11 datasets found

h
pandas-create-context
huggingface.co
Updated Jan 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Or Hiltch (2024). pandas-create-context [Dataset]. https://huggingface.co/datasets/hiltch/pandas-create-context
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 8, 2024
Authors
Or Hiltch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

This dataset is built from sql-create-context, which in itself builds from WikiSQL and Spider. I have used GPT4 to translate the SQL schema into pandas DataFrame schem initialization statements and to translate the SQL queries into pandas queries. There are 862 examples of natural language queries, pandas DataFrame creation statements, and pandas query answering the question using the DataFrame creation statement as context. This dataset was built with text-to-pandas LLMs… See the full description on the dataset page: https://huggingface.co/datasets/hiltch/pandas-create-context.
PandasPlotBench
huggingface.co
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JetBrains Research (2024). PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 25, 2024
Dataset provided by
JetBrainshttp://jetbrains.com/
Authors
JetBrains Research
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
PandasPlotBench

PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.
Shopping Mall
kaggle.com
zip
Updated Dec 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
Explore at:
zip(22852 bytes)Available download formats
Dataset updated
Dec 15, 2023
Authors
Anshul Pachauri
Description
Libraries Import:

Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").
h
PlotQA_V1
huggingface.co
Updated Sep 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aryan Badkul (2025). PlotQA_V1 [Dataset]. https://huggingface.co/datasets/Abd223653/PlotQA_V1
Explore at:
Dataset updated
Sep 22, 2025
Authors
Aryan Badkul
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Plotqa V1

Dataset Description

This dataset was uploaded from a pandas DataFrame.

Dataset Structure Overview

Total Examples: 5,733,893 Total Features: 9 Dataset Size: ~2805.4 MB Format: Parquet files Created: 2025-09-22 20:12:01 UTC

Data Instances

The dataset contains 5,733,893 rows and 9 columns.

Data Fields

image_index (int64): 0 null values (0.0%), Range: [0.00, 157069.00], Mean: 78036.26 qid (object): 0 null values (0.0%)… See the full description on the dataset page: https://huggingface.co/datasets/Abd223653/PlotQA_V1.
Z
SELTO Dataset
data.niaid.nih.gov
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dittmer, Sören; Erzmann, David; Harms, Henrik; Falck, Rielson; Gosch, Marco (2023). SELTO Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7034898
Explore at:
Dataset updated
May 23, 2023
Dataset provided by
ArianeGroup GmbH
University of Bremen, University of Cambridge
University of Bremen
Authors
Dittmer, Sören; Erzmann, David; Harms, Henrik; Falck, Rielson; Gosch, Marco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Benchmark Dataset for Deep Learning for 3D Topology Optimization

This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

The following paper provides full documentation and examples:

Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets. Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

E - Young's modulus [Pa]

ν - Poisson's ratio [-]

σ_ys - a yield stress [Pa]

h - discretization size of the voxel grid [m]

The columns of i.csv correspond to the following voxel-wise information:

x, y, z - the indices that state the location of the voxel within the voxel mesh

Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized

Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension

F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]

density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

How to Import the Dataset

with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

from dl4to.datasets import SELTODataset

dataset = SELTODataset(root=root, name=name, train=train)

Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

import pandas as pd

root = ... file_path = f'{root}/{i}.csv' columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density'] df = pd.read_csv(file_path, names=columns)

Similarly, we can import a i_info.csv file via:

file_path = f'{root}/{i}_info.csv' info_column_names = ['E', 'ν', 'σ_ys', 'h'] df_info = pd.read_csv(file_path, names=info_columns)

We can extract PyTorch tensors from the Pandas dataframe df using the following function:

import torch

def get_torch_tensors_from_dataframe(df, dtype=torch.float32): shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1 voxels = [df['x'].values, df['y'].values, df['z'].values]

Ω_design = torch.zeros(1, *shape, dtype=int) Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int)) Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype) Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype) Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype) Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype) F = torch.zeros(3, *shape, dtype=dtype) F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype) F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype) F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype) density = torch.zeros(1, *shape, dtype=dtype) density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype) return Ω_design, Ω_Dirichlet, F, density
Stone Classification
kaggle.com
zip
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khadgar (2025). Stone Classification [Dataset]. https://www.kaggle.com/datasets/claydonwang/stone-classification
Explore at:
zip(69490 bytes)Available download formats
Dataset updated
Mar 18, 2025
Authors
Khadgar
Description
Outline

The dataset is used in final project of STA325 at SUSTech.

How to Generate submission.csv from test_loader

1. Define the Prediction Function

Use the following function to extract predictions from test_loader: ```python def predict(model, loader, device): model.eval() # Set the model to evaluation mode predictions = [] # Store predicted classes image_ids = [] # Store image filenames

with torch.no_grad(): # Disable gradient computation for images, img_paths in tqdm(loader, desc="Predicting on test set"): images = images.to(device) # Move images to the specified device outputs = model(images) # Forward pass to get model outputs _, predicted = torch.max(outputs, 1) # Get predicted classes

# Collect predictions and image IDs predictions.extend(predicted.cpu().numpy()) image_ids.extend([os.path.basename(path) for path in img_paths])

return image_ids, predictions ```

2. Run Predictions

Call the prediction function with the trained model, test_loader, and device: python image_ids, predictions = predict(model, test_loader, device)

3. Create the Submission File

import pandas as pd import os # Create DataFrame submission_df = pd.DataFrame({ "id": image_ids, # Image filenames "label": predictions # Predicted classes }) # Save to the specified path OUTPUT_DIR = "logs" os.makedirs(OUTPUT_DIR, exist_ok=True) submission_path = os.path.join(OUTPUT_DIR, "submission.csv") submission_df.to_csv(submission_path, index=False) print(f"Kaggle submission file saved to {submission_path}")

Output Description

submission.csv Format:
The file contains two columns:

id: Filenames of test images (without paths, e.g., image1.jpg).

label: Predicted class indices (e.g., 0, 1, 2, depending on the number of classes).

Example Content: id,label 000001.jpg,0 000002.jpg,1 000003.jpg,2 Then submit the submission.csv to Kaggle.
Klib library python
kaggle.com
zip
Updated Jan 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sripaad Srinivasan (2021). Klib library python [Dataset]. https://www.kaggle.com/sripaadsrinivasan/klib-library-python
Explore at:
zip(89892446 bytes)Available download formats
Dataset updated
Jan 11, 2021
Authors
Sripaad Srinivasan
Description
klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).

Original Github repo

https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">

Usage

!pip install klib

import klib import pandas as pd df = pd.DataFrame(data) # klib.describe functions for visualizing datasets - klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features - klib.corr_mat(df) # returns a color-encoded correlation matrix - klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations - klib.dist_plot(df) # returns a distribution plot for every numeric feature - klib.missingval_plot(df) # returns a figure containing information about missing values

Examples

Take a look at this starter notebook.

Further examples, as well as applications of the functions can be found here.

Contributing

Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.

License

MIT
Stack Overflow tags
kaggle.com
zip
Updated Jan 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abid Ali Awan (2021). Stack Overflow tags [Dataset]. https://www.kaggle.com/datasets/kingabzpro/stack-overflow-tags/code
Explore at:
zip(273306 bytes)Available download formats
Dataset updated
Jan 6, 2021
Authors
Abid Ali Awan
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Context

How can we tell what programming languages and technologies are used by the most people? How about what languages are growing and which are shrinking, so that we can tell which are most worth investing time in?

One excellent source of data is Stack Overflow, a programming question and answer site with more than 16 million questions on programming topics. By measuring the number of questions about each technology, we can get an approximate sense of how many people are using it. We're going to use open data from the Stack Exchange Data Explorer to examine the relative popularity of languages like R, Python, Java and Javascript have changed over time.

Content

Each Stack Overflow question has a tag, which marks a question to describe its topic or technology. For instance, there's a tag for languages like R or Python, and for packages like ggplot2 or pandas.

We'll be working with a dataset with one observation for each tag in each year. The dataset includes both the number of questions asked in that tag in that year, and the total number of questions asked in that year.

Acknowledgements

DataCamp
h
oldIT2modIT
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massimo Romano, oldIT2modIT [Dataset]. https://huggingface.co/datasets/cybernetic-m/oldIT2modIT
Explore at:
Authors
Massimo Romano
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Download the dataset

At the moment to download the dataset you should use Pandas DataFrame: import pandas as pd df = pd.read_csv("https://huggingface.co/datasets/cybernetic-m/oldIT2modIT/resolve/main/oldIT2modIT_dataset.csv")

You can visualize the dataset with: df.head()

To convert into Huggingface dataset: from datasets import Dataset dataset = Dataset.from_pandas(df)

Dataset Description

This is an italian dataset formed by 200 old (ancient) italian sentence and… See the full description on the dataset page: https://huggingface.co/datasets/cybernetic-m/oldIT2modIT.
HPA - Processed Train Dataframe With Cell-Wise RLE
kaggle.com
zip
Updated Feb 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darien Schettler (2021). HPA - Processed Train Dataframe With Cell-Wise RLE [Dataset]. https://www.kaggle.com/dschettler8845/hpa-processed-train-dataframe-with-cellwise-rle
Explore at:
zip(1111131078 bytes)Available download formats
Dataset updated
Feb 9, 2021
Authors
Darien Schettler
Description
Description

This is a CSV file after some minor preprocessing (one-hot-expansion, etc.) that also includes all the RLEs and Bounding Boxes as a list for each respective ID.

The individual RLEs in the list will correspond to a cell in the given image.  The individual Bounding Boxes in the list will correspond to a cell in the given image.

The RLE and Bounding Box are ordered to refer to the same respective cell.
h
imcc_hungary_15_18
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ishita Guha Roy, imcc_hungary_15_18 [Dataset]. https://huggingface.co/datasets/igroy06/imcc_hungary_15_18
Explore at:
Authors
Ishita Guha Roy
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Imperviousness Classified Change (2015–2018) – Hungary (Tabular Form)

Description

This dataset is a tabularized version of the Copernicus Land Monitoring Service (CLMS) Imperviousness Classified Change (IMCC) 2015–2018 layer for Hungary.The original raster tiles at 20 m resolution have been converted into a single pandas DataFrame (full_df), where each row represents a pixel with its geospatial coordinates, imperviousness class value, descriptive label… See the full description on the dataset page: https://huggingface.co/datasets/igroy06/imcc_hungary_15_18.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Or Hiltch (2024). pandas-create-context [Dataset]. https://huggingface.co/datasets/hiltch/pandas-create-context

pandas-create-context

hiltch/pandas-create-context

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 8, 2024

Authors

Or Hiltch

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Overview

This dataset is built from sql-create-context, which in itself builds from WikiSQL and Spider. I have used GPT4 to translate the SQL schema into pandas DataFrame schem initialization statements and to translate the SQL queries into pandas queries. There are 862 examples of natural language queries, pandas DataFrame creation statements, and pandas query answering the question using the DataFrame creation statement as context. This dataset was built with text-to-pandas LLMs… See the full description on the dataset page: https://huggingface.co/datasets/hiltch/pandas-create-context.

Clear search

Close search

Google apps

Main menu

pandas-create-context

PandasPlotBench

Shopping Mall

PlotQA_V1

SELTO Dataset

Stone Classification

Outline

How to Generate submission.csv from test_loader

1. Define the Prediction Function

2. Run Predictions

3. Create the Submission File

Output Description

Klib library python

Usage

Examples

Contributing

License

Stack Overflow tags

Context

Content

Acknowledgements

oldIT2modIT

HPA - Processed Train Dataframe With Cell-Wise RLE

Description

imcc_hungary_15_18

pandas-create-context

pandas-create-context

hiltch/pandas-create-context