100+ datasets found

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
Updated May 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy (2024). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.11213783
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.11213783
Dataset updated
May 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is an enriched version of Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The initial corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.

The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.

The updated Code4ML 2.0 corpus includes kernels retrieved from Code Kaggle Meta. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

kernels_meta2.csv may contain kernels without Kaggle score, but with the place in the leader board (rank).

Code4ML 2.0 dataset can be used for various purposes, including training and evaluating models for code generation, code understanding, and natural language processing tasks.
Feature Scaling using Python
kaggle.com
Updated Nov 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
abhishek (2020). Feature Scaling using Python [Dataset]. https://www.kaggle.com/datasets/abhishekgupta31/feature-scaling-using-python/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 25, 2020
Dataset provided by
Kaggle
Authors
abhishek
Description
Dataset

This dataset was created by abhishek

Contents
Rescaled CIFAR-10 dataset
zenodo.org
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15188748
Dataset updated
Jun 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
Description
Motivation

The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

The Rescaled CIFAR-10 dataset was introduced in the paper:

[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

with a pre-print available at arXiv:

[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:

[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2

and is therefore significantly more challenging.

Access and rights

The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:

[4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.

and also for this new rescaled version, using the reference [1] above.

The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

The dataset

The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].

The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.

The h5 files containing the dataset

The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5

Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2^k/4, with k being integers in the range [-4, 4]:

cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5

These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].

Instructions for loading the data set

The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

The training dataset can be loaded in Python as:

with h5py.File(`

x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)

We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))

The test datasets can be loaded in Python as:

with h5py.File(`

x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)

The test datasets can be loaded in Matlab as:

x_test = h5read(`

The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.
Z
Multimodal Vision-Audio-Language Dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Roig, Gemma
Choksi, Bhavin
Schaumlöffel, Timothy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

pip install pandas pyarrow Example

import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Python package Modin
kaggle.com
Updated Nov 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaihua Zhang (2020). Python package Modin [Dataset]. https://www.kaggle.com/zhangkaihua88/pythonmodin/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kaihua Zhang
Description
Dataset

This dataset was created by Kaihua Zhang

Contents
h
PySecDB
huggingface.co
Updated Oct 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SunLab-GMU (2023). PySecDB [Dataset]. https://huggingface.co/datasets/sunlab/PySecDB
Explore at:
Dataset updated
Oct 18, 2023
Dataset authored and provided by
SunLab-GMU
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
PySecDB: security commit dataset in Python

Description

To foster large-scale research on vulnerability mitigation and to enable a comparison of different detection approaches, we make our dataset PySecDB from our ICSME23 paper publicly available. PySecDB is a real-world Python security commit dataset that contains around 1.2K security commits and 2.8K non-security commits. You can find more details on the dataset in the paper "Exploring Security Commits in Python".… See the full description on the dataset page: https://huggingface.co/datasets/sunlab/PySecDB.
P
DyPyBench Docker Dataset
paperswithcode.com
Updated Dec 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). DyPyBench Docker Dataset [Dataset]. https://paperswithcode.com/dataset/dypybench-docker
Explore at:
Dataset updated
Dec 19, 2023
Description
The first benchmark of Python projects that is large-scale, diverse, ready-to-run (i.e., with fully configured and prepared test suites), and ready-to-analyze (i.e., using an integrated Python dynamic analysis framework). The benchmark encompasses 50 popular open-source projects from various application domains, with a total of 681K lines of Python code, and 30K test cases.
P
Project CodeNet Dataset
paperswithcode.com
Updated Jun 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruchir Puri; David S. Kung; Geert Janssen; Wei zhang; Giacomo Domeniconi; Vladimir Zolotov; Julian Dolby; Jie Chen; Mihir Choudhury; Lindsey Decker; Veronika Thost; Luca Buratti; Saurabh Pujar; Shyam Ramji; Ulrich Finkler; Susan Malaika; Frederick Reiss (2022). Project CodeNet Dataset [Dataset]. https://paperswithcode.com/dataset/project-codenet
Explore at:
Dataset updated
Jun 10, 2022
Authors
Ruchir Puri; David S. Kung; Geert Janssen; Wei zhang; Giacomo Domeniconi; Vladimir Zolotov; Julian Dolby; Jie Chen; Mihir Choudhury; Lindsey Decker; Veronika Thost; Luca Buratti; Saurabh Pujar; Shyam Ramji; Ulrich Finkler; Susan Malaika; Frederick Reiss
Description
Project CodeNet is a large-scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems. The code samples are written in over 50 programming languages (although the dominant languages are C++, C, Python, and Java) and they are annotated with a rich set of information, such as its code size, memory footprint, cpu run time, and status, which indicates acceptance or error types. The dataset is accompanied by a repository, where we provide a set of tools to aggregate codes samples based on user criteria and to transform code samples into token sequences, simplified parse trees and other code graphs. A detailed discussion of Project CodeNet is available in this paper.

The rich annotation of Project CodeNet enables research in code search, code completion, code-code translation, and a myriad of other use cases. We also extracted several benchmarks in Python, Java and C++ to drive innovation in deep learning and machine learning models in code classification and code similarity.

Citation @inproceedings{puri2021codenet, author = {Ruchir Puri and David Kung and Geert Janssen and Wei Zhang and Giacomo Domeniconi and Vladmir Zolotov and Julian Dolby and Jie Chen and Mihir Choudhury and Lindsey Decker and Veronika Thost and Luca Buratti and Saurabh Pujar and Ulrich Finkler}, title = {Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks}, year = {2021}, }
T
wikihow
tensorflow.org
paperswithcode.com
+2more
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). wikihow [Dataset]. https://www.tensorflow.org/datasets/catalog/wikihow
Explore at:
Dataset updated
Dec 6, 2022
Description
WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.

There are two features: - text: wikihow answers texts. - headline: bold lines as summary.

There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.

Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig. Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikihow', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
bccd
tensorflow.org
datasetninja.com
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). bccd [Dataset]. https://www.tensorflow.org/datasets/catalog/bccd
Explore at:
Dataset updated
Dec 6, 2022
Description
BCCD Dataset is a small-scale dataset for blood cells detection.

Thanks the original data and annotations from cosmicad and akshaylamba. The original dataset is re-organized into VOC format. BCCD Dataset is under MIT licence.

Data preparation is important to use machine learning. In this project, the Faster R-CNN algorithm from keras-frcnn for Object Detection is used. From this dataset, nicolaschen1 developed two Python scripts to make preparation data (CSV file and images) for recognition of abnormalities in blood cells on medical images.

export.py: it creates the file "test.csv" with all data needed: filename, class_name, x1,y1,x2,y2. plot.py: it plots the boxes for each image and save it in a new directory.

Image Type : jpeg(JPEG) Width x Height : 640 x 480

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('bccd', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/bccd-1.0.0.png" alt="Visualization" width="500px">
Dataset of Jupyter Notebooks from the paper "A Large-Scale Comparison of...
zenodo.org
bin, csv
Updated May 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantin Grotov; Konstantin Grotov; Sergey Titov; Sergey Titov; Vladimir Sotnikov; Vladimir Sotnikov; Yaroslav Golubev; Yaroslav Golubev; Timofey Bryksin; Timofey Bryksin (2022). Dataset of Jupyter Notebooks from the paper "A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts" [Dataset]. http://doi.org/10.5281/zenodo.6383115
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6383115
Dataset updated
May 17, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Konstantin Grotov; Konstantin Grotov; Sergey Titov; Sergey Titov; Vladimir Sotnikov; Vladimir Sotnikov; Yaroslav Golubev; Yaroslav Golubev; Timofey Bryksin; Timofey Bryksin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This archive contains the dataset of properly-licensed Jupyter notebooks from the MSR'22 paper "A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts". The dataset contains 847,881 notebooks stored in the PostgreSQL dump file. You can find the details about the database in the README file. To transform the notebooks into this convenient format and to calcuate the structural metrics, we used our library called Matroskin, which can be found here: https://github.com/JetBrains-Research/Matroskin.
d
(HS 2) Automate Workflows using Jupyter notebook to create Large Extent...
search.dataone.org
hydroshare.org
Updated Oct 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Young-Don Choi (2024). (HS 2) Automate Workflows using Jupyter notebook to create Large Extent Spatial Datasets [Dataset]. http://doi.org/10.4211/hs.a52df87347ef47c388d9633925cde9ad
Explore at:
Unique identifier
https://doi.org/10.4211/hs.a52df87347ef47c388d9633925cde9ad
Dataset updated
Oct 19, 2024
Dataset provided by
Hydroshare
Authors
Young-Don Choi
Description
We implemented automated workflows using Jupyter notebooks for each state. The GIS processing, crucial for merging, extracting, and projecting GeoTIFF data, was performed using ArcPy—a Python package for geographic data analysis, conversion, and management within ArcGIS (Toms, 2015). After generating state-scale LES (large extent spatial) datasets in GeoTIFF format, we utilized the xarray and rioxarray Python packages to convert GeoTIFF to NetCDF. Xarray is a Python package to work with multi-dimensional arrays and rioxarray is rasterio xarray extension. Rasterio is a Python library to read and write GeoTIFF and other raster formats. Xarray facilitated data manipulation and metadata addition in the NetCDF file, while rioxarray was used to save GeoTIFF as NetCDF. These procedures resulted in the creation of three HydroShare resources (HS 3, HS 4 and HS 5) for sharing state-scale LES datasets. Notably, due to licensing constraints with ArcGIS Pro, a commercial GIS software, the Jupyter notebook development was undertaken on a Windows OS.
A Dataset for a Paper Entitled "A Large-Scale Security-Oriented Static...
zenodo.org
csv
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jukka Ruohonen; Jukka Ruohonen; Kalle Hjerppe; Kalle Hjerppe; Kalle Rindell; Kalle Rindell (2025). A Dataset for a Paper Entitled "A Large-Scale Security-Oriented Static Analysis of Python Packages in PyPI" [Dataset]. http://doi.org/10.5281/zenodo.15694891
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15694891
Dataset updated
Jun 19, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jukka Ruohonen; Jukka Ruohonen; Kalle Hjerppe; Kalle Hjerppe; Kalle Rindell; Kalle Rindell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The deposit provides the data used in the paper.
Z
Data from: Pangeo-Enabled ESM Pattern Scaling (PEEPS): A customizable...
data.niaid.nih.gov
Updated Jan 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Snyder, Abigail (2023). Pangeo-Enabled ESM Pattern Scaling (PEEPS): A customizable dataset of emulated Earth System Model output [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7139977
Explore at:
Dataset updated
Jan 22, 2023
Dataset provided by
Kravitz, Ben
Snyder, Abigail
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Earth
Description
We produce a dataset that uses pattern scaling, a common method of emulating climate models. Our dataset is built on the Pangeo CMIP6 archive, which has the advantage that we don't need to actually download the climate model output. Here we demonstrate the utility of our dataset, called Pangeo-Enabled ESM Pattern Scaling (PEEPS). The dataset, which is encapsulated in a Jupyter notebook (and replicated in a Python file), is flexible and can be extended to multiple scenarios and multiple variables, as long as they are in the Pangeo-accessible archive.
h
CodeReasoningPro
huggingface.co
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Mashhudur Rahim (2025). CodeReasoningPro [Dataset]. https://huggingface.co/datasets/XythicK/CodeReasoningPro
Explore at:
Dataset updated
May 6, 2025
Authors
M Mashhudur Rahim
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
CodeReasoningPro

Dataset Summary

CodeReasoningPro is a large-scale synthetic dataset comprising 1,785,725 competitive programming problems in Python, created by XythicK, an MLOps Engineer. Designed for supervised fine-tuning (SFT) of machine learning models for coding tasks, it draws inspiration from datasets like OpenCodeReasoning. The dataset includes problem statements, Python solutions, and reasoning explanations, covering algorithmic topics such as arrays, subarrays… See the full description on the dataset page: https://huggingface.co/datasets/XythicK/CodeReasoningPro.
US Military Pay Scales
kaggle.com
Updated May 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MarkByrne (2025). US Military Pay Scales [Dataset]. https://www.kaggle.com/rugbyrne/us-military-pay-scales/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 9, 2025
Dataset provided by
Kaggle
Authors
MarkByrne
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
This dataset compiles historical US Military pay scales, broken down by rank, time in service, and year. Dataset includes historical data from 1949-Present for Active Duty military pay. Includes Ranks E-1 through O-10, and from <2 Years in Service to 40 years in service.

Raw Data PDFs can be found in milpayscraper/files.
T
winogrande
tensorflow.org
opendatalab.com
Updated Dec 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). winogrande [Dataset]. https://www.tensorflow.org/datasets/catalog/winogrande
Explore at:
Dataset updated
Dec 11, 2024
Description
The WinoGrande, a large-scale dataset of 44k problems, inspired by the original Winograd Schema Challenge design, but adjusted to improve both the scale and the hardness of the dataset.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('winogrande', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
P
LSMI Dataset
paperswithcode.com
Updated Nov 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongyoung Kim; Jinwoo Kim; Seonghyeon Nam; Dongwoo Lee; Yeonkyung Lee; Nahyup Kang; Hyong-Euk Lee; ByungIn Yoo; Jae-Joon Han; Seon Joo Kim (2022). LSMI Dataset [Dataset]. https://paperswithcode.com/dataset/lsmi
Explore at:
Dataset updated
Nov 15, 2022
Authors
Dongyoung Kim; Jinwoo Kim; Seonghyeon Nam; Dongwoo Lee; Yeonkyung Lee; Nahyup Kang; Hyong-Euk Lee; ByungIn Yoo; Jae-Joon Han; Seon Joo Kim
Description
Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination (ICCV 2021)

Change Log LSMI Dataset Version : 1.1

1.0 : LSMI dataset released. (Aug 05, 2021)

1.1 : Add option for saving sub-pair images for 3-illuminant scene (ex. _1,_12,_13) & saving subtracted image (ex. _2,_3,_23) (Feb 20, 2022)

About [Paper] [Project site] [Download Dataset] [Video]

This is an official repository of "Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination", which is accepted as a poster in ICCV 2021.

This repository provides 1. Preprocessing code of "Large Scale Multi Illuminant (LSMI) Dataset" 2. Code of Pixel-level illumination inference U-Net 3. Pre-trained model parameter for testing U-Net

If you use our code or dataset, please cite our paper: @inproceedings{kim2021large, title={Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm Under Mixed Illumination}, author={Kim, Dongyoung and Kim, Jinwoo and Nam, Seonghyeon and Lee, Dongwoo and Lee, Yeonkyung and Kang, Nahyup and Lee, Hyong-Euk and Yoo, ByungIn and Han, Jae-Joon and Kim, Seon Joo}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages={2410--2419}, year={2021} }

Requirements Our running environment is as follows:

Python version 3.8.3 Pytorch version 1.7.0 CUDA version 11.2

We provide a docker image, which supports all extra requirements (ex. dcraw,rawpy,tensorboard...), including specified version of python, pytorch, CUDA above.

You can download the docker image here.

The following instructions are assumed to run in a docker container that uses the docker image we provided.

Getting Started Clone this repo In the docker container, clone this repository first.

sh git clone https://github.com/DY112/LSMI-dataset.git

Download the LSMI dataset You should first download the LSMI dataset from here.

The dataset is composed of 3 sub-folers named "galaxy", "nikon", "sony".

Folders named by each camera include several scenes, and each scene folder contains full-resolution RAW files and JPG files that is converted to sRGB color space.

Move all three folders to the root of cloned repository.

In each sub-folders, we provides metadata (meta.json), and train/val/test scene index (split.json).

In meta.json, we provides following informations.

NumOfLights : Number of illuminants in the scene MCCCoord : Locations of Macbeth color chart Light1,2,3 : Normalized chromaticities of each illuminant (calculated through running 1_make_mixture_map.py)

Preprocess the LSMI dataset

Convert raw images to tiff files

To convert original 1-channel bayer-pattern images to 3-channel RGB tiff images, run following code:

sh python 0_cvt2tiff.py You should modify SOURCE and EXT variables properly.

The converted tiff files are generated at the same location as the source file.

This process uses DCRAW command, with '-h -D -4 -T' as options.

There is no black level subtraction, saturated pixel clipping or else.

You can change the parameters as appropriate for your purpose.

Make mixture map sh python 1_make_mixture_map.py Change the CAMERA variable properly to the target directory you want.

This code does the following operations for each scene:

Subtract black level (no saturation clipping) Use Macbeth Color Chart's achromatic patches, find each illuminant's chromaticities Use green channel pixel values, calculate pixel level illuminant mixture map Mask uncalculable pixel positions (which have 0 as value for all scene pairs) to ZERO_MASK

After running this code, npy tpye mixture map data will be generated at each scene's directory.

:warning: If you run this code with ZERO_MASK=-1, the full resolution mixture map may contains -1 for uncalculable pixels. You MUST replace this value appropriately before resizing to prevent this negative value from interpolating with other values.

Crop for train/test U-Net (Optional) sh python 2_preprocess_data.py

This preprocessing code is written only for U-Net, so you can skip this step and freely process the full resolution LSMI set (tiff and npy files).

The image and the mixture map are resized as a square with a length of the SIZE variable inside the code, and the ground-truth image is also generated.

Note that the side of the image will be cropped to make the image shape square.

If you don't want to crop the side of the image and just want to resize whole image anyway, use SQUARE_CROP=False

We set the default test size to 256, and set train size to 512, and SQUARE_CROP=True.

The new dataset is created in a folder with the name of the CAMERA_SIZE. (Ex. galaxy_512)

Use U-Net for pixel-level AWB You can download pre-trained model parameter here.

Pre-trained model is trained on 512x512 data with random crop & random pixel level relighting augmentation method.

Locate downloaded models folder into SVWB_Unet.

Test U-Net sh cd SVWB_Unet sh test.sh

Train U-Net sh cd SVWB_Unet sh train.sh

Dataset License http://creativecommons.org/licenses/by-nc/4.0/">https://i.creativecommons.org/l/by-nc/4.0/88x31.png" />
This work is licensed under a http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License.
Rescaled Fashion-MNIST with translations dataset
zenodo.org
Updated Jun 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled Fashion-MNIST with translations dataset [Dataset]. http://doi.org/10.5281/zenodo.15188439
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15188439
Dataset updated
Jun 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
Time period covered
Apr 10, 2025
Description
Motivation

The goal of introducing the Rescaled Fashion-MNIST with translations dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data, and to additionally provide a way to test network object detection and object localisation abilities on image data where the objects are not centred.

The Rescaled Fashion-MNIST with translations dataset was introduced in the paper:

[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

with a pre-print available at arXiv:

[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

Importantly, the Rescaled Fashion-MNIST with translations dataset is more challenging than the MNIST Large Scale dataset, introduced in:

[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.

Access and rights

The Rescaled Fashion-MNIST with translations dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:

[4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747

and also for this new rescaled version, using the reference [1] above.

The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

The dataset

The Rescaled FashionMNIST with translations dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72. The objects within the images have also been randomly shifted in the spatial domain, with the object always at least 4 pixels away from the image boundary. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].

The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.

The h5 files containing the dataset

The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

fashionmnist_with_scale_variations_and_translations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5

Additionally, for the Rescaled FashionMNIST with translations dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2^k/4, with k being integers in the range [-4, 4]:

fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte0p500.h5
fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte0p595.h5
fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte0p707.h5
fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte0p841.h5
fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte1p000.h5
fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte1p189.h5
fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte1p414.h5
fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte1p682.h5
fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte2p000.h5

These dataset files were used for the experiments presented in Figure 8 in [1].

Instructions for loading the data set

The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

The training dataset can be loaded in Python as:

with h5py.File(`

x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)

We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))

The test datasets can be loaded in Python as:

with h5py.File(`

x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)

The test datasets can be loaded in Matlab as:

x_test = h5read(`

The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

There is also a closely related Fashion-MNIST dataset, which in addition to scaling variations keeps the objects in the frame centred, meaning no spatial translations are used.
d
Multi-task Deep Learning for Water Temperature and Streamflow Prediction...
catalog.data.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Multi-task Deep Learning for Water Temperature and Streamflow Prediction (ver. 1.1, June 2022) [Dataset]. https://catalog.data.gov/dataset/multi-task-deep-learning-for-water-temperature-and-streamflow-prediction-ver-1-1-june-2022
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This item contains data and code used in experiments that produced the results for Sadler et. al (2022) (see below for full reference). We ran five experiments for the analysis, Experiment A, Experiment B, Experiment C, Experiment D, and Experiment AuxIn. Experiment A tested multi-task learning for predicting streamflow with 25 years of training data and using a different model for each of 101 sites. Experiment B tested multi-task learning for predicting streamflow with 25 years of training data and using a single model for all 101 sites. Experiment C tested multi-task learning for predicting streamflow with just 2 years of training data. Experiment D tested multi-task learning for predicting water temperature with over 25 years of training data. Experiment AuxIn used water temperature as an input variable for predicting streamflow. These experiments and their results are described in detail in the WRR paper. Data from a total of 101 sites across the US was used for the experiments. The model input data and streamflow data were from the Catchment Attributes and Meteorology for Large-sample Studies (CAMELS) dataset (Newman et. al 2014, Addor et. al 2017). The water temperature data were gathered from the National Water Information System (NWIS) (U.S. Geological Survey, 2016). The contents of this item are broken into 13 files or groups of files aggregated into zip files:

input_data_processing.zip: A zip file containing the scripts used to collate the observations, input weather drivers, and catchment attributes for the multi-task modeling experiments

flow_observations.zip: A zip file containing collated daily streamflow data for the sites used in multi-task modeling experiments. The streamflow data were originally accessed from the CAMELs dataset. The data are stored in csv and Zarr formats.

temperature_observations.zip: A zip file containing collated daily water temperature data for the sites used in multi-task modeling experiments. The data were originally accessed via NWIS. The data are stored in csv and Zarr formats.

temperature_sites.geojson: Geojson file of the locations of the water temperature and streamflow sites used in the analysis.

model_drivers.zip: A zip file containing the daily input weather driver data for the multi-task deep learning models. These data are from the Daymet drivers and were collated from the CAMELS dataset. The data are stored in csv and Zarr formats.

catchment_attrs.csv: Catchment attributes collatted from the CAMELS dataset. These data are used for the Random Forest modeling. For full metadata regarding these data see CAMELS dataset.

experiment_workflow_files.zip: A zip file containing workflow definitions used to run multi-task deep learning experiments. These are Snakemake workflows. To run a given experiment, one would run (for experiment A) 'snakemake -s expA_Snakefile --configfile expA_config.yml'

river-dl-paper_v0.zip: A zip file containing python code used to run multi-task deep learning experiments. This code was called by the Snakemake workflows contained in 'experiment_workflow_files.zip'.

random_forest_scripts.zip: A zip file containing Python code and a Python Jupyter Notebook used to prepare data for, train, and visualize feature importance of a Random Forest model.

plotting_code.zip: A zip file containing python code and Snakemake workflow used to produce figures showing the results of multi-task deep learning experiments.

results.zip: A zip file containing results of multi-task deep learning experiments. The results are stored in csv and netcdf formats. The netcdf files were used by the plotting libraries in 'plotting_code.zip'. These files are for five experiments, 'A', 'B', 'C', 'D', and 'AuxIn'. These experiment names are shown in the file name.

sample_scripts.zip: A zip file containing scripts for creating sample output to demonstrate how the modeling workflow was executed.

sample_output.zip: A zip file containing sample output data. Similar files are created by running the sample scripts provided.

A. Newman; K. Sampson; M. P. Clark; A. Bock; R. J. Viger; D. Blodgett, 2014. A large-sample watershed-scale hydrometeorological dataset for the contiguous USA. Boulder, CO: UCAR/NCAR. https://dx.doi.org/10.5065/D6MW2F4D

N. Addor, A. Newman, M. Mizukami, and M. P. Clark, 2017. Catchment attributes for large-sample studies. Boulder, CO: UCAR/NCAR. https://doi.org/10.5065/D6G73C3Q

Sadler, J. M., Appling, A. P., Read, J. S., Oliver, S. K., Jia, X., Zwart, J. A., & Kumar, V. (2022). Multi-Task Deep Learning of Daily Streamflow and Water Temperature. Water Resources Research, 58(4), e2021WR030138. https://doi.org/10.1029/2021WR030138

U.S. Geological Survey, 2016, National Water Information System data available on the World Wide Web (USGS Water Data for the Nation), accessed Dec. 2020.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy (2024). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.11213783

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.11213783

Dataset updated

May 18, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is an enriched version of Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The initial corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.

The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.

The updated Code4ML 2.0 corpus includes kernels retrieved from Code Kaggle Meta. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

kernels_meta2.csv may contain kernels without Kaggle score, but with the place in the leader board (rank).

Code4ML 2.0 dataset can be used for various purposes, including training and evaluating models for code generation, code understanding, and natural language processing tasks.

Clear search

Close search

Google apps

Main menu

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

Feature Scaling using Python

Dataset

Contents

Rescaled CIFAR-10 dataset

Motivation

Access and rights

The dataset

The h5 files containing the dataset

Instructions for loading the data set

Multimodal Vision-Audio-Language Dataset

Python package Modin

Dataset

Contents

PySecDB

DyPyBench Docker Dataset

Project CodeNet Dataset

wikihow

bccd

Dataset of Jupyter Notebooks from the paper "A Large-Scale Comparison of...

(HS 2) Automate Workflows using Jupyter notebook to create Large Extent...

A Dataset for a Paper Entitled "A Large-Scale Security-Oriented Static...

Data from: Pangeo-Enabled ESM Pattern Scaling (PEEPS): A customizable...

CodeReasoningPro

US Military Pay Scales

winogrande

LSMI Dataset

Rescaled Fashion-MNIST with translations dataset