44 datasets found

Meta Kaggle Code
kaggle.com
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
Dataset updated
Jun 5, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
csv
Updated Sep 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6607065
Dataset updated
Sep 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous authors; Anonymous authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
Mayo Clinic - STRIP AI: Tiled, Thresholded Dataset
kaggle.com
Updated Sep 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mrinal Tyagi (2022). Mayo Clinic - STRIP AI: Tiled, Thresholded Dataset [Dataset]. https://www.kaggle.com/datasets/tr1gg3rtrash/mayo-clinic
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 17, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mrinal Tyagi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The dataset is a preprocessed dataset, which is a part of MAYO Clinic - STRIP AI competition.

Example Images from the dataset.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4636937%2Ff8fff1fa62dc75cdb3b38981b7a147fd%2F0b7871_0_2_33_15.png?generation=1663412986808182&alt=media"> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4636937%2Fd6ed6a9702983da36ee9225a9c19e5fc%2F0b25f8_0_13_8_16.png?generation=1663413194154320&alt=media"> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4636937%2F913f27973e1bd16b956a4551a2e7850d%2F0ba49d_0_3_6_17.png?generation=1663413413713409&alt=media">

Collection Methodology

The dataset was downloaded from the Kaggle website. Using Openslide, 600 x 600 slides were extracted. From the last level of DeepZoomGenerator, the slides were saved in png format. Then using the thresholding technique of calculating the area of slide contents in the image as well as the total content in the image. If the slide area consists of more than 30% of the total slide area, the image was kept otherwise it was discarded. Totally white images from tiles were deleted manually from the dataset.

Important Information

We spent days and nights creating this dataset but due to resource as well as time constraints, we were not able to make a good submission on the leaderboard using this dataset. We would encourage everyone on the leaderboard to use the dataset if they can and come up with a great generalized solution for solving this problem.

Code4ML 2.0

zenodo.org

csv, txt

Updated May 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737

Explore at:

csv, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15465737

Dataset updated

May 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonimous authors; Anonimous authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

The original dataset is organized into multiple CSV files, each containing structured data on different entities:

code_blocks.csv: Contains raw code snippets extracted from Kaggle.
kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

Table 1. code_blocks.csv structure

Column	Description
code_blocks_index	Global index linking code blocks to markup_data.csv.
kernel_id	Identifier for the Kaggle Jupyter notebook from which the code block was extracted.
code_block_id	Position of the code block within the notebook.
code_block	The actual machine learning code snippet.

Table 2. kernels_meta.csv structure

Column	Description
kernel_id	Identifier for the Kaggle Jupyter notebook.
kaggle_score	Performance metric of the notebook.
kaggle_comments	Number of comments on the notebook.
kaggle_upvotes	Number of upvotes the notebook received.
kernel_link	URL to the notebook.
comp_name	Name of the associated Kaggle competition.

Table 3. competitions_meta.csv structure

Column	Description
comp_name	Name of the Kaggle competition.
description	Overview of the competition task.
data_type	Type of data used in the competition.
comp_type	Classification of the competition.
subtitle	Short description of the task.
EvaluationAlgorithmAbbreviation	Metric used for assessing competition submissions.
data_sources	Links to datasets used.
metric type	Class label for the assessment metric.

Table 4. markup_data.csv structure

Column	Description
code_block	Machine learning code block.
too_long	Flag indicating whether the block spans multiple semantic types.
marks	Confidence level of the annotation.
graph_vertex_id	ID of the semantic type.

The dataset allows mapping between these tables. For example:

code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

Code4ML 2.0 Enhancements

The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

Applications

The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

Code generation
Code understanding
Natural language processing of code-related tasks

FSDKaggle2018
zenodo.org
opendatalab.com
+2more
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal (2020). FSDKaggle2018 [Dataset]. http://doi.org/10.5281/zenodo.2552860
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2552860
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal
Description
FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.

Citation

If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)

You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

About this dataset

Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.

The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.

All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.

The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:

"Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".

Some other relevant characteristics of FSDKaggle2018:

The dataset is split into a train set and a test set.

The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.

Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.

Non-verified annotations in the train set are properly flagged in train.csv so that participants can opt to use this information during the development of their systems.

The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.

All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.

Data labeling process

The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.

Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.

Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.

The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.

More details about the data labeling process can be found in [3].

License

FSDKaggle2018 has licenses at two different levels, as explained next.

All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv and test_post_competition_scoring_clips.csv.

In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2018.doc zip file.

Files

FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:

root │
└───FSDKaggle2018.audio_train/ Audio clips in the train set │
└───FSDKaggle2018.audio_test/ Audio clips in the test set │
└───FSDKaggle2018.meta/ Files for evaluation setup │ │
│ └───train_post_competition.csv Data split and ground truth for the train set │ │
│ └───test_post_competition_scoring_clips.csv Ground truth for the test set
│
└───FSDKaggle2018.doc/ │
└───README.md The dataset description file you are reading │
└───LICENSE-DATASET
Miss America Titleholders
kaggle.com
Updated Nov 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Miss America Titleholders [Dataset]. https://www.kaggle.com/datasets/thedevastator/miss-america-titleholders-a-comprehensive-datase
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 17, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Miss America Titleholders

Miss America over the years

About this dataset

Every year, young women from across the United States compete for the title of Miss America. The competition is open to women between the ages of 17 and 25, and includes a talent portion, an interview, and a swimsuit competition (which was removed in 2018). The winner is crowned by the previous year's titleholder and goes on to tour the nation for about 20,000 miles a month, promoting her particular platform of interest.

The Miss America dataset contains information on all Miss America titleholders from 1921 to 2022. It includes columns for the year of the pageant, the name of the crowned winner, her state or district represented, awards won, talent performed, and notes about her win

How to use the dataset

This dataset contains information on Miss America titleholders from 1921 to 2022. The data includes the name of the winner, her state or district, the city she represented, her talent, and the year she won

Research Ideas

Miss America could be used to study changes in American culture over time. For example, the decline in the swimsuit competition could be seen as a sign of increasing body positivity in the US.

The dataset could be used to study the effect of winning Miss America has on a woman's career. Does winning lead to more opportunities?

The dataset could be used to study geographical patterns inMiss America winners. For example, are there any states that have produced more winners than others?

Acknowledgements

License

License: Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original.

Columns

File: miss_america_titleholders.csv | Column name | Description | |:----------------------|:-----------------------------------------------------------------------| | year | The year the Miss America pageant was held. (Integer) | | crowned | The name of the Miss America titleholder. (String) | | winner | The name of the Miss America winner. (String) | | state_or_district | The state or district represented by the Miss America winner. (String) | | city | The city represented by the Miss America winner. (String) | | awards | The awards won by the Miss America winner. (String) | | talent | The talent performed by the Miss America winner. (String) | | notes | Notes about the Miss America winner. (String) |

File: eurovision_winners.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------------| | Year | The year the pageant was held. (Integer) | | Date | The date the pageant was held. (Date) | | Host City | The city where the pageant was held. (String) | | Winner | The name of the pageant winner. (String) | | Song | The song performed by the pageant winner. (String) | | Performer | The name of the performer of the pageant winner's song. (String) | | Points | The number of points the pageant winner received. (Integer) | | Margin | The margin of points between the pageant winner and runner-up. (Integer) | | Runner-up | The name of the pageant runner-up. (String) |
h
olympiad-math-contest-llama3-20k
huggingface.co
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin Amiri (2024). olympiad-math-contest-llama3-20k [Dataset]. https://huggingface.co/datasets/kevin009/olympiad-math-contest-llama3-20k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 1, 2024
Authors
Kevin Amiri
Description
AMC/AIME Mathematics Problem and Solution Dataset

Dataset Details

Dataset Name: AMC/AIME Mathematics Problem and Solution Dataset Version: 1.0 Release Date: 2024-06-1 Authors: Kevin Amiri

Intended Use

Primary Use: The dataset is created and intended for research and an AI Mathematical Olympiad Kaggle competition. Intended Users: Researchers in AI & mathematics or science.

Dataset Composition

Number of Examples: 20,300 problems and solution sets… See the full description on the dataset page: https://huggingface.co/datasets/kevin009/olympiad-math-contest-llama3-20k.
LLM Science Dataset
kaggle.com
Updated Aug 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhecheng Li (2023). LLM Science Dataset [Dataset]. https://www.kaggle.com/datasets/lizhecheng/llm-science-dataset/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 7, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Zhecheng Li
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
The version 3 contains 6 datasets.

1 - The original training dataset in LLM Science Exam

2 - 6.0k train examples for LLM Science Exam from RADEK OSMULSKI, the dataset link is here

3 - 500 train examples for LLM Science Exam from RADEK OSMULSKI, the dataset link is here

4 - 600 train examples collected by Zhecheng LI using Chatgpt3.5 here

5 - wikipedia-stem-1k dataset collected by LEONID KULYK, the dataset link is here

6 - MMLU Dataset, I choose about 3600+ examples that are suitable for finetuning this competition, the original dataset I have published here

Thanks for their contribution to this competition and many NLP project
Z
FSDKaggle2019
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manoj Plakal (2020). FSDKaggle2019 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3612636
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Daniel P. W. Ellis
Manoj Plakal
Xavier Serra
Frederic Font
Eduardo Fonseca
Description
FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.

Citation

If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)

You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

Data curators

Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

ABOUT FSDKaggle2019

Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.

FSDKaggle2019 employs audio clips from the following sources:

Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology

The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)

The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.

What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.

Ground Truth Labels

The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).

The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].

The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].

Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:

curated train set: correct (but potentially incomplete) labels

noisy train set: noisy labels

test set: correct and complete labels

Further details can be found below in the sections for each set.

Format

All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

DATA SPLIT

FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.

Curated train set

The curated train set consists of manually-labeled data from FSD.

Number of clips/class: 75 except in a few cases (where there are less)

Total number of clips: 4970

Avg number of labels/clip: 1.2

Total duration: 10.5 hours

The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).

Noisy train set

The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].

Number of clips/class: 300

Total number of clips: 19,815

Avg number of labels/clip: 1.2

Total duration: ~80 hours

The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.

Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.

Test set

The test set is used for system evaluation and consists of manually-labeled data from FSD.

Number of clips/class: between 50 and 150

Total number of clips: 4481

Avg number of labels/clip: 1.4

Total duration: 12.9 hours

The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.

During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public or private) for each test clip (see more info in Files & Download below).

Acoustic mismatch

As mentioned before, FSDKaggle2019 uses audio clips from two sources:

FSD: curated train set and test set, and

YFCC: noisy train set.

While the sources of audio (Freesound and Flickr) are collaboratively contributed and pretty diverse themselves, a certain acoustic mismatch can be expected between FSD and YFCC. We conjecture this mismatch comes from a variety of reasons. For example, through acoustic inspection of a small sample of both data sources, we find a higher percentage of high quality recordings in FSD. In addition, audio clips in Freesound are typically recorded with the purpose of capturing audio, which is not necessarily the case in YFCC.

This mismatch can have an impact in the evaluation, considering that most of the train data come from YFCC, while all test data are drawn from FSD. This constraint (i.e., noisy training data coming from a different web audio source than the test set) is sometimes a real-world condition.

LICENSE

All clips in FSDKaggle2019 are released under Creative Commons (CC) licenses. For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses.

Curated train set and test set. All clips in Freesound are released under different modalities of Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. The licenses are specified in the files train_curated_post_competition.csv and test_post_competition.csv. These licenses can be CC0, CC-BY, CC-BY-NC and CC Sampling+.

Noisy train set. Similarly, the licenses of the soundtracks from Flickr used in FSDKaggle2019 are specified in the file train_noisy_post_competition.csv. These licenses can be CC-BY and CC BY-SA.

In addition, FSDKaggle2019 as a whole is the result of a curation process and it has an additional license. FSDKaggle2019 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2019.doc zip file.

FILES & DOWNLOAD

FSDKaggle2019 can be downloaded as a series of zip files with the following directory structure:

root │
└───FSDKaggle2019.audio_train_curated/ Audio clips in the curated train set │ └───FSDKaggle2019.audio_train_noisy/ Audio clips in the noisy

SVG Code Generation Sample Training Data

kaggle.com

Updated May 3, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Vinothkumar Sekar (2025). SVG Code Generation Sample Training Data [Dataset]. https://www.kaggle.com/datasets/vinothkumarsekar89/svg-generation-sample-training-data

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 3, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Vinothkumar Sekar

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

This training data was generated using GPT-4o as part of the 'Drawing with LLM' competition (https://www.kaggle.com/competitions/drawing-with-llms). It can be used to fine-tune small language models for the competition or serve as an augmentation dataset alongside other data sources.

The dataset is generated in two steps using the GPT-4o model. - In the first step, topic descriptions relevant to the competition are generated using a specific prompt. By running this prompt multiple times, over 3,000 descriptions were collected.

 
prompt=f""" I am participating in an SVG code generation competition.
  
   The competition involves generating SVG images based on short textual descriptions of everyday objects and scenes, spanning a wide range of categories. The key guidelines are as follows:
  
   - Descriptions are generic and do not contain brand names, trademarks, or personal names.
   - No descriptions include people, even in generic terms.
   - Descriptions are concise—each is no more than 200 characters, with an average length of about 50 characters.
   - Categories cover various domains, with some overlap between public and private test sets.
  
   To train a small LLM model, I am preparing a synthetic dataset. Could you generate 100 unique topics aligned with the competition style?
  
   Requirements:
   - Each topic should range between **20 and 200 characters**, with an **average around 60 characters**.
   - Ensure **diversity and creativity** across topics.
   - **50% of the topics** should come from the categories of **landscapes**, **abstract art**, and **fashion**.
   - Avoid duplication or overly similar phrasing.
  
   Example topics:
                 a purple forest at dusk, gray wool coat with a faux fur collar, a lighthouse overlooking the ocean, burgundy corduroy, pants with patch pockets and silver buttons, orange corduroy overalls, a purple silk scarf with tassel trim, a green lagoon under a cloudy sky, crimson rectangles forming a chaotic grid,  purple pyramids spiraling around a bronze cone, magenta trapezoids layered on a translucent silver sheet,  a snowy plain, black and white checkered pants,  a starlit night over snow-covered peaks, khaki triangles and azure crescents,  a maroon dodecahedron interwoven with teal threads.
  
   Please return the 100 topics in csv format.
   """

In the second step, SVG code is generated by prompting the GPT-4o model. The following prompt is used to query the model to generate svg.

 
  prompt = f"""
      Generate SVG code to visually represent the following text description, while respecting the given constraints.
      
      Allowed Elements: `svg`, `path`, `circle`, `rect`, `ellipse`, `line`, `polyline`, `polygon`, `g`, `linearGradient`, `radialGradient`, `stop`, `defs`
      Allowed Attributes: `viewBox`, `width`, `height`, `fill`, `stroke`, `stroke-width`, `d`, `cx`, `cy`, `r`, `x`, `y`, `rx`, `ry`, `x1`, `y1`, `x2`, `y2`, `points`, `transform`, `opacity`
      

      Please ensure that the generated SVG code is well-formed, valid, and strictly adheres to these constraints. 
      Focus on a clear and concise representation of the input description within the given limitations. 
      Always give the complete SVG code with nothing omitted. Never use an ellipsis.

      The code is scored based on similarity to the description, Visual question anwering and aesthetic components.
      Please generate a detailed svg code accordingly.

      input description: {text}
      """

The raw SVG output is then cleaned and sanitized using a competition-specific sanitization class. After that, the cleaned SVG is scored using the SigLIP model to evaluate text-to-SVG similarity. Only SVGs with a score above 0.5 are included in the dataset. On average, out of three SVG generations, only one meets the quality threshold after the cleaning, sanitization, and scoring process.

Google Patent Phrase Similarity Dataset
kaggle.com
Updated Jul 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2022). Google Patent Phrase Similarity Dataset [Dataset]. https://www.kaggle.com/datasets/google/google-patent-phrase-similarity-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Google
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a human rated contextual phrase to phrase matching dataset focused on technical terms from patents. In addition to similarity scores that are typically included in other benchmark datasets we include granular rating classes similar to WordNet, such as synonym, antonym, hypernym, hyponym, holonym, meronym, domain related. The dataset was used in the U.S. Patent Phrase to Phrase Matching competition.

The dataset was generated with focus on the following: - Phrase disambiguation: certain keywords and phrases can have multiple different meanings. For example, the phrase "mouse" may refer to an animal or a computer input device. To help disambiguate the phrases we have included Cooperative Patent Classification (CPC) classes with each pair of phrases. - Adversarial keyword match: there are phrases that have matching keywords but are otherwise unrelated (e.g. “container section” → “kitchen container”, “offset table” → “table fan”). Many models will not do well on such data (e.g. bag of words models). Our dataset is designed to include many such examples. - Hard negatives: We created our dataset with the aim to improve upon current state of the art language models. Specifically, we have used the BERT model to generate some of the target phrases. So our dataset contains many human rated examples of phrase pairs that BERT may identify as very similar but in fact they may not be.

Each entry of the dataset contains two phrases - anchor and target, a context CPC class, a rating class, and a similarity score. The rating classes have the following meanings: - 4 - Very high. - 3 - High. - 2 - Medium. - 2a - Hyponym (broad-narrow match). - 2b - Hypernym (narrow-broad match). - 2c - Structural match. - 1 - Low. - 1a - Antonym. - 1b - Meronym (a part of). - 1c - Holonym ( a whole of). - 1d - Other high level domain match. - 0 - Not related.

The dataset is split into a training (75%), validation (5%), and test (20%) sets. When splitting the data all of the entries with the same anchor are kept together in the same set. There are 106 different context CPC classes and all of them are represented in the training set.

More details about the dataset are available in the corresponding paper. Please cite the paper if you use the dataset.
Multisubject, multimodal face processing
openneuro.org
Updated Dec 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DG Wakeman; RN Henson (2020). Multisubject, multimodal face processing [Dataset]. http://doi.org/10.18112/openneuro.ds000117.v1.0.4
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds000117.v1.0.4
Dataset updated
Dec 18, 2020
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
DG Wakeman; RN Henson
Description
This dataset was obtained from the OpenNeuro project (https://www.openneuro.org). Accession #: ds000117

The same dataset is also available here: ftp://ftp.mrc-cbu.cam.ac.uk/personal/rik.henson/wakemandg_hensonrn/, but in a non-BIDS format (which may be easier to download by subject rather than by modality)

Note that it is a subset of the data available on OpenfMRI (http://www.openfmri.org; Accession #: ds000117).

Description: Multi-subject, multi-modal (sMRI+fMRI+MEG+EEG) neuroimaging dataset on face processing

Please cite the following reference if you use these data:

Wakeman, D.G. & Henson, R.N. (2015). A multi-subject, multi-modal human neuroimaging dataset. Sci. Data 2:150001 doi: 10.1038/sdata.2015.1

The data have been used in several publications including, for example:

Henson, R.N., Abdulrahman, H., Flandin, G. & Litvak, V. (2019). Multimodal integration of M/EEG and f/MRI data in SPM12. Frontiers in Neuroscience, Methods, 13, 300.

Henson, R.N., Wakeman, D.G., Litvak, V. & Friston, K.J. (2011). A Parametric Empirical Bayesian framework for the EEG/MEG inverse problem: generative models for multisubject and multimodal integration. Frontiers in Human Neuroscience, 5, 76, 1-16. Chapter 42 of the SPM12 manual (http://www.fil.ion.ucl.ac.uk/spm/doc/manual.pdf)

(see ftp://ftp.mrc-cbu.cam.ac.uk/personal/rik.henson/wakemandg_hensonrn/Publications for full list), as well as the BioMag2010 data competition and the Kaggle competition: https://www.kaggle.com/c/decoding-the-human-brain)

==================================================================================

func/

Unlike in v1-v3 of this dataset, the first two (dummy) volumes have now been removed (as stated in *.json), so event onset times correctly refer to t=0 at start of third volume

Note that, owing to scanner error, Subject 10 only has 170 volumes in last run (Run 9) (hence the BIDS warning of some onsets in events.tsv file being later than the data)

meg/

Three anatomical fiducials were digitized for aligning the MEG with the MRI: the nasion (lowest depression between the eyes) and the left and right ears (lowest depression between the tragus and the helix, above the tragus). This procedure is illustrated here: http://neuroimage.usc.edu/brainstorm/CoordinateSystems#Subject_Coordinate_System_.28SCS_.2F_CTF.29 and in task-facerecognition_fidinfo.pdf

The following triggers are included in the .fif files and are also used in the “trigger” column of the meg and bold events files:

Trigger Label Simplified Label

5 Initial Famous Face FAMOUS 6 Immediate Repeat Famous Face FAMOUS 7 Delayed Repeat Famous Face FAMOUS 13 Initial Unfamiliar Face UNFAMILIAR 14 Immediate Repeat Unfamiliar Face UNFAMILIAR 15 Delayed Repeat Unfamiliar Face UNFAMILIAR 17 Initial Scrambled Face SCRAMBLED 18 Immediate Repeat Scrambled Face SCRAMBLED 19 Delayed Repeat Scrambled Face SCRAMBLED

stimuli/meg/

The .bmp files correspond to those described in the text. There are 6 additional images in this directory, which were used in the practice experiment to familiarize participants with the task (hence some more BIDS validator warnings)

stimuli/mri/

The .bmp files correspond to those described in the text.

Defacing

Defacing of MPRAGE T1 images was performed by the submitter. A subset of subjects have given consent for non-defaced versions to be shared - in which case, please contact rik.henson@mrc-cbu.cam.ac.uk.

Quality Control

Mriqc was run on the dataset. Results are located in derivatives/mriqc. Learn more about it here: https://mriqc.readthedocs.io/en/latest/

Known Issues

N/A

Relationship of Subject Numbering relative to other versions of Dataset

There are multiple versions of the dataset available on the web (see notes above), and these entailed a renumbering of the subjects for various reasons. Here are all the versions and how to match subjects between them (plus some rationale and history for different versions):

Original Paper (N=19): Wakeman & Henson (2015): doi:10.1038/sdata.2015.1 Number refers to order that tested (and some, eg 4, 7, 13 etc were excluded for not completing both MRI and MEG sessions)

openfMRI, renumbered from paper: http://openfmri.org/s3-browser/?prefix=ds000117/ds000117_R0.1.1/uncompressed/ Numbers 1-19 just made contiguous

FTP subset of N=16: ftp: ftp://ftp.mrc-cbu.cam.ac.uk/personal/rik.henson/wakemandg_hensonrn/
This set was used for SPM Courses Designed to illustrate multimodal integration, so wanted good MRI+MEG+EEG data for all subjects Removed original subject_01 and subject_06 because bad EEG data; subject_19 because poor EEG and fMRI data (And renumbered subject_14 for some reason).

Current OpenNeuro subset N=16 used for (BIDS): https://openneuro.org/datasets/ds000117 OpenNeuro was rebranding of openfMRI, and enforced BIDS format Since this version designed to illustrate multi-modal BIDS, kept same numbering as FTP

W&H2015 openfMRI FTP openNeuro ======== ====== === ======= subject_01 sub001 subject_02 sub002 Sub01 sub-01 subject_03 sub003 Sub02 sub-02 subject_05 sub004 Sub03 sub-03 subject_06 sub005 subject_08 sub006 Sub05 sub-05 subject_09 sub007 Sub06 sub-06 subject_10 sub008 Sub07 sub-07 subject_11 sub009 Sub08 sub-08 subject_12 sub010 Sub09 sub-09 subject_14 sub011 Sub04 sub-04 subject_15 sub012 Sub10 sub-10 subject_16 sub013 Sub11 sub-11 subject_17 sub014 Sub12 sub-12 subject_18 sub015 Sub13 sub-13 subject_19 sub016 subject_23 sub017 Sub14 sub-14 subject_24 sub018 Sub15 sub-15 subject_25 sub019 Sub16 sub-16
h
quora-duplicates
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sentence Transformers, quora-duplicates [Dataset]. https://huggingface.co/datasets/sentence-transformers/quora-duplicates
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Sentence Transformers
Description
Dataset Card for Quora Duplicate Questions

This dataset contains the Quora Question Pairs dataset in four formats that are easily used with Sentence Transformers to train embedding models. The data was originally created by Quora for this Kaggle Competition.

Dataset Subsets pair-class subset

Columns: "sentence1", "sentence2", "label" Column types: str, str, class with {"0": "different", "1": "duplicate"} Examples:{ 'sentence1': 'What is the step by step… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/quora-duplicates.
h
chatbot-arena-llm-judges
huggingface.co
Updated Aug 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Potsawee Manakul (2024). chatbot-arena-llm-judges [Dataset]. https://huggingface.co/datasets/potsawee/chatbot-arena-llm-judges
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 26, 2024
Authors
Potsawee Manakul
Description
Chatbot-Arena

https://www.kaggle.com/competitions/lmsys-chatbot-arena/data Single-turn data: https://huggingface.co/datasets/potsawee/chatbot-arena-llm-judges

examples = 49938

split: A_win = 17312 (34.67%), B_win = 16985 (34.01%), tie = 15641 (31.32%)

2-way only examples = 34297 (68.68%)

This repository

train.single-turn.json: data extracted from the train file from LMSys on Kaggle each example has attributes - id, model_[a, b], winne_model_[a, b, tie], question… See the full description on the dataset page: https://huggingface.co/datasets/potsawee/chatbot-arena-llm-judges.
Z
A collection of fully-annotated soundscape recordings from the southern...
data.niaid.nih.gov
zenodo.org
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Kahl (2024). A collection of fully-annotated soundscape recordings from the southern Sierra Nevada mountain range [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7525804
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Stefan Kahl
Megan McKenna
Mary Clapp
Erik Meyer
Gail Patricelli
Holger Klinck
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Sierra Nevada, Nevada
Description
This collection contains 100 soundscape recordings of 10 minutes duration, which have been annotated with 10,296 bounding box labels for 21 different bird species from the Western United States. The data were recorded in 2015 in the southern end of the Sierra Nevada mountain range in California, USA. This collection has been featured as test data in the 2020 BirdCLEF and Kaggle Birdcall Identification competition and can primarily be used for training and evaluation of machine learning algorithms.

Data collection

The recordings were made in Sequoia and Kings Canyon National Parks, two contiguous national parks in the southern Sierra Nevada mountain range in California, USA. The focus of the acoustic study was the high-elevation region of the Parks; specifically, the headwater lake basins above 3,000 km in elevation. The original intent of the study was to monitor seasonal activity of birds and bats at lakes containing trout and lakes without trout, because the cascading impacts of trout on the adjacent terrestrial zone remain poorly understood. Soundscapes were recorded for 24 h continuously at 10 lakes (5 fishless, 5 fish-containing) throughout Sequoia and Kings Canyon National Parks during June-September 2015. Song Meter SM2+ units (Wildlife Acoustics, USA) powered by custom-made solar panels were used to obviate the need to swap batteries, due to the recording locations being extremely difficult to access. Song Meters continuously recorded mono-channel, 16-bits uncompressed WAVE files at 48 kHz sampling rate. For this collection, recordings were resampled at 32 kHz and converted to FLAC.

Sampling and annotation protocol

A total of 100 10-minute segments of audio between July 9 and 12, 2015 from morning hours (06:10-09:10 PDT) from all 10 sites were selected at random. Annotators were asked to box every bird call they could recognize, ignoring those that are too faint or unidentifiable. Every sound that could not be confidently assigned an identity was reviewed with 1-2 other experts in bird identification. To minimize observer bias, all identifying information about the location, date and time of the recordings was hidden from the annotator. Raven Pro software was used to annotate the data. Provided labels contain full bird calls that are boxed in time and frequency. In this collection, we use eBird species codes as labels, following the 2021 eBird taxonomy (Clements list). Unidentifiable calls have been marked with “????” and were added as bounding box labels to the ground truth annotations. Parts of this dataset have previously been used in the 2020 BirdCLEF and Kaggle Birdcall Identification competition.

Files in this collection

Audio recordings can be accessed by downloading and extracting the “soundscape_data.zip” file. Soundscape recording filenames contain a sequential file ID, recording date and timestamp in PDT (UTC-7). As an example, the file “HSN_001_20150708_061805.flac” has sequential ID 001 and was recorded on July 8th 2015 at 06:18:05 PDT. Ground truth annotations are listed in “annotations.csv” where each line specifies the corresponding filename, start and end time in seconds, low and high frequency in Hertz and an eBird species code. These species codes can be assigned to scientific and common name of a species with the “species.csv” file. The approximate recording location with longitude and latitude can be found in the “recording_location.txt” file.

Acknowledgements

Compiling this extensive dataset was a major undertaking, and we are very thankful to the domain experts who helped to collect and manually annotate the data for this collection (individual contributors in alphabetic order): Anna Calderón, Thomas Hahn, Ruoshi Huang, Angelly Tovar
The Maestro Dataset v2
kaggle.com
zip
Updated Jun 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jack Vial (2020). The Maestro Dataset v2 [Dataset]. https://www.kaggle.com/datasets/jackvial/themaestrodatasetv2
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Jun 14, 2020
Authors
Jack Vial
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Note

I did not have any part in creating this dataset I am only uploading it here to make it easily available to others on Kaggle. More info about the dataset can be found here https://magenta.tensorflow.org/datasets/maestro

Wav -> mp3 Conversion

I had to convert the wav audio files to mp3 so the dataset would fit within Kaggle's 20gb limit, therefore all audio files have the extension .mp3 which is inconsistent with the .wav extensions in the .csv meta files.

Summary

MAESTRO (MIDI and Audio Edited for Synchronous Tracks and Organization) is a dataset composed of over 200 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms.

Dataset (from the Magenta site https://magenta.tensorflow.org/datasets/maestro )

We partnered with organizers of the International Piano-e-Competition for the raw data used in this dataset. During each installment of the competition virtuoso pianists perform on Yamaha Disklaviers which, in addition to being concert-quality acoustic grand pianos, utilize an integrated high-precision MIDI capture and playback system. Recorded MIDI data is of sufficient fidelity to allow the audition stage of the competition to be judged remotely by listening to contestant performances reproduced over the wire on another Disklavier instrument.

The dataset contains over 200 hours of paired audio and MIDI recordings from ten years of International Piano-e-Competition. The MIDI data includes key strike velocities and sustain/sostenuto/una corda pedal positions. Audio and MIDI files are aligned with ∼3 ms accuracy and sliced to individual musical pieces, which are annotated with composer, title, and year of performance. Uncompressed audio is of CD quality or higher (44.1–48 kHz 16-bit PCM stereo).

A train/validation/test split configuration is also proposed, so that the same composition, even if performed by multiple contestants, does not appear in multiple subsets. Repertoire is mostly classical, including composers from the 17th to early 20th century.

For more information about how the dataset was created and several applications of it, please see the paper where it was introduced: Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset.

For an example application of the dataset, see our blog post on Wave2Midi2Wave.

License

The dataset is made available by Google LLC under a Creative Commons Attribution Non-Commercial Share-Alike 4.0 (CC BY-NC-SA 4.0) license.

Acknowledgements

More info on the MAESTRO dataset https://magenta.tensorflow.org/datasets/maestro Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset https://arxiv.org/abs/1810.12247

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. "Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset." In International Conference on Learning Representations, 2019.
Rock Paper Scissors Agents Battles
kaggle.com
Updated Nov 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikos Koumbakis (2020). Rock Paper Scissors Agents Battles [Dataset]. https://www.kaggle.com/jumaru/rock-paper-scissors-agents-battles/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2020
Dataset provided by
Kaggle
Authors
Nikos Koumbakis
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Used for the Rock Paper Scissors Competition

What insights you can get by analyzing a large number of games played by a pair of agents? If a good strategy is indistinguishable from a random strategy, how random is your agent?

Content

This dataset contains actions and rewards of game runs for the Rock Paper Scissors Competition. Data are recorded during evaluation of consequential RPS environment runs and an example can be seen in (Not so) Markov vs Nash Equilibrium Notebook Use the starter notebook: Starter: Rock Paper Scissors Agents Battles to get insights of the battles.

Acknowledgements

Rock Paper Scissors - Nash Equilibrium Strategy & Rock Paper Scissors - Agents Comparison by Yaroslav Isaienkov

(Not so) Markov by Alexander Samarin

LB simulation by Ant 🐜

Inspiration

Beautiful Mind: dating a blond girl and Nash Equilibrium
Power Transformers FDD and RUL
kaggle.com
zip
Updated Sep 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iurii Katser (2024). Power Transformers FDD and RUL [Dataset]. https://www.kaggle.com/datasets/yuriykatser/power-transformers-fdd-and-rul
Explore at:
zip(33405750 bytes)Available download formats
Dataset updated
Sep 1, 2024
Authors
Iurii Katser
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Datasets with dissolved gases concentrations in power transformer oil for remaining useful life (RUL), fault detection and diagnosis (FDD) problems.

Introduction

Power transformers (PTs) are an important component of a nuclear power plant (NPP). They convert alternating voltage and are instrumental in power supply of both external NPP energy consumers and NPPs themselves. Currently, many PTs have exceeded planned service life that had been extended over the designated 25 years. Due to the extension, monitoring the PT technical condition becomes an urgent matter.

An important method for monitoring and diagnosing PTs is Chromatographic Analysis of Dissolved Gas (CADG). It is based on the principle of forced extraction and analysis of dissolved gases from PT oil. Almost all types of equipment defects are accompanied by formation of gases that dissolve in oil; certain types of defects generate certain gases in different quantities. The concentrations also differ on various stages of defects developing that allows to calculate RUL of the PT. At present, NPP control and diagnostic systems for PT equipment use predefined control limits for concentration of dissolved gases in oil. The main disadvantages of this approach are the lack of automatic control and insufficient quality of diagnostics, especially for PTs with extended service life. To combat these shortcomings in diagnostic systems for the analysis of data obtained using CADG, machine learning (ML) methods can be used, as they are used in diagnostics of many NNP components.

Data description

The datasets are available as .csv files containing 420 records of gas concentration, presented as a time dependence. The gasses are 𝐻2, 𝐶𝑂, 𝐶2𝐻4 и 𝐶2𝐻2. The period between time points is 12 hours. There are 3000 datasets splitted into train (2100 datasets) and test (900 datasets) sets.

For RUL problem, annotations are available (in the separate files): each .csv file corresponds to a value in points that is equal the time remaining until the equipment fails, at the end of record.

For FDD problems, there are labels (in the separate files) with four PT operating modes (classes): 1. Normal mode (2436 datasets); 2. Partial discharge: local dielectric breakdown in gas-filled cavities (127 datasets); 3. Low energy discharge: sparking or arc discharges in poor contact connections of structural elements with different or floating potential; discharges between PT core structural elements, high voltage winding taps and the tank, high voltage winding and grounding; discharges in oil during contact switching (162 datasets); 4. Low-temperature overheating: oil flow disruption in windings cooling channels, magnetic system causing low efficiency of the cooling system for temperatures < 300 °C (275 datasets).

Data in this repository is an extension (test set added) of data from here and here.

FDD problems statement

In our case, the fault detection problem transforms into a classification problem, since the data is related to one of four labeled classes (including one normal and three anomalous), so the model’s output needs to be a class number. The problem can be stated as binary classification (healthy/anomalous) for fault detection or multi class classification (on of 4 states) for fault diagnosis.

RUL problem statement

To ensure high-quality maintenance and repair, it is vital to be aware of potential malfunctions and predict RUL of transformer equipment. Therefore, it is necessary to create a mathematical model that will determine RUL by the final 420 points.

Data usage examples

Dataset was used in this article.

Dataset was used in this research by Katser et.al. that solves the problem proposing ensemble of classifiers.
Data Fusion Contest 2025 - Task 1 "Label Craft"
kaggle.com
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Egor Andreasyan (2025). Data Fusion Contest 2025 - Task 1 "Label Craft" [Dataset]. https://www.kaggle.com/datasets/egorandreasyan/data-fusion-contest-2025-task-1-label-craft
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 1, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Egor Andreasyan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
In this competition, you will have to develop an algorithm for automatic categorization of products by their name and attributes, even in conditions of incomplete marking.

The category system is arranged in the form of a hierarchical tree (up to 5 levels of nesting), and product data comes from many trading platforms, which creates a number of difficulties:

Product marking is incomplete and imperfect.

The attributes of the same products may differ on different platforms or be absent altogether.

As the catalog expands, new platforms and categories may appear that did not exist before.

In this competition, we invite participants to try their hand at setting up a task as close to the real one as possible:

The task is complicated by the fact that in the training sample, categories are marked only for half of the products, and in the test sample there will be categories for which there are no marked examples.

At the same time, you are provided with a complete list of possible categories in advance - will you be able to take them all into account and accurately assign products to the right classes?

LLMs (Large Language Models) look promising for such tasks – but will they really help? This is a great chance to test different approaches!
YIEDL Numerai Crypto data - Historical chunked
kaggle.com
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Duuscha (2025). YIEDL Numerai Crypto data - Historical chunked [Dataset]. https://www.kaggle.com/datasets/duuuscha/yiedl-numerai-crypto-data-historical/versions/22
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Duuscha
Description
Originally taken from https://yiedl.ai/competition/datasets. Data has been downloaded and transformed to make working with it easier. NaN values have been set to -1 and features converted to int16 for memory efficiency. I left out the dataset separated in chunks as it is easier to work with.

Example notebook how to load combine chunks: https://www.kaggle.com/code/duuuscha/example-usage-of-yiedl-data

Daily version for submissions: https://www.kaggle.com/datasets/duuuscha/yiedl-numerai-crypto-dataset-daily/data

Facebook

Twitter

Click to copy link

Link copied

Cite

Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code

Meta Kaggle Code

Kaggle's public data on notebook code

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jun 5, 2025

Dataset authored and provided by

Kagglehttp://kaggle.com/

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!

Clear search

Close search

Google apps

Main menu

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

Mayo Clinic - STRIP AI: Tiled, Thresholded Dataset

Collection Methodology

Important Information

Code4ML 2.0

Code4ML 2.0 Enhancements

Applications

FSDKaggle2018

Miss America Titleholders

Miss America Titleholders

Miss America over the years

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

olympiad-math-contest-llama3-20k

LLM Science Dataset

FSDKaggle2019

SVG Code Generation Sample Training Data

Google Patent Phrase Similarity Dataset

Multisubject, multimodal face processing

func/

meg/

stimuli/meg/

stimuli/mri/

Defacing

Quality Control

Known Issues

Relationship of Subject Numbering relative to other versions of Dataset

quora-duplicates

chatbot-arena-llm-judges

examples = 49938

2-way only examples = 34297 (68.68%)

A collection of fully-annotated soundscape recordings from the southern...

The Maestro Dataset v2

Note

Wav -> mp3 Conversion

Summary

Dataset (from the Magenta site https://magenta.tensorflow.org/datasets/maestro )

License

Acknowledgements

Rock Paper Scissors Agents Battles

Context

Content

Acknowledgements

Inspiration

Power Transformers FDD and RUL

Introduction

Data description

FDD problems statement

RUL problem statement

Data usage examples

Data Fusion Contest 2025 - Task 1 "Label Craft"

YIEDL Numerai Crypto data - Historical chunked

Meta Kaggle Code

Kaggle's public data on notebook code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments