100+ datasets found

Meta Kaggle Code
kaggle.com
zip
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(143722388562 bytes)Available download formats
Dataset updated
Jun 5, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
h
kaggle-entity-annotated-corpus-ner-dataset
huggingface.co
Updated Jul 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2022
Authors
Rafael Arias Calles
License
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Description
Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

About Dataset

from Kaggle Datasets

Context

Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.
785 Million Language Translation Database for AI
kaggle.com
Updated Aug 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramakrishnan Lakshmanan (2023). 785 Million Language Translation Database for AI [Dataset]. https://www.kaggle.com/datasets/ramakrishnan1984/785-million-language-translation-database-ai-ml
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ramakrishnan Lakshmanan
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
Our groundbreaking translation dataset represents a monumental advancement in the field of natural language processing and machine translation. Comprising a staggering 785 million records, this corpus bridges language barriers by offering translations from English to an astonishing 548 languages. The dataset promises to be a cornerstone resource for researchers, engineers, and developers seeking to enhance their machine translation models, cross-lingual analysis, and linguistic investigations.

Size of the dataset – 41GB(Uncompressed) and Compressed – 20GB

Key Features:

Scope and Scale: With a comprehensive collection of 785 million records, this dataset provides an unparalleled wealth of translated text. Each record consists of an English sentence paired with its translation in one of the 548 target languages, enabling multi-directional translation applications.

Language Diversity: Encompassing translations into 548 languages, this dataset represents a diverse array of linguistic families, dialects, and scripts. From widely spoken languages to those with limited digital representation, the dataset bridges communication gaps on a global scale.

Quality and Authenticity: The translations have been meticulously curated, verified, and cross-referenced to ensure high quality and authenticity. This attention to detail guarantees that the dataset is not only extensive but also reliable, serving as a solid foundation for machine learning applications. Data is collected from various open datasets for my personal ML projects and looking to share it to team.

Use Case Versatility: Researchers and practitioners across a spectrum of domains can harness this dataset for a myriad of applications. It facilitates the training and evaluation of machine translation models, empowers cross-lingual sentiment analysis, aids in linguistic typology studies, and supports cultural and sociolinguistic investigations.

Machine Learning Advancement: Machine translation models, especially neural machine translation (NMT) systems, can leverage this dataset to enhance their training. The large-scale nature of the dataset allows for more robust and contextually accurate translation outputs.

Fine-tuning and Customization: Developers can fine-tune translation models using specific language pairs, offering a powerful tool for specialized translation tasks. This customization capability ensures that the dataset is adaptable to various industries and use cases.

Data Format: The dataset is provided in a structured json format, facilitating easy integration into existing machine learning pipelines. This structured approach expedites research and experimentation. Json format contains the English word and equivalent word as single record. Data was exported from MongoDB database to ensure the uniqueness of the record. Each of the record is unique and sorted.

Access: The dataset is available for academic and research purposes, enabling the global AI community to contribute to and benefit from its usage. A well-documented API and sample code are provided to expedite exploration and integration.

The English-to-548-languages translation dataset represents an incredible leap forward in advancing multilingual communication, breaking down barriers to understanding, and fostering collaboration on a global scale. It holds the potential to reshape how we approach cross-lingual communication, linguistic studies, and the development of cutting-edge translation technologies.

Dataset Composition: The dataset is a culmination of translations from English, a widely spoken and understood language, into 548 distinct languages. Each language represents a unique linguistic and cultural background, providing a rich array of translation contexts. This diverse range of languages spans across various language families, regions, and linguistic complexities, making the dataset a comprehensive repository for linguistic research.

Data Volume and Scale: With a staggering 785 million records, the dataset boasts an immense scale that captures a vast array of translations and linguistic nuances. Each translation entry consists of an English source text paired with its corresponding translation in one of the 548 target languages. This vast corpus allows researchers and practitioners to explore patterns, trends, and variations across languages, enabling the development of robust and adaptable translation models.

Linguistic Coverage: The dataset covers an extensive set of languages, including but not limited to Indo-European, Afroasiatic, Sino-Tibetan, Austronesian, Niger-Congo, and many more. This broad linguistic coverage ensures that languages with varying levels of grammatical complexity, vocabulary richness, and syntactic structures are included, enhancing the applicability of translation models across diverse linguistic landscapes.

Dataset Preparation: The translation ...
Data from: AI & Computer Vision Dataset
kaggle.com
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khushi Yadav (2025). AI & Computer Vision Dataset [Dataset]. https://www.kaggle.com/datasets/khushikyad001/ai-and-computer-vision-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 1, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Khushi Yadav
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains metadata related to three categories of AI and computer vision applications:

Handwritten Math Solutions: Metadata on images of handwritten math problems with step-by-step solutions.

Multi-lingual Street Signs: Road sign images in various languages, with translations.

Security Camera Anomalies: Surveillance footage metadata distinguishing between normal and suspicious activities.

The dataset is useful for machine learning, image recognition, OCR (Optical Character Recognition), anomaly detection, and AI model training.
Wikibooks Dataset
kaggle.com
huggingface.co
zip
Updated Oct 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhruvil Dave (2021). Wikibooks Dataset [Dataset]. https://www.kaggle.com/datasets/dhruvildave/wikibooks-dataset
Explore at:
zip(1958715136 bytes)Available download formats
Dataset updated
Oct 22, 2021
Authors
Dhruvil Dave
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This is the complete dataset of contents of all the Wikibooks in 12 languages. The content contains books of the following languages: English, French, German, Spanish, Portuguese, Italian and Russian, Japanese, Dutch, Polish, Hungarian, and Hebrew; each in its own directory. Wikibooks are divided into chapters and each chapter has its own webpage. This dataset can be used for tasks like Machine Translation, Text Generation, Text Parsing, and Sematic Understanding of Natural Language. Body contents are provided in both newline delimited textual format as would be visible on the page along with its HTML for better semantic parsing.

Refer to the starter notebook: Starter: Wikibooks dataset

Data as of October 22, 2021.

Image Credits: Unsplash - itfeelslikefilm
FSDKaggle2018
zenodo.org
opendatalab.com
+2more
zip
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal (2020). FSDKaggle2018 [Dataset]. http://doi.org/10.5281/zenodo.2552860
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2552860
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal
Description
FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.

Citation

If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)

You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

About this dataset

Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.

The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.

All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.

The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:

"Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".

Some other relevant characteristics of FSDKaggle2018:

The dataset is split into a train set and a test set.

The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.

Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.

Non-verified annotations in the train set are properly flagged in train.csv so that participants can opt to use this information during the development of their systems.

The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.

All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.

Data labeling process

The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.

Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.

Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.

The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.

More details about the data labeling process can be found in [3].

License

FSDKaggle2018 has licenses at two different levels, as explained next.

All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv and test_post_competition_scoring_clips.csv.

In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2018.doc zip file.

Files

FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:

root │
└───FSDKaggle2018.audio_train/ Audio clips in the train set │
└───FSDKaggle2018.audio_test/ Audio clips in the test set │
└───FSDKaggle2018.meta/ Files for evaluation setup │ │
│ └───train_post_competition.csv Data split and ground truth for the train set │ │
│ └───test_post_competition_scoring_clips.csv Ground truth for the test set
│
└───FSDKaggle2018.doc/ │
└───README.md The dataset description file you are reading │
└───LICENSE-DATASET
FSDKaggle2019
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Fonseca; Eduardo Fonseca; Manoj Plakal; Frederic Font; Frederic Font; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Manoj Plakal (2020). FSDKaggle2019 [Dataset]. http://doi.org/10.5281/zenodo.3612637
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3612637
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eduardo Fonseca; Eduardo Fonseca; Manoj Plakal; Frederic Font; Frederic Font; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Manoj Plakal
Description
FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.

Citation

If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)

You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

Data curators

Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

ABOUT FSDKaggle2019

Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.

FSDKaggle2019 employs audio clips from the following sources:

Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology

The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)

The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.

What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.

Ground Truth Labels

The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).

The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].

The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].

Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:

curated train set: correct (but potentially incomplete) labels

noisy train set: noisy labels

test set: correct and complete labels

Further details can be found below in the sections for each set.

Format

All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

DATA SPLIT

FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.

Curated train set

The curated train set consists of manually-labeled data from FSD.

Number of clips/class: 75 except in a few cases (where there are less)

Total number of clips: 4970

Avg number of labels/clip: 1.2

Total duration: 10.5 hours

The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).

Noisy train set

The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].

Number of clips/class: 300

Total number of clips: 19,815

Avg number of labels/clip: 1.2

Total duration: ~80 hours

The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.

Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.

Test set

The test set is used for system evaluation and consists of manually-labeled data from FSD.

Number of clips/class: between 50 and 150

Total number of clips: 4481

Avg number of labels/clip: 1.4

Total duration: 12.9 hours

The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.

During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public
🌟 Emoji Trends Dataset
kaggle.com
Updated Jul 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Waqar Ali (2024). 🌟 Emoji Trends Dataset [Dataset]. https://www.kaggle.com/datasets/waqi786/emoji-trends-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 31, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Waqar Ali
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset provides a detailed analysis of emoji usage across various social media platforms. It captures how different emojis are used in different contexts, reflecting emotions, trends, and user demographics.

With emojis becoming a universal digital language, this dataset helps researchers, marketers, and data analysts explore how people express emotions online and identify patterns in social media communication.

📌 Key Features: 😊 Emoji Details: Emoji 🎭: The specific emoji used in a post, comment, or message. Context 💬: The meaning or emotion associated with the emoji (e.g., Happy, Love, Funny, Sad). Platform 🌐: The social media platform where the emoji was used (e.g., Facebook, Instagram, Twitter). 👤 User Demographics: User Age 🎂: Age of the user who posted the emoji (ranges from 13 to 65 years). User Gender 🚻: Gender of the user (Male/Female). 📈 Additional Insights: Emoji Popularity 🔥: Frequency of each emoji’s usage across platforms. Trends Over Time 📅: How emoji usage changes based on trends or events. Regional Usage Patterns 🌍: How different cultures and regions use emojis differently. 📊 Use Cases & Applications: 🔹 Understanding emoji trends across social media 🔹 Analyzing emotional expression through digital communication 🔹 Exploring demographic differences in emoji usage 🔹 Identifying platform-specific emoji preferences 🔹 Enhancing sentiment analysis models with emoji insights

⚠️ Important Note: This dataset is synthetically generated for educational and analytical purposes. It does not contain real user data but is designed to reflect real-world trends in emoji usage.
f
ChatGPT API and BERT NLP
figshare.com
application/csv
Updated Mar 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carmen Atkins (2024). ChatGPT API and BERT NLP [Dataset]. http://doi.org/10.6084/m9.figshare.25403407.v2
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25403407.v2
Dataset updated
Mar 13, 2024
Dataset provided by
figshare
Authors
Carmen Atkins
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
input_prompts.csv provides the inputs for the ChatGPT API (countries and their respective prompts).topic_consolidations.csv contains the 4,018 unique topics listed across all ChatGPT responses to prompts in our study and their corresponding cluster labels after applying K-means++ clustering (n = 50) via natural language processing with Bidirectional Encoder Representations from Transformers (BERT). ChatGPT response topics come from both versions (3.5 and 4) over 10 iterations each (per each country).ChatGPT_prompt_automation.ipynb is the Jupyter notebook of Python code used to run the API to prompt ChatGPT and gather responses.topic_consolidation_BERT.ipynb is the Jupyter notebook of Python code used to process the 4,018 unique topics gathered through BERT NLP. This code was adapted from Vimal Pillar on Kaggle (https://www.kaggle.com/code/vimalpillai/text-clustering-with-sentence-bert).
Google Patent Phrase Similarity Dataset
kaggle.com
Updated Jul 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2022). Google Patent Phrase Similarity Dataset [Dataset]. https://www.kaggle.com/datasets/google/google-patent-phrase-similarity-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Google
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a human rated contextual phrase to phrase matching dataset focused on technical terms from patents. In addition to similarity scores that are typically included in other benchmark datasets we include granular rating classes similar to WordNet, such as synonym, antonym, hypernym, hyponym, holonym, meronym, domain related. The dataset was used in the U.S. Patent Phrase to Phrase Matching competition.

The dataset was generated with focus on the following: - Phrase disambiguation: certain keywords and phrases can have multiple different meanings. For example, the phrase "mouse" may refer to an animal or a computer input device. To help disambiguate the phrases we have included Cooperative Patent Classification (CPC) classes with each pair of phrases. - Adversarial keyword match: there are phrases that have matching keywords but are otherwise unrelated (e.g. “container section” → “kitchen container”, “offset table” → “table fan”). Many models will not do well on such data (e.g. bag of words models). Our dataset is designed to include many such examples. - Hard negatives: We created our dataset with the aim to improve upon current state of the art language models. Specifically, we have used the BERT model to generate some of the target phrases. So our dataset contains many human rated examples of phrase pairs that BERT may identify as very similar but in fact they may not be.

Each entry of the dataset contains two phrases - anchor and target, a context CPC class, a rating class, and a similarity score. The rating classes have the following meanings: - 4 - Very high. - 3 - High. - 2 - Medium. - 2a - Hyponym (broad-narrow match). - 2b - Hypernym (narrow-broad match). - 2c - Structural match. - 1 - Low. - 1a - Antonym. - 1b - Meronym (a part of). - 1c - Holonym ( a whole of). - 1d - Other high level domain match. - 0 - Not related.

The dataset is split into a training (75%), validation (5%), and test (20%) sets. When splitting the data all of the entries with the same anchor are kept together in the same set. There are 106 different context CPC classes and all of them are represented in the training set.

More details about the dataset are available in the corresponding paper. Please cite the paper if you use the dataset.
P
Kaggle EyePACS Dataset
paperswithcode.com
Updated Oct 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Kaggle EyePACS Dataset [Dataset]. https://paperswithcode.com/dataset/kaggle-eyepacs
Explore at:
Dataset updated
Oct 28, 2020
Description
Diabetic retinopathy is the leading cause of blindness in the working-age population of the developed world. It is estimated to affect over 93 million people.

retina

The US Center for Disease Control and Prevention estimates that 29.1 million people in the US have diabetes and the World Health Organization estimates that 347 million people have the disease worldwide. Diabetic Retinopathy (DR) is an eye disease associated with long-standing diabetes. Around 40% to 45% of Americans with diabetes have some stage of the disease. Progression to vision impairment can be slowed or averted if DR is detected in time, however this can be difficult as the disease often shows few symptoms until it is too late to provide effective treatment.

Currently, detecting DR is a time-consuming and manual process that requires a trained clinician to examine and evaluate digital color fundus photographs of the retina. By the time human readers submit their reviews, often a day or two later, the delayed results lead to lost follow up, miscommunication, and delayed treatment.

Clinicians can identify DR by the presence of lesions associated with the vascular abnormalities caused by the disease. While this approach is effective, its resource demands are high. The expertise and equipment required are often lacking in areas where the rate of diabetes in local populations is high and DR detection is most needed. As the number of individuals with diabetes continues to grow, the infrastructure needed to prevent blindness due to DR will become even more insufficient.

The need for a comprehensive and automated method of DR screening has long been recognized, and previous efforts have made good progress using image classification, pattern recognition, and machine learning. With color fundus photography as input, the goal of this competition is to push an automated detection system to the limit of what is possible – ideally resulting in models with realistic clinical potential. The winning models will be open sourced to maximize the impact such a model can have on improving DR detection.

Acknowledgements This competition is sponsored by the California Healthcare Foundation.

Retinal images were provided by EyePACS, a free platform for retinopathy screening.
English Text Translation Dataset
kaggle.com
Updated Aug 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arjun Basandrai (2024). English Text Translation Dataset [Dataset]. https://www.kaggle.com/datasets/arjunbasandrai/english-german-arabic-text-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 20, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Arjun Basandrai
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is dataset of translations from English to various other languages

Content English to German and Arabic

Acknowledgements Arabic translations taken from this dataset
Sanskrit-English Translation Dataset
kaggle.com
Updated May 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Devesh Parmar (2023). Sanskrit-English Translation Dataset [Dataset]. https://www.kaggle.com/datasets/deveshparmar01/sanskrit2/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 1, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Devesh Parmar
Description
Dataset

This dataset was created by Devesh Parmar

Released under Data files © Original Authors

Contents
Iris Dataset
kaggle.com
Updated Jul 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rutuja Vaidya (2021). Iris Dataset [Dataset]. https://www.kaggle.com/rutujavaidya/iris-dataset/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 25, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rutuja Vaidya
Description
Predict the optimum number of clusters and represent it Visually.
Farsi words with Farsi meaning
kaggle.com
Updated Feb 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mohammad soleimanirad (2025). Farsi words with Farsi meaning [Dataset]. https://www.kaggle.com/datasets/mohammadsoleimanirad/farsi-words-with-farsi-meaning/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 28, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
mohammad soleimanirad
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset

This dataset was created by mohammad soleimanirad

Released under CC BY-NC-SA 4.0

Contents
k-mean-customer-analysis
kaggle.com
Updated Sep 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quân Nguyễn20202 (2024). k-mean-customer-analysis [Dataset]. https://www.kaggle.com/qunnguyn20202/k-mean-customer-analysis/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 20, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Quân Nguyễn20202
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Quân Nguyễn20202

Released under MIT

Contents
Anxiety and Depression Mental Health Factors
kaggle.com
Updated Mar 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AKshay (2025). Anxiety and Depression Mental Health Factors [Dataset]. https://www.kaggle.com/datasets/ak0212/anxiety-and-depression-mental-health-factors
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2025
Dataset provided by
Kaggle
Authors
AKshay
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains related to anxiety, depression, and mental health influences. It includes demographic details, lifestyle habits, mental health indicators, medical history, coping mechanisms, and stress factors. The dataset is designed for mental health analysis, predictive modeling, and research on the impact of various factors on mental well-being.

Features Included: Demographics: Age, Gender, Education, Employment Status

Lifestyle Factors: Sleep Hours, Physical Activity, Social Support

Mental Health Metrics: Anxiety Score, Depression Score, Stress Level

Medical History: Family History of Mental Illness, Chronic Illnesses, Medication Use

Coping Strategies: Therapy, Meditation, Substance Use

Additional Factors: Financial Stress, Work Stress, Self-Esteem, Life Satisfaction, Loneliness

Age

Gender

Education_Level

Employment_Status

Sleep_Hours

Physical_Activity_Hrs

Social_Support_Score

Anxiety_Score

Depression_Score

Stress_Level

Family_History_Mental_Illness

Chronic_Illnesses

Medication_Use

Therapy

Meditation

Substance_Use

Financial_Stress

Work_Stress

Self_Esteem_Score

Life_Satisfaction_Score

Loneliness_Score
frick-my-real-insights-means
kaggle.com
Updated Feb 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aniruddh K Budhgavi (2022). frick-my-real-insights-means [Dataset]. https://www.kaggle.com/datasets/aniruddhkb/frick-my-real-insights-means
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 21, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aniruddh K Budhgavi
Description
Dataset

This dataset was created by Aniruddh K Budhgavi

Contents
Clustering the n mean
kaggle.com
Updated Aug 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yashaswi .B (2021). Clustering the n mean [Dataset]. https://www.kaggle.com/datasets/yashaswib/clustering-the-n-mean
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yashaswi .B
Description
Dataset

This dataset was created by Yashaswi .B

Contents
Online Retail K-means & Hierarchical Clustering
kaggle.com
Updated Oct 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manish Kumar (2019). Online Retail K-means & Hierarchical Clustering [Dataset]. https://www.kaggle.com/hellbuoy/online-retail-customer-clustering/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 26, 2019
Dataset provided by
Kaggle
Authors
Manish Kumar
Description
Context

Online retail is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Business Goal

We will be using the online retail trasnational dataset to build a RFM clustering and choose the best set of customers which the company should target.

Facebook

Twitter

Click to copy link

Link copied

Cite

Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code

Meta Kaggle Code

Kaggle's public data on notebook code

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

zip(143722388562 bytes)Available download formats

Dataset updated

Jun 5, 2025

Dataset authored and provided by

Kagglehttp://kaggle.com/

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!

Clear search

Close search

Google apps

Main menu

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

kaggle-entity-annotated-corpus-ner-dataset

785 Million Language Translation Database for AI

Data from: AI & Computer Vision Dataset

Wikibooks Dataset

FSDKaggle2018

FSDKaggle2019

🌟 Emoji Trends Dataset

ChatGPT API and BERT NLP

Google Patent Phrase Similarity Dataset

Kaggle EyePACS Dataset

English Text Translation Dataset

Sanskrit-English Translation Dataset

Dataset

Contents

Iris Dataset

Farsi words with Farsi meaning

Dataset

Contents

k-mean-customer-analysis

Dataset

Contents

Anxiety and Depression Mental Health Factors

frick-my-real-insights-means

Dataset

Contents

Clustering the n mean

Dataset

Contents

Online Retail K-means & Hierarchical Clustering

Context

Business Goal

Meta Kaggle Code

Kaggle's public data on notebook code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments