52 datasets found

Data from: Open Images
kaggle.com
opendatalab.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Open Images [Dataset]. https://www.kaggle.com/datasets/bigquery/open-images
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

Labeled datasets are useful in machine learning research.

Content

This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.

Tables: 1) annotations_bbox 2) dict 3) images 4) labels

Update Frequency: Quarterly

Querying BigQuery Tables

Fork this kernel to get started.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images

https://cloud.google.com/bigquery/public-data/openimages

APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.

Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.

The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

Banner Photo by Mattias Diesel from Unsplash.

Inspiration

Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?
T
civil_comments
tensorflow.org
huggingface.co
Updated Feb 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
Explore at:
Dataset updated
Feb 28, 2023
Description
This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('civil_comments', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Z
FSDKaggle2019
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xavier Serra (2020). FSDKaggle2019 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3612636
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Daniel P. W. Ellis
Manoj Plakal
Eduardo Fonseca
Xavier Serra
Frederic Font
Description
FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.

Citation

If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)

You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

Data curators

Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

ABOUT FSDKaggle2019

Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.

FSDKaggle2019 employs audio clips from the following sources:

Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology

The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)

The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.

What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.

Ground Truth Labels

The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).

The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].

The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].

Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:

curated train set: correct (but potentially incomplete) labels

noisy train set: noisy labels

test set: correct and complete labels

Further details can be found below in the sections for each set.

Format

All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

DATA SPLIT

FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.

Curated train set

The curated train set consists of manually-labeled data from FSD.

Number of clips/class: 75 except in a few cases (where there are less)

Total number of clips: 4970

Avg number of labels/clip: 1.2

Total duration: 10.5 hours

The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).

Noisy train set

The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].

Number of clips/class: 300

Total number of clips: 19,815

Avg number of labels/clip: 1.2

Total duration: ~80 hours

The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.

Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.

Test set

The test set is used for system evaluation and consists of manually-labeled data from FSD.

Number of clips/class: between 50 and 150

Total number of clips: 4481

Avg number of labels/clip: 1.4

Total duration: 12.9 hours

The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.

During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public or private) for each test clip (see more info in Files & Download below).

Acoustic mismatch

As mentioned before, FSDKaggle2019 uses audio clips from two sources:

FSD: curated train set and test set, and

YFCC: noisy train set.

While the sources of audio (Freesound and Flickr) are collaboratively contributed and pretty diverse themselves, a certain acoustic mismatch can be expected between FSD and YFCC. We conjecture this mismatch comes from a variety of reasons. For example, through acoustic inspection of a small sample of both data sources, we find a higher percentage of high quality recordings in FSD. In addition, audio clips in Freesound are typically recorded with the purpose of capturing audio, which is not necessarily the case in YFCC.

This mismatch can have an impact in the evaluation, considering that most of the train data come from YFCC, while all test data are drawn from FSD. This constraint (i.e., noisy training data coming from a different web audio source than the test set) is sometimes a real-world condition.

LICENSE

All clips in FSDKaggle2019 are released under Creative Commons (CC) licenses. For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses.

Curated train set and test set. All clips in Freesound are released under different modalities of Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. The licenses are specified in the files train_curated_post_competition.csv and test_post_competition.csv. These licenses can be CC0, CC-BY, CC-BY-NC and CC Sampling+.

Noisy train set. Similarly, the licenses of the soundtracks from Flickr used in FSDKaggle2019 are specified in the file train_noisy_post_competition.csv. These licenses can be CC-BY and CC BY-SA.

In addition, FSDKaggle2019 as a whole is the result of a curation process and it has an additional license. FSDKaggle2019 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2019.doc zip file.

FILES & DOWNLOAD

FSDKaggle2019 can be downloaded as a series of zip files with the following directory structure:

root │
└───FSDKaggle2019.audio_train_curated/ Audio clips in the curated train set │ └───FSDKaggle2019.audio_train_noisy/ Audio clips in the noisy
Z
FSDKaggle2018
data.niaid.nih.gov
opendatalab.com
+1more
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manoj Plakal (2020). FSDKaggle2018 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2552859
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Xavier Favory
Daniel P. W. Ellis
Manoj Plakal
Eduardo Fonseca
Xavier Serra
Frederic Font
Jordi Pons
Description
FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.

Citation

If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)

You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

About this dataset

Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.

The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.

All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.

The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:

"Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".

Some other relevant characteristics of FSDKaggle2018:

The dataset is split into a train set and a test set.

The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.

Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.

Non-verified annotations in the train set are properly flagged in train.csv so that participants can opt to use this information during the development of their systems.

The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.

All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.

Data labeling process

The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.

Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.

Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.

The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.

More details about the data labeling process can be found in [3].

License

FSDKaggle2018 has licenses at two different levels, as explained next.

All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv and test_post_competition_scoring_clips.csv.

In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2018.doc zip file.

Files

FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:

root │
└───FSDKaggle2018.audio_train/ Audio clips in the train set │
└───FSDKaggle2018.audio_test/ Audio clips in the test set │
└───FSDKaggle2018.meta/ Files for evaluation setup │ │
│ └───train_post_competition.csv Data split and ground truth for the train set │ │
│ └───test_post_competition_scoring_clips.csv Ground truth for the test set
│
└───FSDKaggle2018.doc/ │
└───README.md The dataset description file you are reading │
└───LICENSE-DATASET License of FSDKaggle2018 dataset as a whole

NOTE: the original train.csv file provided during the competition has been updated with more metadata (licenses, Freesound ids, etc.) into train_post_competition.csv. Likewise, the original test.csv that was not public during the competition is now available with ground truth and metadata as test_post_competition_scoring_clips.csv. The file name test_post_competition_scoring_clips.csv refers to the fact that only the 1600 clips used for systems' ranking are included. During the competition, an additional subset of padding clips was added in order to prevent undesired practices. This padding subset (that was never used for systems' ranking) is no longer included in the dataset (see our DCASE 2018 paper for more details.)

Each row (i.e. audio clip) of the train_post_competition.csv file contains the following information:

fname: the file name

label: the audio classification label (ground truth)

manually_verified: Boolean (1 or 0) flag to indicate whether or not that annotation has been manually verified; see description above for more info

freesound_id: the Freesound id for the audio clip

license: the license for the audio clip

Each row (i.e. audio clip) of the test_post_competition_scoring_clips.csv file contains the following information:

fname: the file name

label: the audio classification label (ground truth)

usage: string that indicates to which Kaggle leaderboard the clip was associated during the competition: Public or Private

freesound_id: the Freesound id for the audio clip

license: the license for the audio clip

Baseline System

A CNN baseline system for FSDKaggle2018 is available at

Spam Images for Malicious Annotation Set (SIMAS)

zenodo.org

application/gzip, bin +1

Updated May 23, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Maria Vukić; Maria Vukić; Emanuel Lacić; Emanuel Lacić; Denis Helic; Denis Helic (2025). Spam Images for Malicious Annotation Set (SIMAS) [Dataset]. http://doi.org/10.5281/zenodo.15423637

Explore at:

png, bin, application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15423637

Dataset updated

May 23, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Maria Vukić; Maria Vukić; Emanuel Lacić; Emanuel Lacić; Denis Helic; Denis Helic

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

SIMAS Dataset

This archive includes the SIMAS dataset for fine-tuning models for MMS (Multimedia Messaging Service) image moderation. SIMAS is a balanced collection of publicly available images, manually annotated in accordance with a specialized taxonomy designed for identifying visual spam in MMS messages.

Taxonomy for MMS Visual Spam

The following table presents the definitions of categories used for classifying MMS images.

Table 1: Category definitions

Category	Description
Alcohol*	Content related to alcoholic beverages, including advertisements and consumption.
Drugs*	Content related to the use, sale, or trafficking of narcotics (e.g., cannabis, cocaine,
Firearms*	Content involving guns, pistols, knives, or military weapons.
Gambling*	Content related to gambling (casinos, poker, roulette, lotteries).
Sexual	Content involving nudity, sexual acts, or sexually suggestive material.
Tobacco*	Content related to tobacco use and advertisements.
Violence	Content showing violent acts, self-harm, or injury.
Safe	All other content, including neutral depictions, products, or harmless cultural symbols

Note: Categories marked with an asterisk are regulated in some jurisdictions and may not be universally restricted.

Dataset Collection and Annotation

Data Sources

The SIMAS dataset combines publicly available images from multiple sources, selected to reflect the categories defined in our content taxonomy. Each image was manually reviewed by three independent annotators, with final labels assigned when at least two annotators agreed.

The largest portion of the dataset (30.4%) originates from LAION-400M, a large-scale image-text dataset. To identify relevant content, we first selected a list of ImageNet labels that semantically matched our taxonomy. These labels were generated using GPT-4o in a zero-shot setting, using separate prompts per category. This resulted in 194 candidate labels, of which 88.7% were retained after manual review. The structure of the prompts used in this process is shown in the file gpt4o_imagenet_prompting_scheme.png, which illustrates a shared base prompt template applied across all categories. The fields category_definition, file_examples, and exceptions are specified per category. Definitions align with the taxonomy, while the file_examples column includes sample labels retrieved from the ImageNet label list. The exceptions field contains category-specific filtering instructions; a dash indicates no exceptions were specified.

Another 25.1% of images were sourced from Roboflow, using open datasets such as:

The NudeNet dataset contributes 11.4% of the dataset. We sampled 1,000 images from the “porn” category to provide visual coverage of explicit sexual content.

Another 11.0% of images were collected from Kaggle, including:

An additional 9.9% of images were retrieved from Unsplash, using keyword-based search queries aligned with each category in our taxonomy.

Images from UnsafeBench make up 8.0% of the dataset. Since its original binary labels did not match our taxonomy, all samples were manually reassigned to the most appropriate category.

Finally, 4.2% of images were gathered from various publicly accessible websites. These were primarily used to improve category balance and model generalization, especially in safe classes.

All images collected from the listed sources have been manually reviewed by three independent annotators. Each image is then assigned to a category when at least two annotators reach consensus.

Table 2: Distribution of images per public source and category in SIMAS dataset

Type	Category	LAION	Roboflow	NudeNet	Kaggle	Unsplash	UnsafeBench	Other	Total
Unsafe	Alcohol	29	0	3	267	0	1	0	300
Unsafe	Drugs	17	211	0	0	13	8	1	250
Unsafe	Firearms	0	59	0	229	0	62	0	350
Unsafe	Gambling	132	38	0	0	73	39	18	300
Unsafe	Sexual	2	0	421	0	3	68	6	500
Unsafe	Tobacco	0	446	0	0	43	11	0	500
Unsafe	Violence	0	289	0	0	0	11	0	300
Safe	Alcohol	140	35	0	0	16	13	96	300
Safe	Drugs	67	49	0	15	72	17	30	250
Safe	Firearms	173	15	0	3	144	8	7	350
Safe	Gambling	164	2	0	1	121	12	0	300
Safe	Sexual	235	22	139	2	0	94	8	500
Safe	Tobacco	351	67	5	13	8	16	40	500
Safe	Violence	212	20	3	21	0	42	2	300
All	All	1,522	1,253	571	551	493	402	208	5,000

Balancing

To ensure semantic diversity and dataset balance, undersampling was performed on overrepresented categories using a CLIP-based embedding and k-means clustering strategy. This resulted in a final dataset containing 2,500 spam and 2,500 safe images, evenly distributed across all categories.

Table 3: Distribution of images per category in SIMAS

Medium articles dataset
crawlfeeds.com
kaggle.com
json, zip
Updated Aug 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Medium articles dataset [Dataset]. https://crawlfeeds.com/datasets/medium-articles-dataset
Explore at:
json, zipAvailable download formats
Dataset updated
Aug 26, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Buy Medium Articles Dataset – 500K+ Published Articles in JSON Format

Get access to a premium Medium articles dataset containing 500,000+ curated articles with metadata including author profiles, publication dates, reading time, tags, claps, and more. Ideal for natural language processing (NLP), machine learning, content trend analysis, and AI model training.

Request here for the large dataset Medium datasets

Checkout sample dataset in CSV

Use Cases:

Training language models (LLMs)

Analyzing content trends and engagement

Sentiment and text classification

SEO research and author profiling

Academic or commercial research

Why Choose This Dataset?

High-volume, cleanly structured JSON

Ideal for developers, researchers, and data scientists

Easy integration with Python, R, SQL, and other data pipelines

Affordable and ready-to-use
bioimage.io upload: hpa/hpa-kaggle-2021-dataset
zenodo.org
bin, png
Updated Aug 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Human Protein Atlas; The Human Protein Atlas (2024). bioimage.io upload: hpa/hpa-kaggle-2021-dataset [Dataset]. http://doi.org/10.5281/zenodo.13219996
Explore at:
bin, pngAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13219996
Dataset updated
Aug 5, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
The Human Protein Atlas; The Human Protein Atlas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
View on bioimage.io # HPA Single Cell Classification Dataset 2021

Training dataset for the Human Protein Atlas - Single Cell Classification competition 2021

More information: https://www.kaggle.com/competitions/hpa-single-cell-image-classification/data

What files do I need?

On the data page below, you will find a set of full size original images (a mix of 1728x1728, 2048x2048 and 3072x3072 PNG files) in train.zip and test.zip. (Please note that since this is a code competition, part of test data will be hidden)

You will also need the image level labels from train.csv and the filenames for the test set from sample_submission.csv. As many Kagglers made use of all public images in HPA for previous classification challenge, we made the public HPA images available to download as instructed in this notebook. Note also that there are TFRecords available if competitors would like to use TPUs.

The 16-bit version of the training images are available here. Additional training images are available here.

What should I expect the data format to be?

The training image-level labels are provided for each sample in train.csv. The bulk of the data for images - train.zip. Each sample consists of four files. Each file represents a different filter on the subcellular protein patterns represented by the sample. The format should be [filename]_[filter color].png for the PNG files. Colors are red for microtubule channels, blue for nuclei channels, yellow for Endoplasmic Reticulum (ER) channels, and green for protein of interest.

What am I predicting?

You are predicting protein organelle localization labels for each cell in the image. Border cells are included when there is enough information to decide on the labels.

There are in total 19 different labels present in the dataset (18 labels for specific locations, and label 18 for negative and unspecific signal). The dataset is acquired in a highly standardized way using one imaging modality (confocal microscopy). However, the dataset comprises 17 different cell types of highly different morphology, which affect the protein patterns of the different organelles. All image samples are represented by four filters (stored as individual files), the protein of interest (green) plus three cellular landmarks: nucleus (blue), microtubules (red), endoplasmic reticulum (yellow). The green filter should hence be used to predict the label, and the other filters are used as references. The labels are represented as integers that map to the following:

0. Nucleoplasm 1. Nuclear membrane 2. Nucleoli 3. Nucleoli fibrillar center 4. Nuclear speckles 5. Nuclear bodies 6. Endoplasmic reticulum 7. Golgi apparatus 8. Intermediate filaments 9. Actin filaments 10. Microtubules 11. Mitotic spindle 12. Centrosome 13. Plasma membrane 14. Mitochondria 15. Aggresome 16. Cytosol 17. Vesicles and punctate cytosolic patterns 18. Negative

What is meant by weak image-level labels?

The labels you will get for training are image level labels while the task is to predict cell level labels. That is to say, each training image contains a number of cells that have collectively been labeled as described above and the prediction task is to look at images of the same type and predict the labels of each individual cell within those images.

As the training labels are a collective label for all the cells in an image, it means that each labeled pattern can be seen in the image but not necessarily that each cell within the image expresses the pattern. This imprecise labeling is what we refer to as weak.

During the challenge you will both need to segment the cells in the images and predict the labels of those segmented cells.

Files: - train - training images (in .tif) - test - test images (in .png) - the task of the competition is to segment and label the images in this folder - train.csv - filenames and image level labels for the training set - sample_submission.csv - filenames for the test set, and a guide to constructing a working submission.

Columns: - ID - The base filename of the sample. As noted above all samples consist of four files - blue, green, red, and yellow. - Label - in the training data, this represents the labels assigned to each sample; in submission, this represent the labels assigned to each cell.
T
plant_village
tensorflow.org
opendatalab.com
+1more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). plant_village [Dataset]. http://identifiers.org/arxiv:1511.08060
Explore at:
Unique identifier
https://identifiers.org/arxiv:1511.08060
Dataset updated
Jun 1, 2024
Description
The PlantVillage dataset consists of 54303 healthy and unhealthy leaf images divided into 38 categories by species and disease.

NOTE: The original dataset is not available from the original source (plantvillage.org), therefore we get the unaugmented dataset from a paper that used that dataset and republished it. Moreover, we dropped images with Background_without_leaves label, because these were not present in the original dataset.

Original paper URL: https://arxiv.org/abs/1511.08060 Dataset URL: https://data.mendeley.com/datasets/tywbtsjrjv/1

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('plant_village', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/plant_village-1.0.2.png" alt="Visualization" width="500px">
Off-Road Terrain Dataset for Autonomous Vehicles
kaggle.com
Updated Mar 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Magnum Research Group (2021). Off-Road Terrain Dataset for Autonomous Vehicles [Dataset]. https://www.kaggle.com/datasets/magnumresearchgroup/offroad-terrain-dataset-for-autonomous-vehicles
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 3, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Magnum Research Group
Description
This dataset provides 12,730 images of off-road terrain over 44 miles to assist researchers in the space of autonomous driving in making progress for off-road environments. This dataset also includes readings from the accelerometer, gyroscope, magnetometer, GPS, and wheel rotation speed sensor. Further, we include 8 potential roughness labels derived from the vehicle's z-axis acceleration for the subset of images in the dataset which have sufficient sensor data to calculate the image labels and depict clear, visible terrain.

Please consider citing: Gresenz, G., White, J., & Schmidt, D. C. (2021). "An Off-Road Terrain Dataset Including Images Labeled With Measures of Terrain Roughness." Proceedings of the IEEE International Conference in Autonomous Systems, 309-313.

Papers

This dataset is described and published in Gresenz et al. [1].

Content

Data was collected with a mountain bike on off-road trails during five different dates in the late summer and early fall. The bike was equipped with Garmin 830 dual GPS receivers, Garmin Virb Ultra dual high resolution Inertial Measurement Units (IMU's), a Garmin Virb Ultra 4k 30 fps camera, and a Garmin Bike Speed Sensor 2 wheel rotation speed sensor. The camera was time synchronized to both IMU's.

Images were extracted from the videos collected by the camera at 1 second intervals. They are located in the Images folder, sorted into subfolders by the date they were collected, and labeled with their UTC timestamp in order to be used alongside the corresponding sensor data.

Sensor data was collected in a file format called a FIT file. We converted the FIT files to CSVs using tools provided by Garmin [2, 3]. We then created distinct CSVs for each of the major sensor readings and formatted each in a state-based representation, where a single row is labeled by UTC timestamp and contains all relevant readings at that timestamp. Sensor data is located in the SensorData folder and is sorted into subfolders based on the date the data was collected.

The Three D Sensor Adjustment Plugin [3] provided by Garmin calibrates three dimensional readings, meaning that the readings are converted to the conventionally understood units and the x, y, and z-axis readings correspond directly to these axes. Accelerometer and gyroscope readings were calibrated using this plugin. It is important to note that our data did not contain the necessary calibration information to calibrate the magnetometer readings, so these readings are uncalibrated in our dataset.

The ImageLabels folder contains two CSVs for the subset of images which had sufficient sensor data to calculate their labels and depicted a clear, visible path.

tsm_1_labels.csv contains the following labels: 1. The standard deviation of a 1 second sampling of z-axis acceleration readings centered around 5 meters ahead of the image's timestamp, discretized using data visualization. 2. The standard deviation of a 1 second sampling of z-axis acceleration readings centered around 5 meters ahead of the image's timestamp, discretized using k-means clustering with k = 2. 3. The standard deviation of a 1 second sampling of z-axis acceleration readings centered around 5 meters ahead of the image's timestamp, discretized using k-means clustering with k = 3. 4. The standard deviation of a 1 second sampling of z-axis acceleration readings centered around 5 meters ahead of the image's timestamp, discretized using k-means clustering with k = 4.

tsm_2_labels.csv contains the following labels: 5. The standard deviation of a 1 second sampling of z-axis acceleration readings directly ahead of the image's timestamp, discretized using data visualization. 6. The standard deviation of a 1 second sampling of z-axis acceleration readings directly ahead of the image's timestamp, discretized using k-means clustering with k = 2. 7. The standard deviation of a 1 second sampling of z-axis acceleration readings directly ahead of the image's timestamp, discretized using k-means clustering with k = 3. 8. The standard deviation of a 1 second sampling of z-axis acceleration readings directly ahead of the image's timestamp, discretized using k-means clustering with k = 4.

These labeling schemas, along with how effectively they were able to be learned, are described in depth in Gresenz et al. [1].

Links

Check out our other dataset, Off-Road Terrain Attention Region Images.

The Github repo for the papers associated with these datasets is located here.

References

[1] Gresenz, G., White, J., & Schmidt, D.C. (2021). "An Off-Road Terrain Dataset Including Images Labeled With Measures of Terrain Roughness." Proceedings of the IEEE International Conference in Autonomous Systems, 309-31...
m
Lumbar Spine MRI Dataset
data.mendeley.com
opendatalab.com
Updated Apr 3, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sud Sudirman (2019). Lumbar Spine MRI Dataset [Dataset]. http://doi.org/10.17632/k57fr854j2.2
Explore at:
Unique identifier
https://doi.org/10.17632/k57fr854j2.2
Dataset updated
Apr 3, 2019
Authors
Sud Sudirman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains anonymised clinical MRI study, or a set of scans, of 515 patients with symptomatic back pains. Each patient data can have one or more MRI studies associated with it. Each study contains slices, i.e., individual images taken from either sagittal or axial view, of the lowest three vertebrae and the lowest three IVDs. The axial view slices are mainly taken from the last three IVDs – including the one between the last vertebrae and the sacrum. The orientation of the slices of the last IVD are made to follow the spine curve whereas those of the other IVDs are usually made in blocks – i.e., parallel to each other. There are between four to five slices per IVD and they begin from the top of the IVD towards its bottom. Many of the top and bottom slices cut through the vertebrae leaving between one to three slices that cut the IVD cleanly and show purely the image of that IVD. In most cases, the total number of slices in axial view ranges from 12 to 15. However, in some cases, there may be up to 20 slices because the study contains slices of more than last three vertebrae. The scans in sagittal view also vary but all contain at least the last seven vertebrae and the sacrum. While the number of vertebrae varies, each scan always includes the first two sacral links.

There are a total 48,345 MRI slices in our dataset. The majority of the slices have an image resolution of 320x320 pixels, however, there are slices from three studies with 320x310 pixel resolution. The pixels in all slices have 12-bit per pixel precision which is higher than the standard 8-bit greyscale images. Specifically for all axial-view slices, the slice thickness are uniformly 4 mm with centre-to-centre distance between adjacent slices to be 4.4 mm. The horizontal and vertical pixel spacing is 0.6875 mm uniformly across all axial-view slices.

The majority of the MRI studies were taken with the patient in Head-First-Supine position with the rests were taken with the patient in in Feet-First-Supine position. Each study can last between 15 to 45 minutes and a patient may have one or more study associated with them taken at a different time or a few days apart.

You can download and read the research papers detailing our methodology on boundary delineation for lumbar spinal stenosis detection using the URLs provided in the Related Links at the end of this page. You can also check out other dataset and source code related to this program from that section.

We kindly request you to cite our papers when using our data or program in your research.
Gomrade dataset: go (baduk) images with annots
kaggle.com
Updated Jun 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DavidS (2021). Gomrade dataset: go (baduk) images with annots [Dataset]. https://www.kaggle.com/davids1992/gomrade-dataset-go-baduk-images-with-labels/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 1, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
DavidS
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
About the dataset

This dataset contains the images of Go (baduk, igo) board with stones, together with the corresponding annotation of stones positions, and board coordinates, one for the game.

The games are rather not continuous, with missing moves.

The dataset contains 62 games in terms of board positions, and 2000 (todo) images with 300 000 (todo) intersections to recognize.

The dataset is split into two folders because of historical events. Maybe it would be a good split for the train and test part.

This is the dataset I created for developing my Gomrade project. It allows you to play against AI on a real board. You can find more here: GitHub

Other sources

In this dataset, there are recordings of my own games, some images scraped from the internet, and two other sources: - VideoKifu - https://www.oipaz.net/VideoKifu.html - even though it's a really great project overall, the validation of single image recognition was not performed. VideoKifu focuses on video tracking and the course of the game, so they did not found reference labels necessary for their application. I added reference annotation because I believe it's necessary for reliable systems. - Imago - http://tomasm.cz/imago - a great project with a few annotated images

All the labels in this dataset except for Imago are created by me and I own rights to them. They were created in a semi-supervised manner.

About a license

The license is one of the most restrictive. If you have any ideas on how to develop this dataset or a project based on it, contact me and we can figure out the future of this data. For now, I don't have any other ideas so I restrict it.

Other

I also have access to Igoki project data but at the moment I don't know if I can share it. I will be happy to get in touch and share my thoughts.
T
cats_vs_dogs
tensorflow.org
universe.roboflow.com
+1more
Updated Dec 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). cats_vs_dogs [Dataset]. https://www.tensorflow.org/datasets/catalog/cats_vs_dogs
Explore at:
Dataset updated
Dec 19, 2023
Description
A large set of images of cats and dogs. There are 1738 corrupted images that are dropped.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('cats_vs_dogs', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/cats_vs_dogs-4.0.1.png" alt="Visualization" width="500px">
u
Product Exchange/Bartering Data
cseweb.ucsd.edu
json
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Product Exchange/Bartering Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
These datasets contain peer-to-peer trades from various recommendation platforms.

Metadata includes

peer-to-peer trades

have and want lists

image data (tradesy)
u
Amazon Question and Answer Data
cseweb.ucsd.edu
json
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Amazon Question and Answer Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
These datasets contain 1.48 million question and answer pairs about products from Amazon.

Metadata includes

question and answer text

is the question binary (yes/no), and if so does it have a yes/no answer?

timestamps

product ID (to reference the review dataset)

Basic Statistics:

Questions: 1.48 million

Answers: 4,019,744

Labeled yes/no questions: 309,419

Number of unique products with questions: 191,185
Monkeypox Skin Lesion Dataset
kaggle.com
Updated Jul 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TensorKitty (2022). Monkeypox Skin Lesion Dataset [Dataset]. https://www.kaggle.com/datasets/nafin59/monkeypox-skin-lesion-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
TensorKitty
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
An updated version of the MSLD dataset, MSLD v2.0 has been released after being verified by an expert dermatologist!

For details, check our GitHub repo!

Context

The recent monkeypox outbreak has become a global healthcare concern owing to its rapid spread in more than 65 countries around the globe. To obstruct its expeditious pace, early diagnosis is a must. But the confirmatory Polymerase Chain Reaction (PCR) tests and other biochemical assays are not readily available in sufficient quantities. In this scenario, computer-aided monkeypox identification from skin lesion images can be a beneficial measure. Nevertheless, so far, such datasets are not available. Hence, the "Monkeypox Skin Lesion Dataset (MSLD)" is created by collecting and processing images from different means of web-scrapping i.e., from news portals, websites and publicly accessible case reports.

The creation of "Monkeypox Image Lesion Dataset" is primarily focused on distinguishing the monkeypox cases from the similar non-monkeypox cases. Therefore, along with the 'Monkeypox' class, we included skin lesion images of 'Chickenpox' and 'Measles' because of their resemblance to the monkeypox rash and pustules in initial state in another class named 'Others' to perform binary classification.

Content

There are 3 folders in the dataset.

1) Original Images: It contains a total number of 228 images, among which 102 belongs to the 'Monkeypox' class and the remaining 126 represents the 'Others' class i.e., non-monkeypox (chickenpox and measles) cases.

2) Augmented Images: To aid the classification task, several data augmentation methods such as rotation, translation, reflection, shear, hue, saturation, contrast and brightness jitter, noise, scaling etc. have been applied using MATLAB R2020a. Although this can be readily done using ImageGenerator/other image augmentors, to ensure reproducibility of the results, the augmented images are provided in this folder. Post-augmentation, the number of images increased by approximately 14-folds. The classes 'Monkeypox' and 'Others' have 1428 and 1764 images, respectively.

3) Fold1: One of the three-fold cross validation datasets. To avoid any sort of bias in training, three-fold cross validation was performed. The original images were split into training, validation and test set(s) with the approximate proportion of 70 : 10 : 20 while maintaining patient independence. According to the commonly perceived data preparation practice, only the training and validation images were augmented while the test set contained only the original images. Users have the option of using the folds directly or using the original data and employing other algorithms to augment it.

Additionally, a CSV file is provided that has 228 rows and two columns. The table contains the list of all the ImageID(s) with their corresponding label.

Web Application

Since monkeypox is demonstrating a very rapid community transmission pattern, a consumer-level software is truly necessary to increase awareness and encourage people to take rapid action. We have developed an easy-to-use web application named Monkey Pox Detector using the open-source python streamlit framework that uses our trained model to address this issue. It makes predictions on whether or not to see a specialist along with the prediction accuracy. Future updates will benefit from the user data we continue to collect and use to improve our model. The web app has a flask core, so that it can be deployed cross-platform in the future.

Learn more at our GitHub repo!

Citation

If this dataset helped your research, please cite the following articles:

Ali, S. N., Ahmed, M. T., Paul, J., Jahan, T., Sani, S. M. Sakeef, Noor, N., & Hasan, T. (2022). Monkeypox Skin Lesion Detection Using Deep Learning Models: A Preliminary Feasibility Study. arXiv preprint arXiv:2207.03342.

@article{Nafisa2022, title={Monkeypox Skin Lesion Detection Using Deep Learning Models: A Preliminary Feasibility Study}, author={Ali, Shams Nafisa and Ahmed, Md. Tazuddin and Paul, Joydip and Jahan, Tasnim and Sani, S. M. Sakeef and Noor, Nawshaba and Hasan, Taufiq}, journal={arXiv preprint arXiv:2207.03342}, year={2022} }

Ali, S. N., Ahmed, M. T., Jahan, T., Paul, J., Sani, S. M. Sakeef, Noor, N., Asma, A. N., & Hasan, T. (2023). A Web-based Mpox Skin Lesion Detection System Using State-of-the-art Deep Learning Models Considering Racial Diversity. arXiv preprint arXiv:2306.14169.

@article{Nafisa2023, title={A Web-base...
A
‘Birds' Bones and Living Habits’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Birds' Bones and Living Habits’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-birds-bones-and-living-habits-8e70/a5c525fb/?iid=003-438&v=presentation
Explore at:
Dataset updated
Nov 12, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Birds' Bones and Living Habits’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/zhangjuefei/birds-bones-and-living-habits on 12 November 2021.

--- Dataset description provided by original source is as follows ---

Context

There are many kinds of birds: pigeons, ducks, ostriches, penguins... Some are good at flying, others can't fly but run fast. Some swim under water, others wading in shallow pool.

According to their living environments and living habits, birds are classified into different ecological groups. There are 8 ecological groups of birds:

Swimming Birds

Wading Birds

Terrestrial Birds

Raptors

Scansorial Birds

Singing Birds

Cursorial Birds (not included in dataset)

Marine Birds (not included in dataset)

First 6 groups are main and are covered by this dataset.

Apparently, birds belong to different ecological groups have different appearances: flying birds have strong wings and wading birds have long legs. Their living habits are somewhat reflected in their bones' shapes. As data scientists we may think of examining the underlying relationship between sizes of bones and ecological groups , and recognising birds' ecological groups by their bones' shapes.

Content

There are 420 birds contained in this dataset. Each bird is represented by 10 measurements (features):

Length and Diameter of Humerus

Length and Diameter of Ulna

Length and Diameter of Femur

Length and Diameter of Tibiotarsus

Length and Diameter of Tarsometatarsus

https://s-media-cache-ak0.pinimg.com/564x/cb/cb/41/cbcb41cb2c1db6c32c1e869be2172c3b.jpg" alt="Pigeon skeleton">

All measurements are continuous float numbers (mm) with missing values represented by empty strings. The skeletons of this dataset are collections of Natural History Museum of Los Angeles County. They belong to 21 orders, 153 genera, 245 species.

Each bird has a label for its ecological group:

SW: Swimming Birds

W: Wading Birds

T: Terrestrial Birds

R: Raptors

P: Scansorial Birds

SO: Singing Birds

Acknowledgements

This dataset is provided by Dr. D. Liu of Beijing Museum of Natural History.

Inspiration

This dataset is a 420x10 size continuous values unbalanced multi-class dataset. What can be done include:

Data Visualisation

Statical Analysis

Supervised Classification

Unsupervised Clustering

License

Please do not publish or cite this dataset in research papers or other public publications.

--- Original source retains full ownership of the source dataset ---
DailyDialog (Multi-turn Dialog)
kaggle.com
Updated Nov 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). DailyDialog (Multi-turn Dialog) [Dataset]. https://www.kaggle.com/datasets/thedevastator/dailydialog-unlock-the-conversation-potential-in
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 29, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
DailyDialog (Multi-turn Dialog)

Dialogues that reflect our daily communication way and cover various topics

By Huggingface Hub [source]

About this dataset

Welcome to the DailyDialog dataset, your gateway to unlocking conversation potential through multi-turn dialog experiences! Our dataset consists of conversations written by humans, which serve as a more accurate reflection of our day-to-day conversations than other datasets. Additionally, we have included manually labeled communication intentions and emotion fields in our data that can be used for advancing dialog systems.

Whether you’re a researcher looking for new approaches in dialog systems or someone simply curious about conversation dynamics from the perspective of computer science – this dataset is here to help! We invite you to explore and make use of this data for its full potential and advance the research field further.

Our three main files (train.csv, validation.csv, test.csv) each provide key columns such as dialogue , act , and emotion , enabling you to get an even deeper understanding into how effective conversations really work -- so what are you waiting for? Unlock your conversation potential today with DailyDialog!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Welcome and thank you for your interest in the DailyDialog dataset! This dataset is designed to unlock conversation potential through multi-turn dialog experiences and provide a better understanding of conversations in our day-to-day lives. Whether you are a student, researcher, or just plain curious, this guide is here to help you get started with using the DailyDialog dataset for your own research or exploration.

The DailyDialog dataset includes three files: train.csv, validation.csv, and test.csv which all contain dialog, act and emotion fields that can be used by those who wish to evaluate existing approaches in the field of dialogue systems or perform new experiments on conversational models. All data found in this dataset is written by humans and thus contains less noise than other datasets typically seen online.

The first step when using this data set would be to familiarize yourself with the different fields found within each file: * Dialog – The dialog field contains the conversation between two people (String). * Act – The act field contains communication intentions of both parties involved within the dialogue (String). * Emotion – The emotion field labels any emotions expressed during a particular dialogue (String).

Once you understand what each of these three fields mean it’s time to start exploring! You can use any programming language/software as well as statistical methods such as text analysis tools like RapidMiner or Natural Language Processing libraries like NLTK or Spacy to use these fields in order to further explore them individually or together on a more profound level. Additionally, if you are interested specifically into machine learning tasks there might also be possibilities such as generating new conversations from our data set (e.g., chat bots) using reinforcement learning models such deep learning architectures / neural networks for natural language understanding tasks etc..which can be explored too!

All said done we believe that the ability of unlocking underlying patterns embedded within real life conversations will enable researchers in various domains & research areas (e.g., AI / ML ones) enable their efforts great success & have an exciting journey :)

Research Ideas

Developing a conversational AI system that can replicate authentic conversations by modeling the emotion and communication intentions present in the DailyDialog dataset.

Creating a language-learning tool which can customize personalized dialogues based on the DailyDialog data to help foreign language learners get used to spoken dialogue.

Utilizing the DailyDialog data to develop an interactive chatbot with customized responses and emotions, allowing users to learn more about their conversational skills through simulated conversations

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](https://creativecommons...

Customer IT Support - Ticket Dataset

kaggle.com

Updated Jun 16, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Tobias Bueck (2025). Customer IT Support - Ticket Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/12183005

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/12183005

Dataset updated

Jun 16, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Tobias Bueck

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Featuring Labeled Customer Emails and Support Responses

Network Diagram Tags

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3023333%2F9f9df25b75671db2d255b2d284c2c80c%2Fnetwork_diagram.svg?generation=1739380045025331&alt=media" alt="">

Discover the new, expanded version of this dataset with 20,000 ticket entries! Perfect for training models to classify and prioritize support tickets.

Definetly check out my other Dataset:
Tickets from Github Issues

It includes priorities, queues, types, tags, and business types. This preview offers a detailed structure with classifications by department, type, priority, language, subject, full email text, and agent answers.

Features / Attributes

Field	Description	Values
🔀 Queue	Specifies the department to which the email ticket is routed	e.g. Technical Support, Customer Service, Billing and Payments, ...
🚦 Priority	Indicates the urgency and importance of the issue	🟢Low 🟠Medium 🔴Critical
🗣️ Language	Indicates the language in which the email is written	EN, DE, ES, FR, PT
Subject	Subject of the customer's email
Body	Body of the customer's email
Answer	The response provided by the helpdesk agent
Type	The type of ticket as picked by the agent	e.g. Incident, Request, Problem, Change ...
🏢 Business Type	The business type of the support helpdesk	e.g. Tech Online Store, IT Services, Software Development Company
Tags	Tags/categories assigned to the ticket, split into ten columns in the dataset	e.g. "Software Bug", "Warranty Claim"

Queue

Specifies the department to which the email ticket is categorized. This helps in routing the ticket to the appropriate support team for resolution. - 💻 Technical Support: Technical issues and support requests. - 🈂️ Customer Service: Customer inquiries and service requests. - 💰 Billing and Payments: Billing issues and payment processing. - 🖥️ Product Support: Support for product-related issues. - 🌐 IT Support: Internal IT support and infrastructure issues. - 🔄 Returns and Exchanges: Product returns and exchanges. - 📞 Sales and Pre-Sales: Sales inquiries and pre-sales questions. - 🧑‍💻 Human Resources: Employee inquiries and HR-related issues. - ❌ Service Outages and Maintenance: Service interruptions and maintenance. - 📮 General Inquiry: General inquiries and information requests.

Priority

Indicates the urgency and importance of the issue. Helps in managing the workflow by prioritizing tickets that need immediate attention. - 🟢 1 (Low): Non-urgent issues that do not require immediate attention. Examples: general inquiries, minor inconveniences, routine updates, and feature requests. - 🟠 2 (Medium): Moderately urgent issues that need timely resolution but are not critical. Examples: performance issues, intermittent errors, and detailed user questions. - 🔴 3 (Critical): Urgent issues that require immediate attention and quick resolution. Examples: system ...

Sepsis Dataset –
kaggle.com
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatolu Peter (2025). Sepsis Dataset – [Dataset]. https://www.kaggle.com/datasets/olagokeblissman/sepsis-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 31, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Fatolu Peter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
📝 Dataset Overview: This dataset focuses on early warning detection for sepsis, a critical and potentially fatal medical condition. It includes anonymized vital signs, lab results, and clinical indicators of patients admitted to the hospital, structured for real-time monitoring and predictive modeling.

It’s ideal for clinical data analysts, healthcare data scientists, and AI practitioners aiming to develop decision support tools, early warning dashboards, or predictive health models.

🔍 Dataset Features: Column Name Description Patient_ID Unique anonymized identifier Admission_Date Patient’s hospital admission date Temperature_C Body temperature in degrees Celsius BP_Systolic Systolic blood pressure (mmHg) BP_Diastolic Diastolic blood pressure (mmHg) Heart_Rate Beats per minute WBC_Count White blood cell count (x10⁹/L) Lactate_mmol_L Lactate level in mmol/L Sepsis_Flag Binary indicator (1 = Suspected Sepsis, 0 = Normal) Ward Hospital ward/unit Doctor_On_Duty Attending physician name (anonymized)

🎯 Use Cases: Build Power BI dashboards for hospital early warning systems

Train ML classification models to detect early signs of sepsis

Create patient monitoring tools with Python or R

Explore the relationship between vitals & sepsis onset

Perform feature engineering for risk scoring systems

📌 Clinical Relevance: Sepsis is one of the leading causes of in-hospital mortality worldwide. Early detection is crucial to reducing death rates and improving outcomes. This dataset empowers developers and analysts to make a meaningful impact in the healthcare sector.

👤 Created By: Fatolu Peter (Emperor Analytics) A passionate healthcare analyst leveraging data to drive innovation in public health across Nigeria. This is Project 12 in my data-for-good series.

✅ LinkedIn Post: 🚨 New Dataset: Sepsis Early Warning System Data – Now on Kaggle 📊 Clinical vital signs + lab markers + sepsis risk flags 🔗 Explore the dataset here

This dataset enables healthcare data scientists to: ✅ Build real-time hospital dashboards ✅ Predict sepsis risk with machine learning ✅ Explore vitals like BP, lactate, WBC, and temperature ✅ Support early intervention using data insights

Whether you're into: 🧠 Predictive modeling 📈 Power BI clinical dashboards 📉 Risk analytics in healthcare This is for you.

Join me in using data to save lives — one insight at a time. If you build something, tag me. I’ll gladly share it! 💡

HealthcareAnalytics #SepsisAwareness #EarlyWarningSystems #KaggleDataset #PowerBI #DataForGood #FatoluPeter #EmperorAnalytics #PublicHealth #Project12 #RealWorldData

Let me know if you’d like help starting a Power BI or Python model for this!

You said: whic license is the best to choose that is global

ChatGPT said: For globally recognized and widely accepted dataset licensing on Kaggle, the best license to choose depends on your intent — especially whether you want others to reuse and build upon your dataset freely.

✅ Recommended License: Creative Commons Attribution 4.0 (CC BY 4.0) Why CC BY 4.0 is best globally: 🌍 Internationally recognized and legally valid worldwide.

✅ Allows others to use, share, adapt, and build on your dataset, even for commercial purposes.

📢 Requires only attribution to you as the creator.

🔐 You still retain copyright while maximizing openness.

When to choose it: If your goal is to:

Share freely with the global community,

Allow use in academic, commercial, or public projects,

Gain credit and visibility as the original creator.

⚠️ Other license types (if needed): CC BY-NC 4.0 (Attribution-NonCommercial): Only for non-commercial use.

CC0 (Public Domain Dedication): Freest use; no attribution required — but not always ideal if you want credit.

GPL / Open Data Commons: More for software or structured databases with specific open-source obligations.

✅ Final Suggestion for Your Datasets: Use CC BY 4.0 for all your uploads, unless you have confidential/private data. It's perfect for:

Healthcare datasets

Sales/retail analytics

Kaggle portfolio building

Global recognition & impact
Faces: Age Detection from Images
kaggle.com
Updated Jan 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Möbius (2024). Faces: Age Detection from Images [Dataset]. https://www.kaggle.com/arashnic/faces-age-detection-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Möbius
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Recently, many applications from biometrics,to entertainment use the information extracted from face images that contain information about age, gender, ethnic background, and emotional state. Automatic age estimation from facial images is one of the popular and challenging tasks that have different fields of applications such as controlling the content of the watched media depending on the customer's age. So facial feature analysis has been a topic of interest mainly due to its applicability and Deep Learning techniques are now making it possible for face analysis to be not just a dream but a reality. This simple practice dataset can get you more acquainted with application of deep learning in age detection. #

https://media.gettyimages.com/photos/facial-recognition-technology-picture-id1139859279?k=6&m=1139859279&s=612x612&w=0&h=H-i0yAM3A49I_r44424-jACD667nxiKb7bZR52ByOA=" alt="im"> #

Content

Indian Movie Face database (IMFDB) is a large unconstrained face database consisting of 34512 images of 100 Indian actors collected from more than 100 videos. All the images are manually selected and cropped from the video frames resulting in a high degree of variability interms of scale, pose, expression, illumination, age, resolution, occlusion, and makeup. IMFDB is the first face database that provides a detailed annotation of every image in terms of age, pose, gender, expression and type of occlusion that may help other face related applications.

The dataset provided a total of 19906 images.The attributes of data are as follows:

ID – Unique ID of image

Class – Age bin of person in image

#

https://ars.els-cdn.com/content/image/1-s2.0-S0925231215017348-gr1.jpg" alt="face"> image ref : Automatic age estimation based on CNN #

Acknowledgements

CVIT focuses on basic and advanced research in image processing, computer vision, computer graphics and machine learning. This center deals with the generation, processing, and understanding of primarily visual data as well as with the techniques and tools required doing so efficiently. The activity of this center overlaps the traditional areas of Computer Vision, Image Processing, Computer Graphics, Pattern Recognition and Machine Learning. CVIT works on both theoretical as well as practical aspects of visual information processing. Center aims to keep the right balance between the cutting edge academic research and impactful applied research.

Inspiration

The main task is to predict the age of a person from his or her facial attributes. For simplicity, the problem has been converted to a multiclass problem with classes as Young, Middle and Old.

Version 2: Faces dataset for regression added

UTKFace dataset is a large-scale face dataset with long age span (range from 0 to 116 years old). The dataset consists of over 20,000 face images with annotations of age, gender, and ethnicity. The images cover large variation in pose, facial expression, illumination, occlusion, resolution, etc. This dataset could be used on a variety of tasks, e.g., face detection, age estimation, age progression/regression, landmark localization, etc. Some sample images are shown as following: https://susanqq.github.io/UTKFace/icon/samples.png" alt="face2">

consists of 20k+ face images in the wild (only single face in one image)

provides the correspondingly aligned and cropped faces

provides the corresponding landmarks (68 points)

images are labelled by age, gender, and ethnicity

Complete Dataset: https://susanqq.github.io/UTKFace/

The labels of each face image is embedded in the file name, formated like [age]_[gender]_[race]_[date&time].jpg

[age] is an integer from 0 to 116, indicating the age

[gender] is either 0 (male) or 1 (female)

[race] is an integer from 0 to 4, denoting White, Black, Asian, Indian, and Others (like Hispanic, Latino, Middle Eastern).

[date&time] is in the format of yyyymmddHHMMSSFFF, showing the date and time an image was collected to UTKFace

*If you download and find the data useful your upvote is an explicit feedback for future works*

Facebook

Twitter

Click to copy link

Link copied

Cite

Google BigQuery (2019). Open Images [Dataset]. https://www.kaggle.com/datasets/bigquery/open-images

Data from: Open Images

9 million URLs with labels and more than 6,000 categories (BigQuery)

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Feb 12, 2019

Dataset provided by

BigQueryhttps://cloud.google.com/bigquery

Authors

Google BigQuery

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Context

Labeled datasets are useful in machine learning research.

Content

This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.

Tables: 1) annotations_bbox 2) dict 3) images 4) labels

Update Frequency: Quarterly

Querying BigQuery Tables

Fork this kernel to get started.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images

https://cloud.google.com/bigquery/public-data/openimages

APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.

Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.

The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

Banner Photo by Mattias Diesel from Unsplash.

Inspiration

Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?

Clear search

Close search

Google apps

Main menu

Data from: Open Images

Context

Content

Querying BigQuery Tables

Acknowledgements

Inspiration

civil_comments

FSDKaggle2019

FSDKaggle2018

Spam Images for Malicious Annotation Set (SIMAS)

SIMAS Dataset

Taxonomy for MMS Visual Spam

Dataset Collection and Annotation

Data Sources

Balancing

Medium articles dataset

Buy Medium Articles Dataset – 500K+ Published Articles in JSON Format

Use Cases:

Why Choose This Dataset?

bioimage.io upload: hpa/hpa-kaggle-2021-dataset

Training dataset for the Human Protein Atlas - Single Cell Classification competition 2021

What files do I need?

What should I expect the data format to be?

What am I predicting?

What is meant by weak image-level labels?

plant_village

Off-Road Terrain Dataset for Autonomous Vehicles

Papers

Content

Links

References

Lumbar Spine MRI Dataset

Gomrade dataset: go (baduk) images with annots

About the dataset

Other sources

About a license

Other

cats_vs_dogs

Product Exchange/Bartering Data

Amazon Question and Answer Data

Monkeypox Skin Lesion Dataset

An updated version of the MSLD dataset, MSLD v2.0 has been released after being verified by an expert dermatologist!

Context

Content

Web Application

Citation

‘Birds' Bones and Living Habits’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

License

DailyDialog (Multi-turn Dialog)

DailyDialog (Multi-turn Dialog)

Dialogues that reflect our daily communication way and cover various topics

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Customer IT Support - Ticket Dataset

Featuring Labeled Customer Emails and Support Responses

Network Diagram Tags

Features / Attributes

Queue

Priority

Sepsis Dataset –

HealthcareAnalytics #SepsisAwareness #EarlyWarningSystems #KaggleDataset #PowerBI #DataForGood #FatoluPeter #EmperorAnalytics #PublicHealth #Project12 #RealWorldData

Faces: Age Detection from Images

Context

Content

Acknowledgements

Inspiration

Version 2: Faces dataset for regression added

Data from: Open Images

9 million URLs with labels and more than 6,000 categories (BigQuery)

Context

Content