52 datasets found
  1. Data from: Open Images

    • kaggle.com
    • opendatalab.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Open Images [Dataset]. https://www.kaggle.com/datasets/bigquery/open-images
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    Labeled datasets are useful in machine learning research.

    Content

    This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.

    Tables: 1) annotations_bbox 2) dict 3) images 4) labels

    Update Frequency: Quarterly

    Querying BigQuery Tables

    Fork this kernel to get started.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images

    https://cloud.google.com/bigquery/public-data/openimages

    APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.

    Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.

    The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

    Banner Photo by Mattias Diesel from Unsplash.

    Inspiration

    Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?

  2. T

    civil_comments

    • tensorflow.org
    • huggingface.co
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
    Explore at:
    Dataset updated
    Feb 28, 2023
    Description

    This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

    The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

    The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

    For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('civil_comments', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  3. Z

    FSDKaggle2019

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xavier Serra (2020). FSDKaggle2019 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3612636
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Daniel P. W. Ellis
    Manoj Plakal
    Eduardo Fonseca
    Xavier Serra
    Frederic Font
    Description

    FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.

    Citation

    If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:

    Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)

    You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    Data curators

    Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    ABOUT FSDKaggle2019

    Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.

    FSDKaggle2019 employs audio clips from the following sources:

    Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology

    The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)

    The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.

    What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.

    Ground Truth Labels

    The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).

    The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].

    The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].

    Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:

    curated train set: correct (but potentially incomplete) labels

    noisy train set: noisy labels

    test set: correct and complete labels

    Further details can be found below in the sections for each set.

    Format

    All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

    DATA SPLIT

    FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.

    Curated train set

    The curated train set consists of manually-labeled data from FSD.

    Number of clips/class: 75 except in a few cases (where there are less)

    Total number of clips: 4970

    Avg number of labels/clip: 1.2

    Total duration: 10.5 hours

    The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).

    Noisy train set

    The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].

    Number of clips/class: 300

    Total number of clips: 19,815

    Avg number of labels/clip: 1.2

    Total duration: ~80 hours

    The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.

    Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.

    Test set

    The test set is used for system evaluation and consists of manually-labeled data from FSD.

    Number of clips/class: between 50 and 150

    Total number of clips: 4481

    Avg number of labels/clip: 1.4

    Total duration: 12.9 hours

    The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.

    During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public or private) for each test clip (see more info in Files & Download below).

    Acoustic mismatch

    As mentioned before, FSDKaggle2019 uses audio clips from two sources:

    FSD: curated train set and test set, and

    YFCC: noisy train set.

    While the sources of audio (Freesound and Flickr) are collaboratively contributed and pretty diverse themselves, a certain acoustic mismatch can be expected between FSD and YFCC. We conjecture this mismatch comes from a variety of reasons. For example, through acoustic inspection of a small sample of both data sources, we find a higher percentage of high quality recordings in FSD. In addition, audio clips in Freesound are typically recorded with the purpose of capturing audio, which is not necessarily the case in YFCC.

    This mismatch can have an impact in the evaluation, considering that most of the train data come from YFCC, while all test data are drawn from FSD. This constraint (i.e., noisy training data coming from a different web audio source than the test set) is sometimes a real-world condition.

    LICENSE

    All clips in FSDKaggle2019 are released under Creative Commons (CC) licenses. For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses.

    Curated train set and test set. All clips in Freesound are released under different modalities of Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. The licenses are specified in the files train_curated_post_competition.csv and test_post_competition.csv. These licenses can be CC0, CC-BY, CC-BY-NC and CC Sampling+.

    Noisy train set. Similarly, the licenses of the soundtracks from Flickr used in FSDKaggle2019 are specified in the file train_noisy_post_competition.csv. These licenses can be CC-BY and CC BY-SA.

    In addition, FSDKaggle2019 as a whole is the result of a curation process and it has an additional license. FSDKaggle2019 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2019.doc zip file.

    FILES & DOWNLOAD

    FSDKaggle2019 can be downloaded as a series of zip files with the following directory structure:

    root │
    └───FSDKaggle2019.audio_train_curated/ Audio clips in the curated train set │ └───FSDKaggle2019.audio_train_noisy/ Audio clips in the noisy

  4. Z

    FSDKaggle2018

    • data.niaid.nih.gov
    • opendatalab.com
    • +1more
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manoj Plakal (2020). FSDKaggle2018 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2552859
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Xavier Favory
    Daniel P. W. Ellis
    Manoj Plakal
    Eduardo Fonseca
    Xavier Serra
    Frederic Font
    Jordi Pons
    Description

    FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.

    Citation

    If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:

    Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)

    You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    About this dataset

    Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.

    The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.

    All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.

    The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:

    "Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".

    Some other relevant characteristics of FSDKaggle2018:

    The dataset is split into a train set and a test set.

    The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.

    Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.

    Non-verified annotations in the train set are properly flagged in train.csv so that participants can opt to use this information during the development of their systems.

    The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.

    All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.

    Data labeling process

    The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.

    Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.

    Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.

    The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.

    More details about the data labeling process can be found in [3].

    License

    FSDKaggle2018 has licenses at two different levels, as explained next.

    All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv and test_post_competition_scoring_clips.csv.

    In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2018.doc zip file.

    Files

    FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:

    root │
    └───FSDKaggle2018.audio_train/ Audio clips in the train set │
    └───FSDKaggle2018.audio_test/ Audio clips in the test set │
    └───FSDKaggle2018.meta/ Files for evaluation setup │ │
    │ └───train_post_competition.csv Data split and ground truth for the train set │ │
    │ └───test_post_competition_scoring_clips.csv Ground truth for the test set

    └───FSDKaggle2018.doc/ │
    └───README.md The dataset description file you are reading │
    └───LICENSE-DATASET License of FSDKaggle2018 dataset as a whole

    NOTE: the original train.csv file provided during the competition has been updated with more metadata (licenses, Freesound ids, etc.) into train_post_competition.csv. Likewise, the original test.csv that was not public during the competition is now available with ground truth and metadata as test_post_competition_scoring_clips.csv. The file name test_post_competition_scoring_clips.csv refers to the fact that only the 1600 clips used for systems' ranking are included. During the competition, an additional subset of padding clips was added in order to prevent undesired practices. This padding subset (that was never used for systems' ranking) is no longer included in the dataset (see our DCASE 2018 paper for more details.)

    Each row (i.e. audio clip) of the train_post_competition.csv file contains the following information:

    fname: the file name

    label: the audio classification label (ground truth)

    manually_verified: Boolean (1 or 0) flag to indicate whether or not that annotation has been manually verified; see description above for more info

    freesound_id: the Freesound id for the audio clip

    license: the license for the audio clip

    Each row (i.e. audio clip) of the test_post_competition_scoring_clips.csv file contains the following information:

    fname: the file name

    label: the audio classification label (ground truth)

    usage: string that indicates to which Kaggle leaderboard the clip was associated during the competition: Public or Private

    freesound_id: the Freesound id for the audio clip

    license: the license for the audio clip

    Baseline System

    A CNN baseline system for FSDKaggle2018 is available at

  5. Spam Images for Malicious Annotation Set (SIMAS)

    • zenodo.org
    application/gzip, bin +1
    Updated May 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Vukić; Maria Vukić; Emanuel Lacić; Emanuel Lacić; Denis Helic; Denis Helic (2025). Spam Images for Malicious Annotation Set (SIMAS) [Dataset]. http://doi.org/10.5281/zenodo.15423637
    Explore at:
    png, bin, application/gzipAvailable download formats
    Dataset updated
    May 23, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maria Vukić; Maria Vukić; Emanuel Lacić; Emanuel Lacić; Denis Helic; Denis Helic
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SIMAS Dataset

    This archive includes the SIMAS dataset for fine-tuning models for MMS (Multimedia Messaging Service) image moderation. SIMAS is a balanced collection of publicly available images, manually annotated in accordance with a specialized taxonomy designed for identifying visual spam in MMS messages.

    Taxonomy for MMS Visual Spam

    The following table presents the definitions of categories used for classifying MMS images.

    Table 1: Category definitions

    Category Description
    Alcohol* Content related to alcoholic beverages, including advertisements and consumption.
    Drugs* Content related to the use, sale, or trafficking of narcotics (e.g., cannabis, cocaine,
    Firearms* Content involving guns, pistols, knives, or military weapons.
    Gambling* Content related to gambling (casinos, poker, roulette, lotteries).
    Sexual Content involving nudity, sexual acts, or sexually suggestive material.
    Tobacco* Content related to tobacco use and advertisements.
    Violence Content showing violent acts, self-harm, or injury.
    Safe All other content, including neutral depictions, products, or harmless cultural symbols

    Note: Categories marked with an asterisk are regulated in some jurisdictions and may not be universally restricted.

    Dataset Collection and Annotation

    Data Sources

    The SIMAS dataset combines publicly available images from multiple sources, selected to reflect the categories defined in our content taxonomy. Each image was manually reviewed by three independent annotators, with final labels assigned when at least two annotators agreed.

    The largest portion of the dataset (30.4%) originates from LAION-400M, a large-scale image-text dataset. To identify relevant content, we first selected a list of ImageNet labels that semantically matched our taxonomy. These labels were generated using GPT-4o in a zero-shot setting, using separate prompts per category. This resulted in 194 candidate labels, of which 88.7% were retained after manual review. The structure of the prompts used in this process is shown in the file gpt4o_imagenet_prompting_scheme.png, which illustrates a shared base prompt template applied across all categories. The fields category_definition, file_examples, and exceptions are specified per category. Definitions align with the taxonomy, while the file_examples column includes sample labels retrieved from the ImageNet label list. The exceptions field contains category-specific filtering instructions; a dash indicates no exceptions were specified.

    Another 25.1% of images were sourced from Roboflow, using open datasets such as:

    The NudeNet dataset contributes 11.4% of the dataset. We sampled 1,000 images from the “porn” category to provide visual coverage of explicit sexual content.

    Another 11.0% of images were collected from Kaggle, including:

    An additional 9.9% of images were retrieved from Unsplash, using keyword-based search queries aligned with each category in our taxonomy.

    Images from UnsafeBench make up 8.0% of the dataset. Since its original binary labels did not match our taxonomy, all samples were manually reassigned to the most appropriate category.

    Finally, 4.2% of images were gathered from various publicly accessible websites. These were primarily used to improve category balance and model generalization, especially in safe classes.

    All images collected from the listed sources have been manually reviewed by three independent annotators. Each image is then assigned to a category when at least two annotators reach consensus.

    Table 2: Distribution of images per public source and category in SIMAS dataset

    TypeCategoryLAIONRoboflowNudeNetKaggleUnsplashUnsafeBenchOtherTotal
    UnsafeAlcohol2903267010300
    UnsafeDrugs17211001381250
    UnsafeFirearms05902290620350
    UnsafeGambling1323800733918300
    UnsafeSexual2042103686500
    UnsafeTobacco04460043110500
    UnsafeViolence0289000110300
    SafeAlcohol1403500161396300
    SafeDrugs6749015721730250
    SafeFirearms173150314487350
    SafeGambling164201121120300
    SafeSexual2352213920948500
    SafeTobacco3516751381640500
    SafeViolence212203210422300
    AllAll1,5221,2535715514934022085,000

    Balancing

    To ensure semantic diversity and dataset balance, undersampling was performed on overrepresented categories using a CLIP-based embedding and k-means clustering strategy. This resulted in a final dataset containing 2,500 spam and 2,500 safe images, evenly distributed across all categories.

    Table 3: Distribution of images per category in SIMAS

  6. Medium articles dataset

    • crawlfeeds.com
    • kaggle.com
    json, zip
    Updated Aug 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Medium articles dataset [Dataset]. https://crawlfeeds.com/datasets/medium-articles-dataset
    Explore at:
    json, zipAvailable download formats
    Dataset updated
    Aug 26, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Buy Medium Articles Dataset – 500K+ Published Articles in JSON Format

    Get access to a premium Medium articles dataset containing 500,000+ curated articles with metadata including author profiles, publication dates, reading time, tags, claps, and more. Ideal for natural language processing (NLP), machine learning, content trend analysis, and AI model training.

    Request here for the large dataset Medium datasets

    Checkout sample dataset in CSV

    Use Cases:

    • Training language models (LLMs)

    • Analyzing content trends and engagement

    • Sentiment and text classification

    • SEO research and author profiling

    • Academic or commercial research

    Why Choose This Dataset?

    • High-volume, cleanly structured JSON

    • Ideal for developers, researchers, and data scientists

    • Easy integration with Python, R, SQL, and other data pipelines

    • Affordable and ready-to-use

  7. bioimage.io upload: hpa/hpa-kaggle-2021-dataset

    • zenodo.org
    bin, png
    Updated Aug 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Human Protein Atlas; The Human Protein Atlas (2024). bioimage.io upload: hpa/hpa-kaggle-2021-dataset [Dataset]. http://doi.org/10.5281/zenodo.13219996
    Explore at:
    bin, pngAvailable download formats
    Dataset updated
    Aug 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    The Human Protein Atlas; The Human Protein Atlas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    View on bioimage.io # HPA Single Cell Classification Dataset 2021

    Training dataset for the Human Protein Atlas - Single Cell Classification competition 2021

    More information: https://www.kaggle.com/competitions/hpa-single-cell-image-classification/data

    What files do I need?

    On the data page below, you will find a set of full size original images (a mix of 1728x1728, 2048x2048 and 3072x3072 PNG files) in train.zip and test.zip. (Please note that since this is a code competition, part of test data will be hidden)

    You will also need the image level labels from train.csv and the filenames for the test set from sample_submission.csv. As many Kagglers made use of all public images in HPA for previous classification challenge, we made the public HPA images available to download as instructed in this notebook. Note also that there are TFRecords available if competitors would like to use TPUs.

    The 16-bit version of the training images are available here. Additional training images are available here.

    What should I expect the data format to be?

    The training image-level labels are provided for each sample in train.csv. The bulk of the data for images - train.zip. Each sample consists of four files. Each file represents a different filter on the subcellular protein patterns represented by the sample. The format should be [filename]_[filter color].png for the PNG files. Colors are red for microtubule channels, blue for nuclei channels, yellow for Endoplasmic Reticulum (ER) channels, and green for protein of interest.

    What am I predicting?

    You are predicting protein organelle localization labels for each cell in the image. Border cells are included when there is enough information to decide on the labels.

    There are in total 19 different labels present in the dataset (18 labels for specific locations, and label 18 for negative and unspecific signal). The dataset is acquired in a highly standardized way using one imaging modality (confocal microscopy). However, the dataset comprises 17 different cell types of highly different morphology, which affect the protein patterns of the different organelles. All image samples are represented by four filters (stored as individual files), the protein of interest (green) plus three cellular landmarks: nucleus (blue), microtubules (red), endoplasmic reticulum (yellow). The green filter should hence be used to predict the label, and the other filters are used as references. The labels are represented as integers that map to the following:

    0. Nucleoplasm 1. Nuclear membrane 2. Nucleoli 3. Nucleoli fibrillar center 4. Nuclear speckles 5. Nuclear bodies 6. Endoplasmic reticulum 7. Golgi apparatus 8. Intermediate filaments 9. Actin filaments 10. Microtubules 11. Mitotic spindle 12. Centrosome 13. Plasma membrane 14. Mitochondria 15. Aggresome 16. Cytosol 17. Vesicles and punctate cytosolic patterns 18. Negative

    What is meant by weak image-level labels?

    The labels you will get for training are image level labels while the task is to predict cell level labels. That is to say, each training image contains a number of cells that have collectively been labeled as described above and the prediction task is to look at images of the same type and predict the labels of each individual cell within those images.

    As the training labels are a collective label for all the cells in an image, it means that each labeled pattern can be seen in the image but not necessarily that each cell within the image expresses the pattern. This imprecise labeling is what we refer to as weak.

    During the challenge you will both need to segment the cells in the images and predict the labels of those segmented cells.

    Files: - train - training images (in .tif) - test - test images (in .png) - the task of the competition is to segment and label the images in this folder - train.csv - filenames and image level labels for the training set - sample_submission.csv - filenames for the test set, and a guide to constructing a working submission.

    Columns: - ID - The base filename of the sample. As noted above all samples consist of four files - blue, green, red, and yellow. - Label - in the training data, this represents the labels assigned to each sample; in submission, this represent the labels assigned to each cell.

  8. T

    plant_village

    • tensorflow.org
    • opendatalab.com
    • +1more
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). plant_village [Dataset]. http://identifiers.org/arxiv:1511.08060
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    The PlantVillage dataset consists of 54303 healthy and unhealthy leaf images divided into 38 categories by species and disease.

    NOTE: The original dataset is not available from the original source (plantvillage.org), therefore we get the unaugmented dataset from a paper that used that dataset and republished it. Moreover, we dropped images with Background_without_leaves label, because these were not present in the original dataset.

    Original paper URL: https://arxiv.org/abs/1511.08060 Dataset URL: https://data.mendeley.com/datasets/tywbtsjrjv/1

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('plant_village', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/plant_village-1.0.2.png" alt="Visualization" width="500px">

  9. Off-Road Terrain Dataset for Autonomous Vehicles

    • kaggle.com
    Updated Mar 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Magnum Research Group (2021). Off-Road Terrain Dataset for Autonomous Vehicles [Dataset]. https://www.kaggle.com/datasets/magnumresearchgroup/offroad-terrain-dataset-for-autonomous-vehicles
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 3, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Magnum Research Group
    Description

    This dataset provides 12,730 images of off-road terrain over 44 miles to assist researchers in the space of autonomous driving in making progress for off-road environments. This dataset also includes readings from the accelerometer, gyroscope, magnetometer, GPS, and wheel rotation speed sensor. Further, we include 8 potential roughness labels derived from the vehicle's z-axis acceleration for the subset of images in the dataset which have sufficient sensor data to calculate the image labels and depict clear, visible terrain.

    Please consider citing: Gresenz, G., White, J., & Schmidt, D. C. (2021). "An Off-Road Terrain Dataset Including Images Labeled With Measures of Terrain Roughness." Proceedings of the IEEE International Conference in Autonomous Systems, 309-313.

    Papers

    This dataset is described and published in Gresenz et al. [1].

    Content

    Data was collected with a mountain bike on off-road trails during five different dates in the late summer and early fall. The bike was equipped with Garmin 830 dual GPS receivers, Garmin Virb Ultra dual high resolution Inertial Measurement Units (IMU's), a Garmin Virb Ultra 4k 30 fps camera, and a Garmin Bike Speed Sensor 2 wheel rotation speed sensor. The camera was time synchronized to both IMU's.

    Images were extracted from the videos collected by the camera at 1 second intervals. They are located in the Images folder, sorted into subfolders by the date they were collected, and labeled with their UTC timestamp in order to be used alongside the corresponding sensor data.

    Sensor data was collected in a file format called a FIT file. We converted the FIT files to CSVs using tools provided by Garmin [2, 3]. We then created distinct CSVs for each of the major sensor readings and formatted each in a state-based representation, where a single row is labeled by UTC timestamp and contains all relevant readings at that timestamp. Sensor data is located in the SensorData folder and is sorted into subfolders based on the date the data was collected.

    The Three D Sensor Adjustment Plugin [3] provided by Garmin calibrates three dimensional readings, meaning that the readings are converted to the conventionally understood units and the x, y, and z-axis readings correspond directly to these axes. Accelerometer and gyroscope readings were calibrated using this plugin. It is important to note that our data did not contain the necessary calibration information to calibrate the magnetometer readings, so these readings are uncalibrated in our dataset.

    The ImageLabels folder contains two CSVs for the subset of images which had sufficient sensor data to calculate their labels and depicted a clear, visible path.

    tsm_1_labels.csv contains the following labels: 1. The standard deviation of a 1 second sampling of z-axis acceleration readings centered around 5 meters ahead of the image's timestamp, discretized using data visualization. 2. The standard deviation of a 1 second sampling of z-axis acceleration readings centered around 5 meters ahead of the image's timestamp, discretized using k-means clustering with k = 2. 3. The standard deviation of a 1 second sampling of z-axis acceleration readings centered around 5 meters ahead of the image's timestamp, discretized using k-means clustering with k = 3. 4. The standard deviation of a 1 second sampling of z-axis acceleration readings centered around 5 meters ahead of the image's timestamp, discretized using k-means clustering with k = 4.

    tsm_2_labels.csv contains the following labels: 5. The standard deviation of a 1 second sampling of z-axis acceleration readings directly ahead of the image's timestamp, discretized using data visualization. 6. The standard deviation of a 1 second sampling of z-axis acceleration readings directly ahead of the image's timestamp, discretized using k-means clustering with k = 2. 7. The standard deviation of a 1 second sampling of z-axis acceleration readings directly ahead of the image's timestamp, discretized using k-means clustering with k = 3. 8. The standard deviation of a 1 second sampling of z-axis acceleration readings directly ahead of the image's timestamp, discretized using k-means clustering with k = 4.

    These labeling schemas, along with how effectively they were able to be learned, are described in depth in Gresenz et al. [1].

    Links

    Check out our other dataset, Off-Road Terrain Attention Region Images.

    The Github repo for the papers associated with these datasets is located here.

    References

    [1] Gresenz, G., White, J., & Schmidt, D.C. (2021). "An Off-Road Terrain Dataset Including Images Labeled With Measures of Terrain Roughness." Proceedings of the IEEE International Conference in Autonomous Systems, 309-31...

  10. m

    Lumbar Spine MRI Dataset

    • data.mendeley.com
    • opendatalab.com
    Updated Apr 3, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sud Sudirman (2019). Lumbar Spine MRI Dataset [Dataset]. http://doi.org/10.17632/k57fr854j2.2
    Explore at:
    Dataset updated
    Apr 3, 2019
    Authors
    Sud Sudirman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains anonymised clinical MRI study, or a set of scans, of 515 patients with symptomatic back pains. Each patient data can have one or more MRI studies associated with it. Each study contains slices, i.e., individual images taken from either sagittal or axial view, of the lowest three vertebrae and the lowest three IVDs. The axial view slices are mainly taken from the last three IVDs – including the one between the last vertebrae and the sacrum. The orientation of the slices of the last IVD are made to follow the spine curve whereas those of the other IVDs are usually made in blocks – i.e., parallel to each other. There are between four to five slices per IVD and they begin from the top of the IVD towards its bottom. Many of the top and bottom slices cut through the vertebrae leaving between one to three slices that cut the IVD cleanly and show purely the image of that IVD. In most cases, the total number of slices in axial view ranges from 12 to 15. However, in some cases, there may be up to 20 slices because the study contains slices of more than last three vertebrae. The scans in sagittal view also vary but all contain at least the last seven vertebrae and the sacrum. While the number of vertebrae varies, each scan always includes the first two sacral links.

    There are a total 48,345 MRI slices in our dataset. The majority of the slices have an image resolution of 320x320 pixels, however, there are slices from three studies with 320x310 pixel resolution. The pixels in all slices have 12-bit per pixel precision which is higher than the standard 8-bit greyscale images. Specifically for all axial-view slices, the slice thickness are uniformly 4 mm with centre-to-centre distance between adjacent slices to be 4.4 mm. The horizontal and vertical pixel spacing is 0.6875 mm uniformly across all axial-view slices.

    The majority of the MRI studies were taken with the patient in Head-First-Supine position with the rests were taken with the patient in in Feet-First-Supine position. Each study can last between 15 to 45 minutes and a patient may have one or more study associated with them taken at a different time or a few days apart.

    You can download and read the research papers detailing our methodology on boundary delineation for lumbar spinal stenosis detection using the URLs provided in the Related Links at the end of this page. You can also check out other dataset and source code related to this program from that section.

    We kindly request you to cite our papers when using our data or program in your research.

  11. Gomrade dataset: go (baduk) images with annots

    • kaggle.com
    Updated Jun 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DavidS (2021). Gomrade dataset: go (baduk) images with annots [Dataset]. https://www.kaggle.com/davids1992/gomrade-dataset-go-baduk-images-with-labels/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 1, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DavidS
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    About the dataset

    This dataset contains the images of Go (baduk, igo) board with stones, together with the corresponding annotation of stones positions, and board coordinates, one for the game.

    The games are rather not continuous, with missing moves.

    The dataset contains 62 games in terms of board positions, and 2000 (todo) images with 300 000 (todo) intersections to recognize.

    The dataset is split into two folders because of historical events. Maybe it would be a good split for the train and test part.

    This is the dataset I created for developing my Gomrade project. It allows you to play against AI on a real board. You can find more here: GitHub

    Other sources

    In this dataset, there are recordings of my own games, some images scraped from the internet, and two other sources: - VideoKifu - https://www.oipaz.net/VideoKifu.html - even though it's a really great project overall, the validation of single image recognition was not performed. VideoKifu focuses on video tracking and the course of the game, so they did not found reference labels necessary for their application. I added reference annotation because I believe it's necessary for reliable systems. - Imago - http://tomasm.cz/imago - a great project with a few annotated images

    All the labels in this dataset except for Imago are created by me and I own rights to them. They were created in a semi-supervised manner.

    About a license

    The license is one of the most restrictive. If you have any ideas on how to develop this dataset or a project based on it, contact me and we can figure out the future of this data. For now, I don't have any other ideas so I restrict it.

    Other

    I also have access to Igoki project data but at the moment I don't know if I can share it. I will be happy to get in touch and share my thoughts.

  12. T

    cats_vs_dogs

    • tensorflow.org
    • universe.roboflow.com
    • +1more
    Updated Dec 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). cats_vs_dogs [Dataset]. https://www.tensorflow.org/datasets/catalog/cats_vs_dogs
    Explore at:
    Dataset updated
    Dec 19, 2023
    Description

    A large set of images of cats and dogs. There are 1738 corrupted images that are dropped.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('cats_vs_dogs', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/cats_vs_dogs-4.0.1.png" alt="Visualization" width="500px">

  13. u

    Product Exchange/Bartering Data

    • cseweb.ucsd.edu
    json
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Product Exchange/Bartering Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
    Explore at:
    jsonAvailable download formats
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    These datasets contain peer-to-peer trades from various recommendation platforms.

    Metadata includes

    • peer-to-peer trades

    • have and want lists

    • image data (tradesy)

  14. u

    Amazon Question and Answer Data

    • cseweb.ucsd.edu
    json
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Amazon Question and Answer Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
    Explore at:
    jsonAvailable download formats
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    These datasets contain 1.48 million question and answer pairs about products from Amazon.

    Metadata includes

    • question and answer text

    • is the question binary (yes/no), and if so does it have a yes/no answer?

    • timestamps

    • product ID (to reference the review dataset)

    Basic Statistics:

    • Questions: 1.48 million

    • Answers: 4,019,744

    • Labeled yes/no questions: 309,419

    • Number of unique products with questions: 191,185

  15. Monkeypox Skin Lesion Dataset

    • kaggle.com
    Updated Jul 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TensorKitty (2022). Monkeypox Skin Lesion Dataset [Dataset]. https://www.kaggle.com/datasets/nafin59/monkeypox-skin-lesion-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    TensorKitty
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    An updated version of the MSLD dataset, MSLD v2.0 has been released after being verified by an expert dermatologist!

    For details, check our GitHub repo!

    Context

    The recent monkeypox outbreak has become a global healthcare concern owing to its rapid spread in more than 65 countries around the globe. To obstruct its expeditious pace, early diagnosis is a must. But the confirmatory Polymerase Chain Reaction (PCR) tests and other biochemical assays are not readily available in sufficient quantities. In this scenario, computer-aided monkeypox identification from skin lesion images can be a beneficial measure. Nevertheless, so far, such datasets are not available. Hence, the "Monkeypox Skin Lesion Dataset (MSLD)" is created by collecting and processing images from different means of web-scrapping i.e., from news portals, websites and publicly accessible case reports.

    The creation of "Monkeypox Image Lesion Dataset" is primarily focused on distinguishing the monkeypox cases from the similar non-monkeypox cases. Therefore, along with the 'Monkeypox' class, we included skin lesion images of 'Chickenpox' and 'Measles' because of their resemblance to the monkeypox rash and pustules in initial state in another class named 'Others' to perform binary classification.

    Content

    There are 3 folders in the dataset.

    1) Original Images: It contains a total number of 228 images, among which 102 belongs to the 'Monkeypox' class and the remaining 126 represents the 'Others' class i.e., non-monkeypox (chickenpox and measles) cases.

    2) Augmented Images: To aid the classification task, several data augmentation methods such as rotation, translation, reflection, shear, hue, saturation, contrast and brightness jitter, noise, scaling etc. have been applied using MATLAB R2020a. Although this can be readily done using ImageGenerator/other image augmentors, to ensure reproducibility of the results, the augmented images are provided in this folder. Post-augmentation, the number of images increased by approximately 14-folds. The classes 'Monkeypox' and 'Others' have 1428 and 1764 images, respectively.

    3) Fold1: One of the three-fold cross validation datasets. To avoid any sort of bias in training, three-fold cross validation was performed. The original images were split into training, validation and test set(s) with the approximate proportion of 70 : 10 : 20 while maintaining patient independence. According to the commonly perceived data preparation practice, only the training and validation images were augmented while the test set contained only the original images. Users have the option of using the folds directly or using the original data and employing other algorithms to augment it.

    Additionally, a CSV file is provided that has 228 rows and two columns. The table contains the list of all the ImageID(s) with their corresponding label.

    Web Application

    Since monkeypox is demonstrating a very rapid community transmission pattern, a consumer-level software is truly necessary to increase awareness and encourage people to take rapid action. We have developed an easy-to-use web application named Monkey Pox Detector using the open-source python streamlit framework that uses our trained model to address this issue. It makes predictions on whether or not to see a specialist along with the prediction accuracy. Future updates will benefit from the user data we continue to collect and use to improve our model. The web app has a flask core, so that it can be deployed cross-platform in the future.

    Learn more at our GitHub repo!

    Citation

    If this dataset helped your research, please cite the following articles:

    Ali, S. N., Ahmed, M. T., Paul, J., Jahan, T., Sani, S. M. Sakeef, Noor, N., & Hasan, T. (2022). Monkeypox Skin Lesion Detection Using Deep Learning Models: A Preliminary Feasibility Study. arXiv preprint arXiv:2207.03342.

    @article{Nafisa2022, title={Monkeypox Skin Lesion Detection Using Deep Learning Models: A Preliminary Feasibility Study}, author={Ali, Shams Nafisa and Ahmed, Md. Tazuddin and Paul, Joydip and Jahan, Tasnim and Sani, S. M. Sakeef and Noor, Nawshaba and Hasan, Taufiq}, journal={arXiv preprint arXiv:2207.03342}, year={2022} }

    Ali, S. N., Ahmed, M. T., Jahan, T., Paul, J., Sani, S. M. Sakeef, Noor, N., Asma, A. N., & Hasan, T. (2023). A Web-based Mpox Skin Lesion Detection System Using State-of-the-art Deep Learning Models Considering Racial Diversity. arXiv preprint arXiv:2306.14169.

    @article{Nafisa2023, title={A Web-base...

  16. A

    ‘Birds' Bones and Living Habits’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Birds' Bones and Living Habits’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-birds-bones-and-living-habits-8e70/a5c525fb/?iid=003-438&v=presentation
    Explore at:
    Dataset updated
    Nov 12, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Birds' Bones and Living Habits’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/zhangjuefei/birds-bones-and-living-habits on 12 November 2021.

    --- Dataset description provided by original source is as follows ---

    Context

    There are many kinds of birds: pigeons, ducks, ostriches, penguins... Some are good at flying, others can't fly but run fast. Some swim under water, others wading in shallow pool.

    According to their living environments and living habits, birds are classified into different ecological groups. There are 8 ecological groups of birds:

    • Swimming Birds
    • Wading Birds
    • Terrestrial Birds
    • Raptors
    • Scansorial Birds
    • Singing Birds
    • Cursorial Birds (not included in dataset)
    • Marine Birds (not included in dataset)

    First 6 groups are main and are covered by this dataset.

    Apparently, birds belong to different ecological groups have different appearances: flying birds have strong wings and wading birds have long legs. Their living habits are somewhat reflected in their bones' shapes. As data scientists we may think of examining the underlying relationship between sizes of bones and ecological groups , and recognising birds' ecological groups by their bones' shapes.

    Content

    There are 420 birds contained in this dataset. Each bird is represented by 10 measurements (features):

    • Length and Diameter of Humerus
    • Length and Diameter of Ulna
    • Length and Diameter of Femur
    • Length and Diameter of Tibiotarsus
    • Length and Diameter of Tarsometatarsus

    https://s-media-cache-ak0.pinimg.com/564x/cb/cb/41/cbcb41cb2c1db6c32c1e869be2172c3b.jpg" alt="Pigeon skeleton">

    All measurements are continuous float numbers (mm) with missing values represented by empty strings. The skeletons of this dataset are collections of Natural History Museum of Los Angeles County. They belong to 21 orders, 153 genera, 245 species.

    Each bird has a label for its ecological group:

    • SW: Swimming Birds
    • W: Wading Birds
    • T: Terrestrial Birds
    • R: Raptors
    • P: Scansorial Birds
    • SO: Singing Birds

    Acknowledgements

    This dataset is provided by Dr. D. Liu of Beijing Museum of Natural History.

    Inspiration

    This dataset is a 420x10 size continuous values unbalanced multi-class dataset. What can be done include:

    • Data Visualisation
    • Statical Analysis
    • Supervised Classification
    • Unsupervised Clustering

    License

    Please do not publish or cite this dataset in research papers or other public publications.

    --- Original source retains full ownership of the source dataset ---

  17. DailyDialog (Multi-turn Dialog)

    • kaggle.com
    Updated Nov 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). DailyDialog (Multi-turn Dialog) [Dataset]. https://www.kaggle.com/datasets/thedevastator/dailydialog-unlock-the-conversation-potential-in
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 29, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    DailyDialog (Multi-turn Dialog)

    Dialogues that reflect our daily communication way and cover various topics

    By Huggingface Hub [source]

    About this dataset

    Welcome to the DailyDialog dataset, your gateway to unlocking conversation potential through multi-turn dialog experiences! Our dataset consists of conversations written by humans, which serve as a more accurate reflection of our day-to-day conversations than other datasets. Additionally, we have included manually labeled communication intentions and emotion fields in our data that can be used for advancing dialog systems.

    Whether you’re a researcher looking for new approaches in dialog systems or someone simply curious about conversation dynamics from the perspective of computer science – this dataset is here to help! We invite you to explore and make use of this data for its full potential and advance the research field further.

    Our three main files (train.csv, validation.csv, test.csv) each provide key columns such as dialogue , act , and emotion , enabling you to get an even deeper understanding into how effective conversations really work -- so what are you waiting for? Unlock your conversation potential today with DailyDialog!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Welcome and thank you for your interest in the DailyDialog dataset! This dataset is designed to unlock conversation potential through multi-turn dialog experiences and provide a better understanding of conversations in our day-to-day lives. Whether you are a student, researcher, or just plain curious, this guide is here to help you get started with using the DailyDialog dataset for your own research or exploration.

    The DailyDialog dataset includes three files: train.csv, validation.csv, and test.csv which all contain dialog, act and emotion fields that can be used by those who wish to evaluate existing approaches in the field of dialogue systems or perform new experiments on conversational models. All data found in this dataset is written by humans and thus contains less noise than other datasets typically seen online.

    The first step when using this data set would be to familiarize yourself with the different fields found within each file: * Dialog – The dialog field contains the conversation between two people (String). * Act – The act field contains communication intentions of both parties involved within the dialogue (String). * Emotion – The emotion field labels any emotions expressed during a particular dialogue (String).

    Once you understand what each of these three fields mean it’s time to start exploring! You can use any programming language/software as well as statistical methods such as text analysis tools like RapidMiner or Natural Language Processing libraries like NLTK or Spacy to use these fields in order to further explore them individually or together on a more profound level. Additionally, if you are interested specifically into machine learning tasks there might also be possibilities such as generating new conversations from our data set (e.g., chat bots) using reinforcement learning models such deep learning architectures / neural networks for natural language understanding tasks etc..which can be explored too!

    All said done we believe that the ability of unlocking underlying patterns embedded within real life conversations will enable researchers in various domains & research areas (e.g., AI / ML ones) enable their efforts great success & have an exciting journey :)

    Research Ideas

    • Developing a conversational AI system that can replicate authentic conversations by modeling the emotion and communication intentions present in the DailyDialog dataset.
    • Creating a language-learning tool which can customize personalized dialogues based on the DailyDialog data to help foreign language learners get used to spoken dialogue.
    • Utilizing the DailyDialog data to develop an interactive chatbot with customized responses and emotions, allowing users to learn more about their conversational skills through simulated conversations

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](https://creativecommons...

  18. Customer IT Support - Ticket Dataset

    • kaggle.com
    Updated Jun 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tobias Bueck (2025). Customer IT Support - Ticket Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/12183005
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 16, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Tobias Bueck
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Featuring Labeled Customer Emails and Support Responses

    Network Diagram Tags

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3023333%2F9f9df25b75671db2d255b2d284c2c80c%2Fnetwork_diagram.svg?generation=1739380045025331&alt=media" alt="">

    Discover the new, expanded version of this dataset with 20,000 ticket entries! Perfect for training models to classify and prioritize support tickets.

    Definetly check out my other Dataset:
    Tickets from Github Issues

    It includes priorities, queues, types, tags, and business types. This preview offers a detailed structure with classifications by department, type, priority, language, subject, full email text, and agent answers.

    Features / Attributes

    FieldDescriptionValues
    🔀 QueueSpecifies the department to which the email ticket is routede.g. Technical Support, Customer Service, Billing and Payments, ...
    🚦 PriorityIndicates the urgency and importance of the issue🟢Low
    🟠Medium
    🔴Critical
    🗣️ LanguageIndicates the language in which the email is writtenEN, DE, ES, FR, PT
    SubjectSubject of the customer's email
    BodyBody of the customer's email
    AnswerThe response provided by the helpdesk agent
    TypeThe type of ticket as picked by the agente.g. Incident, Request, Problem, Change ...
    🏢 Business TypeThe business type of the support helpdeske.g. Tech Online Store, IT Services, Software Development Company
    TagsTags/categories assigned to the ticket, split into ten columns in the datasete.g. "Software Bug", "Warranty Claim"

    Queue

    Specifies the department to which the email ticket is categorized. This helps in routing the ticket to the appropriate support team for resolution. - 💻 Technical Support: Technical issues and support requests. - 🈂️ Customer Service: Customer inquiries and service requests. - 💰 Billing and Payments: Billing issues and payment processing. - 🖥️ Product Support: Support for product-related issues. - 🌐 IT Support: Internal IT support and infrastructure issues. - 🔄 Returns and Exchanges: Product returns and exchanges. - 📞 Sales and Pre-Sales: Sales inquiries and pre-sales questions. - 🧑‍💻 Human Resources: Employee inquiries and HR-related issues. - ❌ Service Outages and Maintenance: Service interruptions and maintenance. - 📮 General Inquiry: General inquiries and information requests.

    Priority

    Indicates the urgency and importance of the issue. Helps in managing the workflow by prioritizing tickets that need immediate attention. - 🟢 1 (Low): Non-urgent issues that do not require immediate attention. Examples: general inquiries, minor inconveniences, routine updates, and feature requests. - 🟠 2 (Medium): Moderately urgent issues that need timely resolution but are not critical. Examples: performance issues, intermittent errors, and detailed user questions. - 🔴 3 (Critical): Urgent issues that require immediate attention and quick resolution. Examples: system ...

  19. Sepsis Dataset –

    • kaggle.com
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatolu Peter (2025). Sepsis Dataset – [Dataset]. https://www.kaggle.com/datasets/olagokeblissman/sepsis-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Fatolu Peter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    📝 Dataset Overview: This dataset focuses on early warning detection for sepsis, a critical and potentially fatal medical condition. It includes anonymized vital signs, lab results, and clinical indicators of patients admitted to the hospital, structured for real-time monitoring and predictive modeling.

    It’s ideal for clinical data analysts, healthcare data scientists, and AI practitioners aiming to develop decision support tools, early warning dashboards, or predictive health models.

    🔍 Dataset Features: Column Name Description Patient_ID Unique anonymized identifier Admission_Date Patient’s hospital admission date Temperature_C Body temperature in degrees Celsius BP_Systolic Systolic blood pressure (mmHg) BP_Diastolic Diastolic blood pressure (mmHg) Heart_Rate Beats per minute WBC_Count White blood cell count (x10⁹/L) Lactate_mmol_L Lactate level in mmol/L Sepsis_Flag Binary indicator (1 = Suspected Sepsis, 0 = Normal) Ward Hospital ward/unit Doctor_On_Duty Attending physician name (anonymized)

    🎯 Use Cases: Build Power BI dashboards for hospital early warning systems

    Train ML classification models to detect early signs of sepsis

    Create patient monitoring tools with Python or R

    Explore the relationship between vitals & sepsis onset

    Perform feature engineering for risk scoring systems

    📌 Clinical Relevance: Sepsis is one of the leading causes of in-hospital mortality worldwide. Early detection is crucial to reducing death rates and improving outcomes. This dataset empowers developers and analysts to make a meaningful impact in the healthcare sector.

    👤 Created By: Fatolu Peter (Emperor Analytics) A passionate healthcare analyst leveraging data to drive innovation in public health across Nigeria. This is Project 12 in my data-for-good series.

    ✅ LinkedIn Post: 🚨 New Dataset: Sepsis Early Warning System Data – Now on Kaggle 📊 Clinical vital signs + lab markers + sepsis risk flags 🔗 Explore the dataset here

    This dataset enables healthcare data scientists to: ✅ Build real-time hospital dashboards ✅ Predict sepsis risk with machine learning ✅ Explore vitals like BP, lactate, WBC, and temperature ✅ Support early intervention using data insights

    Whether you're into: 🧠 Predictive modeling 📈 Power BI clinical dashboards 📉 Risk analytics in healthcare This is for you.

    Join me in using data to save lives — one insight at a time. If you build something, tag me. I’ll gladly share it! 💡

    HealthcareAnalytics #SepsisAwareness #EarlyWarningSystems #KaggleDataset #PowerBI #DataForGood #FatoluPeter #EmperorAnalytics #PublicHealth #Project12 #RealWorldData

    Let me know if you’d like help starting a Power BI or Python model for this!

    You said: whic license is the best to choose that is global

    ChatGPT said: For globally recognized and widely accepted dataset licensing on Kaggle, the best license to choose depends on your intent — especially whether you want others to reuse and build upon your dataset freely.

    ✅ Recommended License: Creative Commons Attribution 4.0 (CC BY 4.0) Why CC BY 4.0 is best globally: 🌍 Internationally recognized and legally valid worldwide.

    ✅ Allows others to use, share, adapt, and build on your dataset, even for commercial purposes.

    📢 Requires only attribution to you as the creator.

    🔐 You still retain copyright while maximizing openness.

    When to choose it: If your goal is to:

    Share freely with the global community,

    Allow use in academic, commercial, or public projects,

    Gain credit and visibility as the original creator.

    ⚠️ Other license types (if needed): CC BY-NC 4.0 (Attribution-NonCommercial): Only for non-commercial use.

    CC0 (Public Domain Dedication): Freest use; no attribution required — but not always ideal if you want credit.

    GPL / Open Data Commons: More for software or structured databases with specific open-source obligations.

    ✅ Final Suggestion for Your Datasets: Use CC BY 4.0 for all your uploads, unless you have confidential/private data. It's perfect for:

    Healthcare datasets

    Sales/retail analytics

    Kaggle portfolio building

    Global recognition & impact

  20. Faces: Age Detection from Images

    • kaggle.com
    Updated Jan 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Möbius (2024). Faces: Age Detection from Images [Dataset]. https://www.kaggle.com/arashnic/faces-age-detection-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Möbius
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Recently, many applications from biometrics,to entertainment use the information extracted from face images that contain information about age, gender, ethnic background, and emotional state. Automatic age estimation from facial images is one of the popular and challenging tasks that have different fields of applications such as controlling the content of the watched media depending on the customer's age. So facial feature analysis has been a topic of interest mainly due to its applicability and Deep Learning techniques are now making it possible for face analysis to be not just a dream but a reality. This simple practice dataset can get you more acquainted with application of deep learning in age detection. #

    https://media.gettyimages.com/photos/facial-recognition-technology-picture-id1139859279?k=6&m=1139859279&s=612x612&w=0&h=H-i0yAM3A49I_r44424-jACD667nxiKb7bZR52ByOA=" alt="im"> #

    Content

    Indian Movie Face database (IMFDB) is a large unconstrained face database consisting of 34512 images of 100 Indian actors collected from more than 100 videos. All the images are manually selected and cropped from the video frames resulting in a high degree of variability interms of scale, pose, expression, illumination, age, resolution, occlusion, and makeup. IMFDB is the first face database that provides a detailed annotation of every image in terms of age, pose, gender, expression and type of occlusion that may help other face related applications.

    The dataset provided a total of 19906 images.The attributes of data are as follows:

    • ID – Unique ID of image
    • Class – Age bin of person in image

    #

    https://ars.els-cdn.com/content/image/1-s2.0-S0925231215017348-gr1.jpg" alt="face"> image ref : Automatic age estimation based on CNN #

    Acknowledgements

    CVIT focuses on basic and advanced research in image processing, computer vision, computer graphics and machine learning. This center deals with the generation, processing, and understanding of primarily visual data as well as with the techniques and tools required doing so efficiently. The activity of this center overlaps the traditional areas of Computer Vision, Image Processing, Computer Graphics, Pattern Recognition and Machine Learning. CVIT works on both theoretical as well as practical aspects of visual information processing. Center aims to keep the right balance between the cutting edge academic research and impactful applied research.

    Inspiration

    The main task is to predict the age of a person from his or her facial attributes. For simplicity, the problem has been converted to a multiclass problem with classes as Young, Middle and Old.

    Version 2: Faces dataset for regression added

    UTKFace dataset is a large-scale face dataset with long age span (range from 0 to 116 years old). The dataset consists of over 20,000 face images with annotations of age, gender, and ethnicity. The images cover large variation in pose, facial expression, illumination, occlusion, resolution, etc. This dataset could be used on a variety of tasks, e.g., face detection, age estimation, age progression/regression, landmark localization, etc. Some sample images are shown as following: https://susanqq.github.io/UTKFace/icon/samples.png" alt="face2">

    • consists of 20k+ face images in the wild (only single face in one image)
    • provides the correspondingly aligned and cropped faces
    • provides the corresponding landmarks (68 points)
    • images are labelled by age, gender, and ethnicity

    Complete Dataset: https://susanqq.github.io/UTKFace/

    The labels of each face image is embedded in the file name, formated like [age]_[gender]_[race]_[date&time].jpg

    • [age] is an integer from 0 to 116, indicating the age
    • [gender] is either 0 (male) or 1 (female)
    • [race] is an integer from 0 to 4, denoting White, Black, Asian, Indian, and Others (like Hispanic, Latino, Middle Eastern).
    • [date&time] is in the format of yyyymmddHHMMSSFFF, showing the date and time an image was collected to UTKFace

    *If you download and find the data useful your upvote is an explicit feedback for future works*

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Google BigQuery (2019). Open Images [Dataset]. https://www.kaggle.com/datasets/bigquery/open-images
Organization logo

Data from: Open Images

9 million URLs with labels and more than 6,000 categories (BigQuery)

Related Article
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Context

Labeled datasets are useful in machine learning research.

Content

This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.

Tables: 1) annotations_bbox 2) dict 3) images 4) labels

Update Frequency: Quarterly

Querying BigQuery Tables

Fork this kernel to get started.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images

https://cloud.google.com/bigquery/public-data/openimages

APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.

Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.

The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

Banner Photo by Mattias Diesel from Unsplash.

Inspiration

Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?

Search
Clear search
Close search
Google apps
Main menu