100+ datasets found
  1. T

    imagenet2012

    • tensorflow.org
    Updated Jun 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). imagenet2012 [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

    The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:

    1. Download the 2012 test split available here.
    2. Download the October 10, 2019 patch. There is a Google Drive link to the patch provided on the same page.
    3. Combine the two tar-balls, manually overwriting any images in the original archive with images from the patch. According to the instructions on image-net.org, this procedure overwrites just a few images.

    The resulting tar-ball may then be processed by TFDS.

    To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.

    To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:

    771 778 794 387 650
    363 691 764 923 427
    737 369 430 531 124
    755 930 755 59 168
    

    The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imagenet2012', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012-5.1.0.png" alt="Visualization" width="500px">

  2. T

    cifar10

    • tensorflow.org
    • opendatalab.com
    • +3more
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). cifar10 [Dataset]. https://www.tensorflow.org/datasets/catalog/cifar10
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('cifar10', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/cifar10-3.0.2.png" alt="Visualization" width="500px">

  3. Dataset for "Enhancing Cloud Detection in Sentinel-2 Imagery: A...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Feb 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gong Chengjuan; Yin Ranyu; Yin Ranyu; Long Tengfei; Long Tengfei; He Guojin; Jiao Weili; Wang Guizhou; Gong Chengjuan; He Guojin; Jiao Weili; Wang Guizhou (2024). Dataset for "Enhancing Cloud Detection in Sentinel-2 Imagery: A Spatial-Temporal Approach and Dataset" [Dataset]. http://doi.org/10.5281/zenodo.10613705
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gong Chengjuan; Yin Ranyu; Yin Ranyu; Long Tengfei; Long Tengfei; He Guojin; Jiao Weili; Wang Guizhou; Gong Chengjuan; He Guojin; Jiao Weili; Wang Guizhou
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset is built for time-series Sentinel-2 cloud detection and stored in Tensorflow TFRecord (refer to https://www.tensorflow.org/tutorials/load_data/tfrecord).

    Each file is compressed in 7z format and can be decompressed using Bandzip or 7-zip software.

    Dataset Structure:

    Each filename can be split into three parts using underscores. The first part indicates whether it is designated for training or validation ('train' or 'val'); the second part indicates the Sentinel-2 tile name, and the last part indicates the number of samples in this file.

    For each sample, it includes:

    1. Sample ID;
    2. Array of time series 4 band image patches in 10m resolution, shaped as (n_timestamps, 4, 42, 42);
    3. Label list indicating cloud cover status for the center \(6\times6\) pixels of each timestamp;
    4. Ordinal list for each timestamp;
    5. Sample weight list (reserved);

    Here is a demonstration function for parsing the TFRecord file:

    import tensorflow as tf
    
    # init Tensorflow Dataset from file name
    def parseRecordDirect(fname):
      sep = '/'
      parts = tf.strings.split(fname,sep)
      tn = tf.strings.split(parts[-1],sep='_')[-2]
      nn = tf.strings.to_number(tf.strings.split(parts[-1],sep='_')[-1],tf.dtypes.int64)
      t = tf.data.Dataset.from_tensors(tn).repeat().take(nn)
      t1 = tf.data.TFRecordDataset(fname)
      ds = tf.data.Dataset.zip((t, t1))
      return ds
    
    keys_to_features_direct = {
      'localid': tf.io.FixedLenFeature([], tf.int64, -1),
      'image_raw_ldseries': tf.io.FixedLenFeature((), tf.string, ''),
      'labels': tf.io.FixedLenFeature((), tf.string, ''),
      'dates': tf.io.FixedLenFeature((), tf.string, ''),
      'weights': tf.io.FixedLenFeature((), tf.string, '')
        }
    
    # The Decoder (Optional)
    class SeriesClassificationDirectDecorder(decoder.Decoder):
     """A tf.Example decoder for tfds classification datasets."""
     def _init_(self) -> None:
      super()._init_()
    
     def decode(self, tid, ds):
      parsed = tf.io.parse_single_example(ds, keys_to_features_direct)
      encoded = parsed['image_raw_ldseries']
      labels_encoded = parsed['labels']
      decoded = tf.io.decode_raw(encoded, tf.uint16)
      label = tf.io.decode_raw(labels_encoded, tf.int8)
      dates = tf.io.decode_raw(parsed['dates'], tf.int64)
      weight = tf.io.decode_raw(parsed['weights'], tf.float32)
      decoded = tf.reshape(decoded,[-1,4,42,42])
      sample_dict = {
       'tid': tid, # tile ID
       'dates': dates, # Date list
       'localid': parsed['localid'], # sample ID
       'imgs': decoded, # image array
       'labels': label, # label list
       'weights': weight
      }
      return sample_dict
    
    # simple function 
    def preprocessDirect(tid, record):
      parsed = tf.io.parse_single_example(record, keys_to_features_direct)
      encoded = parsed['image_raw_ldseries']
      labels_encoded = parsed['labels']
      decoded = tf.io.decode_raw(encoded, tf.uint16)
      label = tf.io.decode_raw(labels_encoded, tf.int8)
      dates = tf.io.decode_raw(parsed['dates'], tf.int64)
      weight = tf.io.decode_raw(parsed['weights'], tf.float32)
      decoded = tf.reshape(decoded,[-1,4,42,42])
      return tid, dates, parsed['localid'], decoded, label, weight
    
    t1 = parseRecordDirect('filename here')
    dataset = t1.map(preprocessDirect, num_parallel_calls=tf.data.experimental.AUTOTUNE)
    
    #
    

    Class Definition:

    • 0: clear
    • 1: opaque cloud
    • 2: thin cloud
    • 3: haze
    • 4: cloud shadow
    • 5: snow

    Dataset Construction:

    First, we randomly generate 500 points for each tile, and all these points are aligned to the pixel grid center of the subdatasets in 60m resolution (eg. B10) for consistence when comparing with other products.
    It is because that other cloud detection method may use the cirrus band as features, which is in 60m resolution.

    Then, the time series image patches of two shapes are cropped with each point as the center.
    The patches of shape \(42 \times 42\) are cropped from the bands in 10m resolution (B2, B3, B4, B8) and are used to construct this dataset.
    And the patches of shape \(348 \times 348\) are cropped from the True Colour Image (TCI, details see sentinel-2 user guide) file and are used to interpreting class labels.

    The samples with a large number of timestamps could be time-consuming in the IO stage, thus the time series patches are divided into different groups with timestamps not exceeding 100 for every group.

  4. T

    speech_commands

    • tensorflow.org
    • datasets.activeloop.ai
    • +1more
    Updated Jan 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). speech_commands [Dataset]. http://identifiers.org/arxiv:1804.03209
    Explore at:
    Dataset updated
    Jan 13, 2023
    Description

    An audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Its primary goal is to provide a way to build and test small models that detect when a single word is spoken, from a set of ten target words, with as few false positives as possible from background noise or unrelated speech. Note that in the train and validation set, the label "unknown" is much more prevalent than the labels of the target words or background noise. One difference from the release version is the handling of silent segments. While in the test set the silence segments are regular 1 second files, in the training they are provided as long segments under "background_noise" folder. Here we split these background noise into 1 second clips, and also keep one of the files for the validation set.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('speech_commands', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  5. Z

    Simulated datasets for detector and particle flow reconstruction: CLIC...

    • nde-dev.biothings.io
    • data.niaid.nih.gov
    • +1more
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mokhtar, Farouk (2025). Simulated datasets for detector and particle flow reconstruction: CLIC detector, machine learning format [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_8409591
    Explore at:
    Dataset updated
    Mar 21, 2025
    Dataset provided by
    Kagan, Michael
    Garcia, Dolores
    Duarte, Javier
    Pata, Joosep
    Zhang, Mengke
    Wulff, Eric
    Mokhtar, Farouk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synopsis

    Machine-learning friendly format of tracks, clusters and target particles in electron-positron events, simulated with the CLIC detector. Ready to be used with jpata/particleflow:v2.3.0. Derived from the EDM4HEP ROOT files in https://zenodo.org/record/8260741.

    clic_edm_ttbar_pf.zip: e+e- -> ttbar, center of mass energy at 380 GeV

    clic_edm_qq_pf.zip: e+e- -> Z* -> qqbar, center of mass energy at 380 GeV

    clic_edm_ww_fullhad_pf.zip: e+e- -> WW -> W decaying hadronically, center of mass energy at 380 GeV

    clic-tfds.ipynb: an example notebook on how to load the files

    Contents

    Each .zip file contains the dataset in the tensorflow-datasets, array_record format. We have split the full datasets into 10 subsets, due to space considerations on zenodo, two subsets from each dataset are uploaded. Each dataset contains a train and test split of events.

    Dataset semantics (to be updated)

    Each dataset consists of events that can be iterated over using the tensorflow-datasets library and used in either tensorflow or pytorch. Each event has the following information available:

    X: the reconstruction input features, i.e. tracks and clusters

    ytarget: the ground truth particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle

    ycand: the baseline Pandora PF particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle

    The full semantics, including the list of features for X, are available at https://github.com/jpata/particleflow/blob/v2.3.0/mlpf/heptfds/clic_pf_edm4hep/utils_edm.py and https://github.com/jpata/particleflow/blob/v2.3.0/mlpf/data/key4hep/postprocessing.py.

  6. T

    imagenet_v2

    • tensorflow.google.cn
    • tensorflow.org
    Updated Dec 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). imagenet_v2 [Dataset]. https://tensorflow.google.cn/datasets/catalog/imagenet_v2?hl=zh-cn
    Explore at:
    Dataset updated
    Dec 2, 2021
    Description

    ImageNet-v2 is an ImageNet test set (10 per class) collected by closely following the original labelling protocol. Each image has been labelled by at least 10 MTurk workers, possibly more, and depending on the strategy used to select which images to include among the 10 chosen for the given class there are three different versions of the dataset. Please refer to section four of the paper for more details on how the different variants were compiled.

    The label space is the same as that of ImageNet2012. Each example is represented as a dictionary with the following keys:

    • 'image': The image, a (H, W, 3)-tensor.
    • 'label': An integer in the range [0, 1000).
    • 'file_name': A unique sting identifying the example within the dataset.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imagenet_v2', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/imagenet_v2-matched-frequency-3.0.0.png" alt="Visualization" width="500px">

  7. T

    food101

    • tensorflow.org
    • opendatalab.com
    • +3more
    Updated Nov 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). food101 [Dataset]. https://www.tensorflow.org/datasets/catalog/food101
    Explore at:
    Dataset updated
    Nov 23, 2022
    Description

    This dataset consists of 101 food categories, with 101'000 images. For each class, 250 manually reviewed test images are provided as well as 750 training images. On purpose, the training images were not cleaned, and thus still contain some amount of noise. This comes mostly in the form of intense colors and sometimes wrong labels. All images were rescaled to have a maximum side length of 512 pixels.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('food101', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/food101-2.0.0.png" alt="Visualization" width="500px">

  8. Anime Subtitles

    • kaggle.com
    zip
    Updated Aug 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jess Fan (2021). Anime Subtitles [Dataset]. https://www.kaggle.com/datasets/jef1056/anime-subtitles/code
    Explore at:
    zip(103874640 bytes)Available download formats
    Dataset updated
    Aug 19, 2021
    Authors
    Jess Fan
    Description

    Content

    The original extracted versions (in .srt and .ass format) are also included in this release (which, idk why, but kaggle decompressed >:U)

    This dataset contains 1,497,770 messages across 3,836 episodes of anime. The raw dataset contains 1,563,442 messages, some of which were removed during cleaning.

    This version (V4) adapts the original (frankly, terrible) format into the newer format I developed, which is used in https://github.com/JEF1056/clean-discord. The Dataset folder contains compressed text files, which are compatable with tensorflow datasets. These can be streamed as a textlinedataset in the TSV format.

    V4 also fixes many (but not all) issues that the original cleaning script was too simple to realistically take care of. It also uses the clean-discord cleaner algorithms to make sentences more natural language than formatting. The script has also been optimized to run on multi-core systems, allowing it to complete cleaning this entire dataset in under 30 seconds on a 4-core machine. See the new and impoved script here: https://github.com/JEF1056/clean-discord/blob/v1.2/misc/anime.py (no longer bundled in the dataset files)

    Format

    The files are now all compressed to save space, and are compatable with tensorflow datasets. You can initialize a dataset function as such: def dataset_fn_local(split, shuffle_files=False): global nq_tsv_path del shuffle_files # Load lines from the text file as examples. files_to_read=[os.path.join(nq_tsv_path[split],filename) for filename in os.listdir(nq_tsv_path[split]) if filename.startswith(split)] print(f"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Split {split} contains {len(files_to_read)} files. First 10: {files_to_read[0:10]}") ds = tf.data.TextLineDataset(files_to_read, compression_type="GZIP").filter(lambda line:tf.not_equal(tf.strings.length(line),0)) ds = ds.shuffle(buffer_size=600000) ds = ds.map(functools.partial(tf.io.decode_csv, record_defaults=["",""], field_delim="\t", use_quote_delim=False), num_parallel_calls=tf.data.experimental.AUTOTUNE) ds = ds.map(lambda *ex: dict(zip(["question", "answer"], ex))) return ds

    Acknowledgements

    A sincere thanks to all of my friends for helping me come up with anime titles, a shoutout to the talented and dedicated people translating Japanese anime, and an even bigger thanks to Leen Chan for compiling the actual subtitles.

    This dataset is far from complete! I hope that people who are willing to find, add and clean the data are out there, and could do their best to try and help out in the effort to grow this data

  9. T

    civil_comments

    • tensorflow.org
    • huggingface.co
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
    Explore at:
    Dataset updated
    Feb 28, 2023
    Description

    This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

    The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

    The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

    For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('civil_comments', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  10. R

    Data from: Fashion Mnist Dataset

    • universe.roboflow.com
    • opendatalab.com
    • +3more
    zip
    Updated Aug 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 10, 2022
    Dataset authored and provided by
    Popular Benchmarks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Clothing
    Description

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Authors:

    Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

    All images were sized 28x28 in the original dataset

    Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

    Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

    Version 1 (original-images_Original-FashionMNIST-Splits):

    • Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.
    • This version was not trained

    Version 3 (original-images_trainSetSplitBy80_20):

    • Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set
    • https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

    Citation:

    @online{xiao2017/online,
     author    = {Han Xiao and Kashif Rasul and Roland Vollgraf},
     title    = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
     date     = {2017-08-28},
     year     = {2017},
     eprintclass = {cs.LG},
     eprinttype  = {arXiv},
     eprint    = {cs.LG/1708.07747},
    }
    
  11. T

    placesfull

    • tensorflow.org
    Updated Dec 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). placesfull [Dataset]. https://www.tensorflow.org/datasets/catalog/placesfull
    Explore at:
    Dataset updated
    Dec 16, 2022
    Description

    The Places dataset is designed following principles of human visual cognition. Our goal is to build a core of visual knowledge that can be used to train artificial systems for high-level visual understanding tasks, such as scene context, object recognition, action and event prediction, and theory-of-mind inference.

    The semantic categories of Places are defined by their function: the labels represent the entry-level of an environment. To illustrate, the dataset has different categories of bedrooms, or streets, etc, as one does not act the same way, and does not make the same predictions of what can happen next, in a home bedroom, an hotel bedroom or a nursery. In total, Places contains more than 10 million images comprising 400+ unique scene categories. The dataset features 5000 to 30,000 training images per class, consistent with real-world frequencies of occurrence. Using convolutional neural networks (CNN), Places dataset allows learning of deep scene features for various scene recognition tasks, with the goal to establish new state-of-the-art performances on scene-centric benchmarks.

    Here we provide the Places Database and the trained CNNs for academic research and education purposes.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('placesfull', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/placesfull-1.0.0.png" alt="Visualization" width="500px">

  12. T

    coco

    • tensorflow.org
    • huggingface.co
    Updated Jun 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). coco [Dataset]. https://www.tensorflow.org/datasets/catalog/coco
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    COCO is a large-scale object detection, segmentation, and captioning dataset.

    Note: * Some images from the train and validation sets don't have annotations. * Coco 2014 and 2017 uses the same images, but different train/val/test splits * The test split don't have any annotations (only images). * Coco defines 91 classes but the data only uses 80 classes. * Panotptic annotations defines defines 200 classes but only uses 133.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('coco', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/coco-2014-1.1.0.png" alt="Visualization" width="500px">

  13. Data from: Cats vs. Dogs

    • kaggle.com
    zip
    Updated Jul 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sree Teja Dusi (2023). Cats vs. Dogs [Dataset]. https://www.kaggle.com/datasets/sreetejadusi/cats-vs-dogs/code
    Explore at:
    zip(68612859 bytes)Available download formats
    Dataset updated
    Jul 30, 2023
    Authors
    Sree Teja Dusi
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    This dataset has cleaned up data, i.e images which are corrupted are removed, and split into Train and Validation for ease of application. Dataset provided by tensorflow.org.

    sreeteja.dev

  14. Z

    Graph topological features extracted from expression profiles of...

    • data-staging.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tranchevent, Léon-Charles; Azuaje, Francisco; Rajapakse, Jagath C (2020). Graph topological features extracted from expression profiles of neuroblastoma patients [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3357673
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Luxembourg Institute of Health
    Nanyang Technological University
    Authors
    Tranchevent, Léon-Charles; Azuaje, Francisco; Rajapakse, Jagath C
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This dataset contains the data described in the paper titled "A deep neural network approach to predicting clinical outcomes of neuroblastoma patients." by Tranchevent, Azuaje and Rajapakse. More precisely, this dataset contains the topological features extracted from graphs built from publicly available expression data (see details below). This dataset does not contain the original expression data, which are available elsewhere. We thank the scientists who did generate and share these data (please see below the relevant links and publications).

    Content

    File names start with the name of the publicly available dataset they are built on (among "Fischer", "Maris" and "Versteeg"). This name is followed by a tag representing whether they contain raw data ("raw", which means, in this case, the raw topological features) or TF formatted data ("TF", which stands for TensorFlow). This tag is then followed by a unique identifier representing a unique configuration. The configuration file "Global_configuration.tsv" contains details about these configurations such as which topological features are present and which clinical outcome is considered.

    The code associated to the same manuscript that uses these data is at https://gitlab.com/biomodlih/SingalunDeep. The procedure by which the raw data are transformed into the TensorFlow ready data is described in the paper.

    File format

    All files are TSV files that correspond to matrices with samples as rows and features as columns (or clinical data as columns for clinical data files). The data files contain various sets of topological features that were extracted from the sample graphs (or Patient Similarity Networks - PSN). The clinical files contain relevant clinical outcomes.

    The raw data files only contain the topological data. For instance, the file "Fischer_raw_2d0000_data_tsv" contains 24 values for each sample corresponding to the 12 centralities computed for both the microarray (Fischer-M) and RNA-seq (Fischer-R) datasets. The TensorFlow ready files do not contain the sample identifiers in the first column. However, they contain two extra columns at the end. The first extra column is the sample weights (for the classifiers and because we very often have a dominant class). The second extra column is the class labels (binary), based on the clinical outcome of interest.

    Dataset details

    The Fischer dataset is used to train, evaluate and validate the models, so the dataset is split into train / eval / valid files, which contains respectively 249, 125 and 124 rows (samples) of the original 498 samples. In contrast, the other two datasets (Maris and Versteeg) are smaller and are only used for validation (and therefore have no training or evaluation file).

    The Fischer dataset also has more data files because various configurations were tested (see manuscript). In contrast, the validation, using the Maris and Versteeg datasets is only done for a single configuration and there are therefore less files.

    For Fischer, a few configurations are listed in the global configuration file but there is no corresponding raw data. This is because these items are derived from concatenations of the original raw data (see global configuration file and manuscript for details).

    References

    This dataset is associated with Tranchevent L., Azuaje F.. Rajapakse J.C., A deep neural network approach to predicting clinical outcomes of neuroblastoma patients.

    If you use these data in your research, please do not forget to also cite the researchers who have generated the original expression datasets.

    Fischer dataset:

    Zhang W. et al., Comparison of RNA-seq and microarray-based models for clinical endpoint prediction. Genome Biology 16(1) (2015). doi:10.1186/s13059-015-0694-1

    Wang C. et al., The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat. Biotechnol. 32(9), 926–932. doi:10.1038/nbt.3001

    Versteeg dataset:

    Molenaar J.J. et al., Sequencing of neuroblastoma identifies chromothripsis and defects in neuritogenesis genes. Nature 483(7391), 589–593. doi:10.1038/nature10910

    Maris dataset:

    Wang Q. et al., Integrative genomics identifies distinct molecular classes of neuroblastoma and shows that multiple genes are targeted by regional alterations in DNA copy number. Cancer Res. 66(12), 6050–6062. doi:10.1158/0008-5472.CAN-05-4618

  15. Animals (Cats, Dogs, and Snakes)

    • kaggle.com
    zip
    Updated Nov 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Rehan (2025). Animals (Cats, Dogs, and Snakes) [Dataset]. https://www.kaggle.com/datasets/aiomarrehan/animals-cats-dogs-and-snakes
    Explore at:
    zip(40219983 bytes)Available download formats
    Dataset updated
    Nov 18, 2025
    Authors
    Omar Rehan
    Description

    Cats, Dogs, and Snakes Dataset

    Dataset Overview

    The dataset contains images of three animal classes: Cats, Dogs, and Snakes. It is balanced and cleaned, designed for supervised image classification tasks.

    ClassNumber of ImagesDescription
    Cats1,000Includes multiple breeds and poses
    Dogs1,000Covers various breeds and backgrounds
    Snakes1,000Includes multiple species and natural settings

    Total Images: 3,000

    Image Properties:

    • Resolution: 224×224 pixels (resized for consistency)
    • Color Mode: RGB
    • Format: JPEG/PNG
    • Cleaned: Duplicate, blurry, and irrelevant images removed

    Data Split Recommendation

    SetPercentageNumber of Images
    Training70%2,100
    Validation15%450
    Test15%450

    Preprocessing

    Images in the dataset have been standardized to support machine learning pipelines:

    1. Resizing to 224×224 pixels.
    2. Normalization of pixel values to [0,1] or mean subtraction for deep learning frameworks.
    3. Label encoding: Integer encoding (0 = Cat, 1 = Dog, 2 = Snake) or one-hot encoding for model training.

    Example: Loading and Using the Dataset (Python)

    import os
    import tensorflow as tf
    from tensorflow.keras.preprocessing.image import ImageDataGenerator
    
    # Path to dataset
    dataset_path = "path/to/dataset"
    
    # ImageDataGenerator for preprocessing
    datagen = ImageDataGenerator(
      rescale=1./255,
      validation_split=0.15 # 15% for validation
    )
    
    # Load training data
    train_generator = datagen.flow_from_directory(
      dataset_path,
      target_size=(224, 224),
      batch_size=32,
      class_mode='categorical',
      subset='training',
      shuffle=True
    )
    
    # Load validation data
    validation_generator = datagen.flow_from_directory(
      dataset_path,
      target_size=(224, 224),
      batch_size=32,
      class_mode='categorical',
      subset='validation',
      shuffle=False
    )
    
    # Example: Iterate over one batch
    images, labels = next(train_generator)
    print(images.shape, labels.shape) # (32, 224, 224, 3) (32, 3)
    

    Key Features

    • Balanced: Equal number of samples per class reduces bias.
    • Cleaned: High-quality, relevant images improve model performance.
    • Diverse: Covers multiple breeds, species, and environments to ensure generalization.
    • Ready for ML: Preprocessed and easily integrated into popular deep learning frameworks.
  16. Brain Tumor Dataset

    • columbia.redivis.com
    • redivis.com
    application/jsonl +7
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Columbia Data Platform Demo (2024). Brain Tumor Dataset [Dataset]. https://columbia.redivis.com/datasets/avkx-f78pchg53
    Explore at:
    arrow, spss, application/jsonl, parquet, stata, csv, avro, sasAvailable download formats
    Dataset updated
    Feb 7, 2024
    Dataset provided by
    Redivis Inc.
    Authors
    Columbia Data Platform Demo
    Time period covered
    Feb 7, 2024
    Description

    Abstract

    A dataset of training and test images for the Brain Tumor Identification notebook found at: https://www.kaggle.com/code/faridtaghiyev/brain-tumor-detection-using-tensorflow-2-x/notebook

    Methodology

    The dataset comprises MRI images labeled for brain tumor presence. Images are split into training (70%), validation (15%), and test (15%) sets. Preprocessing includes resizing to 256x256 pixels, normalization, and augmentation (rotation, flipping). Models are trained using TensorFlow on a CNN architecture, optimized with Adam, and evaluated based on accuracy, precision, recall, and F1-score.

    Usage

    This public dataset is available for non-commercial use. Any publications or derivatives from these data must credit the original source. Please cite appropriately when using or referencing this dataset in any capacity.

  17. h

    conll2003

    • huggingface.co
    • opendatalab.com
    • +1more
    Updated Sep 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tom Aarsen (2023). conll2003 [Dataset]. https://huggingface.co/datasets/tomaarsen/conll2003
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2023
    Authors
    Tom Aarsen
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

    The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2 tagging scheme, whereas the original dataset uses IOB1.

    For more details see https://www.clips.uantwerpen.be/conll2003/ner/ and https://www.aclweb.org/anthology/W03-0419

  18. T

    imagenet_a

    • tensorflow.org
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). imagenet_a [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet_a
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    ImageNet-A is a set of images labelled with ImageNet labels that were obtained by collecting new data and keeping only those images that ResNet-50 models fail to correctly classify. For more details please refer to the paper.

    The label space is the same as that of ImageNet2012. Each example is represented as a dictionary with the following keys:

    • 'image': The image, a (H, W, 3)-tensor.
    • 'label': An integer in the range [0, 1000).
    • 'file_name': A unique sting identifying the example within the dataset.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imagenet_a', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/imagenet_a-0.1.0.png" alt="Visualization" width="500px">

  19. 200000 Medical Research Paper Abstracts

    • kaggle.com
    zip
    Updated Jan 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anshul Mehta (2022). 200000 Medical Research Paper Abstracts [Dataset]. https://www.kaggle.com/datasets/anshulmehtakaggl/200000-abstracts-for-seq-sentence-classification/code
    Explore at:
    zip(251885526 bytes)Available download formats
    Dataset updated
    Jan 14, 2022
    Authors
    Anshul Mehta
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    I was working with this dataset as a part of a Project for a Tensorflow course that I was taking. It seemed to be a very interesting problem. You can check the course here.

    Content

    In the training set the Abstracts are split into Objectives, Methods, Conclusions, Results etc for each sentence. The aim is to make sure that our model is able to split the test data or any other abstract as a matter of fact into the same way making the complicated abstracts much easier to read.

    Acknowledgements

    Inspiration of the Project: https://arxiv.org/abs/1710.06071 Data Belongs to: https://github.com/Franck-Dernoncourt/pubmed-rct

  20. h

    nyu_depth_v2

    • huggingface.co
    • tensorflow.org
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sayak Paul (2023). nyu_depth_v2 [Dataset]. https://huggingface.co/datasets/sayakpaul/nyu_depth_v2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2023
    Authors
    Sayak Paul
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both the RGB and Depth cameras from the Microsoft Kinect.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). imagenet2012 [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012

imagenet2012

Related Article
Explore at:
Dataset updated
Jun 1, 2024
Description

ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:

  1. Download the 2012 test split available here.
  2. Download the October 10, 2019 patch. There is a Google Drive link to the patch provided on the same page.
  3. Combine the two tar-balls, manually overwriting any images in the original archive with images from the patch. According to the instructions on image-net.org, this procedure overwrites just a few images.

The resulting tar-ball may then be processed by TFDS.

To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.

To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:

771 778 794 387 650
363 691 764 923 427
737 369 430 531 124
755 930 755 59 168

The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('imagenet2012', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012-5.1.0.png" alt="Visualization" width="500px">

Search
Clear search
Close search
Google apps
Main menu