100+ datasets found

T
imagenet2012
tensorflow.org
Updated Jun 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). imagenet2012 [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012
Explore at:
Dataset updated
Jun 1, 2024
Description
ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:

Download the 2012 test split available here.

Download the October 10, 2019 patch. There is a Google Drive link to the patch provided on the same page.

Combine the two tar-balls, manually overwriting any images in the original archive with images from the patch. According to the instructions on image-net.org, this procedure overwrites just a few images.

The resulting tar-ball may then be processed by TFDS.

To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.

To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:

771 778 794 387 650 363 691 764 923 427 737 369 430 531 124 755 930 755 59 168

The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('imagenet2012', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012-5.1.0.png" alt="Visualization" width="500px">
T
cifar10
tensorflow.org
opendatalab.com
+3more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). cifar10 [Dataset]. https://www.tensorflow.org/datasets/catalog/cifar10
Explore at:
Dataset updated
Jun 1, 2024
Description
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('cifar10', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/cifar10-3.0.2.png" alt="Visualization" width="500px">
Dataset for "Enhancing Cloud Detection in Sentinel-2 Imagery: A...
zenodo.org
data.niaid.nih.gov
bin
Updated Feb 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gong Chengjuan; Yin Ranyu; Yin Ranyu; Long Tengfei; Long Tengfei; He Guojin; Jiao Weili; Wang Guizhou; Gong Chengjuan; He Guojin; Jiao Weili; Wang Guizhou (2024). Dataset for "Enhancing Cloud Detection in Sentinel-2 Imagery: A Spatial-Temporal Approach and Dataset" [Dataset]. http://doi.org/10.5281/zenodo.10613705
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10613705
Dataset updated
Feb 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gong Chengjuan; Yin Ranyu; Yin Ranyu; Long Tengfei; Long Tengfei; He Guojin; Jiao Weili; Wang Guizhou; Gong Chengjuan; He Guojin; Jiao Weili; Wang Guizhou
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is built for time-series Sentinel-2 cloud detection and stored in Tensorflow TFRecord (refer to https://www.tensorflow.org/tutorials/load_data/tfrecord).

Each file is compressed in 7z format and can be decompressed using Bandzip or 7-zip software.

Dataset Structure:

Each filename can be split into three parts using underscores. The first part indicates whether it is designated for training or validation ('train' or 'val'); the second part indicates the Sentinel-2 tile name, and the last part indicates the number of samples in this file.

For each sample, it includes:

Sample ID;

Array of time series 4 band image patches in 10m resolution, shaped as (n_timestamps, 4, 42, 42);

Label list indicating cloud cover status for the center \(6\times6\) pixels of each timestamp;

Ordinal list for each timestamp;

Sample weight list (reserved);

Here is a demonstration function for parsing the TFRecord file:

import tensorflow as tf # init Tensorflow Dataset from file name def parseRecordDirect(fname): sep = '/' parts = tf.strings.split(fname,sep) tn = tf.strings.split(parts[-1],sep='_')[-2] nn = tf.strings.to_number(tf.strings.split(parts[-1],sep='_')[-1],tf.dtypes.int64) t = tf.data.Dataset.from_tensors(tn).repeat().take(nn) t1 = tf.data.TFRecordDataset(fname) ds = tf.data.Dataset.zip((t, t1)) return ds keys_to_features_direct = { 'localid': tf.io.FixedLenFeature([], tf.int64, -1), 'image_raw_ldseries': tf.io.FixedLenFeature((), tf.string, ''), 'labels': tf.io.FixedLenFeature((), tf.string, ''), 'dates': tf.io.FixedLenFeature((), tf.string, ''), 'weights': tf.io.FixedLenFeature((), tf.string, '') } # The Decoder (Optional) class SeriesClassificationDirectDecorder(decoder.Decoder): """A tf.Example decoder for tfds classification datasets.""" def _init_(self) -> None: super()._init_() def decode(self, tid, ds): parsed = tf.io.parse_single_example(ds, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) sample_dict = { 'tid': tid, # tile ID 'dates': dates, # Date list 'localid': parsed['localid'], # sample ID 'imgs': decoded, # image array 'labels': label, # label list 'weights': weight } return sample_dict # simple function def preprocessDirect(tid, record): parsed = tf.io.parse_single_example(record, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) return tid, dates, parsed['localid'], decoded, label, weight t1 = parseRecordDirect('filename here') dataset = t1.map(preprocessDirect, num_parallel_calls=tf.data.experimental.AUTOTUNE) #

Class Definition:

0: clear

1: opaque cloud

2: thin cloud

3: haze

4: cloud shadow

5: snow

Dataset Construction:

First, we randomly generate 500 points for each tile, and all these points are aligned to the pixel grid center of the subdatasets in 60m resolution (eg. B10) for consistence when comparing with other products.
It is because that other cloud detection method may use the cirrus band as features, which is in 60m resolution.

Then, the time series image patches of two shapes are cropped with each point as the center.
The patches of shape \(42 \times 42\) are cropped from the bands in 10m resolution (B2, B3, B4, B8) and are used to construct this dataset.
And the patches of shape \(348 \times 348\) are cropped from the True Colour Image (TCI, details see sentinel-2 user guide) file and are used to interpreting class labels.

The samples with a large number of timestamps could be time-consuming in the IO stage, thus the time series patches are divided into different groups with timestamps not exceeding 100 for every group.
T
speech_commands
tensorflow.org
datasets.activeloop.ai
+1more
Updated Jan 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). speech_commands [Dataset]. http://identifiers.org/arxiv:1804.03209
Explore at:
Unique identifier
https://identifiers.org/arxiv:1804.03209
Dataset updated
Jan 13, 2023
Description
An audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Its primary goal is to provide a way to build and test small models that detect when a single word is spoken, from a set of ten target words, with as few false positives as possible from background noise or unrelated speech. Note that in the train and validation set, the label "unknown" is much more prevalent than the labels of the target words or background noise. One difference from the release version is the handling of silent segments. While in the test set the silence segments are regular 1 second files, in the training they are provided as long segments under "background_noise" folder. Here we split these background noise into 1 second clips, and also keep one of the files for the validation set.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('speech_commands', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Z
Simulated datasets for detector and particle flow reconstruction: CLIC...
nde-dev.biothings.io
data.niaid.nih.gov
+1more
Updated Mar 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mokhtar, Farouk (2025). Simulated datasets for detector and particle flow reconstruction: CLIC detector, machine learning format [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_8409591
Explore at:
Dataset updated
Mar 21, 2025
Dataset provided by
Kagan, Michael
Garcia, Dolores
Duarte, Javier
Pata, Joosep
Zhang, Mengke
Wulff, Eric
Mokhtar, Farouk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synopsis

Machine-learning friendly format of tracks, clusters and target particles in electron-positron events, simulated with the CLIC detector. Ready to be used with jpata/particleflow:v2.3.0. Derived from the EDM4HEP ROOT files in https://zenodo.org/record/8260741.

clic_edm_ttbar_pf.zip: e+e- -> ttbar, center of mass energy at 380 GeV

clic_edm_qq_pf.zip: e+e- -> Z* -> qqbar, center of mass energy at 380 GeV

clic_edm_ww_fullhad_pf.zip: e+e- -> WW -> W decaying hadronically, center of mass energy at 380 GeV

clic-tfds.ipynb: an example notebook on how to load the files

Contents

Each .zip file contains the dataset in the tensorflow-datasets, array_record format. We have split the full datasets into 10 subsets, due to space considerations on zenodo, two subsets from each dataset are uploaded. Each dataset contains a train and test split of events.

Dataset semantics (to be updated)

Each dataset consists of events that can be iterated over using the tensorflow-datasets library and used in either tensorflow or pytorch. Each event has the following information available:

X: the reconstruction input features, i.e. tracks and clusters

ytarget: the ground truth particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle

ycand: the baseline Pandora PF particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle

The full semantics, including the list of features for X, are available at https://github.com/jpata/particleflow/blob/v2.3.0/mlpf/heptfds/clic_pf_edm4hep/utils_edm.py and https://github.com/jpata/particleflow/blob/v2.3.0/mlpf/data/key4hep/postprocessing.py.
T
imagenet_v2
tensorflow.google.cn
tensorflow.org
Updated Dec 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). imagenet_v2 [Dataset]. https://tensorflow.google.cn/datasets/catalog/imagenet_v2?hl=zh-cn
Explore at:
Dataset updated
Dec 2, 2021
Description
ImageNet-v2 is an ImageNet test set (10 per class) collected by closely following the original labelling protocol. Each image has been labelled by at least 10 MTurk workers, possibly more, and depending on the strategy used to select which images to include among the 10 chosen for the given class there are three different versions of the dataset. Please refer to section four of the paper for more details on how the different variants were compiled.

The label space is the same as that of ImageNet2012. Each example is represented as a dictionary with the following keys:

'image': The image, a (H, W, 3)-tensor.

'label': An integer in the range [0, 1000).

'file_name': A unique sting identifying the example within the dataset.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('imagenet_v2', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/imagenet_v2-matched-frequency-3.0.0.png" alt="Visualization" width="500px">
T
food101
tensorflow.org
opendatalab.com
+3more
Updated Nov 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). food101 [Dataset]. https://www.tensorflow.org/datasets/catalog/food101
Explore at:
Dataset updated
Nov 23, 2022
Description
This dataset consists of 101 food categories, with 101'000 images. For each class, 250 manually reviewed test images are provided as well as 750 training images. On purpose, the training images were not cleaned, and thus still contain some amount of noise. This comes mostly in the form of intense colors and sometimes wrong labels. All images were rescaled to have a maximum side length of 512 pixels.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('food101', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/food101-2.0.0.png" alt="Visualization" width="500px">
Anime Subtitles
kaggle.com
zip
Updated Aug 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jess Fan (2021). Anime Subtitles [Dataset]. https://www.kaggle.com/datasets/jef1056/anime-subtitles/code
Explore at:
zip(103874640 bytes)Available download formats
Dataset updated
Aug 19, 2021
Authors
Jess Fan
Description
Content

The original extracted versions (in .srt and .ass format) are also included in this release (which, idk why, but kaggle decompressed >:U)

This dataset contains 1,497,770 messages across 3,836 episodes of anime. The raw dataset contains 1,563,442 messages, some of which were removed during cleaning.

This version (V4) adapts the original (frankly, terrible) format into the newer format I developed, which is used in https://github.com/JEF1056/clean-discord. The Dataset folder contains compressed text files, which are compatable with tensorflow datasets. These can be streamed as a textlinedataset in the TSV format.

V4 also fixes many (but not all) issues that the original cleaning script was too simple to realistically take care of. It also uses the clean-discord cleaner algorithms to make sentences more natural language than formatting. The script has also been optimized to run on multi-core systems, allowing it to complete cleaning this entire dataset in under 30 seconds on a 4-core machine. See the new and impoved script here: https://github.com/JEF1056/clean-discord/blob/v1.2/misc/anime.py (no longer bundled in the dataset files)

Format

The files are now all compressed to save space, and are compatable with tensorflow datasets. You can initialize a dataset function as such: def dataset_fn_local(split, shuffle_files=False): global nq_tsv_path del shuffle_files # Load lines from the text file as examples. files_to_read=[os.path.join(nq_tsv_path[split],filename) for filename in os.listdir(nq_tsv_path[split]) if filename.startswith(split)] print(f"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Split {split} contains {len(files_to_read)} files. First 10: {files_to_read[0:10]}") ds = tf.data.TextLineDataset(files_to_read, compression_type="GZIP").filter(lambda line:tf.not_equal(tf.strings.length(line),0)) ds = ds.shuffle(buffer_size=600000) ds = ds.map(functools.partial(tf.io.decode_csv, record_defaults=["",""], field_delim="\t", use_quote_delim=False), num_parallel_calls=tf.data.experimental.AUTOTUNE) ds = ds.map(lambda *ex: dict(zip(["question", "answer"], ex))) return ds

Acknowledgements

A sincere thanks to all of my friends for helping me come up with anime titles, a shoutout to the talented and dedicated people translating Japanese anime, and an even bigger thanks to Leen Chan for compiling the actual subtitles.

This dataset is far from complete! I hope that people who are willing to find, add and clean the data are out there, and could do their best to try and help out in the effort to grow this data
T
civil_comments
tensorflow.org
huggingface.co
Updated Feb 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
Explore at:
Dataset updated
Feb 28, 2023
Description
This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('civil_comments', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
R
Data from: Fashion Mnist Dataset
universe.roboflow.com
opendatalab.com
+3more
zip
Updated Aug 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
Explore at:
zipAvailable download formats
Dataset updated
Aug 10, 2022
Dataset authored and provided by
Popular Benchmarks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Clothing
Description
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Han Xiao, Kashif Rasul and Roland Vollgraf

https://arxiv.org/abs/1708.07747

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

Version 1 (original-images_Original-FashionMNIST-Splits):

Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.

This version was not trained

Version 3 (original-images_trainSetSplitBy80_20):

Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set

https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

Citation:

@online{xiao2017/online, author = {Han Xiao and Kashif Rasul and Roland Vollgraf}, title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms}, date = {2017-08-28}, year = {2017}, eprintclass = {cs.LG}, eprinttype = {arXiv}, eprint = {cs.LG/1708.07747}, }
T
placesfull
tensorflow.org
Updated Dec 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). placesfull [Dataset]. https://www.tensorflow.org/datasets/catalog/placesfull
Explore at:
Dataset updated
Dec 16, 2022
Description
The Places dataset is designed following principles of human visual cognition. Our goal is to build a core of visual knowledge that can be used to train artificial systems for high-level visual understanding tasks, such as scene context, object recognition, action and event prediction, and theory-of-mind inference.

The semantic categories of Places are defined by their function: the labels represent the entry-level of an environment. To illustrate, the dataset has different categories of bedrooms, or streets, etc, as one does not act the same way, and does not make the same predictions of what can happen next, in a home bedroom, an hotel bedroom or a nursery. In total, Places contains more than 10 million images comprising 400+ unique scene categories. The dataset features 5000 to 30,000 training images per class, consistent with real-world frequencies of occurrence. Using convolutional neural networks (CNN), Places dataset allows learning of deep scene features for various scene recognition tasks, with the goal to establish new state-of-the-art performances on scene-centric benchmarks.

Here we provide the Places Database and the trained CNNs for academic research and education purposes.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('placesfull', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/placesfull-1.0.0.png" alt="Visualization" width="500px">
T
coco
tensorflow.org
huggingface.co
Updated Jun 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). coco [Dataset]. https://www.tensorflow.org/datasets/catalog/coco
Explore at:
Dataset updated
Jun 1, 2024
Description
COCO is a large-scale object detection, segmentation, and captioning dataset.

Note: * Some images from the train and validation sets don't have annotations. * Coco 2014 and 2017 uses the same images, but different train/val/test splits * The test split don't have any annotations (only images). * Coco defines 91 classes but the data only uses 80 classes. * Panotptic annotations defines defines 200 classes but only uses 133.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('coco', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/coco-2014-1.1.0.png" alt="Visualization" width="500px">
Data from: Cats vs. Dogs
kaggle.com
zip
Updated Jul 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sree Teja Dusi (2023). Cats vs. Dogs [Dataset]. https://www.kaggle.com/datasets/sreetejadusi/cats-vs-dogs/code
Explore at:
zip(68612859 bytes)Available download formats
Dataset updated
Jul 30, 2023
Authors
Sree Teja Dusi
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
This dataset has cleaned up data, i.e images which are corrupted are removed, and split into Train and Validation for ease of application. Dataset provided by tensorflow.org.

sreeteja.dev
Z
Graph topological features extracted from expression profiles of...
data-staging.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tranchevent, Léon-Charles; Azuaje, Francisco; Rajapakse, Jagath C (2020). Graph topological features extracted from expression profiles of neuroblastoma patients [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3357673
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Luxembourg Institute of Health
Nanyang Technological University
Authors
Tranchevent, Léon-Charles; Azuaje, Francisco; Rajapakse, Jagath C
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

This dataset contains the data described in the paper titled "A deep neural network approach to predicting clinical outcomes of neuroblastoma patients." by Tranchevent, Azuaje and Rajapakse. More precisely, this dataset contains the topological features extracted from graphs built from publicly available expression data (see details below). This dataset does not contain the original expression data, which are available elsewhere. We thank the scientists who did generate and share these data (please see below the relevant links and publications).

Content

File names start with the name of the publicly available dataset they are built on (among "Fischer", "Maris" and "Versteeg"). This name is followed by a tag representing whether they contain raw data ("raw", which means, in this case, the raw topological features) or TF formatted data ("TF", which stands for TensorFlow). This tag is then followed by a unique identifier representing a unique configuration. The configuration file "Global_configuration.tsv" contains details about these configurations such as which topological features are present and which clinical outcome is considered.

The code associated to the same manuscript that uses these data is at https://gitlab.com/biomodlih/SingalunDeep. The procedure by which the raw data are transformed into the TensorFlow ready data is described in the paper.

File format

All files are TSV files that correspond to matrices with samples as rows and features as columns (or clinical data as columns for clinical data files). The data files contain various sets of topological features that were extracted from the sample graphs (or Patient Similarity Networks - PSN). The clinical files contain relevant clinical outcomes.

The raw data files only contain the topological data. For instance, the file "Fischer_raw_2d0000_data_tsv" contains 24 values for each sample corresponding to the 12 centralities computed for both the microarray (Fischer-M) and RNA-seq (Fischer-R) datasets. The TensorFlow ready files do not contain the sample identifiers in the first column. However, they contain two extra columns at the end. The first extra column is the sample weights (for the classifiers and because we very often have a dominant class). The second extra column is the class labels (binary), based on the clinical outcome of interest.

Dataset details

The Fischer dataset is used to train, evaluate and validate the models, so the dataset is split into train / eval / valid files, which contains respectively 249, 125 and 124 rows (samples) of the original 498 samples. In contrast, the other two datasets (Maris and Versteeg) are smaller and are only used for validation (and therefore have no training or evaluation file).

The Fischer dataset also has more data files because various configurations were tested (see manuscript). In contrast, the validation, using the Maris and Versteeg datasets is only done for a single configuration and there are therefore less files.

For Fischer, a few configurations are listed in the global configuration file but there is no corresponding raw data. This is because these items are derived from concatenations of the original raw data (see global configuration file and manuscript for details).

References

This dataset is associated with Tranchevent L., Azuaje F.. Rajapakse J.C., A deep neural network approach to predicting clinical outcomes of neuroblastoma patients.

If you use these data in your research, please do not forget to also cite the researchers who have generated the original expression datasets.

Fischer dataset:

Zhang W. et al., Comparison of RNA-seq and microarray-based models for clinical endpoint prediction. Genome Biology 16(1) (2015). doi:10.1186/s13059-015-0694-1

Wang C. et al., The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat. Biotechnol. 32(9), 926–932. doi:10.1038/nbt.3001

Versteeg dataset:

Molenaar J.J. et al., Sequencing of neuroblastoma identifies chromothripsis and defects in neuritogenesis genes. Nature 483(7391), 589–593. doi:10.1038/nature10910

Maris dataset:

Wang Q. et al., Integrative genomics identifies distinct molecular classes of neuroblastoma and shows that multiple genes are targeted by regional alterations in DNA copy number. Cancer Res. 66(12), 6050–6062. doi:10.1158/0008-5472.CAN-05-4618
Animals (Cats, Dogs, and Snakes)
kaggle.com
zip
Updated Nov 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Rehan (2025). Animals (Cats, Dogs, and Snakes) [Dataset]. https://www.kaggle.com/datasets/aiomarrehan/animals-cats-dogs-and-snakes
Explore at:
zip(40219983 bytes)Available download formats
Dataset updated
Nov 18, 2025
Authors
Omar Rehan
Description
Cats, Dogs, and Snakes Dataset

Dataset Overview

The dataset contains images of three animal classes: Cats, Dogs, and Snakes. It is balanced and cleaned, designed for supervised image classification tasks.

Class Number of Images Description
Cats 1,000 Includes multiple breeds and poses
Dogs 1,000 Covers various breeds and backgrounds
Snakes 1,000 Includes multiple species and natural settings

Total Images: 3,000

Image Properties:

Resolution: 224×224 pixels (resized for consistency)

Color Mode: RGB

Format: JPEG/PNG

Cleaned: Duplicate, blurry, and irrelevant images removed

Data Split Recommendation

Set Percentage Number of Images
Training 70% 2,100
Validation 15% 450
Test 15% 450

Preprocessing

Images in the dataset have been standardized to support machine learning pipelines:

Resizing to 224×224 pixels.

Normalization of pixel values to [0,1] or mean subtraction for deep learning frameworks.

Label encoding: Integer encoding (0 = Cat, 1 = Dog, 2 = Snake) or one-hot encoding for model training.

Example: Loading and Using the Dataset (Python)

import os import tensorflow as tf from tensorflow.keras.preprocessing.image import ImageDataGenerator # Path to dataset dataset_path = "path/to/dataset" # ImageDataGenerator for preprocessing datagen = ImageDataGenerator( rescale=1./255, validation_split=0.15 # 15% for validation ) # Load training data train_generator = datagen.flow_from_directory( dataset_path, target_size=(224, 224), batch_size=32, class_mode='categorical', subset='training', shuffle=True ) # Load validation data validation_generator = datagen.flow_from_directory( dataset_path, target_size=(224, 224), batch_size=32, class_mode='categorical', subset='validation', shuffle=False ) # Example: Iterate over one batch images, labels = next(train_generator) print(images.shape, labels.shape) # (32, 224, 224, 3) (32, 3)

Key Features

Balanced: Equal number of samples per class reduces bias.

Cleaned: High-quality, relevant images improve model performance.

Diverse: Covers multiple breeds, species, and environments to ensure generalization.

Ready for ML: Preprocessed and easily integrated into popular deep learning frameworks.
Brain Tumor Dataset
columbia.redivis.com
redivis.com
application/jsonl +7
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Columbia Data Platform Demo (2024). Brain Tumor Dataset [Dataset]. https://columbia.redivis.com/datasets/avkx-f78pchg53
Explore at:
arrow, spss, application/jsonl, parquet, stata, csv, avro, sasAvailable download formats
Dataset updated
Feb 7, 2024
Dataset provided by
Redivis Inc.
Authors
Columbia Data Platform Demo
Time period covered
Feb 7, 2024
Description
Abstract

A dataset of training and test images for the Brain Tumor Identification notebook found at: https://www.kaggle.com/code/faridtaghiyev/brain-tumor-detection-using-tensorflow-2-x/notebook

Methodology

The dataset comprises MRI images labeled for brain tumor presence. Images are split into training (70%), validation (15%), and test (15%) sets. Preprocessing includes resizing to 256x256 pixels, normalization, and augmentation (rotation, flipping). Models are trained using TensorFlow on a CNN architecture, optimized with Adam, and evaluated based on accuracy, precision, recall, and F1-score.

Usage

This public dataset is available for non-commercial use. Any publications or derivatives from these data must credit the original source. Please cite appropriately when using or referencing this dataset in any capacity.
h
conll2003
huggingface.co
opendatalab.com
+1more
Updated Sep 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom Aarsen (2023). conll2003 [Dataset]. https://huggingface.co/datasets/tomaarsen/conll2003
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2023
Authors
Tom Aarsen
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2 tagging scheme, whereas the original dataset uses IOB1.

For more details see https://www.clips.uantwerpen.be/conll2003/ner/ and https://www.aclweb.org/anthology/W03-0419
T
imagenet_a
tensorflow.org
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). imagenet_a [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet_a
Explore at:
Dataset updated
Jun 1, 2024
Description
ImageNet-A is a set of images labelled with ImageNet labels that were obtained by collecting new data and keeping only those images that ResNet-50 models fail to correctly classify. For more details please refer to the paper.

The label space is the same as that of ImageNet2012. Each example is represented as a dictionary with the following keys:

'image': The image, a (H, W, 3)-tensor.

'label': An integer in the range [0, 1000).

'file_name': A unique sting identifying the example within the dataset.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('imagenet_a', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/imagenet_a-0.1.0.png" alt="Visualization" width="500px">
200000 Medical Research Paper Abstracts
kaggle.com
zip
Updated Jan 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anshul Mehta (2022). 200000 Medical Research Paper Abstracts [Dataset]. https://www.kaggle.com/datasets/anshulmehtakaggl/200000-abstracts-for-seq-sentence-classification/code
Explore at:
zip(251885526 bytes)Available download formats
Dataset updated
Jan 14, 2022
Authors
Anshul Mehta
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I was working with this dataset as a part of a Project for a Tensorflow course that I was taking. It seemed to be a very interesting problem. You can check the course here.

Content

In the training set the Abstracts are split into Objectives, Methods, Conclusions, Results etc for each sentence. The aim is to make sure that our model is able to split the test data or any other abstract as a matter of fact into the same way making the complicated abstracts much easier to read.

Acknowledgements

Inspiration of the Project: https://arxiv.org/abs/1710.06071 Data Belongs to: https://github.com/Franck-Dernoncourt/pubmed-rct
h
nyu_depth_v2
huggingface.co
tensorflow.org
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sayak Paul (2023). nyu_depth_v2 [Dataset]. https://huggingface.co/datasets/sayakpaul/nyu_depth_v2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 1, 2023
Authors
Sayak Paul
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both the RGB and Depth cameras from the Microsoft Kinect.

Class	Number of Images	Description
Cats	1,000	Includes multiple breeds and poses
Dogs	1,000	Covers various breeds and backgrounds
Snakes	1,000	Includes multiple species and natural settings

Set	Percentage	Number of Images
Training	70%	2,100
Validation	15%	450
Test	15%	450

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). imagenet2012 [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012

imagenet2012

Explore at:

Dataset updated

Jun 1, 2024

Description

ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:

Download the 2012 test split available here.
Download the October 10, 2019 patch. There is a Google Drive link to the patch provided on the same page.
Combine the two tar-balls, manually overwriting any images in the original archive with images from the patch. According to the instructions on image-net.org, this procedure overwrites just a few images.

The resulting tar-ball may then be processed by TFDS.

To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.

To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:

771 778 794 387 650
363 691 764 923 427
737 369 430 531 124
755 930 755 59 168

The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('imagenet2012', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012-5.1.0.png" alt="Visualization" width="500px">

Clear search

Close search

Google apps

Main menu

imagenet2012

cifar10

Dataset for "Enhancing Cloud Detection in Sentinel-2 Imagery: A...

speech_commands

Simulated datasets for detector and particle flow reconstruction: CLIC...

imagenet_v2

food101

Anime Subtitles

Content

Format

Acknowledgements

civil_comments

Data from: Fashion Mnist Dataset

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Version 1 (original-images_Original-FashionMNIST-Splits):

Version 3 (original-images_trainSetSplitBy80_20):

Citation:

placesfull

coco

Data from: Cats vs. Dogs

Graph topological features extracted from expression profiles of...

Animals (Cats, Dogs, and Snakes)

Cats, Dogs, and Snakes Dataset

Dataset Overview

Data Split Recommendation

Preprocessing

Example: Loading and Using the Dataset (Python)

Key Features

Brain Tumor Dataset

Abstract

Methodology

Usage

conll2003

imagenet_a

200000 Medical Research Paper Abstracts

Context

Content

Acknowledgements

nyu_depth_v2

imagenet2012See More Versions

imagenet2012