The MNIST database of handwritten digits.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset is built for time-series Sentinel-2 cloud detection and stored in Tensorflow TFRecord (refer to https://www.tensorflow.org/tutorials/load_data/tfrecord).
Each file is compressed in 7z format and can be decompressed using Bandzip or 7-zip software.
Dataset Structure:
Each filename can be split into three parts using underscores. The first part indicates whether it is designated for training or validation ('train' or 'val'); the second part indicates the Sentinel-2 tile name, and the last part indicates the number of samples in this file.
For each sample, it includes:
Sample ID;
Array of time series 4 band image patches in 10m resolution, shaped as (n_timestamps, 4, 42, 42);
Label list indicating cloud cover status for the center (6\times6) pixels of each timestamp;
Ordinal list for each timestamp;
Sample weight list (reserved);
Here is a demonstration function for parsing the TFRecord file:
import tensorflow as tf
def parseRecordDirect(fname): sep = '/' parts = tf.strings.split(fname,sep) tn = tf.strings.split(parts[-1],sep='_')[-2] nn = tf.strings.to_number(tf.strings.split(parts[-1],sep='_')[-1],tf.dtypes.int64) t = tf.data.Dataset.from_tensors(tn).repeat().take(nn) t1 = tf.data.TFRecordDataset(fname) ds = tf.data.Dataset.zip((t, t1)) return ds
keys_to_features_direct = { 'localid': tf.io.FixedLenFeature([], tf.int64, -1), 'image_raw_ldseries': tf.io.FixedLenFeature((), tf.string, ''), 'labels': tf.io.FixedLenFeature((), tf.string, ''), 'dates': tf.io.FixedLenFeature((), tf.string, ''), 'weights': tf.io.FixedLenFeature((), tf.string, '') }
class SeriesClassificationDirectDecorder(decoder.Decoder): """A tf.Example decoder for tfds classification datasets.""" def init(self) -> None: super()._init_()
def decode(self, tid, ds): parsed = tf.io.parse_single_example(ds, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) sample_dict = { 'tid': tid, # tile ID 'dates': dates, # Date list 'localid': parsed['localid'], # sample ID 'imgs': decoded, # image array 'labels': label, # label list 'weights': weight } return sample_dict
def preprocessDirect(tid, record): parsed = tf.io.parse_single_example(record, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) return tid, dates, parsed['localid'], decoded, label, weight
t1 = parseRecordDirect('filename here') dataset = t1.map(preprocessDirect, num_parallel_calls=tf.data.experimental.AUTOTUNE)
#
Class Definition:
0: clear
1: opaque cloud
2: thin cloud
3: haze
4: cloud shadow
5: snow
Dataset Construction:
First, we randomly generate 500 points for each tile, and all these points are aligned to the pixel grid center of the subdatasets in 60m resolution (eg. B10) for consistence when comparing with other products. It is because that other cloud detection method may use the cirrus band as features, which is in 60m resolution.
Then, the time series image patches of two shapes are cropped with each point as the center.The patches of shape (42 \times 42) are cropped from the bands in 10m resolution (B2, B3, B4, B8) and are used to construct this dataset.And the patches of shape (348 \times 348) are cropped from the True Colour Image (TCI, details see sentinel-2 user guide) file and are used to interpreting class labels.
The samples with a large number of timestamps could be time-consuming in the IO stage, thus the time series patches are divided into different groups with timestamps not exceeding 100 for every group.
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('cifar10', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/cifar10-3.0.2.png" alt="Visualization" width="500px">
Fashion-MNIST is a dataset of Zalando's article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('fashion_mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/fashion_mnist-3.0.1.png" alt="Visualization" width="500px">
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('imdb_reviews', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For the data and network task we build upon previous work on a climate change classification task by Labe and Barnes [2021], using data simulated by the general climate model, CESM1 (Hurrell et al. [2013]). We focus on the “ALL” configuration (Kay et al. [2015]) data. We use 2-m air temperature (T2m) temperature maps from 1920 to 2080 and take the annual average of monthly data; this results in 𝑇=161 temperature maps for each of the 40 ensemble members. The participants can load the already pre-processed and prepared batch of the data containing 𝑁=161 images (1 sample per year) from an .npz-file into the notebook, as we will upload finalized data to either Kaggle. Moreover, we provide a trained CNN (.tf-files). The network assigns the annual temperature maps to classes based on their decade. Accordingly, the network output is a class vector of 20 classes with each entry including the probability of the input sample belonging to the decade of the class. We provide input temperature maps as images on a ℎ=144 by 𝑣=95 longitude-latitude-grid with 1.9∘ sampling in latitude and 2.5∘ sampling in longitude. For the CNN we maintain latitude-longitude grid. Additionally, we include the associated class vectors (probability entry for each class) and the years of each temperature map in the input data
Dataset describing the survival status of individual passengers on the Titanic. Missing values in the original dataset are represented using ?. Float and int missing values are replaced with -1, string missing values are replaced with 'Unknown'.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('titanic', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including - 10,177 number of identities, - 202,599 number of face images, and - 5 landmark locations, 40 binary attributes annotations per image.
The dataset can be employed as the training and test sets for the following computer vision tasks: face attribute recognition, face detection, and landmark (or facial part) localization.
Note: CelebA dataset may contain potential bias. The fairness indicators example goes into detail about several considerations to keep in mind while using the CelebA dataset.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('celeb_a', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/celeb_a-2.1.0.png" alt="Visualization" width="500px">
This dataset contains the data from the PASCAL Visual Object Classes Challenge, corresponding to the Classification and Detection competitions.
In the Classification competition, the goal is to predict the set of labels contained in the image, while in the Detection competition the goal is to predict the bounding box and label of each individual object. WARNING: As per the official dataset, the test set of VOC2012 does not contain annotations.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('voc', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/voc-2007-5.0.0.png" alt="Visualization" width="500px">
Not seeing a result you expected?
Learn how you can add new datasets to our index.
The MNIST database of handwritten digits.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">