Facebook
TwitterQuestion generation using squad dataset using data splits described in 'Neural Question Generation from Text: A Preliminary Study' (Zhou et al, 2017) and 'Learning to Ask: Neural Question Generation for Reading Comprehension' (Du et al, 2017).
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('squad_question_generation', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterClean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects. The language models trained on this corpus - including 41 monolingual models, and 2 multilingual models - can be found at https://tfhub.dev/google/collections/wiki40b-lm/1.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wiki40b', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterRL Unplugged is suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed around the following considerations: to facilitate ease of use, we provide the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established.
The datasets follow the RLDS format to represent steps and episodes.
We are releasing a large and diverse dataset of gameplay following the protocol described by Agarwal et al., 2020, which can be used to evaluate several discrete offline RL algorithms. The dataset is generated by running an online DQN agent and recording transitions from its replay during training with sticky actions Machado et al., 2018. As stated in Agarwal et al., 2020, for each game we use data from five runs with 50 million transitions each. We release datasets for 46 Atari games. For details on how the dataset was generated, please refer to the paper. Please see this note about the ROM versions used to generate the datasets.
Atari is a standard RL benchmark. We recommend you to try offline RL methods on Atari if you are interested in comparing your approach to other state of the art offline RL methods with discrete actions.
The reward of each step is clipped (obtained with [-1, 1] clipping) and the episode includes the sum of the clipped reward per episode.
Each of the configurations is broken into splits. Splits correspond to checkpoints of 1M steps (note that the number of episodes may difer). Checkpoints are ordered in time (so checkpoint 0 ran before checkpoint 1).
Episodes within each split are ordered. Check https://www.tensorflow.org/datasets/determinism if you want to ensure that you read episodes in order.
This dataset corresponds to the one used in the DQN replay paper. https://research.google/tools/datasets/dqn-replay/
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('rlu_atari_checkpoints_ordered', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterSCAN tasks with various splits.
SCAN is a set of simple language-driven navigation tasks for studying compositional learning and zero-shot generalization.
Most splits are described at https://github.com/brendenlake/SCAN. For the MCD splits please see https://arxiv.org/abs/1912.09713.pdf.
Basic usage:
data = tfds.load('scan/length')
More advanced example:
import tensorflow_datasets as tfds
from tensorflow_datasets.datasets.scan import scan_dataset_builder
data = tfds.load(
'scan',
builder_kwargs=dict(
config=scan_dataset_builder.ScanConfig(
name='simple_p8', directory='simple_split/size_variations')))
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('scan', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterTo use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('spoc_robot', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterCOCO is a large-scale object detection, segmentation, and captioning dataset.
Note: * Some images from the train and validation sets don't have annotations. * Coco 2014 and 2017 uses the same images, but different train/val/test splits * The test split don't have any annotations (only images). * Coco defines 91 classes but the data only uses 80 classes. * Panotptic annotations defines defines 200 classes but only uses 133.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('coco', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/coco-2014-1.1.0.png" alt="Visualization" width="500px">
Facebook
TwitterCREMA-D is an audio-visual data set for emotion recognition. The data set consists of facial and vocal emotional expressions in sentences spoken in a range of basic emotional states (happy, sad, anger, fear, disgust, and neutral). 7,442 clips of 91 actors with diverse ethnic backgrounds were collected. This release contains only the audio stream from the original audio-visual recording. The samples are splitted between train, validation and testing so that samples from each speaker belongs to exactly one split.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('crema_d', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.
The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:
The resulting tar-ball may then be processed by TFDS.
To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.
To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:
771 778 794 387 650
363 691 764 923 427
737 369 430 531 124
755 930 755 59 168
The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('imagenet2012', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012-5.1.0.png" alt="Visualization" width="500px">
Facebook
TwitterAn audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Its primary goal is to provide a way to build and test small models that detect when a single word is spoken, from a set of ten target words, with as few false positives as possible from background noise or unrelated speech. Note that in the train and validation set, the label "unknown" is much more prevalent than the labels of the target words or background noise. One difference from the release version is the handling of silent segments. While in the test set the silence segments are regular 1 second files, in the training they are provided as long segments under "background_noise" folder. Here we split these background noise into 1 second clips, and also keep one of the files for the validation set.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('speech_commands', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterThe CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('cifar10', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/cifar10-3.0.2.png" alt="Visualization" width="500px">
Facebook
TwitterWikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wikipedia', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterThis version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.
The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.
The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.
For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('civil_comments', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterA collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.
RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.
Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".
Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):
| dataset | partition | split | refs | images |
|---|---|---|---|---|
| refcoco | train | 40000 | 19213 | |
| refcoco | val | 5000 | 4559 | |
| refcoco | test | 5000 | 4527 | |
| refcoco | unc | train | 42404 | 16994 |
| refcoco | unc | val | 3811 | 1500 |
| refcoco | unc | testA | 1975 | 750 |
| refcoco | unc | testB | 1810 | 750 |
| refcoco+ | unc | train | 42278 | 16992 |
| refcoco+ | unc | val | 3805 | 1500 |
| refcoco+ | unc | testA | 1975 | 750 |
| refcoco+ | unc | testB | 1798 | 750 |
| refcocog | train | 44822 | 24698 | |
| refcocog | val | 5000 | 4650 | |
| refcocog | umd | train | 42226 | 21899 |
| refcocog | umd | val | 2573 | 1300 |
| refcocog | umd | test | 5023 | 2600 |
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('ref_coco', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/ref_coco-refcoco_unc-1.1.0.png" alt="Visualization" width="500px">
Facebook
TwitterThe Oxford-IIIT pet dataset is a 37 category pet image dataset with roughly 200 images for each class. The images have large variations in scale, pose and lighting. All images have an associated ground truth annotation of breed and species. Additionally, head bounding boxes are provided for the training split, allowing using this dataset for simple object detection tasks. In the test split, the bounding boxes are empty.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('oxford_iiit_pet', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterThe dataset contains pairs table-question, and the respective answer. The questions require multi-step reasoning and various data operations such as comparison, aggregation, and arithmetic computation. The tables were randomly selected among Wikipedia tables with at least 8 rows and 5 columns.
(As per the documentation usage notes)
Dev: Mean accuracy over three (not five) splits of the training data. In other words, train on 'split-{1,2,3}-train' and test on 'split-{1,2,3}-dev', respectively, then average the accuracy.
Test: Train on 'train' and test on 'test'.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wiki_table_questions', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterWikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.
There are two features: - text: wikihow answers texts. - headline: bold lines as summary.
There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.
Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig. Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wikihow', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterThe *-CFQ datasets (and their splits) for measuring the scalability of compositional generalization.
See https://arxiv.org/abs/2012.08266 for background.
Example usage:
data = tfds.load('star_cfq/single_pool_10x_b_cfq')
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('star_cfq', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterBillSum, summarization of US Congressional and California state bills.
There are several features: - text: bill text. - summary: summary of the bills. - title: title of the bills. features for us bills. ca bills does not have. - text_len: number of chars in text. - sum_len: number of chars in summary.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('billsum', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterDART (DAta Record to Text generation) contains RDF entity-relation annotated with sentence descriptions that cover all facts in the triple set. DART was constructed using existing datasets such as: WikiTableQuestions, WikiSQL, WebNLG and Cleaned E2E. The tables from WikiTableQuestions and WikiSQL were transformed to subject-predicate-object triples, and its text annotations were mainly collected from MTurk. The meaningful representations in E2E were also transformed to triples and its descriptions were used, some that couldn't be transformed were dropped.
The dataset splits of E2E and WebNLG are kept, and for the WikiTableQuestions and WikiSQL the Jaccard similarity is used to keep similar tables in the same set (train/dev/tes).
This dataset is constructed following a standarized table format.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('dart', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterThis dataset contains a sparse graph representing web link structure for a small subset of the Web.
Its a processed version of a single crawl performed by CommonCrawl in 2021 where we strip everything and keep only the link->outlinks structure. The final dataset is basically int -> List[int] format with each integer id representing a url.
Also, in order to increase the value of this resource, we created 6 different version of WebGraph, each varying in the sparsity pattern and locale. We took the following processing steps, in order:
| Version | Top level domain | Min count | Num nodes | Num edges |
|---|---|---|---|---|
| sparse | 10 | 365.4M | 30B | |
| dense | 50 | 136.5M | 22B | |
| de-sparse | de | 10 | 19.7M | 1.19B |
| de-dense | de | 50 | 5.7M | 0.82B |
| in-sparse | in | 10 | 1.5M | 0.14B |
| in-dense | in | 50 | 0.5M | 0.12B |
All versions of the dataset have following features:
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('web_graph', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterQuestion generation using squad dataset using data splits described in 'Neural Question Generation from Text: A Preliminary Study' (Zhou et al, 2017) and 'Learning to Ask: Neural Question Generation for Reading Comprehension' (Du et al, 2017).
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('squad_question_generation', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.