100+ datasets found

T
squad_question_generation
tensorflow.org
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). squad_question_generation [Dataset]. http://doi.org/10.18653/v1/P17-1123
Explore at:
Unique identifier
https://doi.org/10.18653/v1/P17-1123
Dataset updated
Dec 6, 2022
Description
Question generation using squad dataset using data splits described in 'Neural Question Generation from Text: A Preliminary Study' (Zhou et al, 2017) and 'Learning to Ask: Neural Question Generation for Reading Comprehension' (Du et al, 2017).

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('squad_question_generation', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
wiki40b
tensorflow.org
opendatalab.com
+1more
Updated Aug 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). wiki40b [Dataset]. https://www.tensorflow.org/datasets/catalog/wiki40b
Explore at:
Dataset updated
Aug 30, 2023
Description
Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects. The language models trained on this corpus - including 41 monolingual models, and 2 multilingual models - can be found at https://tfhub.dev/google/collections/wiki40b-lm/1.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wiki40b', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
rlu_atari_checkpoints_ordered
tensorflow.org
Updated Dec 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). rlu_atari_checkpoints_ordered [Dataset]. https://www.tensorflow.org/datasets/catalog/rlu_atari_checkpoints_ordered
Explore at:
Dataset updated
Dec 9, 2021
Description
RL Unplugged is suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed around the following considerations: to facilitate ease of use, we provide the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established.

The datasets follow the RLDS format to represent steps and episodes.

We are releasing a large and diverse dataset of gameplay following the protocol described by Agarwal et al., 2020, which can be used to evaluate several discrete offline RL algorithms. The dataset is generated by running an online DQN agent and recording transitions from its replay during training with sticky actions Machado et al., 2018. As stated in Agarwal et al., 2020, for each game we use data from five runs with 50 million transitions each. We release datasets for 46 Atari games. For details on how the dataset was generated, please refer to the paper. Please see this note about the ROM versions used to generate the datasets.

Atari is a standard RL benchmark. We recommend you to try offline RL methods on Atari if you are interested in comparing your approach to other state of the art offline RL methods with discrete actions.

The reward of each step is clipped (obtained with [-1, 1] clipping) and the episode includes the sum of the clipped reward per episode.

Each of the configurations is broken into splits. Splits correspond to checkpoints of 1M steps (note that the number of episodes may difer). Checkpoints are ordered in time (so checkpoint 0 ran before checkpoint 1).

Episodes within each split are ordered. Check https://www.tensorflow.org/datasets/determinism if you want to ensure that you read episodes in order.

This dataset corresponds to the one used in the DQN replay paper. https://research.google/tools/datasets/dqn-replay/

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('rlu_atari_checkpoints_ordered', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
scan
tensorflow.org
opendatalab.com
Updated Dec 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). scan [Dataset]. https://www.tensorflow.org/datasets/catalog/scan
Explore at:
Dataset updated
Dec 23, 2022
Description
SCAN tasks with various splits.

SCAN is a set of simple language-driven navigation tasks for studying compositional learning and zero-shot generalization.

Most splits are described at https://github.com/brendenlake/SCAN. For the MCD splits please see https://arxiv.org/abs/1912.09713.pdf.

Basic usage:

data = tfds.load('scan/length')

More advanced example:

import tensorflow_datasets as tfds from tensorflow_datasets.datasets.scan import scan_dataset_builder data = tfds.load( 'scan', builder_kwargs=dict( config=scan_dataset_builder.ScanConfig( name='simple_p8', directory='simple_split/size_variations')))

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('scan', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
spoc_robot
tensorflow.org
Updated Dec 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). spoc_robot [Dataset]. https://www.tensorflow.org/datasets/catalog/spoc_robot
Explore at:
Dataset updated
Dec 11, 2024
Description
To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('spoc_robot', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
coco
tensorflow.org
huggingface.co
Updated Jun 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). coco [Dataset]. https://www.tensorflow.org/datasets/catalog/coco
Explore at:
Dataset updated
Jun 1, 2024
Description
COCO is a large-scale object detection, segmentation, and captioning dataset.

Note: * Some images from the train and validation sets don't have annotations. * Coco 2014 and 2017 uses the same images, but different train/val/test splits * The test split don't have any annotations (only images). * Coco defines 91 classes but the data only uses 80 classes. * Panotptic annotations defines defines 200 classes but only uses 133.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('coco', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/coco-2014-1.1.0.png" alt="Visualization" width="500px">
T
crema_d
tensorflow.org
opendatalab.com
+2more
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). crema_d [Dataset]. https://www.tensorflow.org/datasets/catalog/crema_d
Explore at:
Dataset updated
Dec 6, 2022
Description
CREMA-D is an audio-visual data set for emotion recognition. The data set consists of facial and vocal emotional expressions in sentences spoken in a range of basic emotional states (happy, sad, anger, fear, disgust, and neutral). 7,442 clips of 91 actors with diverse ethnic backgrounds were collected. This release contains only the audio stream from the original audio-visual recording. The samples are splitted between train, validation and testing so that samples from each speaker belongs to exactly one split.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('crema_d', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
imagenet2012
tensorflow.org
Updated Jun 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). imagenet2012 [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012
Explore at:
Dataset updated
Jun 1, 2024
Description
ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:

Download the 2012 test split available here.

Download the October 10, 2019 patch. There is a Google Drive link to the patch provided on the same page.

Combine the two tar-balls, manually overwriting any images in the original archive with images from the patch. According to the instructions on image-net.org, this procedure overwrites just a few images.

The resulting tar-ball may then be processed by TFDS.

To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.

To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:

771 778 794 387 650 363 691 764 923 427 737 369 430 531 124 755 930 755 59 168

The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('imagenet2012', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012-5.1.0.png" alt="Visualization" width="500px">
T
speech_commands
tensorflow.org
datasets.activeloop.ai
+1more
Updated Jan 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). speech_commands [Dataset]. http://identifiers.org/arxiv:1804.03209
Explore at:
Unique identifier
https://identifiers.org/arxiv:1804.03209
Dataset updated
Jan 13, 2023
Description
An audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Its primary goal is to provide a way to build and test small models that detect when a single word is spoken, from a set of ten target words, with as few false positives as possible from background noise or unrelated speech. Note that in the train and validation set, the label "unknown" is much more prevalent than the labels of the target words or background noise. One difference from the release version is the handling of silent segments. While in the test set the silence segments are regular 1 second files, in the training they are provided as long segments under "background_noise" folder. Here we split these background noise into 1 second clips, and also keep one of the files for the validation set.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('speech_commands', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
cifar10
tensorflow.org
opendatalab.com
+3more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). cifar10 [Dataset]. https://www.tensorflow.org/datasets/catalog/cifar10
Explore at:
Dataset updated
Jun 1, 2024
Description
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('cifar10', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/cifar10-3.0.2.png" alt="Visualization" width="500px">
T
wikipedia
tensorflow.org
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wikipedia [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia
Explore at:
Description
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikipedia', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
civil_comments
tensorflow.org
huggingface.co
Updated Feb 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
Explore at:
Dataset updated
Feb 28, 2023
Description
This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('civil_comments', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

ref_coco

tensorflow.org
opendatalab.com

Updated May 31, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). ref_coco [Dataset]. https://www.tensorflow.org/datasets/catalog/ref_coco

Explore at:

Dataset updated

May 31, 2024

Description

A collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.

RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.

Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".

Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):

dataset	partition	split	refs	images
refcoco	google	train	40000	19213
refcoco	google	val	5000	4559
refcoco	google	test	5000	4527
refcoco	unc	train	42404	16994
refcoco	unc	val	3811	1500
refcoco	unc	testA	1975	750
refcoco	unc	testB	1810	750
refcoco+	unc	train	42278	16992
refcoco+	unc	val	3805	1500
refcoco+	unc	testA	1975	750
refcoco+	unc	testB	1798	750
refcocog	google	train	44822	24698
refcocog	google	val	5000	4650
refcocog	umd	train	42226	21899
refcocog	umd	val	2573	1300
refcocog	umd	test	5023	2600

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('ref_coco', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/ref_coco-refcoco_unc-1.1.0.png" alt="Visualization" width="500px">

T
oxford_iiit_pet
tensorflow.org
opendatalab.com
Updated Mar 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). oxford_iiit_pet [Dataset]. https://www.tensorflow.org/datasets/catalog/oxford_iiit_pet
Explore at:
Dataset updated
Mar 14, 2025
Description
The Oxford-IIIT pet dataset is a 37 category pet image dataset with roughly 200 images for each class. The images have large variations in scale, pose and lighting. All images have an associated ground truth annotation of breed and species. Additionally, head bounding boxes are provided for the training split, allowing using this dataset for simple object detection tasks. In the test split, the bounding boxes are empty.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('oxford_iiit_pet', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
wiki_table_questions
tensorflow.org
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). wiki_table_questions [Dataset]. https://www.tensorflow.org/datasets/catalog/wiki_table_questions
Explore at:
Dataset updated
Dec 6, 2022
Description
The dataset contains pairs table-question, and the respective answer. The questions require multi-step reasoning and various data operations such as comparison, aggregation, and arithmetic computation. The tables were randomly selected among Wikipedia tables with at least 8 rows and 5 columns.

(As per the documentation usage notes)

Dev: Mean accuracy over three (not five) splits of the training data. In other words, train on 'split-{1,2,3}-train' and test on 'split-{1,2,3}-dev', respectively, then average the accuracy.

Test: Train on 'train' and test on 'test'.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wiki_table_questions', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
wikihow
tensorflow.org
opendatalab.com
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). wikihow [Dataset]. https://www.tensorflow.org/datasets/catalog/wikihow
Explore at:
Dataset updated
Dec 6, 2022
Description
WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.

There are two features: - text: wikihow answers texts. - headline: bold lines as summary.

There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.

Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig. Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikihow', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
star_cfq
tensorflow.org
Updated Feb 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). star_cfq [Dataset]. https://www.tensorflow.org/datasets/catalog/star_cfq
Explore at:
Dataset updated
Feb 13, 2021
Description
The *-CFQ datasets (and their splits) for measuring the scalability of compositional generalization.

See https://arxiv.org/abs/2012.08266 for background.

Example usage:

data = tfds.load('star_cfq/single_pool_10x_b_cfq')

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('star_cfq', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
billsum
tensorflow.org
opendatalab.com
+2more
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). billsum [Dataset]. https://www.tensorflow.org/datasets/catalog/billsum
Explore at:
Dataset updated
Dec 6, 2022
Description
BillSum, summarization of US Congressional and California state bills.

There are several features: - text: bill text. - summary: summary of the bills. - title: title of the bills. features for us bills. ca bills does not have. - text_len: number of chars in text. - sum_len: number of chars in summary.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('billsum', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
dart
tensorflow.org
opendatalab.com
+1more
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). dart [Dataset]. https://www.tensorflow.org/datasets/catalog/dart
Explore at:
Dataset updated
Dec 6, 2022
Description
DART (DAta Record to Text generation) contains RDF entity-relation annotated with sentence descriptions that cover all facts in the triple set. DART was constructed using existing datasets such as: WikiTableQuestions, WikiSQL, WebNLG and Cleaned E2E. The tables from WikiTableQuestions and WikiSQL were transformed to subject-predicate-object triples, and its text annotations were mainly collected from MTurk. The meaningful representations in E2E were also transformed to triples and its descriptions were used, some that couldn't be transformed were dropped.

The dataset splits of E2E and WebNLG are kept, and for the WikiTableQuestions and WikiSQL the Jaccard similarity is used to keep similar tables in the same set (train/dev/tes).

This dataset is constructed following a standarized table format.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('dart', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
web_graph
tensorflow.org
Updated Nov 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). web_graph [Dataset]. http://identifiers.org/arxiv:2112.02194
Explore at:
Unique identifier
https://identifiers.org/arxiv:2112.02194
Dataset updated
Nov 23, 2022
Description
This dataset contains a sparse graph representing web link structure for a small subset of the Web.

Its a processed version of a single crawl performed by CommonCrawl in 2021 where we strip everything and keep only the link->outlinks structure. The final dataset is basically int -> List[int] format with each integer id representing a url.

Also, in order to increase the value of this resource, we created 6 different version of WebGraph, each varying in the sparsity pattern and locale. We took the following processing steps, in order:

We started with WAT files from June 2021 crawl.

Since the outlinks in HTTP-Response-Metadata are stored as relative paths, we convert them to absolute paths using urllib after validating each link.

To study locale-specific graphs, we further filter based on 2 top level domains: ‘de’ and ‘in’, each producing a graph with an order of magnitude less number of nodes.

These graphs can still have arbitrary sparsity patterns and dangling links. Thus we further filter the nodes in each graph to have minimum of K ∈ [10, 50] inlinks and outlinks. Note that we only do this processing once, thus this is still an approximation i.e. the resulting graph might have nodes with less than K links.

Using both locale and count filters, we finalize 6 versions of WebGraph dataset, summarized in the folling table.

Version Top level domain Min count Num nodes Num edges
sparse 10 365.4M 30B
dense 50 136.5M 22B
de-sparse de 10 19.7M 1.19B
de-dense de 50 5.7M 0.82B
in-sparse in 10 1.5M 0.14B
in-dense in 50 0.5M 0.12B

All versions of the dataset have following features:

"row_tag": a unique identifier of the row (source link).

"col_tag": a list of unique identifiers of non-zero columns (dest outlinks).

"gt_tag": a list of unique identifiers of non-zero columns used as ground truth (dest outlinks), empty for train/train_t splits.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('web_graph', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

Version	Top level domain	Min count	Num nodes	Num edges
sparse		10	365.4M	30B
dense		50	136.5M	22B
de-sparse	de	10	19.7M	1.19B
de-dense	de	50	5.7M	0.82B
in-sparse	in	10	1.5M	0.14B
in-dense	in	50	0.5M	0.12B

Facebook

Twitter

Click to copy link

Link copied

Cite

(2022). squad_question_generation [Dataset]. http://doi.org/10.18653/v1/P17-1123

squad_question_generation

Explore at:

63 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.18653/v1/P17-1123

Dataset updated

Dec 6, 2022

Description

Question generation using squad dataset using data splits described in 'Neural Question Generation from Text: A Preliminary Study' (Zhou et al, 2017) and 'Learning to Ask: Neural Question Generation for Reading Comprehension' (Du et al, 2017).

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('squad_question_generation', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

Clear search

Close search

Google apps

Main menu

squad_question_generation

wiki40b

rlu_atari_checkpoints_ordered

scan

spoc_robot

coco

crema_d

imagenet2012

speech_commands

cifar10

wikipedia

civil_comments

ref_coco

oxford_iiit_pet

wiki_table_questions

wikihow

star_cfq

billsum

dart

web_graph

squad_question_generation