100+ datasets found
  1. T

    squad_question_generation

    • tensorflow.org
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). squad_question_generation [Dataset]. http://doi.org/10.18653/v1/P17-1123
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    Question generation using squad dataset using data splits described in 'Neural Question Generation from Text: A Preliminary Study' (Zhou et al, 2017) and 'Learning to Ask: Neural Question Generation for Reading Comprehension' (Du et al, 2017).

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('squad_question_generation', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  2. T

    wiki40b

    • tensorflow.org
    • opendatalab.com
    • +1more
    Updated Aug 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). wiki40b [Dataset]. https://www.tensorflow.org/datasets/catalog/wiki40b
    Explore at:
    Dataset updated
    Aug 30, 2023
    Description

    Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects. The language models trained on this corpus - including 41 monolingual models, and 2 multilingual models - can be found at https://tfhub.dev/google/collections/wiki40b-lm/1.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wiki40b', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  3. T

    rlu_atari_checkpoints_ordered

    • tensorflow.org
    Updated Dec 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). rlu_atari_checkpoints_ordered [Dataset]. https://www.tensorflow.org/datasets/catalog/rlu_atari_checkpoints_ordered
    Explore at:
    Dataset updated
    Dec 9, 2021
    Description

    RL Unplugged is suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed around the following considerations: to facilitate ease of use, we provide the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established.

    The datasets follow the RLDS format to represent steps and episodes.

    We are releasing a large and diverse dataset of gameplay following the protocol described by Agarwal et al., 2020, which can be used to evaluate several discrete offline RL algorithms. The dataset is generated by running an online DQN agent and recording transitions from its replay during training with sticky actions Machado et al., 2018. As stated in Agarwal et al., 2020, for each game we use data from five runs with 50 million transitions each. We release datasets for 46 Atari games. For details on how the dataset was generated, please refer to the paper. Please see this note about the ROM versions used to generate the datasets.

    Atari is a standard RL benchmark. We recommend you to try offline RL methods on Atari if you are interested in comparing your approach to other state of the art offline RL methods with discrete actions.

    The reward of each step is clipped (obtained with [-1, 1] clipping) and the episode includes the sum of the clipped reward per episode.

    Each of the configurations is broken into splits. Splits correspond to checkpoints of 1M steps (note that the number of episodes may difer). Checkpoints are ordered in time (so checkpoint 0 ran before checkpoint 1).

    Episodes within each split are ordered. Check https://www.tensorflow.org/datasets/determinism if you want to ensure that you read episodes in order.

    This dataset corresponds to the one used in the DQN replay paper. https://research.google/tools/datasets/dqn-replay/

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('rlu_atari_checkpoints_ordered', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  4. T

    scan

    • tensorflow.org
    • opendatalab.com
    Updated Dec 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). scan [Dataset]. https://www.tensorflow.org/datasets/catalog/scan
    Explore at:
    Dataset updated
    Dec 23, 2022
    Description

    SCAN tasks with various splits.

    SCAN is a set of simple language-driven navigation tasks for studying compositional learning and zero-shot generalization.

    Most splits are described at https://github.com/brendenlake/SCAN. For the MCD splits please see https://arxiv.org/abs/1912.09713.pdf.

    Basic usage:

    data = tfds.load('scan/length')
    

    More advanced example:

    import tensorflow_datasets as tfds
    from tensorflow_datasets.datasets.scan import scan_dataset_builder
    
    data = tfds.load(
      'scan',
      builder_kwargs=dict(
        config=scan_dataset_builder.ScanConfig(
          name='simple_p8', directory='simple_split/size_variations')))
    

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('scan', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  5. T

    spoc_robot

    • tensorflow.org
    Updated Dec 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). spoc_robot [Dataset]. https://www.tensorflow.org/datasets/catalog/spoc_robot
    Explore at:
    Dataset updated
    Dec 11, 2024
    Description

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('spoc_robot', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  6. T

    coco

    • tensorflow.org
    • huggingface.co
    Updated Jun 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). coco [Dataset]. https://www.tensorflow.org/datasets/catalog/coco
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    COCO is a large-scale object detection, segmentation, and captioning dataset.

    Note: * Some images from the train and validation sets don't have annotations. * Coco 2014 and 2017 uses the same images, but different train/val/test splits * The test split don't have any annotations (only images). * Coco defines 91 classes but the data only uses 80 classes. * Panotptic annotations defines defines 200 classes but only uses 133.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('coco', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/coco-2014-1.1.0.png" alt="Visualization" width="500px">

  7. T

    crema_d

    • tensorflow.org
    • opendatalab.com
    • +2more
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). crema_d [Dataset]. https://www.tensorflow.org/datasets/catalog/crema_d
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    CREMA-D is an audio-visual data set for emotion recognition. The data set consists of facial and vocal emotional expressions in sentences spoken in a range of basic emotional states (happy, sad, anger, fear, disgust, and neutral). 7,442 clips of 91 actors with diverse ethnic backgrounds were collected. This release contains only the audio stream from the original audio-visual recording. The samples are splitted between train, validation and testing so that samples from each speaker belongs to exactly one split.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('crema_d', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  8. T

    imagenet2012

    • tensorflow.org
    Updated Jun 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). imagenet2012 [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

    The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:

    1. Download the 2012 test split available here.
    2. Download the October 10, 2019 patch. There is a Google Drive link to the patch provided on the same page.
    3. Combine the two tar-balls, manually overwriting any images in the original archive with images from the patch. According to the instructions on image-net.org, this procedure overwrites just a few images.

    The resulting tar-ball may then be processed by TFDS.

    To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.

    To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:

    771 778 794 387 650
    363 691 764 923 427
    737 369 430 531 124
    755 930 755 59 168
    

    The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imagenet2012', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012-5.1.0.png" alt="Visualization" width="500px">

  9. T

    speech_commands

    • tensorflow.org
    • datasets.activeloop.ai
    • +1more
    Updated Jan 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). speech_commands [Dataset]. http://identifiers.org/arxiv:1804.03209
    Explore at:
    Dataset updated
    Jan 13, 2023
    Description

    An audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Its primary goal is to provide a way to build and test small models that detect when a single word is spoken, from a set of ten target words, with as few false positives as possible from background noise or unrelated speech. Note that in the train and validation set, the label "unknown" is much more prevalent than the labels of the target words or background noise. One difference from the release version is the handling of silent segments. While in the test set the silence segments are regular 1 second files, in the training they are provided as long segments under "background_noise" folder. Here we split these background noise into 1 second clips, and also keep one of the files for the validation set.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('speech_commands', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  10. T

    cifar10

    • tensorflow.org
    • opendatalab.com
    • +3more
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). cifar10 [Dataset]. https://www.tensorflow.org/datasets/catalog/cifar10
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('cifar10', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/cifar10-3.0.2.png" alt="Visualization" width="500px">

  11. T

    wikipedia

    • tensorflow.org
    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wikipedia [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia
    Explore at:
    Description

    Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  12. T

    civil_comments

    • tensorflow.org
    • huggingface.co
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
    Explore at:
    Dataset updated
    Feb 28, 2023
    Description

    This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

    The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

    The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

    For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('civil_comments', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  13. T

    ref_coco

    • tensorflow.org
    • opendatalab.com
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). ref_coco [Dataset]. https://www.tensorflow.org/datasets/catalog/ref_coco
    Explore at:
    Dataset updated
    May 31, 2024
    Description

    A collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.

    RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.

    Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".

    Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):

    datasetpartitionsplitrefsimages
    refcocogoogletrain4000019213
    refcocogoogleval50004559
    refcocogoogletest50004527
    refcocounctrain4240416994
    refcocouncval38111500
    refcocounctestA1975750
    refcocounctestB1810750
    refcoco+unctrain4227816992
    refcoco+uncval38051500
    refcoco+unctestA1975750
    refcoco+unctestB1798750
    refcocoggoogletrain4482224698
    refcocoggoogleval50004650
    refcocogumdtrain4222621899
    refcocogumdval25731300
    refcocogumdtest50232600

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('ref_coco', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/ref_coco-refcoco_unc-1.1.0.png" alt="Visualization" width="500px">

  14. T

    oxford_iiit_pet

    • tensorflow.org
    • opendatalab.com
    Updated Mar 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). oxford_iiit_pet [Dataset]. https://www.tensorflow.org/datasets/catalog/oxford_iiit_pet
    Explore at:
    Dataset updated
    Mar 14, 2025
    Description

    The Oxford-IIIT pet dataset is a 37 category pet image dataset with roughly 200 images for each class. The images have large variations in scale, pose and lighting. All images have an associated ground truth annotation of breed and species. Additionally, head bounding boxes are provided for the training split, allowing using this dataset for simple object detection tasks. In the test split, the bounding boxes are empty.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('oxford_iiit_pet', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  15. T

    wiki_table_questions

    • tensorflow.org
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wiki_table_questions [Dataset]. https://www.tensorflow.org/datasets/catalog/wiki_table_questions
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    The dataset contains pairs table-question, and the respective answer. The questions require multi-step reasoning and various data operations such as comparison, aggregation, and arithmetic computation. The tables were randomly selected among Wikipedia tables with at least 8 rows and 5 columns.

    (As per the documentation usage notes)

    • Dev: Mean accuracy over three (not five) splits of the training data. In other words, train on 'split-{1,2,3}-train' and test on 'split-{1,2,3}-dev', respectively, then average the accuracy.

    • Test: Train on 'train' and test on 'test'.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wiki_table_questions', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  16. T

    wikihow

    • tensorflow.org
    • opendatalab.com
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wikihow [Dataset]. https://www.tensorflow.org/datasets/catalog/wikihow
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.

    There are two features: - text: wikihow answers texts. - headline: bold lines as summary.

    There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.

    Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig. Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikihow', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  17. T

    star_cfq

    • tensorflow.org
    Updated Feb 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). star_cfq [Dataset]. https://www.tensorflow.org/datasets/catalog/star_cfq
    Explore at:
    Dataset updated
    Feb 13, 2021
    Description

    The *-CFQ datasets (and their splits) for measuring the scalability of compositional generalization.

    See https://arxiv.org/abs/2012.08266 for background.

    Example usage:

    data = tfds.load('star_cfq/single_pool_10x_b_cfq')
    

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('star_cfq', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  18. T

    billsum

    • tensorflow.org
    • opendatalab.com
    • +2more
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). billsum [Dataset]. https://www.tensorflow.org/datasets/catalog/billsum
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    BillSum, summarization of US Congressional and California state bills.

    There are several features: - text: bill text. - summary: summary of the bills. - title: title of the bills. features for us bills. ca bills does not have. - text_len: number of chars in text. - sum_len: number of chars in summary.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('billsum', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  19. T

    dart

    • tensorflow.org
    • opendatalab.com
    • +1more
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). dart [Dataset]. https://www.tensorflow.org/datasets/catalog/dart
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    DART (DAta Record to Text generation) contains RDF entity-relation annotated with sentence descriptions that cover all facts in the triple set. DART was constructed using existing datasets such as: WikiTableQuestions, WikiSQL, WebNLG and Cleaned E2E. The tables from WikiTableQuestions and WikiSQL were transformed to subject-predicate-object triples, and its text annotations were mainly collected from MTurk. The meaningful representations in E2E were also transformed to triples and its descriptions were used, some that couldn't be transformed were dropped.

    The dataset splits of E2E and WebNLG are kept, and for the WikiTableQuestions and WikiSQL the Jaccard similarity is used to keep similar tables in the same set (train/dev/tes).

    This dataset is constructed following a standarized table format.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('dart', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  20. T

    web_graph

    • tensorflow.org
    Updated Nov 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). web_graph [Dataset]. http://identifiers.org/arxiv:2112.02194
    Explore at:
    Dataset updated
    Nov 23, 2022
    Description

    This dataset contains a sparse graph representing web link structure for a small subset of the Web.

    Its a processed version of a single crawl performed by CommonCrawl in 2021 where we strip everything and keep only the link->outlinks structure. The final dataset is basically int -> List[int] format with each integer id representing a url.

    Also, in order to increase the value of this resource, we created 6 different version of WebGraph, each varying in the sparsity pattern and locale. We took the following processing steps, in order:

    • We started with WAT files from June 2021 crawl.
    • Since the outlinks in HTTP-Response-Metadata are stored as relative paths, we convert them to absolute paths using urllib after validating each link.
    • To study locale-specific graphs, we further filter based on 2 top level domains: ‘de’ and ‘in’, each producing a graph with an order of magnitude less number of nodes.
    • These graphs can still have arbitrary sparsity patterns and dangling links. Thus we further filter the nodes in each graph to have minimum of K ∈ [10, 50] inlinks and outlinks. Note that we only do this processing once, thus this is still an approximation i.e. the resulting graph might have nodes with less than K links.
    • Using both locale and count filters, we finalize 6 versions of WebGraph dataset, summarized in the folling table.
    VersionTop level domainMin countNum nodesNum edges
    sparse10365.4M30B
    dense50136.5M22B
    de-sparsede1019.7M1.19B
    de-densede505.7M0.82B
    in-sparsein101.5M0.14B
    in-densein500.5M0.12B

    All versions of the dataset have following features:

    • "row_tag": a unique identifier of the row (source link).
    • "col_tag": a list of unique identifiers of non-zero columns (dest outlinks).
    • "gt_tag": a list of unique identifiers of non-zero columns used as ground truth (dest outlinks), empty for train/train_t splits.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('web_graph', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2022). squad_question_generation [Dataset]. http://doi.org/10.18653/v1/P17-1123

squad_question_generation

Explore at:
63 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Dec 6, 2022
Description

Question generation using squad dataset using data splits described in 'Neural Question Generation from Text: A Preliminary Study' (Zhou et al, 2017) and 'Learning to Ask: Neural Question Generation for Reading Comprehension' (Du et al, 2017).

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('squad_question_generation', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

Search
Clear search
Close search
Google apps
Main menu