3 datasets found
  1. h

    test

    • huggingface.co
    Updated Sep 27, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DigitalBrainLab (2016). test [Dataset]. https://huggingface.co/datasets/DBL/test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 27, 2016
    Dataset authored and provided by
    DigitalBrainLab
    Description

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers -… See the full description on the dataset page: https://huggingface.co/datasets/DBL/test.

  2. P

    WikiText-2 Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Dec 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher (2023). WikiText-2 Dataset [Dataset]. https://paperswithcode.com/dataset/wikitext-2
    Explore at:
    Dataset updated
    Dec 13, 2023
    Authors
    Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher
    Description

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

    Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

  3. P

    WikiText-103 Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Oct 2, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher (2016). WikiText-103 Dataset [Dataset]. https://paperswithcode.com/dataset/wikitext-103
    Explore at:
    Dataset updated
    Oct 2, 2016
    Authors
    Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher
    Description

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

    Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
DigitalBrainLab (2016). test [Dataset]. https://huggingface.co/datasets/DBL/test

test

DBL/test

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 27, 2016
Dataset authored and provided by
DigitalBrainLab
Description

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers -… See the full description on the dataset page: https://huggingface.co/datasets/DBL/test.

Search
Clear search
Close search
Google apps
Main menu