24 datasets found
  1. P

    WikiText-103 Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Sep 27, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher (2016). WikiText-103 Dataset [Dataset]. https://paperswithcode.com/dataset/wikitext-103
    Explore at:
    Dataset updated
    Sep 27, 2016
    Authors
    Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher
    Description

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

    Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

  2. h

    wikitext2

    • huggingface.co
    • paperswithcode.com
    • +1more
    Updated Oct 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Karsten Kuhnke (2023). wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2
    Explore at:
    Dataset updated
    Oct 21, 2023
    Authors
    Jan Karsten Kuhnke
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "wikitext"

      Dataset Summary
    

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.

  3. a

    Wikitext-103

    • academictorrents.com
    bittorrent
    Updated Oct 16, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen Merity et al., 2016 (2018). Wikitext-103 [Dataset]. https://academictorrents.com/details/a4fee5547056c845e31ab952598f43b42333183c
    Explore at:
    bittorrent(190200704)Available download formats
    Dataset updated
    Oct 16, 2018
    Dataset authored and provided by
    Stephen Merity et al., 2016
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. Widely used for language modeling, including the pretrained models used in the fastai library and ULMFiT algorithm.

  4. h

    wikitext-103-v1

    • huggingface.co
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ChulwonChoi (2025). wikitext-103-v1 [Dataset]. https://huggingface.co/datasets/cchoi1022/wikitext-103-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2025
    Authors
    ChulwonChoi
    Description

    cchoi1022/wikitext-103-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. a

    Wikitext-2

    • academictorrents.com
    bittorrent
    Updated Oct 16, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen Merity et al., 2016 (2018). Wikitext-2 [Dataset]. https://academictorrents.com/details/ac7ffa98b66427246a316a81b2ea31c9b58ea5b6
    Explore at:
    bittorrent(4070055)Available download formats
    Dataset updated
    Oct 16, 2018
    Dataset authored and provided by
    Stephen Merity et al., 2016
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    A subset of Wikitext-103; useful for testing language model training on smaller datasets.

  6. o

    Wikitext-103 and OpenWebText Models

    • explore.openaire.eu
    • zenodo.org
    Updated Sep 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Forrest Davis (2020). Wikitext-103 and OpenWebText Models [Dataset]. http://doi.org/10.5281/zenodo.4053571
    Explore at:
    Dataset updated
    Sep 27, 2020
    Authors
    Forrest Davis
    Description

    This repository contains 25 Wikitext-103 LSTM models and 25 LSTM models trained on a 100 million token subset of the OpenWebTextCorpus. Training/validation/test data is included with the Web models. By-epoch validation perplexity is given in the logs (within the directory for the models). Please write to me if you have any questions :)

  7. E

    WikiText-103 & 2

    • live.european-language-grid.eu
    txt
    Updated Dec 30, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). WikiText-103 & 2 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/5169
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 30, 2016
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset contains word and character level tokens extracted from Wikipedia

  8. h

    wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG

    • huggingface.co
    Updated Nov 23, 2011
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos E. Jimenez (2011). wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG [Dataset]. https://huggingface.co/datasets/carlosejimenez/wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2011
    Authors
    Carlos E. Jimenez
    Description

    carlosejimenez/wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    wikitext-103-raw-v1-para-permute-1

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tongyao, wikitext-103-raw-v1-para-permute-1 [Dataset]. https://huggingface.co/datasets/tyzhu/wikitext-103-raw-v1-para-permute-1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Tongyao
    Description

    Dataset Card for "wikitext-103-raw-v1-para-permute-1"

    More Information needed

  10. h

    wikitext-103-raw-v1-5percent

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fuheng Wu, wikitext-103-raw-v1-5percent [Dataset]. https://huggingface.co/datasets/wufuheng/wikitext-103-raw-v1-5percent
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Fuheng Wu
    Description

    wufuheng/wikitext-103-raw-v1-5percent dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    wikitext-103-stanza

    • huggingface.co
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esen Ergun (2025). wikitext-103-stanza [Dataset]. https://huggingface.co/datasets/esenergun/wikitext-103-stanza
    Explore at:
    Dataset updated
    Apr 23, 2025
    Authors
    Esen Ergun
    Description

    Dataset Card for "wikitext-103-stanza"

    More Information needed

  12. Z

    PTNews Corpus

    • data.niaid.nih.gov
    Updated Jun 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nunes, Davide (2020). PTNews Corpus [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3908506
    Explore at:
    Dataset updated
    Jun 27, 2020
    Dataset authored and provided by
    Nunes, Davide
    Description

    The PTNews Corpus is a collection of over 19 million tokens extracted from 10 years of political news articles (in Portuguese) from the Portuguese newspaper PÚBLICO. The corpus is available under the Creative Commons Attribution-NonCommercial-ShareAlike Licence. The material contained on the PTNews Corpus is © 2010-2020 PÚBLICO Comunicação Social SA.

    The corpus sizes between the preprocessed version of Penn Treebank (PTB) and WikiText-103. Similarly to WikiText, PTNews has a larger vocabulary than PTB and retains the original case, punctuation and numbers. This corpus contains over 31000 publicly available full articles which makes it well suited for models that can take advantage of long-term dependencies.

    The corpus is available as a word-level collection of articles in two version: the first (ptnews_origin) contains a single file with all the articles in the form: title, URL, date, body; the second, contains only the title and body of the news articles and it is split into train, test, validation sets. In this processed version, the words with less than 3 occurrences are mapped to the token. Each sentence in an article body occupies a single line of the dataset and the end of paragraph is marked with the tag at the end of a sentence. Portuguese words resulting from contractions like "desta", ou "nesta" are separated into "d", "esta", "n", "esta", respectively.

    Sample article:

    Carlos César : Cavaco " cansado e sem entusiasmo " quis afastar responsabilidades sobre a crise https://publico.pt/2010/06/10/politica/noticia/carlos-cesar-cavaco-cansado-e-sem-entusiasmo-quis-afastar-responsabilidades-sobre-a-crise-1441369 2010-06-10 15:38:00

    O presidente do Governo Regional dos Açores , Carlos César , considerou hoje que Cavaco Silva esteve " cansado e sem entusiasmo " no discurso do Dia de Portugal , onde afastou responsabilidades sobre a actual crise . " O país ouviu um Presidente cansado e sem entusiasmo , que andou às voltas com os papéis para dizer que não tinha nada a ver com as razões da crise " , afirmou Carlos César , num comentário à Lusa sobre o discurso do Presidente da República na cerimónia oficial do 10 de Junho , realizada em Faro . Carlos César considerou , no entanto , " positivo " que Cavaco Silva tenha feito " um discurso alinhado com um tema recorrente na apreciação do momento que vivemos , o da coesão e da corresponsabilização " . No mesmo sentido , manifestou concordância com o apelo que Cavaco Silva fez " à responsabilidade dos empregadores e empregados " , mas deixou um alerta relativamente à referência do Presidente da República à necessidade de " limpar Portugal " . Para Carlos César , se essa referência " for despida de conteúdo institucional útil , tratou-se de mais um discurso que se perderá na babugem política d aquilo que Cavaco Silva entendeu recordar como o ' rectângulo ' " .

    Reporting Results If you wish to report results or other resources obtained on the PTNews contact Davide Nunes with the following information:

    Task: e.g. Language Modelling, Semantic Similarity, etc;

    Publication URL: url to published article or preprint;

    Type of Model: LSTM Neural Network, n-grams, GloVe vectors, etc;

    Evaluation Metrics: e.g. validation and testing perplexities in the case of language modelling.

    They will be displayed here

    Preprocessed Corpus Statistics

    articles: 31.919

    articles by split:

    train: 25.537

    test: 3.191

    val: 3.191

    unique tokens: 68.318

    unique OoV Tokens: 76.157

    total tokens: 19.021.661

    total OoV tokens: 95.043

    OoV rate: 0.5%

    tokens by split:

    train: 15.242.995

    test: 1.895.184

    val: 1.883.482

    Contact Information

    If you have questions about the corpus or want to report benchmark results, contact Davide Nunes.

  13. h

    wikitext-103-raw-v1-sent-permute-1

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tongyao, wikitext-103-raw-v1-sent-permute-1 [Dataset]. https://huggingface.co/datasets/tyzhu/wikitext-103-raw-v1-sent-permute-1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Tongyao
    Description

    Dataset Card for "wikitext-103-raw-v1-sent-permute-1"

    More Information needed

  14. W

    Webis-Context-sensitive-Word-Search-Queries-2022

    • anthology.aicmu.ac.cn
    • webis.de
    6425595
    Updated 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matti Wiegmann; Martin Potthast; Benno Stein (2022). Webis-Context-sensitive-Word-Search-Queries-2022 [Dataset]. http://doi.org/10.5281/zenodo.6425595
    Explore at:
    6425595Available download formats
    Dataset updated
    2022
    Dataset provided by
    Bauhaus-Universität Weimar
    Leipzig University
    The Web Technology & Information Systems Network
    Authors
    Matti Wiegmann; Martin Potthast; Benno Stein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains two datasets with word search queries. Each word search query consists of a token n-gram with one wildcard token ([MASK]). The answers to each query are the most likely token to replace the mask. All queries originate from wikitext-103 and CLOTH, the respected source is annotated for each query.

    The original-token dataset lists exactly one top answer for each query. The ranked-answers dataset lists multiple, sorted answers in three relevance categories, where 3 is the most relevant. Please refer to the citation for more details.

  15. h

    wikitext-103-raw-v1-seq512-tokenized-grouped

    • huggingface.co
    Updated Apr 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bluebrain (2025). wikitext-103-raw-v1-seq512-tokenized-grouped [Dataset]. https://huggingface.co/datasets/BluebrainAI/wikitext-103-raw-v1-seq512-tokenized-grouped
    Explore at:
    Dataset updated
    Apr 3, 2025
    Dataset authored and provided by
    Bluebrain
    Description

    BluebrainAI/wikitext-103-raw-v1-seq512-tokenized-grouped dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    wikitext-103-raw-v1_sents_min_len10_max_len30_openai_clip-vit-base-patch32

    • huggingface.co
    Updated Sep 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos E. Jimenez (2022). wikitext-103-raw-v1_sents_min_len10_max_len30_openai_clip-vit-base-patch32 [Dataset]. https://huggingface.co/datasets/carlosejimenez/wikitext-103-raw-v1_sents_min_len10_max_len30_openai_clip-vit-base-patch32
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 25, 2022
    Authors
    Carlos E. Jimenez
    Description

    carlosejimenez/wikitext-103-raw-v1_sents_min_len10_max_len30_openai_clip-vit-base-patch32 dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    wikitext_103

    • huggingface.co
    Updated May 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation datasets (2025). wikitext_103 [Dataset]. https://huggingface.co/datasets/lighteval/wikitext_103
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 9, 2025
    Dataset authored and provided by
    Evaluation datasets
    Description

    Wikitext-103 dataset from this paper: https://arxiv.org/pdf/1609.07843.pdf

    Gopher's authors concatenate all the articles, set context length to n/2 (n = max_seq_len),
    and use the "closed vocabulary" variant of the dataset for evaluation.
    
    In contrast, we evaluate the model on each article independently, use single token contexts
    (except for the last sequence in each document), and use the raw dataset.
    
  18. h

    wikitext-103-raw-v1-rwkv-v5-tokenized

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Rodkin, wikitext-103-raw-v1-rwkv-v5-tokenized [Dataset]. https://huggingface.co/datasets/irodkin/wikitext-103-raw-v1-rwkv-v5-tokenized
    Explore at:
    Dataset updated
    Sep 12, 2024
    Authors
    Ivan Rodkin
    Description

    irodkin/wikitext-103-raw-v1-rwkv-v5-tokenized dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    wikitext-103-raw-v1_gpt2-20k

    • huggingface.co
    Updated Feb 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pietro Lesci (2024). wikitext-103-raw-v1_gpt2-20k [Dataset]. https://huggingface.co/datasets/pietrolesci/wikitext-103-raw-v1_gpt2-20k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2024
    Authors
    Pietro Lesci
    Description

    Dataset Card for "wikitext-103-raw-v1_gpt2-20k"

    More Information needed

  20. wikitext_document_level

    • huggingface.co
    Updated Mar 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EleutherAI (2023). wikitext_document_level [Dataset]. https://huggingface.co/datasets/EleutherAI/wikitext_document_level
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 10, 2023
    Dataset authored and provided by
    EleutherAIhttps://eleuther.ai/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Wikitext Document Level

    This is a modified version of https://huggingface.co/datasets/wikitext that returns Wiki pages instead of Wiki text line-by-line. The original readme is contained below.

      Dataset Card for "wikitext"
    
    
    
    
    
      Dataset Summary
    

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons… See the full description on the dataset page: https://huggingface.co/datasets/EleutherAI/wikitext_document_level.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher (2016). WikiText-103 Dataset [Dataset]. https://paperswithcode.com/dataset/wikitext-103

WikiText-103 Dataset

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Sep 27, 2016
Authors
Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher
Description

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

Search
Clear search
Close search
Google apps
Main menu