24 datasets found

P
WikiText-103 Dataset
paperswithcode.com
opendatalab.com
Updated Sep 27, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher (2016). WikiText-103 Dataset [Dataset]. https://paperswithcode.com/dataset/wikitext-103
Explore at:
Dataset updated
Sep 27, 2016
Authors
Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher
Description
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.
h
wikitext2
huggingface.co
paperswithcode.com
+1more
Updated Oct 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Karsten Kuhnke (2023). wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2
Explore at:
Dataset updated
Oct 21, 2023
Authors
Jan Karsten Kuhnke
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for "wikitext"

Dataset Summary

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.
a
Wikitext-103
academictorrents.com
bittorrent
Updated Oct 16, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephen Merity et al., 2016 (2018). Wikitext-103 [Dataset]. https://academictorrents.com/details/a4fee5547056c845e31ab952598f43b42333183c
Explore at:
bittorrent(190200704)Available download formats
Dataset updated
Oct 16, 2018
Dataset authored and provided by
Stephen Merity et al., 2016
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. Widely used for language modeling, including the pretrained models used in the fastai library and ULMFiT algorithm.
h
wikitext-103-v1
huggingface.co
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ChulwonChoi (2025). wikitext-103-v1 [Dataset]. https://huggingface.co/datasets/cchoi1022/wikitext-103-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 10, 2025
Authors
ChulwonChoi
Description
cchoi1022/wikitext-103-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community
a
Wikitext-2
academictorrents.com
bittorrent
Updated Oct 16, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephen Merity et al., 2016 (2018). Wikitext-2 [Dataset]. https://academictorrents.com/details/ac7ffa98b66427246a316a81b2ea31c9b58ea5b6
Explore at:
bittorrent(4070055)Available download formats
Dataset updated
Oct 16, 2018
Dataset authored and provided by
Stephen Merity et al., 2016
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
A subset of Wikitext-103; useful for testing language model training on smaller datasets.
o
Wikitext-103 and OpenWebText Models
explore.openaire.eu
zenodo.org
Updated Sep 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Forrest Davis (2020). Wikitext-103 and OpenWebText Models [Dataset]. http://doi.org/10.5281/zenodo.4053571
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4053571
Dataset updated
Sep 27, 2020
Authors
Forrest Davis
Description
This repository contains 25 Wikitext-103 LSTM models and 25 LSTM models trained on a 100 million token subset of the OpenWebTextCorpus. Training/validation/test data is included with the Web models. By-epoch validation perplexity is given in the logs (within the directory for the models). Please write to me if you have any questions :)
E
WikiText-103 & 2
live.european-language-grid.eu
txt
Updated Dec 30, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). WikiText-103 & 2 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/5169
Explore at:
txtAvailable download formats
Dataset updated
Dec 30, 2016
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset contains word and character level tokens extracted from Wikipedia
h
wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG
huggingface.co
Updated Nov 23, 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos E. Jimenez (2011). wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG [Dataset]. https://huggingface.co/datasets/carlosejimenez/wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2011
Authors
Carlos E. Jimenez
Description
carlosejimenez/wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext-103-raw-v1-para-permute-1
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tongyao, wikitext-103-raw-v1-para-permute-1 [Dataset]. https://huggingface.co/datasets/tyzhu/wikitext-103-raw-v1-para-permute-1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Tongyao
Description
Dataset Card for "wikitext-103-raw-v1-para-permute-1"

More Information needed
h
wikitext-103-raw-v1-5percent
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fuheng Wu, wikitext-103-raw-v1-5percent [Dataset]. https://huggingface.co/datasets/wufuheng/wikitext-103-raw-v1-5percent
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Fuheng Wu
Description
wufuheng/wikitext-103-raw-v1-5percent dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext-103-stanza
huggingface.co
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esen Ergun (2025). wikitext-103-stanza [Dataset]. https://huggingface.co/datasets/esenergun/wikitext-103-stanza
Explore at:
Dataset updated
Apr 23, 2025
Authors
Esen Ergun
Description
Dataset Card for "wikitext-103-stanza"

More Information needed
Z
PTNews Corpus
data.niaid.nih.gov
Updated Jun 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nunes, Davide (2020). PTNews Corpus [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3908506
Explore at:
Dataset updated
Jun 27, 2020
Dataset authored and provided by
Nunes, Davide
Description
The PTNews Corpus is a collection of over 19 million tokens extracted from 10 years of political news articles (in Portuguese) from the Portuguese newspaper PÚBLICO. The corpus is available under the Creative Commons Attribution-NonCommercial-ShareAlike Licence. The material contained on the PTNews Corpus is © 2010-2020 PÚBLICO Comunicação Social SA.

The corpus sizes between the preprocessed version of Penn Treebank (PTB) and WikiText-103. Similarly to WikiText, PTNews has a larger vocabulary than PTB and retains the original case, punctuation and numbers. This corpus contains over 31000 publicly available full articles which makes it well suited for models that can take advantage of long-term dependencies.

The corpus is available as a word-level collection of articles in two version: the first (ptnews_origin) contains a single file with all the articles in the form: title, URL, date, body; the second, contains only the title and body of the news articles and it is split into train, test, validation sets. In this processed version, the words with less than 3 occurrences are mapped to the token. Each sentence in an article body occupies a single line of the dataset and the end of paragraph is marked with the tag at the end of a sentence. Portuguese words resulting from contractions like "desta", ou "nesta" are separated into "d", "esta", "n", "esta", respectively.

Sample article:

Carlos César : Cavaco " cansado e sem entusiasmo " quis afastar responsabilidades sobre a crise https://publico.pt/2010/06/10/politica/noticia/carlos-cesar-cavaco-cansado-e-sem-entusiasmo-quis-afastar-responsabilidades-sobre-a-crise-1441369 2010-06-10 15:38:00

O presidente do Governo Regional dos Açores , Carlos César , considerou hoje que Cavaco Silva esteve " cansado e sem entusiasmo " no discurso do Dia de Portugal , onde afastou responsabilidades sobre a actual crise . " O país ouviu um Presidente cansado e sem entusiasmo , que andou às voltas com os papéis para dizer que não tinha nada a ver com as razões da crise " , afirmou Carlos César , num comentário à Lusa sobre o discurso do Presidente da República na cerimónia oficial do 10 de Junho , realizada em Faro . Carlos César considerou , no entanto , " positivo " que Cavaco Silva tenha feito " um discurso alinhado com um tema recorrente na apreciação do momento que vivemos , o da coesão e da corresponsabilização " . No mesmo sentido , manifestou concordância com o apelo que Cavaco Silva fez " à responsabilidade dos empregadores e empregados " , mas deixou um alerta relativamente à referência do Presidente da República à necessidade de " limpar Portugal " . Para Carlos César , se essa referência " for despida de conteúdo institucional útil , tratou-se de mais um discurso que se perderá na babugem política d aquilo que Cavaco Silva entendeu recordar como o ' rectângulo ' " .

Reporting Results If you wish to report results or other resources obtained on the PTNews contact Davide Nunes with the following information:

Task: e.g. Language Modelling, Semantic Similarity, etc;

Publication URL: url to published article or preprint;

Type of Model: LSTM Neural Network, n-grams, GloVe vectors, etc;

Evaluation Metrics: e.g. validation and testing perplexities in the case of language modelling.

They will be displayed here

Preprocessed Corpus Statistics

articles: 31.919

articles by split:

train: 25.537

test: 3.191

val: 3.191

unique tokens: 68.318

unique OoV Tokens: 76.157

total tokens: 19.021.661

total OoV tokens: 95.043

OoV rate: 0.5%

tokens by split:

train: 15.242.995

test: 1.895.184

val: 1.883.482

Contact Information

If you have questions about the corpus or want to report benchmark results, contact Davide Nunes.
h
wikitext-103-raw-v1-sent-permute-1
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tongyao, wikitext-103-raw-v1-sent-permute-1 [Dataset]. https://huggingface.co/datasets/tyzhu/wikitext-103-raw-v1-sent-permute-1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Tongyao
Description
Dataset Card for "wikitext-103-raw-v1-sent-permute-1"

More Information needed
W
Webis-Context-sensitive-Word-Search-Queries-2022
anthology.aicmu.ac.cn
webis.de
6425595
Updated 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matti Wiegmann; Martin Potthast; Benno Stein (2022). Webis-Context-sensitive-Word-Search-Queries-2022 [Dataset]. http://doi.org/10.5281/zenodo.6425595
Explore at:
6425595Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.6425595
Dataset updated
2022
Dataset provided by
Bauhaus-Universität Weimar
Leipzig University
The Web Technology & Information Systems Network
Authors
Matti Wiegmann; Martin Potthast; Benno Stein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains two datasets with word search queries. Each word search query consists of a token n-gram with one wildcard token ([MASK]). The answers to each query are the most likely token to replace the mask. All queries originate from wikitext-103 and CLOTH, the respected source is annotated for each query.
The original-token dataset lists exactly one top answer for each query. The ranked-answers dataset lists multiple, sorted answers in three relevance categories, where 3 is the most relevant. Please refer to the citation for more details.
h
wikitext-103-raw-v1-seq512-tokenized-grouped
huggingface.co
Updated Apr 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bluebrain (2025). wikitext-103-raw-v1-seq512-tokenized-grouped [Dataset]. https://huggingface.co/datasets/BluebrainAI/wikitext-103-raw-v1-seq512-tokenized-grouped
Explore at:
Dataset updated
Apr 3, 2025
Dataset authored and provided by
Bluebrain
Description
BluebrainAI/wikitext-103-raw-v1-seq512-tokenized-grouped dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext-103-raw-v1_sents_min_len10_max_len30_openai_clip-vit-base-patch32
huggingface.co
Updated Sep 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos E. Jimenez (2022). wikitext-103-raw-v1_sents_min_len10_max_len30_openai_clip-vit-base-patch32 [Dataset]. https://huggingface.co/datasets/carlosejimenez/wikitext-103-raw-v1_sents_min_len10_max_len30_openai_clip-vit-base-patch32
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 25, 2022
Authors
Carlos E. Jimenez
Description
carlosejimenez/wikitext-103-raw-v1_sents_min_len10_max_len30_openai_clip-vit-base-patch32 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext_103
huggingface.co
Updated May 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evaluation datasets (2025). wikitext_103 [Dataset]. https://huggingface.co/datasets/lighteval/wikitext_103
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 9, 2025
Dataset authored and provided by
Evaluation datasets
Description
Wikitext-103 dataset from this paper: https://arxiv.org/pdf/1609.07843.pdf

Gopher's authors concatenate all the articles, set context length to n/2 (n = max_seq_len), and use the "closed vocabulary" variant of the dataset for evaluation. In contrast, we evaluate the model on each article independently, use single token contexts (except for the last sequence in each document), and use the raw dataset.
h
wikitext-103-raw-v1-rwkv-v5-tokenized
huggingface.co
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Rodkin, wikitext-103-raw-v1-rwkv-v5-tokenized [Dataset]. https://huggingface.co/datasets/irodkin/wikitext-103-raw-v1-rwkv-v5-tokenized
Explore at:
Dataset updated
Sep 12, 2024
Authors
Ivan Rodkin
Description
irodkin/wikitext-103-raw-v1-rwkv-v5-tokenized dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext-103-raw-v1_gpt2-20k
huggingface.co
Updated Feb 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pietro Lesci (2024). wikitext-103-raw-v1_gpt2-20k [Dataset]. https://huggingface.co/datasets/pietrolesci/wikitext-103-raw-v1_gpt2-20k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2024
Authors
Pietro Lesci
Description
Dataset Card for "wikitext-103-raw-v1_gpt2-20k"

More Information needed
wikitext_document_level
huggingface.co
Updated Mar 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EleutherAI (2023). wikitext_document_level [Dataset]. https://huggingface.co/datasets/EleutherAI/wikitext_document_level
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 10, 2023
Dataset authored and provided by
EleutherAIhttps://eleuther.ai/
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Wikitext Document Level

This is a modified version of https://huggingface.co/datasets/wikitext that returns Wiki pages instead of Wiki text line-by-line. The original readme is contained below.

Dataset Card for "wikitext" Dataset Summary

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons… See the full description on the dataset page: https://huggingface.co/datasets/EleutherAI/wikitext_document_level.

Facebook

Twitter

Click to copy link

Link copied

Cite

Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher (2016). WikiText-103 Dataset [Dataset]. https://paperswithcode.com/dataset/wikitext-103

WikiText-103 Dataset

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Sep 27, 2016

Authors

Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher

Description

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

Clear search

Close search

Google apps

Main menu

WikiText-103 Dataset

wikitext2

Wikitext-103

wikitext-103-v1

Wikitext-2

Wikitext-103 and OpenWebText Models

WikiText-103 & 2

wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG

wikitext-103-raw-v1-para-permute-1

wikitext-103-raw-v1-5percent

wikitext-103-stanza

PTNews Corpus

wikitext-103-raw-v1-sent-permute-1

Webis-Context-sensitive-Word-Search-Queries-2022

wikitext-103-raw-v1-seq512-tokenized-grouped

wikitext-103-raw-v1_sents_min_len10_max_len30_openai_clip-vit-base-patch32

wikitext_103

wikitext-103-raw-v1-rwkv-v5-tokenized

wikitext-103-raw-v1_gpt2-20k

wikitext_document_level

WikiText-103 DatasetSee More Versions

WikiText-103 Dataset