59 datasets found

h
wikitext2
huggingface.co
opendatalab.com
Updated Oct 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Karsten Kuhnke (2023). wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2
Explore at:
Dataset updated
Oct 21, 2023
Authors
Jan Karsten Kuhnke
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for "wikitext"

Dataset Summary

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.
a
Wikitext-2
academictorrents.com
bittorrent
Updated Oct 16, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephen Merity et al., 2016 (2018). Wikitext-2 [Dataset]. https://academictorrents.com/details/ac7ffa98b66427246a316a81b2ea31c9b58ea5b6
Explore at:
bittorrent(4070055)Available download formats
Dataset updated
Oct 16, 2018
Dataset authored and provided by
Stephen Merity et al., 2016
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
A subset of Wikitext-103; useful for testing language model training on smaller datasets.
h
lilac-wikitext-2-raw-v1
huggingface.co
Updated Sep 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lilac AI (2025). lilac-wikitext-2-raw-v1 [Dataset]. https://huggingface.co/datasets/lilacai/lilac-wikitext-2-raw-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2025
Dataset authored and provided by
Lilac AI
Description
This dataset is generated by Lilac for a HuggingFace Space: huggingface.co/spaces/lilacai/lilac. Original dataset: https://huggingface.co/datasets/wikitext Lilac dataset config: name: wikitext-2-raw-v1 source: dataset_name: wikitext config_name: wikitext-2-raw-v1 source_name: huggingface embeddings: - path: text embedding: gte-small signals: - path: text signal: signal_name: near_dup - path: text signal: signal_name: pii - path: text signal:… See the full description on the dataset page: https://huggingface.co/datasets/lilacai/lilac-wikitext-2-raw-v1.
h
wikitext-2-raw-v1-shuffled
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tongyao, wikitext-2-raw-v1-shuffled [Dataset]. https://huggingface.co/datasets/tyzhu/wikitext-2-raw-v1-shuffled
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Tongyao
Description
Dataset Card for "wikitext-2-raw-v1-shuffled"

More Information needed
h
wikitext-2-raw-v1-forbidden-titles-train
huggingface.co
Updated Jul 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ActiveRetrieval (2024). wikitext-2-raw-v1-forbidden-titles-train [Dataset]. https://huggingface.co/datasets/Self-GRIT/wikitext-2-raw-v1-forbidden-titles-train
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 25, 2024
Dataset authored and provided by
ActiveRetrieval
Description
Self-GRIT/wikitext-2-raw-v1-forbidden-titles-train dataset hosted on Hugging Face and contributed by the HF Datasets community
wikitext-2
kaggle.com
Updated Sep 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashutosh Saxena (2021). wikitext-2 [Dataset]. https://www.kaggle.com/datasets/ashuto5h/wikitext2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 15, 2021
Dataset provided by
Kaggle
Authors
Ashutosh Saxena
Description
Dataset

This dataset was created by Ashutosh Saxena

Contents
P
WikiText-103 Dataset
paperswithcode.com
opendatalab.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher, WikiText-103 Dataset [Dataset]. https://paperswithcode.com/dataset/wikitext-103
Explore at:
Authors
Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher
Description
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.
h
wikitext-2
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mika Senghaas, wikitext-2 [Dataset]. https://huggingface.co/datasets/mikasenghaas/wikitext-2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Mika Senghaas
Description
mikasenghaas/wikitext-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext-2-sample
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clara Na, wikitext-2-sample [Dataset]. https://huggingface.co/datasets/claran/wikitext-2-sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Clara Na
Description
claran/wikitext-2-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext-tiny
huggingface.co
aifasthub.com
Updated Aug 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yujiepan (2023). wikitext-tiny [Dataset]. https://huggingface.co/datasets/yujiepan/wikitext-tiny
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 31, 2023
Authors
yujiepan
Description
This dataset is sampled from wikitext/wikitext-2-v1/train. Codes to generate this dataset: import datasets dataset = datasets.load_dataset('wikitext', 'wikitext-2-v1')

selected = [] i = -1 while len(selected) < 24: i += 1 text = dataset['train'][i]['text'] if 8 < len(text.split(' ')) <= 16 and '=' not in text: selected.append(i)

tiny_dataset = dataset['train'].select(selected)
E
WikiText-103 & 2
live.european-language-grid.eu
txt
Updated Dec 30, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). WikiText-103 & 2 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/5169
Explore at:
txtAvailable download formats
Dataset updated
Dec 30, 2016
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset contains word and character level tokens extracted from Wikipedia
h
wikitext-2-nonulls-sample-v2
huggingface.co
Updated Sep 15, 2008
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clara Na (2008). wikitext-2-nonulls-sample-v2 [Dataset]. https://huggingface.co/datasets/claran/wikitext-2-nonulls-sample-v2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 15, 2008
Authors
Clara Na
Description
claran/wikitext-2-nonulls-sample-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community
f
Word-level valid and test perplexity on WikiText-2.
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Yuwen; Shuyu Chen; Xiaohan Yuan (2023). Word-level valid and test perplexity on WikiText-2. [Dataset]. http://doi.org/10.1371/journal.pone.0249820.t010
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0249820.t010
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS ONE
Authors
Lu Yuwen; Shuyu Chen; Xiaohan Yuan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Word-level valid and test perplexity on WikiText-2.
h
wikitext-2-second-half
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hao Li, wikitext-2-second-half [Dataset]. https://huggingface.co/datasets/tartspuppy/wikitext-2-second-half
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Hao Li
Description
tartspuppy/wikitext-2-second-half dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext-2-noheader-sample
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clara Na, wikitext-2-noheader-sample [Dataset]. https://huggingface.co/datasets/claran/wikitext-2-noheader-sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Clara Na
Description
claran/wikitext-2-noheader-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
f
Perplexity of different initializations and improvement strategies.
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Yuwen; Shuyu Chen; Xiaohan Yuan (2023). Perplexity of different initializations and improvement strategies. [Dataset]. http://doi.org/10.1371/journal.pone.0249820.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0249820.t007
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS ONE
Authors
Lu Yuwen; Shuyu Chen; Xiaohan Yuan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Perplexity of different initializations and improvement strategies.
h
wikitext-2-raw-v1-preprocessed-1k
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DevQuasar, wikitext-2-raw-v1-preprocessed-1k [Dataset]. https://huggingface.co/datasets/DevQuasar/wikitext-2-raw-v1-preprocessed-1k
Explore at:
Dataset authored and provided by
DevQuasar
Description
DevQuasar/wikitext-2-raw-v1-preprocessed-1k dataset hosted on Hugging Face and contributed by the HF Datasets community
f
The pseudocode of the learning rate back-tracking.
figshare.com
xls
Updated Apr 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Yuwen; Shuyu Chen; Xiaohan Yuan (2021). The pseudocode of the learning rate back-tracking. [Dataset]. http://doi.org/10.1371/journal.pone.0249820.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0249820.t006
Dataset updated
Apr 15, 2021
Dataset provided by
PLOS ONE
Authors
Lu Yuwen; Shuyu Chen; Xiaohan Yuan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The pseudocode of the learning rate back-tracking.
h
wikitext-2-raw-v1-forbidden-titles-1k
huggingface.co
Updated Aug 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ActiveRetrieval (2024). wikitext-2-raw-v1-forbidden-titles-1k [Dataset]. https://huggingface.co/datasets/Self-GRIT/wikitext-2-raw-v1-forbidden-titles-1k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 7, 2024
Dataset authored and provided by
ActiveRetrieval
Description
Self-GRIT/wikitext-2-raw-v1-forbidden-titles-1k dataset hosted on Hugging Face and contributed by the HF Datasets community
f
The pseudocode of Adadelta optimization algorithm.
plos.figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Yuwen; Shuyu Chen; Xiaohan Yuan (2023). The pseudocode of Adadelta optimization algorithm. [Dataset]. http://doi.org/10.1371/journal.pone.0249820.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0249820.t003
Dataset updated
Jun 4, 2023
Dataset provided by
PLOS ONE
Authors
Lu Yuwen; Shuyu Chen; Xiaohan Yuan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The pseudocode of Adadelta optimization algorithm.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jan Karsten Kuhnke (2023). wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2

wikitext2

WikiText

mindchain/wikitext2

Explore at:

Dataset updated

Oct 21, 2023

Authors

Jan Karsten Kuhnke

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Dataset Card for "wikitext"

  Dataset Summary

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.

Clear search

Close search

Google apps

Main menu

wikitext2

Wikitext-2

lilac-wikitext-2-raw-v1

wikitext-2-raw-v1-shuffled

wikitext-2-raw-v1-forbidden-titles-train

wikitext-2

Dataset

Contents

WikiText-103 Dataset

wikitext-2

wikitext-2-sample

wikitext-tiny

WikiText-103 & 2

wikitext-2-nonulls-sample-v2

Word-level valid and test perplexity on WikiText-2.

wikitext-2-second-half

wikitext-2-noheader-sample

Perplexity of different initializations and improvement strategies.

wikitext-2-raw-v1-preprocessed-1k

The pseudocode of the learning rate back-tracking.

wikitext-2-raw-v1-forbidden-titles-1k

The pseudocode of Adadelta optimization algorithm.

wikitext2See More Versions

WikiText

mindchain/wikitext2

wikitext2