100+ datasets found

h
wikitext2
huggingface.co
opendatalab.com
Updated Oct 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Karsten Kuhnke (2023). wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2
Explore at:
Dataset updated
Oct 21, 2023
Authors
Jan Karsten Kuhnke
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for "wikitext"

Dataset Summary

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.
h
Wikitext-103
huggingface.co
Updated Aug 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rushitha Mamidala (2025). Wikitext-103 [Dataset]. https://huggingface.co/datasets/Ritu27/Wikitext-103
Explore at:
Dataset updated
Aug 11, 2025
Authors
Rushitha Mamidala
Description
Ritu27/Wikitext-103 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext-tiny
huggingface.co
Updated Aug 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yujiepan (2023). wikitext-tiny [Dataset]. https://huggingface.co/datasets/yujiepan/wikitext-tiny
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 31, 2023
Authors
yujiepan
Description
This dataset is sampled from wikitext/wikitext-2-v1/train. Codes to generate this dataset: import datasets dataset = datasets.load_dataset('wikitext', 'wikitext-2-v1')

selected = [] i = -1 while len(selected) < 24: i += 1 text = dataset['train'][i]['text'] if 8 < len(text.split(' ')) <= 16 and '=' not in text: selected.append(i)

tiny_dataset = dataset['train'].select(selected)
h
wikitext-103-raw
huggingface.co
Updated Jun 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Watson (2024). wikitext-103-raw [Dataset]. https://huggingface.co/datasets/mattdangerw/wikitext-103-raw
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 26, 2024
Authors
Matthew Watson
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Wikitext 103 Raw Dataset

The original link for this dataset is now down, hosting the original zip file here.
h
wikitext-2-v1
huggingface.co
Updated Apr 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Mealman (2023). wikitext-2-v1 [Dataset]. https://huggingface.co/datasets/Dracones/wikitext-2-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 20, 2023
Authors
Mark Mealman
Description
Dracones/wikitext-2-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext_fr
huggingface.co
Updated May 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antoine SIMOULIN (2024). wikitext_fr [Dataset]. https://huggingface.co/datasets/asi/wikitext_fr
Explore at:
Dataset updated
May 31, 2024
Authors
Antoine SIMOULIN
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Wikitext-fr language modeling dataset consists of over 70 million tokens extracted from the set of french Wikipedia articles that are classified as "quality articles" or "good articles.". The aim is to replicate the English benchmark.
h
wikitext-2-raw-v1-forbidden-titles-train
huggingface.co
Updated Jul 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ActiveRetrieval (2024). wikitext-2-raw-v1-forbidden-titles-train [Dataset]. https://huggingface.co/datasets/Self-GRIT/wikitext-2-raw-v1-forbidden-titles-train
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 25, 2024
Dataset authored and provided by
ActiveRetrieval
Description
Self-GRIT/wikitext-2-raw-v1-forbidden-titles-train dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext-csv
huggingface.co
Updated Feb 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pooodle Shmith (2024). wikitext-csv [Dataset]. https://huggingface.co/datasets/Tom9000/wikitext-csv
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 14, 2024
Authors
Pooodle Shmith
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This one is not entirely identical to default "wiki.train.raw" dataset, used with llama.cpp, so instead of this one, get the recomended one from here: https://huggingface.co/datasets/ggml-org/ci
h
wikitext-103-v1-5p
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fuheng Wu, wikitext-103-v1-5p [Dataset]. https://huggingface.co/datasets/wufuheng/wikitext-103-v1-5p
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Fuheng Wu
Description
wufuheng/wikitext-103-v1-5p dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext-2
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mika Senghaas, wikitext-2 [Dataset]. https://huggingface.co/datasets/mikasenghaas/wikitext-2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Mika Senghaas
Description
mikasenghaas/wikitext-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext-v2-clean
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HaryoAW, wikitext-v2-clean [Dataset]. https://huggingface.co/datasets/haryoaw/wikitext-v2-clean
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
HaryoAW
Description
haryoaw/wikitext-v2-clean dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext-103
huggingface.co
Updated Apr 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SE Gyges (2024). wikitext-103 [Dataset]. https://huggingface.co/datasets/segyges/wikitext-103
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 5, 2024
Authors
SE Gyges
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
segyges/wikitext-103 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext-2-raw-v1-shuffled
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tongyao, wikitext-2-raw-v1-shuffled [Dataset]. https://huggingface.co/datasets/tyzhu/wikitext-2-raw-v1-shuffled
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Tongyao
Description
Dataset Card for "wikitext-2-raw-v1-shuffled"

More Information needed
h
wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG
huggingface.co
Updated Nov 23, 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos E. Jimenez (2011). wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG [Dataset]. https://huggingface.co/datasets/carlosejimenez/wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2011
Authors
Carlos E. Jimenez
Description
carlosejimenez/wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext
huggingface.co
Updated Oct 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ankit (2023). wikitext [Dataset]. https://huggingface.co/datasets/Ankit1057/wikitext
Explore at:
Dataset updated
Oct 17, 2023
Authors
Ankit
Description
Ankit1057/wikitext dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yuval goldstein, wikitext [Dataset]. https://huggingface.co/datasets/golyuval/wikitext
Explore at:
Authors
yuval goldstein
Description
golyuval/wikitext dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext
huggingface.co
Updated Nov 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lang Xiong (2025). wikitext [Dataset]. https://huggingface.co/datasets/Lang008/wikitext
Explore at:
Dataset updated
Nov 13, 2025
Authors
Lang Xiong
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Lang008/wikitext dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext-103-special
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marius Mosbach, wikitext-103-special [Dataset]. https://huggingface.co/datasets/mmosbach/wikitext-103-special
Explore at:
Authors
Marius Mosbach
Description
mmosbach/wikitext-103-special dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext
huggingface.co
Updated Mar 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Renati (2024). wikitext [Dataset]. https://huggingface.co/datasets/RRickk/wikitext
Explore at:
Dataset updated
Mar 11, 2024
Authors
Renati
Description
RRickk/wikitext dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikitext-103-v1-cleaned
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vishwa Teja, wikitext-103-v1-cleaned [Dataset]. https://huggingface.co/datasets/vish26/wikitext-103-v1-cleaned
Explore at:
Authors
Vishwa Teja
Description
vish26/wikitext-103-v1-cleaned dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

Jan Karsten Kuhnke (2023). wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2

wikitext2

WikiText

mindchain/wikitext2

Explore at:

Dataset updated

Oct 21, 2023

Authors

Jan Karsten Kuhnke

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Dataset Card for "wikitext"

  Dataset Summary

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.

Clear search

Close search

Google apps

Main menu

wikitext2

Wikitext-103

wikitext-tiny

wikitext-103-raw

wikitext-2-v1

wikitext_fr

wikitext-2-raw-v1-forbidden-titles-train

wikitext-csv

wikitext-103-v1-5p

wikitext-2

wikitext-v2-clean

wikitext-103

wikitext-2-raw-v1-shuffled

wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG

wikitext

wikitext

wikitext

wikitext-103-special

wikitext

wikitext-103-v1-cleaned

wikitext2See More Versions

WikiText

mindchain/wikitext2

wikitext2