Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "wikitext"
Dataset Summary
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far largerโฆ See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.
Facebook
TwitterRitu27/Wikitext-103 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset is sampled from wikitext/wikitext-2-v1/train. Codes to generate this dataset: import datasets dataset = datasets.load_dataset('wikitext', 'wikitext-2-v1')
selected = [] i = -1 while len(selected) < 24: i += 1 text = dataset['train'][i]['text'] if 8 < len(text.split(' ')) <= 16 and '=' not in text: selected.append(i)
tiny_dataset = dataset['train'].select(selected)
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikitext 103 Raw Dataset
The original link for this dataset is now down, hosting the original zip file here.
Facebook
TwitterDracones/wikitext-2-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Wikitext-fr language modeling dataset consists of over 70 million tokens extracted from the set of french Wikipedia articles that are classified as "quality articles" or "good articles.". The aim is to replicate the English benchmark.
Facebook
TwitterSelf-GRIT/wikitext-2-raw-v1-forbidden-titles-train dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This one is not entirely identical to default "wiki.train.raw" dataset, used with llama.cpp, so instead of this one, get the recomended one from here: https://huggingface.co/datasets/ggml-org/ci
Facebook
Twitterwufuheng/wikitext-103-v1-5p dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittermikasenghaas/wikitext-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterharyoaw/wikitext-v2-clean dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
segyges/wikitext-103 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset Card for "wikitext-2-raw-v1-shuffled"
More Information needed
Facebook
Twittercarlosejimenez/wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAnkit1057/wikitext dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Lang008/wikitext dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittermmosbach/wikitext-103-special dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittervish26/wikitext-103-v1-cleaned dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "wikitext"
Dataset Summary
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far largerโฆ See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.