Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "wikitext"
Dataset Summary
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.
Facebook
TwitterThis dataset was created by VIVEK METTU
Facebook
TwitterDracones/wikitext-2-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThe dataset used in this paper is not explicitly described. However, it is mentioned that the authors used the Wikitext-2 dataset for text generation tasks.
Facebook
TwitterSelf-GRIT/wikitext-2-raw-v1-forbidden-titles-train dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.
Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.
Facebook
TwitterDataset Card for "wikitext-2-raw-v1-shuffled"
More Information needed
Facebook
TwitterThis dataset was created by bestwater
Facebook
TwitterThis dataset is sampled from wikitext/wikitext-2-v1/train. Codes to generate this dataset: import datasets dataset = datasets.load_dataset('wikitext', 'wikitext-2-v1')
selected = [] i = -1 while len(selected) < 24: i += 1 text = dataset['train'][i]['text'] if 8 < len(text.split(' ')) <= 16 and '=' not in text: selected.append(i)
tiny_dataset = dataset['train'].select(selected)
Facebook
Twittermikasenghaas/wikitext-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThe WikiText-2 dataset is a benchmark for evaluating the performance of large language models.
Facebook
Twitterclaran/wikitext-2-nonulls-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset is taken from https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset
Facebook
Twittertartspuppy/wikitext-2-second-half dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterclaran/wikitext-2-noheader-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset contains word and character level tokens extracted from Wikipedia
Facebook
TwitterThis dataset was created by Romi Kumar
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Word-level valid and test perplexity on WikiText-2.
Facebook
Twitterwikitext-2-raw-v1-test
Test split for lm-eval-harness.
Facebook
TwitterSelf-GRIT/wikitext-2-raw-v1-forbidden-titles-1k dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "wikitext"
Dataset Summary
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.