The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.
Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "wikitext"
Dataset Summary
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far largerโฆ See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/wikitext.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
A subset of Wikitext-103; useful for testing language model training on smaller datasets.
This dataset is generated by Lilac for a HuggingFace Space: huggingface.co/spaces/lilacai/lilac. Original dataset: https://huggingface.co/datasets/wikitext Lilac dataset config: name: wikitext-2-raw-v1 source: dataset_name: wikitext config_name: wikitext-2-raw-v1 source_name: huggingface embeddings: - path: text embedding: gte-small signals: - path: text signal: signal_name: near_dup - path: text signal: signal_name: pii - path: text signal:โฆ See the full description on the dataset page: https://huggingface.co/datasets/lilacai/lilac-wikitext-2-raw-v1.
Dataset Card for "wikitext-2-raw-v1-shuffled"
More Information needed
This dataset was created by Nazhura
mikasenghaas/wikitext-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
claran/wikitext-2-nonulls-sample-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Alignment-Lab-AI/wikitext-2-raw-bytepair dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset contains word and character level tokens extracted from Wikipedia
Self-GRIT/wikitext-2-raw-v1-forbidden-titles-1k dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistics information of the PTB and WikiText-2 datasets.
Self-GRIT/wikitext-2-raw-v1-forbidden-titles-5k dataset hosted on Hugging Face and contributed by the HF Datasets community
Self-GRIT/wikitext-2-raw-v1-preprocessed-200-PI_KFI_-wikipedia-dpr-k-1-OP-False-train-PI_KFI_IK-perplexity dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Word-level valid and test perplexity on PTB.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The pseudocode of the learning rate back-tracking.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The pseudocode of the gradient batch training algorithm.
Self-GRIT/wikitext-2-raw-v1-preprocessed-200-PP_RIP_PTP-train-perplexity dataset hosted on Hugging Face and contributed by the HF Datasets community
Self-GRIT/wikitext-2-raw-v1-preprocessed-200-PP_RIP_PTP_claude-wikipedia-dpr-k-1-OP-True dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Perplexity of different initializations and improvement strategies.
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.
Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.