59 datasets found
  1. h

    wikitext2

    • huggingface.co
    • opendatalab.com
    Updated Oct 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Karsten Kuhnke (2023). wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2
    Explore at:
    Dataset updated
    Oct 21, 2023
    Authors
    Jan Karsten Kuhnke
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "wikitext"

      Dataset Summary
    

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.

  2. a

    Wikitext-2

    • academictorrents.com
    bittorrent
    Updated Oct 16, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen Merity et al., 2016 (2018). Wikitext-2 [Dataset]. https://academictorrents.com/details/ac7ffa98b66427246a316a81b2ea31c9b58ea5b6
    Explore at:
    bittorrent(4070055)Available download formats
    Dataset updated
    Oct 16, 2018
    Dataset authored and provided by
    Stephen Merity et al., 2016
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    A subset of Wikitext-103; useful for testing language model training on smaller datasets.

  3. h

    lilac-wikitext-2-raw-v1

    • huggingface.co
    Updated Sep 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lilac AI (2025). lilac-wikitext-2-raw-v1 [Dataset]. https://huggingface.co/datasets/lilacai/lilac-wikitext-2-raw-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2025
    Dataset authored and provided by
    Lilac AI
    Description

    This dataset is generated by Lilac for a HuggingFace Space: huggingface.co/spaces/lilacai/lilac. Original dataset: https://huggingface.co/datasets/wikitext Lilac dataset config: name: wikitext-2-raw-v1 source: dataset_name: wikitext config_name: wikitext-2-raw-v1 source_name: huggingface embeddings: - path: text embedding: gte-small signals: - path: text signal: signal_name: near_dup - path: text signal: signal_name: pii - path: text signal:… See the full description on the dataset page: https://huggingface.co/datasets/lilacai/lilac-wikitext-2-raw-v1.

  4. h

    wikitext-2-raw-v1-shuffled

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tongyao, wikitext-2-raw-v1-shuffled [Dataset]. https://huggingface.co/datasets/tyzhu/wikitext-2-raw-v1-shuffled
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Tongyao
    Description

    Dataset Card for "wikitext-2-raw-v1-shuffled"

    More Information needed

  5. h

    wikitext-2-raw-v1-forbidden-titles-train

    • huggingface.co
    Updated Jul 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ActiveRetrieval (2024). wikitext-2-raw-v1-forbidden-titles-train [Dataset]. https://huggingface.co/datasets/Self-GRIT/wikitext-2-raw-v1-forbidden-titles-train
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 25, 2024
    Dataset authored and provided by
    ActiveRetrieval
    Description

    Self-GRIT/wikitext-2-raw-v1-forbidden-titles-train dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. wikitext-2

    • kaggle.com
    Updated Sep 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashutosh Saxena (2021). wikitext-2 [Dataset]. https://www.kaggle.com/datasets/ashuto5h/wikitext2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 15, 2021
    Dataset provided by
    Kaggle
    Authors
    Ashutosh Saxena
    Description

    Dataset

    This dataset was created by Ashutosh Saxena

    Contents

  7. P

    WikiText-103 Dataset

    • paperswithcode.com
    • opendatalab.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher, WikiText-103 Dataset [Dataset]. https://paperswithcode.com/dataset/wikitext-103
    Explore at:
    Authors
    Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher
    Description

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

    Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

  8. h

    wikitext-2

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mika Senghaas, wikitext-2 [Dataset]. https://huggingface.co/datasets/mikasenghaas/wikitext-2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Mika Senghaas
    Description

    mikasenghaas/wikitext-2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    wikitext-2-sample

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clara Na, wikitext-2-sample [Dataset]. https://huggingface.co/datasets/claran/wikitext-2-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Clara Na
    Description

    claran/wikitext-2-sample dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    wikitext-tiny

    • huggingface.co
    • aifasthub.com
    Updated Aug 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yujiepan (2023). wikitext-tiny [Dataset]. https://huggingface.co/datasets/yujiepan/wikitext-tiny
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2023
    Authors
    yujiepan
    Description

    This dataset is sampled from wikitext/wikitext-2-v1/train. Codes to generate this dataset: import datasets dataset = datasets.load_dataset('wikitext', 'wikitext-2-v1')

    selected = [] i = -1 while len(selected) < 24: i += 1 text = dataset['train'][i]['text'] if 8 < len(text.split(' ')) <= 16 and '=' not in text: selected.append(i)

    tiny_dataset = dataset['train'].select(selected)

  11. E

    WikiText-103 & 2

    • live.european-language-grid.eu
    txt
    Updated Dec 30, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). WikiText-103 & 2 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/5169
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 30, 2016
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset contains word and character level tokens extracted from Wikipedia

  12. h

    wikitext-2-nonulls-sample-v2

    • huggingface.co
    Updated Sep 15, 2008
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clara Na (2008). wikitext-2-nonulls-sample-v2 [Dataset]. https://huggingface.co/datasets/claran/wikitext-2-nonulls-sample-v2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 15, 2008
    Authors
    Clara Na
    Description

    claran/wikitext-2-nonulls-sample-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. f

    Word-level valid and test perplexity on WikiText-2.

    • plos.figshare.com
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan (2023). Word-level valid and test perplexity on WikiText-2. [Dataset]. http://doi.org/10.1371/journal.pone.0249820.t010
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Word-level valid and test perplexity on WikiText-2.

  14. h

    wikitext-2-second-half

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hao Li, wikitext-2-second-half [Dataset]. https://huggingface.co/datasets/tartspuppy/wikitext-2-second-half
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Hao Li
    Description

    tartspuppy/wikitext-2-second-half dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    wikitext-2-noheader-sample

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clara Na, wikitext-2-noheader-sample [Dataset]. https://huggingface.co/datasets/claran/wikitext-2-noheader-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Clara Na
    Description

    claran/wikitext-2-noheader-sample dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. f

    Perplexity of different initializations and improvement strategies.

    • plos.figshare.com
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan (2023). Perplexity of different initializations and improvement strategies. [Dataset]. http://doi.org/10.1371/journal.pone.0249820.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Perplexity of different initializations and improvement strategies.

  17. h

    wikitext-2-raw-v1-preprocessed-1k

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DevQuasar, wikitext-2-raw-v1-preprocessed-1k [Dataset]. https://huggingface.co/datasets/DevQuasar/wikitext-2-raw-v1-preprocessed-1k
    Explore at:
    Dataset authored and provided by
    DevQuasar
    Description

    DevQuasar/wikitext-2-raw-v1-preprocessed-1k dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. f

    The pseudocode of the learning rate back-tracking.

    • figshare.com
    xls
    Updated Apr 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan (2021). The pseudocode of the learning rate back-tracking. [Dataset]. http://doi.org/10.1371/journal.pone.0249820.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 15, 2021
    Dataset provided by
    PLOS ONE
    Authors
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The pseudocode of the learning rate back-tracking.

  19. h

    wikitext-2-raw-v1-forbidden-titles-1k

    • huggingface.co
    Updated Aug 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ActiveRetrieval (2024). wikitext-2-raw-v1-forbidden-titles-1k [Dataset]. https://huggingface.co/datasets/Self-GRIT/wikitext-2-raw-v1-forbidden-titles-1k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 7, 2024
    Dataset authored and provided by
    ActiveRetrieval
    Description

    Self-GRIT/wikitext-2-raw-v1-forbidden-titles-1k dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. f

    The pseudocode of Adadelta optimization algorithm.

    • plos.figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan (2023). The pseudocode of Adadelta optimization algorithm. [Dataset]. http://doi.org/10.1371/journal.pone.0249820.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The pseudocode of Adadelta optimization algorithm.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jan Karsten Kuhnke (2023). wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2

wikitext2

WikiText

mindchain/wikitext2

Explore at:
Dataset updated
Oct 21, 2023
Authors
Jan Karsten Kuhnke
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Dataset Card for "wikitext"

  Dataset Summary

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.

Search
Clear search
Close search
Google apps
Main menu