100+ datasets found
  1. h

    wikitext2

    • huggingface.co
    • opendatalab.com
    Updated Oct 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Karsten Kuhnke (2023). wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2
    Explore at:
    Dataset updated
    Oct 21, 2023
    Authors
    Jan Karsten Kuhnke
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "wikitext"

      Dataset Summary
    

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far largerโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.

  2. h

    Wikitext-103

    • huggingface.co
    Updated Aug 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rushitha Mamidala (2025). Wikitext-103 [Dataset]. https://huggingface.co/datasets/Ritu27/Wikitext-103
    Explore at:
    Dataset updated
    Aug 11, 2025
    Authors
    Rushitha Mamidala
    Description

    Ritu27/Wikitext-103 dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    wikitext-tiny

    • huggingface.co
    Updated Aug 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yujiepan (2023). wikitext-tiny [Dataset]. https://huggingface.co/datasets/yujiepan/wikitext-tiny
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2023
    Authors
    yujiepan
    Description

    This dataset is sampled from wikitext/wikitext-2-v1/train. Codes to generate this dataset: import datasets dataset = datasets.load_dataset('wikitext', 'wikitext-2-v1')

    selected = [] i = -1 while len(selected) < 24: i += 1 text = dataset['train'][i]['text'] if 8 < len(text.split(' ')) <= 16 and '=' not in text: selected.append(i)

    tiny_dataset = dataset['train'].select(selected)

  4. h

    wikitext-103-raw

    • huggingface.co
    Updated Jun 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Watson (2024). wikitext-103-raw [Dataset]. https://huggingface.co/datasets/mattdangerw/wikitext-103-raw
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 26, 2024
    Authors
    Matthew Watson
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Wikitext 103 Raw Dataset

    The original link for this dataset is now down, hosting the original zip file here.

  5. h

    wikitext-2-v1

    • huggingface.co
    Updated Apr 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Mealman (2023). wikitext-2-v1 [Dataset]. https://huggingface.co/datasets/Dracones/wikitext-2-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2023
    Authors
    Mark Mealman
    Description

    Dracones/wikitext-2-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    wikitext_fr

    • huggingface.co
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antoine SIMOULIN (2024). wikitext_fr [Dataset]. https://huggingface.co/datasets/asi/wikitext_fr
    Explore at:
    Dataset updated
    May 31, 2024
    Authors
    Antoine SIMOULIN
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Wikitext-fr language modeling dataset consists of over 70 million tokens extracted from the set of french Wikipedia articles that are classified as "quality articles" or "good articles.". The aim is to replicate the English benchmark.

  7. h

    wikitext-2-raw-v1-forbidden-titles-train

    • huggingface.co
    Updated Jul 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ActiveRetrieval (2024). wikitext-2-raw-v1-forbidden-titles-train [Dataset]. https://huggingface.co/datasets/Self-GRIT/wikitext-2-raw-v1-forbidden-titles-train
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 25, 2024
    Dataset authored and provided by
    ActiveRetrieval
    Description

    Self-GRIT/wikitext-2-raw-v1-forbidden-titles-train dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    wikitext-csv

    • huggingface.co
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pooodle Shmith (2024). wikitext-csv [Dataset]. https://huggingface.co/datasets/Tom9000/wikitext-csv
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 14, 2024
    Authors
    Pooodle Shmith
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This one is not entirely identical to default "wiki.train.raw" dataset, used with llama.cpp, so instead of this one, get the recomended one from here: https://huggingface.co/datasets/ggml-org/ci

  9. h

    wikitext-103-v1-5p

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fuheng Wu, wikitext-103-v1-5p [Dataset]. https://huggingface.co/datasets/wufuheng/wikitext-103-v1-5p
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Fuheng Wu
    Description

    wufuheng/wikitext-103-v1-5p dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    wikitext-2

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mika Senghaas, wikitext-2 [Dataset]. https://huggingface.co/datasets/mikasenghaas/wikitext-2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Mika Senghaas
    Description

    mikasenghaas/wikitext-2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    wikitext-v2-clean

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HaryoAW, wikitext-v2-clean [Dataset]. https://huggingface.co/datasets/haryoaw/wikitext-v2-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    HaryoAW
    Description

    haryoaw/wikitext-v2-clean dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    wikitext-103

    • huggingface.co
    Updated Apr 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SE Gyges (2024). wikitext-103 [Dataset]. https://huggingface.co/datasets/segyges/wikitext-103
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 5, 2024
    Authors
    SE Gyges
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    segyges/wikitext-103 dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    wikitext-2-raw-v1-shuffled

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tongyao, wikitext-2-raw-v1-shuffled [Dataset]. https://huggingface.co/datasets/tyzhu/wikitext-2-raw-v1-shuffled
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Tongyao
    Description

    Dataset Card for "wikitext-2-raw-v1-shuffled"

    More Information needed

  14. h

    wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG

    • huggingface.co
    Updated Nov 23, 2011
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos E. Jimenez (2011). wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG [Dataset]. https://huggingface.co/datasets/carlosejimenez/wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2011
    Authors
    Carlos E. Jimenez
    Description

    carlosejimenez/wikitext-103-raw-v1_sents_min_len10_max_len30_DEBUG dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    wikitext

    • huggingface.co
    Updated Oct 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankit (2023). wikitext [Dataset]. https://huggingface.co/datasets/Ankit1057/wikitext
    Explore at:
    Dataset updated
    Oct 17, 2023
    Authors
    Ankit
    Description

    Ankit1057/wikitext dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    wikitext

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yuval goldstein, wikitext [Dataset]. https://huggingface.co/datasets/golyuval/wikitext
    Explore at:
    Authors
    yuval goldstein
    Description

    golyuval/wikitext dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    wikitext

    • huggingface.co
    Updated Nov 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lang Xiong (2025). wikitext [Dataset]. https://huggingface.co/datasets/Lang008/wikitext
    Explore at:
    Dataset updated
    Nov 13, 2025
    Authors
    Lang Xiong
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Lang008/wikitext dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    wikitext-103-special

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marius Mosbach, wikitext-103-special [Dataset]. https://huggingface.co/datasets/mmosbach/wikitext-103-special
    Explore at:
    Authors
    Marius Mosbach
    Description

    mmosbach/wikitext-103-special dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    wikitext

    • huggingface.co
    Updated Mar 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Renati (2024). wikitext [Dataset]. https://huggingface.co/datasets/RRickk/wikitext
    Explore at:
    Dataset updated
    Mar 11, 2024
    Authors
    Renati
    Description

    RRickk/wikitext dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    wikitext-103-v1-cleaned

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vishwa Teja, wikitext-103-v1-cleaned [Dataset]. https://huggingface.co/datasets/vish26/wikitext-103-v1-cleaned
    Explore at:
    Authors
    Vishwa Teja
    Description

    vish26/wikitext-103-v1-cleaned dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jan Karsten Kuhnke (2023). wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2

wikitext2

WikiText

mindchain/wikitext2

Explore at:
Dataset updated
Oct 21, 2023
Authors
Jan Karsten Kuhnke
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Dataset Card for "wikitext"

  Dataset Summary

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far largerโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.

Search
Clear search
Close search
Google apps
Main menu