36 datasets found
  1. h

    openwebtext-10k

    • huggingface.co
    • opendatalab.com
    Updated Aug 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stas Bekman (2021). openwebtext-10k [Dataset]. https://huggingface.co/datasets/stas/openwebtext-10k
    Explore at:
    Dataset updated
    Aug 27, 2021
    Authors
    Stas Bekman
    Description

    An open-source replication of the WebText dataset from OpenAI.

    This is a small subset representing the first 10K records from the original dataset - created for testing.

    The full 8M-record dataset is at https://huggingface.co/datasets/openwebtext

  2. E

    OpenWebText

    • live.european-language-grid.eu
    • opendatalab.com
    • +2more
    Updated Apr 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). OpenWebText [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7790
    Explore at:
    Dataset updated
    Apr 30, 2024
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    An open-source replication of the WebText dataset from OpenAI. For more info please visit https://skylion007.github.io/OpenWebTextCorpus/

    @misc{Gokaslan2019OpenWeb, 

    title={OpenWebText Corpus},

    author={Aaron Gokaslan and Vanya Cohen},

    howpublished{\url{http://Skylion007.github.io/OpenWebTextCorpus}},

    year={2019}

    }

  3. h

    openwebtext-100k

    • huggingface.co
    Updated Sep 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Logan Riggs Smith (2025). openwebtext-100k [Dataset]. https://huggingface.co/datasets/Elriggs/openwebtext-100k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 7, 2025
    Authors
    Logan Riggs Smith
    Description

    Dataset Card for "openwebtext-100k"

    More Information needed

  4. h

    openwebtext

    • huggingface.co
    Updated Feb 1, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oliver Carey (2017). openwebtext [Dataset]. https://huggingface.co/datasets/olivercareyncl/openwebtext
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2017
    Authors
    Oliver Carey
    Description

    olivercareyncl/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. a

    OpenWebText (Gokaslan's distribution, 2019), GPT-2 Tokenized

    • academictorrents.com
    bittorrent
    Updated Jun 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    eukaryote31 and Joshua Peterson and Aaron Gokaslan and Vanya Cohen (2019). OpenWebText (Gokaslan's distribution, 2019), GPT-2 Tokenized [Dataset]. https://academictorrents.com/details/36c39b25657ce1639ccec0a91cf242b42e1f01db
    Explore at:
    bittorrent(16023403913)Available download formats
    Dataset updated
    Jun 1, 2019
    Dataset authored and provided by
    eukaryote31 and Joshua Peterson and Aaron Gokaslan and Vanya Cohen
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    Code by eukaryote31 and Joshua Peterson: and Scraped by Aaron Gokaslan and Vanya Cohen: Tokenized by eukaryote31

  6. h

    openwebtext-gpt2

    • huggingface.co
    Updated Nov 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Chanin (2024). openwebtext-gpt2 [Dataset]. https://huggingface.co/datasets/chanind/openwebtext-gpt2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 18, 2024
    Authors
    David Chanin
    Description

    chanind/openwebtext-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    openwebtext-gemma-2-context-128

    • huggingface.co
    Updated Aug 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marlon May (2025). openwebtext-gemma-2-context-128 [Dataset]. https://huggingface.co/datasets/Marlon154/openwebtext-gemma-2-context-128
    Explore at:
    Dataset updated
    Aug 2, 2025
    Authors
    Marlon May
    Description

    OpenWebTextCorpus tokenized for Gemma 2 with 128 context size

    This dataset is a pre-tokenized version of the Skylion007/openwebtext dataset using the gemma tokenizer. As such, this dataset follows the same licensing as the original openwebtext dataset. This pre-tokenization is done as a performance optimization for using the openwebtext dataset with a Gemma model (gemma-2b, gemma-2b-it, gemma-7b, gemma-7b-it). This dataset was created using SAELens, with the following settings:… See the full description on the dataset page: https://huggingface.co/datasets/Marlon154/openwebtext-gemma-2-context-128.

  8. t

    Subham Sekhar Sahoo, Aaron Gokaslan, Chris De Sa, Volodymyr Kuleshov (2024)....

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Subham Sekhar Sahoo, Aaron Gokaslan, Chris De Sa, Volodymyr Kuleshov (2024). Dataset: OpenWebText Corpus. https://doi.org/10.57702/yw8o2eqh [Dataset]. https://service.tib.eu/ldmservice/dataset/openwebtext-corpus
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.

  9. h

    openwebtext

    • huggingface.co
    Updated Jul 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thijmen Nijdam (2024). openwebtext [Dataset]. https://huggingface.co/datasets/Thijmen/openwebtext
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 29, 2024
    Authors
    Thijmen Nijdam
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Thijmen/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. Wikitext-103 and OpenWebText Models

    • zenodo.org
    application/gzip
    Updated Sep 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Forrest Davis; Forrest Davis (2020). Wikitext-103 and OpenWebText Models [Dataset]. http://doi.org/10.5281/zenodo.4053572
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Sep 30, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Forrest Davis; Forrest Davis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains 25 Wikitext-103 LSTM models and 25 LSTM models trained on a 100 million token subset of the OpenWebTextCorpus. Training/validation/test data is included with the Web models. By-epoch validation perplexity is given in the logs (within the directory for the models). Please write to me if you have any questions :)

  11. h

    openwebtext-10k

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhongnan Wang, openwebtext-10k [Dataset]. https://huggingface.co/datasets/wangzn2001/openwebtext-10k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Zhongnan Wang
    Description

    wangzn2001/openwebtext-10k dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. a

    OpenWebText-urls-26M-filtered.xz

    • academictorrents.com
    bittorrent
    Updated May 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    eukaryote and jcpeterson (2019). OpenWebText-urls-26M-filtered.xz [Dataset]. https://academictorrents.com/details/f5161721b322bca66ed74da32b963c1066e64312
    Explore at:
    bittorrent(480280068)Available download formats
    Dataset updated
    May 12, 2019
    Dataset authored and provided by
    eukaryote and jcpeterson
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    Every outbound reddit link from before 31. Dec 2018 with at least 3 karma. The list is filtered to remove image sites, non-scraper-friendly sites, and other media files.

  13. OpenLLMText Dataset

    • zenodo.org
    zip
    Updated Oct 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yutian Chen; Yutian Chen; Hao Kang; Yiyan Zhai; Liangze Li; Rita Singh; Bhiksha Raj; Hao Kang; Yiyan Zhai; Liangze Li; Rita Singh; Bhiksha Raj (2023). OpenLLMText Dataset [Dataset]. http://doi.org/10.5281/zenodo.8285326
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 19, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yutian Chen; Yutian Chen; Hao Kang; Yiyan Zhai; Liangze Li; Rita Singh; Bhiksha Raj; Hao Kang; Yiyan Zhai; Liangze Li; Rita Singh; Bhiksha Raj
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains approximately 300k text entries collected from 5 different sources (Human, ChatGPT, PaLM, LLaMA, GPT2-XL).

    60k of them are Human-written, randomly selected from the OpenWebText dataset. These entries are collected from the user generated content from Reddit before 2019.

    60k of them are the ChatGPT's (gpt3.5-turbo) paragraph-by-paragraph rephrasing for the human written data.

    60k of them are the PaLM's (Pathway Language Model, text-bison-001) paragraph-by-paragraph rephrasing for the human written data.

    60k of them are the LLaMA-7B's (Large Language Model Meta AI) paragraph-by-pargraph rephrasing for the human written data.

    60k of them are the data adapted from the GPT2-output dataset released by the OpenAI (GPT2-XL).

  14. h

    openwebtext.json

    • huggingface.co
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oliver Carey (2025). openwebtext.json [Dataset]. https://huggingface.co/datasets/olivercareyncl/openwebtext.json
    Explore at:
    Dataset updated
    Mar 12, 2025
    Authors
    Oliver Carey
    Description

    olivercareyncl/openwebtext.json dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    openwebtext-cc-196K

    • huggingface.co
    Updated Jun 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Woojin Chung (2025). openwebtext-cc-196K [Dataset]. https://huggingface.co/datasets/gartland/openwebtext-cc-196K
    Explore at:
    Dataset updated
    Jun 8, 2025
    Authors
    Woojin Chung
    Description

    gartland/openwebtext-cc-196K dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    openwebtext

    • huggingface.co
    Updated Sep 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    andythebreaker (2025). openwebtext [Dataset]. https://huggingface.co/datasets/andythebreaker/openwebtext
    Explore at:
    Dataset updated
    Sep 14, 2025
    Authors
    andythebreaker
    Description

    andythebreaker/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    openwebtext

    • huggingface.co
    Updated Sep 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    laxman vijay (2025). openwebtext [Dataset]. https://huggingface.co/datasets/laxmanvijay24/openwebtext
    Explore at:
    Dataset updated
    Sep 6, 2025
    Authors
    laxman vijay
    Description

    laxmanvijay24/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    openwebtext-cc-98K

    • huggingface.co
    Updated Jun 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Woojin Chung (2025). openwebtext-cc-98K [Dataset]. https://huggingface.co/datasets/gartland/openwebtext-cc-98K
    Explore at:
    Dataset updated
    Jun 8, 2025
    Authors
    Woojin Chung
    Description

    gartland/openwebtext-cc-98K dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    deepseek-v3-10k

    • huggingface.co
    Updated Jan 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tony shark (2025). deepseek-v3-10k [Dataset]. https://huggingface.co/datasets/tonyshark/deepseek-v3-10k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 3, 2025
    Authors
    tony shark
    License

    https://choosealicense.com/licenses/bigscience-bloom-rail-1.0/https://choosealicense.com/licenses/bigscience-bloom-rail-1.0/

    Description

    The first 10K elements of The Pile, useful for debugging models trained on it. See the HuggingFace page for the full Pile for more info. Inspired by stas' great resource doing the same for OpenWebText

  20. h

    openwebtext-1b-llama3-tokenized-cxt-1024

    • huggingface.co
    Updated Aug 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiatong Han (2024). openwebtext-1b-llama3-tokenized-cxt-1024 [Dataset]. http://doi.org/10.57967/hf/2890
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 16, 2024
    Authors
    Jiatong Han
    Description

    Juliushanhanhan/openwebtext-1b-llama3-tokenized-cxt-1024 dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Stas Bekman (2021). openwebtext-10k [Dataset]. https://huggingface.co/datasets/stas/openwebtext-10k

openwebtext-10k

stas/openwebtext-10k

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Aug 27, 2021
Authors
Stas Bekman
Description

An open-source replication of the WebText dataset from OpenAI.

This is a small subset representing the first 10K records from the original dataset - created for testing.

The full 8M-record dataset is at https://huggingface.co/datasets/openwebtext

Search
Clear search
Close search
Google apps
Main menu