53 datasets found
  1. h

    openwebtext-10k

    • huggingface.co
    • opendatalab.com
    Updated Aug 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stas Bekman (2021). openwebtext-10k [Dataset]. https://huggingface.co/datasets/stas/openwebtext-10k
    Explore at:
    Dataset updated
    Aug 27, 2021
    Authors
    Stas Bekman
    Description

    An open-source replication of the WebText dataset from OpenAI.

    This is a small subset representing the first 10K records from the original dataset - created for testing.

    The full 8M-record dataset is at https://huggingface.co/datasets/openwebtext

  2. h

    openwebtext

    • huggingface.co
    • opendatalab.com
    • +2more
    Updated Feb 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Ebert (2017). openwebtext [Dataset]. https://huggingface.co/datasets/dylanebert/openwebtext
    Explore at:
    Dataset updated
    Feb 1, 2017
    Authors
    Dylan Ebert
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for "openwebtext"

      Dataset Summary
    

    An open-source replication of the WebText dataset from OpenAI, that was used to train GPT-2. This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure
    
    
    
    
    
      Data Instances
    
    
    
    
    
      plain_text
    

    Size of downloaded dataset files: 13.51 GB Size of theโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/dylanebert/openwebtext.

  3. h

    openwebtext

    • huggingface.co
    Updated Feb 1, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oliver Carey (2017). openwebtext [Dataset]. https://huggingface.co/datasets/olivercareyncl/openwebtext
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2017
    Authors
    Oliver Carey
    Description

    olivercareyncl/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    openwebtext-100k

    • huggingface.co
    Updated Sep 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Logan Riggs Smith (2025). openwebtext-100k [Dataset]. https://huggingface.co/datasets/Elriggs/openwebtext-100k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 7, 2025
    Authors
    Logan Riggs Smith
    Description

    Dataset Card for "openwebtext-100k"

    More Information needed

  5. OpenWebText-gpt2

    • kaggle.com
    zip
    Updated Jan 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    windmaple (2025). OpenWebText-gpt2 [Dataset]. https://www.kaggle.com/datasets/windmaple/openwebtext-gpt2
    Explore at:
    zip(12138662851 bytes)Available download formats
    Dataset updated
    Jan 25, 2025
    Authors
    windmaple
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is the processed dataset using Andrey Karpathy's script https://github.com/karpathy/nanoGPT/tree/master/data/openwebtext. The original dataset is from https://huggingface.co/datasets/Skylion007/openwebtext, which now requires datasets lib version < 3 to download.

  6. OpenWebText 2M Subset

    • kaggle.com
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikhil R (2025). OpenWebText 2M Subset [Dataset]. https://www.kaggle.com/datasets/nikhilr612/openwebtext-2m-subset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nikhil R
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    A subset of OpenWebText, an open-source recreation of OpenAI's internal WebText corpus. This subset contains ~2 million documents, mainly in English, scraped from the Web. Highly unstructed text data; not necessarily clean.

  7. h

    Skylion007-openwebtext-tokenizer-gpt2

    • huggingface.co
    Updated Nov 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Apollo Research (2024). Skylion007-openwebtext-tokenizer-gpt2 [Dataset]. https://huggingface.co/datasets/apollo-research/Skylion007-openwebtext-tokenizer-gpt2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2024
    Dataset authored and provided by
    Apollo Research
    Description

    apollo-research/Skylion007-openwebtext-tokenizer-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. OpenWebText Dataset (TFRecords)

    • kaggle.com
    zip
    Updated Jan 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taha Bouhsine (2025). OpenWebText Dataset (TFRecords) [Dataset]. https://www.kaggle.com/skywolfmo/openwebtext-tfrecords
    Explore at:
    zip(12374355616 bytes)Available download formats
    Dataset updated
    Jan 29, 2025
    Authors
    Taha Bouhsine
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The OpenWebText dataset, preprocessed and formatted as TFRecords for efficient use with TPUs. This format optimizes data loading and processing for large-scale language modeling tasks. Ideal for training transformer models on Google Colab with TPUs.

  9. OpenWebText

    • kaggle.com
    zip
    Updated Nov 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian 0627 (2024). OpenWebText [Dataset]. https://www.kaggle.com/datasets/florian0627/openwebtext
    Explore at:
    zip(12138664253 bytes)Available download formats
    Dataset updated
    Nov 23, 2024
    Authors
    Florian 0627
    Description

    Dataset

    This dataset was created by Florian 0627

    Contents

  10. OpenWebText-Dataset

    • kaggle.com
    zip
    Updated Nov 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Himon Sarkar (2023). OpenWebText-Dataset [Dataset]. https://www.kaggle.com/himonsarkar/openwebtext-dataset
    Explore at:
    zip(16214363432 bytes)Available download formats
    Dataset updated
    Nov 4, 2023
    Authors
    Himon Sarkar
    Description

    Dataset

    This dataset was created by Himon Sarkar

    Contents

  11. quarter_of_openwebtext

    • kaggle.com
    zip
    Updated Sep 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ben alla ismail (2023). quarter_of_openwebtext [Dataset]. https://www.kaggle.com/datasets/benallaismail/gpt-data/code
    Explore at:
    zip(5830148796 bytes)Available download formats
    Dataset updated
    Sep 7, 2023
    Authors
    ben alla ismail
    Description

    I am Ismail Ben Alla, a computer science engineer with a keen interest in advancing natural language processing and deep learning. As part of a comprehensive project to develop GPT-2 from the ground up, I undertook the task of preprocessing a substantial portion of the OpenWebText dataset.

    The dataset preprocessing involved tokenization using the highly effective GPT-2 tokenizer, resulting in the creation of two distinct sets:

    • Test Dataset (Approximately 856M): This dataset is carefully curated for testing and evaluation purposes.
    • Training Dataset (7.66G): This extensive dataset serves as a robust foundation for training and enhancing deep learning models.

    For additional insights, updates, and access to this valuable dataset, please refer to the following links:

  12. h

    openwebtext-gpt2

    • huggingface.co
    Updated Nov 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Chanin (2024). openwebtext-gpt2 [Dataset]. https://huggingface.co/datasets/chanind/openwebtext-gpt2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 18, 2024
    Authors
    David Chanin
    Description

    chanind/openwebtext-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. Roughly one quarter of openwebtext

    • kaggle.com
    zip
    Updated May 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isamu (2019). Roughly one quarter of openwebtext [Dataset]. https://www.kaggle.com/isamuisozaki/roughly-one-quarter-of-openwebtext
    Explore at:
    zip(3464531598 bytes)Available download formats
    Dataset updated
    May 5, 2019
    Authors
    Isamu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    This is a dataset made by modifying the code from this github repository slightly. This is part of the open-webtext movement where people are trying to replicate the data which the GPT-2(a text generator model which is said to be too dangerous to publish) was trained on. This is just roughly the quarter of the entire data but as Venya Cohen stated in the Discussion section, the full dataset is available at https://skylion007.github.io/OpenWebTextCorpus/.

    Content

    This dataset contains an id to identify the text up to 2019434 for the large data.db and 743 for the smaller.db. Each of the ids corresponds to a text column which has a bit of text.

    Acknowledgements

    Thanks eukaryote31 for github repo the https://github.com/eukaryote31/openwebtext

    Inspiration

    The open ai's gpt-2 model is said to have still underfitted the 40GB webtext model so, I'm wondering how much data will be just right?

  14. openwebtext_1M

    • kaggle.com
    zip
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanay Mehta (2024). openwebtext_1M [Dataset]. https://www.kaggle.com/datasets/heyytanay/openwebtext-1m/code
    Explore at:
    zip(2043993317 bytes)Available download formats
    Dataset updated
    Mar 18, 2024
    Authors
    Tanay Mehta
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    A subset of Skylion007/openwebtext dataset consisting of 1 Million tokenized samples in Lance file format for blazing fast and memory efficient I/O.

    The files were tokenized using the gpt2 tokenizer with no extra tokens.

    For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.

    Instructions for using this dataset

    This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called openwebtext_1M.lance. Below are detailed snippets on how to download and use this dataset.

    First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:

    $ pip install -q kaggle pyarrow pylance
    $ kaggle datasets download -d heyytanay/openwebtext-1m
    $ mkdir openwebtext_1M.lance/
    $ unzip -qq openwebtext-1m.zip -d openwebtext_1M.lance/
    $ rm openwebtext-1m.zip
    

    Once this is done, you will find your dataset in the openwebtext_1M.lance/ folder. Now to load and get a gist of the data, run the below snippet.

    import lance
    dataset = lance.dataset('openwebtext_1M.lance/')
    print(dataset.count_rows())
    

    This will give you the total number of tokens in the dataset.

  15. h

    openwebtext-gemma

    • huggingface.co
    Updated Jun 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Chanin (2024). openwebtext-gemma [Dataset]. https://huggingface.co/datasets/chanind/openwebtext-gemma
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 16, 2024
    Authors
    David Chanin
    Description

    OpenWebTextCorpus tokenized for Gemma

    This dataset is a pre-tokenized version of the Skylion007/openwebtext dataset using the gemma tokenizer. As such, this dataset follows the same licensing as the original openwebtext dataset. This pre-tokenization is done as a performance optimization for using the openwebtext dataset with a Gemma model (gemma-2b, gemma-2b-it, gemma-7b, gemma-7b-it). This dataset was created using SAELens, with the following settings:

    context_size: 8192โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/chanind/openwebtext-gemma.

  16. OpenWebText

    • kaggle.com
    zip
    Updated Mar 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sean and Joanna (2023). OpenWebText [Dataset]. https://www.kaggle.com/datasets/seanandjoanna/openwebtext
    Explore at:
    zip(12877185949 bytes)Available download formats
    Dataset updated
    Mar 12, 2023
    Authors
    Sean and Joanna
    Description

    A copy of OpenWebText used by OpenAI. This is a large corpus used by OpenAI to train GPT-3 and above.

  17. openwebtext_split

    • kaggle.com
    zip
    Updated Feb 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tuggypetu (2024). openwebtext_split [Dataset]. https://www.kaggle.com/datasets/tuggypetu/openwebtext-split
    Explore at:
    zip(16193939469 bytes)Available download formats
    Dataset updated
    Feb 12, 2024
    Authors
    tuggypetu
    Description

    Dataset

    This dataset was created by tuggypetu

    Contents

  18. h

    openwebtext-10k

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhongnan Wang, openwebtext-10k [Dataset]. https://huggingface.co/datasets/wangzn2001/openwebtext-10k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Zhongnan Wang
    Description

    wangzn2001/openwebtext-10k dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    openwebtext

    • huggingface.co
    Updated Jul 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thijmen Nijdam (2023). openwebtext [Dataset]. https://huggingface.co/datasets/Thijmen/openwebtext
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 9, 2023
    Authors
    Thijmen Nijdam
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Thijmen/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    openwebtext-tokenized-9b

    • huggingface.co
    Updated Mar 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neel Nanda (2023). openwebtext-tokenized-9b [Dataset]. https://huggingface.co/datasets/NeelNanda/openwebtext-tokenized-9b
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 15, 2023
    Authors
    Neel Nanda
    Description

    Dataset Card for "openwebtext-tokenized-9b"

    More Information needed

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Stas Bekman (2021). openwebtext-10k [Dataset]. https://huggingface.co/datasets/stas/openwebtext-10k

openwebtext-10k

stas/openwebtext-10k

Explore at:
7 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Aug 27, 2021
Authors
Stas Bekman
Description

An open-source replication of the WebText dataset from OpenAI.

This is a small subset representing the first 10K records from the original dataset - created for testing.

The full 8M-record dataset is at https://huggingface.co/datasets/openwebtext

Search
Clear search
Close search
Google apps
Main menu