10 datasets found
  1. h

    openwebtext

    • huggingface.co
    • paperswithcode.com
    • +4more
    Updated Sep 28, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Gokaslan (2020). openwebtext [Dataset]. https://huggingface.co/datasets/Skylion007/openwebtext
    Explore at:
    Dataset updated
    Sep 28, 2020
    Authors
    Aaron Gokaslan
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    An open-source replication of the WebText dataset from OpenAI.

  2. O

    openwebtext-10k

    • opendatalab.com
    • huggingface.co
    zip
    Updated Jan 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). openwebtext-10k [Dataset]. https://opendatalab.com/OpenDataLab/openwebtext-10k
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 1, 2019
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    10K slice of OpenWebText - An open-source replication of the WebText dataset from OpenAI. This is a small subset representing the first 10K records from the original dataset - created for testing.

  3. h

    openwebtext-tokenized-Llama-3.2

    • huggingface.co
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Gulko (2025). openwebtext-tokenized-Llama-3.2 [Dataset]. https://huggingface.co/datasets/GulkoA/openwebtext-tokenized-Llama-3.2
    Explore at:
    Dataset updated
    Mar 26, 2025
    Authors
    Alex Gulko
    Description

    OpenWebText dataset (open-source replication of the WebText dataset from OpenAI, that was used to train GPT-2) tokenized for Llama 3.2 models Useful for accelerated training and testing of sparse autoencoders Context size: 128, not shuffled

  4. h

    openwebtext.json

    • huggingface.co
    Updated Mar 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oliver Carey (2025). openwebtext.json [Dataset]. https://huggingface.co/datasets/olivercareyncl/openwebtext.json
    Explore at:
    Dataset updated
    Mar 12, 2025
    Authors
    Oliver Carey
    Description

    olivercareyncl/openwebtext.json dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    openwebtext

    • huggingface.co
    Updated Feb 1, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oliver Carey (2017). openwebtext [Dataset]. https://huggingface.co/datasets/olivercareyncl/openwebtext
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2017
    Authors
    Oliver Carey
    Description

    olivercareyncl/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    openwebtext-gemma-1024-abbrv-2B

    • huggingface.co
    Updated Feb 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Chanin (2025). openwebtext-gemma-1024-abbrv-2B [Dataset]. https://huggingface.co/datasets/chanind/openwebtext-gemma-1024-abbrv-2B
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 20, 2025
    Authors
    David Chanin
    Description

    chanind/openwebtext-gemma-1024-abbrv-2B dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    Skylion007-openwebtext-tokenizer-gpt2

    • huggingface.co
    Updated Mar 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiangchen Song (2025). Skylion007-openwebtext-tokenizer-gpt2 [Dataset]. https://huggingface.co/datasets/xiangchensong/Skylion007-openwebtext-tokenizer-gpt2
    Explore at:
    Dataset updated
    Mar 11, 2025
    Authors
    Xiangchen Song
    Description

    xiangchensong/Skylion007-openwebtext-tokenizer-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    Skylion007-openwebtext-tokenizer-gpt2-64

    • huggingface.co
    Updated Jul 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Perception, Control, and Cognition Lab (2024). Skylion007-openwebtext-tokenizer-gpt2-64 [Dataset]. https://huggingface.co/datasets/pccl-org/Skylion007-openwebtext-tokenizer-gpt2-64
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 14, 2024
    Dataset authored and provided by
    Perception, Control, and Cognition Lab
    Description

    pccl-org/Skylion007-openwebtext-tokenizer-gpt2-64 dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. openwebtext_val_vocab

    • kaggle.com
    zip
    Updated Feb 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tuggypetu (2024). openwebtext_val_vocab [Dataset]. https://www.kaggle.com/tuggypetu/openwebtext-val-vocab
    Explore at:
    zip(1618447286 bytes)Available download formats
    Dataset updated
    Feb 15, 2024
    Authors
    tuggypetu
    Description

    Dataset

    This dataset was created by tuggypetu

    Contents

  10. O

    pile-10k

    • opendatalab.com
    zip
    Updated Dec 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). pile-10k [Dataset]. https://opendatalab.com/OpenDataLab/pile-10k
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 16, 2023
    Description

    The first 10K elements of The Pile, useful for debugging models trained on it. See the HuggingFace page for the full Pile for more info. Inspired by stas' great resource doing the same for OpenWebText

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Aaron Gokaslan (2020). openwebtext [Dataset]. https://huggingface.co/datasets/Skylion007/openwebtext

openwebtext

OpenWebText

Skylion007/openwebtext

Explore at:
7 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Sep 28, 2020
Authors
Aaron Gokaslan
License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

An open-source replication of the WebText dataset from OpenAI.

Search
Clear search
Close search
Google apps
Main menu