24 datasets found
  1. h

    pile-uncopyrighted

    • huggingface.co
    Updated Aug 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    monology (2023). pile-uncopyrighted [Dataset]. https://huggingface.co/datasets/monology/pile-uncopyrighted
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2023
    Authors
    monology
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Pile Uncopyrighted

    In response to authors demanding that LLMs stop using their works, here's a copy of The Pile with all copyrighted content removed.Please consider using this dataset to train your future LLMs, to respect authors and abide by copyright law.Creating an uncopyrighted version of a larger dataset (ie RedPajama) is planned, with no ETA.
    MethodologyCleaning was performed by removing everything from the Books3, BookCorpus2, OpenSubtitles, YTSubtitles, and OWT2โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/monology/pile-uncopyrighted.

  2. h

    sae-monology-pile-uncopyrighted-tokenizer-gpt2

    • huggingface.co
    Updated Sep 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zhangjunyu (2024). sae-monology-pile-uncopyrighted-tokenizer-gpt2 [Dataset]. https://huggingface.co/datasets/dayu-ai/sae-monology-pile-uncopyrighted-tokenizer-gpt2
    Explore at:
    Dataset updated
    Sep 13, 2024
    Authors
    zhangjunyu
    Description

    dayu-ai/sae-monology-pile-uncopyrighted-tokenizer-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    monology-pile-uncopyrighted-tokenizer-gpt2

    • huggingface.co
    Updated Mar 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Apollo Research (2024). monology-pile-uncopyrighted-tokenizer-gpt2 [Dataset]. https://huggingface.co/datasets/apollo-research/monology-pile-uncopyrighted-tokenizer-gpt2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 7, 2024
    Dataset authored and provided by
    Apollo Research
    Description

    apollo-research/monology-pile-uncopyrighted-tokenizer-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    pile-uncopyrighted-gemma-1024-abbrv-2B

    • huggingface.co
    Updated Sep 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Chanin (2025). pile-uncopyrighted-gemma-1024-abbrv-2B [Dataset]. https://huggingface.co/datasets/chanind/pile-uncopyrighted-gemma-1024-abbrv-2B
    Explore at:
    Dataset updated
    Sep 12, 2025
    Authors
    David Chanin
    Description

    Pre-tokenized dataset of the first 10 million lines of monology/pile-uncopyrighted without any concatenated lines, tokenized for Gemma-2 using SAELens. This dataset has 1024 context size and about 2.5B tokens.

  5. h

    pile-uncopyrighted-index

    • huggingface.co
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Won Bae (2025). pile-uncopyrighted-index [Dataset]. https://huggingface.co/datasets/won-bae/pile-uncopyrighted-index
    Explore at:
    Dataset updated
    Jul 30, 2025
    Authors
    Won Bae
    Description

    won-bae/pile-uncopyrighted-index dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    pile-uncopyrighted-qwen-2-1024-abbrv-1B

    • huggingface.co
    Updated May 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Chanin (2025). pile-uncopyrighted-qwen-2-1024-abbrv-1B [Dataset]. https://huggingface.co/datasets/chanind/pile-uncopyrighted-qwen-2-1024-abbrv-1B
    Explore at:
    Dataset updated
    May 11, 2025
    Authors
    David Chanin
    Description

    chanind/pile-uncopyrighted-qwen-2-1024-abbrv-1B dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    pile-uncopyrighted-sample

    • huggingface.co
    Updated Mar 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caden Juang (2025). pile-uncopyrighted-sample [Dataset]. https://huggingface.co/datasets/kh4dien/pile-uncopyrighted-sample
    Explore at:
    Dataset updated
    Mar 23, 2025
    Authors
    Caden Juang
    Description

    kh4dien/pile-uncopyrighted-sample dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    monology-pile-uncopyrighted-tokenizer-EleutherAI-gpt-neox-20b

    • huggingface.co
    Updated Sep 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Apollo Research (2025). monology-pile-uncopyrighted-tokenizer-EleutherAI-gpt-neox-20b [Dataset]. https://huggingface.co/datasets/apollo-research/monology-pile-uncopyrighted-tokenizer-EleutherAI-gpt-neox-20b
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 10, 2025
    Dataset authored and provided by
    Apollo Research
    Description

    apollo-research/monology-pile-uncopyrighted-tokenizer-EleutherAI-gpt-neox-20b dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    pile-uncopyrighted-llama-3_2-1024-abbrv-1B

    • huggingface.co
    Updated May 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Chanin (2025). pile-uncopyrighted-llama-3_2-1024-abbrv-1B [Dataset]. https://huggingface.co/datasets/chanind/pile-uncopyrighted-llama-3_2-1024-abbrv-1B
    Explore at:
    Dataset updated
    May 13, 2025
    Authors
    David Chanin
    Description

    chanind/pile-uncopyrighted-llama-3_2-1024-abbrv-1B dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    pile-uncopyrighted-gemma-1024-abbrv

    • huggingface.co
    Updated Aug 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Chanin (2024). pile-uncopyrighted-gemma-1024-abbrv [Dataset]. https://huggingface.co/datasets/chanind/pile-uncopyrighted-gemma-1024-abbrv
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 23, 2024
    Authors
    David Chanin
    Description

    chanind/pile-uncopyrighted-gemma-1024-abbrv dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    pile-uncopyrighted-olmo-tokenized-2048-10p0

    • huggingface.co
    Updated Sep 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaden Zheng (2025). pile-uncopyrighted-olmo-tokenized-2048-10p0 [Dataset]. https://huggingface.co/datasets/kzheng/pile-uncopyrighted-olmo-tokenized-2048-10p0
    Explore at:
    Dataset updated
    Sep 27, 2025
    Authors
    Kaden Zheng
    Description

    kzheng/pile-uncopyrighted-olmo-tokenized-2048-10p0 dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    pile-uncopyrighted-6b-tokenized-gpt2

    • huggingface.co
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geonwoo Hong (2025). pile-uncopyrighted-6b-tokenized-gpt2 [Dataset]. https://huggingface.co/datasets/Geonwoohong/pile-uncopyrighted-6b-tokenized-gpt2
    Explore at:
    Dataset updated
    Oct 7, 2025
    Authors
    Geonwoo Hong
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Description

    This dataset is a pre-tokenized version of the Pile, encoded with the GPT-2 Byte Pair Encoding tokenizer.This corresponds to the uncopyrighted subset of the Pile and was tokenized using the tiktoken library. Each record corresponds to a fixed-length 1024-token segment, already padded and masked for direct use in decoder-only LMs.

      Dataset Structure
    
    
    
    
    
    
    
      Data Fields
    

    Field Type Description

    meta dict Metadata containing the sourceโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Geonwoohong/pile-uncopyrighted-6b-tokenized-gpt2.

  13. h

    pile-hackernews

    • huggingface.co
    Updated Mar 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timaeus (2025). pile-hackernews [Dataset]. https://huggingface.co/datasets/timaeus/pile-hackernews
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 30, 2025
    Dataset authored and provided by
    Timaeus
    Description

    Dataset Creation Process

    These subsets were created by streaming over the rows from monology/pile-uncopyrighted and filtering by the meta column. Each subset is generally limited to the first 100,000 qualifying rows encountered.

      Citations
    

    If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverse text for language modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence andโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-hackernews.

  14. h

    pile-cp-free-100k-part-2

    • huggingface.co
    Updated Sep 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Finn Strom (2024). pile-cp-free-100k-part-2 [Dataset]. https://huggingface.co/datasets/finnstrom3693/pile-cp-free-100k-part-2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 30, 2024
    Authors
    Finn Strom
    Description

    Second 100K partial version from this dataset : https://huggingface.co/datasets/monology/pile-uncopyrighted

  15. h

    pile-github

    • huggingface.co
    Updated Mar 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timaeus (2025). pile-github [Dataset]. https://huggingface.co/datasets/timaeus/pile-github
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 30, 2025
    Dataset authored and provided by
    Timaeus
    Description

    Dataset Creation Process

    These subsets were created by streaming over the rows from monology/pile-uncopyrighted and filtering by the meta column. Each subset is generally limited to the first 100,000 qualifying rows encountered.

  16. h

    pile-pubmed_central

    • huggingface.co
    Updated Mar 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timaeus (2025). pile-pubmed_central [Dataset]. https://huggingface.co/datasets/timaeus/pile-pubmed_central
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 30, 2025
    Dataset authored and provided by
    Timaeus
    Description

    Dataset Creation Process

    These subsets were created by streaming over the rows from monology/pile-uncopyrighted and filtering by the meta column. Each subset is generally limited to the first 100,000 qualifying rows encountered.

  17. h

    pile-ubuntu_irc-broken

    • huggingface.co
    Updated Mar 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timaeus (2025). pile-ubuntu_irc-broken [Dataset]. https://huggingface.co/datasets/timaeus/pile-ubuntu_irc-broken
    Explore at:
    Dataset updated
    Mar 30, 2025
    Dataset authored and provided by
    Timaeus
    Description

    โš ๏ธ Warning: This dataset will probably make you run out of memory if you try loading it. Don't do it.

      Dataset Creation Process
    

    These subsets were created by streaming over the rows from monology/pile-uncopyrighted and filtering by the meta column. Each subset is generally limited to the first 100,000 qualifying rows encountered.

      Citations
    

    If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverseโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-ubuntu_irc-broken.

  18. h

    pile-dm_mathematics

    • huggingface.co
    Updated Mar 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timaeus (2025). pile-dm_mathematics [Dataset]. https://huggingface.co/datasets/timaeus/pile-dm_mathematics
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 30, 2025
    Dataset authored and provided by
    Timaeus
    Description

    Dataset Creation Process

    These subsets were created by streaming over the rows from monology/pile-uncopyrighted and filtering by the meta column. Each subset is generally limited to the first 100,000 qualifying rows encountered.

      Citations
    

    If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverse text for language modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence andโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-dm_mathematics.

  19. h

    PileUncopyrighted-NER-BIO

    • huggingface.co
    Updated Aug 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DISI UniBo NLP (2025). PileUncopyrighted-NER-BIO [Dataset]. https://huggingface.co/datasets/disi-unibo-nlp/PileUncopyrighted-NER-BIO
    Explore at:
    Dataset updated
    Aug 7, 2025
    Dataset authored and provided by
    DISI UniBo NLP
    Description

    disi-unibo-nlp/PileUncopyrighted-NER-BIO dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    Pile-RS-Truncated

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Si Ci Ong, Pile-RS-Truncated [Dataset]. https://huggingface.co/datasets/ongsici/Pile-RS-Truncated
    Explore at:
    Authors
    Si Ci Ong
    Description

    Dataset Card for Pile-RS-Truncated

      Dataset Summary
    

    This dataset was created as part of my Master's thesis research on "Leveraging Model Checkpoints for Membership Inference Attacks on Large Language Models". The aim was to create a clean setup, free from distribution shifts, to study proposed checkpoint MIA methods and their performance using checkpoints from Pythia models. It was derived from Pile Uncopyrighted by applying reservoir sampling and data processing steps.โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/ongsici/Pile-RS-Truncated.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
monology (2023). pile-uncopyrighted [Dataset]. https://huggingface.co/datasets/monology/pile-uncopyrighted

pile-uncopyrighted

monology/pile-uncopyrighted

Explore at:
45 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 30, 2023
Authors
monology
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Pile Uncopyrighted

In response to authors demanding that LLMs stop using their works, here's a copy of The Pile with all copyrighted content removed.Please consider using this dataset to train your future LLMs, to respect authors and abide by copyright law.Creating an uncopyrighted version of a larger dataset (ie RedPajama) is planned, with no ETA.
MethodologyCleaning was performed by removing everything from the Books3, BookCorpus2, OpenSubtitles, YTSubtitles, and OWT2โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/monology/pile-uncopyrighted.

Search
Clear search
Close search
Google apps
Main menu