24 datasets found

h
pile-uncopyrighted
huggingface.co
Updated Aug 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
monology (2023). pile-uncopyrighted [Dataset]. https://huggingface.co/datasets/monology/pile-uncopyrighted
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 30, 2023
Authors
monology
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Pile Uncopyrighted

In response to authors demanding that LLMs stop using their works, here's a copy of The Pile with all copyrighted content removed.Please consider using this dataset to train your future LLMs, to respect authors and abide by copyright law.Creating an uncopyrighted version of a larger dataset (ie RedPajama) is planned, with no ETA.
MethodologyCleaning was performed by removing everything from the Books3, BookCorpus2, OpenSubtitles, YTSubtitles, and OWT2… See the full description on the dataset page: https://huggingface.co/datasets/monology/pile-uncopyrighted.
h
sae-monology-pile-uncopyrighted-tokenizer-gpt2
huggingface.co
Updated Sep 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zhangjunyu (2024). sae-monology-pile-uncopyrighted-tokenizer-gpt2 [Dataset]. https://huggingface.co/datasets/dayu-ai/sae-monology-pile-uncopyrighted-tokenizer-gpt2
Explore at:
Dataset updated
Sep 13, 2024
Authors
zhangjunyu
Description
dayu-ai/sae-monology-pile-uncopyrighted-tokenizer-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
monology-pile-uncopyrighted-tokenizer-gpt2
huggingface.co
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Apollo Research (2024). monology-pile-uncopyrighted-tokenizer-gpt2 [Dataset]. https://huggingface.co/datasets/apollo-research/monology-pile-uncopyrighted-tokenizer-gpt2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 7, 2024
Dataset authored and provided by
Apollo Research
Description
apollo-research/monology-pile-uncopyrighted-tokenizer-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
pile-uncopyrighted-gemma-1024-abbrv-2B
huggingface.co
Updated Sep 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Chanin (2025). pile-uncopyrighted-gemma-1024-abbrv-2B [Dataset]. https://huggingface.co/datasets/chanind/pile-uncopyrighted-gemma-1024-abbrv-2B
Explore at:
Dataset updated
Sep 12, 2025
Authors
David Chanin
Description
Pre-tokenized dataset of the first 10 million lines of monology/pile-uncopyrighted without any concatenated lines, tokenized for Gemma-2 using SAELens. This dataset has 1024 context size and about 2.5B tokens.
h
pile-uncopyrighted-index
huggingface.co
Updated Jul 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Won Bae (2025). pile-uncopyrighted-index [Dataset]. https://huggingface.co/datasets/won-bae/pile-uncopyrighted-index
Explore at:
Dataset updated
Jul 30, 2025
Authors
Won Bae
Description
won-bae/pile-uncopyrighted-index dataset hosted on Hugging Face and contributed by the HF Datasets community
h
pile-uncopyrighted-qwen-2-1024-abbrv-1B
huggingface.co
Updated May 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Chanin (2025). pile-uncopyrighted-qwen-2-1024-abbrv-1B [Dataset]. https://huggingface.co/datasets/chanind/pile-uncopyrighted-qwen-2-1024-abbrv-1B
Explore at:
Dataset updated
May 11, 2025
Authors
David Chanin
Description
chanind/pile-uncopyrighted-qwen-2-1024-abbrv-1B dataset hosted on Hugging Face and contributed by the HF Datasets community
h
pile-uncopyrighted-sample
huggingface.co
Updated Mar 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caden Juang (2025). pile-uncopyrighted-sample [Dataset]. https://huggingface.co/datasets/kh4dien/pile-uncopyrighted-sample
Explore at:
Dataset updated
Mar 23, 2025
Authors
Caden Juang
Description
kh4dien/pile-uncopyrighted-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
h
monology-pile-uncopyrighted-tokenizer-EleutherAI-gpt-neox-20b
huggingface.co
Updated Sep 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Apollo Research (2025). monology-pile-uncopyrighted-tokenizer-EleutherAI-gpt-neox-20b [Dataset]. https://huggingface.co/datasets/apollo-research/monology-pile-uncopyrighted-tokenizer-EleutherAI-gpt-neox-20b
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 10, 2025
Dataset authored and provided by
Apollo Research
Description
apollo-research/monology-pile-uncopyrighted-tokenizer-EleutherAI-gpt-neox-20b dataset hosted on Hugging Face and contributed by the HF Datasets community
h
pile-uncopyrighted-llama-3_2-1024-abbrv-1B
huggingface.co
Updated May 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Chanin (2025). pile-uncopyrighted-llama-3_2-1024-abbrv-1B [Dataset]. https://huggingface.co/datasets/chanind/pile-uncopyrighted-llama-3_2-1024-abbrv-1B
Explore at:
Dataset updated
May 13, 2025
Authors
David Chanin
Description
chanind/pile-uncopyrighted-llama-3_2-1024-abbrv-1B dataset hosted on Hugging Face and contributed by the HF Datasets community
h
pile-uncopyrighted-gemma-1024-abbrv
huggingface.co
Updated Aug 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Chanin (2024). pile-uncopyrighted-gemma-1024-abbrv [Dataset]. https://huggingface.co/datasets/chanind/pile-uncopyrighted-gemma-1024-abbrv
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 23, 2024
Authors
David Chanin
Description
chanind/pile-uncopyrighted-gemma-1024-abbrv dataset hosted on Hugging Face and contributed by the HF Datasets community
h
pile-uncopyrighted-olmo-tokenized-2048-10p0
huggingface.co
Updated Sep 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaden Zheng (2025). pile-uncopyrighted-olmo-tokenized-2048-10p0 [Dataset]. https://huggingface.co/datasets/kzheng/pile-uncopyrighted-olmo-tokenized-2048-10p0
Explore at:
Dataset updated
Sep 27, 2025
Authors
Kaden Zheng
Description
kzheng/pile-uncopyrighted-olmo-tokenized-2048-10p0 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
pile-uncopyrighted-6b-tokenized-gpt2
huggingface.co
Updated Oct 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Geonwoo Hong (2025). pile-uncopyrighted-6b-tokenized-gpt2 [Dataset]. https://huggingface.co/datasets/Geonwoohong/pile-uncopyrighted-6b-tokenized-gpt2
Explore at:
Dataset updated
Oct 7, 2025
Authors
Geonwoo Hong
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Description

This dataset is a pre-tokenized version of the Pile, encoded with the GPT-2 Byte Pair Encoding tokenizer.This corresponds to the uncopyrighted subset of the Pile and was tokenized using the tiktoken library. Each record corresponds to a fixed-length 1024-token segment, already padded and masked for direct use in decoder-only LMs.

Dataset Structure Data Fields

Field Type Description

meta dict Metadata containing the source… See the full description on the dataset page: https://huggingface.co/datasets/Geonwoohong/pile-uncopyrighted-6b-tokenized-gpt2.
h
pile-hackernews
huggingface.co
Updated Mar 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timaeus (2025). pile-hackernews [Dataset]. https://huggingface.co/datasets/timaeus/pile-hackernews
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 30, 2025
Dataset authored and provided by
Timaeus
Description
Dataset Creation Process

These subsets were created by streaming over the rows from monology/pile-uncopyrighted and filtering by the meta column. Each subset is generally limited to the first 100,000 qualifying rows encountered.

Citations

If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverse text for language modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and… See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-hackernews.
h
pile-cp-free-100k-part-2
huggingface.co
Updated Sep 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Finn Strom (2024). pile-cp-free-100k-part-2 [Dataset]. https://huggingface.co/datasets/finnstrom3693/pile-cp-free-100k-part-2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 30, 2024
Authors
Finn Strom
Description
Second 100K partial version from this dataset : https://huggingface.co/datasets/monology/pile-uncopyrighted
h
pile-github
huggingface.co
Updated Mar 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timaeus (2025). pile-github [Dataset]. https://huggingface.co/datasets/timaeus/pile-github
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 30, 2025
Dataset authored and provided by
Timaeus
Description
Dataset Creation Process

These subsets were created by streaming over the rows from monology/pile-uncopyrighted and filtering by the meta column. Each subset is generally limited to the first 100,000 qualifying rows encountered.
h
pile-pubmed_central
huggingface.co
Updated Mar 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timaeus (2025). pile-pubmed_central [Dataset]. https://huggingface.co/datasets/timaeus/pile-pubmed_central
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 30, 2025
Dataset authored and provided by
Timaeus
Description
Dataset Creation Process

These subsets were created by streaming over the rows from monology/pile-uncopyrighted and filtering by the meta column. Each subset is generally limited to the first 100,000 qualifying rows encountered.
h
pile-ubuntu_irc-broken
huggingface.co
Updated Mar 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timaeus (2025). pile-ubuntu_irc-broken [Dataset]. https://huggingface.co/datasets/timaeus/pile-ubuntu_irc-broken
Explore at:
Dataset updated
Mar 30, 2025
Dataset authored and provided by
Timaeus
Description
⚠️ Warning: This dataset will probably make you run out of memory if you try loading it. Don't do it.

Dataset Creation Process

These subsets were created by streaming over the rows from monology/pile-uncopyrighted and filtering by the meta column. Each subset is generally limited to the first 100,000 qualifying rows encountered.

Citations

If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverse… See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-ubuntu_irc-broken.
h
pile-dm_mathematics
huggingface.co
Updated Mar 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timaeus (2025). pile-dm_mathematics [Dataset]. https://huggingface.co/datasets/timaeus/pile-dm_mathematics
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 30, 2025
Dataset authored and provided by
Timaeus
Description
Dataset Creation Process

These subsets were created by streaming over the rows from monology/pile-uncopyrighted and filtering by the meta column. Each subset is generally limited to the first 100,000 qualifying rows encountered.

Citations

If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverse text for language modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and… See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-dm_mathematics.
h
PileUncopyrighted-NER-BIO
huggingface.co
Updated Aug 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DISI UniBo NLP (2025). PileUncopyrighted-NER-BIO [Dataset]. https://huggingface.co/datasets/disi-unibo-nlp/PileUncopyrighted-NER-BIO
Explore at:
Dataset updated
Aug 7, 2025
Dataset authored and provided by
DISI UniBo NLP
Description
disi-unibo-nlp/PileUncopyrighted-NER-BIO dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Pile-RS-Truncated
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Si Ci Ong, Pile-RS-Truncated [Dataset]. https://huggingface.co/datasets/ongsici/Pile-RS-Truncated
Explore at:
Authors
Si Ci Ong
Description
Dataset Card for Pile-RS-Truncated

Dataset Summary

This dataset was created as part of my Master's thesis research on "Leveraging Model Checkpoints for Membership Inference Attacks on Large Language Models". The aim was to create a clean setup, free from distribution shifts, to study proposed checkpoint MIA methods and their performance using checkpoints from Pythia models. It was derived from Pile Uncopyrighted by applying reservoir sampling and data processing steps.… See the full description on the dataset page: https://huggingface.co/datasets/ongsici/Pile-RS-Truncated.

Facebook

Twitter

Click to copy link

Link copied

Cite

monology (2023). pile-uncopyrighted [Dataset]. https://huggingface.co/datasets/monology/pile-uncopyrighted

pile-uncopyrighted

monology/pile-uncopyrighted

Explore at:

45 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 30, 2023

Authors

monology

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Pile Uncopyrighted

In response to authors demanding that LLMs stop using their works, here's a copy of The Pile with all copyrighted content removed.Please consider using this dataset to train your future LLMs, to respect authors and abide by copyright law.Creating an uncopyrighted version of a larger dataset (ie RedPajama) is planned, with no ETA.
MethodologyCleaning was performed by removing everything from the Books3, BookCorpus2, OpenSubtitles, YTSubtitles, and OWT2… See the full description on the dataset page: https://huggingface.co/datasets/monology/pile-uncopyrighted.

Clear search

Close search

Google apps

Main menu

pile-uncopyrighted

sae-monology-pile-uncopyrighted-tokenizer-gpt2

monology-pile-uncopyrighted-tokenizer-gpt2

pile-uncopyrighted-gemma-1024-abbrv-2B

pile-uncopyrighted-index

pile-uncopyrighted-qwen-2-1024-abbrv-1B

pile-uncopyrighted-sample

monology-pile-uncopyrighted-tokenizer-EleutherAI-gpt-neox-20b

pile-uncopyrighted-llama-3_2-1024-abbrv-1B

pile-uncopyrighted-gemma-1024-abbrv

pile-uncopyrighted-olmo-tokenized-2048-10p0

pile-uncopyrighted-6b-tokenized-gpt2

pile-hackernews

pile-cp-free-100k-part-2

pile-github

pile-pubmed_central

pile-ubuntu_irc-broken

pile-dm_mathematics

PileUncopyrighted-NER-BIO

Pile-RS-Truncated

pile-uncopyrightedSee More Versions

monology/pile-uncopyrighted

pile-uncopyrighted