Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Pile Uncopyrighted
In response to authors demanding that LLMs stop using their works, here's a copy of The Pile with all copyrighted content removed.Please consider using this dataset to train your future LLMs, to respect authors and abide by copyright law.Creating an uncopyrighted version of a larger dataset (ie RedPajama) is planned, with no ETA.
MethodologyCleaning was performed by removing everything from the Books3, BookCorpus2, OpenSubtitles, YTSubtitles, and OWT2โฆ See the full description on the dataset page: https://huggingface.co/datasets/monology/pile-uncopyrighted.
Facebook
Twitterdayu-ai/sae-monology-pile-uncopyrighted-tokenizer-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterapollo-research/monology-pile-uncopyrighted-tokenizer-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterPre-tokenized dataset of the first 10 million lines of monology/pile-uncopyrighted without any concatenated lines, tokenized for Gemma-2 using SAELens. This dataset has 1024 context size and about 2.5B tokens.
Facebook
Twitterwon-bae/pile-uncopyrighted-index dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterchanind/pile-uncopyrighted-qwen-2-1024-abbrv-1B dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterkh4dien/pile-uncopyrighted-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterapollo-research/monology-pile-uncopyrighted-tokenizer-EleutherAI-gpt-neox-20b dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterchanind/pile-uncopyrighted-llama-3_2-1024-abbrv-1B dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterchanind/pile-uncopyrighted-gemma-1024-abbrv dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterkzheng/pile-uncopyrighted-olmo-tokenized-2048-10p0 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Description
This dataset is a pre-tokenized version of the Pile, encoded with the GPT-2 Byte Pair Encoding tokenizer.This corresponds to the uncopyrighted subset of the Pile and was tokenized using the tiktoken library. Each record corresponds to a fixed-length 1024-token segment, already padded and masked for direct use in decoder-only LMs.
Dataset Structure
Data Fields
Field Type Description
meta dict Metadata containing the sourceโฆ See the full description on the dataset page: https://huggingface.co/datasets/Geonwoohong/pile-uncopyrighted-6b-tokenized-gpt2.
Facebook
TwitterDataset Creation Process
These subsets were created by streaming over the rows from monology/pile-uncopyrighted and filtering by the meta column. Each subset is generally limited to the first 100,000 qualifying rows encountered.
Citations
If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverse text for language modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence andโฆ See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-hackernews.
Facebook
TwitterSecond 100K partial version from this dataset : https://huggingface.co/datasets/monology/pile-uncopyrighted
Facebook
TwitterDataset Creation Process
These subsets were created by streaming over the rows from monology/pile-uncopyrighted and filtering by the meta column. Each subset is generally limited to the first 100,000 qualifying rows encountered.
Facebook
TwitterDataset Creation Process
These subsets were created by streaming over the rows from monology/pile-uncopyrighted and filtering by the meta column. Each subset is generally limited to the first 100,000 qualifying rows encountered.
Facebook
Twitterโ ๏ธ Warning: This dataset will probably make you run out of memory if you try loading it. Don't do it.
Dataset Creation Process
These subsets were created by streaming over the rows from monology/pile-uncopyrighted and filtering by the meta column. Each subset is generally limited to the first 100,000 qualifying rows encountered.
Citations
If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverseโฆ See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-ubuntu_irc-broken.
Facebook
TwitterDataset Creation Process
These subsets were created by streaming over the rows from monology/pile-uncopyrighted and filtering by the meta column. Each subset is generally limited to the first 100,000 qualifying rows encountered.
Citations
If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverse text for language modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence andโฆ See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-dm_mathematics.
Facebook
Twitterdisi-unibo-nlp/PileUncopyrighted-NER-BIO dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset Card for Pile-RS-Truncated
Dataset Summary
This dataset was created as part of my Master's thesis research on "Leveraging Model Checkpoints for Membership Inference Attacks on Large Language Models". The aim was to create a clean setup, free from distribution shifts, to study proposed checkpoint MIA methods and their performance using checkpoints from Pythia models. It was derived from Pile Uncopyrighted by applying reservoir sampling and data processing steps.โฆ See the full description on the dataset page: https://huggingface.co/datasets/ongsici/Pile-RS-Truncated.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Pile Uncopyrighted
In response to authors demanding that LLMs stop using their works, here's a copy of The Pile with all copyrighted content removed.Please consider using this dataset to train your future LLMs, to respect authors and abide by copyright law.Creating an uncopyrighted version of a larger dataset (ie RedPajama) is planned, with no ETA.
MethodologyCleaning was performed by removing everything from the Books3, BookCorpus2, OpenSubtitles, YTSubtitles, and OWT2โฆ See the full description on the dataset page: https://huggingface.co/datasets/monology/pile-uncopyrighted.