Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Pile Uncopyrighted
In response to authors demanding that LLMs stop using their works, here's a copy of The Pile with all copyrighted content removed.Please consider using this dataset to train your future LLMs, to respect authors and abide by copyright law.Creating an uncopyrighted version of a larger dataset (ie RedPajama) is planned, with no ETA.
MethodologyCleaning was performed by removing everything from the Books3, BookCorpus2, OpenSubtitles, YTSubtitles, and OWT2… See the full description on the dataset page: https://huggingface.co/datasets/RichardErkhov/pile-uncopyrighted.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Pile Uncopyrighted
In response to authors demanding that LLMs stop using their works, here's a copy of The Pile with all copyrighted content removed.Please consider using this dataset to train your future LLMs, to respect authors and abide by copyright law.Creating an uncopyrighted version of a larger dataset (ie RedPajama) is planned, with no ETA.
MethodologyCleaning was performed by removing everything from the Books3, BookCorpus2, OpenSubtitles, YTSubtitles, and OWT2… See the full description on the dataset page: https://huggingface.co/datasets/yanjx21/PubMed.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Pile Uncopyrighted
In response to authors demanding that LLMs stop using their works, here's a copy of The Pile with all copyrighted content removed.Please consider using this dataset to train your future LLMs, to respect authors and abide by copyright law.Creating an uncopyrighted version of a larger dataset (ie RedPajama) is planned, with no ETA.
MethodologyCleaning was performed by removing everything from the Books3, BookCorpus2, OpenSubtitles, YTSubtitles, and OWT2… See the full description on the dataset page: https://huggingface.co/datasets/monology/pile-uncopyrighted.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Pile Uncopyrighted
In response to authors demanding that LLMs stop using their works, here's a copy of The Pile with all copyrighted content removed.Please consider using this dataset to train your future LLMs, to respect authors and abide by copyright law.Creating an uncopyrighted version of a larger dataset (ie RedPajama) is planned, with no ETA.
MethodologyCleaning was performed by removing everything from the Books3, BookCorpus2, OpenSubtitles, YTSubtitles, and OWT2… See the full description on the dataset page: https://huggingface.co/datasets/RichardErkhov/pile-uncopyrighted.