3 datasets found
  1. h

    pile-uncopyrighted

    • huggingface.co
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Erkhov (2024). pile-uncopyrighted [Dataset]. https://huggingface.co/datasets/RichardErkhov/pile-uncopyrighted
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 27, 2024
    Authors
    Richard Erkhov
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Pile Uncopyrighted

    In response to authors demanding that LLMs stop using their works, here's a copy of The Pile with all copyrighted content removed.Please consider using this dataset to train your future LLMs, to respect authors and abide by copyright law.Creating an uncopyrighted version of a larger dataset (ie RedPajama) is planned, with no ETA.
    MethodologyCleaning was performed by removing everything from the Books3, BookCorpus2, OpenSubtitles, YTSubtitles, and OWT2… See the full description on the dataset page: https://huggingface.co/datasets/RichardErkhov/pile-uncopyrighted.

  2. h

    PubMed

    • huggingface.co
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yanjx (2025). PubMed [Dataset]. https://huggingface.co/datasets/yanjx21/PubMed
    Explore at:
    Dataset updated
    Apr 3, 2025
    Authors
    yanjx
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Pile Uncopyrighted

    In response to authors demanding that LLMs stop using their works, here's a copy of The Pile with all copyrighted content removed.Please consider using this dataset to train your future LLMs, to respect authors and abide by copyright law.Creating an uncopyrighted version of a larger dataset (ie RedPajama) is planned, with no ETA.
    MethodologyCleaning was performed by removing everything from the Books3, BookCorpus2, OpenSubtitles, YTSubtitles, and OWT2… See the full description on the dataset page: https://huggingface.co/datasets/yanjx21/PubMed.

  3. h

    pile-uncopyrighted

    • huggingface.co
    Updated Aug 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    monology (2023). pile-uncopyrighted [Dataset]. https://huggingface.co/datasets/monology/pile-uncopyrighted
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2023
    Authors
    monology
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Pile Uncopyrighted

    In response to authors demanding that LLMs stop using their works, here's a copy of The Pile with all copyrighted content removed.Please consider using this dataset to train your future LLMs, to respect authors and abide by copyright law.Creating an uncopyrighted version of a larger dataset (ie RedPajama) is planned, with no ETA.
    MethodologyCleaning was performed by removing everything from the Books3, BookCorpus2, OpenSubtitles, YTSubtitles, and OWT2… See the full description on the dataset page: https://huggingface.co/datasets/monology/pile-uncopyrighted.

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Richard Erkhov (2024). pile-uncopyrighted [Dataset]. https://huggingface.co/datasets/RichardErkhov/pile-uncopyrighted

pile-uncopyrighted

RichardErkhov/pile-uncopyrighted

Explore at:
45 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 27, 2024
Authors
Richard Erkhov
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Pile Uncopyrighted

In response to authors demanding that LLMs stop using their works, here's a copy of The Pile with all copyrighted content removed.Please consider using this dataset to train your future LLMs, to respect authors and abide by copyright law.Creating an uncopyrighted version of a larger dataset (ie RedPajama) is planned, with no ETA.
MethodologyCleaning was performed by removing everything from the Books3, BookCorpus2, OpenSubtitles, YTSubtitles, and OWT2… See the full description on the dataset page: https://huggingface.co/datasets/RichardErkhov/pile-uncopyrighted.

Search
Clear search
Close search
Google apps
Main menu