5 datasets found
  1. h

    pile_v2

    • huggingface.co
    Updated Jul 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rob Myers (2023). pile_v2 [Dataset]. https://huggingface.co/datasets/robertmyers/pile_v2
    Explore at:
    Dataset updated
    Jul 21, 2023
    Authors
    Rob Myers
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

  2. h

    the_pile_github

    • huggingface.co
    Updated Mar 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    André Storhaug (2023). the_pile_github [Dataset]. https://huggingface.co/datasets/andstor/the_pile_github
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 16, 2023
    Authors
    André Storhaug
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

  3. h

    pile

    • huggingface.co
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation datasets (2023). pile [Dataset]. https://huggingface.co/datasets/lighteval/pile
    Explore at:
    Dataset updated
    Jun 3, 2023
    Dataset authored and provided by
    Evaluation datasets
    Description

    The Pile is a 825 GiB diverse, open source language modeling data set that consists of 22 smaller, high-quality datasets combined together. To score well on Pile BPB (bits per byte), a model must be able to understand many disparate domains including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers.

  4. h

    Pile-subset

    • huggingface.co
    Updated Aug 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    YukangChen (2023). Pile-subset [Dataset]. https://huggingface.co/datasets/Yukang/Pile-subset
    Explore at:
    Dataset updated
    Aug 1, 2023
    Authors
    YukangChen
    Description

    The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

  5. pile

    • huggingface.co
    Updated Jul 5, 2004
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EleutherAI (2004). pile [Dataset]. https://huggingface.co/datasets/EleutherAI/pile
    Explore at:
    Dataset updated
    Jul 5, 2004
    Dataset authored and provided by
    EleutherAIhttps://eleuther.ai/
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

  6. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rob Myers (2023). pile_v2 [Dataset]. https://huggingface.co/datasets/robertmyers/pile_v2

pile_v2

robertmyers/pile_v2

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jul 21, 2023
Authors
Rob Myers
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

Search
Clear search
Close search
Google apps
Main menu