5 datasets found

h
pile_v2
huggingface.co
Updated Jul 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rob Myers (2023). pile_v2 [Dataset]. https://huggingface.co/datasets/robertmyers/pile_v2
Explore at:
Dataset updated
Jul 21, 2023
Authors
Rob Myers
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.
h
the_pile_github
huggingface.co
Updated Mar 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
André Storhaug (2023). the_pile_github [Dataset]. https://huggingface.co/datasets/andstor/the_pile_github
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 16, 2023
Authors
André Storhaug
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.
h
pile
huggingface.co
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evaluation datasets (2023). pile [Dataset]. https://huggingface.co/datasets/lighteval/pile
Explore at:
Dataset updated
Jun 3, 2023
Dataset authored and provided by
Evaluation datasets
Description
The Pile is a 825 GiB diverse, open source language modeling data set that consists of 22 smaller, high-quality datasets combined together. To score well on Pile BPB (bits per byte), a model must be able to understand many disparate domains including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers.
h
Pile-subset
huggingface.co
Updated Aug 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YukangChen (2023). Pile-subset [Dataset]. https://huggingface.co/datasets/Yukang/Pile-subset
Explore at:
Dataset updated
Aug 1, 2023
Authors
YukangChen
Description
The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.
pile
huggingface.co
Updated Jul 5, 2004
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EleutherAI (2004). pile [Dataset]. https://huggingface.co/datasets/EleutherAI/pile
Explore at:
Dataset updated
Jul 5, 2004
Dataset authored and provided by
EleutherAIhttps://eleuther.ai/
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.
Not seeing a result you expected?
Learn how you can add new datasets to our index.