2 datasets found

h
babylm-100M
huggingface.co
Updated Feb 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niels Horn (2024). babylm-100M [Dataset]. https://huggingface.co/datasets/nilq/babylm-100M
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2024
Authors
Niels Horn
Description
BabyLM 100M

This curated dataset is originally from the BabyLM Challenge. It consists of ~100M words of mixed domain, consisting of the following sources:

CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)
h
babylm-10M
huggingface.co
Updated Feb 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niels Horn (2024). babylm-10M [Dataset]. https://huggingface.co/datasets/nilq/babylm-10M
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2024
Authors
Niels Horn
Description
BabyLM 10M

This curated dataset is originally from the BabyLM Challenge. It consists of ~10M words of mixed domain, consisting of the following sources:

CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)
Not seeing a result you expected?
Learn how you can add new datasets to our index.