BabyLM 100M
This curated dataset is originally from the BabyLM Challenge. It consists of ~100M words of mixed domain, consisting of the following sources:
CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)
BabyLM 10M
This curated dataset is originally from the BabyLM Challenge. It consists of ~10M words of mixed domain, consisting of the following sources:
CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
BabyLM 100M
This curated dataset is originally from the BabyLM Challenge. It consists of ~100M words of mixed domain, consisting of the following sources:
CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)