75 datasets found

h
babylm-100M
huggingface.co
Updated Feb 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niels Horn (2024). babylm-100M [Dataset]. https://huggingface.co/datasets/nilq/babylm-100M
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2024
Authors
Niels Horn
Description
BabyLM 100M

This curated dataset is originally from the BabyLM Challenge. It consists of ~100M words of mixed domain, consisting of the following sources:

CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)
h
babylm-nld
huggingface.co
Updated Oct 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-nld [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-nld
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: nld Script: Latn Tier: 100M Byte Premium Factor: 1.051606 Size (MB): 569.49 Expected Size (MB): 571.02 Number of Documents: 304,611 Total Tokens: 109,885,564 Tokenizer: separate by whitespace

Tokens Per Category

child-books: 4,576,823 tokens child-directed-speech: 3,304,756… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-nld.
h
babylm-kor
huggingface.co
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-kor [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-kor
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: kor Script: Hang Tier: 1M Byte Premium Factor: 1.293311 Size (MB): 7.07 Expected Size (MB): 7.02 Number of Documents: 290 Total Tokens: 2,453,075 Tokenizer: LGAI-EXAONE/EXAONE-4.0-1.2B

Tokens Per Category

child-books: 15,458 tokens child-directed-speech: 2,163,779 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-kor.
h
babylm-ron
huggingface.co
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-ron [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-ron
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: ron Script: Latn Tier: 1M Byte Premium Factor: 1.115121 Size (MB): 6.10 Expected Size (MB): 6.06 Number of Documents: 18,763 Total Tokens: 972,105 Tokenizer: separate by whitespace

Tokens Per Category

child-books: 284,101 tokens child-directed-speech: 294,696 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ron.
h
babylm-jav
huggingface.co
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-jav [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-jav
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: jav Script: Latn Tier: 1M Byte Premium Factor: 1.146846 Size (MB): 6.23 Expected Size (MB): 6.23 Number of Documents: 6,657 Total Tokens: 952,647 Tokenizer: separate by whitespace

Tokens Per Category

child-books: 307,282 tokens padding: 645,365 tokens

Tokens Per… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-jav.
h
babylm-sun
huggingface.co
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-sun [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-sun
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: sun Script: Latn Tier: 1M Byte Premium Factor: 1.096981 Size (MB): 5.96 Expected Size (MB): 5.96 Number of Documents: 11,201 Total Tokens: 892,088 Tokenizer: separate by whitespace

Tokens Per Category

child-books: 17,264 tokens educational: 177 tokens padding: 874,647 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-sun.
h
babylm-bul
huggingface.co
Updated Oct 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-bul [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-bul
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: bul Script: Cyrl Tier: 100M Byte Premium Factor: 1.812316 Size (MB): 981.07 Expected Size (MB): 984.09 Number of Documents: 2,143,576 Total Tokens: 115,362,693 Tokenizer: separate by whitespace

Tokens Per Category

child-books: 24,799,312 tokens padding-opensubtitles: 90,563… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-bul.
h
babylm-ban
huggingface.co
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-ban [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-ban
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: ban Script: Latn Tier: 1M Byte Premium Factor: 1.269544 Size (MB): 6.87 Expected Size (MB): 6.89 Number of Documents: 13,503 Total Tokens: 938,725 Tokenizer: separate by whitespace

Tokens Per Category

child-books: 63,826 tokens padding: 580,523 tokens padding-wikipedia: 294… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ban.
h
babylm-ces
huggingface.co
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-ces [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-ces
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: ces Script: Latn Tier: 1M Byte Premium Factor: 1.035849 Size (MB): 5.64 Expected Size (MB): 5.62 Number of Documents: 540 Total Tokens: 762,576 Tokenizer: separate by whitespace

Tokens Per Category

child-directed-speech: 377,313 tokens padding-fineweb-c: 78,540 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ces.
h
babylm-srp
huggingface.co
Updated Oct 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-srp [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-srp
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: srp Script: Cyrl, Latn Tier: < 100M Byte Premium Factor: 0.826258 Size (MB): 77.21 Expected Size (MB): 448.66 Number of Documents: 2,244 Total Tokens: 15,227,050 Tokenizer: separate by whitespace

Tokens Per Category

child-books: 29,896 tokens child-directed-speech: 1,489,908… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-srp.
h
babylm-spa
huggingface.co
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-spa [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-spa
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: spa Script: Latn Tier: 10M Byte Premium Factor: 1.083832 Size (MB): 58.75 Expected Size (MB): 58.85 Number of Documents: 11,502 Total Tokens: 9,709,092 Tokenizer: separate by whitespace

Tokens Per Category

child-available-speech: 103,394 tokens child-books: 3,950,325 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-spa.
h
babylm-jpn
huggingface.co
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-jpn [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-jpn
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: jpn Script: Hira, Jpan, Kana Tier: 10M Byte Premium Factor: 1.321974 Size (MB): 71.78 Expected Size (MB): 71.78 Number of Documents: 2,043 Total Tokens: 16,524,324 Tokenizer: tohoku-nlp/bert-base-japanese

Tokens Per Category

child-books: 9,712,521 tokens educational: 291,053… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-jpn.
h
babylm-yue
huggingface.co
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-yue [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-yue
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: yue Script: Hani, Hant Tier: < 10M Byte Premium Factor: 0.862461 Size (MB): 43.34 Expected Size (MB): 46.83 Number of Documents: 28,318 Total Tokens: 15,045,195 Tokenizer: Qwen/Qwen1.5-7B-Chat

Tokens Per Category

child-books: 191,861 tokens child-directed-speech: 2,982,684… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-yue.
h
babylm-eus
huggingface.co
Updated Nov 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-eus [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-eus
Explore at:
Dataset updated
Nov 13, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: eus Script: Latn Tier: 10M Byte Premium Factor: 1.059584 Size (MB): 57.06 Expected Size (MB): 57.54 Number of Documents: 13,421 Total Tokens: 8,189,297 Tokenizer: separate by whitespace

Tokens Per Category

child-directed-speech: 201,402 tokens child-wiki: 1,716,026 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-eus.
h
babylm-ace
huggingface.co
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-ace [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-ace
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: ace Script: Latn Tier: 1M Byte Premium Factor: 1.241957 Size (MB): 6.74 Expected Size (MB): 6.74 Number of Documents: 20,883 Total Tokens: 968,194 Tokenizer: separate by whitespace

Tokens Per Category

child-books: 242,613 tokens padding: 283,843 tokens padding-wikipedia: 441… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ace.
h
babylm-zho
huggingface.co
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-zho [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-zho
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: zho Script: Hani, Hans, Latn Tier: > 100M Byte Premium Factor: 0.935966 Size (MB): 518.85 Expected Size (MB): 508.23 Number of Documents: 203,891 Total Tokens: 137,835,046 Tokenizer: Qwen/Qwen3-0.6B

Tokens Per Category

child-available-speech: 98,731,442 tokens child-books: 15… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-zho.
h
babylm-ell
huggingface.co
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-ell [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-ell
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: ell Script: Greek, Grek Tier: 10M Byte Premium Factor: 1.967262 Size (MB): 106.81 Expected Size (MB): 106.82 Number of Documents: 11,104 Total Tokens: 10,882,556 Tokenizer: separate by whitespace

Tokens Per Category

child-available-speech: 1,673,255 tokens child-books: 1,390… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ell.
h
babylm-ara
huggingface.co
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-ara [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-ara
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: ara Script: Arab Tier: 10M Byte Premium Factor: 1.465018 Size (MB): 79.57 Expected Size (MB): 79.55 Number of Documents: 30,533 Total Tokens: 8,353,682 Tokenizer: separate by whitespace

Tokens Per Category

child-available-speech: 3,160,747 tokens child-books: 1,667,683 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ara.
h
babylm-german
huggingface.co
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bastian Bunzeck (2025). babylm-german [Dataset]. https://huggingface.co/datasets/bbunzeck/babylm-german
Explore at:
Dataset updated
Mar 17, 2025
Authors
Bastian Bunzeck
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
German BabyLM dataset

This is a pre-training dataset for training developmentally plausible language models in German (also called BabyLMs), compiled by the Computational Linguistics Group (CLAUSE) at Bielefeld University. If you are looking for ways to evaluate your German BabyLMs, we recommend our own lexical decision dataset, CLAMS for syntactic evaluation and XCOMPS for conceptual semantics/world knowledge. The composition is inspired by the original, English BabyLM dataset (see… See the full description on the dataset page: https://huggingface.co/datasets/bbunzeck/babylm-german.
h
babylm-fas
huggingface.co
Updated Oct 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BabyLM Challenge (2025). babylm-fas [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-fas
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
BabyLM Dataset

Dataset Description

This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

Dataset Summary

Language: fas Script: Arab Tier: 100M Byte Premium Factor: 1.597326 Size (MB): 867.30 Expected Size (MB): 867.35 Number of Documents: 217,776 Total Tokens: 98,506,081 Tokenizer: separate by whitespace

Tokens Per Category

child-books: 67,165 tokens educational: 94,320,928 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-fas.

Facebook

Twitter

Click to copy link

Link copied

Cite

Niels Horn (2024). babylm-100M [Dataset]. https://huggingface.co/datasets/nilq/babylm-100M

babylm-100M

BabyLM 100M

nilq/babylm-100M

Explore at:

16 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 24, 2024

Authors

Niels Horn

Description

BabyLM 100M

This curated dataset is originally from the BabyLM Challenge. It consists of ~100M words of mixed domain, consisting of the following sources:

CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)

Clear search

Close search

Google apps

Main menu

babylm-100M

babylm-nld

babylm-kor

babylm-ron

babylm-jav

babylm-sun

babylm-bul

babylm-ban

babylm-ces

babylm-srp

babylm-spa

babylm-jpn

babylm-yue

babylm-eus

babylm-ace

babylm-zho

babylm-ell

babylm-ara

babylm-german

babylm-fas

babylm-100MSee More Versions

BabyLM 100M

nilq/babylm-100M

babylm-100M