Facebook
TwitterBabyLM 100M
This curated dataset is originally from the BabyLM Challenge. It consists of ~100M words of mixed domain, consisting of the following sources:
CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: nld Script: Latn Tier: 100M Byte Premium Factor: 1.051606 Size (MB): 569.49 Expected Size (MB): 571.02 Number of Documents: 304,611 Total Tokens: 109,885,564 Tokenizer: separate by whitespace
Tokens Per Category
child-books: 4,576,823 tokens child-directed-speech: 3,304,756… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-nld.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: kor Script: Hang Tier: 1M Byte Premium Factor: 1.293311 Size (MB): 7.07 Expected Size (MB): 7.02 Number of Documents: 290 Total Tokens: 2,453,075 Tokenizer: LGAI-EXAONE/EXAONE-4.0-1.2B
Tokens Per Category
child-books: 15,458 tokens child-directed-speech: 2,163,779 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-kor.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: ron Script: Latn Tier: 1M Byte Premium Factor: 1.115121 Size (MB): 6.10 Expected Size (MB): 6.06 Number of Documents: 18,763 Total Tokens: 972,105 Tokenizer: separate by whitespace
Tokens Per Category
child-books: 284,101 tokens child-directed-speech: 294,696 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ron.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: jav Script: Latn Tier: 1M Byte Premium Factor: 1.146846 Size (MB): 6.23 Expected Size (MB): 6.23 Number of Documents: 6,657 Total Tokens: 952,647 Tokenizer: separate by whitespace
Tokens Per Category
child-books: 307,282 tokens padding: 645,365 tokens
Tokens Per… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-jav.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: sun Script: Latn Tier: 1M Byte Premium Factor: 1.096981 Size (MB): 5.96 Expected Size (MB): 5.96 Number of Documents: 11,201 Total Tokens: 892,088 Tokenizer: separate by whitespace
Tokens Per Category
child-books: 17,264 tokens educational: 177 tokens padding: 874,647 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-sun.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: bul Script: Cyrl Tier: 100M Byte Premium Factor: 1.812316 Size (MB): 981.07 Expected Size (MB): 984.09 Number of Documents: 2,143,576 Total Tokens: 115,362,693 Tokenizer: separate by whitespace
Tokens Per Category
child-books: 24,799,312 tokens padding-opensubtitles: 90,563… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-bul.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: ban Script: Latn Tier: 1M Byte Premium Factor: 1.269544 Size (MB): 6.87 Expected Size (MB): 6.89 Number of Documents: 13,503 Total Tokens: 938,725 Tokenizer: separate by whitespace
Tokens Per Category
child-books: 63,826 tokens padding: 580,523 tokens padding-wikipedia: 294… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ban.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: ces Script: Latn Tier: 1M Byte Premium Factor: 1.035849 Size (MB): 5.64 Expected Size (MB): 5.62 Number of Documents: 540 Total Tokens: 762,576 Tokenizer: separate by whitespace
Tokens Per Category
child-directed-speech: 377,313 tokens padding-fineweb-c: 78,540 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ces.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: srp Script: Cyrl, Latn Tier: < 100M Byte Premium Factor: 0.826258 Size (MB): 77.21 Expected Size (MB): 448.66 Number of Documents: 2,244 Total Tokens: 15,227,050 Tokenizer: separate by whitespace
Tokens Per Category
child-books: 29,896 tokens child-directed-speech: 1,489,908… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-srp.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: spa Script: Latn Tier: 10M Byte Premium Factor: 1.083832 Size (MB): 58.75 Expected Size (MB): 58.85 Number of Documents: 11,502 Total Tokens: 9,709,092 Tokenizer: separate by whitespace
Tokens Per Category
child-available-speech: 103,394 tokens child-books: 3,950,325 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-spa.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: jpn Script: Hira, Jpan, Kana Tier: 10M Byte Premium Factor: 1.321974 Size (MB): 71.78 Expected Size (MB): 71.78 Number of Documents: 2,043 Total Tokens: 16,524,324 Tokenizer: tohoku-nlp/bert-base-japanese
Tokens Per Category
child-books: 9,712,521 tokens educational: 291,053… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-jpn.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: yue Script: Hani, Hant Tier: < 10M Byte Premium Factor: 0.862461 Size (MB): 43.34 Expected Size (MB): 46.83 Number of Documents: 28,318 Total Tokens: 15,045,195 Tokenizer: Qwen/Qwen1.5-7B-Chat
Tokens Per Category
child-books: 191,861 tokens child-directed-speech: 2,982,684… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-yue.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: eus Script: Latn Tier: 10M Byte Premium Factor: 1.059584 Size (MB): 57.06 Expected Size (MB): 57.54 Number of Documents: 13,421 Total Tokens: 8,189,297 Tokenizer: separate by whitespace
Tokens Per Category
child-directed-speech: 201,402 tokens child-wiki: 1,716,026 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-eus.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: ace Script: Latn Tier: 1M Byte Premium Factor: 1.241957 Size (MB): 6.74 Expected Size (MB): 6.74 Number of Documents: 20,883 Total Tokens: 968,194 Tokenizer: separate by whitespace
Tokens Per Category
child-books: 242,613 tokens padding: 283,843 tokens padding-wikipedia: 441… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ace.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: zho Script: Hani, Hans, Latn Tier: > 100M Byte Premium Factor: 0.935966 Size (MB): 518.85 Expected Size (MB): 508.23 Number of Documents: 203,891 Total Tokens: 137,835,046 Tokenizer: Qwen/Qwen3-0.6B
Tokens Per Category
child-available-speech: 98,731,442 tokens child-books: 15… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-zho.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: ell Script: Greek, Grek Tier: 10M Byte Premium Factor: 1.967262 Size (MB): 106.81 Expected Size (MB): 106.82 Number of Documents: 11,104 Total Tokens: 10,882,556 Tokenizer: separate by whitespace
Tokens Per Category
child-available-speech: 1,673,255 tokens child-books: 1,390… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ell.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: ara Script: Arab Tier: 10M Byte Premium Factor: 1.465018 Size (MB): 79.57 Expected Size (MB): 79.55 Number of Documents: 30,533 Total Tokens: 8,353,682 Tokenizer: separate by whitespace
Tokens Per Category
child-available-speech: 3,160,747 tokens child-books: 1,667,683 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ara.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
German BabyLM dataset
This is a pre-training dataset for training developmentally plausible language models in German (also called BabyLMs), compiled by the Computational Linguistics Group (CLAUSE) at Bielefeld University. If you are looking for ways to evaluate your German BabyLMs, we recommend our own lexical decision dataset, CLAMS for syntactic evaluation and XCOMPS for conceptual semantics/world knowledge. The composition is inspired by the original, English BabyLM dataset (see… See the full description on the dataset page: https://huggingface.co/datasets/bbunzeck/babylm-german.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
BabyLM Dataset
Dataset Description
This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm
Dataset Summary
Language: fas Script: Arab Tier: 100M Byte Premium Factor: 1.597326 Size (MB): 867.30 Expected Size (MB): 867.35 Number of Documents: 217,776 Total Tokens: 98,506,081 Tokenizer: separate by whitespace
Tokens Per Category
child-books: 67,165 tokens educational: 94,320,928 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-fas.
Facebook
TwitterBabyLM 100M
This curated dataset is originally from the BabyLM Challenge. It consists of ~100M words of mixed domain, consisting of the following sources:
CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)