75 datasets found
  1. h

    babylm-100M

    • huggingface.co
    Updated Feb 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niels Horn (2024). babylm-100M [Dataset]. https://huggingface.co/datasets/nilq/babylm-100M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 24, 2024
    Authors
    Niels Horn
    Description

    BabyLM 100M

    This curated dataset is originally from the BabyLM Challenge. It consists of ~100M words of mixed domain, consisting of the following sources:

    CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)

  2. h

    babylm-nld

    • huggingface.co
    Updated Oct 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-nld [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-nld
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: nld Script: Latn Tier: 100M Byte Premium Factor: 1.051606 Size (MB): 569.49 Expected Size (MB): 571.02 Number of Documents: 304,611 Total Tokens: 109,885,564 Tokenizer: separate by whitespace

      Tokens Per Category
    

    child-books: 4,576,823 tokens child-directed-speech: 3,304,756… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-nld.

  3. h

    babylm-kor

    • huggingface.co
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-kor [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-kor
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: kor Script: Hang Tier: 1M Byte Premium Factor: 1.293311 Size (MB): 7.07 Expected Size (MB): 7.02 Number of Documents: 290 Total Tokens: 2,453,075 Tokenizer: LGAI-EXAONE/EXAONE-4.0-1.2B

      Tokens Per Category
    

    child-books: 15,458 tokens child-directed-speech: 2,163,779 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-kor.

  4. h

    babylm-ron

    • huggingface.co
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-ron [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-ron
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: ron Script: Latn Tier: 1M Byte Premium Factor: 1.115121 Size (MB): 6.10 Expected Size (MB): 6.06 Number of Documents: 18,763 Total Tokens: 972,105 Tokenizer: separate by whitespace

      Tokens Per Category
    

    child-books: 284,101 tokens child-directed-speech: 294,696 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ron.

  5. h

    babylm-jav

    • huggingface.co
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-jav [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-jav
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: jav Script: Latn Tier: 1M Byte Premium Factor: 1.146846 Size (MB): 6.23 Expected Size (MB): 6.23 Number of Documents: 6,657 Total Tokens: 952,647 Tokenizer: separate by whitespace

      Tokens Per Category
    

    child-books: 307,282 tokens padding: 645,365 tokens

      Tokens Per… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-jav.
    
  6. h

    babylm-sun

    • huggingface.co
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-sun [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-sun
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: sun Script: Latn Tier: 1M Byte Premium Factor: 1.096981 Size (MB): 5.96 Expected Size (MB): 5.96 Number of Documents: 11,201 Total Tokens: 892,088 Tokenizer: separate by whitespace

      Tokens Per Category
    

    child-books: 17,264 tokens educational: 177 tokens padding: 874,647 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-sun.

  7. h

    babylm-bul

    • huggingface.co
    Updated Oct 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-bul [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-bul
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: bul Script: Cyrl Tier: 100M Byte Premium Factor: 1.812316 Size (MB): 981.07 Expected Size (MB): 984.09 Number of Documents: 2,143,576 Total Tokens: 115,362,693 Tokenizer: separate by whitespace

      Tokens Per Category
    

    child-books: 24,799,312 tokens padding-opensubtitles: 90,563… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-bul.

  8. h

    babylm-ban

    • huggingface.co
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-ban [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-ban
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: ban Script: Latn Tier: 1M Byte Premium Factor: 1.269544 Size (MB): 6.87 Expected Size (MB): 6.89 Number of Documents: 13,503 Total Tokens: 938,725 Tokenizer: separate by whitespace

      Tokens Per Category
    

    child-books: 63,826 tokens padding: 580,523 tokens padding-wikipedia: 294… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ban.

  9. h

    babylm-ces

    • huggingface.co
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-ces [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-ces
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: ces Script: Latn Tier: 1M Byte Premium Factor: 1.035849 Size (MB): 5.64 Expected Size (MB): 5.62 Number of Documents: 540 Total Tokens: 762,576 Tokenizer: separate by whitespace

      Tokens Per Category
    

    child-directed-speech: 377,313 tokens padding-fineweb-c: 78,540 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ces.

  10. h

    babylm-srp

    • huggingface.co
    Updated Oct 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-srp [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-srp
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: srp Script: Cyrl, Latn Tier: < 100M Byte Premium Factor: 0.826258 Size (MB): 77.21 Expected Size (MB): 448.66 Number of Documents: 2,244 Total Tokens: 15,227,050 Tokenizer: separate by whitespace

      Tokens Per Category
    

    child-books: 29,896 tokens child-directed-speech: 1,489,908… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-srp.

  11. h

    babylm-spa

    • huggingface.co
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-spa [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-spa
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: spa Script: Latn Tier: 10M Byte Premium Factor: 1.083832 Size (MB): 58.75 Expected Size (MB): 58.85 Number of Documents: 11,502 Total Tokens: 9,709,092 Tokenizer: separate by whitespace

      Tokens Per Category
    

    child-available-speech: 103,394 tokens child-books: 3,950,325 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-spa.

  12. h

    babylm-jpn

    • huggingface.co
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-jpn [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-jpn
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: jpn Script: Hira, Jpan, Kana Tier: 10M Byte Premium Factor: 1.321974 Size (MB): 71.78 Expected Size (MB): 71.78 Number of Documents: 2,043 Total Tokens: 16,524,324 Tokenizer: tohoku-nlp/bert-base-japanese

      Tokens Per Category
    

    child-books: 9,712,521 tokens educational: 291,053… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-jpn.

  13. h

    babylm-yue

    • huggingface.co
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-yue [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-yue
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: yue Script: Hani, Hant Tier: < 10M Byte Premium Factor: 0.862461 Size (MB): 43.34 Expected Size (MB): 46.83 Number of Documents: 28,318 Total Tokens: 15,045,195 Tokenizer: Qwen/Qwen1.5-7B-Chat

      Tokens Per Category
    

    child-books: 191,861 tokens child-directed-speech: 2,982,684… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-yue.

  14. h

    babylm-eus

    • huggingface.co
    Updated Nov 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-eus [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-eus
    Explore at:
    Dataset updated
    Nov 13, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: eus Script: Latn Tier: 10M Byte Premium Factor: 1.059584 Size (MB): 57.06 Expected Size (MB): 57.54 Number of Documents: 13,421 Total Tokens: 8,189,297 Tokenizer: separate by whitespace

      Tokens Per Category
    

    child-directed-speech: 201,402 tokens child-wiki: 1,716,026 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-eus.

  15. h

    babylm-ace

    • huggingface.co
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-ace [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-ace
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: ace Script: Latn Tier: 1M Byte Premium Factor: 1.241957 Size (MB): 6.74 Expected Size (MB): 6.74 Number of Documents: 20,883 Total Tokens: 968,194 Tokenizer: separate by whitespace

      Tokens Per Category
    

    child-books: 242,613 tokens padding: 283,843 tokens padding-wikipedia: 441… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ace.

  16. h

    babylm-zho

    • huggingface.co
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-zho [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-zho
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: zho Script: Hani, Hans, Latn Tier: > 100M Byte Premium Factor: 0.935966 Size (MB): 518.85 Expected Size (MB): 508.23 Number of Documents: 203,891 Total Tokens: 137,835,046 Tokenizer: Qwen/Qwen3-0.6B

      Tokens Per Category
    

    child-available-speech: 98,731,442 tokens child-books: 15… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-zho.

  17. h

    babylm-ell

    • huggingface.co
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-ell [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-ell
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: ell Script: Greek, Grek Tier: 10M Byte Premium Factor: 1.967262 Size (MB): 106.81 Expected Size (MB): 106.82 Number of Documents: 11,104 Total Tokens: 10,882,556 Tokenizer: separate by whitespace

      Tokens Per Category
    

    child-available-speech: 1,673,255 tokens child-books: 1,390… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ell.

  18. h

    babylm-ara

    • huggingface.co
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-ara [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-ara
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: ara Script: Arab Tier: 10M Byte Premium Factor: 1.465018 Size (MB): 79.57 Expected Size (MB): 79.55 Number of Documents: 30,533 Total Tokens: 8,353,682 Tokenizer: separate by whitespace

      Tokens Per Category
    

    child-available-speech: 3,160,747 tokens child-books: 1,667,683 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-ara.

  19. h

    babylm-german

    • huggingface.co
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bastian Bunzeck (2025). babylm-german [Dataset]. https://huggingface.co/datasets/bbunzeck/babylm-german
    Explore at:
    Dataset updated
    Mar 17, 2025
    Authors
    Bastian Bunzeck
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    German BabyLM dataset

    This is a pre-training dataset for training developmentally plausible language models in German (also called BabyLMs), compiled by the Computational Linguistics Group (CLAUSE) at Bielefeld University. If you are looking for ways to evaluate your German BabyLMs, we recommend our own lexical decision dataset, CLAMS for syntactic evaluation and XCOMPS for conceptual semantics/world knowledge. The composition is inspired by the original, English BabyLM dataset (see… See the full description on the dataset page: https://huggingface.co/datasets/bbunzeck/babylm-german.

  20. h

    babylm-fas

    • huggingface.co
    Updated Oct 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-fas [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-fas
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    BabyLM Dataset

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.More information at: babylm.github.io/babybabellm

      Dataset Summary
    

    Language: fas Script: Arab Tier: 100M Byte Premium Factor: 1.597326 Size (MB): 867.30 Expected Size (MB): 867.35 Number of Documents: 217,776 Total Tokens: 98,506,081 Tokenizer: separate by whitespace

      Tokens Per Category
    

    child-books: 67,165 tokens educational: 94,320,928 tokens… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-fas.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Niels Horn (2024). babylm-100M [Dataset]. https://huggingface.co/datasets/nilq/babylm-100M

babylm-100M

BabyLM 100M

nilq/babylm-100M

Explore at:
16 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2024
Authors
Niels Horn
Description

BabyLM 100M

This curated dataset is originally from the BabyLM Challenge. It consists of ~100M words of mixed domain, consisting of the following sources:

CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)

Search
Clear search
Close search
Google apps
Main menu