9 datasets found
  1. h

    babylm-100M

    • huggingface.co
    Updated Feb 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niels Horn (2024). babylm-100M [Dataset]. https://huggingface.co/datasets/nilq/babylm-100M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 24, 2024
    Authors
    Niels Horn
    Description

    BabyLM 100M

    This curated dataset is originally from the BabyLM Challenge. It consists of ~100M words of mixed domain, consisting of the following sources:

    CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)

  2. BabyLM Evaluation Data

    • zenodo.org
    zip
    Updated Mar 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Mueller; Alex Warstadt; Ethan Gotlieb Wilcox; Leshem Choshen; Chengxu Zhuang; Haokun Liu; Aaron Mueller; Alex Warstadt; Ethan Gotlieb Wilcox; Leshem Choshen; Chengxu Zhuang; Haokun Liu (2023). BabyLM Evaluation Data [Dataset]. http://doi.org/10.5281/zenodo.7754565
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 21, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Aaron Mueller; Alex Warstadt; Ethan Gotlieb Wilcox; Leshem Choshen; Chengxu Zhuang; Haokun Liu; Aaron Mueller; Alex Warstadt; Ethan Gotlieb Wilcox; Leshem Choshen; Chengxu Zhuang; Haokun Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Evaluation data for the BabyLM Challenge. We filter for examples where each word has appeared in our strict-small dataset at least twice.

  3. h

    babylm-german

    • huggingface.co
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bastian Bunzeck (2025). babylm-german [Dataset]. https://huggingface.co/datasets/bbunzeck/babylm-german
    Explore at:
    Dataset updated
    Mar 17, 2025
    Authors
    Bastian Bunzeck
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    German BabyLM dataset

    This is a pre-training dataset for training developmentally plausible language models in German (also called BabyLMs), compiled by the Computational Linguistics Group (CLAUSE) at Bielefeld University. If you are looking for ways to evaluate your German BabyLMs, we recommend our own lexical decision dataset, CLAMS for syntactic evaluation and XCOMPS for conceptual semantics/world knowledge. The composition is inspired by the original, English BabyLM dataset (see… See the full description on the dataset page: https://huggingface.co/datasets/bbunzeck/babylm-german.

  4. h

    slightly-cleaner-babylm

    • huggingface.co
    Updated Mar 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qing Yao (2025). slightly-cleaner-babylm [Dataset]. https://huggingface.co/datasets/qing-yao/slightly-cleaner-babylm
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 24, 2025
    Authors
    Qing Yao
    Description

    qing-yao/slightly-cleaner-babylm dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    babylm-2024-baby-cosmo-fine-100m

    • huggingface.co
    Updated Sep 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Group (University of Oslo) (2024). babylm-2024-baby-cosmo-fine-100m [Dataset]. https://huggingface.co/datasets/ltg/babylm-2024-baby-cosmo-fine-100m
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 17, 2024
    Dataset authored and provided by
    Language Technology Group (University of Oslo)
    Description

    @misc{charpentier2024gptbertboth, title={GPT or BERT: why not both?}, author={Lucas Georges Gabriel Charpentier and David Samuel}, year={2024}, eprint={2410.24159}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.24159}, }

  6. h

    phoneme-babylm-10M

    • huggingface.co
    Updated Jan 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bastian Bunzeck (2025). phoneme-babylm-10M [Dataset]. https://huggingface.co/datasets/bbunzeck/phoneme-babylm-10M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 27, 2025
    Authors
    Bastian Bunzeck
    Description

    bbunzeck/phoneme-babylm-10M dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    babylm-10M-wikipedia

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    babylm-10M-wikipedia [Dataset]. https://huggingface.co/datasets/deven367/babylm-10M-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Deven Mistry
    Description

    Dataset Card for "babylm-10M-wikipedia"

    More Information needed

  8. h

    counterfactual-babylm-only_measure_nps_as_singular_removal

    • huggingface.co
    Updated Dec 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    counterfactual-babylm-only_measure_nps_as_singular_removal [Dataset]. https://huggingface.co/datasets/kanishka/counterfactual-babylm-only_measure_nps_as_singular_removal
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 13, 2023
    Authors
    Kanishka Misra
    Description

    kanishka/counterfactual-babylm-only_measure_nps_as_singular_removal dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    SLING

    • huggingface.co
    Updated Apr 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suchir Salhan (2024). SLING [Dataset]. https://huggingface.co/datasets/suchirsalhan/SLING
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 19, 2024
    Authors
    Suchir Salhan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    SLING: Sino-Linguistic Evaluation of Large Language Models

    This is the official SLING dataset, accompanying the EMNLP 2022 paper "SLING: Sino-Linguistic Evaluation of Large Language Models" by Yixiao Song♢ Kalpesh Krishna♠ Rajesh Bhatt♢ Mohit Iyyer♠. You can find the paper on arxiv. We use this dataset for evaluation of a small-scale Chinese Language Model for the BabyLM Challenge.

      SLING Dataset
    

    See SLING_Data and the readme file in it. A complete list of all… See the full description on the dataset page: https://huggingface.co/datasets/suchirsalhan/SLING.

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Niels Horn (2024). babylm-100M [Dataset]. https://huggingface.co/datasets/nilq/babylm-100M

babylm-100M

BabyLM 100M

nilq/babylm-100M

Explore at:
10 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2024
Authors
Niels Horn
Description

BabyLM 100M

This curated dataset is originally from the BabyLM Challenge. It consists of ~100M words of mixed domain, consisting of the following sources:

CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)

Search
Clear search
Close search
Google apps
Main menu