BabyLM 100M
This curated dataset is originally from the BabyLM Challenge. It consists of ~100M words of mixed domain, consisting of the following sources:
CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Evaluation data for the BabyLM Challenge. We filter for examples where each word has appeared in our strict-small dataset at least twice.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
German BabyLM dataset
This is a pre-training dataset for training developmentally plausible language models in German (also called BabyLMs), compiled by the Computational Linguistics Group (CLAUSE) at Bielefeld University. If you are looking for ways to evaluate your German BabyLMs, we recommend our own lexical decision dataset, CLAMS for syntactic evaluation and XCOMPS for conceptual semantics/world knowledge. The composition is inspired by the original, English BabyLM dataset (see… See the full description on the dataset page: https://huggingface.co/datasets/bbunzeck/babylm-german.
qing-yao/slightly-cleaner-babylm dataset hosted on Hugging Face and contributed by the HF Datasets community
@misc{charpentier2024gptbertboth, title={GPT or BERT: why not both?}, author={Lucas Georges Gabriel Charpentier and David Samuel}, year={2024}, eprint={2410.24159}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.24159}, }
bbunzeck/phoneme-babylm-10M dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for "babylm-10M-wikipedia"
More Information needed
kanishka/counterfactual-babylm-only_measure_nps_as_singular_removal dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
SLING: Sino-Linguistic Evaluation of Large Language Models
This is the official SLING dataset, accompanying the EMNLP 2022 paper "SLING: Sino-Linguistic Evaluation of Large Language Models" by Yixiao Song♢ Kalpesh Krishna♠ Rajesh Bhatt♢ Mohit Iyyer♠. You can find the paper on arxiv. We use this dataset for evaluation of a small-scale Chinese Language Model for the BabyLM Challenge.
SLING Dataset
See SLING_Data and the readme file in it. A complete list of all… See the full description on the dataset page: https://huggingface.co/datasets/suchirsalhan/SLING.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
BabyLM 100M
This curated dataset is originally from the BabyLM Challenge. It consists of ~100M words of mixed domain, consisting of the following sources:
CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)