2 datasets found
  1. h

    babylm-100M

    • huggingface.co
    Updated Feb 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niels Horn (2024). babylm-100M [Dataset]. https://huggingface.co/datasets/nilq/babylm-100M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 24, 2024
    Authors
    Niels Horn
    Description

    BabyLM 100M

    This curated dataset is originally from the BabyLM Challenge. It consists of ~100M words of mixed domain, consisting of the following sources:

    CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)

  2. h

    babylm-10M

    • huggingface.co
    Updated Feb 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niels Horn (2024). babylm-10M [Dataset]. https://huggingface.co/datasets/nilq/babylm-10M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 24, 2024
    Authors
    Niels Horn
    Description

    BabyLM 10M

    This curated dataset is originally from the BabyLM Challenge. It consists of ~10M words of mixed domain, consisting of the following sources:

    CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Niels Horn (2024). babylm-100M [Dataset]. https://huggingface.co/datasets/nilq/babylm-100M

babylm-100M

BabyLM 100M

nilq/babylm-100M

Explore at:
10 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2024
Authors
Niels Horn
Description

BabyLM 100M

This curated dataset is originally from the BabyLM Challenge. It consists of ~100M words of mixed domain, consisting of the following sources:

CHILDES (child-directed speech) Subtitles (speech) BNC (speech) TED talks (speech) children's books (simple written language)

Search
Clear search
Close search
Google apps
Main menu