1 dataset found
  1. h

    babylm-nso

    • huggingface.co
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BabyLM Challenge (2025). babylm-nso [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-nso
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    BabyLM Challenge
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    babylm-nso

      Dataset Description
    

    This dataset is part of the BabyLM multilingual collection.

      Dataset Summary
    

    Language: nso Script: Latin Number of Documents: 26772 Total Tokens: 1067761

      Tokens Per Category
    

    child-books: 122083 tokens child-news: 130 tokens educational: 92589 tokens padding-mt: 206703 tokens padding-news: 150960 tokens padding-wikipedia: 495296 tokens

      Data Fields
    

    text: The document text category: Type of content (e.g.… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-nso.

  2. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
BabyLM Challenge (2025). babylm-nso [Dataset]. https://huggingface.co/datasets/BabyLM-community/babylm-nso

babylm-nso

BabyLM-community/babylm-nso

Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
BabyLM Challenge
License

https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

Description

babylm-nso

  Dataset Description

This dataset is part of the BabyLM multilingual collection.

  Dataset Summary

Language: nso Script: Latin Number of Documents: 26772 Total Tokens: 1067761

  Tokens Per Category

child-books: 122083 tokens child-news: 130 tokens educational: 92589 tokens padding-mt: 206703 tokens padding-news: 150960 tokens padding-wikipedia: 495296 tokens

  Data Fields

text: The document text category: Type of content (e.g.… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-nso.

Search
Clear search
Close search
Google apps
Main menu