Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
babylm-nso
Dataset Description
This dataset is part of the BabyLM multilingual collection.
Dataset Summary
Language: nso Script: Latin Number of Documents: 26772 Total Tokens: 1067761
Tokens Per Category
child-books: 122083 tokens child-news: 130 tokens educational: 92589 tokens padding-mt: 206703 tokens padding-news: 150960 tokens padding-wikipedia: 495296 tokens
Data Fields
text: The document text category: Type of content (e.g.… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-nso.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
babylm-nso
Dataset Description
This dataset is part of the BabyLM multilingual collection.
Dataset Summary
Language: nso Script: Latin Number of Documents: 26772 Total Tokens: 1067761
Tokens Per Category
child-books: 122083 tokens child-news: 130 tokens educational: 92589 tokens padding-mt: 206703 tokens padding-news: 150960 tokens padding-wikipedia: 495296 tokens
Data Fields
text: The document text category: Type of content (e.g.… See the full description on the dataset page: https://huggingface.co/datasets/BabyLM-community/babylm-nso.