https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
MADLAD-400
Dataset and Introduction
MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it is document-level. The main disadvantage… See the full description on the dataset page: https://huggingface.co/datasets/allenai/MADLAD-400.
Usage madlad-400-udmurt
from datasets import load_dataset
dataset = load_dataset("udmurtNLP/madlad-400-udmurt")
Mosaic format for extra dataset to train Malaysian LLM
This repository is to store dataset shards using mosaic format.
prepared at https://github.com/malaysia-ai/dedup-text-dataset/blob/main/pretrain-llm/combine-madlad-400-ms.ipynb using tokenizer https://huggingface.co/malaysia-ai/bpe-tokenizer 4096 context length.
how-to
git clone,
git lfs clone https://huggingface.co/datasets/malaysia-ai/mosaic-madlad-400-ms
load it,
from streaming import LocalDataset import… See the full description on the dataset page: https://huggingface.co/datasets/malaysia-ai/mosaic-madlad-400-ms.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
This pre-training dataset was created for shisa-base-7b-v1. It is primarily composed of a DSIR sampling of MADLAD-400 JA/EN tokens in a 90%/10% ratio.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
MADLAD-400
Dataset and Introduction
MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it is document-level. The main disadvantage… See the full description on the dataset page: https://huggingface.co/datasets/allenai/MADLAD-400.