4 datasets found
  1. MADLAD-400

    • huggingface.co
    • opendatalab.com
    Updated Oct 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2023). MADLAD-400 [Dataset]. https://huggingface.co/datasets/allenai/MADLAD-400
    Explore at:
    Dataset updated
    Oct 30, 2023
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    MADLAD-400

      Dataset and Introduction
    

    MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it is document-level. The main disadvantage… See the full description on the dataset page: https://huggingface.co/datasets/allenai/MADLAD-400.

  2. h

    madlad-400-udmurt

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    udmurtNLP, madlad-400-udmurt [Dataset]. https://huggingface.co/datasets/udmurtNLP/madlad-400-udmurt
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    udmurtNLP
    Description

    Usage madlad-400-udmurt

    from datasets import load_dataset

    dataset = load_dataset("udmurtNLP/madlad-400-udmurt")

  3. h

    mosaic-madlad-400-ms

    • huggingface.co
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Malaysia AI (2023). mosaic-madlad-400-ms [Dataset]. https://huggingface.co/datasets/malaysia-ai/mosaic-madlad-400-ms
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 30, 2023
    Dataset authored and provided by
    Malaysia AI
    Description

    Mosaic format for extra dataset to train Malaysian LLM

    This repository is to store dataset shards using mosaic format.

    prepared at https://github.com/malaysia-ai/dedup-text-dataset/blob/main/pretrain-llm/combine-madlad-400-ms.ipynb using tokenizer https://huggingface.co/malaysia-ai/bpe-tokenizer 4096 context length.

      how-to
    

    git clone,

    git lfs clone https://huggingface.co/datasets/malaysia-ai/mosaic-madlad-400-ms

    load it,

    from streaming import LocalDataset import… See the full description on the dataset page: https://huggingface.co/datasets/malaysia-ai/mosaic-madlad-400-ms.

  4. h

    shisa-pretrain-en-ja-v1

    • huggingface.co
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AUGMXNT (2023). shisa-pretrain-en-ja-v1 [Dataset]. https://huggingface.co/datasets/augmxnt/shisa-pretrain-en-ja-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2023
    Dataset authored and provided by
    AUGMXNT
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    This pre-training dataset was created for shisa-base-7b-v1. It is primarily composed of a DSIR sampling of MADLAD-400 JA/EN tokens in a 90%/10% ratio.

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ai2 (2023). MADLAD-400 [Dataset]. https://huggingface.co/datasets/allenai/MADLAD-400
Organization logo

MADLAD-400

allenai/MADLAD-400

Explore at:
182 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Oct 30, 2023
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

MADLAD-400

  Dataset and Introduction

MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it is document-level. The main disadvantage… See the full description on the dataset page: https://huggingface.co/datasets/allenai/MADLAD-400.

Search
Clear search
Close search
Google apps
Main menu