14 datasets found
  1. h

    dclm-baseline-1.0

    • huggingface.co
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ML Foundations (2024). dclm-baseline-1.0 [Dataset]. https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2024
    Dataset authored and provided by
    ML Foundations
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DCLM-baseline

    DCLM-baseline is a 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks. Below are comparisions of model trained on DCLM-baseline with other models in the 7B regime.

    Model Params Tokens Open dataset? CORE MMLU EXTENDED

    Open weights, closed datasets

    Llama2 7B 2T ✗ 49.2 45.8 34.1

    DeepSeek 7B 2T ✗ 50.7 48.5 35.3

    Mistral-0.3 7B ? ✗ 57.0 62.7 45.1

    QWEN-2 7B ? ✗ 57.5 71.9 50.5

    Llama3 8B 15T ✗ 57.6… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0.

  2. h

    dclm-10B

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    rg, dclm-10B [Dataset]. https://huggingface.co/datasets/robbiegwaldd/dclm-10B
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    rg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    10 billion GPT-2 tokens extracted and parquetified from the dclm-baseline dataset. NOTE: I am not the creator nor part of the team that created the dclm-baseline dataset. This dataset follows the same licence (CC-4.0). The original dataset can be found here: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0

  3. h

    dclm-baseline

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tokyotech-llm, dclm-baseline [Dataset]. https://huggingface.co/datasets/tokyotech-llm/dclm-baseline
    Explore at:
    Dataset authored and provided by
    tokyotech-llm
    Description

    tokyotech-llm/dclm-baseline dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    dclm-baseline-1.0

    • huggingface.co
    Updated Jul 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orion Weller (2024). dclm-baseline-1.0 [Dataset]. https://huggingface.co/datasets/orionweller/dclm-baseline-1.0
    Explore at:
    Dataset updated
    Jul 20, 2024
    Authors
    Orion Weller
    Description

    orionweller/dclm-baseline-1.0 dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    dclm-baseline-1.0-parquet_urls

    • huggingface.co
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Hagar (2025). dclm-baseline-1.0-parquet_urls [Dataset]. http://doi.org/10.57967/hf/5454
    Explore at:
    Dataset updated
    May 15, 2025
    Authors
    Nick Hagar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for dclm-baseline-1.0-parquet_urls

    This dataset provides the URLs and top-level domains associated with training records in mlfoundations/dclm-baseline-1.0-parquet. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.

      Dataset Details
    
    
    
    
    
    
    
      Dataset Description
    

    This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/dclm-baseline-1.0-parquet_urls.

  6. h

    dclm-baseline-100M

    • huggingface.co
    Updated Jul 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asankhaya Sharma (2025). dclm-baseline-100M [Dataset]. https://huggingface.co/datasets/codelion/dclm-baseline-100M
    Explore at:
    Dataset updated
    Jul 7, 2025
    Authors
    Asankhaya Sharma
    Description

    codelion/dclm-baseline-100M dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    dclm-baseline-10M

    • huggingface.co
    Updated Jul 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asankhaya Sharma (2025). dclm-baseline-10M [Dataset]. https://huggingface.co/datasets/codelion/dclm-baseline-10M
    Explore at:
    Dataset updated
    Jul 7, 2025
    Authors
    Asankhaya Sharma
    Description

    codelion/dclm-baseline-10M dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    dclm-baseline-1.0-2.5k

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fizz 🏳️‍⚧️, dclm-baseline-1.0-2.5k [Dataset]. https://huggingface.co/datasets/Fizzarolli/dclm-baseline-1.0-2.5k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Fizz 🏳️‍⚧️
    Description

    Fizzarolli/dclm-baseline-1.0-2.5k dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    dclm-baseline-1.0_subset_1M

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suhas Kotha, dclm-baseline-1.0_subset_1M [Dataset]. https://huggingface.co/datasets/kothasuhas/dclm-baseline-1.0_subset_1M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Suhas Kotha
    Description

    kothasuhas/dclm-baseline-1.0_subset_1M dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    dclm-baseline-1.0-138k

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MLX Community, dclm-baseline-1.0-138k [Dataset]. https://huggingface.co/datasets/mlx-community/dclm-baseline-1.0-138k
    Explore at:
    Dataset authored and provided by
    MLX Community
    Description

    This dataset was created from DCLM. It is a small subset of 138,000 samples intended to be used as calibration data for mlx-lm quantizations. The script used to create the datset: import datasets

    files = [ "global-shard_01_of_10/local-shard_0_of_10/shard_00000009_processed.jsonl.zst", "global-shard_01_of_10/local-shard_0_of_10/shard_00000000_processed.jsonl.zst", ] ds = datasets.load_dataset("mlfoundations/dclm-baseline-1.0", data_files=files) ds = ds["train"] feats =… See the full description on the dataset page: https://huggingface.co/datasets/mlx-community/dclm-baseline-1.0-138k.

  11. h

    dclm-baseline-1.0-final

    • huggingface.co
    Updated Aug 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wangjiapeng (2025). dclm-baseline-1.0-final [Dataset]. https://huggingface.co/datasets/Lyric1010/dclm-baseline-1.0-final
    Explore at:
    Dataset updated
    Aug 31, 2025
    Authors
    wangjiapeng
    Description

    Lyric1010/dclm-baseline-1.0-final dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    dclm-baseline-fasttext

    • huggingface.co
    Updated Jun 7, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Lukic (2014). dclm-baseline-fasttext [Dataset]. https://huggingface.co/datasets/ivlu2000/dclm-baseline-fasttext
    Explore at:
    Dataset updated
    Jun 7, 2014
    Authors
    Ivan Lukic
    Description

    ivlu2000/dclm-baseline-fasttext dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    datashop-science-qa

    • huggingface.co
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Marin Project (2025). datashop-science-qa [Dataset]. https://huggingface.co/datasets/marin-community/datashop-science-qa
    Explore at:
    Dataset updated
    May 19, 2025
    Dataset authored and provided by
    The Marin Project
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    Dataset Card for Datashop Science QA

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    This science-focused dataset was curated by applying model-based filtering to the DCLM Baseline dataset, extracting around 40B Llama-3 tokens of data, which were later rewritten into QA pairs format by Llama-3.1-8B-Instruct. It yields strong out of the box performance for improving MMLU scores, particularly the MMLU STEM subset. We observe +4 point increase in the MMLU STEM subset… See the full description on the dataset page: https://huggingface.co/datasets/marin-community/datashop-science-qa.

  14. olmo-mix-1124

    • huggingface.co
    Updated Jun 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2024). olmo-mix-1124 [Dataset]. https://huggingface.co/datasets/allenai/olmo-mix-1124
    Explore at:
    Dataset updated
    Jun 4, 2024
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    OLMo 2 (November 2024) Pretraining set

    Collection of data used to train OLMo-2-1124 models. The majority of this dataset comes from DCLM-Baseline with no additional filtering, but we provide the explicit breakdowns below.

    Name Tokens Bytes (uncompressed) Documents License

    DCLM-Baseline 3.70T 21.3TB 2.95B CC-BY-4.0

    Arxiv 20.8B 77.2GB 3.95M ODC-BY

    pes2o 58.6B 412GB 38M ODC-BY

    starcoder 83.0B 458GB 78.7M ODC-BY

    Algebraic-stack 11.8B 44.0GB 2.83M ODC-BY

    OpenWebMath… See the full description on the dataset page: https://huggingface.co/datasets/allenai/olmo-mix-1124.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ML Foundations (2024). dclm-baseline-1.0 [Dataset]. https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0

dclm-baseline-1.0

mlfoundations/dclm-baseline-1.0

Explore at:
15 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2024
Dataset authored and provided by
ML Foundations
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

DCLM-baseline

DCLM-baseline is a 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks. Below are comparisions of model trained on DCLM-baseline with other models in the 7B regime.

Model Params Tokens Open dataset? CORE MMLU EXTENDED

Open weights, closed datasets

Llama2 7B 2T ✗ 49.2 45.8 34.1

DeepSeek 7B 2T ✗ 50.7 48.5 35.3

Mistral-0.3 7B ? ✗ 57.0 62.7 45.1

QWEN-2 7B ? ✗ 57.5 71.9 50.5

Llama3 8B 15T ✗ 57.6… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0.

Search
Clear search
Close search
Google apps
Main menu