14 datasets found

h
dclm-baseline-1.0
huggingface.co
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ML Foundations (2024). dclm-baseline-1.0 [Dataset]. https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2024
Dataset authored and provided by
ML Foundations
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DCLM-baseline

DCLM-baseline is a 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks. Below are comparisions of model trained on DCLM-baseline with other models in the 7B regime.

Model Params Tokens Open dataset? CORE MMLU EXTENDED

Open weights, closed datasets

Llama2 7B 2T ✗ 49.2 45.8 34.1

DeepSeek 7B 2T ✗ 50.7 48.5 35.3

Mistral-0.3 7B ? ✗ 57.0 62.7 45.1

QWEN-2 7B ? ✗ 57.5 71.9 50.5

Llama3 8B 15T ✗ 57.6… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0.
h
dclm-10B
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
rg, dclm-10B [Dataset]. https://huggingface.co/datasets/robbiegwaldd/dclm-10B
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
rg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
10 billion GPT-2 tokens extracted and parquetified from the dclm-baseline dataset. NOTE: I am not the creator nor part of the team that created the dclm-baseline dataset. This dataset follows the same licence (CC-4.0). The original dataset can be found here: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0
h
dclm-baseline
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tokyotech-llm, dclm-baseline [Dataset]. https://huggingface.co/datasets/tokyotech-llm/dclm-baseline
Explore at:
Dataset authored and provided by
tokyotech-llm
Description
tokyotech-llm/dclm-baseline dataset hosted on Hugging Face and contributed by the HF Datasets community
h
dclm-baseline-1.0
huggingface.co
Updated Jul 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Orion Weller (2024). dclm-baseline-1.0 [Dataset]. https://huggingface.co/datasets/orionweller/dclm-baseline-1.0
Explore at:
Dataset updated
Jul 20, 2024
Authors
Orion Weller
Description
orionweller/dclm-baseline-1.0 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
dclm-baseline-1.0-parquet_urls
huggingface.co
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Hagar (2025). dclm-baseline-1.0-parquet_urls [Dataset]. http://doi.org/10.57967/hf/5454
Explore at:
Unique identifier
https://doi.org/10.57967/hf/5454
Dataset updated
May 15, 2025
Authors
Nick Hagar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for dclm-baseline-1.0-parquet_urls

This dataset provides the URLs and top-level domains associated with training records in mlfoundations/dclm-baseline-1.0-parquet. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.

Dataset Details Dataset Description

This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/dclm-baseline-1.0-parquet_urls.
h
dclm-baseline-100M
huggingface.co
Updated Jul 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asankhaya Sharma (2025). dclm-baseline-100M [Dataset]. https://huggingface.co/datasets/codelion/dclm-baseline-100M
Explore at:
Dataset updated
Jul 7, 2025
Authors
Asankhaya Sharma
Description
codelion/dclm-baseline-100M dataset hosted on Hugging Face and contributed by the HF Datasets community
h
dclm-baseline-10M
huggingface.co
Updated Jul 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asankhaya Sharma (2025). dclm-baseline-10M [Dataset]. https://huggingface.co/datasets/codelion/dclm-baseline-10M
Explore at:
Dataset updated
Jul 7, 2025
Authors
Asankhaya Sharma
Description
codelion/dclm-baseline-10M dataset hosted on Hugging Face and contributed by the HF Datasets community
h
dclm-baseline-1.0-2.5k
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fizz 🏳️‍⚧️, dclm-baseline-1.0-2.5k [Dataset]. https://huggingface.co/datasets/Fizzarolli/dclm-baseline-1.0-2.5k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Fizz 🏳️‍⚧️
Description
Fizzarolli/dclm-baseline-1.0-2.5k dataset hosted on Hugging Face and contributed by the HF Datasets community
h
dclm-baseline-1.0_subset_1M
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suhas Kotha, dclm-baseline-1.0_subset_1M [Dataset]. https://huggingface.co/datasets/kothasuhas/dclm-baseline-1.0_subset_1M
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Suhas Kotha
Description
kothasuhas/dclm-baseline-1.0_subset_1M dataset hosted on Hugging Face and contributed by the HF Datasets community
h
dclm-baseline-1.0-138k
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MLX Community, dclm-baseline-1.0-138k [Dataset]. https://huggingface.co/datasets/mlx-community/dclm-baseline-1.0-138k
Explore at:
Dataset authored and provided by
MLX Community
Description
This dataset was created from DCLM. It is a small subset of 138,000 samples intended to be used as calibration data for mlx-lm quantizations. The script used to create the datset: import datasets

files = [ "global-shard_01_of_10/local-shard_0_of_10/shard_00000009_processed.jsonl.zst", "global-shard_01_of_10/local-shard_0_of_10/shard_00000000_processed.jsonl.zst", ] ds = datasets.load_dataset("mlfoundations/dclm-baseline-1.0", data_files=files) ds = ds["train"] feats =… See the full description on the dataset page: https://huggingface.co/datasets/mlx-community/dclm-baseline-1.0-138k.
h
dclm-baseline-1.0-final
huggingface.co
Updated Aug 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wangjiapeng (2025). dclm-baseline-1.0-final [Dataset]. https://huggingface.co/datasets/Lyric1010/dclm-baseline-1.0-final
Explore at:
Dataset updated
Aug 31, 2025
Authors
wangjiapeng
Description
Lyric1010/dclm-baseline-1.0-final dataset hosted on Hugging Face and contributed by the HF Datasets community
h
dclm-baseline-fasttext
huggingface.co
Updated Jun 7, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Lukic (2014). dclm-baseline-fasttext [Dataset]. https://huggingface.co/datasets/ivlu2000/dclm-baseline-fasttext
Explore at:
Dataset updated
Jun 7, 2014
Authors
Ivan Lukic
Description
ivlu2000/dclm-baseline-fasttext dataset hosted on Hugging Face and contributed by the HF Datasets community
h
datashop-science-qa
huggingface.co
Updated May 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Marin Project (2025). datashop-science-qa [Dataset]. https://huggingface.co/datasets/marin-community/datashop-science-qa
Explore at:
Dataset updated
May 19, 2025
Dataset authored and provided by
The Marin Project
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
Dataset Card for Datashop Science QA

Dataset Details Dataset Description

This science-focused dataset was curated by applying model-based filtering to the DCLM Baseline dataset, extracting around 40B Llama-3 tokens of data, which were later rewritten into QA pairs format by Llama-3.1-8B-Instruct. It yields strong out of the box performance for improving MMLU scores, particularly the MMLU STEM subset. We observe +4 point increase in the MMLU STEM subset… See the full description on the dataset page: https://huggingface.co/datasets/marin-community/datashop-science-qa.
olmo-mix-1124
huggingface.co
Updated Jun 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2024). olmo-mix-1124 [Dataset]. https://huggingface.co/datasets/allenai/olmo-mix-1124
Explore at:
Dataset updated
Jun 4, 2024
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
OLMo 2 (November 2024) Pretraining set

Collection of data used to train OLMo-2-1124 models. The majority of this dataset comes from DCLM-Baseline with no additional filtering, but we provide the explicit breakdowns below.

Name Tokens Bytes (uncompressed) Documents License

DCLM-Baseline 3.70T 21.3TB 2.95B CC-BY-4.0

Arxiv 20.8B 77.2GB 3.95M ODC-BY

pes2o 58.6B 412GB 38M ODC-BY

starcoder 83.0B 458GB 78.7M ODC-BY

Algebraic-stack 11.8B 44.0GB 2.83M ODC-BY

OpenWebMath… See the full description on the dataset page: https://huggingface.co/datasets/allenai/olmo-mix-1124.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

ML Foundations (2024). dclm-baseline-1.0 [Dataset]. https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0

dclm-baseline-1.0

mlfoundations/dclm-baseline-1.0

Explore at:

15 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 22, 2024

Dataset authored and provided by

ML Foundations

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

DCLM-baseline

DCLM-baseline is a 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks. Below are comparisions of model trained on DCLM-baseline with other models in the 7B regime.

Model Params Tokens Open dataset? CORE MMLU EXTENDED

Open weights, closed datasets

Llama2 7B 2T ✗ 49.2 45.8 34.1

DeepSeek 7B 2T ✗ 50.7 48.5 35.3

Mistral-0.3 7B ? ✗ 57.0 62.7 45.1

QWEN-2 7B ? ✗ 57.5 71.9 50.5

Llama3 8B 15T ✗ 57.6… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0.

Clear search

Close search

Google apps

Main menu

dclm-baseline-1.0

dclm-10B

dclm-baseline

dclm-baseline-1.0

dclm-baseline-1.0-parquet_urls

dclm-baseline-100M

dclm-baseline-10M

dclm-baseline-1.0-2.5k

dclm-baseline-1.0_subset_1M

dclm-baseline-1.0-138k

dclm-baseline-1.0-final

dclm-baseline-fasttext

datashop-science-qa

olmo-mix-1124

dclm-baseline-1.0

mlfoundations/dclm-baseline-1.0