Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DCLM-baseline
DCLM-baseline is a 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks. Below are comparisions of model trained on DCLM-baseline with other models in the 7B regime.
Model Params Tokens Open dataset? CORE MMLU EXTENDED
Open weights, closed datasets
Llama2 7B 2T ✗ 49.2 45.8 34.1
DeepSeek 7B 2T ✗ 50.7 48.5 35.3
Mistral-0.3 7B ? ✗ 57.0 62.7 45.1
QWEN-2 7B ? ✗ 57.5 71.9 50.5
Llama3 8B 15T ✗ 57.6… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
10 billion GPT-2 tokens extracted and parquetified from the dclm-baseline dataset. NOTE: I am not the creator nor part of the team that created the dclm-baseline dataset. This dataset follows the same licence (CC-4.0). The original dataset can be found here: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0
tokyotech-llm/dclm-baseline dataset hosted on Hugging Face and contributed by the HF Datasets community
orionweller/dclm-baseline-1.0 dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for dclm-baseline-1.0-parquet_urls
This dataset provides the URLs and top-level domains associated with training records in mlfoundations/dclm-baseline-1.0-parquet. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.
Dataset Details
Dataset Description
This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/dclm-baseline-1.0-parquet_urls.
codelion/dclm-baseline-100M dataset hosted on Hugging Face and contributed by the HF Datasets community
codelion/dclm-baseline-10M dataset hosted on Hugging Face and contributed by the HF Datasets community
Fizzarolli/dclm-baseline-1.0-2.5k dataset hosted on Hugging Face and contributed by the HF Datasets community
kothasuhas/dclm-baseline-1.0_subset_1M dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset was created from DCLM. It is a small subset of 138,000 samples intended to be used as calibration data for mlx-lm quantizations. The script used to create the datset: import datasets
files = [ "global-shard_01_of_10/local-shard_0_of_10/shard_00000009_processed.jsonl.zst", "global-shard_01_of_10/local-shard_0_of_10/shard_00000000_processed.jsonl.zst", ] ds = datasets.load_dataset("mlfoundations/dclm-baseline-1.0", data_files=files) ds = ds["train"] feats =… See the full description on the dataset page: https://huggingface.co/datasets/mlx-community/dclm-baseline-1.0-138k.
Lyric1010/dclm-baseline-1.0-final dataset hosted on Hugging Face and contributed by the HF Datasets community
ivlu2000/dclm-baseline-fasttext dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Dataset Card for Datashop Science QA
Dataset Details
Dataset Description
This science-focused dataset was curated by applying model-based filtering to the DCLM Baseline dataset, extracting around 40B Llama-3 tokens of data, which were later rewritten into QA pairs format by Llama-3.1-8B-Instruct. It yields strong out of the box performance for improving MMLU scores, particularly the MMLU STEM subset. We observe +4 point increase in the MMLU STEM subset… See the full description on the dataset page: https://huggingface.co/datasets/marin-community/datashop-science-qa.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
OLMo 2 (November 2024) Pretraining set
Collection of data used to train OLMo-2-1124 models. The majority of this dataset comes from DCLM-Baseline with no additional filtering, but we provide the explicit breakdowns below.
Name Tokens Bytes (uncompressed) Documents License
DCLM-Baseline 3.70T 21.3TB 2.95B CC-BY-4.0
Arxiv 20.8B 77.2GB 3.95M ODC-BY
pes2o 58.6B 412GB 38M ODC-BY
starcoder 83.0B 458GB 78.7M ODC-BY
Algebraic-stack 11.8B 44.0GB 2.83M ODC-BY
OpenWebMath… See the full description on the dataset page: https://huggingface.co/datasets/allenai/olmo-mix-1124.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DCLM-baseline
DCLM-baseline is a 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks. Below are comparisions of model trained on DCLM-baseline with other models in the 7B regime.
Model Params Tokens Open dataset? CORE MMLU EXTENDED
Open weights, closed datasets
Llama2 7B 2T ✗ 49.2 45.8 34.1
DeepSeek 7B 2T ✗ 50.7 48.5 35.3
Mistral-0.3 7B ? ✗ 57.0 62.7 45.1
QWEN-2 7B ? ✗ 57.5 71.9 50.5
Llama3 8B 15T ✗ 57.6… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0.