Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
OLMo 2 1124 13B Preference Mixture
Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This mix is made up of the following on-policy preference datasets generated using a synthetic data generation pipeline similar to Tulu
Reused prompts from the SFT mix (via ai2-adapt-dev/sft_v3.9_used_on_policy_po_olmo2_13b and… See the full description on the dataset page: https://huggingface.co/datasets/allenai/olmo-2-1124-13b-preference-mix.
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
OLMo 2 (November 2024) Pretraining set
Collection of data used to train OLMo-2-1124 models. The majority of this dataset comes from DCLM-Baseline with no additional filtering, but we provide the explicit breakdowns below.
Name Tokens Bytes (uncompressed) Documents License
DCLM-Baseline 3.70T 21.3TB 2.95B CC-BY-4.0
Arxiv 20.8B 77.2GB 3.95M ODC-BY
pes2o 58.6B 412GB 38M ODC-BY
starcoder 83.0B 458GB 78.7M ODC-BY
Algebraic-stack 11.8B 44.0GB 2.83M ODC-BY
OpenWebMath… See the full description on the dataset page: https://huggingface.co/datasets/allenai/olmo-mix-1124.
Facebook
Twitterjacobmorrison/olmo2-32b-combined-outputs dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittersaumyamalik/olmo-2-pref-mix-no-source dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittersaumyamalik/olmo-2-7b-pref-mix-delta-olmo2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterjacobmorrison/olmo-2-1124-7b-preference-mix-filtered-overlapping dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterscottgeng00/olmo2-delta-qwen2.5_3b_over_1.5b dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterjacobmorrison/olmo-2-1124-13b-preference-mix-randomcase dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitternatolambert/rlhf-library-OLMo-2-1124-7B-DPO dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittersaumyamalik/ultrafeedback-cleaned-olmo2-7b-unused-gemma3 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterfedzbar/olmo2-13b-generated dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterUsed to train OLMo 2 32B. From the blog post:
Filtered out instructions from the SFT dataset and the chosen responses of the preference data that included mentions of a date cutoff from the synthetic data generation process. This resulted in a new version of the instruction dataset, Tulu 3 SFT Mixture 0225, and preference dataset, OLMo-2-32B-pref-mix-0325. We use majority voting to improve the quality of answers to our synthetic math questions. For our Persona MATH and Grade School Math… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-2-mixture-0225.
Facebook
Twitterjacobmorrison/olmo-2-0325-32b-preference-mix-leetspeak dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittersaumyamalik/olmo-2-0325-32b-preference-mix-20-pct-perturbed dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterVGraf/olmo-2-0325-32b-preference-mix dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset Card for Evaluation run of allenai/OLMo-2-1124-7B-Instruct
Dataset automatically created during the evaluation run of model allenai/OLMo-2-1124-7B-Instruct The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/allenai_OLMo-2-1124-7B-Instruct-details.
Facebook
TwitterNeelectric/OLMo-2-0425-1B-Instruct_DPO dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterRemoved sources:
ai2-adapt-dev/tulu_v3.9_wildjailbreak_decontaminated_50k ai2-adapt-dev/tulu_v3.9_synthetic_finalresp_wildguardmixtrain_decontaminated_50k ai2-adapt-dev/coconot_converted ai2-adapt-dev/tulu_hard_coded_repeated_10 ai2-adapt-dev/tulu_v3.9_aya_100k ai2-adapt-dev/numinamath_tir_math_decontaminated ai2-adapt-dev/tulu_v3.9_open_math_2_gsm8k_50k ai2-adapt-dev/tulu_v3.9_personahub_math_interm_algebra_20k allenai/tulu-3-sft-personas-math-filtered… See the full description on the dataset page: https://huggingface.co/datasets/PJMixers-Dev/allenai_tulu-3-sft-olmo-2-mixture-0225-filtered-ShareGPT.
Facebook
Twittersaumyamalik/daringanteater-prefs-olmo2-7b-unused-gemma3 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterjacobmorrison/olmo-2-0325-32b-preference-mix-mis-sense dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
OLMo 2 1124 13B Preference Mixture
Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This mix is made up of the following on-policy preference datasets generated using a synthetic data generation pipeline similar to Tulu
Reused prompts from the SFT mix (via ai2-adapt-dev/sft_v3.9_used_on_policy_po_olmo2_13b and… See the full description on the dataset page: https://huggingface.co/datasets/allenai/olmo-2-1124-13b-preference-mix.