https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Tulu 3 SFT Mixture
Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. The Tulu 3 SFT mixture was used to train the Tulu 3 series of models. It contains 939,344 samples from the following sets:
CoCoNot (ODC-BY-1.0), 10,983 prompts (Brahman et al., 2024) FLAN v2 via ai2-adapt-dev/flan_v2_converted, 89,982 prompts (Longpre et… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-mixture.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. The OLMo v2 SFT mixture was used to train the OLMo models. It contains 939,344 samples from the following sets:
CoCoNot (ODC-BY-1.0), 10,983 prompts (Brahman et al., 2024) FLAN v2 via ai2-adapt-dev/flan_v2_converted, 89,982 prompts (Longpre et al., 2023) No Robots (CC-BY-NC-4.0), 9,500… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-2-mixture.
Used to train OLMo 2 32B. From the blog post:
Filtered out instructions from the SFT dataset and the chosen responses of the preference data that included mentions of a date cutoff from the synthetic data generation process. This resulted in a new version of the instruction dataset, Tulu 3 SFT Mixture 0225, and preference dataset, OLMo-2-32B-pref-mix-0325. We use majority voting to improve the quality of answers to our synthetic math questions. For our Persona MATH and Grade School Math… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-2-mixture-0225.
Llama 3.1 Tulu 3 SFT reused (off-policy)
Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This preference dataset is part of our Tulu 3 preference mixture: it contains prompts from our SFT mixture and it contains 96,911 generation pairs obtained using the following models:
Mistral 7B Instruct v0.2 (Apache 2.0) Mistral… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-reused-off-policy.
Created with open-instruct data tools:
python scripts/data/filtering_and_updates/update_subsets.py
--base_ds allenai/tulu-3-sft-mixture-filter-datecutoff
--remove_sources ai2-adapt-dev/personahub_math_v5_regen_149960 allenai/tulu-3-sft-personas-math-grade
--add_ds allenai/tulu-3-sft-personas-math-filtered allenai/tulu-3-sft-personas-math-grade-filtered
--remove_keys prompt dataset
--push_to_hub
--repo_id allenai/tulu-3-sft-mixture-0225
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
OLMoE SFT Mix
The SFT mix used is an expanded version of the Tulu v2 SFT mix with new additions for code, CodeFeedback-Filtered-Instruction, reasoning, MetaMathQA, and instruction following, No Robots and a subset of Daring Anteater. Please see the referenced datasets for the multiple licenses used in subsequent data. We do not introduce any new data with this dataset. Config for creation via open-instruct: dataset_mixer: allenai/tulu-v2-sft-mixture-olmo-4096: 1.0… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-v3.1-mix-preview-4096-OLMoE.
aeolian83/allenai-tulu-3-sft-mixture_filtered dataset hosted on Hugging Face and contributed by the HF Datasets community
Llama 3.1 Tulu 3 IF-Augmented (on-policy 8B)
Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This preference dataset is part of our Tulu 3 preference mixture: it contains prompts from SFT Data, with constraints from https://huggingface.co/datasets/google/IFEval. It contains 65,530 generation pairs (some of which on-policy… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-IF-augmented-on-policy-8b.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Tulu 3 8B Preference Mixture
Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This mix is made up from the following preference datasets:
https://huggingface.co/datasets/allenai/tulu-3-sft-reused-off-policy https://huggingface.co/datasets/allenai/tulu-3-sft-reused-on-policy-8b… See the full description on the dataset page: https://huggingface.co/datasets/allenai/llama-3.1-tulu-3-8b-preference-mixture.
Tulu 3 SFT Mixture (Sampled)
This dataset is a sampled and filtered subset of the allenai/tulu-3-sft-mixture, curated and rebalanced for structured instruction fine-tuning. The goal is to support research and model development in math reasoning, coding, knowledge recall, instruction following (IF), and conversational alignment, while explicitly excluding safety, multilingual, and certain task-specific sources.
📦 Dataset Structure
Source: Filtered from… See the full description on the dataset page: https://huggingface.co/datasets/ushakov15/MNLP_M3_rag_dataset.
allenai/tulu-3-sft-olmo-2-mixture-filter-datecutoff dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
allenai tulu-3-sft-mixture DolphinLabeled
Part of the DolphinLabeled series of datasets
Presented by Eric Hartford and Cognitive Computations
The purpose of this dataset is to enable filtering of allenai/tulu-3-sft-mixture dataset. The original dataset is allenai/tulu-3-sft-mixture I have modified the dataset using two scripts.
dedupe.py - removes rows with identical final message content label.py - adds a "flags" column containing the following boolean… See the full description on the dataset page: https://huggingface.co/datasets/QuixiAI/allenai_tulu-3-sft-mixture-DolphinLabeled.
This is a reduced subsample from allenai/tulu-3-sft-mixture. After removing the wildchat/wildjailbreak and related subsets, each subset/source was deduplicated using a WIP script that uses avsolatorio/GIST-large-Embedding-v0 (can likely be improved). This resulted in removing a lot of the math and coding samples, so I'm currently re-running on those to pad out the samples in those categories. additional code and math samples have been added as separate .jsonl files.
Llama 3.1 Tulu 3 405B Preference Mixture
Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This preference mixture used for DPO on our the Llama 3.1 Tulu 3 405B SFT checkpoint to obtain Llama 3.1 Tulu 3 405B DPO. It contains 360,924 generation pairs obtained using the following models:
Mistral 7B Instruct v0.2 (Apache 2.0)… See the full description on the dataset page: https://huggingface.co/datasets/allenai/llama-3.1-tulu-3-405b-preference-mixture.
Origin Datasets: allenai/tulu-3-sft-mixture Dataset Sampling for Merge-Up SLM Training To prepare a dataset of 100,000 samples for Merge-Up SLM training, the following steps were taken:
Filtering for English Only: We used a regular expression to filter the dataset, retaining only the samples that contain English alphabets exclusively. Proportional Sampling by Token Length: Starting from 4,000 tokens, we counted the number of samples in increments of 200 tokens. Based on the resulting… See the full description on the dataset page: https://huggingface.co/datasets/aeolian83/allenai_tulu_3_sft_mixture_filtered_10k_sampled.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Tulu-3-Pool-Annotated
Project | Github | Paper | HuggingFace's collection Annotated tulu-3-sft-mixture. Used as a data pool in MIG. The annotations include #InsTag tags, DEITA scores, and CaR scores.
Dataset Details
Tulu3 Dataset Sources
Repository: https://huggingface.co/datasets/allenai/tulu-3-sft-mixture Paper [optional]: Tulu 3: Pushing Frontiers in Open Language Model Post-Training
MIG Dataset Sources
Repository:… See the full description on the dataset page: https://huggingface.co/datasets/xsample/tulu-3-pool-annotated.
Removed sources:
ai2-adapt-dev/tulu_v3.9_wildjailbreak_decontaminated_50k ai2-adapt-dev/tulu_v3.9_synthetic_finalresp_wildguardmixtrain_decontaminated_50k ai2-adapt-dev/coconot_converted ai2-adapt-dev/tulu_hard_coded_repeated_10 ai2-adapt-dev/tulu_v3.9_aya_100k ai2-adapt-dev/numinamath_tir_math_decontaminated ai2-adapt-dev/tulu_v3.9_open_math_2_gsm8k_50k ai2-adapt-dev/tulu_v3.9_personahub_math_interm_algebra_20k allenai/tulu-3-sft-personas-math-filtered… See the full description on the dataset page: https://huggingface.co/datasets/PJMixers-Dev/allenai_tulu-3-sft-olmo-2-mixture-0225-filtered-ShareGPT.
Tulu-3-sft-mixture_scored - with OpenDataArena Scores
This dataset is a scored version of the original allenai/tulu-3-sft-mixture dataset. The scoring was performed using the OpenDataArena-Tool, a comprehensive suite of automated evaluation methods for assessing instruction-following datasets. This version of the dataset includes rich, multi-dimensional scores for both the instructions (questions) and the instruction-response pairs, allowing for highly granular data analysis and… See the full description on the dataset page: https://huggingface.co/datasets/OpenDataArena/tulu-3-sft-mixture_scored.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset has been generated from 10k prompts randomly subsampled from allenai/tulu-3-sft-mixture. The LLM used for inference is THUDM/GLM-Z1-32B-0414. The outputs that didn't contain a "" token have been discarded. More details on the generation process in this article: How to Create Reasoning Datasets with GLM-Z1 at Low Cost Total generation cost: $17 (H100 from RunPod).
Developed by: The Kaitchup Language(s) (NLP): English License: Apache 2.0 license
This dataset is derived from allenai/tulu-3-sft-mixture. It is intended for use as calibration data for DWQ with MLX LM and Qwen3 models. Half the data is from the original corpus. The other half of the data are synthetically generated with prompts from the original corpus and completions from Qwen3-30B-A3B. The completions include thinking traces ("reasoning_content"). The completions were generated with mlx-lm and the following code: import json from mlx_lm import load, generate from… See the full description on the dataset page: https://huggingface.co/datasets/mlx-community/qwen3_dwq_calibration_1332.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Tulu 3 SFT Mixture
Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. The Tulu 3 SFT mixture was used to train the Tulu 3 series of models. It contains 939,344 samples from the following sets:
CoCoNot (ODC-BY-1.0), 10,983 prompts (Brahman et al., 2024) FLAN v2 via ai2-adapt-dev/flan_v2_converted, 89,982 prompts (Longpre et… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-mixture.