25 datasets found

h
tulu-3-sft-mixture
huggingface.co
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2024). tulu-3-sft-mixture [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-sft-mixture
Explore at:
Dataset updated
Nov 21, 2024
Dataset authored and provided by
Ai2
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Tulu 3 SFT Mixture

Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. The Tulu 3 SFT mixture was used to train the Tulu 3 series of models. It contains 939,344 samples from the following sets:

CoCoNot (ODC-BY-1.0), 10,983 prompts (Brahman et al., 2024) FLAN v2 via ai2-adapt-dev/flan_v2_converted, 89,982 prompts (Longpre et… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-mixture.
h
tulu-3-sft-olmo-2-mixture
huggingface.co
Updated Oct 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2024). tulu-3-sft-olmo-2-mixture [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-2-mixture
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 30, 2024
Dataset authored and provided by
Ai2
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. The OLMo v2 SFT mixture was used to train the OLMo models. It contains 939,344 samples from the following sets:

CoCoNot (ODC-BY-1.0), 10,983 prompts (Brahman et al., 2024) FLAN v2 via ai2-adapt-dev/flan_v2_converted, 89,982 prompts (Longpre et al., 2023) No Robots (CC-BY-NC-4.0), 9,500… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-2-mixture.
h
tulu-3-sft-olmo-2-mixture-0225
huggingface.co
Updated Mar 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). tulu-3-sft-olmo-2-mixture-0225 [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-2-mixture-0225
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 13, 2025
Dataset authored and provided by
Ai2
Description
Used to train OLMo 2 32B. From the blog post:

Filtered out instructions from the SFT dataset and the chosen responses of the preference data that included mentions of a date cutoff from the synthetic data generation process. This resulted in a new version of the instruction dataset, Tulu 3 SFT Mixture 0225, and preference dataset, OLMo-2-32B-pref-mix-0325. We use majority voting to improve the quality of answers to our synthetic math questions. For our Persona MATH and Grade School Math… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-2-mixture-0225.
tulu-3-sft-reused-off-policy
huggingface.co
Updated Apr 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). tulu-3-sft-reused-off-policy [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-sft-reused-off-policy
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 30, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
Description
Llama 3.1 Tulu 3 SFT reused (off-policy)

Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This preference dataset is part of our Tulu 3 preference mixture: it contains prompts from our SFT mixture and it contains 96,911 generation pairs obtained using the following models:

Mistral 7B Instruct v0.2 (Apache 2.0) Mistral… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-reused-off-policy.
tulu-3-sft-mixture-0225
huggingface.co
Updated Mar 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). tulu-3-sft-mixture-0225 [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-sft-mixture-0225
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 13, 2025
Dataset provided by
Instituto Allen para la Inteligencia Artificialhttp://allenai.org/
Authors
Ai2
Description
Created with open-instruct data tools: python scripts/data/filtering_and_updates/update_subsets.py
--base_ds allenai/tulu-3-sft-mixture-filter-datecutoff
--remove_sources ai2-adapt-dev/personahub_math_v5_regen_149960 allenai/tulu-3-sft-personas-math-grade
--add_ds allenai/tulu-3-sft-personas-math-filtered allenai/tulu-3-sft-personas-math-grade-filtered
--remove_keys prompt dataset
--push_to_hub
--repo_id allenai/tulu-3-sft-mixture-0225
tulu-v3.1-mix-preview-4096-OLMoE
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2, tulu-v3.1-mix-preview-4096-OLMoE [Dataset]. https://huggingface.co/datasets/allenai/tulu-v3.1-mix-preview-4096-OLMoE
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
OLMoE SFT Mix

The SFT mix used is an expanded version of the Tulu v2 SFT mix with new additions for code, CodeFeedback-Filtered-Instruction, reasoning, MetaMathQA, and instruction following, No Robots and a subset of Daring Anteater. Please see the referenced datasets for the multiple licenses used in subsequent data. We do not introduce any new data with this dataset. Config for creation via open-instruct: dataset_mixer: allenai/tulu-v2-sft-mixture-olmo-4096: 1.0… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-v3.1-mix-preview-4096-OLMoE.
h
allenai-tulu-3-sft-mixture_filtered
huggingface.co
Updated Mar 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jungki son (2025). allenai-tulu-3-sft-mixture_filtered [Dataset]. https://huggingface.co/datasets/aeolian83/allenai-tulu-3-sft-mixture_filtered
Explore at:
Dataset updated
Mar 23, 2025
Authors
jungki son
Description
aeolian83/allenai-tulu-3-sft-mixture_filtered dataset hosted on Hugging Face and contributed by the HF Datasets community
tulu-3-IF-augmented-on-policy-8b
huggingface.co
Updated Apr 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). tulu-3-IF-augmented-on-policy-8b [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-IF-augmented-on-policy-8b
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 30, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
Description
Llama 3.1 Tulu 3 IF-Augmented (on-policy 8B)

Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This preference dataset is part of our Tulu 3 preference mixture: it contains prompts from SFT Data, with constraints from https://huggingface.co/datasets/google/IFEval. It contains 65,530 generation pairs (some of which on-policy… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-IF-augmented-on-policy-8b.
llama-3.1-tulu-3-8b-preference-mixture
huggingface.co
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2024). llama-3.1-tulu-3-8b-preference-mixture [Dataset]. https://huggingface.co/datasets/allenai/llama-3.1-tulu-3-8b-preference-mixture
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2024
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Tulu 3 8B Preference Mixture

Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This mix is made up from the following preference datasets:

https://huggingface.co/datasets/allenai/tulu-3-sft-reused-off-policy https://huggingface.co/datasets/allenai/tulu-3-sft-reused-on-policy-8b… See the full description on the dataset page: https://huggingface.co/datasets/allenai/llama-3.1-tulu-3-8b-preference-mixture.
h
MNLP_M3_rag_dataset
huggingface.co
Updated Jun 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Ushakov (2025). MNLP_M3_rag_dataset [Dataset]. https://huggingface.co/datasets/ushakov15/MNLP_M3_rag_dataset
Explore at:
Dataset updated
Jun 12, 2025
Authors
Ivan Ushakov
Description
Tulu 3 SFT Mixture (Sampled)

This dataset is a sampled and filtered subset of the allenai/tulu-3-sft-mixture, curated and rebalanced for structured instruction fine-tuning. The goal is to support research and model development in math reasoning, coding, knowledge recall, instruction following (IF), and conversational alignment, while explicitly excluding safety, multilingual, and certain task-specific sources.

📦 Dataset Structure

Source: Filtered from… See the full description on the dataset page: https://huggingface.co/datasets/ushakov15/MNLP_M3_rag_dataset.
tulu-3-sft-olmo-2-mixture-filter-datecutoff
huggingface.co
Updated Feb 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). tulu-3-sft-olmo-2-mixture-filter-datecutoff [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-2-mixture-filter-datecutoff
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 8, 2025
Dataset provided by
Instituto Allen para la Inteligencia Artificialhttp://allenai.org/
Authors
Ai2
Description
allenai/tulu-3-sft-olmo-2-mixture-filter-datecutoff dataset hosted on Hugging Face and contributed by the HF Datasets community
h
allenai_tulu-3-sft-mixture-DolphinLabeled
huggingface.co
Updated Jan 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quixi AI (2025). allenai_tulu-3-sft-mixture-DolphinLabeled [Dataset]. https://huggingface.co/datasets/QuixiAI/allenai_tulu-3-sft-mixture-DolphinLabeled
Explore at:
Dataset updated
Jan 6, 2025
Dataset authored and provided by
Quixi AI
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
allenai tulu-3-sft-mixture DolphinLabeled

Part of the DolphinLabeled series of datasets Presented by Eric Hartford and Cognitive Computations

The purpose of this dataset is to enable filtering of allenai/tulu-3-sft-mixture dataset. The original dataset is allenai/tulu-3-sft-mixture I have modified the dataset using two scripts.

dedupe.py - removes rows with identical final message content label.py - adds a "flags" column containing the following boolean… See the full description on the dataset page: https://huggingface.co/datasets/QuixiAI/allenai_tulu-3-sft-mixture-DolphinLabeled.
h
tulu-mini
huggingface.co
Updated May 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toaster (2025). tulu-mini [Dataset]. https://huggingface.co/datasets/ToastyPigeon/tulu-mini
Explore at:
Dataset updated
May 17, 2025
Authors
Toaster
Description
This is a reduced subsample from allenai/tulu-3-sft-mixture. After removing the wildchat/wildjailbreak and related subsets, each subset/source was deduplicated using a WIP script that uses avsolatorio/GIST-large-Embedding-v0 (can likely be improved). This resulted in removing a lot of the math and coding samples, so I'm currently re-running on those to pad out the samples in those categories. additional code and math samples have been added as separate .jsonl files.
llama-3.1-tulu-3-405b-preference-mixture
huggingface.co
Updated Jan 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). llama-3.1-tulu-3-405b-preference-mixture [Dataset]. https://huggingface.co/datasets/allenai/llama-3.1-tulu-3-405b-preference-mixture
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 30, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
Description
Llama 3.1 Tulu 3 405B Preference Mixture

Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This preference mixture used for DPO on our the Llama 3.1 Tulu 3 405B SFT checkpoint to obtain Llama 3.1 Tulu 3 405B DPO. It contains 360,924 generation pairs obtained using the following models:

Mistral 7B Instruct v0.2 (Apache 2.0)… See the full description on the dataset page: https://huggingface.co/datasets/allenai/llama-3.1-tulu-3-405b-preference-mixture.
h
allenai_tulu_3_sft_mixture_filtered_10k_sampled
huggingface.co
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jungki son (2025). allenai_tulu_3_sft_mixture_filtered_10k_sampled [Dataset]. https://huggingface.co/datasets/aeolian83/allenai_tulu_3_sft_mixture_filtered_10k_sampled
Explore at:
Dataset updated
Apr 10, 2025
Authors
jungki son
Description
Origin Datasets: allenai/tulu-3-sft-mixture Dataset Sampling for Merge-Up SLM Training To prepare a dataset of 100,000 samples for Merge-Up SLM training, the following steps were taken:

Filtering for English Only: We used a regular expression to filter the dataset, retaining only the samples that contain English alphabets exclusively. Proportional Sampling by Token Length: Starting from 4,000 tokens, we counted the number of samples in increments of 200 tokens. Based on the resulting… See the full description on the dataset page: https://huggingface.co/datasets/aeolian83/allenai_tulu_3_sft_mixture_filtered_10k_sampled.
h
tulu-3-pool-annotated
huggingface.co
Updated Apr 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen Yicheng (2025). tulu-3-pool-annotated [Dataset]. https://huggingface.co/datasets/xsample/tulu-3-pool-annotated
Explore at:
Dataset updated
Apr 27, 2025
Authors
Chen Yicheng
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Tulu-3-Pool-Annotated

Project | Github | Paper | HuggingFace's collection Annotated tulu-3-sft-mixture. Used as a data pool in MIG. The annotations include #InsTag tags, DEITA scores, and CaR scores.

Dataset Details Tulu3 Dataset Sources

Repository: https://huggingface.co/datasets/allenai/tulu-3-sft-mixture Paper [optional]: Tulu 3: Pushing Frontiers in Open Language Model Post-Training

MIG Dataset Sources

Repository:… See the full description on the dataset page: https://huggingface.co/datasets/xsample/tulu-3-pool-annotated.
h
allenai_tulu-3-sft-olmo-2-mixture-0225-filtered-ShareGPT
huggingface.co
Updated Jun 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peanut Jar Mixers Development (2025). allenai_tulu-3-sft-olmo-2-mixture-0225-filtered-ShareGPT [Dataset]. https://huggingface.co/datasets/PJMixers-Dev/allenai_tulu-3-sft-olmo-2-mixture-0225-filtered-ShareGPT
Explore at:
Dataset updated
Jun 12, 2025
Dataset authored and provided by
Peanut Jar Mixers Development
Description
Removed sources:

ai2-adapt-dev/tulu_v3.9_wildjailbreak_decontaminated_50k ai2-adapt-dev/tulu_v3.9_synthetic_finalresp_wildguardmixtrain_decontaminated_50k ai2-adapt-dev/coconot_converted ai2-adapt-dev/tulu_hard_coded_repeated_10 ai2-adapt-dev/tulu_v3.9_aya_100k ai2-adapt-dev/numinamath_tir_math_decontaminated ai2-adapt-dev/tulu_v3.9_open_math_2_gsm8k_50k ai2-adapt-dev/tulu_v3.9_personahub_math_interm_algebra_20k allenai/tulu-3-sft-personas-math-filtered… See the full description on the dataset page: https://huggingface.co/datasets/PJMixers-Dev/allenai_tulu-3-sft-olmo-2-mixture-0225-filtered-ShareGPT.
h
tulu-3-sft-mixture_scored
huggingface.co
Updated Aug 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenDataArena (2025). tulu-3-sft-mixture_scored [Dataset]. https://huggingface.co/datasets/OpenDataArena/tulu-3-sft-mixture_scored
Explore at:
Dataset updated
Aug 1, 2025
Authors
OpenDataArena
Description
Tulu-3-sft-mixture_scored - with OpenDataArena Scores

This dataset is a scored version of the original allenai/tulu-3-sft-mixture dataset. The scoring was performed using the OpenDataArena-Tool, a comprehensive suite of automated evaluation methods for assessing instruction-following datasets. This version of the dataset includes rich, multi-dimensional scores for both the instructions (questions) and the instruction-response pairs, allowing for highly granular data analysis and… See the full description on the dataset page: https://huggingface.co/datasets/OpenDataArena/tulu-3-sft-mixture_scored.
h
reasoning-Z1
huggingface.co
Updated Apr 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Kaitchup (2025). reasoning-Z1 [Dataset]. https://huggingface.co/datasets/kaitchup/reasoning-Z1
Explore at:
Dataset updated
Apr 27, 2025
Dataset authored and provided by
The Kaitchup
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset has been generated from 10k prompts randomly subsampled from allenai/tulu-3-sft-mixture. The LLM used for inference is THUDM/GLM-Z1-32B-0414. The outputs that didn't contain a "" token have been discarded. More details on the generation process in this article: How to Create Reasoning Datasets with GLM-Z1 at Low Cost Total generation cost: $17 (H100 from RunPod).

Developed by: The Kaitchup Language(s) (NLP): English License: Apache 2.0 license
h
qwen3_dwq_calibration_1332
huggingface.co
Updated May 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MLX Community (2025). qwen3_dwq_calibration_1332 [Dataset]. https://huggingface.co/datasets/mlx-community/qwen3_dwq_calibration_1332
Explore at:
Dataset updated
May 10, 2025
Dataset authored and provided by
MLX Community
Description
This dataset is derived from allenai/tulu-3-sft-mixture. It is intended for use as calibration data for DWQ with MLX LM and Qwen3 models. Half the data is from the original corpus. The other half of the data are synthetically generated with prompts from the original corpus and completions from Qwen3-30B-A3B. The completions include thinking traces ("reasoning_content"). The completions were generated with mlx-lm and the following code: import json from mlx_lm import load, generate from… See the full description on the dataset page: https://huggingface.co/datasets/mlx-community/qwen3_dwq_calibration_1332.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ai2 (2024). tulu-3-sft-mixture [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-sft-mixture

tulu-3-sft-mixture

allenai/tulu-3-sft-mixture

Explore at:

28 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Nov 21, 2024

Dataset authored and provided by

Ai2

License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

Tulu 3 SFT Mixture

Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. The Tulu 3 SFT mixture was used to train the Tulu 3 series of models. It contains 939,344 samples from the following sets:

CoCoNot (ODC-BY-1.0), 10,983 prompts (Brahman et al., 2024) FLAN v2 via ai2-adapt-dev/flan_v2_converted, 89,982 prompts (Longpre et… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-mixture.

Clear search

Close search

Google apps

Main menu

tulu-3-sft-mixture

tulu-3-sft-olmo-2-mixture

tulu-3-sft-olmo-2-mixture-0225

tulu-3-sft-reused-off-policy

tulu-3-sft-mixture-0225

tulu-v3.1-mix-preview-4096-OLMoE

allenai-tulu-3-sft-mixture_filtered

tulu-3-IF-augmented-on-policy-8b

llama-3.1-tulu-3-8b-preference-mixture

MNLP_M3_rag_dataset

tulu-3-sft-olmo-2-mixture-filter-datecutoff

allenai_tulu-3-sft-mixture-DolphinLabeled

tulu-mini

llama-3.1-tulu-3-405b-preference-mixture

allenai_tulu_3_sft_mixture_filtered_10k_sampled

tulu-3-pool-annotated

allenai_tulu-3-sft-olmo-2-mixture-0225-filtered-ShareGPT

tulu-3-sft-mixture_scored

reasoning-Z1

qwen3_dwq_calibration_1332

tulu-3-sft-mixture

allenai/tulu-3-sft-mixture