25 datasets found
  1. h

    tulu-3-sft-mixture

    • huggingface.co
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2024). tulu-3-sft-mixture [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-sft-mixture
    Explore at:
    Dataset updated
    Nov 21, 2024
    Dataset authored and provided by
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Tulu 3 SFT Mixture

    Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. The Tulu 3 SFT mixture was used to train the Tulu 3 series of models. It contains 939,344 samples from the following sets:

    CoCoNot (ODC-BY-1.0), 10,983 prompts (Brahman et al., 2024) FLAN v2 via ai2-adapt-dev/flan_v2_converted, 89,982 prompts (Longpre et… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-mixture.

  2. h

    tulu-3-sft-olmo-2-mixture

    • huggingface.co
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2024). tulu-3-sft-olmo-2-mixture [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-2-mixture
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 30, 2024
    Dataset authored and provided by
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. The OLMo v2 SFT mixture was used to train the OLMo models. It contains 939,344 samples from the following sets:

    CoCoNot (ODC-BY-1.0), 10,983 prompts (Brahman et al., 2024) FLAN v2 via ai2-adapt-dev/flan_v2_converted, 89,982 prompts (Longpre et al., 2023) No Robots (CC-BY-NC-4.0), 9,500… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-2-mixture.

  3. h

    tulu-3-sft-olmo-2-mixture-0225

    • huggingface.co
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). tulu-3-sft-olmo-2-mixture-0225 [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-2-mixture-0225
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 13, 2025
    Dataset authored and provided by
    Ai2
    Description

    Used to train OLMo 2 32B. From the blog post:

    Filtered out instructions from the SFT dataset and the chosen responses of the preference data that included mentions of a date cutoff from the synthetic data generation process. This resulted in a new version of the instruction dataset, Tulu 3 SFT Mixture 0225, and preference dataset, OLMo-2-32B-pref-mix-0325. We use majority voting to improve the quality of answers to our synthetic math questions. For our Persona MATH and Grade School Math… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-2-mixture-0225.

  4. tulu-3-sft-reused-off-policy

    • huggingface.co
    Updated Apr 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). tulu-3-sft-reused-off-policy [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-sft-reused-off-policy
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    Description

    Llama 3.1 Tulu 3 SFT reused (off-policy)

    Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This preference dataset is part of our Tulu 3 preference mixture: it contains prompts from our SFT mixture and it contains 96,911 generation pairs obtained using the following models:

    Mistral 7B Instruct v0.2 (Apache 2.0) Mistral… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-reused-off-policy.

  5. tulu-3-sft-mixture-0225

    • huggingface.co
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). tulu-3-sft-mixture-0225 [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-sft-mixture-0225
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 13, 2025
    Dataset provided by
    Instituto Allen para la Inteligencia Artificialhttp://allenai.org/
    Authors
    Ai2
    Description

    Created with open-instruct data tools: python scripts/data/filtering_and_updates/update_subsets.py
    --base_ds allenai/tulu-3-sft-mixture-filter-datecutoff
    --remove_sources ai2-adapt-dev/personahub_math_v5_regen_149960 allenai/tulu-3-sft-personas-math-grade
    --add_ds allenai/tulu-3-sft-personas-math-filtered allenai/tulu-3-sft-personas-math-grade-filtered
    --remove_keys prompt dataset
    --push_to_hub
    --repo_id allenai/tulu-3-sft-mixture-0225

  6. tulu-v3.1-mix-preview-4096-OLMoE

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2, tulu-v3.1-mix-preview-4096-OLMoE [Dataset]. https://huggingface.co/datasets/allenai/tulu-v3.1-mix-preview-4096-OLMoE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    OLMoE SFT Mix

    The SFT mix used is an expanded version of the Tulu v2 SFT mix with new additions for code, CodeFeedback-Filtered-Instruction, reasoning, MetaMathQA, and instruction following, No Robots and a subset of Daring Anteater. Please see the referenced datasets for the multiple licenses used in subsequent data. We do not introduce any new data with this dataset. Config for creation via open-instruct: dataset_mixer: allenai/tulu-v2-sft-mixture-olmo-4096: 1.0… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-v3.1-mix-preview-4096-OLMoE.

  7. h

    allenai-tulu-3-sft-mixture_filtered

    • huggingface.co
    Updated Mar 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jungki son (2025). allenai-tulu-3-sft-mixture_filtered [Dataset]. https://huggingface.co/datasets/aeolian83/allenai-tulu-3-sft-mixture_filtered
    Explore at:
    Dataset updated
    Mar 23, 2025
    Authors
    jungki son
    Description

    aeolian83/allenai-tulu-3-sft-mixture_filtered dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. tulu-3-IF-augmented-on-policy-8b

    • huggingface.co
    Updated Apr 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). tulu-3-IF-augmented-on-policy-8b [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-IF-augmented-on-policy-8b
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    Description

    Llama 3.1 Tulu 3 IF-Augmented (on-policy 8B)

    Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This preference dataset is part of our Tulu 3 preference mixture: it contains prompts from SFT Data, with constraints from https://huggingface.co/datasets/google/IFEval. It contains 65,530 generation pairs (some of which on-policy… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-IF-augmented-on-policy-8b.

  9. llama-3.1-tulu-3-8b-preference-mixture

    • huggingface.co
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2024). llama-3.1-tulu-3-8b-preference-mixture [Dataset]. https://huggingface.co/datasets/allenai/llama-3.1-tulu-3-8b-preference-mixture
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2024
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Tulu 3 8B Preference Mixture

    Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This mix is made up from the following preference datasets:

    https://huggingface.co/datasets/allenai/tulu-3-sft-reused-off-policy https://huggingface.co/datasets/allenai/tulu-3-sft-reused-on-policy-8b… See the full description on the dataset page: https://huggingface.co/datasets/allenai/llama-3.1-tulu-3-8b-preference-mixture.

  10. h

    MNLP_M3_rag_dataset

    • huggingface.co
    Updated Jun 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Ushakov (2025). MNLP_M3_rag_dataset [Dataset]. https://huggingface.co/datasets/ushakov15/MNLP_M3_rag_dataset
    Explore at:
    Dataset updated
    Jun 12, 2025
    Authors
    Ivan Ushakov
    Description

    Tulu 3 SFT Mixture (Sampled)

    This dataset is a sampled and filtered subset of the allenai/tulu-3-sft-mixture, curated and rebalanced for structured instruction fine-tuning. The goal is to support research and model development in math reasoning, coding, knowledge recall, instruction following (IF), and conversational alignment, while explicitly excluding safety, multilingual, and certain task-specific sources.

      📦 Dataset Structure
    

    Source: Filtered from… See the full description on the dataset page: https://huggingface.co/datasets/ushakov15/MNLP_M3_rag_dataset.

  11. tulu-3-sft-olmo-2-mixture-filter-datecutoff

    • huggingface.co
    Updated Feb 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). tulu-3-sft-olmo-2-mixture-filter-datecutoff [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-2-mixture-filter-datecutoff
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 8, 2025
    Dataset provided by
    Instituto Allen para la Inteligencia Artificialhttp://allenai.org/
    Authors
    Ai2
    Description

    allenai/tulu-3-sft-olmo-2-mixture-filter-datecutoff dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    allenai_tulu-3-sft-mixture-DolphinLabeled

    • huggingface.co
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quixi AI (2025). allenai_tulu-3-sft-mixture-DolphinLabeled [Dataset]. https://huggingface.co/datasets/QuixiAI/allenai_tulu-3-sft-mixture-DolphinLabeled
    Explore at:
    Dataset updated
    Jan 6, 2025
    Dataset authored and provided by
    Quixi AI
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    allenai tulu-3-sft-mixture DolphinLabeled

      Part of the DolphinLabeled series of datasets
    
    
    
    
    
      Presented by Eric Hartford and Cognitive Computations
    

    The purpose of this dataset is to enable filtering of allenai/tulu-3-sft-mixture dataset. The original dataset is allenai/tulu-3-sft-mixture I have modified the dataset using two scripts.

    dedupe.py - removes rows with identical final message content label.py - adds a "flags" column containing the following boolean… See the full description on the dataset page: https://huggingface.co/datasets/QuixiAI/allenai_tulu-3-sft-mixture-DolphinLabeled.

  13. h

    tulu-mini

    • huggingface.co
    Updated May 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toaster (2025). tulu-mini [Dataset]. https://huggingface.co/datasets/ToastyPigeon/tulu-mini
    Explore at:
    Dataset updated
    May 17, 2025
    Authors
    Toaster
    Description

    This is a reduced subsample from allenai/tulu-3-sft-mixture. After removing the wildchat/wildjailbreak and related subsets, each subset/source was deduplicated using a WIP script that uses avsolatorio/GIST-large-Embedding-v0 (can likely be improved). This resulted in removing a lot of the math and coding samples, so I'm currently re-running on those to pad out the samples in those categories. additional code and math samples have been added as separate .jsonl files.

  14. llama-3.1-tulu-3-405b-preference-mixture

    • huggingface.co
    Updated Jan 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). llama-3.1-tulu-3-405b-preference-mixture [Dataset]. https://huggingface.co/datasets/allenai/llama-3.1-tulu-3-405b-preference-mixture
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 30, 2025
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    Description

    Llama 3.1 Tulu 3 405B Preference Mixture

    Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This preference mixture used for DPO on our the Llama 3.1 Tulu 3 405B SFT checkpoint to obtain Llama 3.1 Tulu 3 405B DPO. It contains 360,924 generation pairs obtained using the following models:

    Mistral 7B Instruct v0.2 (Apache 2.0)… See the full description on the dataset page: https://huggingface.co/datasets/allenai/llama-3.1-tulu-3-405b-preference-mixture.

  15. h

    allenai_tulu_3_sft_mixture_filtered_10k_sampled

    • huggingface.co
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jungki son (2025). allenai_tulu_3_sft_mixture_filtered_10k_sampled [Dataset]. https://huggingface.co/datasets/aeolian83/allenai_tulu_3_sft_mixture_filtered_10k_sampled
    Explore at:
    Dataset updated
    Apr 10, 2025
    Authors
    jungki son
    Description

    Origin Datasets: allenai/tulu-3-sft-mixture Dataset Sampling for Merge-Up SLM Training To prepare a dataset of 100,000 samples for Merge-Up SLM training, the following steps were taken:

    Filtering for English Only: We used a regular expression to filter the dataset, retaining only the samples that contain English alphabets exclusively. Proportional Sampling by Token Length: Starting from 4,000 tokens, we counted the number of samples in increments of 200 tokens. Based on the resulting… See the full description on the dataset page: https://huggingface.co/datasets/aeolian83/allenai_tulu_3_sft_mixture_filtered_10k_sampled.

  16. h

    tulu-3-pool-annotated

    • huggingface.co
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen Yicheng (2025). tulu-3-pool-annotated [Dataset]. https://huggingface.co/datasets/xsample/tulu-3-pool-annotated
    Explore at:
    Dataset updated
    Apr 27, 2025
    Authors
    Chen Yicheng
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Tulu-3-Pool-Annotated

    Project | Github | Paper | HuggingFace's collection Annotated tulu-3-sft-mixture. Used as a data pool in MIG. The annotations include #InsTag tags, DEITA scores, and CaR scores.

      Dataset Details
    
    
    
    
    
    
      Tulu3 Dataset Sources
    

    Repository: https://huggingface.co/datasets/allenai/tulu-3-sft-mixture Paper [optional]: Tulu 3: Pushing Frontiers in Open Language Model Post-Training

      MIG Dataset Sources
    

    Repository:… See the full description on the dataset page: https://huggingface.co/datasets/xsample/tulu-3-pool-annotated.

  17. h

    allenai_tulu-3-sft-olmo-2-mixture-0225-filtered-ShareGPT

    • huggingface.co
    Updated Jun 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peanut Jar Mixers Development (2025). allenai_tulu-3-sft-olmo-2-mixture-0225-filtered-ShareGPT [Dataset]. https://huggingface.co/datasets/PJMixers-Dev/allenai_tulu-3-sft-olmo-2-mixture-0225-filtered-ShareGPT
    Explore at:
    Dataset updated
    Jun 12, 2025
    Dataset authored and provided by
    Peanut Jar Mixers Development
    Description

    Removed sources:

    ai2-adapt-dev/tulu_v3.9_wildjailbreak_decontaminated_50k ai2-adapt-dev/tulu_v3.9_synthetic_finalresp_wildguardmixtrain_decontaminated_50k ai2-adapt-dev/coconot_converted ai2-adapt-dev/tulu_hard_coded_repeated_10 ai2-adapt-dev/tulu_v3.9_aya_100k ai2-adapt-dev/numinamath_tir_math_decontaminated ai2-adapt-dev/tulu_v3.9_open_math_2_gsm8k_50k ai2-adapt-dev/tulu_v3.9_personahub_math_interm_algebra_20k allenai/tulu-3-sft-personas-math-filtered… See the full description on the dataset page: https://huggingface.co/datasets/PJMixers-Dev/allenai_tulu-3-sft-olmo-2-mixture-0225-filtered-ShareGPT.

  18. h

    tulu-3-sft-mixture_scored

    • huggingface.co
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenDataArena (2025). tulu-3-sft-mixture_scored [Dataset]. https://huggingface.co/datasets/OpenDataArena/tulu-3-sft-mixture_scored
    Explore at:
    Dataset updated
    Aug 1, 2025
    Authors
    OpenDataArena
    Description

    Tulu-3-sft-mixture_scored - with OpenDataArena Scores

    This dataset is a scored version of the original allenai/tulu-3-sft-mixture dataset. The scoring was performed using the OpenDataArena-Tool, a comprehensive suite of automated evaluation methods for assessing instruction-following datasets. This version of the dataset includes rich, multi-dimensional scores for both the instructions (questions) and the instruction-response pairs, allowing for highly granular data analysis and… See the full description on the dataset page: https://huggingface.co/datasets/OpenDataArena/tulu-3-sft-mixture_scored.

  19. h

    reasoning-Z1

    • huggingface.co
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Kaitchup (2025). reasoning-Z1 [Dataset]. https://huggingface.co/datasets/kaitchup/reasoning-Z1
    Explore at:
    Dataset updated
    Apr 27, 2025
    Dataset authored and provided by
    The Kaitchup
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset has been generated from 10k prompts randomly subsampled from allenai/tulu-3-sft-mixture. The LLM used for inference is THUDM/GLM-Z1-32B-0414. The outputs that didn't contain a "" token have been discarded. More details on the generation process in this article: How to Create Reasoning Datasets with GLM-Z1 at Low Cost Total generation cost: $17 (H100 from RunPod).

    Developed by: The Kaitchup Language(s) (NLP): English License: Apache 2.0 license

  20. h

    qwen3_dwq_calibration_1332

    • huggingface.co
    Updated May 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MLX Community (2025). qwen3_dwq_calibration_1332 [Dataset]. https://huggingface.co/datasets/mlx-community/qwen3_dwq_calibration_1332
    Explore at:
    Dataset updated
    May 10, 2025
    Dataset authored and provided by
    MLX Community
    Description

    This dataset is derived from allenai/tulu-3-sft-mixture. It is intended for use as calibration data for DWQ with MLX LM and Qwen3 models. Half the data is from the original corpus. The other half of the data are synthetically generated with prompts from the original corpus and completions from Qwen3-30B-A3B. The completions include thinking traces ("reasoning_content"). The completions were generated with mlx-lm and the following code: import json from mlx_lm import load, generate from… See the full description on the dataset page: https://huggingface.co/datasets/mlx-community/qwen3_dwq_calibration_1332.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ai2 (2024). tulu-3-sft-mixture [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-sft-mixture

tulu-3-sft-mixture

allenai/tulu-3-sft-mixture

Explore at:
28 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Nov 21, 2024
Dataset authored and provided by
Ai2
License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

Tulu 3 SFT Mixture

Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. The Tulu 3 SFT mixture was used to train the Tulu 3 series of models. It contains 939,344 samples from the following sets:

CoCoNot (ODC-BY-1.0), 10,983 prompts (Brahman et al., 2024) FLAN v2 via ai2-adapt-dev/flan_v2_converted, 89,982 prompts (Longpre et… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-mixture.

Search
Clear search
Close search
Google apps
Main menu