15 datasets found
  1. h

    oasst1_converted

    • huggingface.co
    Updated Jul 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Graf (2025). oasst1_converted [Dataset]. https://huggingface.co/datasets/VGraf/oasst1_converted
    Explore at:
    Dataset updated
    Jul 27, 2025
    Authors
    Victoria Graf
    Description

    This is a converted version of the Open Assistant 1 dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

    apply_keyword_filters: False apply_empty_message_filters: False top_k: 3 push_to_hub: True hf_entity: VGraf converted_dataset_name: None local_save_dir: /results

    Please refer to the original dataset for more information about this dataset and the license.

  2. h

    bespoke_stratos_17k_converted

    • huggingface.co
    Updated Jan 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan Lambert (2025). bespoke_stratos_17k_converted [Dataset]. https://huggingface.co/datasets/natolambert/bespoke_stratos_17k_converted
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 22, 2025
    Authors
    Nathan Lambert
    Description

    This is a converted version of the bespokelabs/Bespoke-Stratos-17k dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

    apply_keyword_filters: False apply_empty_message_filters: False push_to_hub: True hf_entity: natolambert converted_dataset_name: bespoke_stratos_17k_converted local_save_dir: None

    Please refer to the original dataset for more information about this dataset and the license.

  3. h

    numinamath_tir_converted

    • huggingface.co
    Updated Mar 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mingyuan Zhang (2025). numinamath_tir_converted [Dataset]. https://huggingface.co/datasets/KKHYA/numinamath_tir_converted
    Explore at:
    Dataset updated
    Mar 23, 2025
    Authors
    Mingyuan Zhang
    Description

    This is a converted version of the NuminaMath-TIR subset into Tulu SFT training format. The conversion script can be found in open-instruct repo. The conversion took the following parameters:

    apply_keyword_filters: True apply_empty_message_filters: True push_to_hub: True hf_entity: KKHYA converted_dataset_name: None local_save_dir: ./data/sft/numinamath

    Please refer to the original TIR dataset for more information about this dataset and the license.

  4. coconot_converted

    • huggingface.co
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI2 Adapt Dev (2025). coconot_converted [Dataset]. https://huggingface.co/datasets/ai2-adapt-dev/coconot_converted
    Explore at:
    Dataset updated
    Jul 9, 2025
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    AI2 Adapt Dev
    Description

    This is a converted version of the CoCoNoT dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

    apply_keyword_filters: True apply_empty_message_filters: False push_to_hub: True hf_entity: ai2-adapt-dev converted_dataset_name: coconot_converted local_save_dir: ./data/sft/coconot

    Please refer to the original dataset for more information about this dataset and the license.

  5. h

    evol_codealpaca_converted

    • huggingface.co
    Updated Apr 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mingyuan Zhang (2025). evol_codealpaca_converted [Dataset]. https://huggingface.co/datasets/KKHYA/evol_codealpaca_converted
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2025
    Authors
    Mingyuan Zhang
    Description

    This is a converted version of the Evol-CodeAlpaca dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

    apply_keyword_filters: True apply_empty_message_filters: True push_to_hub: True hf_entity: KKHYA converted_dataset_name: evol_codealpaca_converted local_save_dir: ./data/sft/evol_codealpaca

    The original Evol-CodeAlpaca dataset can be found here. Our conversion is based on the decontaminated… See the full description on the dataset page: https://huggingface.co/datasets/KKHYA/evol_codealpaca_converted.

  6. wizardlm_converted

    • huggingface.co
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI2 Adapt Dev (2024). wizardlm_converted [Dataset]. https://huggingface.co/datasets/ai2-adapt-dev/wizardlm_converted
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    AI2 Adapt Dev
    Description

    This is a converted version of the WizardLM evol instruct dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

    apply_keyword_filters: True apply_empty_message_filters: True push_to_hub: True hf_entity: ai2-adapt-dev converted_dataset_name: wizardlm_converted local_save_dir: ./data/sft/wizardlm

    Please refer to the original dataset for more information about this dataset and the license.

  7. h

    flan_v2_converted

    • huggingface.co
    Updated Jul 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Graf (2025). flan_v2_converted [Dataset]. https://huggingface.co/datasets/VGraf/flan_v2_converted
    Explore at:
    Dataset updated
    Jul 27, 2025
    Authors
    Victoria Graf
    Description

    This is a converted version of the Flan dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

    apply_keyword_filters: False apply_empty_message_filters: False push_to_hub: True hf_entity: VGraf converted_dataset_name: flan_v2_converted local_save_dir: None

    The original FLAN dataset needs extensive efforts to be regenerated, so we are using a reproduced version by the OpenOrca team.More specifically… See the full description on the dataset page: https://huggingface.co/datasets/VGraf/flan_v2_converted.

  8. no_robots_converted

    • huggingface.co
    Updated Oct 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI2 Adapt Dev (2024). no_robots_converted [Dataset]. https://huggingface.co/datasets/ai2-adapt-dev/no_robots_converted
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 17, 2024
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    AI2 Adapt Dev
    Description

    This is a converted version of the no_robots dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

    apply_keyword_filters: False apply_empty_message_filters: False push_to_hub: True hf_entity: ai2-adapt-dev converted_dataset_name: no_robots_converted local_save_dir: ./data/sft/no_robots

    Please refer to the original dataset for more information about this dataset and the license.

  9. Tulu V2 Dataset

    • kaggle.com
    zip
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Tulu V2 Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/science-based-tulu-nlp-model
    Explore at:
    zip(355939514 bytes)Available download formats
    Dataset updated
    Nov 23, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Tulu V2 Dataset

    Assisting Assistive Tasks with Language Data Mixtures

    By Huggingface Hub [source]

    About this dataset

    This dataset, Tulu-V2, is a science-based natural language model for assistive tasks that contains a mixture of language data from research and analysis. It consists of messages in the Tulu language, enabling machine learning algorithms to train and develop a more accurate model for understanding language usage and context. This dataset provides researchers and analysts with an invaluable resource of data that can be used to study linguistics, speech recognition technology, artificial intelligence applications and more. With an unprecedented variety of languages included—from formal literatures to informal conversations—this collection gives everyone the ability to make breakthrough insights about how people interact with their environment through dialogues. In order to truly understand the world we live in today, this dataset offers unparalleled opportunities for researchers who are striving for progress in natural communication technologies!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Once you have an understanding of the data format in this dataset you are ready to start the development process with your own model design! Here are some tips on how to get started:

    • Pre-process your data by cleaning up any irrelevant parts before training your model – this will make sure that only useful information is used when creating a model
    • Split your data into smaller chunks that will feed into individual models – this way it will be easier to find mistakes during your development process and reduce time spent debugging
    • Use different techniques such as feature engineering to allow for different levels of complexity in downstream tasks like classification
    • Optimize performance by testing out different parameters values on separate configurations
      5 Develop an evaluation metric appropriate for the task - consider metrics such as precision/recall measures when developing your metric

      following these steps when working with Scienece-Based Tulu NLP Model should help create accurate results faster than manually fine tuning models every time!

    Research Ideas

    • Developing a speech recognition system to understand Tulu conversations
    • Building a machine learning model for automatic translation from Tulu to English
    • Creating an artificial intelligence-based natural language processing platform for assisting people with disabilities in understanding and navigating the world around them using Tulu as their primary language

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:--------------------------------------------| | dataset | The name of the dataset. (String) | | messages | The messages in the Tulu language. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  10. O

    ToolBench

    • opendatalab.com
    zip
    Updated Oct 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tsinghua University (2023). ToolBench [Dataset]. https://opendatalab.com/OpenDataLab/ToolBench
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 3, 2023
    Dataset provided by
    Yale University
    Tsinghua University
    Renmin University of China
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This project (ToolLLM) aims to construct open-source, large-scale, high-quality instruction tuning SFT data to facilitate the construction of powerful LLMs with general tool-use capability. We aim to empower open-source LLMs to master thousands of diverse real-world APIs. We achieve this by collecting a high-quality instruction-tuning dataset. It is constructed automatically using the latest ChatGPT (gpt-3.5-turbo-16k), which is upgraded with enhanced function call capabilities. We provide the dataset, the corresponding training and evaluation scripts, and a capable model ToolLLaMA fine-tuned on ToolBench.

  11. h

    VARGPT_datasets

    • huggingface.co
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VARGPT-family (2025). VARGPT_datasets [Dataset]. https://huggingface.co/datasets/VARGPT-family/VARGPT_datasets
    Explore at:
    Dataset updated
    Jan 23, 2025
    Authors
    VARGPT-family
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Datasets for training VARGPT

    arxiv

      Dataset Structure
    

    stage1-pt: Contains 1.28M pre-training instruction fine-tuning dataset for VARGPT.

    stage2-sft: Includes datasets for the second stage of VARGPT instruction fine-tuning:

    stage2-sft/llava_v1_5_mix665k: Derived entirely from LLaVA-1.5 training data. stage2-sft/llava_onevision_508k: Sampled from the LLaVA-onevision Dataset. stage2-sft/ImageNet-Instruct-5k: Sampled from our stage3-sft/ImageNet-Instruct-130k dataset.… See the full description on the dataset page: https://huggingface.co/datasets/VARGPT-family/VARGPT_datasets.

  12. h

    LIMI

    • huggingface.co
    Updated Sep 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SII - GAIR (2025). LIMI [Dataset]. https://huggingface.co/datasets/GAIR/LIMI
    Explore at:
    Dataset updated
    Sep 23, 2025
    Dataset authored and provided by
    SII - GAIR
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    LIMI: Less is More for Agency

    To learn more about LIMI, feel free to explore our documentation and resources. Our release consists of the following sections: This dataset release includes the following sections:

    Data Fields: Message schema, tool definitions, and tool-call format. Splits: Available partitions and counts. Examples: Representative JSON samples.

    News
    

    2025.10.08: 📝 Released training scripts for Qwen3 dense models (4B/8B/32B) - check… See the full description on the dataset page: https://huggingface.co/datasets/GAIR/LIMI.

  13. smoltalk2_everyday_convs_think

    • huggingface.co
    Updated Sep 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2025). smoltalk2_everyday_convs_think [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smoltalk2_everyday_convs_think
    Explore at:
    Dataset updated
    Sep 24, 2025
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    Description

    SmolTalk2

      Dataset description
    

    This dataset contains the smoltalk_everyday_convs_reasoning_Qwen3_32B_think from SmolkTalk2. We processed the dataset using SmolLM3's chat template and make it available for the SFT exercises from the smol course. The script we used to create the dataset is available in the create_dataset.py file in this repository. You can load a dataset using from datasets import load_dataset

    To load the train split you can run

    ds =… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk2_everyday_convs_think.

  14. h

    aya_dataset_converted

    • huggingface.co
    Updated Jul 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Graf (2025). aya_dataset_converted [Dataset]. https://huggingface.co/datasets/VGraf/aya_dataset_converted
    Explore at:
    Dataset updated
    Jul 27, 2025
    Authors
    Victoria Graf
    Description

    This is a converted version of the Aya dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

    apply_keyword_filters: False apply_empty_message_filters: False push_to_hub: True hf_entity: VGraf converted_dataset_name: aya_dataset_converted local_save_dir: None

    Please refer to the original dataset for more information about this dataset and the license.

  15. smollm-corpus

    • huggingface.co
    Updated Jul 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2024). smollm-corpus [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    SmolLM-Corpus

    This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

      Dataset subsets
    
    
    
    
    
      Cosmopedia v2
    

    Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Victoria Graf (2025). oasst1_converted [Dataset]. https://huggingface.co/datasets/VGraf/oasst1_converted

oasst1_converted

VGraf/oasst1_converted

Explore at:
Dataset updated
Jul 27, 2025
Authors
Victoria Graf
Description

This is a converted version of the Open Assistant 1 dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

apply_keyword_filters: False apply_empty_message_filters: False top_k: 3 push_to_hub: True hf_entity: VGraf converted_dataset_name: None local_save_dir: /results

Please refer to the original dataset for more information about this dataset and the license.

Search
Clear search
Close search
Google apps
Main menu