Facebook
TwitterThis is a converted version of the Open Assistant 1 dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:
apply_keyword_filters: False apply_empty_message_filters: False top_k: 3 push_to_hub: True hf_entity: VGraf converted_dataset_name: None local_save_dir: /results
Please refer to the original dataset for more information about this dataset and the license.
Facebook
TwitterThis is a converted version of the bespokelabs/Bespoke-Stratos-17k dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:
apply_keyword_filters: False apply_empty_message_filters: False push_to_hub: True hf_entity: natolambert converted_dataset_name: bespoke_stratos_17k_converted local_save_dir: None
Please refer to the original dataset for more information about this dataset and the license.
Facebook
TwitterThis is a converted version of the NuminaMath-TIR subset into Tulu SFT training format. The conversion script can be found in open-instruct repo. The conversion took the following parameters:
apply_keyword_filters: True apply_empty_message_filters: True push_to_hub: True hf_entity: KKHYA converted_dataset_name: None local_save_dir: ./data/sft/numinamath
Please refer to the original TIR dataset for more information about this dataset and the license.
Facebook
TwitterThis is a converted version of the CoCoNoT dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:
apply_keyword_filters: True apply_empty_message_filters: False push_to_hub: True hf_entity: ai2-adapt-dev converted_dataset_name: coconot_converted local_save_dir: ./data/sft/coconot
Please refer to the original dataset for more information about this dataset and the license.
Facebook
TwitterThis is a converted version of the Evol-CodeAlpaca dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:
apply_keyword_filters: True apply_empty_message_filters: True push_to_hub: True hf_entity: KKHYA converted_dataset_name: evol_codealpaca_converted local_save_dir: ./data/sft/evol_codealpaca
The original Evol-CodeAlpaca dataset can be found here. Our conversion is based on the decontaminated… See the full description on the dataset page: https://huggingface.co/datasets/KKHYA/evol_codealpaca_converted.
Facebook
TwitterThis is a converted version of the WizardLM evol instruct dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:
apply_keyword_filters: True apply_empty_message_filters: True push_to_hub: True hf_entity: ai2-adapt-dev converted_dataset_name: wizardlm_converted local_save_dir: ./data/sft/wizardlm
Please refer to the original dataset for more information about this dataset and the license.
Facebook
TwitterThis is a converted version of the Flan dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:
apply_keyword_filters: False apply_empty_message_filters: False push_to_hub: True hf_entity: VGraf converted_dataset_name: flan_v2_converted local_save_dir: None
The original FLAN dataset needs extensive efforts to be regenerated, so we are using a reproduced version by the OpenOrca team.More specifically… See the full description on the dataset page: https://huggingface.co/datasets/VGraf/flan_v2_converted.
Facebook
TwitterThis is a converted version of the no_robots dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:
apply_keyword_filters: False apply_empty_message_filters: False push_to_hub: True hf_entity: ai2-adapt-dev converted_dataset_name: no_robots_converted local_save_dir: ./data/sft/no_robots
Please refer to the original dataset for more information about this dataset and the license.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset, Tulu-V2, is a science-based natural language model for assistive tasks that contains a mixture of language data from research and analysis. It consists of messages in the Tulu language, enabling machine learning algorithms to train and develop a more accurate model for understanding language usage and context. This dataset provides researchers and analysts with an invaluable resource of data that can be used to study linguistics, speech recognition technology, artificial intelligence applications and more. With an unprecedented variety of languages included—from formal literatures to informal conversations—this collection gives everyone the ability to make breakthrough insights about how people interact with their environment through dialogues. In order to truly understand the world we live in today, this dataset offers unparalleled opportunities for researchers who are striving for progress in natural communication technologies!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Once you have an understanding of the data format in this dataset you are ready to start the development process with your own model design! Here are some tips on how to get started:
- Pre-process your data by cleaning up any irrelevant parts before training your model – this will make sure that only useful information is used when creating a model
- Split your data into smaller chunks that will feed into individual models – this way it will be easier to find mistakes during your development process and reduce time spent debugging
- Use different techniques such as feature engineering to allow for different levels of complexity in downstream tasks like classification
Optimize performance by testing out different parameters values on separate configurations
5 Develop an evaluation metric appropriate for the task - consider metrics such as precision/recall measures when developing your metricfollowing these steps when working with Scienece-Based Tulu NLP Model should help create accurate results faster than manually fine tuning models every time!
- Developing a speech recognition system to understand Tulu conversations
- Building a machine learning model for automatic translation from Tulu to English
- Creating an artificial intelligence-based natural language processing platform for assisting people with disabilities in understanding and navigating the world around them using Tulu as their primary language
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:--------------------------------------------| | dataset | The name of the dataset. (String) | | messages | The messages in the Tulu language. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This project (ToolLLM) aims to construct open-source, large-scale, high-quality instruction tuning SFT data to facilitate the construction of powerful LLMs with general tool-use capability. We aim to empower open-source LLMs to master thousands of diverse real-world APIs. We achieve this by collecting a high-quality instruction-tuning dataset. It is constructed automatically using the latest ChatGPT (gpt-3.5-turbo-16k), which is upgraded with enhanced function call capabilities. We provide the dataset, the corresponding training and evaluation scripts, and a capable model ToolLLaMA fine-tuned on ToolBench.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Datasets for training VARGPT
arxiv
Dataset Structure
stage1-pt: Contains 1.28M pre-training instruction fine-tuning dataset for VARGPT.
stage2-sft: Includes datasets for the second stage of VARGPT instruction fine-tuning:
stage2-sft/llava_v1_5_mix665k: Derived entirely from LLaVA-1.5 training data. stage2-sft/llava_onevision_508k: Sampled from the LLaVA-onevision Dataset. stage2-sft/ImageNet-Instruct-5k: Sampled from our stage3-sft/ImageNet-Instruct-130k dataset.… See the full description on the dataset page: https://huggingface.co/datasets/VARGPT-family/VARGPT_datasets.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
LIMI: Less is More for Agency
To learn more about LIMI, feel free to explore our documentation and resources. Our release consists of the following sections: This dataset release includes the following sections:
Data Fields: Message schema, tool definitions, and tool-call format. Splits: Available partitions and counts. Examples: Representative JSON samples.
News
2025.10.08: 📝 Released training scripts for Qwen3 dense models (4B/8B/32B) - check… See the full description on the dataset page: https://huggingface.co/datasets/GAIR/LIMI.
Facebook
TwitterSmolTalk2
Dataset description
This dataset contains the smoltalk_everyday_convs_reasoning_Qwen3_32B_think from SmolkTalk2. We processed the dataset using SmolLM3's chat template and make it available for the SFT exercises from the smol course. The script we used to create the dataset is available in the create_dataset.py file in this repository. You can load a dataset using from datasets import load_dataset
ds =… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk2_everyday_convs_think.
Facebook
TwitterThis is a converted version of the Aya dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:
apply_keyword_filters: False apply_empty_message_filters: False push_to_hub: True hf_entity: VGraf converted_dataset_name: aya_dataset_converted local_save_dir: None
Please refer to the original dataset for more information about this dataset and the license.
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
SmolLM-Corpus
This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.
Dataset subsets
Cosmopedia v2
Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterThis is a converted version of the Open Assistant 1 dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:
apply_keyword_filters: False apply_empty_message_filters: False top_k: 3 push_to_hub: True hf_entity: VGraf converted_dataset_name: None local_save_dir: /results
Please refer to the original dataset for more information about this dataset and the license.