15 datasets found

h
oasst1_converted
huggingface.co
Updated Jul 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victoria Graf (2025). oasst1_converted [Dataset]. https://huggingface.co/datasets/VGraf/oasst1_converted
Explore at:
Dataset updated
Jul 27, 2025
Authors
Victoria Graf
Description
This is a converted version of the Open Assistant 1 dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

apply_keyword_filters: False apply_empty_message_filters: False top_k: 3 push_to_hub: True hf_entity: VGraf converted_dataset_name: None local_save_dir: /results

Please refer to the original dataset for more information about this dataset and the license.
h
bespoke_stratos_17k_converted
huggingface.co
Updated Jan 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathan Lambert (2025). bespoke_stratos_17k_converted [Dataset]. https://huggingface.co/datasets/natolambert/bespoke_stratos_17k_converted
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 22, 2025
Authors
Nathan Lambert
Description
This is a converted version of the bespokelabs/Bespoke-Stratos-17k dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

apply_keyword_filters: False apply_empty_message_filters: False push_to_hub: True hf_entity: natolambert converted_dataset_name: bespoke_stratos_17k_converted local_save_dir: None

Please refer to the original dataset for more information about this dataset and the license.
h
numinamath_tir_converted
huggingface.co
Updated Mar 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mingyuan Zhang (2025). numinamath_tir_converted [Dataset]. https://huggingface.co/datasets/KKHYA/numinamath_tir_converted
Explore at:
Dataset updated
Mar 23, 2025
Authors
Mingyuan Zhang
Description
This is a converted version of the NuminaMath-TIR subset into Tulu SFT training format. The conversion script can be found in open-instruct repo. The conversion took the following parameters:

apply_keyword_filters: True apply_empty_message_filters: True push_to_hub: True hf_entity: KKHYA converted_dataset_name: None local_save_dir: ./data/sft/numinamath

Please refer to the original TIR dataset for more information about this dataset and the license.
coconot_converted
huggingface.co
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI2 Adapt Dev (2025). coconot_converted [Dataset]. https://huggingface.co/datasets/ai2-adapt-dev/coconot_converted
Explore at:
Dataset updated
Jul 9, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
AI2 Adapt Dev
Description
This is a converted version of the CoCoNoT dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

apply_keyword_filters: True apply_empty_message_filters: False push_to_hub: True hf_entity: ai2-adapt-dev converted_dataset_name: coconot_converted local_save_dir: ./data/sft/coconot

Please refer to the original dataset for more information about this dataset and the license.
h
evol_codealpaca_converted
huggingface.co
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mingyuan Zhang (2025). evol_codealpaca_converted [Dataset]. https://huggingface.co/datasets/KKHYA/evol_codealpaca_converted
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2025
Authors
Mingyuan Zhang
Description
This is a converted version of the Evol-CodeAlpaca dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

apply_keyword_filters: True apply_empty_message_filters: True push_to_hub: True hf_entity: KKHYA converted_dataset_name: evol_codealpaca_converted local_save_dir: ./data/sft/evol_codealpaca

The original Evol-CodeAlpaca dataset can be found here. Our conversion is based on the decontaminated… See the full description on the dataset page: https://huggingface.co/datasets/KKHYA/evol_codealpaca_converted.
wizardlm_converted
huggingface.co
Updated Oct 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI2 Adapt Dev (2024). wizardlm_converted [Dataset]. https://huggingface.co/datasets/ai2-adapt-dev/wizardlm_converted
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 18, 2024
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
AI2 Adapt Dev
Description
This is a converted version of the WizardLM evol instruct dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

apply_keyword_filters: True apply_empty_message_filters: True push_to_hub: True hf_entity: ai2-adapt-dev converted_dataset_name: wizardlm_converted local_save_dir: ./data/sft/wizardlm

Please refer to the original dataset for more information about this dataset and the license.
h
flan_v2_converted
huggingface.co
Updated Jul 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victoria Graf (2025). flan_v2_converted [Dataset]. https://huggingface.co/datasets/VGraf/flan_v2_converted
Explore at:
Dataset updated
Jul 27, 2025
Authors
Victoria Graf
Description
This is a converted version of the Flan dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

apply_keyword_filters: False apply_empty_message_filters: False push_to_hub: True hf_entity: VGraf converted_dataset_name: flan_v2_converted local_save_dir: None

The original FLAN dataset needs extensive efforts to be regenerated, so we are using a reproduced version by the OpenOrca team.More specifically… See the full description on the dataset page: https://huggingface.co/datasets/VGraf/flan_v2_converted.
no_robots_converted
huggingface.co
Updated Oct 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI2 Adapt Dev (2024). no_robots_converted [Dataset]. https://huggingface.co/datasets/ai2-adapt-dev/no_robots_converted
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 17, 2024
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
AI2 Adapt Dev
Description
This is a converted version of the no_robots dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

apply_keyword_filters: False apply_empty_message_filters: False push_to_hub: True hf_entity: ai2-adapt-dev converted_dataset_name: no_robots_converted local_save_dir: ./data/sft/no_robots

Please refer to the original dataset for more information about this dataset and the license.
Tulu V2 Dataset
kaggle.com
zip
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Tulu V2 Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/science-based-tulu-nlp-model
Explore at:
zip(355939514 bytes)Available download formats
Dataset updated
Nov 23, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Tulu V2 Dataset

Assisting Assistive Tasks with Language Data Mixtures

By Huggingface Hub [source]

About this dataset

This dataset, Tulu-V2, is a science-based natural language model for assistive tasks that contains a mixture of language data from research and analysis. It consists of messages in the Tulu language, enabling machine learning algorithms to train and develop a more accurate model for understanding language usage and context. This dataset provides researchers and analysts with an invaluable resource of data that can be used to study linguistics, speech recognition technology, artificial intelligence applications and more. With an unprecedented variety of languages included—from formal literatures to informal conversations—this collection gives everyone the ability to make breakthrough insights about how people interact with their environment through dialogues. In order to truly understand the world we live in today, this dataset offers unparalleled opportunities for researchers who are striving for progress in natural communication technologies!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Once you have an understanding of the data format in this dataset you are ready to start the development process with your own model design! Here are some tips on how to get started:

Pre-process your data by cleaning up any irrelevant parts before training your model – this will make sure that only useful information is used when creating a model

Split your data into smaller chunks that will feed into individual models – this way it will be easier to find mistakes during your development process and reduce time spent debugging

Use different techniques such as feature engineering to allow for different levels of complexity in downstream tasks like classification

Optimize performance by testing out different parameters values on separate configurations
5 Develop an evaluation metric appropriate for the task - consider metrics such as precision/recall measures when developing your metric

following these steps when working with Scienece-Based Tulu NLP Model should help create accurate results faster than manually fine tuning models every time!

Research Ideas

Developing a speech recognition system to understand Tulu conversations

Building a machine learning model for automatic translation from Tulu to English

Creating an artificial intelligence-based natural language processing platform for assisting people with disabilities in understanding and navigating the world around them using Tulu as their primary language

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:--------------------------------------------| | dataset | The name of the dataset. (String) | | messages | The messages in the Tulu language. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
O
ToolBench
opendatalab.com
zip
Updated Oct 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tsinghua University (2023). ToolBench [Dataset]. https://opendatalab.com/OpenDataLab/ToolBench
Explore at:
zipAvailable download formats
Dataset updated
Oct 3, 2023
Dataset provided by
Yale University
Tsinghua University
Renmin University of China
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This project (ToolLLM) aims to construct open-source, large-scale, high-quality instruction tuning SFT data to facilitate the construction of powerful LLMs with general tool-use capability. We aim to empower open-source LLMs to master thousands of diverse real-world APIs. We achieve this by collecting a high-quality instruction-tuning dataset. It is constructed automatically using the latest ChatGPT (gpt-3.5-turbo-16k), which is upgraded with enhanced function call capabilities. We provide the dataset, the corresponding training and evaluation scripts, and a capable model ToolLLaMA fine-tuned on ToolBench.
h
VARGPT_datasets
huggingface.co
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VARGPT-family (2025). VARGPT_datasets [Dataset]. https://huggingface.co/datasets/VARGPT-family/VARGPT_datasets
Explore at:
Dataset updated
Jan 23, 2025
Authors
VARGPT-family
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Datasets for training VARGPT

arxiv

Dataset Structure

stage1-pt: Contains 1.28M pre-training instruction fine-tuning dataset for VARGPT.

stage2-sft: Includes datasets for the second stage of VARGPT instruction fine-tuning:

stage2-sft/llava_v1_5_mix665k: Derived entirely from LLaVA-1.5 training data. stage2-sft/llava_onevision_508k: Sampled from the LLaVA-onevision Dataset. stage2-sft/ImageNet-Instruct-5k: Sampled from our stage3-sft/ImageNet-Instruct-130k dataset.… See the full description on the dataset page: https://huggingface.co/datasets/VARGPT-family/VARGPT_datasets.
h
LIMI
huggingface.co
Updated Sep 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SII - GAIR (2025). LIMI [Dataset]. https://huggingface.co/datasets/GAIR/LIMI
Explore at:
Dataset updated
Sep 23, 2025
Dataset authored and provided by
SII - GAIR
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
LIMI: Less is More for Agency

To learn more about LIMI, feel free to explore our documentation and resources. Our release consists of the following sections: This dataset release includes the following sections:

Data Fields: Message schema, tool definitions, and tool-call format. Splits: Available partitions and counts. Examples: Representative JSON samples.

News

2025.10.08: 📝 Released training scripts for Qwen3 dense models (4B/8B/32B) - check… See the full description on the dataset page: https://huggingface.co/datasets/GAIR/LIMI.
smoltalk2_everyday_convs_think
huggingface.co
Updated Sep 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2025). smoltalk2_everyday_convs_think [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smoltalk2_everyday_convs_think
Explore at:
Dataset updated
Sep 24, 2025
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
Description
SmolTalk2

Dataset description

This dataset contains the smoltalk_everyday_convs_reasoning_Qwen3_32B_think from SmolkTalk2. We processed the dataset using SmolLM3's chat template and make it available for the SFT exercises from the smol course. The script we used to create the dataset is available in the create_dataset.py file in this repository. You can load a dataset using from datasets import load_dataset

To load the train split you can run

ds =… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk2_everyday_convs_think.
h
aya_dataset_converted
huggingface.co
Updated Jul 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victoria Graf (2025). aya_dataset_converted [Dataset]. https://huggingface.co/datasets/VGraf/aya_dataset_converted
Explore at:
Dataset updated
Jul 27, 2025
Authors
Victoria Graf
Description
This is a converted version of the Aya dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

apply_keyword_filters: False apply_empty_message_filters: False push_to_hub: True hf_entity: VGraf converted_dataset_name: aya_dataset_converted local_save_dir: None

Please refer to the original dataset for more information about this dataset and the license.
smollm-corpus
huggingface.co
Updated Jul 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2024). smollm-corpus [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 16, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
SmolLM-Corpus

This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

Dataset subsets Cosmopedia v2

Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Victoria Graf (2025). oasst1_converted [Dataset]. https://huggingface.co/datasets/VGraf/oasst1_converted

oasst1_converted

VGraf/oasst1_converted

Explore at:

Dataset updated

Jul 27, 2025

Authors

Victoria Graf

Description

This is a converted version of the Open Assistant 1 dataset into Tulu SFT training format. The conversion script can be found in our open-instruct repo. The conversion took the following parameters:

apply_keyword_filters: False apply_empty_message_filters: False top_k: 3 push_to_hub: True hf_entity: VGraf converted_dataset_name: None local_save_dir: /results

Please refer to the original dataset for more information about this dataset and the license.

Clear search

Close search

Google apps

Main menu

oasst1_converted

bespoke_stratos_17k_converted

numinamath_tir_converted

coconot_converted

evol_codealpaca_converted

wizardlm_converted

flan_v2_converted

no_robots_converted

Tulu V2 Dataset

Tulu V2 Dataset

Assisting Assistive Tasks with Language Data Mixtures

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

ToolBench

VARGPT_datasets

LIMI

smoltalk2_everyday_convs_think

To load the train split you can run

aya_dataset_converted

smollm-corpus

oasst1_convertedSee More Versions

VGraf/oasst1_converted

oasst1_converted