44 datasets found

ChatQA-Training-Data
huggingface.co
Updated Jun 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NVIDIA (2023). ChatQA-Training-Data [Dataset]. https://huggingface.co/datasets/nvidia/ChatQA-Training-Data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 30, 2023
Dataset provided by
Nvidiahttp://nvidia.com/
Authors
NVIDIA
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Data Description

We release the training dataset of ChatQA. It is built and derived from existing datasets: DROP, NarrativeQA, NewsQA, Quoref, ROPES, SQuAD1.1, SQuAD2.0, TAT-QA, a SFT dataset, as well as a our synthetic conversational QA dataset by GPT-3.5-turbo-0613. The SFT dataset is built and derived from: Soda, ELI5, FLAN, the FLAN collection, Self-Instruct, Unnatural Instructions, OpenAssistant, and Dolly. For more information about ChatQA, check the website!

Other… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/ChatQA-Training-Data.
h
toxic-chat
huggingface.co
Updated Jan 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Large Model Systems Organization (2024). toxic-chat [Dataset]. https://huggingface.co/datasets/lmsys/toxic-chat
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 25, 2024
Dataset authored and provided by
Large Model Systems Organization
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Update

[01/31/2024] We update the OpenAI Moderation API results for ToxicChat (0124) based on their updated moderation model on on Jan 25, 2024.[01/28/2024] We release an official T5-Large model trained on ToxicChat (toxicchat0124). Go and check it for you baseline comparision![01/19/2024] We have a new version of ToxicChat (toxicchat0124)!

Content

This dataset contains toxicity annotations on 10K user prompts collected from the Vicuna online demo. We utilize a human-AI… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/toxic-chat.
h
Bitext-telco-llm-chatbot-training-dataset
huggingface.co
Updated Jan 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2025). Bitext-telco-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-telco-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 10, 2025
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Telco Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [telco] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-telco-llm-chatbot-training-dataset.
h
lmsys-chat-1m
huggingface.co
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Post-training-Data-Flywheel (2024). lmsys-chat-1m [Dataset]. https://huggingface.co/datasets/Post-training-Data-Flywheel/lmsys-chat-1m
Explore at:
Dataset updated
Sep 12, 2024
Dataset authored and provided by
Post-training-Data-Flywheel
Description
Post-training-Data-Flywheel/lmsys-chat-1m dataset hosted on Hugging Face and contributed by the HF Datasets community
h
baize-chat-data
huggingface.co
Updated Jul 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Link An Jarad (2023). baize-chat-data [Dataset]. https://huggingface.co/datasets/linkanjarad/baize-chat-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 20, 2023
Authors
Link An Jarad
Description
Dataset Description

Original Repository: https://github.com/project-baize/baize-chatbot/tree/main/data This is a dataset of the training data used to train the Baize family of models. This dataset is used for instruction fine-tuning of LLMs, particularly in "chat" format. Human and AI messages are marked by [|Human|] and [|AI|] tags respectively. The data from the orignial repo consists of 4 datasets (alpaca, medical, quora, stackoverflow), and this dataset combines all four into… See the full description on the dataset page: https://huggingface.co/datasets/linkanjarad/baize-chat-data.
h
lmsys-chat-1m
huggingface.co
opendatalab.com
Updated Sep 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Large Model Systems Organization (2023). lmsys-chat-1m [Dataset]. https://huggingface.co/datasets/lmsys/lmsys-chat-1m
Explore at:
Dataset updated
Sep 17, 2023
Dataset authored and provided by
Large Model Systems Organization
Description
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

This dataset contains one million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210K unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website from April to August 2023. Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag. User consent is obtained through the "Terms of use"… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/lmsys-chat-1m.
h
VideoChat-Flash-Training-Data
huggingface.co
Updated Apr 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenGVLab (2025). VideoChat-Flash-Training-Data [Dataset]. https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data
Explore at:
Dataset updated
Apr 20, 2025
Dataset authored and provided by
OpenGVLab
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
🦜 VideoChat-Flash-Training-Data

This repos contains all annotaions and most videos for training VideoChat-Flash.

📕 How to use the LongVid data?

For video_dir like longvid_subset/coin_grounding_10k_zip, you need to concat this dir to a zip file as follows: cat ego4dhcap_eventunderstanding_2k_zip/* > ego4dhcap_eventunderstanding_2k.zip

✏️ Citation

@article{li2024videochatflash, title={VideoChat-Flash: Hierarchical Compression for Long-Context Video… See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data.
h
Bitext-retail-ecommerce-llm-chatbot-training-dataset
huggingface.co
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2024). Bitext-retail-ecommerce-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 6, 2024
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset.
h
oasst1
huggingface.co
paperswithcode.com
Updated Apr 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAssistant (2023). oasst1 [Dataset]. https://huggingface.co/datasets/OpenAssistant/oasst1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 12, 2023
Dataset authored and provided by
OpenAssistant
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
OpenAssistant Conversations Dataset (OASST1)

Dataset Summary

In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort… See the full description on the dataset page: https://huggingface.co/datasets/OpenAssistant/oasst1.
h
CollectiveCognition-chats-data-2023-09-22
huggingface.co
Updated Sep 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Post-training-Data-Flywheel (2023). CollectiveCognition-chats-data-2023-09-22 [Dataset]. https://huggingface.co/datasets/Post-training-Data-Flywheel/CollectiveCognition-chats-data-2023-09-22
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 22, 2023
Dataset authored and provided by
Post-training-Data-Flywheel
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Post-training-Data-Flywheel/CollectiveCognition-chats-data-2023-09-22 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
chatbot_arena_conversations
huggingface.co
Updated Jul 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Large Model Systems Organization (2023). chatbot_arena_conversations [Dataset]. https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
Explore at:
Dataset updated
Jul 18, 2023
Dataset authored and provided by
Large Model Systems Organization
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
Chatbot Arena Conversations Dataset

This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Each sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp. To ensure the safe release… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/chatbot_arena_conversations.
h
toolverifier
huggingface.co
Updated Mar 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2024). toolverifier [Dataset]. https://huggingface.co/datasets/facebook/toolverifier
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2024
Dataset authored and provided by
AI at Meta
Description
TOOLVERIFIER: Generalization to New Tools via Self-Verification

This repository contains the ToolSelect dataset which was used to fine-tune Llama-2 70B for tool selection.

Data

ToolSelect data is synthetic training data generated for tool selection task using Llama-2 70B and Llama-2-Chat-70B. It consists of 555 samples corresponding to 173 tools. Each training sample is composed of a user instruction, a candidate set of tools that includes the ground truth tool, and a… See the full description on the dataset page: https://huggingface.co/datasets/facebook/toolverifier.
h
alpaca
huggingface.co
opendatalab.com
Updated Mar 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tatsu Lab (2023). alpaca [Dataset]. https://huggingface.co/datasets/tatsu-lab/alpaca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2023
Dataset authored and provided by
Tatsu Lab
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca

Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:

The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.
instruction-dataset
huggingface.co
opendatalab.com
Updated Feb 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). instruction-dataset [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the blind eval dataset of high-quality, diverse, human-written instructions with demonstrations. We will be using this for step 3 evaluations in our RLHF pipeline.
h
mental_health_counseling_conversations
huggingface.co
opendatalab.com
Updated Jul 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amod (2025). mental_health_counseling_conversations [Dataset]. http://doi.org/10.57967/hf/1581
Explore at:
Unique identifier
https://doi.org/10.57967/hf/1581
Dataset updated
Jul 6, 2025
Authors
Amod
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Amod/mental_health_counseling_conversations

Dataset Summary

This dataset is a collection of real counselling question-and-answer pairs taken from two public mental-health platforms.It is intended for training and evaluating language models that provide safer, context-aware mental-health responses.

Supported Tasks

Text generation and question-answering with an advice-giving focus.

Languages

English (en)

Dataset Structure Data… See the full description on the dataset page: https://huggingface.co/datasets/Amod/mental_health_counseling_conversations.
h
sales-conversations
huggingface.co
Updated Sep 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ENGEL (2023). sales-conversations [Dataset]. https://huggingface.co/datasets/goendalf666/sales-conversations
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 28, 2023
Authors
ENGEL
Description
Dataset Card for "sales-conversations"

This dataset was created for the purpose of training a sales agent chatbot that can convince people. The initial idea came from: textbooks is all you need https://arxiv.org/abs/2306.11644 gpt-3.5-turbo was used for the generation

Structure

The conversations have a customer and a salesman which appear always in changing order. customer, salesman, customer, salesman, etc. The customer always starts the conversation Who ends the… See the full description on the dataset page: https://huggingface.co/datasets/goendalf666/sales-conversations.
h
Fujitsu-Primergy-Expert-Training-Data
huggingface.co
Updated Jun 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max Schwarzer (2024). Fujitsu-Primergy-Expert-Training-Data [Dataset]. https://huggingface.co/datasets/MaxSchwrzr/Fujitsu-Primergy-Expert-Training-Data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2024
Authors
Max Schwarzer
Description
Dataset Card for Fujitsu Primergy Expert Training Data

This dataset contains all the files used to fine-tune and test the LLaMA 7B Chat Fujitsu Primergy Expert LLM. The model can be accessed under https://huggingface.co/MaxSchwrzr/LLama-2-7B-Chat-Primergy-Expert.

Dataset Sources

The dataset is created on public data of Fujitsu Primergy Server. The data can accessed under https://www.fujitsu.com/de/products/computing/servers/primergy/index.html. The data has been… See the full description on the dataset page: https://huggingface.co/datasets/MaxSchwrzr/Fujitsu-Primergy-Expert-Training-Data.
h
Data from: small-chat
huggingface.co
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sven Stahlmann (2024). small-chat [Dataset]. https://huggingface.co/datasets/SvenStahlmannNLP/small-chat
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Authors
Sven Stahlmann
Description
Small Chat

This is a small multi-turn chat dataset based on publicly available data sources. It contains approximately 5,000 examples, allowing for quick training and evaluation of base models to assess their chat capabilities. The dataset prioritizes quality over quantity. It was compiled using the following sources:

HuggingFaceH4/OpenHermes-2.5-1k-longest LDJnr/Pure-Dove

I performed some preprocessing to merge the datasets into a consistent format and to remove artifacts from… See the full description on the dataset page: https://huggingface.co/datasets/SvenStahlmannNLP/small-chat.
h
chats-data-2023-10-16
huggingface.co
Updated Oct 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Collective Cognition (2023). chats-data-2023-10-16 [Dataset]. https://huggingface.co/datasets/CollectiveCognition/chats-data-2023-10-16
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 16, 2023
Dataset authored and provided by
Collective Cognition
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "Collective Cognition ChatGPT Conversations"

Dataset Description Dataset Summary

The "Collective Cognition ChatGPT Conversations" dataset is a collection of chat logs between users and the ChatGPT model. These conversations have been shared by users on the "Collective Cognition" website. The dataset provides insights into user interactions with language models and can be utilized for multiple purposes, including training, research, and… See the full description on the dataset page: https://huggingface.co/datasets/CollectiveCognition/chats-data-2023-10-16.
h
llama-instruct
huggingface.co
Updated Aug 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Together (2023). llama-instruct [Dataset]. https://huggingface.co/datasets/togethercomputer/llama-instruct
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 18, 2023
Dataset authored and provided by
Together
License
https://choosealicense.com/licenses/llama2/https://choosealicense.com/licenses/llama2/
Description
llama-instruct

This dataset was used to finetune Llama-2-7B-32K-Instruct. We follow the distillation paradigm that is used by Alpaca, Vicuna, WizardLM, Orca — producing instructions by querying a powerful LLM, which in our case, is the Llama-2-70B-Chat model released by Meta. To build Llama-2-7B-32K-Instruct, we collect instructions from 19K human inputs extracted from ShareGPT-90K (only using human inputs, not ChatGPT outputs). The actual script handles multi-turn conversations… See the full description on the dataset page: https://huggingface.co/datasets/togethercomputer/llama-instruct.

Facebook

Twitter

Click to copy link

Link copied

Cite

NVIDIA (2023). ChatQA-Training-Data [Dataset]. https://huggingface.co/datasets/nvidia/ChatQA-Training-Data

ChatQA-Training-Data

nvidia/ChatQA-Training-Data

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 30, 2023

Dataset provided by

Nvidiahttp://nvidia.com/

Authors

NVIDIA

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Data Description

We release the training dataset of ChatQA. It is built and derived from existing datasets: DROP, NarrativeQA, NewsQA, Quoref, ROPES, SQuAD1.1, SQuAD2.0, TAT-QA, a SFT dataset, as well as a our synthetic conversational QA dataset by GPT-3.5-turbo-0613. The SFT dataset is built and derived from: Soda, ELI5, FLAN, the FLAN collection, Self-Instruct, Unnatural Instructions, OpenAssistant, and Dolly. For more information about ChatQA, check the website!

  Other… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/ChatQA-Training-Data.

Clear search

Close search

Google apps

Main menu

ChatQA-Training-Data

toxic-chat

Bitext-telco-llm-chatbot-training-dataset

lmsys-chat-1m

baize-chat-data

lmsys-chat-1m

VideoChat-Flash-Training-Data

Bitext-retail-ecommerce-llm-chatbot-training-dataset

oasst1

CollectiveCognition-chats-data-2023-09-22

chatbot_arena_conversations

toolverifier

alpaca

instruction-dataset

mental_health_counseling_conversations

sales-conversations

Fujitsu-Primergy-Expert-Training-Data

Data from: small-chat

chats-data-2023-10-16

llama-instruct

ChatQA-Training-Data

nvidia/ChatQA-Training-Data