44 datasets found
  1. ChatQA-Training-Data

    • huggingface.co
    Updated Jun 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NVIDIA (2023). ChatQA-Training-Data [Dataset]. https://huggingface.co/datasets/nvidia/ChatQA-Training-Data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 30, 2023
    Dataset provided by
    Nvidiahttp://nvidia.com/
    Authors
    NVIDIA
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Data Description

    We release the training dataset of ChatQA. It is built and derived from existing datasets: DROP, NarrativeQA, NewsQA, Quoref, ROPES, SQuAD1.1, SQuAD2.0, TAT-QA, a SFT dataset, as well as a our synthetic conversational QA dataset by GPT-3.5-turbo-0613. The SFT dataset is built and derived from: Soda, ELI5, FLAN, the FLAN collection, Self-Instruct, Unnatural Instructions, OpenAssistant, and Dolly. For more information about ChatQA, check the website!

      Other… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/ChatQA-Training-Data.
    
  2. h

    toxic-chat

    • huggingface.co
    Updated Jan 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Large Model Systems Organization (2024). toxic-chat [Dataset]. https://huggingface.co/datasets/lmsys/toxic-chat
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 25, 2024
    Dataset authored and provided by
    Large Model Systems Organization
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Update

    [01/31/2024] We update the OpenAI Moderation API results for ToxicChat (0124) based on their updated moderation model on on Jan 25, 2024.[01/28/2024] We release an official T5-Large model trained on ToxicChat (toxicchat0124). Go and check it for you baseline comparision![01/19/2024] We have a new version of ToxicChat (toxicchat0124)!

      Content
    

    This dataset contains toxicity annotations on 10K user prompts collected from the Vicuna online demo. We utilize a human-AI… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/toxic-chat.

  3. h

    Bitext-telco-llm-chatbot-training-dataset

    • huggingface.co
    Updated Jan 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2025). Bitext-telco-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-telco-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 10, 2025
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Telco Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [telco] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-telco-llm-chatbot-training-dataset.

  4. h

    lmsys-chat-1m

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Post-training-Data-Flywheel (2024). lmsys-chat-1m [Dataset]. https://huggingface.co/datasets/Post-training-Data-Flywheel/lmsys-chat-1m
    Explore at:
    Dataset updated
    Sep 12, 2024
    Dataset authored and provided by
    Post-training-Data-Flywheel
    Description

    Post-training-Data-Flywheel/lmsys-chat-1m dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    baize-chat-data

    • huggingface.co
    Updated Jul 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Link An Jarad (2023). baize-chat-data [Dataset]. https://huggingface.co/datasets/linkanjarad/baize-chat-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 20, 2023
    Authors
    Link An Jarad
    Description

    Dataset Description

    Original Repository: https://github.com/project-baize/baize-chatbot/tree/main/data This is a dataset of the training data used to train the Baize family of models. This dataset is used for instruction fine-tuning of LLMs, particularly in "chat" format. Human and AI messages are marked by [|Human|] and [|AI|] tags respectively. The data from the orignial repo consists of 4 datasets (alpaca, medical, quora, stackoverflow), and this dataset combines all four into… See the full description on the dataset page: https://huggingface.co/datasets/linkanjarad/baize-chat-data.

  6. h

    lmsys-chat-1m

    • huggingface.co
    • opendatalab.com
    Updated Sep 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Large Model Systems Organization (2023). lmsys-chat-1m [Dataset]. https://huggingface.co/datasets/lmsys/lmsys-chat-1m
    Explore at:
    Dataset updated
    Sep 17, 2023
    Dataset authored and provided by
    Large Model Systems Organization
    Description

    LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

    This dataset contains one million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210K unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website from April to August 2023. Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag. User consent is obtained through the "Terms of use"… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/lmsys-chat-1m.

  7. h

    VideoChat-Flash-Training-Data

    • huggingface.co
    Updated Apr 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenGVLab (2025). VideoChat-Flash-Training-Data [Dataset]. https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data
    Explore at:
    Dataset updated
    Apr 20, 2025
    Dataset authored and provided by
    OpenGVLab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🦜 VideoChat-Flash-Training-Data

    This repos contains all annotaions and most videos for training VideoChat-Flash.

      📕 How to use the LongVid data?
    

    For video_dir like longvid_subset/coin_grounding_10k_zip, you need to concat this dir to a zip file as follows: cat ego4dhcap_eventunderstanding_2k_zip/* > ego4dhcap_eventunderstanding_2k.zip

      ✏️ Citation
    

    @article{li2024videochatflash, title={VideoChat-Flash: Hierarchical Compression for Long-Context Video… See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data.

  8. h

    Bitext-retail-ecommerce-llm-chatbot-training-dataset

    • huggingface.co
    Updated Aug 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2024). Bitext-retail-ecommerce-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset.

  9. h

    oasst1

    • huggingface.co
    • paperswithcode.com
    Updated Apr 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAssistant (2023). oasst1 [Dataset]. https://huggingface.co/datasets/OpenAssistant/oasst1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 12, 2023
    Dataset authored and provided by
    OpenAssistant
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    OpenAssistant Conversations Dataset (OASST1)

      Dataset Summary
    

    In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort… See the full description on the dataset page: https://huggingface.co/datasets/OpenAssistant/oasst1.

  10. h

    CollectiveCognition-chats-data-2023-09-22

    • huggingface.co
    Updated Sep 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Post-training-Data-Flywheel (2023). CollectiveCognition-chats-data-2023-09-22 [Dataset]. https://huggingface.co/datasets/Post-training-Data-Flywheel/CollectiveCognition-chats-data-2023-09-22
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 22, 2023
    Dataset authored and provided by
    Post-training-Data-Flywheel
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Post-training-Data-Flywheel/CollectiveCognition-chats-data-2023-09-22 dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    chatbot_arena_conversations

    • huggingface.co
    Updated Jul 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Large Model Systems Organization (2023). chatbot_arena_conversations [Dataset]. https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
    Explore at:
    Dataset updated
    Jul 18, 2023
    Dataset authored and provided by
    Large Model Systems Organization
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    Chatbot Arena Conversations Dataset

    This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Each sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp. To ensure the safe release… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/chatbot_arena_conversations.

  12. h

    toolverifier

    • huggingface.co
    Updated Mar 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta (2024). toolverifier [Dataset]. https://huggingface.co/datasets/facebook/toolverifier
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2024
    Dataset authored and provided by
    AI at Meta
    Description

    TOOLVERIFIER: Generalization to New Tools via Self-Verification

    This repository contains the ToolSelect dataset which was used to fine-tune Llama-2 70B for tool selection.

      Data
    

    ToolSelect data is synthetic training data generated for tool selection task using Llama-2 70B and Llama-2-Chat-70B. It consists of 555 samples corresponding to 173 tools. Each training sample is composed of a user instruction, a candidate set of tools that includes the ground truth tool, and a… See the full description on the dataset page: https://huggingface.co/datasets/facebook/toolverifier.

  13. h

    alpaca

    • huggingface.co
    • opendatalab.com
    Updated Mar 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tatsu Lab (2023). alpaca [Dataset]. https://huggingface.co/datasets/tatsu-lab/alpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2023
    Dataset authored and provided by
    Tatsu Lab
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for Alpaca

      Dataset Summary
    

    Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:

    The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.

  14. instruction-dataset

    • huggingface.co
    • opendatalab.com
    Updated Feb 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). instruction-dataset [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the blind eval dataset of high-quality, diverse, human-written instructions with demonstrations. We will be using this for step 3 evaluations in our RLHF pipeline.

  15. h

    mental_health_counseling_conversations

    • huggingface.co
    • opendatalab.com
    Updated Jul 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amod (2025). mental_health_counseling_conversations [Dataset]. http://doi.org/10.57967/hf/1581
    Explore at:
    Dataset updated
    Jul 6, 2025
    Authors
    Amod
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Amod/mental_health_counseling_conversations

      Dataset Summary
    

    This dataset is a collection of real counselling question-and-answer pairs taken from two public mental-health platforms.It is intended for training and evaluating language models that provide safer, context-aware mental-health responses.

      Supported Tasks
    

    Text generation and question-answering with an advice-giving focus.

      Languages
    

    English (en)

      Dataset Structure
    
    
    
    
    
      Data… See the full description on the dataset page: https://huggingface.co/datasets/Amod/mental_health_counseling_conversations.
    
  16. h

    sales-conversations

    • huggingface.co
    Updated Sep 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ENGEL (2023). sales-conversations [Dataset]. https://huggingface.co/datasets/goendalf666/sales-conversations
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 28, 2023
    Authors
    ENGEL
    Description

    Dataset Card for "sales-conversations"

    This dataset was created for the purpose of training a sales agent chatbot that can convince people. The initial idea came from: textbooks is all you need https://arxiv.org/abs/2306.11644 gpt-3.5-turbo was used for the generation

      Structure
    

    The conversations have a customer and a salesman which appear always in changing order. customer, salesman, customer, salesman, etc. The customer always starts the conversation Who ends the… See the full description on the dataset page: https://huggingface.co/datasets/goendalf666/sales-conversations.

  17. h

    Fujitsu-Primergy-Expert-Training-Data

    • huggingface.co
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Schwarzer (2024). Fujitsu-Primergy-Expert-Training-Data [Dataset]. https://huggingface.co/datasets/MaxSchwrzr/Fujitsu-Primergy-Expert-Training-Data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2024
    Authors
    Max Schwarzer
    Description

    Dataset Card for Fujitsu Primergy Expert Training Data

    This dataset contains all the files used to fine-tune and test the LLaMA 7B Chat Fujitsu Primergy Expert LLM. The model can be accessed under https://huggingface.co/MaxSchwrzr/LLama-2-7B-Chat-Primergy-Expert.

      Dataset Sources
    

    The dataset is created on public data of Fujitsu Primergy Server. The data can accessed under https://www.fujitsu.com/de/products/computing/servers/primergy/index.html. The data has been… See the full description on the dataset page: https://huggingface.co/datasets/MaxSchwrzr/Fujitsu-Primergy-Expert-Training-Data.

  18. h

    Data from: small-chat

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven Stahlmann (2024). small-chat [Dataset]. https://huggingface.co/datasets/SvenStahlmannNLP/small-chat
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Authors
    Sven Stahlmann
    Description

    Small Chat

    This is a small multi-turn chat dataset based on publicly available data sources. It contains approximately 5,000 examples, allowing for quick training and evaluation of base models to assess their chat capabilities. The dataset prioritizes quality over quantity. It was compiled using the following sources:

    HuggingFaceH4/OpenHermes-2.5-1k-longest LDJnr/Pure-Dove

    I performed some preprocessing to merge the datasets into a consistent format and to remove artifacts from… See the full description on the dataset page: https://huggingface.co/datasets/SvenStahlmannNLP/small-chat.

  19. h

    chats-data-2023-10-16

    • huggingface.co
    Updated Oct 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Collective Cognition (2023). chats-data-2023-10-16 [Dataset]. https://huggingface.co/datasets/CollectiveCognition/chats-data-2023-10-16
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 16, 2023
    Dataset authored and provided by
    Collective Cognition
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "Collective Cognition ChatGPT Conversations"

      Dataset Description
    
    
    
    
    
      Dataset Summary
    

    The "Collective Cognition ChatGPT Conversations" dataset is a collection of chat logs between users and the ChatGPT model. These conversations have been shared by users on the "Collective Cognition" website. The dataset provides insights into user interactions with language models and can be utilized for multiple purposes, including training, research, and… See the full description on the dataset page: https://huggingface.co/datasets/CollectiveCognition/chats-data-2023-10-16.

  20. h

    llama-instruct

    • huggingface.co
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Together (2023). llama-instruct [Dataset]. https://huggingface.co/datasets/togethercomputer/llama-instruct
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 18, 2023
    Dataset authored and provided by
    Together
    License

    https://choosealicense.com/licenses/llama2/https://choosealicense.com/licenses/llama2/

    Description

    llama-instruct

    This dataset was used to finetune Llama-2-7B-32K-Instruct. We follow the distillation paradigm that is used by Alpaca, Vicuna, WizardLM, Orca — producing instructions by querying a powerful LLM, which in our case, is the Llama-2-70B-Chat model released by Meta. To build Llama-2-7B-32K-Instruct, we collect instructions from 19K human inputs extracted from ShareGPT-90K (only using human inputs, not ChatGPT outputs). The actual script handles multi-turn conversations… See the full description on the dataset page: https://huggingface.co/datasets/togethercomputer/llama-instruct.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
NVIDIA (2023). ChatQA-Training-Data [Dataset]. https://huggingface.co/datasets/nvidia/ChatQA-Training-Data
Organization logo

ChatQA-Training-Data

nvidia/ChatQA-Training-Data

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 30, 2023
Dataset provided by
Nvidiahttp://nvidia.com/
Authors
NVIDIA
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Data Description

We release the training dataset of ChatQA. It is built and derived from existing datasets: DROP, NarrativeQA, NewsQA, Quoref, ROPES, SQuAD1.1, SQuAD2.0, TAT-QA, a SFT dataset, as well as a our synthetic conversational QA dataset by GPT-3.5-turbo-0613. The SFT dataset is built and derived from: Soda, ELI5, FLAN, the FLAN collection, Self-Instruct, Unnatural Instructions, OpenAssistant, and Dolly. For more information about ChatQA, check the website!

  Other… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/ChatQA-Training-Data.
Search
Clear search
Close search
Google apps
Main menu