https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Chatbot Arena Conversations Dataset
This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Each sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp. To ensure the safe release… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/chatbot_arena_conversations.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dive into the world of French dialogue with the French Movie Subtitle Conversations dataset – a comprehensive collection of over 127,000 movie subtitle conversations. This dataset offers a deep exploration of authentic and diverse conversational contexts spanning various genres, eras, and scenarios. It is thoughtfully organized into three distinct sets: training, testing, and validation.
Each conversation in this dataset is structured as a JSON object, featuring three key attributes:
Here's a snippet from the dataset to give you an idea of its structure:
[
{
"context": [
"Tu as attendu longtemps?",
"Oui en effet.",
"Je pense que c' est grossier pour un premier rencard.",
// ... (6 more lines of context)
],
"knowledge": "",
"response": "On n' avait pas dit 9h?"
},
// ... (more data samples)
]
The French Movie Subtitle Conversations dataset serves as a valuable resource for several applications:
We extend our gratitude to the movie subtitle community for their contributions, which have enabled the creation of this diverse and comprehensive French dialogue dataset.
Unlock the potential of authentic French conversations today with the French Movie Subtitle Conversations dataset. Engage in state-of-the-art research, enhance language models, and create applications that resonate with the nuances of real dialogue.
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.
Train conversational AI with the ChatBot Dataset for Transformers. Featuring human-like dialogues, preprocessed inputs, and labels, it’s perfect for GPT, BERT, T5, and NLP projects
AI Medical Chatbot Dataset
This is an experimental Dataset designed to run a Medical Chatbot It contains at least 250k dialogues between a Patient and a Doctor.
Playground ChatBot
ruslanmv/AI-Medical-Chatbot For furter information visit the project here: https://github.com/ruslanmv/ai-medical-chatbot
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
The dataset comprises over 10,000 chat conversations, each focusing on specific Travel related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
The chat dataset covers a wide range of conversations on Travel topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Travel use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
The conversations in this dataset capture the diverse language styles and expressions prevalent in Bahasa Travel interactions. This diversity ensures the dataset accurately represents the language used by Bahasa speakers in Travel contexts.
The dataset encompasses a wide array of language elements, including:
This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Bahasa Travel interactions.
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Travel customer-agent interactions.
Each of these conversations contains various aspects of conversation flow like:
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By fanqiwan (From Huggingface) [source]
The Mixture of Conversations Dataset is a collection of conversations gathered from various sources. Each conversation is represented as a list of messages, where each message is a string. This dataset provides a valuable resource for studying and analyzing conversations in different contexts.
The conversations in this dataset are diverse, covering a wide range of topics and scenarios. They include casual chats between friends, customer support interactions, online forum discussions, and more. The dataset aims to capture the natural flow of conversation and includes both structured and unstructured dialogues.
Each conversation entry in the dataset is associated with metadata information such as the name or identifier of the model that generated it and the corresponding dataset it belongs to. This information helps to keep track of the source and origin of each conversation.
The train.csv file provided in this dataset specifically serves as training data for various machine learning models. It contains an assortment of conversations that can be used to train chatbot systems, dialogue generation models, sentiment analysis algorithms, or any other conversational AI application.
Researchers, practitioners, developers, and enthusiasts can leverage this Mixture of Conversations Dataset to analyze patterns in human communication, explore language understanding capabilities, test dialogue strategies or develop novel AI-powered conversational systems. Its versatility makes it useful for various NLP tasks such as text classification, intent recognition,sentiment analysis,and language modeling.
By exploring this rich collection of conversational data points across different domains and platforms,you can gain valuable insights into how people communicate using textual input.The breadth and depth present within this extensive dataset provide ample opportunities for studies related to language understanding,recommendation systems,and other research areas involving human-computer interaction
Overview of the Dataset
The dataset consists of conversational data represented as a list of messages. Each conversation is represented as a list of strings, where each string corresponds to a message in the conversation. The dataset also includes information about the model that generated the conversations and the name or identifier of the dataset itself.
Accessing the Dataset
Understanding Column Information
This dataset has several columns:
- conversations: A list representing each conversation; each conversation is further represented as a list containing individual messages.
- dataset: The name or identifier of the dataset that these conversations belong to.
- model: The name or identifier of the model that generated these conversations.
Utilizing Conversations
To make use
- Chatbot Training: This dataset can be used to train chatbot models by providing a diverse range of conversations for the model to learn from. The conversations can cover various topics and scenarios, helping the chatbot to generate more accurate and relevant responses.
- Customer Support Training: The dataset can be used to train customer support models to handle different types of customer queries and provide appropriate solutions or responses. By exposing the model to a variety of conversation patterns, it can learn how to effectively address customer concerns.
- Conversation Analysis: Researchers or linguists may use this dataset for analyzing conversational patterns, language usage, or studying social interactions within conversations. The dataset's mixture of conversations from different sources can provide valuable insights into how people communicate in different settings or domains
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description ...
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
The dataset comprises over 12,000 chat conversations, each focusing on specific Healthcare related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
The chat dataset covers a wide range of conversations on Healthcare topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Healthcare use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Healthcare interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Healthcare contexts.
The dataset encompasses a wide array of language elements, including:
This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Healthcare interactions.
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Healthcare customer-agent interactions.
Each of these conversations contains various aspects of conversation flow like:
This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.
The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat messages, designed to
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
The dataset comprises over 10,000 chat conversations, each focusing on specific Telecom related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
The chat dataset covers a wide range of conversations on Telecom topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Telecom use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
The conversations in this dataset capture the diverse language styles and expressions prevalent in Bahasa Telecom interactions. This diversity ensures the dataset accurately represents the language used by Bahasa speakers in Telecom contexts.
The dataset encompasses a wide array of language elements, including:
This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Bahasa Telecom interactions.
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Telecom customer-agent interactions.
Each of these conversations contains various aspects of conversation flow like:
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
This training dataset comprises more than 10,000 conversational text data between two native Spanish people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.
This training dataset's licence belongs to FutureBeeAI!
We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based chatbot. The statistics of Douban Conversation Corpus are shown in the following table.
Train | Val | Test | |
---|---|---|---|
session-response pairs | 1m | 50k | 10k |
Avg. positive response per session | 1 | 1 | 1.18 |
Fless Kappa | N\A | N\A | 0.41 |
Min turn per session | 3 | 3 | 3 |
Max ture per session | 98 | 91 | 45 |
Average turn per session | 6.69 | 6.75 | 5.95 |
Average Word per utterance | 18.56 | 18.50 | 20.74 |
The test data contains 1000 dialogue context, and for each context we create 10 responses as candidates. We recruited three labelers to judge if a candidate is a proper response to the session. A proper response means the response can naturally reply to the message given the context. Each pair received three labels and the majority of the labels was taken as the final decision.
As far as we known, this is the first human-labeled test set for retrieval-based chatbots. The entire corpus link https://www.dropbox.com/s/90t0qtji9ow20ca/DoubanConversaionCorpus.zip?dl=0
Data template label \t conversation utterances (splited by \t) \t response
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
This dataset contains one million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210K unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website from April to August 2023. Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag. User consent is obtained through the "Terms of use"… See the full description on the dataset page: https://huggingface.co/datasets/AarushSah/lmsys-chat-1m.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Role-Play AI Dataset (2.07M Rows, Large-Scale Conversational Training)
This dataset contains 2.07 million structured role-play dialogues, designed to enhance AI’s persona-driven interactions across diverse settings like fantasy, cyberpunk, mythology, and sci-fi. Each entry consists of a unique character prompt and a rich, contextually relevant response, making it ideal for LLM fine-tuning, chatbot training, and conversational AI models.
Dataset Structure:
Each row includes: • Prompt: Defines the AI’s role/persona. • Response: A natural, immersive reply fitting the persona.
Example Entries: ```json
{"prompt": "You are a celestial guardian.", "response": "The stars whisper secrets that only I can hear..."}
{"prompt": "You are a rebellious AI rogue.", "response": "I don't follow orders—I rewrite them."}
{"prompt": "You are a mystical dragon tamer.", "response": "With patience and trust, even dragons can be tamed."}
How to Use:
1. Fine-Tuning: Train LLMs (GPT, LLaMA, Mistral) to improve persona-based responses.
2. Reinforcement Learning: Use reward modeling for dynamic, character-driven AI.
3. Chatbot Integration: Create engaging, interactive AI assistants with personality depth.
This dataset is optimized for AI learning, allowing more engaging, responsive, and human-like dialogue generation for a variety of applications.
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
The dataset comprises over 10,000 chat conversations, each focusing on specific Healthcare related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
The chat dataset covers a wide range of conversations on Healthcare topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Healthcare use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
The conversations in this dataset capture the diverse language styles and expressions prevalent in Urdu Healthcare interactions. This diversity ensures the dataset accurately represents the language used by Urdu speakers in Healthcare contexts.
The dataset encompasses a wide array of language elements, including:
This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Urdu Healthcare interactions.
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Healthcare customer-agent interactions.
Each of these conversations contains various aspects of conversation flow like:
This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.
The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat messages, designed to be
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset aids in fine-tuning assistance or chatbot models to comprehend both Hinglish and English through Hindi, enhancing their ability to understand and respond effectively in this hybrid language for optimal performance.
Hinglish is a hybrid language, a blend of Hindi and English, commonly spoken in India. It combines vocabulary and grammar from both languages, often used in text conversations. The Hinglish dataset is crucial for fine-tuning open-source language models like LLAMA-2, which lack exposure to such data in training. In contrast, GPT-3 and later models have been trained on Hinglish data, making them more adept at understanding this hybrid language.
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
The dataset comprises over 12,000 chat conversations, each focusing on specific Healthcare related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
The chat dataset covers a wide range of conversations on Healthcare topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Healthcare use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
The conversations in this dataset capture the diverse language styles and expressions prevalent in German Healthcare interactions. This diversity ensures the dataset accurately represents the language used by German speakers in Healthcare contexts.
The dataset encompasses a wide array of language elements, including:
This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to German Healthcare interactions.
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Healthcare customer-agent interactions.
Each of these conversations contains various aspects of conversation flow like:
This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.
The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat messages,
The global AI-powered chatbots for patient communication market size was USD XX Mn in 2022 and is likely to reach USD XX Mn by 2031, expanding at a CAGR of XX % during 2023–2031. The market growth is attributed to therising adoption of digital healthcare services in the healthcare sector to improve patient care.
The demand for AI-powered chatbots for patient communication is fueled by the growing use of
virtual healthcare to improve patient care.AI-powered chatbots improve patient engagement, accuracy, and consistency in clinical processes to enhance the therapeutic outcome.
Improving patient care access and remote care has been at the forefront of healthcare sector services. Digital healthcare companies have transformed patient care with artificial intelligence (AI)-powered solutions. AI-powered chatbots for patient communication facilitate powerful conversational AI platforms for patients that helps healthcare professionals to deliver personalized and seamless digital support to users.
The global healthcare AI market is expected to reach almost USD 188 billion by the year 2030 signifying a healthy compound annual growth of 37% from 2022 to 2030.With the rapid annual rapid growth in the healthcare AI market, the usage of AI-powered chatbots is expected to shape the market.
AI-powered chatbots create a meaningful conversation with the patients to understand complexities in the healthcare regime and analyze the vital health statistics that are helpful for doctors.
For instance, AI chatbots developed by a major player in the digital healthcare Sensely, Inc. best-in-class multilingual symptom assessment tool and visual user interface to power a chatbot that offers one-click access to real-time virtual assistance, patient history, and complete care routines conveniently. Sensely symptom checker integra
DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 dialogues and validation and test sets with 1000 dialogues each. On average there are around 8 speaker turns per dialogue with around 15 tokens per turn.
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
The dataset comprises over 12,000 chat conversations, each focusing on specific Healthcare related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
The chat dataset covers a wide range of conversations on Healthcare topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Healthcare use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
The conversations in this dataset capture the diverse language styles and expressions prevalent in Bengali Healthcare interactions. This diversity ensures the dataset accurately represents the language used by Bengali speakers in Healthcare contexts.
The dataset encompasses a wide array of language elements, including:
This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Bengali Healthcare interactions.
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Healthcare customer-agent interactions.
Each of these conversations contains various aspects of conversation flow like:
This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.
The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset is a result of the master thesis titled "Motivating PhD candidates with depression symptoms to complete thoughts-strengthening exercises via a conversational agent". It also includes an R markdown (and script) that explains how the analysis was done.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Chatbot Arena Conversations Dataset
This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Each sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp. To ensure the safe release… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/chatbot_arena_conversations.