56 datasets found

h
chatbot_arena_conversations
huggingface.co
Updated Jul 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Large Model Systems Organization (2023). chatbot_arena_conversations [Dataset]. https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
Explore at:
Dataset updated
Jul 18, 2023
Dataset authored and provided by
Large Model Systems Organization
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
Chatbot Arena Conversations Dataset

This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Each sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp. To ensure the safe release… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/chatbot_arena_conversations.
French Conversations (from movie subtitles)
kaggle.com
Updated Aug 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dali Selmi (2023). French Conversations (from movie subtitles) [Dataset]. https://www.kaggle.com/datasets/daliselmi/french-conversational-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dali Selmi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
French
Description
French Movie Subtitle Conversations Dataset

Description

Dive into the world of French dialogue with the French Movie Subtitle Conversations dataset – a comprehensive collection of over 127,000 movie subtitle conversations. This dataset offers a deep exploration of authentic and diverse conversational contexts spanning various genres, eras, and scenarios. It is thoughtfully organized into three distinct sets: training, testing, and validation.

Content Overview

Each conversation in this dataset is structured as a JSON object, featuring three key attributes:

Context: Get a holistic view of the conversation's flow with the preceding 9 lines of dialogue. This context provides invaluable insights into the conversation's dynamics and contextual cues.

Knowledge: Immerse yourself in a wide range of thematic knowledge. This dataset covers an array of topics, ensuring that your models receive exposure to diverse information sources for generating well-informed responses.

Response: Explore how characters react and respond across various scenarios. From casual conversations to intense emotional exchanges, this dataset encapsulates the authenticity of genuine human interaction.

Data Sample

Here's a snippet from the dataset to give you an idea of its structure:

[ { "context": [ "Tu as attendu longtemps?", "Oui en effet.", "Je pense que c' est grossier pour un premier rencard.", // ... (6 more lines of context) ], "knowledge": "", "response": "On n' avait pas dit 9h?" }, // ... (more data samples) ]

Use Cases

The French Movie Subtitle Conversations dataset serves as a valuable resource for several applications:

Conversational AI: Train advanced chatbots and dialogue systems in French that can engage users in fluid, contextually aware conversations.

Language Modeling: Enhance your language models by leveraging diverse dialogue patterns, colloquialisms, and contextual dependencies present in real-world conversations.

Sentiment Analysis: Investigate the emotional tones of conversations across different movie genres and periods, contributing to a better understanding of sentiment variation.

Why This Dataset

Size and Diversity: With a vast collection of over 127,000 conversations spanning diverse genres and tones, this dataset offers an unparalleled breadth and depth in French dialogue data.

Contextual Richness: The inclusion of context empowers researchers and practitioners to explore the dynamics of conversation flow, leading to more accurate and contextually relevant responses.

Real-world Relevance: Originating from movie subtitles, this dataset mirrors real-world interactions, making it a valuable asset for training models that understand and generate human-like dialogue.

Acknowledgments

We extend our gratitude to the movie subtitle community for their contributions, which have enabled the creation of this diverse and comprehensive French dialogue dataset.

Unlock the potential of authentic French conversations today with the French Movie Subtitle Conversations dataset. Engage in state-of-the-art research, enhance language models, and create applications that resonate with the nuances of real dialogue.
h
Bitext-travel-llm-chatbot-training-dataset
huggingface.co
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2024). Bitext-travel-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 6, 2024
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.
g
ChatBot Dataset for Transformers
gts.ai
json
Updated Jan 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS (2025). ChatBot Dataset for Transformers [Dataset]. https://gts.ai/dataset-download/chatbot-dataset-for-transformers/
Explore at:
jsonAvailable download formats
Dataset updated
Jan 9, 2025
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
Description
Train conversational AI with the ChatBot Dataset for Transformers. Featuring human-like dialogues, preprocessed inputs, and labels, it’s perfect for GPT, BERT, T5, and NLP projects
h
ai-medical-chatbot
huggingface.co
Updated Feb 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruslan Magana Vsevolodovna (2024). ai-medical-chatbot [Dataset]. https://huggingface.co/datasets/ruslanmv/ai-medical-chatbot
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2024
Authors
Ruslan Magana Vsevolodovna
Description
AI Medical Chatbot Dataset

This is an experimental Dataset designed to run a Medical Chatbot It contains at least 250k dialogues between a Patient and a Doctor.

Playground ChatBot

ruslanmv/AI-Medical-Chatbot For furter information visit the project here: https://github.com/ruslanmv/ai-medical-chatbot
F
Bahasa Conversation Chat Dataset for Travel Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Bahasa Conversation Chat Dataset for Travel Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/bahasa-travel-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 10,000 chat conversations, each focusing on specific Travel related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 150+ native Bahasa participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Travel topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Travel use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Calls:
•Booking Inquiries & Assistance
•Destination Information & Recommendations
• Flight Delays or Cancellation Assistance
•Assistance for Disable Passengers
•Travel-related Health & Safety Inquiry
•Lost or Delayed Baggage Assistance, and many more
•Outbound Calls:
•Promotional Offers & Package Deals
•Customer Satisfaction Surveys
•Booking Confirmations & Updates
•Flight Schedule Changes & Notifications
•Customer Feedback Collection
•Visa Expiration Reminders, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in Bahasa Travel interactions. This diversity ensures the dataset accurately represents the language used by Bahasa speakers in Travel contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of Bahasa personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Bahasa-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Bahasa forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Bahasa Travel conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Bahasa Travel interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Travel customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
•Solution Delivery
<span
DistillChat v1: Mixture of Conversations
kaggle.com
Updated Dec 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). DistillChat v1: Mixture of Conversations [Dataset]. https://www.kaggle.com/datasets/thedevastator/distillchat-v1-mixture-of-conversations-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 2, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
DistillChat v1: Mixture of Conversations Dataset

Conversational Dataset with Diverse Sources

By fanqiwan (From Huggingface) [source]

About this dataset

The Mixture of Conversations Dataset is a collection of conversations gathered from various sources. Each conversation is represented as a list of messages, where each message is a string. This dataset provides a valuable resource for studying and analyzing conversations in different contexts.

The conversations in this dataset are diverse, covering a wide range of topics and scenarios. They include casual chats between friends, customer support interactions, online forum discussions, and more. The dataset aims to capture the natural flow of conversation and includes both structured and unstructured dialogues.

Each conversation entry in the dataset is associated with metadata information such as the name or identifier of the model that generated it and the corresponding dataset it belongs to. This information helps to keep track of the source and origin of each conversation.

The train.csv file provided in this dataset specifically serves as training data for various machine learning models. It contains an assortment of conversations that can be used to train chatbot systems, dialogue generation models, sentiment analysis algorithms, or any other conversational AI application.

Researchers, practitioners, developers, and enthusiasts can leverage this Mixture of Conversations Dataset to analyze patterns in human communication, explore language understanding capabilities, test dialogue strategies or develop novel AI-powered conversational systems. Its versatility makes it useful for various NLP tasks such as text classification, intent recognition,sentiment analysis,and language modeling.

By exploring this rich collection of conversational data points across different domains and platforms,you can gain valuable insights into how people communicate using textual input.The breadth and depth present within this extensive dataset provide ample opportunities for studies related to language understanding,recommendation systems,and other research areas involving human-computer interaction

How to use the dataset

Overview of the Dataset

The dataset consists of conversational data represented as a list of messages. Each conversation is represented as a list of strings, where each string corresponds to a message in the conversation. The dataset also includes information about the model that generated the conversations and the name or identifier of the dataset itself.

Accessing the Dataset

Understanding Column Information

This dataset has several columns:

conversations: A list representing each conversation; each conversation is further represented as a list containing individual messages.

dataset: The name or identifier of the dataset that these conversations belong to.

model: The name or identifier of the model that generated these conversations.

Utilizing Conversations

To make use

Research Ideas

Chatbot Training: This dataset can be used to train chatbot models by providing a diverse range of conversations for the model to learn from. The conversations can cover various topics and scenarios, helping the chatbot to generate more accurate and relevant responses.

Customer Support Training: The dataset can be used to train customer support models to handle different types of customer queries and provide appropriate solutions or responses. By exposing the model to a variety of conversation patterns, it can learn how to effectively address customer concerns.

Conversation Analysis: Researchers or linguists may use this dataset for analyzing conversational patterns, language usage, or studying social interactions within conversations. The dataset's mixture of conversations from different sources can provide valuable insights into how people communicate in different settings or domains

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description ...
F
Hindi Conversation Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Hindi Conversation Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 12,000 chat conversations, each focusing on specific Healthcare related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 200+ native Hindi participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Healthcare topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Healthcare use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Appointment Scheduling
•New Patient Registration
•Surgery Consultation
•Consultation regarding Diet, and many more
•Outbound Chats:
•Appointment Reminder
•Health & Wellness Subscription Programs
•Lab Test Results
•Health Risk Assessments
•Preventive Care Reminders, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Healthcare interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Healthcare contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of Hindi personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Healthcare conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Healthcare interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Healthcare customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
•Solution Delivery
•Closing and Follow-ups
•Feedback, etc
This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.
Data Format and Structure
The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat messages, designed to
F
Bahasa Conversation Chat Dataset for Telecom Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Bahasa Conversation Chat Dataset for Telecom Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/bahasa-telecom-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 10,000 chat conversations, each focusing on specific Telecom related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 150+ native Bahasa participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Telecom topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Telecom use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Phone Number Porting
•Network Connectivity Issues
•Billing and Payments
•Technical Support
•Service Activation
•International Roaming Enquiry
•Refunds and Billing Adjustments
•Emergency Service Access, and many more
•Outbound Chats:
•Welcome Calls / Onboarding Process
•Payment Reminders
•Customer Surveys
•Technical Updates
•Service Usage Reviews
•Network Complaint Update, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in Bahasa Telecom interactions. This diversity ensures the dataset accurately represents the language used by Bahasa speakers in Telecom contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of Bahasa personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Bahasa-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Bahasa forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Bahasa Telecom conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Bahasa Telecom interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Telecom customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
<span
F
General domain Human-Human conversation chats in Spanish
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). General domain Human-Human conversation chats in Spanish [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/spanish-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
This training dataset comprises more than 10,000 conversational text data between two native Spanish people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.
This training dataset's licence belongs to FutureBeeAI!

Douban Conversation Corpus Dataset

paperswithcode.com
opendatalab.com

Updated Nov 15, 2021

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Yu Wu; Wei Wu; Chen Xing; Ming Zhou; Zhoujun Li (2021). Douban Conversation Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/douban-conversation-corpus

Explore at:

Dataset updated

Nov 15, 2021

Authors

Yu Wu; Wei Wu; Chen Xing; Ming Zhou; Zhoujun Li

Description

We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based chatbot. The statistics of Douban Conversation Corpus are shown in the following table.

	Train	Val	Test
session-response pairs	1m	50k	10k
Avg. positive response per session	1	1	1.18
Fless Kappa	N\A	N\A	0.41
Min turn per session	3	3	3
Max ture per session	98	91	45
Average turn per session	6.69	6.75	5.95
Average Word per utterance	18.56	18.50	20.74

The test data contains 1000 dialogue context, and for each context we create 10 responses as candidates. We recruited three labelers to judge if a candidate is a proper response to the session. A proper response means the response can naturally reply to the message given the context. Each pair received three labels and the majority of the labels was taken as the final decision.

As far as we known, this is the first human-labeled test set for retrieval-based chatbots. The entire corpus link https://www.dropbox.com/s/90t0qtji9ow20ca/DoubanConversaionCorpus.zip?dl=0

Data template label \t conversation utterances (splited by \t) \t response

h
lmsys-chat-1m
huggingface.co
Updated May 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aarush Sah (2024). lmsys-chat-1m [Dataset]. https://huggingface.co/datasets/AarushSah/lmsys-chat-1m
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 8, 2024
Authors
Aarush Sah
Description
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

This dataset contains one million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210K unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website from April to August 2023. Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag. User consent is obtained through the "Terms of use"… See the full description on the dataset page: https://huggingface.co/datasets/AarushSah/lmsys-chat-1m.
RolePlay DataSet
kaggle.com
Updated Feb 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vampelium (2025). RolePlay DataSet [Dataset]. https://www.kaggle.com/datasets/vampelium/roleplay-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vampelium
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Role-Play AI Dataset (2.07M Rows, Large-Scale Conversational Training)

This dataset contains 2.07 million structured role-play dialogues, designed to enhance AI’s persona-driven interactions across diverse settings like fantasy, cyberpunk, mythology, and sci-fi. Each entry consists of a unique character prompt and a rich, contextually relevant response, making it ideal for LLM fine-tuning, chatbot training, and conversational AI models.

Dataset Structure:

Each row includes: • Prompt: Defines the AI’s role/persona. • Response: A natural, immersive reply fitting the persona.

Example Entries: ```json

{"prompt": "You are a celestial guardian.", "response": "The stars whisper secrets that only I can hear..."}
{"prompt": "You are a rebellious AI rogue.", "response": "I don't follow orders—I rewrite them."}
{"prompt": "You are a mystical dragon tamer.", "response": "With patience and trust, even dragons can be tamed."}

How to Use: 1. Fine-Tuning: Train LLMs (GPT, LLaMA, Mistral) to improve persona-based responses. 2. Reinforcement Learning: Use reward modeling for dynamic, character-driven AI. 3. Chatbot Integration: Create engaging, interactive AI assistants with personality depth. This dataset is optimized for AI learning, allowing more engaging, responsive, and human-like dialogue generation for a variety of applications.
F
Urdu Conversation Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Urdu Conversation Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/urdu-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 10,000 chat conversations, each focusing on specific Healthcare related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 150+ native Urdu participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Healthcare topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Healthcare use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Appointment Scheduling
•New Patient Registration
•Surgery Consultation
•Consultation regarding Diet, and many more
•Outbound Chats:
•Appointment Reminder
•Health & Wellness Subscription Programs
•Lab Test Results
•Health Risk Assessments
•Preventive Care Reminders, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in Urdu Healthcare interactions. This diversity ensures the dataset accurately represents the language used by Urdu speakers in Healthcare contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of Urdu personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Urdu-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Urdu forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Urdu Healthcare conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Urdu Healthcare interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Healthcare customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
•Solution Delivery
•Closing and Follow-ups
•Feedback, etc
This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.
Data Format and Structure
The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat messages, designed to be
Conversation
kaggle.com
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahil Siddiki (2023). Conversation [Dataset]. https://www.kaggle.com/datasets/siddikisahil47/conversation/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sahil Siddiki
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset aids in fine-tuning assistance or chatbot models to comprehend both Hinglish and English through Hindi, enhancing their ability to understand and respond effectively in this hybrid language for optimal performance.

Hinglish is a hybrid language, a blend of Hindi and English, commonly spoken in India. It combines vocabulary and grammar from both languages, often used in text conversations. The Hinglish dataset is crucial for fine-tuning open-source language models like LLAMA-2, which lack exposure to such data in training. In contrast, GPT-3 and later models have been trained on Hinglish data, making them more adept at understanding this hybrid language.
F
German Conversation Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). German Conversation Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/german-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 12,000 chat conversations, each focusing on specific Healthcare related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 200+ native German participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Healthcare topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Healthcare use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Appointment Scheduling
•New Patient Registration
•Surgery Consultation
•Consultation regarding Diet, and many more
•Outbound Chats:
•Appointment Reminder
•Health & Wellness Subscription Programs
•Lab Test Results
•Health Risk Assessments
•Preventive Care Reminders, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in German Healthcare interactions. This diversity ensures the dataset accurately represents the language used by German speakers in Healthcare contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of German personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different German-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in German forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in German Healthcare conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to German Healthcare interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Healthcare customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
•Solution Delivery
•Closing and Follow-ups
•Feedback, etc
This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.
Data Format and Structure
The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat messages,
Artificial Intelligence (AI)-Powered Chatbots for Patient Communication...
growthmarketreports.com
csv, pdf, pptx
Updated Sep 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2023). Artificial Intelligence (AI)-Powered Chatbots for Patient Communication Market 2031 [Dataset]. https://growthmarketreports.com/report/artificial-intelligence-powered-chatbots-for-patient-communication-market-global-industry-analysis
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Sep 13, 2023
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
The global AI-powered chatbots for patient communication market size was USD XX Mn in 2022 and is likely to reach USD XX Mn by 2031, expanding at a CAGR of XX % during 2023–2031. The market growth is attributed to therising adoption of digital healthcare services in the healthcare sector to improve patient care.

The demand for AI-powered chatbots for patient communication is fueled by the growing use of

virtual healthcare to improve patient care.AI-powered chatbots improve patient engagement, accuracy, and consistency in clinical processes to enhance the therapeutic outcome.

Improving patient care access and remote care has been at the forefront of healthcare sector services. Digital healthcare companies have transformed patient care with artificial intelligence (AI)-powered solutions. AI-powered chatbots for patient communication facilitate powerful conversational AI platforms for patients that helps healthcare professionals to deliver personalized and seamless digital support to users.

The global healthcare AI market is expected to reach almost USD 188 billion by the year 2030 signifying a healthy compound annual growth of 37% from 2022 to 2030.With the rapid annual rapid growth in the healthcare AI market, the usage of AI-powered chatbots is expected to shape the market.

AI-powered chatbots create a meaningful conversation with the patients to understand complexities in the healthcare regime and analyze the vital health statistics that are helpful for doctors.

For instance, AI chatbots developed by a major player in the digital healthcare Sensely, Inc. best-in-class multilingual symptom assessment tool and visual user interface to power a chatbot that offers one-click access to real-time virtual assistance, patient history, and complete care routines conveniently. Sensely symptom checker integra
P
DailyDialog Dataset
paperswithcode.com
opendatalab.com
+1more
Updated Oct 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yan-ran Li; Hui Su; Xiaoyu Shen; Wenjie Li; Ziqiang Cao; Shuzi Niu (2022). DailyDialog Dataset [Dataset]. https://paperswithcode.com/dataset/dailydialog
Explore at:
Dataset updated
Oct 2, 2022
Authors
Yan-ran Li; Hui Su; Xiaoyu Shen; Wenjie Li; Ziqiang Cao; Shuzi Niu
Description
DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 dialogues and validation and test sets with 1000 dialogues each. On average there are around 8 speaker turns per dialogue with around 15 tokens per turn.
F
Bengali Conversation Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Bengali Conversation Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/bengali-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 12,000 chat conversations, each focusing on specific Healthcare related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 200+ native Bengali participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Healthcare topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Healthcare use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Appointment Scheduling
•New Patient Registration
•Surgery Consultation
•Consultation regarding Diet, and many more
•Outbound Chats:
•Appointment Reminder
•Health & Wellness Subscription Programs
•Lab Test Results
•Health Risk Assessments
•Preventive Care Reminders, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in Bengali Healthcare interactions. This diversity ensures the dataset accurately represents the language used by Bengali speakers in Healthcare contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of Bengali personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Bengali-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Bengali forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Bengali Healthcare conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Bengali Healthcare interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Healthcare customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
•Solution Delivery
•Closing and Follow-ups
•Feedback, etc
This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.
Data Format and Structure
The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat
f
Dataset of "Motivating PhD candidates with depression symptoms to complete...
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Al Owayyed (2023). Dataset of "Motivating PhD candidates with depression symptoms to complete thoughts-strengthening exercises via a conversational agent" [Dataset]. http://doi.org/10.4121/uuid:6ec331bc-8826-427f-9a12-6aff9a5849c8
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:6ec331bc-8826-427f-9a12-6aff9a5849c8
Dataset updated
May 31, 2023
Dataset provided by
4TU.ResearchData
Authors
Mohammad Al Owayyed
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The dataset is a result of the master thesis titled "Motivating PhD candidates with depression symptoms to complete thoughts-strengthening exercises via a conversational agent". It also includes an R markdown (and script) that explains how the analysis was done.

Facebook

Twitter

Click to copy link

Link copied

Cite

Large Model Systems Organization (2023). chatbot_arena_conversations [Dataset]. https://huggingface.co/datasets/lmsys/chatbot_arena_conversations

chatbot_arena_conversations

lmsys/chatbot_arena_conversations

Explore at:

22 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jul 18, 2023

Dataset authored and provided by

Large Model Systems Organization

License

https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

Description

Chatbot Arena Conversations Dataset

This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Each sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp. To ensure the safe release… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/chatbot_arena_conversations.

Clear search

Close search

Google apps

Main menu

chatbot_arena_conversations

French Conversations (from movie subtitles)

French Movie Subtitle Conversations Dataset

Description

Content Overview

Data Sample

Use Cases

Why This Dataset

Acknowledgments

Bitext-travel-llm-chatbot-training-dataset

ChatBot Dataset for Transformers

ai-medical-chatbot

Bahasa Conversation Chat Dataset for Travel Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

DistillChat v1: Mixture of Conversations

DistillChat v1: Mixture of Conversations Dataset

Conversational Dataset with Diverse Sources

About this dataset

How to use the dataset

Overview of the Dataset

Accessing the Dataset

Understanding Column Information

Utilizing Conversations

Research Ideas

Acknowledgements

License

Columns

Hindi Conversation Chat Dataset for Healthcare Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

Data Format and Structure

Bahasa Conversation Chat Dataset for Telecom Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

General domain Human-Human conversation chats in Spanish

What’s Included

Douban Conversation Corpus Dataset

lmsys-chat-1m

RolePlay DataSet

Urdu Conversation Chat Dataset for Healthcare Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

Data Format and Structure

Conversation

German Conversation Chat Dataset for Healthcare Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

Data Format and Structure

Artificial Intelligence (AI)-Powered Chatbots for Patient Communication...

DailyDialog Dataset

Bengali Conversation Chat Dataset for Healthcare Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

Data Format and Structure

Dataset of "Motivating PhD candidates with depression symptoms to complete...

chatbot_arena_conversationsSee More Versions

lmsys/chatbot_arena_conversations

chatbot_arena_conversations