100+ datasets found

h
lmsys-chat-1m
huggingface.co
opendatalab.com
Updated Sep 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Large Model Systems Organization (2023). lmsys-chat-1m [Dataset]. https://huggingface.co/datasets/lmsys/lmsys-chat-1m
Explore at:
Dataset updated
Sep 17, 2023
Dataset authored and provided by
Large Model Systems Organization
Description
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

This dataset contains one million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210K unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website from April to August 2023. Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag. User consent is obtained through the "Terms of use"… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/lmsys-chat-1m.
h
toxic-chat
huggingface.co
Updated Jan 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Large Model Systems Organization (2024). toxic-chat [Dataset]. https://huggingface.co/datasets/lmsys/toxic-chat
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 25, 2024
Dataset authored and provided by
Large Model Systems Organization
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Update

[01/31/2024] We update the OpenAI Moderation API results for ToxicChat (0124) based on their updated moderation model on on Jan 25, 2024.[01/28/2024] We release an official T5-Large model trained on ToxicChat (toxicchat0124). Go and check it for you baseline comparision![01/19/2024] We have a new version of ToxicChat (toxicchat0124)!

Content

This dataset contains toxicity annotations on 10K user prompts collected from the Vicuna online demo. We utilize a human-AI… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/toxic-chat.
Synthetic-Persona-Chat
huggingface.co
Updated Dec 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2023). Synthetic-Persona-Chat [Dataset]. https://huggingface.co/datasets/google/Synthetic-Persona-Chat
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 20, 2023
Dataset authored and provided by
Googlehttp://google.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for SPC: Synthetic-Persona-Chat Dataset

Abstract from the paper introducing this dataset:

High-quality conversational datasets are essential for developing AI models that can communicate with users. One way to foster deeper interactions between a chatbot and its user is through personas, aspects of the user's character that provide insights into their personality, motivations, and behaviors. Training Natural Language Processing (NLP) models on a diverse and… See the full description on the dataset page: https://huggingface.co/datasets/google/Synthetic-Persona-Chat.
F
French Conversation Chat Dataset for Delivery & Logistics Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). French Conversation Chat Dataset for Delivery & Logistics Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/french-delivery-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
French
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 10,000 chat conversations, each focusing on specific Delivery & Logistics related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
[object Object][object Object]
Topic Diversity
The chat dataset covers a wide range of conversations on Delivery & Logistics topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Delivery & Logistics use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
[object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object]
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in French Delivery & Logistics interactions. This diversity ensures the dataset accurately represents the language used by French speakers in Delivery & Logistics contexts.
The dataset encompasses a wide array of language elements, including:
[object Object][object Object][object Object][object Object]
This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to French Delivery & Logistics interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Delivery & Logistics customer-agent interactions.
[object Object][object Object][object Object][object Object][object Object][object Object]
Each of these conversations contains various aspects of conversation flow like:
[object Object][object Object][object Object][object Object][object Object][object Object][object Object]
This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.
Data Format and Structure
The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat messages, designed to be easily accessible and compatible with popular NLP frameworks.
Usage and Application
This dataset is useful for various applications in NLP and conversational AI, including:
[object Object][object Object][object Object][object Object][object Object]
Secure and Ethical Collection
[object Object][object Object][object Object]
Updates and Customization
The dataset is regularly updated with new chat data. Customization options are available to meet specific needs, including:
[object Object][object Object][object Object][object Object]
License
This French Conversational Chat Dataset for Delivery & Logistics is created by FutureBeeAI and is available for commercial use.
F
Bahasa Conversation Chat Dataset for Telecom Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Bahasa Conversation Chat Dataset for Telecom Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/bahasa-telecom-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 10,000 chat conversations, each focusing on specific Telecom related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 150+ native Bahasa participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Telecom topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Telecom use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Phone Number Porting
•Network Connectivity Issues
•Billing and Payments
•Technical Support
•Service Activation
•International Roaming Enquiry
•Refunds and Billing Adjustments
•Emergency Service Access, and many more
•Outbound Chats:
•Welcome Calls / Onboarding Process
•Payment Reminders
•Customer Surveys
•Technical Updates
•Service Usage Reviews
•Network Complaint Update, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in Bahasa Telecom interactions. This diversity ensures the dataset accurately represents the language used by Bahasa speakers in Telecom contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of Bahasa personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Bahasa-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Bahasa forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Bahasa Telecom conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Bahasa Telecom interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Telecom customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
<span
s
Childhood Adenotonsillectomy Trial
sleepdata.org
Updated Feb 27, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2014). Childhood Adenotonsillectomy Trial [Dataset]. http://doi.org/10.25822/d68d-8g03
Explore at:
Unique identifier
https://doi.org/10.25822/d68d-8g03
Dataset updated
Feb 27, 2014
Description
The CHAT is a multi-center, single-blind, randomized, controlled trial designed to test whether after a 7-month observation period, children, ages 5 to 9.9 years, with mild to moderate obstructive sleep apnea randomized to early adenotonsillectomy (eAT) will show greater levels of neurocognitive functioning, specifically in the attention-executive functioning domain, than children randomized to watchful waiting plus supportive care (WWSC). Other outcomes assessed included other indices of neurocognitive functioning (learning and memory, information processing, etc.), physical growth, blood pressure, metabolic profile, symptoms and quality of life. Physiological measures of sleep were assessed at baseline and at 7-months with standardized full polysomnography with central scoring at the Brigham and Women’s Sleep Reading Center. In total, 1,447 children had screening polysomnographs and 464 were randomized to treatment.
P
ToxicChat Dataset
paperswithcode.com
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zi Lin; Zihan Wang; Yongqi Tong; Yangkun Wang; Yuxin Guo; Yujia Wang; Jingbo Shang (2025). ToxicChat Dataset [Dataset]. https://paperswithcode.com/dataset/toxicchat
Explore at:
Dataset updated
May 15, 2025
Authors
Zi Lin; Zihan Wang; Yongqi Tong; Yangkun Wang; Yuxin Guo; Yujia Wang; Jingbo Shang
Description
ToxicChat is a novel benchmark dataset constructed based on real user queries from an open-source chatbot. Unlike previous toxicity detection benchmarks that primarily rely on social media content, ToxicChat captures the rich and nuanced phenomena inherent in real-world user-AI interactions. This unique dataset reveals significant domain differences compared to social media contents, making it a valuable resource for exploring the challenges of toxicity detection in user-AI conversations¹.

Here are some key details about the ToxicChat dataset:

Construction: ToxicChat was created using real user queries collected from an open-source chatbot. Challenges: It contains phenomena that can be tricky for current toxicity detection models to identify. Domain Difference: ToxicChat exhibits a significant domain difference when compared to social media content. Purpose: ToxicChat serves as a benchmark to drive advancements in building a safe and healthy environment for user-AI interactions.

Source: Conversation with Bing, 3/17/2024 (1) ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real .... https://aclanthology.org/2023.findings-emnlp.311/. (2) arXiv:2310.17389v1 [cs.CL] 26 Oct 2023. https://arxiv.org/pdf/2310.17389. (3) README.md · lmsys/toxic-chat at main - Hugging Face. https://huggingface.co/datasets/lmsys/toxic-chat/blob/main/README.md. (4) The Toxicity Dataset - GitHub. https://github.com/surge-ai/toxicity. (5) undefined. https://aclanthology.org/2023.findings-emnlp.311. (6) undefined. https://aclanthology.org/2023.findings-emnlp.311.pdf.
Reddit Conversation Dataset
kaggle.com
Updated Jun 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sai J (2023). Reddit Conversation Dataset [Dataset]. https://www.kaggle.com/datasets/psyflow/reddit-conversation-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 20, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sai J
Description
This is aimed to build a conversation AI bot or for next word prediction. Please Upvote the dataset so that it reaches to maximum Kagglers and it can help them to build a well chat bot as the size of dataset is 2.6GB
h
Topical-Chat
huggingface.co
Updated Dec 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Conversational Reasoning (2023). Topical-Chat [Dataset]. https://huggingface.co/datasets/Conversational-Reasoning/Topical-Chat
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 31, 2023
Dataset authored and provided by
Conversational Reasoning
Description
Topical-Chat

We introduce Topical-Chat, a knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles. Topical-Chat broadly consists of two types of files:

Conversations: JSON files containing conversations between pairs of Amazon Mechanical Turk workers. Reading Sets: JSON files containing knowledge sections rendered as reading content to the Turkers having conversations.… See the full description on the dataset page: https://huggingface.co/datasets/Conversational-Reasoning/Topical-Chat.
F
English Conversation Chat Dataset for Real Estate Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English Conversation Chat Dataset for Real Estate Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-realestate-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 12,000 chat conversations, each focusing on specific Real Estate related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 200+ native English participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Real Estate topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Real Estate use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Property Inquiry
•Rental Property Search & Availability
•Renovation Inquiries
•Property Features & Amenities Inquiry
•Investment Property Analysis & Advice
•Property History & Ownership Details, and many more
•Outbound Chats:
•New Property Listing Update
•Post Purchase Follow-ups
•Investment Opportunities & Property Recommendations
•Property Value Updates
•Customer Satisfaction Surveys, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in English Real Estate interactions. This diversity ensures the dataset accurately represents the language used by English speakers in Real Estate contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of English personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different English-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in English forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in English Real Estate conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to English Real Estate interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Real Estate customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
•Solution Delivery
•Closing and Follow-ups
<span
F
Hindi Conversation Chat Dataset for Retail & E-commerce Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Hindi Conversation Chat Dataset for Retail & E-commerce Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-retail-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 12,000 chat conversations, each focusing on specific Retail & E-Commerce related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 200+ native Hindi participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Retail & E-Commerce topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Retail & E-Commerce use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Product Inquiry
•Return/Exchange Request
•Order Cancellation
•Refund Request
•Membership/Subscriptions Enquiry
•Order Cancellations, and many more
•Outbound Chats:
•Order Confirmation
•Cross-selling and Upselling
•Account Updates
•Loyalty Program Offers
•Special Offers and Promotions
•Customer Verification, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Retail & E-Commerce interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Retail & E-Commerce contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of Hindi personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Retail & E-Commerce conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Retail & E-Commerce interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Retail & E-Commerce customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
•Solution Delivery
•Closing and Follow-ups
<div style="margin-top:10px;
H
Twitch.tv Chat Log Data
dataverse.harvard.edu
search.dataone.org
Updated Aug 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeongmin Kim (2019). Twitch.tv Chat Log Data [Dataset]. http://doi.org/10.7910/DVN/VE0IVQ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/VE0IVQ
Dataset updated
Aug 1, 2019
Dataset provided by
Harvard Dataverse
Authors
Jeongmin Kim
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Collection of chat log of 2,162 Twitch streaming videos by 52 streamers. Time period of target streaming video is from 2018-04-24 to 2018-06-24. Description of columns follows below: body: Actual text for user chat channel_id: Channel identifier (integer) commenter_id: User identifier (integer) commenter_type: User type (character) created_at: Time of when chat was entered (ISO 8601 date and time) fragments: Chat text including parsing information of Twitch emote (JSON list) offset: Time offset between start time of video stream and the time of when chat was entered (float) updated_at: Time of when chat was edited (ISO 8601 date and time) video_id: Video identifier (integer) File name indicates name of Twitch stream channel. This dataset is saved as python3 pandas.DataFrame with python pickle format. import pandas as pd pd.read_pickle('ninja.pkl')
Enriched Topical-Chat Dataset for Knowledge-Grounded Dialogue Systems
registry.opendata.aws
Updated Aug 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2020). Enriched Topical-Chat Dataset for Knowledge-Grounded Dialogue Systems [Dataset]. https://registry.opendata.aws/topical-chat-enriched/
Explore at:
Dataset updated
Aug 24, 2020
Dataset provided by
Amazon.comhttp://amazon.com/
Description
This dataset provides extra annotations on top of the publicly released Topical-Chat dataset(https://github.com/alexa/Topical-Chat) which will help in reproducing the results in our paper "Policy-Driven Neural Response Generation for Knowledge-Grounded Dialogue Systems" (https://arxiv.org/abs/2005.12529?context=cs.CL). The dataset contains 5 files: train.json, valid_freq.json, valid_rare.json, test_freq.json and test_rare.json. Each of these files will have additional annotations on top of the original Topical-Chat dataset. These specific annotations are: dialogue act annotations and knowledge sentence annotations. The annotations were computed automatically using off the shelf models which are mentioned in the README.txt
h
persona-chat
huggingface.co
Updated Apr 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleksey Korshuk (2023). persona-chat [Dataset]. https://huggingface.co/datasets/AlekseyKorshuk/persona-chat
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 25, 2023
Authors
Aleksey Korshuk
Description
AlekseyKorshuk/persona-chat dataset hosted on Hugging Face and contributed by the HF Datasets community
m
Chat Bot Dataset for AI/ML models
data.macgence.com
mp3
Updated Aug 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2024). Chat Bot Dataset for AI/ML models [Dataset]. https://data.macgence.com/dataset/chat-bot-dataset-for-aiml-models
Explore at:
mp3Available download formats
Dataset updated
Aug 4, 2024
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Get a high-quality chat bot dataset for AI/ML models. Enhance NLP training with diverse conversational data for accurate, efficient machine learning applications.
Leading AI character chat categories WRTN 2024
statista.com
Updated Feb 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Leading AI character chat categories WRTN 2024 [Dataset]. https://www.statista.com/statistics/1553814/wrtn-popular-ai-character-chat-categories/
Explore at:
Dataset updated
Feb 6, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2024
Area covered
South Korea
Description
In 2024, romance was the leading character chat category among WRTN users. WRTN was a South Korean artificial intelligence (AI) service offering various generative AI solutions ranging from chatbots to summarizing documents. In particular, WRTN's character chat has proven to be popular among its users.
C
Customer Service Live Chat System Report
datainsightsmarket.com
doc, pdf, ppt
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Customer Service Live Chat System Report [Dataset]. https://www.datainsightsmarket.com/reports/customer-service-live-chat-system-1959011
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
Apr 25, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global Customer Service Live Chat System market is experiencing robust growth, driven by the increasing adoption of digital channels for customer interaction and the rising demand for improved customer service experiences. The market, estimated at $10 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $30 billion by 2033. This expansion is fueled by several key factors: the escalating preference for instant communication among consumers, the cost-effectiveness of live chat compared to traditional support methods like phone calls, and the ability of live chat systems to improve customer satisfaction and loyalty through personalized and efficient interactions. The web-based segment dominates the market due to its accessibility and scalability, while the retail and e-commerce sector accounts for the largest application share, reflecting the growing importance of online sales and the need for immediate customer support in online transactions. However, challenges remain, such as ensuring data security and integrating live chat seamlessly with other customer relationship management (CRM) systems. Significant regional variations exist within the market. North America currently holds the largest market share, driven by the early adoption of digital technologies and a strong focus on customer experience. However, Asia Pacific is projected to witness the highest growth rate during the forecast period, propelled by rapid e-commerce expansion and increasing internet penetration across developing economies like India and China. Key players in the market, including Comm100, Freshdesk, Intercom, JivoSite, Kayako, LivePerson, Zendesk, LogMeIn, LiveChat, and SnapEngage, are constantly innovating to enhance their offerings, adding features like AI-powered chatbots and advanced analytics to improve customer service efficiency and effectiveness. The market's future trajectory hinges on technological advancements, such as the integration of artificial intelligence and machine learning, and the ongoing need for businesses to provide seamless, omnichannel customer support.
F
General domain Human-Human conversation chats in Bahasa
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). General domain Human-Human conversation chats in Bahasa [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/bahasa-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
This training dataset comprises more than 10,000 conversational text data between two native Bahasa people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.
This training dataset's licence belongs to FutureBeeAI!
Sensai: Toxic Chat Dataset
kaggle.com
Updated Nov 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
uetchy (2021). Sensai: Toxic Chat Dataset [Dataset]. https://www.kaggle.com/uetchy/sensai/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 1, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
uetchy
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Dataset

This dataset was created by uetchy

Released under ODC Public Domain Dedication and Licence (PDDL)

Contents
WhatsApp Data Set
figshare.com
txt
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anika Seufert; Fabian Poignée; Tobias Hoßfeld; Michael Seufert (2023). WhatsApp Data Set [Dataset]. http://doi.org/10.6084/m9.figshare.19785193.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19785193.v1
Dataset updated
Feb 23, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Anika Seufert; Fabian Poignée; Tobias Hoßfeld; Michael Seufert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Anonymized private WhatsApp chat histories, presented in A.Seufert, F. Poignée, M. Seufert, and T. Hoßfeld. "Share and Multiply: Modeling Communication and Generated Traffic in Private WhatsApp Groups," in IEEE Access 2023.

Details on the dataset format are given in README.md.

Facebook

Twitter

Click to copy link

Link copied

Cite

Large Model Systems Organization (2023). lmsys-chat-1m [Dataset]. https://huggingface.co/datasets/lmsys/lmsys-chat-1m

lmsys-chat-1m

lmsys/lmsys-chat-1m

Explore at:

247 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Sep 17, 2023

Dataset authored and provided by

Large Model Systems Organization

Description

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

This dataset contains one million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210K unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website from April to August 2023. Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag. User consent is obtained through the "Terms of use"… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/lmsys-chat-1m.

Clear search

Close search

Google apps

Main menu

lmsys-chat-1m

toxic-chat

Synthetic-Persona-Chat

French Conversation Chat Dataset for Delivery & Logistics Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

Data Format and Structure

Usage and Application

Secure and Ethical Collection

Updates and Customization

License

Bahasa Conversation Chat Dataset for Telecom Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

Childhood Adenotonsillectomy Trial

ToxicChat Dataset

Reddit Conversation Dataset

Topical-Chat

English Conversation Chat Dataset for Real Estate Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

Hindi Conversation Chat Dataset for Retail & E-commerce Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

Twitch.tv Chat Log Data

Enriched Topical-Chat Dataset for Knowledge-Grounded Dialogue Systems

persona-chat

Chat Bot Dataset for AI/ML models

Leading AI character chat categories WRTN 2024

Customer Service Live Chat System Report

General domain Human-Human conversation chats in Bahasa

What’s Included

Sensai: Toxic Chat Dataset

Dataset

Contents

WhatsApp Data Set

lmsys-chat-1mSee More Versions

lmsys/lmsys-chat-1m

lmsys-chat-1m