LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
This dataset contains one million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210K unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website from April to August 2023. Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag. User consent is obtained through the "Terms of use"… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/lmsys-chat-1m.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Update
[01/31/2024] We update the OpenAI Moderation API results for ToxicChat (0124) based on their updated moderation model on on Jan 25, 2024.[01/28/2024] We release an official T5-Large model trained on ToxicChat (toxicchat0124). Go and check it for you baseline comparision![01/19/2024] We have a new version of ToxicChat (toxicchat0124)!
Content
This dataset contains toxicity annotations on 10K user prompts collected from the Vicuna online demo. We utilize a human-AI… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/toxic-chat.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for SPC: Synthetic-Persona-Chat Dataset
Abstract from the paper introducing this dataset:
High-quality conversational datasets are essential for developing AI models that can communicate with users. One way to foster deeper interactions between a chatbot and its user is through personas, aspects of the user's character that provide insights into their personality, motivations, and behaviors. Training Natural Language Processing (NLP) models on a diverse and… See the full description on the dataset page: https://huggingface.co/datasets/google/Synthetic-Persona-Chat.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The dataset comprises over 10,000 chat conversations, each focusing on specific Delivery & Logistics related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
[object Object][object Object]The chat dataset covers a wide range of conversations on Delivery & Logistics topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Delivery & Logistics use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
[object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object]The conversations in this dataset capture the diverse language styles and expressions prevalent in French Delivery & Logistics interactions. This diversity ensures the dataset accurately represents the language used by French speakers in Delivery & Logistics contexts.
The dataset encompasses a wide array of language elements, including:
[object Object][object Object][object Object][object Object]This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to French Delivery & Logistics interactions.
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Delivery & Logistics customer-agent interactions.
[object Object][object Object][object Object][object Object][object Object][object Object]Each of these conversations contains various aspects of conversation flow like:
[object Object][object Object][object Object][object Object][object Object][object Object][object Object]This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.
The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat messages, designed to be easily accessible and compatible with popular NLP frameworks.
This dataset is useful for various applications in NLP and conversational AI, including:
[object Object][object Object][object Object][object Object][object Object]The dataset is regularly updated with new chat data. Customization options are available to meet specific needs, including:
[object Object][object Object][object Object][object Object]This French Conversational Chat Dataset for Delivery & Logistics is created by FutureBeeAI and is available for commercial use.
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
The dataset comprises over 10,000 chat conversations, each focusing on specific Telecom related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
The chat dataset covers a wide range of conversations on Telecom topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Telecom use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
The conversations in this dataset capture the diverse language styles and expressions prevalent in Bahasa Telecom interactions. This diversity ensures the dataset accurately represents the language used by Bahasa speakers in Telecom contexts.
The dataset encompasses a wide array of language elements, including:
This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Bahasa Telecom interactions.
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Telecom customer-agent interactions.
Each of these conversations contains various aspects of conversation flow like:
The CHAT is a multi-center, single-blind, randomized, controlled trial designed to test whether after a 7-month observation period, children, ages 5 to 9.9 years, with mild to moderate obstructive sleep apnea randomized to early adenotonsillectomy (eAT) will show greater levels of neurocognitive functioning, specifically in the attention-executive functioning domain, than children randomized to watchful waiting plus supportive care (WWSC). Other outcomes assessed included other indices of neurocognitive functioning (learning and memory, information processing, etc.), physical growth, blood pressure, metabolic profile, symptoms and quality of life. Physiological measures of sleep were assessed at baseline and at 7-months with standardized full polysomnography with central scoring at the Brigham and Women’s Sleep Reading Center. In total, 1,447 children had screening polysomnographs and 464 were randomized to treatment.
ToxicChat is a novel benchmark dataset constructed based on real user queries from an open-source chatbot. Unlike previous toxicity detection benchmarks that primarily rely on social media content, ToxicChat captures the rich and nuanced phenomena inherent in real-world user-AI interactions. This unique dataset reveals significant domain differences compared to social media contents, making it a valuable resource for exploring the challenges of toxicity detection in user-AI conversations¹.
Here are some key details about the ToxicChat dataset:
Construction: ToxicChat was created using real user queries collected from an open-source chatbot. Challenges: It contains phenomena that can be tricky for current toxicity detection models to identify. Domain Difference: ToxicChat exhibits a significant domain difference when compared to social media content. Purpose: ToxicChat serves as a benchmark to drive advancements in building a safe and healthy environment for user-AI interactions.
Source: Conversation with Bing, 3/17/2024 (1) ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real .... https://aclanthology.org/2023.findings-emnlp.311/. (2) arXiv:2310.17389v1 [cs.CL] 26 Oct 2023. https://arxiv.org/pdf/2310.17389. (3) README.md · lmsys/toxic-chat at main - Hugging Face. https://huggingface.co/datasets/lmsys/toxic-chat/blob/main/README.md. (4) The Toxicity Dataset - GitHub. https://github.com/surge-ai/toxicity. (5) undefined. https://aclanthology.org/2023.findings-emnlp.311. (6) undefined. https://aclanthology.org/2023.findings-emnlp.311.pdf.
This is aimed to build a conversation AI bot or for next word prediction. Please Upvote the dataset so that it reaches to maximum Kagglers and it can help them to build a well chat bot as the size of dataset is 2.6GB
Topical-Chat
We introduce Topical-Chat, a knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles. Topical-Chat broadly consists of two types of files:
Conversations: JSON files containing conversations between pairs of Amazon Mechanical Turk workers. Reading Sets: JSON files containing knowledge sections rendered as reading content to the Turkers having conversations.… See the full description on the dataset page: https://huggingface.co/datasets/Conversational-Reasoning/Topical-Chat.
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
The dataset comprises over 12,000 chat conversations, each focusing on specific Real Estate related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
The chat dataset covers a wide range of conversations on Real Estate topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Real Estate use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
The conversations in this dataset capture the diverse language styles and expressions prevalent in English Real Estate interactions. This diversity ensures the dataset accurately represents the language used by English speakers in Real Estate contexts.
The dataset encompasses a wide array of language elements, including:
This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to English Real Estate interactions.
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Real Estate customer-agent interactions.
Each of these conversations contains various aspects of conversation flow like:
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
The dataset comprises over 12,000 chat conversations, each focusing on specific Retail & E-Commerce related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
The chat dataset covers a wide range of conversations on Retail & E-Commerce topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Retail & E-Commerce use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Retail & E-Commerce interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Retail & E-Commerce contexts.
The dataset encompasses a wide array of language elements, including:
This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Retail & E-Commerce interactions.
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Retail & E-Commerce customer-agent interactions.
Each of these conversations contains various aspects of conversation flow like:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Collection of chat log of 2,162 Twitch streaming videos by 52 streamers. Time period of target streaming video is from 2018-04-24 to 2018-06-24. Description of columns follows below: body: Actual text for user chat channel_id: Channel identifier (integer) commenter_id: User identifier (integer) commenter_type: User type (character) created_at: Time of when chat was entered (ISO 8601 date and time) fragments: Chat text including parsing information of Twitch emote (JSON list) offset: Time offset between start time of video stream and the time of when chat was entered (float) updated_at: Time of when chat was edited (ISO 8601 date and time) video_id: Video identifier (integer) File name indicates name of Twitch stream channel. This dataset is saved as python3 pandas.DataFrame with python pickle format. import pandas as pd pd.read_pickle('ninja.pkl')
This dataset provides extra annotations on top of the publicly released Topical-Chat dataset(https://github.com/alexa/Topical-Chat) which will help in reproducing the results in our paper "Policy-Driven Neural Response Generation for Knowledge-Grounded Dialogue Systems" (https://arxiv.org/abs/2005.12529?context=cs.CL). The dataset contains 5 files: train.json, valid_freq.json, valid_rare.json, test_freq.json and test_rare.json. Each of these files will have additional annotations on top of the original Topical-Chat dataset. These specific annotations are: dialogue act annotations and knowledge sentence annotations. The annotations were computed automatically using off the shelf models which are mentioned in the README.txt
AlekseyKorshuk/persona-chat dataset hosted on Hugging Face and contributed by the HF Datasets community
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Get a high-quality chat bot dataset for AI/ML models. Enhance NLP training with diverse conversational data for accurate, efficient machine learning applications.
In 2024, romance was the leading character chat category among WRTN users. WRTN was a South Korean artificial intelligence (AI) service offering various generative AI solutions ranging from chatbots to summarizing documents. In particular, WRTN's character chat has proven to be popular among its users.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global Customer Service Live Chat System market is experiencing robust growth, driven by the increasing adoption of digital channels for customer interaction and the rising demand for improved customer service experiences. The market, estimated at $10 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $30 billion by 2033. This expansion is fueled by several key factors: the escalating preference for instant communication among consumers, the cost-effectiveness of live chat compared to traditional support methods like phone calls, and the ability of live chat systems to improve customer satisfaction and loyalty through personalized and efficient interactions. The web-based segment dominates the market due to its accessibility and scalability, while the retail and e-commerce sector accounts for the largest application share, reflecting the growing importance of online sales and the need for immediate customer support in online transactions. However, challenges remain, such as ensuring data security and integrating live chat seamlessly with other customer relationship management (CRM) systems. Significant regional variations exist within the market. North America currently holds the largest market share, driven by the early adoption of digital technologies and a strong focus on customer experience. However, Asia Pacific is projected to witness the highest growth rate during the forecast period, propelled by rapid e-commerce expansion and increasing internet penetration across developing economies like India and China. Key players in the market, including Comm100, Freshdesk, Intercom, JivoSite, Kayako, LivePerson, Zendesk, LogMeIn, LiveChat, and SnapEngage, are constantly innovating to enhance their offerings, adding features like AI-powered chatbots and advanced analytics to improve customer service efficiency and effectiveness. The market's future trajectory hinges on technological advancements, such as the integration of artificial intelligence and machine learning, and the ongoing need for businesses to provide seamless, omnichannel customer support.
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
This training dataset comprises more than 10,000 conversational text data between two native Bahasa people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.
This training dataset's licence belongs to FutureBeeAI!
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
This dataset was created by uetchy
Released under ODC Public Domain Dedication and Licence (PDDL)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Anonymized private WhatsApp chat histories, presented in A.Seufert, F. Poignée, M. Seufert, and T. Hoßfeld. "Share and Multiply: Modeling Communication and Generated Traffic in Private WhatsApp Groups," in IEEE Access 2023.
Details on the dataset format are given in README.md.
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
This dataset contains one million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210K unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website from April to August 2023. Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag. User consent is obtained through the "Terms of use"… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/lmsys-chat-1m.