100+ datasets found
  1. h

    lmsys-chat-1m

    • huggingface.co
    • opendatalab.com
    Updated Sep 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Large Model Systems Organization (2023). lmsys-chat-1m [Dataset]. https://huggingface.co/datasets/lmsys/lmsys-chat-1m
    Explore at:
    Dataset updated
    Sep 17, 2023
    Dataset authored and provided by
    Large Model Systems Organization
    Description

    LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

    This dataset contains one million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210K unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website from April to August 2023. Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag. User consent is obtained through the "Terms of use"… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/lmsys-chat-1m.

  2. h

    toxic-chat

    • huggingface.co
    Updated Jan 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Large Model Systems Organization (2024). toxic-chat [Dataset]. https://huggingface.co/datasets/lmsys/toxic-chat
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 25, 2024
    Dataset authored and provided by
    Large Model Systems Organization
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Update

    [01/31/2024] We update the OpenAI Moderation API results for ToxicChat (0124) based on their updated moderation model on on Jan 25, 2024.[01/28/2024] We release an official T5-Large model trained on ToxicChat (toxicchat0124). Go and check it for you baseline comparision![01/19/2024] We have a new version of ToxicChat (toxicchat0124)!

      Content
    

    This dataset contains toxicity annotations on 10K user prompts collected from the Vicuna online demo. We utilize a human-AI… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/toxic-chat.

  3. Synthetic-Persona-Chat

    • huggingface.co
    Updated Dec 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2023). Synthetic-Persona-Chat [Dataset]. https://huggingface.co/datasets/google/Synthetic-Persona-Chat
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 20, 2023
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for SPC: Synthetic-Persona-Chat Dataset

    Abstract from the paper introducing this dataset:

    High-quality conversational datasets are essential for developing AI models that can communicate with users. One way to foster deeper interactions between a chatbot and its user is through personas, aspects of the user's character that provide insights into their personality, motivations, and behaviors. Training Natural Language Processing (NLP) models on a diverse and… See the full description on the dataset page: https://huggingface.co/datasets/google/Synthetic-Persona-Chat.

  4. F

    French Conversation Chat Dataset for Delivery & Logistics Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). French Conversation Chat Dataset for Delivery & Logistics Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/french-delivery-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    French
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 10,000 chat conversations, each focusing on specific Delivery & Logistics related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    [object Object][object Object]

    Topic Diversity

    The chat dataset covers a wide range of conversations on Delivery & Logistics topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Delivery & Logistics use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    [object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object][object Object]

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in French Delivery & Logistics interactions. This diversity ensures the dataset accurately represents the language used by French speakers in Delivery & Logistics contexts.

    The dataset encompasses a wide array of language elements, including:

    [object Object][object Object][object Object][object Object]

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to French Delivery & Logistics interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Delivery & Logistics customer-agent interactions.

    [object Object][object Object][object Object][object Object][object Object][object Object]

    Each of these conversations contains various aspects of conversation flow like:

    [object Object][object Object][object Object][object Object][object Object][object Object][object Object]

    This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.

    Data Format and Structure

    The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat messages, designed to be easily accessible and compatible with popular NLP frameworks.

    Usage and Application

    This dataset is useful for various applications in NLP and conversational AI, including:

    [object Object][object Object][object Object][object Object][object Object]

    Secure and Ethical Collection

    [object Object][object Object][object Object]

    Updates and Customization

    The dataset is regularly updated with new chat data. Customization options are available to meet specific needs, including:

    [object Object][object Object][object Object][object Object]

    License

    This French Conversational Chat Dataset for Delivery & Logistics is created by FutureBeeAI and is available for commercial use.

  5. F

    Bahasa Conversation Chat Dataset for Telecom Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Bahasa Conversation Chat Dataset for Telecom Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/bahasa-telecom-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 10,000 chat conversations, each focusing on specific Telecom related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    Participants Details: 150+ native Bahasa participants from the FutureBeeAI community.
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Telecom topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Telecom use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    Inbound Chats:
    Phone Number Porting
    Network Connectivity Issues
    Billing and Payments
    Technical Support
    Service Activation
    International Roaming Enquiry
    Refunds and Billing Adjustments
    Emergency Service Access, and many more
    Outbound Chats:
    Welcome Calls / Onboarding Process
    Payment Reminders
    Customer Surveys
    Technical Updates
    Service Usage Reviews
    Network Complaint Update, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Bahasa Telecom interactions. This diversity ensures the dataset accurately represents the language used by Bahasa speakers in Telecom contexts.

    The dataset encompasses a wide array of language elements, including:

    Naming Conventions: Chats include a variety of Bahasa personal and business names.
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Bahasa-speaking regions.
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Bahasa forms, adhering to local conventions.
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Bahasa Telecom conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Bahasa Telecom interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Telecom customer-agent interactions.

    Simple Inquiries
    Detailed Discussions
    Transactional Interactions
    Problem-Solving Dialogues
    Advisory Sessions
    Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    Greetings
    Authentication
    Information gathering
    Resolution identification
    <span

  6. s

    Childhood Adenotonsillectomy Trial

    • sleepdata.org
    Updated Feb 27, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). Childhood Adenotonsillectomy Trial [Dataset]. http://doi.org/10.25822/d68d-8g03
    Explore at:
    Dataset updated
    Feb 27, 2014
    Description

    The CHAT is a multi-center, single-blind, randomized, controlled trial designed to test whether after a 7-month observation period, children, ages 5 to 9.9 years, with mild to moderate obstructive sleep apnea randomized to early adenotonsillectomy (eAT) will show greater levels of neurocognitive functioning, specifically in the attention-executive functioning domain, than children randomized to watchful waiting plus supportive care (WWSC). Other outcomes assessed included other indices of neurocognitive functioning (learning and memory, information processing, etc.), physical growth, blood pressure, metabolic profile, symptoms and quality of life. Physiological measures of sleep were assessed at baseline and at 7-months with standardized full polysomnography with central scoring at the Brigham and Women’s Sleep Reading Center. In total, 1,447 children had screening polysomnographs and 464 were randomized to treatment.

  7. P

    ToxicChat Dataset

    • paperswithcode.com
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zi Lin; Zihan Wang; Yongqi Tong; Yangkun Wang; Yuxin Guo; Yujia Wang; Jingbo Shang (2025). ToxicChat Dataset [Dataset]. https://paperswithcode.com/dataset/toxicchat
    Explore at:
    Dataset updated
    May 15, 2025
    Authors
    Zi Lin; Zihan Wang; Yongqi Tong; Yangkun Wang; Yuxin Guo; Yujia Wang; Jingbo Shang
    Description

    ToxicChat is a novel benchmark dataset constructed based on real user queries from an open-source chatbot. Unlike previous toxicity detection benchmarks that primarily rely on social media content, ToxicChat captures the rich and nuanced phenomena inherent in real-world user-AI interactions. This unique dataset reveals significant domain differences compared to social media contents, making it a valuable resource for exploring the challenges of toxicity detection in user-AI conversations¹.

    Here are some key details about the ToxicChat dataset:

    Construction: ToxicChat was created using real user queries collected from an open-source chatbot. Challenges: It contains phenomena that can be tricky for current toxicity detection models to identify. Domain Difference: ToxicChat exhibits a significant domain difference when compared to social media content. Purpose: ToxicChat serves as a benchmark to drive advancements in building a safe and healthy environment for user-AI interactions.

    Source: Conversation with Bing, 3/17/2024 (1) ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real .... https://aclanthology.org/2023.findings-emnlp.311/. (2) arXiv:2310.17389v1 [cs.CL] 26 Oct 2023. https://arxiv.org/pdf/2310.17389. (3) README.md · lmsys/toxic-chat at main - Hugging Face. https://huggingface.co/datasets/lmsys/toxic-chat/blob/main/README.md. (4) The Toxicity Dataset - GitHub. https://github.com/surge-ai/toxicity. (5) undefined. https://aclanthology.org/2023.findings-emnlp.311. (6) undefined. https://aclanthology.org/2023.findings-emnlp.311.pdf.

  8. Reddit Conversation Dataset

    • kaggle.com
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sai J (2023). Reddit Conversation Dataset [Dataset]. https://www.kaggle.com/datasets/psyflow/reddit-conversation-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 20, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sai J
    Description

    This is aimed to build a conversation AI bot or for next word prediction. Please Upvote the dataset so that it reaches to maximum Kagglers and it can help them to build a well chat bot as the size of dataset is 2.6GB

  9. h

    Topical-Chat

    • huggingface.co
    Updated Dec 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Conversational Reasoning (2023). Topical-Chat [Dataset]. https://huggingface.co/datasets/Conversational-Reasoning/Topical-Chat
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2023
    Dataset authored and provided by
    Conversational Reasoning
    Description

    Topical-Chat

    We introduce Topical-Chat, a knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles. Topical-Chat broadly consists of two types of files:

    Conversations: JSON files containing conversations between pairs of Amazon Mechanical Turk workers. Reading Sets: JSON files containing knowledge sections rendered as reading content to the Turkers having conversations.… See the full description on the dataset page: https://huggingface.co/datasets/Conversational-Reasoning/Topical-Chat.

  10. F

    English Conversation Chat Dataset for Real Estate Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English Conversation Chat Dataset for Real Estate Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-realestate-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific Real Estate related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    Participants Details: 200+ native English participants from the FutureBeeAI community.
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Real Estate topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Real Estate use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    Inbound Chats:
    Property Inquiry
    Rental Property Search & Availability
    Renovation Inquiries
    Property Features & Amenities Inquiry
    Investment Property Analysis & Advice
    Property History & Ownership Details, and many more
    Outbound Chats:
    New Property Listing Update
    Post Purchase Follow-ups
    Investment Opportunities & Property Recommendations
    Property Value Updates
    Customer Satisfaction Surveys, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in English Real Estate interactions. This diversity ensures the dataset accurately represents the language used by English speakers in Real Estate contexts.

    The dataset encompasses a wide array of language elements, including:

    Naming Conventions: Chats include a variety of English personal and business names.
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different English-speaking regions.
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in English forms, adhering to local conventions.
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in English Real Estate conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to English Real Estate interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Real Estate customer-agent interactions.

    Simple Inquiries
    Detailed Discussions
    Transactional Interactions
    Problem-Solving Dialogues
    Advisory Sessions
    Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    Greetings
    Authentication
    Information gathering
    Resolution identification
    Solution Delivery
    Closing and Follow-ups
    <span

  11. F

    Hindi Conversation Chat Dataset for Retail & E-commerce Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Conversation Chat Dataset for Retail & E-commerce Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-retail-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific Retail & E-Commerce related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Retail & E-Commerce topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Retail & E-Commerce use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    Inbound Chats:
    Product Inquiry
    Return/Exchange Request
    Order Cancellation
    Refund Request
    Membership/Subscriptions Enquiry
    Order Cancellations, and many more
    Outbound Chats:
    Order Confirmation
    Cross-selling and Upselling
    Account Updates
    Loyalty Program Offers
    Special Offers and Promotions
    Customer Verification, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Retail & E-Commerce interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Retail & E-Commerce contexts.

    The dataset encompasses a wide array of language elements, including:

    Naming Conventions: Chats include a variety of Hindi personal and business names.
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Retail & E-Commerce conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Retail & E-Commerce interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Retail & E-Commerce customer-agent interactions.

    Simple Inquiries
    Detailed Discussions
    Transactional Interactions
    Problem-Solving Dialogues
    Advisory Sessions
    Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    Greetings
    Authentication
    Information gathering
    Resolution identification
    Solution Delivery
    Closing and Follow-ups
    <div style="margin-top:10px;

  12. H

    Twitch.tv Chat Log Data

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Aug 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeongmin Kim (2019). Twitch.tv Chat Log Data [Dataset]. http://doi.org/10.7910/DVN/VE0IVQ
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Jeongmin Kim
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Collection of chat log of 2,162 Twitch streaming videos by 52 streamers. Time period of target streaming video is from 2018-04-24 to 2018-06-24. Description of columns follows below: body: Actual text for user chat channel_id: Channel identifier (integer) commenter_id: User identifier (integer) commenter_type: User type (character) created_at: Time of when chat was entered (ISO 8601 date and time) fragments: Chat text including parsing information of Twitch emote (JSON list) offset: Time offset between start time of video stream and the time of when chat was entered (float) updated_at: Time of when chat was edited (ISO 8601 date and time) video_id: Video identifier (integer) File name indicates name of Twitch stream channel. This dataset is saved as python3 pandas.DataFrame with python pickle format. import pandas as pd pd.read_pickle('ninja.pkl')

  13. Enriched Topical-Chat Dataset for Knowledge-Grounded Dialogue Systems

    • registry.opendata.aws
    Updated Aug 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2020). Enriched Topical-Chat Dataset for Knowledge-Grounded Dialogue Systems [Dataset]. https://registry.opendata.aws/topical-chat-enriched/
    Explore at:
    Dataset updated
    Aug 24, 2020
    Dataset provided by
    Amazon.comhttp://amazon.com/
    Description

    This dataset provides extra annotations on top of the publicly released Topical-Chat dataset(https://github.com/alexa/Topical-Chat) which will help in reproducing the results in our paper "Policy-Driven Neural Response Generation for Knowledge-Grounded Dialogue Systems" (https://arxiv.org/abs/2005.12529?context=cs.CL). The dataset contains 5 files: train.json, valid_freq.json, valid_rare.json, test_freq.json and test_rare.json. Each of these files will have additional annotations on top of the original Topical-Chat dataset. These specific annotations are: dialogue act annotations and knowledge sentence annotations. The annotations were computed automatically using off the shelf models which are mentioned in the README.txt

  14. h

    persona-chat

    • huggingface.co
    Updated Apr 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleksey Korshuk (2023). persona-chat [Dataset]. https://huggingface.co/datasets/AlekseyKorshuk/persona-chat
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 25, 2023
    Authors
    Aleksey Korshuk
    Description

    AlekseyKorshuk/persona-chat dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. m

    Chat Bot Dataset for AI/ML models

    • data.macgence.com
    mp3
    Updated Aug 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Chat Bot Dataset for AI/ML models [Dataset]. https://data.macgence.com/dataset/chat-bot-dataset-for-aiml-models
    Explore at:
    mp3Available download formats
    Dataset updated
    Aug 4, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Get a high-quality chat bot dataset for AI/ML models. Enhance NLP training with diverse conversational data for accurate, efficient machine learning applications.

  16. Leading AI character chat categories WRTN 2024

    • statista.com
    Updated Feb 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Leading AI character chat categories WRTN 2024 [Dataset]. https://www.statista.com/statistics/1553814/wrtn-popular-ai-character-chat-categories/
    Explore at:
    Dataset updated
    Feb 6, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2024
    Area covered
    South Korea
    Description

    In 2024, romance was the leading character chat category among WRTN users. WRTN was a South Korean artificial intelligence (AI) service offering various generative AI solutions ranging from chatbots to summarizing documents. In particular, WRTN's character chat has proven to be popular among its users.

  17. C

    Customer Service Live Chat System Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Customer Service Live Chat System Report [Dataset]. https://www.datainsightsmarket.com/reports/customer-service-live-chat-system-1959011
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Apr 25, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Customer Service Live Chat System market is experiencing robust growth, driven by the increasing adoption of digital channels for customer interaction and the rising demand for improved customer service experiences. The market, estimated at $10 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $30 billion by 2033. This expansion is fueled by several key factors: the escalating preference for instant communication among consumers, the cost-effectiveness of live chat compared to traditional support methods like phone calls, and the ability of live chat systems to improve customer satisfaction and loyalty through personalized and efficient interactions. The web-based segment dominates the market due to its accessibility and scalability, while the retail and e-commerce sector accounts for the largest application share, reflecting the growing importance of online sales and the need for immediate customer support in online transactions. However, challenges remain, such as ensuring data security and integrating live chat seamlessly with other customer relationship management (CRM) systems. Significant regional variations exist within the market. North America currently holds the largest market share, driven by the early adoption of digital technologies and a strong focus on customer experience. However, Asia Pacific is projected to witness the highest growth rate during the forecast period, propelled by rapid e-commerce expansion and increasing internet penetration across developing economies like India and China. Key players in the market, including Comm100, Freshdesk, Intercom, JivoSite, Kayako, LivePerson, Zendesk, LogMeIn, LiveChat, and SnapEngage, are constantly innovating to enhance their offerings, adding features like AI-powered chatbots and advanced analytics to improve customer service efficiency and effectiveness. The market's future trajectory hinges on technological advancements, such as the integration of artificial intelligence and machine learning, and the ongoing need for businesses to provide seamless, omnichannel customer support.

  18. F

    General domain Human-Human conversation chats in Bahasa

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). General domain Human-Human conversation chats in Bahasa [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/bahasa-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    This training dataset comprises more than 10,000 conversational text data between two native Bahasa people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.

    This training dataset's licence belongs to FutureBeeAI!

  19. Sensai: Toxic Chat Dataset

    • kaggle.com
    Updated Nov 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    uetchy (2021). Sensai: Toxic Chat Dataset [Dataset]. https://www.kaggle.com/uetchy/sensai/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    uetchy
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by uetchy

    Released under ODC Public Domain Dedication and Licence (PDDL)

    Contents

  20. WhatsApp Data Set

    • figshare.com
    txt
    Updated Feb 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anika Seufert; Fabian Poignée; Tobias Hoßfeld; Michael Seufert (2023). WhatsApp Data Set [Dataset]. http://doi.org/10.6084/m9.figshare.19785193.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 23, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Anika Seufert; Fabian Poignée; Tobias Hoßfeld; Michael Seufert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Anonymized private WhatsApp chat histories, presented in A.Seufert, F. Poignée, M. Seufert, and T. Hoßfeld. "Share and Multiply: Modeling Communication and Generated Traffic in Private WhatsApp Groups," in IEEE Access 2023.

    Details on the dataset format are given in README.md.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Large Model Systems Organization (2023). lmsys-chat-1m [Dataset]. https://huggingface.co/datasets/lmsys/lmsys-chat-1m

lmsys-chat-1m

lmsys/lmsys-chat-1m

Explore at:
247 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Sep 17, 2023
Dataset authored and provided by
Large Model Systems Organization
Description

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

This dataset contains one million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210K unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website from April to August 2023. Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag. User consent is obtained through the "Terms of use"… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/lmsys-chat-1m.

Search
Clear search
Close search
Google apps
Main menu