100+ datasets found
  1. Bitext Gen AI Chatbot Customer Support Dataset

    • kaggle.com
    zip
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2024). Bitext Gen AI Chatbot Customer Support Dataset [Dataset]. https://www.kaggle.com/datasets/bitext/bitext-gen-ai-chatbot-customer-support-dataset
    Explore at:
    zip(3007665 bytes)Available download formats
    Dataset updated
    Mar 18, 2024
    Authors
    Bitext
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants

    Overview

    This dataset can be used to train Large Language Models such as GPT, Llama2 and Falcon, both for Fine Tuning and Domain Adaptation.

    The dataset has the following specs:

    • Use Case: Intent Detection
    • Vertical: Customer Service
    • 27 intents assigned to 10 categories
    • 26872 question/answer pairs, around 1000 per intent
    • 30 entity/slot types
    • 12 different types of language generation tags

    The categories and intents have been selected from Bitext's collection of 20 vertical-specific datasets, covering the intents that are common across all 20 verticals. The verticals are:

    • Automotive, Retail Banking, Education, Events & Ticketing, Field Services, Healthcare, Hospitality, Insurance, Legal Services, Manufacturing, Media Streaming, Mortgages & Loans, Moving & Storage, Real Estate/Construction, Restaurant & Bar Chains, Retail/E-commerce, Telecommunications, Travel, Utilities, Wealth Management

    For a full list of verticals and its intents see https://www.bitext.com/chatbot-verticals/.

    The question/answer pairs have been generated using a hybrid methodology that uses natural texts as source text, NLP technology to extract seeds from these texts, and NLG technology to expand the seed texts. All steps in the process are curated by computational linguists.

    Dataset Token Count

    The dataset contains an extensive amount of text data across its 'instruction' and 'response' columns. After processing and tokenizing the dataset, we've identified a total of 3.57 million tokens. This rich set of tokens is essential for training advanced LLMs for AI Conversational, AI Generative, and Question and Answering (Q&A) models.

    Fields of the Dataset

    Each entry in the dataset contains the following fields:

    • flags: tags (explained below in the Language Generation Tags section)
    • instruction: a user request from the Customer Service domain
    • category: the high-level semantic category for the intent
    • intent: the intent corresponding to the user instruction
    • response: an example expected response from the virtual assistant

    Categories and Intents

    The categories and intents covered by the dataset are:

    • ACCOUNT: create_account, delete_account, edit_account, recover_password, registration_problems, switch_account
    • CANCELLATION_FEE: check_cancellation_fee
    • CONTACT: contact_customer_service, contact_human_agent
    • DELIVERY: delivery_options, delivery_period
    • FEEDBACK: complaint, review
    • INVOICE: check_invoice, get_invoice
    • ORDER: cancel_order, change_order, place_order, track_order
    • PAYMENT: check_payment_methods, payment_issue
    • REFUND: check_refund_policy, get_refund, track_refund
    • SHIPPING_ADDRESS: change_shipping_address, set_up_shipping_address
    • SUBSCRIPTION: newsletter_subscription

    Entities

    The entities covered by the dataset are:

    • {{Order Number}}, typically present in:
    • Intents: cancel_order, change_order, change_shipping_address, check_invoice, check_refund_policy, complaint, delivery_options, delivery_period, get_invoice, get_refund, place_order, track_order, track_refund
    • {{Invoice Number}}, typically present in:
      • Intents: check_invoice, get_invoice
    • {{Online Order Interaction}}, typically present in:
      • Intents: cancel_order, change_order, check_refund_policy, delivery_period, get_refund, review, track_order, track_refund
    • {{Online Payment Interaction}}, typically present in:
      • Intents: cancel_order, check_payment_methods
    • {{Online Navigation Step}}, typically present in:
      • Intents: complaint, delivery_options
    • {{Online Customer Support Channel}}, typically present in:
      • Intents: check_refund_policy, complaint, contact_human_agent, delete_account, delivery_options, edit_account, get_refund, payment_issue, registration_problems, switch_account
    • {{Profile}}, typically present in:
      • Intent: switch_account
    • {{Profile Type}}, typically present in:
      • Intent: switch_account
    • {{Settings}}, typically present in:
      • Intents: cancel_order, change_order, change_shipping_address, check_cancellation_fee, check_invoice, check_payment_methods, contact_human_agent, delete_account, delivery_options, edit_account, get_invoice, newsletter_subscription, payment_issue, place_order, recover_password, registration_problems, set_up_shipping_address, switch_account, track_order, track_refund
    • {{Online Company Portal Info}}, typically present in:
      • Intents: cancel_order, edit_account
    • {{Date}}, typically present in:
      • Intents: check_invoice, check_refund_policy, get_refund, track_order, track_refund
    • {{Date Range}}, typically present in:
      • Intents: check_cancellation_fee, check_invoice, get_invoice
    • {{Shipping Cut-off Time}}, typically present in:
      • Intent: delivery_options
    • {{Delivery City}}, typically present in:
      • Inten...
  2. F

    Tamil Agent-Customer Chat Dataset for Retail & E-Commerce

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Agent-Customer Chat Dataset for Retail & E-Commerce [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/tamil-retail-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Tamil Retail & E-Commerce Chat Dataset is a large-scale, high-quality collection of over 12,000 chat conversations between customers and call center agents, focused exclusively on Retail and E-Commerce domains. Designed to reflect real-world service interactions, this dataset supports the development of robust conversational AI and NLP models tailored for Tamil-speaking audiences.

    Participant & Chat Overview

    Contributors: 200 native Tamil speakers from the FutureBeeAI Crowd Community
    Chat Length: 300–700 words per conversation
    Turn Count: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: Positive, neutral, and negative interaction outcomes

    Topic Diversity

    This dataset spans a wide range of Retail and E-Commerce conversation types:

    Inbound Chats (Customer-Initiated)
    Product inquiries
    Return or exchange requests
    Order cancellations
    Refunds and payment issues
    Membership or subscription queries
    Shipping, delivery, and more
    Outbound Chats (Agent-Initiated)
    Order confirmation and verification
    Cross-selling and upselling
    Loyalty program promotions
    Account updates
    Special offers and discounts
    Customer feedback and verification

    This diversity enables training of models that handle varied intents, scenarios, and outcomes within customer service workflows.

    Language Nuance & Realism

    The dataset is rich in linguistic diversity and mirrors real conversational tone and structure used in Tamil-speaking regions:

    Personal & Brand Names: Culturally accurate naming conventions
    Local Elements: Realistic addresses, phone numbers, emails, currency references, and time/date formats
    Slang & Idioms: Local expressions, informal phrases, and customer service jargon
    Cultural Specificity: Region-aware vocabulary and tone

    This linguistic authenticity ensures the development of culturally fluent AI models for Tamil Retail & E-Commerce use cases.

    Conversational Structure & Flow

    The conversations reflect natural dialogue dynamics and are organized into various types of interaction styles:

    Simple inquiries
    Detailed problem-solving discussions
    Transactional exchanges
    Follow-ups and status updates
    Advisory and assistance sessions

    Each conversation includes common dialogue stages such as:

    Greetings
    Customer authentication
    Information gathering
    <div style="margin-top:10px; margin-bottom: 10px; margin-left: 30px;font-weight: 300; display: flex; gap: 16px;

  3. d

    AI Training Dataset [Call Transcriptions] – Real support conversations for...

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WiserBrand.com, AI Training Dataset [Call Transcriptions] – Real support conversations for training conversational and sentiment-aware AI [Dataset]. https://datarade.ai/data-products/ai-training-dataset-call-transcriptions-real-support-conv-wiserbrand-com
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset provided by
    WiserBrand
    Area covered
    Gibraltar, Germany, Andorra, Slovakia, Croatia, Serbia, Spain, United States of America, Guatemala, Norway
    Description

    This dataset offers real-world customer service call transcriptions, making it an ideal resource for training conversational AI, customer-facing virtual agents, and support automation systems. All calls are sourced from authentic support interactions across 160+ industries — including retail, finance, telecom, healthcare, and logistics.

    What’s included:

    • Verbatim call transcriptions of customer-agent dialogues
    • Human-curated summaries of each call’s topic and resolution
    • Sentiment classification per call: positive, neutral, or negative
    • Call duration, timestamp, location, and industry tags
    • Optional: company name and issue category

    Use this AI training dataset to:

    • Train large language models on real customer-service language and task flow
    • Improve chatbot responses with exposure to actual customer concerns
    • Model complaint escalation and frustration signals
    • Support summarization pipelines for QA and operations tools
    • Benchmark and test conversational agents on unseen, real-case inputs

    With diverse industries and naturally spoken interactions, this dataset is ideal for AI teams that require reliable, human-language training material grounded in real-world support scenarios.

  4. F

    English Agent-Customer Chat Dataset for Telecom

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English Agent-Customer Chat Dataset for Telecom [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-telecom-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The English Telecom Chat Dataset is a comprehensive collection of over 12,000 text-based conversations between telecom customers and call center agents. This dataset captures real-world service interactions and domain-specific language in English, enabling the development of intelligent conversational AI and NLP systems tailored for the telecommunications sector.Participant & Chat Overview

    Participants: 200+ native English speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: A mix of positive, neutral, and negative interactions

    Topic Diversity

    This dataset spans a wide range of telecom customer service scenarios:

    Inbound Chats (Customer-Initiated)
    Phone number porting
    Network connectivity issues
    Billing inquiries and adjustments
    Technical support requests
    Service activations and upgrades
    International roaming inquiries
    Refunds and complaint resolution
    Emergency service access
    Outbound Chats (Agent-Initiated)
    Welcome and onboarding calls
    Payment reminders and due alerts
    Customer satisfaction surveys
    Technical issue follow-ups
    Usage reviews and service feedback
    Promotions and service offers

    Language Nuance & Realism

    The conversations reflect real-life telecom interactions in English, incorporating:

    Naming Patterns: Realistic English personal, business, and telecom brand names
    Localized Content: Phone numbers, email addresses, and locations consistent with regional norms
    Time & Number Formats: English representations of dates, times, currencies, and service numbers
    Informal Language & Slang: Common English expressions, idioms, and conversational shortcuts found in telecom discussions

    Conversational Flow & Structure

    Conversations follow the natural flow of telecom customer service exchanges, including:

    Dialogue Types:
    Simple service inquiries
    Detailed problem-solving discussions
    Plan explanations and upgrades
    Feedback collection and status updates
    Interaction Stages:
    Initial greetings and verification
    Data or issue collection
    Clarification and troubleshooting
    <span

  5. h

    Bitext-customer-support-llm-chatbot-training-dataset

    • huggingface.co
    • opendatalab.com
    Updated Jul 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2024). Bitext-customer-support-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2024
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the Customer Support sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset.

  6. Mental Health Chatbot Pairs

    • kaggle.com
    zip
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Mental Health Chatbot Pairs [Dataset]. https://www.kaggle.com/datasets/thedevastator/mental-health-chatbot-pairs/code
    Explore at:
    zip(58210 bytes)Available download formats
    Dataset updated
    Nov 27, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Mental Health Chatbot Pairs

    AI-based Tailored Support for Mental Health Conversation

    By Huggingface Hub [source]

    About this dataset

    This dataset contains a compilation of carefully-crafted Q&A pairs which are designed to provide AI-based tailored support for mental health. These carefully chosen questions and answers offer an avenue for those looking for help to gain the assistance they need. With these pre-processed conversations, Artificial Intelligence (AI) solutions can be developed and deployed to better understand and respond appropriately to individual needs based on their input. This comprehensive dataset is crafted by experts in the mental health field, providing insightful content that will further research in this growing area. These data points will be invaluable for developing the next generation of personalized AI-based mental health chatbots capable of truly understanding what people need

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains pre-processed Q&A pairs for AI-based tailored support for mental health. As such, it represents an excellent starting point in building a conversational model which can handle conversations about mental health issues. Here are some tips on how to use this dataset to its fullest potential:

    • Understand your data: Spend time getting to know the text of the conversation between the user and the chatbot and familiarize yourself with what type of questions and answers are included in this specific dataset. This will help you better formulate queries for your own conversational model or develop new ones you can add yourself.

    • Refine your language processing models: By studying the patterns in syntax, grammar, tone, voice, etc., within this conversational data set you can hone your natural language processing capabilities - such as keyword extractions or entity extraction – prior to implementing them into a larger bot system .

    • Test assumptions: Have an idea of what you think may work best with a particular audience or context? See if these assumptions pan out by applying different variations of text to this dataset to see if it works before rolling out changes across other channels or programs that utilize AI/chatbot services

    • Research & Analyze Results : After testing out different scenarios on real-world users by using various forms of q&a within this chatbot pair data set , analyze & record any relevant results pertaining towards understanding user behavior better through further analysis after being exposed to tailored texted conversations about Mental Health topics both passively & actively . The more information you collect here , leads us closer towards creating effective AI powered conversations that bring our desired outcomes from our customer base .

    Research Ideas

    • Developing a chatbot for personalized mental health advice and guidance tailored to individuals' unique needs, experiences, and struggles.
    • Creating an AI-driven diagnostic system that can interpret mental health conversations and provide targeted recommendations for interventions or treatments based on clinical expertise.
    • Designing an AI-powered recommendation engine to suggest relevant content such as articles, videos, or podcasts based on users’ questions or topics of discussion during their conversation with the chatbot

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------------------------| | text | The text of the conversation between the user and the chatbot. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  7. Multi-turn Prompts Dataset

    • kaggle.com
    zip
    Updated Oct 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SoftAge.AI (2024). Multi-turn Prompts Dataset [Dataset]. https://www.kaggle.com/datasets/softageai/multi-turn-prompts-dataset/code
    Explore at:
    zip(1109121 bytes)Available download formats
    Dataset updated
    Oct 25, 2024
    Authors
    SoftAge.AI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Description This dataset consists of 400 text-only fine-tuned versions of multi-turn conversations in the English language based on 10 categories and 19 use cases. It has been generated with ethically sourced human-in-the-loop data methods and aligned with supervised fine-tuning, direct preference optimization, and reinforcement learning through human feedback.

    The human-annotated data is focused on data quality and precision to enhance the generative response of models used for AI chatbots, thereby improving their recall memory and recognition ability for continued assistance.

    Key Features Prompts focused on user intent and were devised using natural language processing techniques. Multi-turn prompts with up to 5 turns to enhance responsive memory of large language models for pretraining. Conversational interactions for queries related to varied aspects of writing, coding, knowledge assistance, data manipulation, reasoning, and classification.

    Dataset Source Subject matter expert annotators @SoftAgeAI have annotated the data at simple and complex levels, focusing on quality factors such as content accuracy, clarity, coherence, grammar, depth of information, and overall usefulness.

    Structure & Fields The dataset is organized into different columns, which are detailed below:

    P1, R1, P2, R2, P3, R3, P4, R4, P5 (object): These columns represent the sequence of prompts (P) and responses (R) within a single interaction. Each interaction can have up to 5 prompts and 5 corresponding responses, capturing the flow of a conversation. The prompts are user inputs, and the responses are the model's outputs. Use Case (object): Specifies the primary application or scenario for which the interaction is designed, such as "Q&A helper" or "Writing assistant." This classification helps in identifying the purpose of the dialogue. Type (object): Indicates the complexity of the interaction, with entries labeled as "Complex" in this dataset. This denotes that the dialogues involve more intricate and multi-layered exchanges. Category (object): Broadly categorizes the interaction type, such as "Open-ended QA" or "Writing." This provides context on the nature of the conversation, whether it is for generating creative content, providing detailed answers, or engaging in complex problem-solving. Intended Use Cases

    The dataset can enhance query assistance model functioning related to shopping, coding, creative writing, travel assistance, marketing, citation, academic writing, language assistance, research topics, specialized knowledge, reasoning, and STEM-based. The dataset intends to aid generative models for e-commerce, customer assistance, marketing, education, suggestive user queries, and generic chatbots. It can pre-train large language models with supervision-based fine-tuned annotated data and for retrieval-augmented generative models. The dataset stands free of violence-based interactions that can lead to harm, conflict, discrimination, brutality, or misinformation. Potential Limitations & Biases This is a static dataset, so the information is dated May 2024.

    Note If you have any questions related to our data annotation and human review services for large language model training and fine-tuning, please contact us at SoftAge Information Technology Limited at info@softage.ai.

  8. F

    English Agent-Customer Chat Dataset for Healthcare Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-healthcare-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The English Healthcare Chat Dataset is a rich collection of over 12,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in English-speaking regions.

    Participant & Chat Overview

    Participants: 200+ native English speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: Positive, neutral, and negative outcomes included

    Topic Diversity

    The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:

    Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups
    Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

    This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.

    Language Diversity & Realism

    This dataset reflects the natural flow of English healthcare communication and includes:

    Authentic Naming Patterns: English personal names, clinic names, and brands
    Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional English formats
    Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with English-speaking regions
    Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

    These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.

    Conversational Flow & Structure

    Conversations range from simple inquiries to complex advisory sessions, including:

    General inquiries
    Detailed problem-solving
    Routine status updates
    Treatment recommendations
    Support and feedback interactions

    Each conversation typically includes these structural components:

    Greetings and verification
    Information gathering
    Problem definition
    Solution delivery
    Closing messages
    Follow-up and feedback (where applicable)

    This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.

    Data Format & Structure

    Available in JSON, CSV, and TXT formats, each conversation includes:

    Full message history with clear speaker labels
    Participant identifiers
    Metadata (e.g., topic tags, region, sentiment)
    Compatibility with common NLP and ML pipelines

    Applications

    <p

  9. d

    Speech Recognition Dataset [Customer Calls] – Transcribed support...

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WiserBrand.com, Speech Recognition Dataset [Customer Calls] – Transcribed support conversations for training voice AI systems [Dataset]. https://datarade.ai/data-products/speech-recognition-dataset-customer-calls-transcribed-sup-wiserbrand-com
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset provided by
    WiserBrand
    Area covered
    Norway, Slovenia, Denmark, Croatia, United Kingdom, Portugal, Moldova (Republic of), Poland, Czech Republic, Greece
    Description

    This dataset is designed for building and improving speech recognition systems. It features transcribed customer service calls from real interactions across 160+ industries, including retail, banking, telecom, logistics, healthcare, and entertainment. Calls are natural, unscripted, and emotion-rich — making the data especially valuable for training models that must interpret speech under real-world conditions.

    Each dataset entry includes:

    • Full call transcription (agent + customer dialogue)
    • Human-written call summary
    • Overall sentiment label: positive, neutral, or negative
    • Metadata: call duration, caller location (city, state, country), timestamp
    • Optional: company name and industry tag

    Use this dataset to:

    • Train speech-to-text models on real customer language patterns. -Benchmark or evaluate speech recognition tools in support settings
    • Improve voice interfaces, chatbots, and IVR systems.
    • Model tone, frustration cues, and escalation behaviors
    • Support LLM fine-tuning for tasks involving spoken input.s

    This dataset provides your speech recognition models with exposure to genuine customer conversations, helping you build tools that can listen, understand, and act in line with how people actually speak.

    The larger the volume you purchase, the lower the price will be.

  10. h

    lmsys-chat-1m

    • huggingface.co
    • opendatalab.com
    Updated Sep 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Large Model Systems Organization (2023). lmsys-chat-1m [Dataset]. https://huggingface.co/datasets/lmsys/lmsys-chat-1m
    Explore at:
    Dataset updated
    Sep 17, 2023
    Dataset authored and provided by
    Large Model Systems Organization
    Description

    LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

    This dataset contains one million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210K unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website from April to August 2023. Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag. User consent is obtained through the "Terms of use"… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/lmsys-chat-1m.

  11. d

    Call Center Audio Recordings (100,000+ Hours, High-Quality) in Multiple...

    • datarade.ai
    .mp3, .wav
    Updated Jul 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2025). Call Center Audio Recordings (100,000+ Hours, High-Quality) in Multiple Languages | Available now (off-the-shelf) [Dataset]. https://datarade.ai/data-products/call-center-audio-recordings-100-000-hours-high-quality-i-filemarket
    Explore at:
    .mp3, .wavAvailable download formats
    Dataset updated
    Jul 23, 2025
    Dataset authored and provided by
    FileMarket
    Area covered
    Hong Kong, Nauru, Iceland, Yemen, Martinique, Chad, Afghanistan, French Southern Territories, American Samoa, Lesotho
    Description

    Call Center Audio Recordings Dataset: 100,000+ Hours of Customer-Agent Conversations in Multiple Languages

    FileMarket AI Data Labs presents an extensive call center audio recordings dataset with over 100,000 hours of customer-agent conversations across a diverse range of topics. This dataset is designed to support AI, machine learning, and speech analytics projects requiring high-quality, real-world customer interaction data. Whether you're working on speech recognition, natural language processing (NLP), sentiment analysis, or any other conversational AI task, our dataset offers the breadth and quality you need to build, train, and refine cutting-edge models.

    Our dataset includes a multilingual collection of customer service interactions, recorded across various industries and service sectors. These recordings cover different call center topics such as customer support, sales and telemarketing, technical helpdesk, complaint handling, and information services, ensuring that the dataset provides rich context and variety. With support for a broad spectrum of languages including English, Spanish, French, German, Chinese, Arabic, and more, this dataset allows for training models that cater to global customer bases.

    In addition to the audio recordings, our dataset includes detailed metadata such as call duration, region, language, and call type, ensuring that data is easily usable for targeted applications. All recordings are carefully annotated for speaker separation and high fidelity to meet the highest standards for audio data.

    Our dataset is fully compliant with data protection and privacy regulations, offering consented and ethically sourced data. You can be assured that every data point meets the highest standards for legal compliance, making it safe for use in your commercial, academic, or research projects.

    At FileMarket AI Data Labs, we offer flexibility in terms of data scaling. Whether you need a small sample or a full-scale dataset, we can cater to your requirements. We also provide sample data for evaluation to help you assess quality before committing to the full dataset. Our pricing structure is competitive, with custom pricing options available for large-scale acquisitions.

    We invite you to explore this versatile dataset, which can help accelerate your AI and machine learning initiatives, whether for training conversational models, improving customer service tools, or enhancing multi-language support systems.

  12. F

    English Agent-Customer Chat Dataset for Travel

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English Agent-Customer Chat Dataset for Travel [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-travel-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The English Travel Chat Dataset is a comprehensive collection of over 12,000 text-based conversations between customers and call center agents. Focused on real-life travel and tourism interactions, this dataset captures the language, tone, and service dynamics essential for building robust conversational AI, chatbots, and NLP solutions for the travel industry in English-speaking markets.

    Participant & Chat Overview

    Participants: 200+ native English speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: Includes positive, neutral, and negative interaction outcomes

    Topic Diversity

    The dataset encompasses a wide range of travel and tourism use cases across both customer-initiated and agent-initiated conversations:

    Inbound Chats (Customer-Initiated)
    Booking assistance and travel planning
    Destination information and recommendations
    Flight delays or cancellations
    Lost or delayed baggage support
    Assistance for travelers with disabilities
    Health and safety travel inquiries
    Outbound Chats (Agent-Initiated)
    Promotional offers and travel package deals
    Booking confirmations and schedule updates
    Flight change notifications
    Customer satisfaction surveys
    Visa expiration and renewal reminders
    Loyalty and feedback collection campaigns

    This variety ensures wide applicability in both sales enablement and customer support automation.

    Language Diversity & Realism

    Conversations are crafted to reflect the everyday language and nuances of English-speaking travelers:

    Naming Patterns: English personal names, airline and hotel names, tour operators
    Localized Details: Regional email formats, phone numbers, locations, and cultural references
    Time and Currency Expressions: Dates, local times, and prices represented in English forms
    Slang and Informal Speech: Common phrases and idioms used in travel planning and customer support

    These linguistic and cultural cues enable the development of context-aware, natural-sounding AI systems.

    Conversational Structure & Flow

    The dataset captures a variety of interaction types, including:

    Dialogue Types:
    Quick inquiries and confirmations
    Complex issue resolution
    Advisory and planning sessions
    Travel disruption and recovery support
    Common Flow Elements:
    Greetings and authentication
    Information request and validation
    Problem or request resolution
    <div style="margin-left: 60px; font-weight: 300; display: flex; gap: 16px; align-items:

  13. 16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM)...

    • datarade.ai
    Updated Dec 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM) Data | Speech AI Datasets|Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-conversational-speech-data-16khz-mob-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 9, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Vietnam, Korea (Republic of), Canada, Turkey, Ecuador, Malaysia, Germany, Austria, Indonesia, Saudi Arabia
    Description
    1. Specifications Format : 16kHz 16bit, uncompressed wav, mono channel;

    Environment : quiet indoor environment, without echo;

    Recording content : No preset linguistic data,dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;

    Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

    Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;

    Device : Android mobile phone, iPhone;

    Language : 100+ Languages;

    Application scenarios : speech recognition; voiceprint recognition;

    Accuracy rate : the word accuracy rate is not less than 98%

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Audio Data and 800TB of computer vision data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
  14. h

    Bitext-retail-ecommerce-llm-chatbot-training-dataset

    • huggingface.co
    Updated Aug 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2024). Bitext-retail-ecommerce-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset.

  15. D

    Multilingual NLP AI Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Multilingual NLP AI Market Research Report 2033 [Dataset]. https://dataintelo.com/report/multilingual-nlp-ai-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Multilingual NLP AI Market Outlook




    According to our latest research, the global multilingual NLP AI market size in 2024 stands at USD 7.4 billion, with a robust CAGR of 18.2% expected from 2025 to 2033. By the end of 2033, the market is forecasted to reach USD 37.6 billion, reflecting the surging demand for advanced natural language processing solutions capable of handling multiple languages across diverse industries. This remarkable growth is primarily driven by the proliferation of digital transformation initiatives, the exponential rise in global content creation, and the increasing need for real-time, accurate, and context-aware language processing in an interconnected world.




    One of the most significant growth factors for the multilingual NLP AI market is the rapid globalization of businesses and the need for seamless cross-border communication. Enterprises are increasingly operating on a global scale, necessitating robust multilingual support for customer engagement, product localization, and regulatory compliance. As companies strive to enhance customer experience and expand their reach, the integration of multilingual NLP AI technologies becomes indispensable. These solutions empower organizations to bridge language barriers, deliver personalized content, and provide support in native languages, which in turn drives customer loyalty and satisfaction. Furthermore, the rise of e-commerce, international trade, and remote work environments has further amplified the necessity for scalable, accurate, and context-sensitive multilingual AI solutions.




    Another critical driver fueling the expansion of the multilingual NLP AI market is the evolution of artificial intelligence and deep learning algorithms. Significant advancements in neural machine translation, contextual language modeling, and sentiment analysis have enabled NLP systems to process and understand complex linguistic nuances across numerous languages. The availability of large-scale multilingual datasets, coupled with the continuous improvement of pre-trained language models like BERT, GPT, and their multilingual variants, has significantly improved the accuracy and reliability of NLP applications. As a result, industries such as healthcare, BFSI, education, and media are leveraging these sophisticated AI-driven platforms to automate processes, extract actionable insights, and enhance decision-making capabilities in multilingual environments.




    The increasing adoption of cloud computing and API-based NLP services is also a pivotal factor in the market's upward trajectory. Cloud deployment models offer unparalleled scalability, cost-effectiveness, and accessibility, allowing organizations of all sizes to implement state-of-the-art multilingual NLP solutions without substantial upfront investments in infrastructure. This democratization of AI technology has opened new avenues for small and medium enterprises (SMEs) to compete on a global scale, driving further market penetration. Additionally, the integration of NLP AI with emerging technologies such as conversational AI, voice assistants, and intelligent chatbots is driving innovation and expanding the application landscape, particularly in customer service, healthcare diagnostics, and digital marketing.




    From a regional perspective, North America currently leads the multilingual NLP AI market, accounting for the largest revenue share in 2024, due to its advanced technological ecosystem, high AI adoption rates, and presence of major industry players. However, the Asia Pacific region is poised for the fastest growth during the forecast period, propelled by rapid digitalization, expanding internet penetration, and increasing investments in AI research and development. Europe also represents a significant market, driven by stringent regulatory frameworks promoting language inclusivity and a diverse linguistic landscape. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, supported by growing awareness and government initiatives aimed at fostering digital transformation.



    Component Analysis




    The component segment of the multilingual NLP AI market is categorized into software, hardware, and services, each playing a crucial role in the overall ecosystem. Software remains the dominant segment, accounting for the largest share of the market in 2024. This dominance is attributed to the widespread deployment of NLP platforms, APIs, and pre-trained lang

  16. F

    Spanish Agent-Customer Chat Dataset for Travel

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Spanish Agent-Customer Chat Dataset for Travel [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/spanish-travel-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Spanish Travel Chat Dataset is a comprehensive collection of over 10,000 text-based conversations between customers and call center agents. Focused on real-life travel and tourism interactions, this dataset captures the language, tone, and service dynamics essential for building robust conversational AI, chatbots, and NLP solutions for the travel industry in Spanish-speaking markets.

    Participant & Chat Overview

    Participants: 150+ native Spanish speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: Includes positive, neutral, and negative interaction outcomes

    Topic Diversity

    The dataset encompasses a wide range of travel and tourism use cases across both customer-initiated and agent-initiated conversations:

    Inbound Chats (Customer-Initiated)
    Booking assistance and travel planning
    Destination information and recommendations
    Flight delays or cancellations
    Lost or delayed baggage support
    Assistance for travelers with disabilities
    Health and safety travel inquiries
    Outbound Chats (Agent-Initiated)
    Promotional offers and travel package deals
    Booking confirmations and schedule updates
    Flight change notifications
    Customer satisfaction surveys
    Visa expiration and renewal reminders
    Loyalty and feedback collection campaigns

    This variety ensures wide applicability in both sales enablement and customer support automation.

    Language Diversity & Realism

    Conversations are crafted to reflect the everyday language and nuances of Spanish-speaking travelers:

    Naming Patterns: Spanish personal names, airline and hotel names, tour operators
    Localized Details: Regional email formats, phone numbers, locations, and cultural references
    Time and Currency Expressions: Dates, local times, and prices represented in Spanish forms
    Slang and Informal Speech: Common phrases and idioms used in travel planning and customer support

    These linguistic and cultural cues enable the development of context-aware, natural-sounding AI systems.

    Conversational Structure & Flow

    The dataset captures a variety of interaction types, including:

    Dialogue Types:
    Quick inquiries and confirmations
    Complex issue resolution
    Advisory and planning sessions
    Travel disruption and recovery support
    Common Flow Elements:
    Greetings and authentication
    Information request and validation
    Problem or request resolution
    <div style="margin-left: 60px; font-weight: 300; display: flex; gap: 16px; align-items:

  17. t

    Privacy-Sensitive Conversations between Care Workers and Care Home Residents...

    • researchdata.tuwien.ac.at
    • test.researchdata.tuwien.at
    bin, text/markdown
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reinhard Grabler; Reinhard Grabler; Michael Starzinger; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns (2025). Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home [Dataset]. http://doi.org/10.48436/q1kt0-edc53
    Explore at:
    bin, text/markdownAvailable download formats
    Dataset updated
    Feb 25, 2025
    Dataset provided by
    TU Wien
    Authors
    Reinhard Grabler; Reinhard Grabler; Michael Starzinger; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2024 - Aug 2024
    Description

    Dataset Card for "privacy-care-interactions"

    Table of Contents

    Dataset Description

    Purpose and Features

    🔒 Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home 🔒

    The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.

    Dataset Overview

    Language Distribution 🌍

    • English (en): 95

    Locale Distribution 🌎

    • United States (US) 🇺🇸: 95

    Key Facts 🔑

    • This is synthetic data! Generated using proprietary algorithms - no privacy violations!
    • Conversations are classified following the taxonomy for privacy-sensitive robotics by Rueben et al. (2017).
    • The data was manually labeled by an expert.

    Dataset Structure

    Data Instances

    The provided data format is .jsonl, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.

    { "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }

    Data Fields

    The data fields are:

    • text: a string feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).
    • taxonomy: a classification label, with possible values including informational (0), invasion (1), collection (2), processing (3), dissemination (4), physical (5), personal-space (6), territoriality (7), intrusion (8), obtrusion (9), contamination (10), modesty (11), psychological (12), interrogation (13), psychological-distance (14), social (15), association (16), crowding-isolation (17), public-gaze (18), solitude (19), intimacy (20), anonymity (21), reserve (22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.
    • category: a classification label, with possible values including personal-information (0), family (1), health (2), thoughts (3), values (4), acquaintance (5), appointment (6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.
    • affected_speaker: a classification label, with possible values including care-worker (0), care-recipient (1), other (2), both (3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.
    • language: a string feature. Language code as defined by ISO 639.
    • locale: a string feature. Regional code as defined by ISO 3166-1 alpha-2.
    • data_type: a string a classification label, with possible values including real (0), synthetic (1).
    • uid: a int64 feature. A unique identifier within the dataset.
    • split: a string feature. Either train, validation or test.

    Dataset Splits

    The dataset has 2 subsets:

    • split: with a total of 95 examples split into train, validation and test (70%-15%-15%)
    • unsplit: with a total of 95 examples in a single train split
    nametrainvalidationtest
    split661415
    unsplit95n/an/a

    The files follow the naming convention subset-split-language.jsonl. The following files are contained in the dataset:

    • split-train-en.jsonl
    • split-validation-en.jsonl
    • split-test-en.jsonl
    • unsplit-train-en.jsonl

    Dataset Creation

    Curation Rationale

    Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.

    Source Data

    Initial Data Collection

    The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.

    Data Processing

    The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the sections were translated from German to U.S. English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split function from the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank" rel="noopener

  18. F

    Russian Call Center Data for Travel AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Russian Call Center Data for Travel AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/travel-call-center-conversation-russian-russia
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Russian Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for Russian -speaking travelers.

    Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.

    Speech Data

    The dataset includes 30 hours of dual-channel audio recordings between native Russian speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.

    Participant Diversity:
    Speakers: 60 native Russian contributors from our verified pool.
    Regions: Covering multiple Russia provinces to capture accent and dialectal variation.
    Participant Profile: Balanced representation of age (18–70) and gender (60% male, 40% female).
    Recording Details:
    Conversation Nature: Naturally flowing, spontaneous customer-agent calls.
    Call Duration: Between 5 and 15 minutes per session.
    Audio Format: Stereo WAV, 16-bit depth, at 8kHz and 16kHz.
    Recording Environment: Captured in controlled, noise-free, echo-free settings.

    Topic Diversity

    Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).

    Inbound Calls:
    Booking Assistance
    Destination Information
    Flight Delays or Cancellations
    Support for Disabled Passengers
    Health and Safety Travel Inquiries
    Lost or Delayed Luggage, and more
    Outbound Calls:
    Promotional Travel Offers
    Customer Feedback Surveys
    Booking Confirmations
    Flight Rescheduling Alerts
    Visa Expiry Notifications, and others

    These scenarios help models understand and respond to diverse traveler needs in real-time.

    Transcription

    Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-Stamped Segments
    Non-speech Markers (e.g., pauses, coughs)
    High transcription accuracy by dual-layered transcription review ensures word error rate under 5%.

    Metadata

    Extensive metadata enriches each call and speaker for better filtering and AI training:

    Participant Metadata: ID, age, gender, region, accent, and dialect.
    Conversation Metadata: Topic, domain, call type, sentiment, and audio specs.

    Usage and Applications

    This dataset is ideal for a variety of AI use cases in the travel and tourism space:

    ASR Systems: Train Russian speech-to-text engines for travel platforms.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex;

  19. F

    Vietnamese Agent-Customer Chat Dataset for Healthcare Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Vietnamese Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/vietnamese-healthcare-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Vietnamese Healthcare Chat Dataset is a rich collection of over 10,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Vietnamese-speaking regions.

    Participant & Chat Overview

    Participants: 150+ native Vietnamese speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: Positive, neutral, and negative outcomes included

    Topic Diversity

    The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:

    Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups
    Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

    This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.

    Language Diversity & Realism

    This dataset reflects the natural flow of Vietnamese healthcare communication and includes:

    Authentic Naming Patterns: Vietnamese personal names, clinic names, and brands
    Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Vietnamese formats
    Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Vietnamese-speaking regions
    Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

    These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.

    Conversational Flow & Structure

    Conversations range from simple inquiries to complex advisory sessions, including:

    General inquiries
    Detailed problem-solving
    Routine status updates
    Treatment recommendations
    Support and feedback interactions

    Each conversation typically includes these structural components:

    Greetings and verification
    Information gathering
    Problem definition
    Solution delivery
    Closing messages
    Follow-up and feedback (where applicable)

    This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.

    Data Format & Structure

    Available in JSON, CSV, and TXT formats, each conversation includes:

    Full message history with clear speaker labels
    Participant identifiers
    Metadata (e.g., topic tags, region, sentiment)
    Compatibility with common NLP and ML pipelines
    <h3 style="font-weight:

  20. F

    Punjabi Agent-Customer Chat Dataset for BFSI Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Punjabi Agent-Customer Chat Dataset for BFSI Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/punjabi-bfsi-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Punjabi BFSI Chat Dataset is a comprehensive collection of over 12,000 text-based chat conversations between customers and call center agents. Focused on Banking, Financial Services, and Insurance (BFSI) interactions, this dataset captures real-world service dialogues, complete with domain-specific language, customer intents, and varied conversational flows.

    Participant & Chat Overview

    Participants: 200 native Punjabi speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: Includes positive, neutral, and negative interaction outcomes

    Topic Diversity

    This dataset reflects the wide range of customer interactions typically encountered in the BFSI sector:

    Inbound Chats (Customer-Initiated)
    Account opening and management
    Transaction-related queries
    Loan inquiries and applications
    Credit card issues
    Insurance questions and requests
    Outbound Chats (Agent-Initiated)
    Product and service promotions
    Cross-selling and upselling efforts
    Loan follow-ups and reminders
    Customer retention and loyalty program outreach
    Insurance policy renewals and verifications

    This topic spread ensures applicability across customer service automation, intent classification, and domain-specific model training.

    Language Nuance & Cultural Relevance

    Conversations capture natural Punjabi as spoken in BFSI contexts, incorporating:

    Names & Branding: Realistic Punjabi personal and business names
    Local Contextual Elements: Emails, phone numbers, addresses, time/date references, and currency in Punjabi format
    Colloquial Speech & Slang: Regional idioms, informal expressions, and domain-specific jargon
    Numerical Expressions: Use of Punjabi numerals, amounts, dates, and measurements as per local conventions

    This linguistic richness enables the training of models that can understand real-world customer queries in culturally relevant contexts.

    Conversational Structure & Flow

    The dataset reflects structured dialogue flow and interaction dynamics seen in BFSI customer service environments:

    Types of Conversations:
    Simple inquiries
    Complex problem-solving discussions
    Transactional updates
    Advisory sessions
    Follow-ups and routine status checks
    Typical Chat Components:
    Greetings and opening
    Customer authentication
    Information gathering
    <div style="margin-top:10px; margin-bottom: 10px; margin-left:

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bitext (2024). Bitext Gen AI Chatbot Customer Support Dataset [Dataset]. https://www.kaggle.com/datasets/bitext/bitext-gen-ai-chatbot-customer-support-dataset
Organization logo

Bitext Gen AI Chatbot Customer Support Dataset

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip(3007665 bytes)Available download formats
Dataset updated
Mar 18, 2024
Authors
Bitext
License

https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

Description

Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This dataset can be used to train Large Language Models such as GPT, Llama2 and Falcon, both for Fine Tuning and Domain Adaptation.

The dataset has the following specs:

  • Use Case: Intent Detection
  • Vertical: Customer Service
  • 27 intents assigned to 10 categories
  • 26872 question/answer pairs, around 1000 per intent
  • 30 entity/slot types
  • 12 different types of language generation tags

The categories and intents have been selected from Bitext's collection of 20 vertical-specific datasets, covering the intents that are common across all 20 verticals. The verticals are:

  • Automotive, Retail Banking, Education, Events & Ticketing, Field Services, Healthcare, Hospitality, Insurance, Legal Services, Manufacturing, Media Streaming, Mortgages & Loans, Moving & Storage, Real Estate/Construction, Restaurant & Bar Chains, Retail/E-commerce, Telecommunications, Travel, Utilities, Wealth Management

For a full list of verticals and its intents see https://www.bitext.com/chatbot-verticals/.

The question/answer pairs have been generated using a hybrid methodology that uses natural texts as source text, NLP technology to extract seeds from these texts, and NLG technology to expand the seed texts. All steps in the process are curated by computational linguists.

Dataset Token Count

The dataset contains an extensive amount of text data across its 'instruction' and 'response' columns. After processing and tokenizing the dataset, we've identified a total of 3.57 million tokens. This rich set of tokens is essential for training advanced LLMs for AI Conversational, AI Generative, and Question and Answering (Q&A) models.

Fields of the Dataset

Each entry in the dataset contains the following fields:

  • flags: tags (explained below in the Language Generation Tags section)
  • instruction: a user request from the Customer Service domain
  • category: the high-level semantic category for the intent
  • intent: the intent corresponding to the user instruction
  • response: an example expected response from the virtual assistant

Categories and Intents

The categories and intents covered by the dataset are:

  • ACCOUNT: create_account, delete_account, edit_account, recover_password, registration_problems, switch_account
  • CANCELLATION_FEE: check_cancellation_fee
  • CONTACT: contact_customer_service, contact_human_agent
  • DELIVERY: delivery_options, delivery_period
  • FEEDBACK: complaint, review
  • INVOICE: check_invoice, get_invoice
  • ORDER: cancel_order, change_order, place_order, track_order
  • PAYMENT: check_payment_methods, payment_issue
  • REFUND: check_refund_policy, get_refund, track_refund
  • SHIPPING_ADDRESS: change_shipping_address, set_up_shipping_address
  • SUBSCRIPTION: newsletter_subscription

Entities

The entities covered by the dataset are:

  • {{Order Number}}, typically present in:
  • Intents: cancel_order, change_order, change_shipping_address, check_invoice, check_refund_policy, complaint, delivery_options, delivery_period, get_invoice, get_refund, place_order, track_order, track_refund
  • {{Invoice Number}}, typically present in:
    • Intents: check_invoice, get_invoice
  • {{Online Order Interaction}}, typically present in:
    • Intents: cancel_order, change_order, check_refund_policy, delivery_period, get_refund, review, track_order, track_refund
  • {{Online Payment Interaction}}, typically present in:
    • Intents: cancel_order, check_payment_methods
  • {{Online Navigation Step}}, typically present in:
    • Intents: complaint, delivery_options
  • {{Online Customer Support Channel}}, typically present in:
    • Intents: check_refund_policy, complaint, contact_human_agent, delete_account, delivery_options, edit_account, get_refund, payment_issue, registration_problems, switch_account
  • {{Profile}}, typically present in:
    • Intent: switch_account
  • {{Profile Type}}, typically present in:
    • Intent: switch_account
  • {{Settings}}, typically present in:
    • Intents: cancel_order, change_order, change_shipping_address, check_cancellation_fee, check_invoice, check_payment_methods, contact_human_agent, delete_account, delivery_options, edit_account, get_invoice, newsletter_subscription, payment_issue, place_order, recover_password, registration_problems, set_up_shipping_address, switch_account, track_order, track_refund
  • {{Online Company Portal Info}}, typically present in:
    • Intents: cancel_order, edit_account
  • {{Date}}, typically present in:
    • Intents: check_invoice, check_refund_policy, get_refund, track_order, track_refund
  • {{Date Range}}, typically present in:
    • Intents: check_cancellation_fee, check_invoice, get_invoice
  • {{Shipping Cut-off Time}}, typically present in:
    • Intent: delivery_options
  • {{Delivery City}}, typically present in:
    • Inten...
Search
Clear search
Close search
Google apps
Main menu