48 datasets found
  1. F

    Hindi Human-Human Chat Dataset for Conversational AI & NLP

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Hindi General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Hindi usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Hindi conversations covering a broad spectrum of everyday topics.

    Conversational Text Data

    This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native Hindi speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.

    Words per Chat: 300–700
    Turns per Chat: Up to 50 dialogue turns
    Contributors: 200 native Hindi speakers from the FutureBeeAI Crowd Community
    Format: TXT, DOCS, JSON or CSV (customizable)
    Structure: Each record contains the full chat, topic tag, and metadata block

    Diversity and Domain Coverage

    Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:

    Music, books, and movies
    Health and wellness
    Children and parenting
    Family life and relationships
    Food and cooking
    Education and studying
    Festivals and traditions
    Environment and daily life
    Internet and tech usage
    Childhood memories and casual chatting

    This diversity ensures the dataset is useful across multiple NLP and language understanding applications.

    Linguistic Authenticity

    Chats reflect informal, native-level Hindi usage with:

    Colloquial expressions and local dialect influence
    Domain-relevant terminology
    Language-specific grammar, phrasing, and sentence flow
    Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
    Representation of different writing styles and input quirks to ensure training data realism

    Metadata

    Every chat instance is accompanied by structured metadata, which includes:

    Participant Age
    Gender
    Country/Region
    Chat Domain
    Chat Topic
    Dialect

    This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.

    Data Quality Assurance

    All chat records pass through a rigorous QA process to maintain consistency and accuracy:

    Manual review for content completeness
    Format checks for chat turns and metadata
    Linguistic verification by native speakers
    Removal of inappropriate or unusable samples

    This ensures a clean, reliable dataset ready for high-performance AI model training.

    Applications

    This dataset is ideal for training and evaluating a wide range of text-based AI systems:

    Conversational AI / Chatbots
    Smart assistants and voicebots
    <div

  2. 797 Hours - Hindi(India) Spontaneous Dialogue Smartphone speech dataset

    • m.nexdata.ai
    • nexdata.ai
    Updated Jun 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 797 Hours - Hindi(India) Spontaneous Dialogue Smartphone speech dataset [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1156?source=Huggingface
    Explore at:
    Dataset updated
    Jun 1, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    India
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
    Description

    Hindi(India) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(1,022 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  3. F

    Hindi General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-hindi-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Hindi General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Hindi speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Hindi communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Hindi speech models that understand and respond to authentic Indian accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Hindi. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Hindi speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of India to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Hindi speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Hindi.
    Voice Assistants: Build smart assistants capable of understanding natural Indian conversations.
    <span

  4. s

    Hindi Dataset

    • shaip.com
    Updated Mar 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Hindi Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/hindi-dataset/
    Explore at:
    Dataset updated
    Mar 22, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Hindi Datasetहिंदी डेटासेटHigh-Quality Hindi TTS, General Conversation, and Podcast Dataset for AI & ASR Models Contact Us General Conversation Podcast Data TTS General Conversation .elementor-58615 .elementor-element.elementor-element-91938a9{padding:20px 0px 50px 0px;}.elementor-58615…

  5. F

    Hindi Agent-Customer Chat Dataset for Healthcare Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-healthcare-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Hindi Healthcare Chat Dataset is a rich collection of over 12,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Hindi-speaking regions.

    Participant & Chat Overview

    Participants: 200+ native Hindi speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: Positive, neutral, and negative outcomes included

    Topic Diversity

    The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:

    Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups
    Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

    This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.

    Language Diversity & Realism

    This dataset reflects the natural flow of Hindi healthcare communication and includes:

    Authentic Naming Patterns: Hindi personal names, clinic names, and brands
    Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Hindi formats
    Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Hindi-speaking regions
    Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

    These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.

    Conversational Flow & Structure

    Conversations range from simple inquiries to complex advisory sessions, including:

    General inquiries
    Detailed problem-solving
    Routine status updates
    Treatment recommendations
    Support and feedback interactions

    Each conversation typically includes these structural components:

    Greetings and verification
    Information gathering
    Problem definition
    Solution delivery
    Closing messages
    Follow-up and feedback (where applicable)

    This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.

    Data Format & Structure

    Available in JSON, CSV, and TXT formats, each conversation includes:

    Full message history with clear speaker labels
    Participant identifiers
    Metadata (e.g., topic tags, region, sentiment)
    Compatibility with common NLP and ML pipelines

    Applications

    <p

  6. F

    Hindi Agent-Customer Chat Dataset for Telecom

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Agent-Customer Chat Dataset for Telecom [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-telecom-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Hindi Telecom Chat Dataset is a comprehensive collection of over 12,000 text-based conversations between telecom customers and call center agents. This dataset captures real-world service interactions and domain-specific language in Hindi, enabling the development of intelligent conversational AI and NLP systems tailored for the telecommunications sector.Participant & Chat Overview

    Participants: 200+ native Hindi speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: A mix of positive, neutral, and negative interactions

    Topic Diversity

    This dataset spans a wide range of telecom customer service scenarios:

    Inbound Chats (Customer-Initiated)
    Phone number porting
    Network connectivity issues
    Billing inquiries and adjustments
    Technical support requests
    Service activations and upgrades
    International roaming inquiries
    Refunds and complaint resolution
    Emergency service access
    Outbound Chats (Agent-Initiated)
    Welcome and onboarding calls
    Payment reminders and due alerts
    Customer satisfaction surveys
    Technical issue follow-ups
    Usage reviews and service feedback
    Promotions and service offers

    Language Nuance & Realism

    The conversations reflect real-life telecom interactions in Hindi, incorporating:

    Naming Patterns: Realistic Hindi personal, business, and telecom brand names
    Localized Content: Phone numbers, email addresses, and locations consistent with regional norms
    Time & Number Formats: Hindi representations of dates, times, currencies, and service numbers
    Informal Language & Slang: Common Hindi expressions, idioms, and conversational shortcuts found in telecom discussions

    Conversational Flow & Structure

    Conversations follow the natural flow of telecom customer service exchanges, including:

    Dialogue Types:
    Simple service inquiries
    Detailed problem-solving discussions
    Plan explanations and upgrades
    Feedback collection and status updates
    Interaction Stages:
    Initial greetings and verification
    Data or issue collection
    Clarification and troubleshooting
    Resolution and

  7. h

    hindi-speech-recognition-dataset

    • huggingface.co
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata NLP (2025). hindi-speech-recognition-dataset [Dataset]. https://huggingface.co/datasets/ud-nlp/hindi-speech-recognition-dataset
    Explore at:
    Dataset updated
    Aug 1, 2025
    Authors
    Unidata NLP
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Hindi Telephone Dialogues Dataset - 760 Hours

    Dataset comprises 760 hours of high-quality audio recordings from 1,000+ native Hindi speakers, featuring telephone dialogues across diverse topics and domains. With a 95% sentence accuracy rate, this essential dataset is ideal for training and evaluating Hindi speech recognition systems. - Get the data

      Dataset characteristics:
    

    Characteristic Data

    Description Audio of telephone dialogues in Hindi for training… See the full description on the dataset page: https://huggingface.co/datasets/ud-nlp/hindi-speech-recognition-dataset.

  8. 34 Hours - Hindi(India) Children Real-world Casual Conversation and...

    • nexdata.ai
    Updated Nov 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 34 Hours - Hindi(India) Children Real-world Casual Conversation and Monologue speech dataset [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1377
    Explore at:
    Dataset updated
    Nov 16, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    India
    Variables measured
    Age, Format, Country, Accuracy, Language, Content category, Language(Region) Code, Recording environment, Features of annotation
    Description

    Hindi(India) Children Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety show and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, age, accent and other attributes. Our dataset was collected from extensive and diversify speakers(12 years old and younger children), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  9. m

    Call Center Conversation Speech Datasets in Indian Hindi for Customer...

    • data.macgence.com
    mp3
    Updated Jul 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Call Center Conversation Speech Datasets in Indian Hindi for Customer Service [Dataset]. https://data.macgence.com/dataset/call-center-conversation-speech-datasets-in-indian-hindi-for-customer-service
    Explore at:
    mp3Available download formats
    Dataset updated
    Jul 21, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide, India
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Elevate customer service with Macgence's Indian Hindi call center dataset. Perfect for AI and analytics, delivering accurate and actionable insights!

  10. h

    hindi-speech-recognition-dataset

    • huggingface.co
    Updated Mar 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata (2025). hindi-speech-recognition-dataset [Dataset]. https://huggingface.co/datasets/UniDataPro/hindi-speech-recognition-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 7, 2025
    Authors
    Unidata
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Hindi Speech Dataset for recognition task

    Dataset comprises 760 hours of telephone dialogues in Hindi, collected from 1,000+ native speakers across various topics and domains. This dataset boasts an impressive 95% sentence accuracy rate, making it a valuable resource for advancing speech recognition technology. By utilizing this dataset, researchers and developers can advance their understanding and capabilities in automatic speech recognition (ASR) systems, transcribing audio, and… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/hindi-speech-recognition-dataset.

  11. m

    General conversation speech datasets in Hindi for Collaboration

    • data.macgence.com
    mp3
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2025). General conversation speech datasets in Hindi for Collaboration [Dataset]. https://data.macgence.com/dataset/general-conversation-speech-datasets-in-hindi-for-collaboration
    Explore at:
    mp3Available download formats
    Dataset updated
    May 21, 2025
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Explore Hindi speech datasets for collaboration, ideal for AI, NLP, and research projects. Access high-quality conversational data for your needs.

  12. m

    General conversation speech datasets in Hindi for General

    • data.macgence.com
    mp3
    Updated Aug 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). General conversation speech datasets in Hindi for General [Dataset]. https://data.macgence.com/dataset/general-conversation-speech-datasets-in-hindi-for-general
    Explore at:
    mp3Available download formats
    Dataset updated
    Aug 4, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Explore high-quality Hindi general conversation speech datasets for AI, NLP, and speech recognition research. Download and enhance your projects today!

  13. F

    Hindi Agent-Customer Chat Dataset for Delivery & Logistics

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Agent-Customer Chat Dataset for Delivery & Logistics [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-delivery-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Hindi Delivery & Logistics Chat Dataset is a comprehensive collection of over 12,000 text-based conversations between customers and call center agents. Focused on real-world delivery and logistics interactions, this dataset captures the language, tone, and service patterns essential for developing robust Hindi-language conversational AI, chatbots, and NLP systems across the delivery ecosystem.

    Participant & Chat Overview

    Participants: 200+ native Hindi speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns between customer and agent
    Chat Types: Inbound (customer-initiated) and outbound (agent-initiated)
    Sentiment Coverage: Includes positive, neutral, and negative interaction outcomes

    Topic Diversity

    The dataset spans a wide range of delivery and logistics scenarios, ensuring strong coverage across customer service and operational workflows.

    Inbound Chats (Customer-Initiated)
    Order tracking and delivery status inquiries
    Complaints about late or missing deliveries
    Undeliverable or incorrect address resolution
    Return process and pickup scheduling
    Order modifications and change requests
    Enquiries about delivery method options
    Outbound Chats (Agent-Initiated)
    Delivery confirmations and dispatch updates
    Subscription renewal or delivery reminders
    Notification of delivery issues or missed attempts
    Out-of-stock or product unavailability alerts
    Satisfaction surveys and service feedback collection
    Address verification for upcoming deliveries

    This topical spread ensures wide applicability in both customer support automation and logistics optimization use cases.

    Language Diversity & Realism

    The conversations reflect the authentic language and interaction style of Hindi-speaking customers and delivery agents, incorporating:

    Naming Patterns: Personal names, business names, and logistics company references
    Localized Details: Hindi-format emails, phone numbers, regional addresses, and delivery zones
    Temporal and Numeric Expressions: Dates, delivery windows, prices, and tracking IDs in Hindi formats
    Slang and Informal Speech: Everyday expressions and delivery-specific idioms used across Hindi dialects

    This linguistic realism enables the development of context-aware and naturally responsive AI systems.

    Conversational Structure & Flow

    The dataset captures a diverse range of interaction types and delivery workflows:

    Dialogue Types:
    Quick status checks and confirmations
    Multi-turn issue resolution
    Process walkthroughs and guidance
    Feedback and escalation handling
    Common Flow Elements:
    Greetings and caller verification
    Request or complaint initiation
    <div style="margin-left: 60px; font-weight: 300;

  14. s

    Vadivelu Comedy Hindi Dataset

    • sn.shaip.com
    Updated Sep 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Vadivelu Comedy Hindi Dataset [Dataset]. https://sn.shaip.com/offerings/speech-data-catalog/hindi-dataset/
    Explore at:
    Dataset updated
    Sep 19, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Hindi Datasetहिंदी डेटासेटHigh-Quality Hindi TTS, General Conversation, and Podcast Dataset for AI & ASR Models Bata Isu General Conversation Podcast Data TTS General Conversation 58615px;}.elementor-91938...

  15. m

    General conversation speech datasets in Hindi for Power house

    • data.macgence.com
    mp3
    Updated May 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2025). General conversation speech datasets in Hindi for Power house [Dataset]. https://data.macgence.com/dataset/general-conversation-speech-datasets-in-hindi-for-power-house
    Explore at:
    mp3Available download formats
    Dataset updated
    May 12, 2025
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Explore high-quality Hindi speech datasets for Power House. Ideal for conversational AI, NLP, and speech recognition applications. Download now!

  16. h

    Synthetic-Hinglish-Finetuning-Dataset

    • huggingface.co
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prakhar Bhartiya (2025). Synthetic-Hinglish-Finetuning-Dataset [Dataset]. https://huggingface.co/datasets/prakharb01/Synthetic-Hinglish-Finetuning-Dataset
    Explore at:
    Dataset updated
    May 4, 2025
    Authors
    Prakhar Bhartiya
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Hinglish Conversations Dataset

      Overview
    

    This dataset contains synthetically generated conversational dialogues in Hinglish (a blend of Hindi and English). The conversations revolve around typical college life, cultural festivities, daily routines, and general discussions, designed to be relatable and engaging.

      Dataset Details
    

    Language: Hinglish (Hindi + English) Domain: College life, daily interactions, cultural events, and general discussions Size: 3576… See the full description on the dataset page: https://huggingface.co/datasets/prakharb01/Synthetic-Hinglish-Finetuning-Dataset.

  17. s

    Isethi yedatha yesi-Hindi

    • zu.shaip.com
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Isethi yedatha yesi-Hindi [Dataset]. https://zu.shaip.com/offerings/speech-data-catalog/hindi-dataset/
    Explore at:
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Ikhaya lesi-Hindi Datasetहिंदी डेटासेटHigh-Quality Hindi TTS, General Conversation, kanye ne-Podcast Dataset for AI & ASR Models Thintana nathi General Conversation Podcast Idatha ye-TTS General Conversation .elementor-element.elementor-element-58615p91938px9px20px0 50px;}.elementor-0…

  18. F

    Hindi Agent-Customer Chat Dataset for Retail & E-Commerce

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Agent-Customer Chat Dataset for Retail & E-Commerce [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-retail-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Hindi Retail & E-Commerce Chat Dataset is a large-scale, high-quality collection of over 12,000 chat conversations between customers and call center agents, focused exclusively on Retail and E-Commerce domains. Designed to reflect real-world service interactions, this dataset supports the development of robust conversational AI and NLP models tailored for Hindi-speaking audiences.

    Participant & Chat Overview

    Contributors: 200 native Hindi speakers from the FutureBeeAI Crowd Community
    Chat Length: 300–700 words per conversation
    Turn Count: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: Positive, neutral, and negative interaction outcomes

    Topic Diversity

    This dataset spans a wide range of Retail and E-Commerce conversation types:

    Inbound Chats (Customer-Initiated)
    Product inquiries
    Return or exchange requests
    Order cancellations
    Refunds and payment issues
    Membership or subscription queries
    Shipping, delivery, and more
    Outbound Chats (Agent-Initiated)
    Order confirmation and verification
    Cross-selling and upselling
    Loyalty program promotions
    Account updates
    Special offers and discounts
    Customer feedback and verification

    This diversity enables training of models that handle varied intents, scenarios, and outcomes within customer service workflows.

    Language Nuance & Realism

    The dataset is rich in linguistic diversity and mirrors real conversational tone and structure used in Hindi-speaking regions:

    Personal & Brand Names: Culturally accurate naming conventions
    Local Elements: Realistic addresses, phone numbers, emails, currency references, and time/date formats
    Slang & Idioms: Local expressions, informal phrases, and customer service jargon
    Cultural Specificity: Region-aware vocabulary and tone

    This linguistic authenticity ensures the development of culturally fluent AI models for Hindi Retail & E-Commerce use cases.

    Conversational Structure & Flow

    The conversations reflect natural dialogue dynamics and are organized into various types of interaction styles:

    Simple inquiries
    Detailed problem-solving discussions
    Transactional exchanges
    Follow-ups and status updates
    Advisory and assistance sessions

    Each conversation includes common dialogue stages such as:

    Greetings
    Customer authentication
    Information gathering
    <div style="margin-top:10px; margin-bottom: 10px; margin-left: 30px;font-weight: 300; display: flex; gap: 16px;

  19. h

    Hinglish-Everyday-Conversations-1M

    • huggingface.co
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhishek Khatri (2025). Hinglish-Everyday-Conversations-1M [Dataset]. https://huggingface.co/datasets/Abhishekcr448/Hinglish-Everyday-Conversations-1M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 13, 2025
    Authors
    Abhishek Khatri
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Hinglish Everyday Conversations Dataset

    A synthetically created Hinglish-based dataset of 2 columns where every row represents a unique conversation between 2 people in Hinglish about Everyday Life Topics.

      Use Model
    

    Access the model made using this dataset: Tiny-Hinglish-Chat-21M For more information about this model, its training process, or related resources, you can check the GitHub repository Tiny-Hinglish-Chat-21M-Scripts.

      Dataset Details… See the full description on the dataset page: https://huggingface.co/datasets/Abhishekcr448/Hinglish-Everyday-Conversations-1M.
    
  20. h

    hindi-end-of-utterance-detection

    • huggingface.co
    Updated Jul 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yash Soni (2025). hindi-end-of-utterance-detection [Dataset]. https://huggingface.co/datasets/yashsoni78/hindi-end-of-utterance-detection
    Explore at:
    Dataset updated
    Jul 16, 2025
    Authors
    Yash Soni
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Hindi Conversational End-of-Utterance (EOU) Dataset

    A high-quality, balanced dataset of 1000 Hindi conversational phrases labeled for end-of-utterance detection. This dataset is designed for training models to detect whether a speaker has finished their turn in a dialogue.

      Dataset Summary
    

    This dataset contains short conversational phrases in Hindi, each labeled as either:

    1 (EOU): A complete utterance or turn (e.g., a complete question, answer, command, or statement). 0… See the full description on the dataset page: https://huggingface.co/datasets/yashsoni78/hindi-end-of-utterance-detection.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FutureBee AI (2022). Hindi Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-general-domain-conversation-text-dataset

Hindi Human-Human Chat Dataset for Conversational AI & NLP

Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License

https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

Dataset funded by
FutureBeeAI
Description

Introduction

The Hindi General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Hindi usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Hindi conversations covering a broad spectrum of everyday topics.

Conversational Text Data

This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native Hindi speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.

Words per Chat: 300–700
Turns per Chat: Up to 50 dialogue turns
Contributors: 200 native Hindi speakers from the FutureBeeAI Crowd Community
Format: TXT, DOCS, JSON or CSV (customizable)
Structure: Each record contains the full chat, topic tag, and metadata block

Diversity and Domain Coverage

Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:

Music, books, and movies
Health and wellness
Children and parenting
Family life and relationships
Food and cooking
Education and studying
Festivals and traditions
Environment and daily life
Internet and tech usage
Childhood memories and casual chatting

This diversity ensures the dataset is useful across multiple NLP and language understanding applications.

Linguistic Authenticity

Chats reflect informal, native-level Hindi usage with:

Colloquial expressions and local dialect influence
Domain-relevant terminology
Language-specific grammar, phrasing, and sentence flow
Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
Representation of different writing styles and input quirks to ensure training data realism

Metadata

Every chat instance is accompanied by structured metadata, which includes:

Participant Age
Gender
Country/Region
Chat Domain
Chat Topic
Dialect

This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.

Data Quality Assurance

All chat records pass through a rigorous QA process to maintain consistency and accuracy:

Manual review for content completeness
Format checks for chat turns and metadata
Linguistic verification by native speakers
Removal of inappropriate or unusable samples

This ensures a clean, reliable dataset ready for high-performance AI model training.

Applications

This dataset is ideal for training and evaluating a wide range of text-based AI systems:

Conversational AI / Chatbots
Smart assistants and voicebots
<div

Search
Clear search
Close search
Google apps
Main menu