48 datasets found

F
Hindi Human-Human Chat Dataset for Conversational AI & NLP
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Hindi Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Hindi General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Hindi usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Hindi conversations covering a broad spectrum of everyday topics.
Conversational Text Data
This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native Hindi speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.
•
Words per Chat: 300–700

•
Turns per Chat: Up to 50 dialogue turns

•
Contributors: 200 native Hindi speakers from the FutureBeeAI Crowd Community

•
Format: TXT, DOCS, JSON or CSV (customizable)

•
Structure: Each record contains the full chat, topic tag, and metadata block

Diversity and Domain Coverage
Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:
•Music, books, and movies
•Health and wellness
•Children and parenting
•Family life and relationships
•Food and cooking
•Education and studying
•Festivals and traditions
•Environment and daily life
•Internet and tech usage
•Childhood memories and casual chatting
This diversity ensures the dataset is useful across multiple NLP and language understanding applications.
Linguistic Authenticity
Chats reflect informal, native-level Hindi usage with:
•Colloquial expressions and local dialect influence
•Domain-relevant terminology
•Language-specific grammar, phrasing, and sentence flow
•Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
•Representation of different writing styles and input quirks to ensure training data realism
Metadata
Every chat instance is accompanied by structured metadata, which includes:
•Participant Age
•Gender
•Country/Region
•Chat Domain
•Chat Topic
•Dialect
This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.
Data Quality Assurance
All chat records pass through a rigorous QA process to maintain consistency and accuracy:
•Manual review for content completeness
•Format checks for chat turns and metadata
•Linguistic verification by native speakers
•Removal of inappropriate or unusable samples
This ensures a clean, reliable dataset ready for high-performance AI model training.
Applications
This dataset is ideal for training and evaluating a wide range of text-based AI systems:
•Conversational AI / Chatbots
•Smart assistants and voicebots
<div
797 Hours - Hindi(India) Spontaneous Dialogue Smartphone speech dataset
m.nexdata.ai
nexdata.ai
Updated Jun 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). 797 Hours - Hindi(India) Spontaneous Dialogue Smartphone speech dataset [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1156?source=Huggingface
Explore at:
Dataset updated
Jun 1, 2025
Dataset authored and provided by
Nexdata
Area covered
India
Variables measured
Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
Description
Hindi(India) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(1,022 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
F
Hindi General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Hindi General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-hindi-india
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Hindi General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Hindi speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Hindi communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Hindi speech models that understand and respond to authentic Indian accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Hindi. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Hindi speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of India to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Hindi speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Hindi.

•
Voice Assistants: Build smart assistants capable of understanding natural Indian conversations.

<span
s
Hindi Dataset
shaip.com
Updated Mar 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2023). Hindi Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/hindi-dataset/
Explore at:
Dataset updated
Mar 22, 2023
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Hindi Datasetहिंदी डेटासेटHigh-Quality Hindi TTS, General Conversation, and Podcast Dataset for AI & ASR Models Contact Us General Conversation Podcast Data TTS General Conversation .elementor-58615 .elementor-element.elementor-element-91938a9{padding:20px 0px 50px 0px;}.elementor-58615…
F
Hindi Agent-Customer Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Hindi Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Hindi Healthcare Chat Dataset is a rich collection of over 12,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Hindi-speaking regions.
Participant & Chat Overview
•
Participants: 200+ native Hindi speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: Positive, neutral, and negative outcomes included

Topic Diversity
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
•
Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups

•
Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
Language Diversity & Realism
This dataset reflects the natural flow of Hindi healthcare communication and includes:
•
Authentic Naming Patterns: Hindi personal names, clinic names, and brands

•
Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Hindi formats

•
Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Hindi-speaking regions

•
Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversational Flow & Structure
Conversations range from simple inquiries to complex advisory sessions, including:
•General inquiries
•Detailed problem-solving
•Routine status updates
•Treatment recommendations
•Support and feedback interactions
Each conversation typically includes these structural components:
•Greetings and verification
•Information gathering
•Problem definition
•Solution delivery
•Closing messages
•Follow-up and feedback (where applicable)
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Data Format & Structure
Available in JSON, CSV, and TXT formats, each conversation includes:
•Full message history with clear speaker labels
•Participant identifiers
•Metadata (e.g., topic tags, region, sentiment)
•Compatibility with common NLP and ML pipelines
Applications
<p
F
Hindi Agent-Customer Chat Dataset for Telecom
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Hindi Agent-Customer Chat Dataset for Telecom [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-telecom-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Hindi Telecom Chat Dataset is a comprehensive collection of over 12,000 text-based conversations between telecom customers and call center agents. This dataset captures real-world service interactions and domain-specific language in Hindi, enabling the development of intelligent conversational AI and NLP systems tailored for the telecommunications sector.Participant & Chat Overview
•
Participants: 200+ native Hindi speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: A mix of positive, neutral, and negative interactions

Topic Diversity
This dataset spans a wide range of telecom customer service scenarios:
•Inbound Chats (Customer-Initiated)
•Phone number porting
•Network connectivity issues
•Billing inquiries and adjustments
•Technical support requests
•Service activations and upgrades
•International roaming inquiries
•Refunds and complaint resolution
•Emergency service access
•Outbound Chats (Agent-Initiated)
•Welcome and onboarding calls
•Payment reminders and due alerts
•Customer satisfaction surveys
•Technical issue follow-ups
•Usage reviews and service feedback
•Promotions and service offers
Language Nuance & Realism
The conversations reflect real-life telecom interactions in Hindi, incorporating:
•
Naming Patterns: Realistic Hindi personal, business, and telecom brand names

•
Localized Content: Phone numbers, email addresses, and locations consistent with regional norms

•
Time & Number Formats: Hindi representations of dates, times, currencies, and service numbers

•
Informal Language & Slang: Common Hindi expressions, idioms, and conversational shortcuts found in telecom discussions

Conversational Flow & Structure
Conversations follow the natural flow of telecom customer service exchanges, including:
•Dialogue Types:
•Simple service inquiries
•Detailed problem-solving discussions
•Plan explanations and upgrades
•Feedback collection and status updates
•Interaction Stages:
•Initial greetings and verification
•Data or issue collection
•Clarification and troubleshooting
•Resolution and
h
hindi-speech-recognition-dataset
huggingface.co
Updated Aug 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unidata NLP (2025). hindi-speech-recognition-dataset [Dataset]. https://huggingface.co/datasets/ud-nlp/hindi-speech-recognition-dataset
Explore at:
Dataset updated
Aug 1, 2025
Authors
Unidata NLP
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Hindi Telephone Dialogues Dataset - 760 Hours

Dataset comprises 760 hours of high-quality audio recordings from 1,000+ native Hindi speakers, featuring telephone dialogues across diverse topics and domains. With a 95% sentence accuracy rate, this essential dataset is ideal for training and evaluating Hindi speech recognition systems. - Get the data

Dataset characteristics:

Characteristic Data

Description Audio of telephone dialogues in Hindi for training… See the full description on the dataset page: https://huggingface.co/datasets/ud-nlp/hindi-speech-recognition-dataset.
34 Hours - Hindi(India) Children Real-world Casual Conversation and...
nexdata.ai
Updated Nov 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 34 Hours - Hindi(India) Children Real-world Casual Conversation and Monologue speech dataset [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1377
Explore at:
Dataset updated
Nov 16, 2023
Dataset authored and provided by
Nexdata
Area covered
India
Variables measured
Age, Format, Country, Accuracy, Language, Content category, Language(Region) Code, Recording environment, Features of annotation
Description
Hindi(India) Children Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety show and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, age, accent and other attributes. Our dataset was collected from extensive and diversify speakers(12 years old and younger children), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
m
Call Center Conversation Speech Datasets in Indian Hindi for Customer...
data.macgence.com
mp3
Updated Jul 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2024). Call Center Conversation Speech Datasets in Indian Hindi for Customer Service [Dataset]. https://data.macgence.com/dataset/call-center-conversation-speech-datasets-in-indian-hindi-for-customer-service
Explore at:
mp3Available download formats
Dataset updated
Jul 21, 2024
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide, India
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Elevate customer service with Macgence's Indian Hindi call center dataset. Perfect for AI and analytics, delivering accurate and actionable insights!
h
hindi-speech-recognition-dataset
huggingface.co
Updated Mar 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unidata (2025). hindi-speech-recognition-dataset [Dataset]. https://huggingface.co/datasets/UniDataPro/hindi-speech-recognition-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 7, 2025
Authors
Unidata
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Hindi Speech Dataset for recognition task

Dataset comprises 760 hours of telephone dialogues in Hindi, collected from 1,000+ native speakers across various topics and domains. This dataset boasts an impressive 95% sentence accuracy rate, making it a valuable resource for advancing speech recognition technology. By utilizing this dataset, researchers and developers can advance their understanding and capabilities in automatic speech recognition (ASR) systems, transcribing audio, and… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/hindi-speech-recognition-dataset.
m
General conversation speech datasets in Hindi for Collaboration
data.macgence.com
mp3
Updated May 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2025). General conversation speech datasets in Hindi for Collaboration [Dataset]. https://data.macgence.com/dataset/general-conversation-speech-datasets-in-hindi-for-collaboration
Explore at:
mp3Available download formats
Dataset updated
May 21, 2025
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Explore Hindi speech datasets for collaboration, ideal for AI, NLP, and research projects. Access high-quality conversational data for your needs.
m
General conversation speech datasets in Hindi for General
data.macgence.com
mp3
Updated Aug 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2024). General conversation speech datasets in Hindi for General [Dataset]. https://data.macgence.com/dataset/general-conversation-speech-datasets-in-hindi-for-general
Explore at:
mp3Available download formats
Dataset updated
Aug 4, 2024
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Explore high-quality Hindi general conversation speech datasets for AI, NLP, and speech recognition research. Download and enhance your projects today!
F
Hindi Agent-Customer Chat Dataset for Delivery & Logistics
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Hindi Agent-Customer Chat Dataset for Delivery & Logistics [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-delivery-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Hindi Delivery & Logistics Chat Dataset is a comprehensive collection of over 12,000 text-based conversations between customers and call center agents. Focused on real-world delivery and logistics interactions, this dataset captures the language, tone, and service patterns essential for developing robust Hindi-language conversational AI, chatbots, and NLP systems across the delivery ecosystem.
Participant & Chat Overview
•
Participants: 200+ native Hindi speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns between customer and agent

•
Chat Types: Inbound (customer-initiated) and outbound (agent-initiated)

•
Sentiment Coverage: Includes positive, neutral, and negative interaction outcomes

Topic Diversity
The dataset spans a wide range of delivery and logistics scenarios, ensuring strong coverage across customer service and operational workflows.
•Inbound Chats (Customer-Initiated)
•Order tracking and delivery status inquiries
•Complaints about late or missing deliveries
•Undeliverable or incorrect address resolution
•Return process and pickup scheduling
•Order modifications and change requests
•Enquiries about delivery method options
•Outbound Chats (Agent-Initiated)
•Delivery confirmations and dispatch updates
•Subscription renewal or delivery reminders
•Notification of delivery issues or missed attempts
•Out-of-stock or product unavailability alerts
•Satisfaction surveys and service feedback collection
•Address verification for upcoming deliveries
This topical spread ensures wide applicability in both customer support automation and logistics optimization use cases.
Language Diversity & Realism
The conversations reflect the authentic language and interaction style of Hindi-speaking customers and delivery agents, incorporating:
•
Naming Patterns: Personal names, business names, and logistics company references

•
Localized Details: Hindi-format emails, phone numbers, regional addresses, and delivery zones

•
Temporal and Numeric Expressions: Dates, delivery windows, prices, and tracking IDs in Hindi formats

•
Slang and Informal Speech: Everyday expressions and delivery-specific idioms used across Hindi dialects

This linguistic realism enables the development of context-aware and naturally responsive AI systems.
Conversational Structure & Flow
The dataset captures a diverse range of interaction types and delivery workflows:
•Dialogue Types:
•Quick status checks and confirmations
•Multi-turn issue resolution
•Process walkthroughs and guidance
•Feedback and escalation handling
•Common Flow Elements:
•Greetings and caller verification
•Request or complaint initiation
<div style="margin-left: 60px; font-weight: 300;
s
Vadivelu Comedy Hindi Dataset
sn.shaip.com
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Vadivelu Comedy Hindi Dataset [Dataset]. https://sn.shaip.com/offerings/speech-data-catalog/hindi-dataset/
Explore at:
Dataset updated
Sep 19, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Hindi Datasetहिंदी डेटासेटHigh-Quality Hindi TTS, General Conversation, and Podcast Dataset for AI & ASR Models Bata Isu General Conversation Podcast Data TTS General Conversation 58615px;}.elementor-91938...
m
General conversation speech datasets in Hindi for Power house
data.macgence.com
mp3
Updated May 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2025). General conversation speech datasets in Hindi for Power house [Dataset]. https://data.macgence.com/dataset/general-conversation-speech-datasets-in-hindi-for-power-house
Explore at:
mp3Available download formats
Dataset updated
May 12, 2025
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Explore high-quality Hindi speech datasets for Power House. Ideal for conversational AI, NLP, and speech recognition applications. Download now!
h
Synthetic-Hinglish-Finetuning-Dataset
huggingface.co
Updated May 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prakhar Bhartiya (2025). Synthetic-Hinglish-Finetuning-Dataset [Dataset]. https://huggingface.co/datasets/prakharb01/Synthetic-Hinglish-Finetuning-Dataset
Explore at:
Dataset updated
May 4, 2025
Authors
Prakhar Bhartiya
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Hinglish Conversations Dataset

Overview

This dataset contains synthetically generated conversational dialogues in Hinglish (a blend of Hindi and English). The conversations revolve around typical college life, cultural festivities, daily routines, and general discussions, designed to be relatable and engaging.

Dataset Details

Language: Hinglish (Hindi + English) Domain: College life, daily interactions, cultural events, and general discussions Size: 3576… See the full description on the dataset page: https://huggingface.co/datasets/prakharb01/Synthetic-Hinglish-Finetuning-Dataset.
s
Isethi yedatha yesi-Hindi
zu.shaip.com
Updated Aug 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Isethi yedatha yesi-Hindi [Dataset]. https://zu.shaip.com/offerings/speech-data-catalog/hindi-dataset/
Explore at:
Dataset updated
Aug 6, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Ikhaya lesi-Hindi Datasetहिंदी डेटासेटHigh-Quality Hindi TTS, General Conversation, kanye ne-Podcast Dataset for AI & ASR Models Thintana nathi General Conversation Podcast Idatha ye-TTS General Conversation .elementor-element.elementor-element-58615p91938px9px20px0 50px;}.elementor-0…
F
Hindi Agent-Customer Chat Dataset for Retail & E-Commerce
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Hindi Agent-Customer Chat Dataset for Retail & E-Commerce [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-retail-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Hindi Retail & E-Commerce Chat Dataset is a large-scale, high-quality collection of over 12,000 chat conversations between customers and call center agents, focused exclusively on Retail and E-Commerce domains. Designed to reflect real-world service interactions, this dataset supports the development of robust conversational AI and NLP models tailored for Hindi-speaking audiences.
Participant & Chat Overview
•
Contributors: 200 native Hindi speakers from the FutureBeeAI Crowd Community

•
Chat Length: 300–700 words per conversation

•
Turn Count: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: Positive, neutral, and negative interaction outcomes

Topic Diversity
This dataset spans a wide range of Retail and E-Commerce conversation types:
•Inbound Chats (Customer-Initiated)
•Product inquiries
•Return or exchange requests
•Order cancellations
•Refunds and payment issues
•Membership or subscription queries
•Shipping, delivery, and more
•Outbound Chats (Agent-Initiated)
•Order confirmation and verification
•Cross-selling and upselling
•Loyalty program promotions
•Account updates
•Special offers and discounts
•Customer feedback and verification
This diversity enables training of models that handle varied intents, scenarios, and outcomes within customer service workflows.
Language Nuance & Realism
The dataset is rich in linguistic diversity and mirrors real conversational tone and structure used in Hindi-speaking regions:
•
Personal & Brand Names: Culturally accurate naming conventions

•
Local Elements: Realistic addresses, phone numbers, emails, currency references, and time/date formats

•
Slang & Idioms: Local expressions, informal phrases, and customer service jargon

•
Cultural Specificity: Region-aware vocabulary and tone

This linguistic authenticity ensures the development of culturally fluent AI models for Hindi Retail & E-Commerce use cases.
Conversational Structure & Flow
The conversations reflect natural dialogue dynamics and are organized into various types of interaction styles:
•Simple inquiries
•Detailed problem-solving discussions
•Transactional exchanges
•Follow-ups and status updates
•Advisory and assistance sessions
Each conversation includes common dialogue stages such as:
•Greetings
•Customer authentication
•Information gathering
<div style="margin-top:10px; margin-bottom: 10px; margin-left: 30px;font-weight: 300; display: flex; gap: 16px;
h
Hinglish-Everyday-Conversations-1M
huggingface.co
Updated Jan 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Khatri (2025). Hinglish-Everyday-Conversations-1M [Dataset]. https://huggingface.co/datasets/Abhishekcr448/Hinglish-Everyday-Conversations-1M
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 13, 2025
Authors
Abhishek Khatri
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Hinglish Everyday Conversations Dataset

A synthetically created Hinglish-based dataset of 2 columns where every row represents a unique conversation between 2 people in Hinglish about Everyday Life Topics.

Use Model

Access the model made using this dataset: Tiny-Hinglish-Chat-21M For more information about this model, its training process, or related resources, you can check the GitHub repository Tiny-Hinglish-Chat-21M-Scripts.

Dataset Details… See the full description on the dataset page: https://huggingface.co/datasets/Abhishekcr448/Hinglish-Everyday-Conversations-1M.
h
hindi-end-of-utterance-detection
huggingface.co
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yash Soni (2025). hindi-end-of-utterance-detection [Dataset]. https://huggingface.co/datasets/yashsoni78/hindi-end-of-utterance-detection
Explore at:
Dataset updated
Jul 16, 2025
Authors
Yash Soni
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Hindi Conversational End-of-Utterance (EOU) Dataset

A high-quality, balanced dataset of 1000 Hindi conversational phrases labeled for end-of-utterance detection. This dataset is designed for training models to detect whether a speaker has finished their turn in a dialogue.

Dataset Summary

This dataset contains short conversational phrases in Hindi, each labeled as either:

1 (EOU): A complete utterance or turn (e.g., a complete question, answer, command, or statement). 0… See the full description on the dataset page: https://huggingface.co/datasets/yashsoni78/hindi-end-of-utterance-detection.

Facebook

Twitter

Click to copy link

Link copied

Cite

FutureBee AI (2022). Hindi Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-general-domain-conversation-text-dataset

Hindi Human-Human Chat Dataset for Conversational AI & NLP

Explore at:

wavAvailable download formats

Dataset updated

Aug 1, 2022

Dataset provided by

FutureBeeAI

Authors

FutureBee AI

License

https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

Dataset funded by

FutureBeeAI

Description

Introduction

The Hindi General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Hindi usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Hindi conversations covering a broad spectrum of everyday topics.

Conversational Text Data

This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native Hindi speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.

•

Words per Chat: 300–700

•

Turns per Chat: Up to 50 dialogue turns

•

Contributors: 200 native Hindi speakers from the FutureBeeAI Crowd Community

•

Format: TXT, DOCS, JSON or CSV (customizable)

•

Structure: Each record contains the full chat, topic tag, and metadata block

Diversity and Domain Coverage

Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:

•Music, books, and movies

•Health and wellness

•Children and parenting

•Family life and relationships

•Food and cooking

•Education and studying

•Festivals and traditions

•Environment and daily life

•Internet and tech usage

•Childhood memories and casual chatting

This diversity ensures the dataset is useful across multiple NLP and language understanding applications.

Linguistic Authenticity

Chats reflect informal, native-level Hindi usage with:

•Colloquial expressions and local dialect influence

•Domain-relevant terminology

•Language-specific grammar, phrasing, and sentence flow

•Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references

•Representation of different writing styles and input quirks to ensure training data realism

Metadata

Every chat instance is accompanied by structured metadata, which includes:

•Participant Age

•Gender

•Country/Region

•Chat Domain

•Chat Topic

•Dialect

This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.

Data Quality Assurance

All chat records pass through a rigorous QA process to maintain consistency and accuracy:

•Manual review for content completeness

•Format checks for chat turns and metadata

•Linguistic verification by native speakers

•Removal of inappropriate or unusable samples

This ensures a clean, reliable dataset ready for high-performance AI model training.

Applications

This dataset is ideal for training and evaluating a wide range of text-based AI systems:

•Conversational AI / Chatbots

•Smart assistants and voicebots

<div

Clear search

Close search

Google apps

Main menu

Hindi Human-Human Chat Dataset for Conversational AI & NLP

Introduction

Conversational Text Data

Diversity and Domain Coverage

Linguistic Authenticity

Metadata

Data Quality Assurance

Applications

797 Hours - Hindi(India) Spontaneous Dialogue Smartphone speech dataset

Hindi General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Hindi Dataset

Hindi Agent-Customer Chat Dataset for Healthcare Domain

Introduction

Participant & Chat Overview

Topic Diversity

Language Diversity & Realism

Conversational Flow & Structure

Data Format & Structure

Applications

Hindi Agent-Customer Chat Dataset for Telecom

Introduction

Topic Diversity

Language Nuance & Realism

Conversational Flow & Structure

hindi-speech-recognition-dataset

34 Hours - Hindi(India) Children Real-world Casual Conversation and...

Call Center Conversation Speech Datasets in Indian Hindi for Customer...

hindi-speech-recognition-dataset

General conversation speech datasets in Hindi for Collaboration

General conversation speech datasets in Hindi for General

Hindi Agent-Customer Chat Dataset for Delivery & Logistics

Introduction

Participant & Chat Overview

Topic Diversity

Language Diversity & Realism

Conversational Structure & Flow

Vadivelu Comedy Hindi Dataset

General conversation speech datasets in Hindi for Power house

Synthetic-Hinglish-Finetuning-Dataset

Isethi yedatha yesi-Hindi

Hindi Agent-Customer Chat Dataset for Retail & E-Commerce

Introduction

Participant & Chat Overview

Topic Diversity

Language Nuance & Realism

Conversational Structure & Flow

Hinglish-Everyday-Conversations-1M

hindi-end-of-utterance-detection

Hindi Human-Human Chat Dataset for Conversational AI & NLPSee More Versions

Introduction

Conversational Text Data

Diversity and Domain Coverage

Linguistic Authenticity

Metadata

Data Quality Assurance

Applications

Hindi Human-Human Chat Dataset for Conversational AI & NLP