84 datasets found

F
Hindi Conversation Chat Dataset for Telecom Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Hindi Conversation Chat Dataset for Telecom Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-telecom-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 12,000 chat conversations, each focusing on specific Telecom related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 200+ native Hindi participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Telecom topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Telecom use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Phone Number Porting
•Network Connectivity Issues
•Billing and Payments
•Technical Support
•Service Activation
•International Roaming Enquiry
•Refunds and Billing Adjustments
•Emergency Service Access, and many more
•Outbound Chats:
•Welcome Calls / Onboarding Process
•Payment Reminders
•Customer Surveys
•Technical Updates
•Service Usage Reviews
•Network Complaint Update, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Telecom interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Telecom contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of Hindi personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Telecom conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Telecom interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Telecom customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
<span
F
Hindi General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Hindi General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-hindi-india
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Hindi General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Hindi speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Hindi communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Hindi speech models that understand and respond to authentic Indian accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Hindi. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Hindi speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of India to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Hindi speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Hindi.

•
Voice Assistants: Build smart assistants capable of understanding natural Indian conversations.

<span
s
Hindi Dataset
ht.shaip.com
shaip.com
Updated Dec 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Hindi Dataset [Dataset]. https://ht.shaip.com/offerings/speech-data-catalog/hindi-dataset/
Explore at:
Dataset updated
Dec 27, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Hindi Datasetहिंदी डेटासेटHigh-Quality Hindi TTS, General Conversation, and Podcast Dataset for AI & ASR Models Contact Us General Conversation Podcast Data TTS General Conversation .elementor-58615 .elementor-element.elementor-element-91938a9{padding:20px 0px 50px 0px;}.elementor-58615…
m
General conversation speech datasets in Hindi for General
data.macgence.com
mp3
Updated Aug 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2024). General conversation speech datasets in Hindi for General [Dataset]. https://data.macgence.com/dataset/general-conversation-speech-datasets-in-hindi-for-general
Explore at:
mp3Available download formats
Dataset updated
Aug 4, 2024
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Explore high-quality Hindi general conversation speech datasets for AI, NLP, and speech recognition research. Download and enhance your projects today!
F
General domain Human-Human conversation chats in Hindi
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). General domain Human-Human conversation chats in Hindi [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
This training dataset comprises more than 10,000 conversational text data between two native Hindi people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.
This training dataset's licence belongs to FutureBeeAI!
n
760 Hours - Hindi(India) Spontaneous Dialogue Telephony speech dataset
m.nexdata.ai
nexdata.ai
Updated Oct 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 760 Hours - Hindi(India) Spontaneous Dialogue Telephony speech dataset [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1206
Explore at:
Dataset updated
Oct 14, 2023
Dataset provided by
Nexdata
nexdata technology inc
Authors
Nexdata
Area covered
India
Variables measured
Format, Country, Speaker, Language, Accuracy rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
Description
Hindi(India) Spontaneous Dialogue Telephony speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(1,004 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
34 Hours - Hindi(India) Children Real-world Casual Conversation and...
nexdata.ai
m.nexdata.ai
Updated May 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2024). 34 Hours - Hindi(India) Children Real-world Casual Conversation and Monologue speech dataset [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1377?source=Github
Explore at:
Dataset updated
May 2, 2024
Dataset authored and provided by
Nexdata
Area covered
India
Variables measured
Age, Format, Country, Accuracy, Language, Content category, Language(Region) Code, Recording environment, Features of annotation
Description
Hindi(India) Children Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety show and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, age, accent and other attributes. Our dataset was collected from extensive and diversify speakers(12 years old and younger children), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
m
Call Center conversation speech datasets in Hindi for Retail
data.macgence.com
mp3
Updated Mar 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2024). Call Center conversation speech datasets in Hindi for Retail [Dataset]. https://data.macgence.com/dataset/call-center-conversation-speech-datasets-in-hindi--for-retail
Explore at:
mp3Available download formats
Dataset updated
Mar 27, 2024
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
The audio dataset includes Call Center conversations from Retail, featuring Hindi speakers from INDIA ,with detailed metadata.
F
Hindi Conversation Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Hindi Conversation Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 12,000 chat conversations, each focusing on specific Healthcare related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 200+ native Hindi participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Healthcare topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Healthcare use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Appointment Scheduling
•New Patient Registration
•Surgery Consultation
•Consultation regarding Diet, and many more
•Outbound Chats:
•Appointment Reminder
•Health & Wellness Subscription Programs
•Lab Test Results
•Health Risk Assessments
•Preventive Care Reminders, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Healthcare interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Healthcare contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of Hindi personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Healthcare conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Healthcare interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Healthcare customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
•Solution Delivery
•Closing and Follow-ups
•Feedback, etc
This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.
Data Format and Structure
The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat messages, designed to
h
indic-instruct-data-v0.1
huggingface.co
Updated Jan 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI4Bharat (2024). indic-instruct-data-v0.1 [Dataset]. https://huggingface.co/datasets/ai4bharat/indic-instruct-data-v0.1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2024
Dataset authored and provided by
AI4Bharat
Description
Indic Instruct Data v0.1

A collection of different instruction datasets spanning English and Hindi languages. The collection consists of:

Anudesh wikiHow Flan v2 (67k sample subset) Dolly Anthropic-HHH (5k sample subset) OpenAssistant v1 LymSys-Chat (50k sample subset)

We translate the English subset of specific datasets using IndicTrans2 (Gala et al., 2023). The chrF++ scores of the back-translated example and the corresponding example is provided for quality assessment of the… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/indic-instruct-data-v0.1.
m
General conversation speech datasets in Hindi for Collaboration
data.macgence.com
mp3
Updated May 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2025). General conversation speech datasets in Hindi for Collaboration [Dataset]. https://data.macgence.com/dataset/general-conversation-speech-datasets-in-hindi-for-collaboration
Explore at:
mp3Available download formats
Dataset updated
May 21, 2025
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Explore Hindi speech datasets for collaboration, ideal for AI, NLP, and research projects. Access high-quality conversational data for your needs.
m
General conversation speech datasets in Hindi for Power house
data.macgence.com
mp3
Updated May 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2025). General conversation speech datasets in Hindi for Power house [Dataset]. https://data.macgence.com/dataset/general-conversation-speech-datasets-in-hindi-for-power-house
Explore at:
mp3Available download formats
Dataset updated
May 12, 2025
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Explore high-quality Hindi speech datasets for Power House. Ideal for conversational AI, NLP, and speech recognition applications. Download now!
m
Call Center Conversation Speech Datasets in Indian Hindi for Customer...
data.macgence.com
mp3
Updated Jul 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2024). Call Center Conversation Speech Datasets in Indian Hindi for Customer Service [Dataset]. https://data.macgence.com/dataset/call-center-conversation-speech-datasets-in-indian-hindi-for-customer-service
Explore at:
mp3Available download formats
Dataset updated
Jul 21, 2024
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide, India
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Elevate customer service with Macgence's Indian Hindi call center dataset. Perfect for AI and analytics, delivering accurate and actionable insights!
h
english-hindi-colloquial-dataset
huggingface.co
Updated Feb 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
deeksha bajpai (2025). english-hindi-colloquial-dataset [Dataset]. https://huggingface.co/datasets/bajpaideeksha/english-hindi-colloquial-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 21, 2025
Authors
deeksha bajpai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A curated dataset of colloquial English phrases and their corresponding Hindi translations. This dataset focuses on informal language, including slang, idioms, and everyday expressions, making it ideal for training models that handle casual conversations. Dataset Details: Size:e.g., 500+ phrase pairs] Source: Collected from publicly available conversational datasets, social media, and crowdsourced contributions. Language Pair: English → Hindi Annotations: Each phrase pair is manually verified… See the full description on the dataset page: https://huggingface.co/datasets/bajpaideeksha/english-hindi-colloquial-dataset.
n
494 Hours - Hindi(India) Real-world Casual Conversation and Monologue speech...
m.nexdata.ai
Updated Nov 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 494 Hours - Hindi(India) Real-world Casual Conversation and Monologue speech dataset [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1269
Explore at:
Dataset updated
Nov 11, 2023
Dataset provided by
nexdata technology inc
Authors
Nexdata
Area covered
World, India
Variables measured
Format, Country, Language, Accuracy Rate, Language(Region) Code, Recording environment, Features of annotation
Description
Hindi(India) Real-world Casual Conversation and Monologue speech dataset, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
F
Hindi Conversation Chat Dataset for Delivery & Logistics Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Hindi Conversation Chat Dataset for Delivery & Logistics Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-delivery-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 12,000 chat conversations, each focusing on specific Delivery & Logistics related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 200+ native Hindi participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Delivery & Logistics topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Delivery & Logistics use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Order Tracking
•Delivery Complaint
•Undeliverable Address
•Delivery Method Selection
•Return Process Enquiry
•Order Modification, and many more
•Outbound Chats:
•Delivery Confirmation
•Delivery Subscription
•Incorrect Address
•Missed Delivery Attempt
•Delivery Feedback
•Out-of-Stock Notification
•Delivery Satisfaction Survey, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Delivery & Logistics interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Delivery & Logistics contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of Hindi personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Delivery & Logistics conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Delivery & Logistics interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Delivery & Logistics customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
•Solution
h
gooftagoo
huggingface.co
Updated Mar 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adithya Kamath (2024). gooftagoo [Dataset]. https://huggingface.co/datasets/adi-kmt/gooftagoo
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2024
Authors
Adithya Kamath
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Hindi/Hinglish Conversation Dataset

This repository contains a dataset of conversational text in conversational hindi and hinglish(a mix of Hindi and English languages). The Conversation Dataset contains multi-turn conversations on multiple topics usually revolving around daily real-life experiences. A small amount of reasoning tasks have also been added (specifically COT style reasoning and coding) with about 1k samples from Openhermes 2.5.

Caution

This dataset was… See the full description on the dataset page: https://huggingface.co/datasets/adi-kmt/gooftagoo.
h
Synthetic-Hinglish-Finetuning-Dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prakhar Bhartiya, Synthetic-Hinglish-Finetuning-Dataset [Dataset]. https://huggingface.co/datasets/prakharb01/Synthetic-Hinglish-Finetuning-Dataset
Explore at:
Authors
Prakhar Bhartiya
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Hinglish Conversations Dataset

Overview

This dataset contains synthetically generated conversational dialogues in Hinglish (a blend of Hindi and English). The conversations revolve around typical college life, cultural festivities, daily routines, and general discussions, designed to be relatable and engaging.

Dataset Details

Language: Hinglish (Hindi + English) Domain: College life, daily interactions, cultural events, and general discussions Size: 3576… See the full description on the dataset page: https://huggingface.co/datasets/prakharb01/Synthetic-Hinglish-Finetuning-Dataset.
h
hind_encorp
huggingface.co
paperswithcode.com
+1more
Updated Mar 22, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Rychlý (2014). hind_encorp [Dataset]. https://huggingface.co/datasets/pary/hind_encorp
Explore at:
Dataset updated
Mar 22, 2014
Authors
Pavel Rychlý
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008).

Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi.

EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages.

Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus. For the current release, we are extending the parallel corpus using these sources: Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi.

TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available.

The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus.

Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files.

Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary.
h
Hinglish-Everyday-Conversations-1M
huggingface.co
Updated Jan 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Khatri (2025). Hinglish-Everyday-Conversations-1M [Dataset]. https://huggingface.co/datasets/Abhishekcr448/Hinglish-Everyday-Conversations-1M
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 13, 2025
Authors
Abhishek Khatri
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Hinglish Everyday Conversations Dataset

A synthetically created Hinglish-based dataset of 2 columns where every row represents a unique conversation between 2 people in Hinglish about Everyday Life Topics.

Use Model

Access the model made using this dataset: Tiny-Hinglish-Chat-21M For more information about this model, its training process, or related resources, you can check the GitHub repository Tiny-Hinglish-Chat-21M-Scripts.

Dataset Details… See the full description on the dataset page: https://huggingface.co/datasets/Abhishekcr448/Hinglish-Everyday-Conversations-1M.

Facebook

Twitter

Click to copy link

Link copied

Cite

FutureBee AI (2022). Hindi Conversation Chat Dataset for Telecom Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-telecom-domain-conversation-text-dataset

Hindi Conversation Chat Dataset for Telecom Domain

Explore at:

wavAvailable download formats

Dataset updated

Aug 1, 2022

Dataset provided by

FutureBeeAI

Authors

FutureBee AI

License

https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

Dataset funded by

FutureBeeAI

Description

Introduction

The dataset comprises over 12,000 chat conversations, each focusing on specific Telecom related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

•

Participants Details: 200+ native Hindi participants from the FutureBeeAI community.

•

Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity

The chat dataset covers a wide range of conversations on Telecom topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Telecom use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

•Inbound Chats:

•Phone Number Porting

•Network Connectivity Issues

•Billing and Payments

•Technical Support

•Service Activation

•International Roaming Enquiry

•Refunds and Billing Adjustments

•Emergency Service Access, and many more

•Outbound Chats:

•Welcome Calls / Onboarding Process

•Payment Reminders

•Customer Surveys

•Technical Updates

•Service Usage Reviews

•Network Complaint Update, and many more

Language Variety & Nuances

The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Telecom interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Telecom contexts.

The dataset encompasses a wide array of language elements, including:

•

Naming Conventions: Chats include a variety of Hindi personal and business names.

•

Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.

•

Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.

•

Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Telecom conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Telecom interactions.

Conversational Flow and Interaction Types

The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Telecom customer-agent interactions.

•Simple Inquiries

•Detailed Discussions

•Transactional Interactions

•Problem-Solving Dialogues

•Advisory Sessions

•Routine Checks and Follow-Ups

Each of these conversations contains various aspects of conversation flow like:

•Greetings

•Authentication

•Information gathering

•Resolution identification

<span

Clear search

Close search

Google apps

Main menu

Hindi Conversation Chat Dataset for Telecom Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

Hindi General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Hindi Dataset

General conversation speech datasets in Hindi for General

General domain Human-Human conversation chats in Hindi

What’s Included

760 Hours - Hindi(India) Spontaneous Dialogue Telephony speech dataset

34 Hours - Hindi(India) Children Real-world Casual Conversation and...

Call Center conversation speech datasets in Hindi for Retail

Hindi Conversation Chat Dataset for Healthcare Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

Data Format and Structure

indic-instruct-data-v0.1

General conversation speech datasets in Hindi for Collaboration

General conversation speech datasets in Hindi for Power house

Call Center Conversation Speech Datasets in Indian Hindi for Customer...

english-hindi-colloquial-dataset

494 Hours - Hindi(India) Real-world Casual Conversation and Monologue speech...

Hindi Conversation Chat Dataset for Delivery & Logistics Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

gooftagoo

Synthetic-Hinglish-Finetuning-Dataset

hind_encorp

Hinglish-Everyday-Conversations-1M

Hindi Conversation Chat Dataset for Telecom DomainSee More Versions

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

Hindi Conversation Chat Dataset for Telecom Domain