49 datasets found
  1. F

    Hindi Conversation Chat Dataset for Healthcare Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Conversation Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-healthcare-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific Healthcare related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Healthcare topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Healthcare use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    Inbound Chats:
    Appointment Scheduling
    New Patient Registration
    Surgery Consultation
    Consultation regarding Diet, and many more
    Outbound Chats:
    Appointment Reminder
    Health & Wellness Subscription Programs
    Lab Test Results
    Health Risk Assessments
    Preventive Care Reminders, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Healthcare interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Healthcare contexts.

    The dataset encompasses a wide array of language elements, including:

    Naming Conventions: Chats include a variety of Hindi personal and business names.
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Healthcare conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Healthcare interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Healthcare customer-agent interactions.

    Simple Inquiries
    Detailed Discussions
    Transactional Interactions
    Problem-Solving Dialogues
    Advisory Sessions
    Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    Greetings
    Authentication
    Information gathering
    Resolution identification
    Solution Delivery
    Closing and Follow-ups
    Feedback, etc

    This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.

    Data Format and Structure

    The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat messages, designed to

  2. n

    797 Hours - Hindi(India) Spontaneous Dialogue Smartphone speech dataset

    • nexdata.ai
    • m.nexdata.ai
    Updated Apr 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 797 Hours - Hindi(India) Spontaneous Dialogue Smartphone speech dataset [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1156
    Explore at:
    Dataset updated
    Apr 13, 2024
    Dataset provided by
    nexdata technology inc
    Nexdata
    Authors
    Nexdata
    Area covered
    India
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
    Description

    Hindi(India) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(1,022 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  3. F

    Hindi (India) General Conversation Speech Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi (India) General Conversation Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-hindi-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    India
    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Welcome to the Hindi Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of Hindi language speech recognition models, with a particular focus on Indian accents and dialects.

    With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the Hindi language spoken in India.

    Speech Data:

    This training dataset comprises 150 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 160 native Hindi speakers from different part of India. This collaborative effort guarantees a balanced representation of Indian accents, dialects, and demographics, reducing biases and promoting inclusivity.

    Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.

    Metadata:

    In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Hindi language speech recognition models.

    Transcription:

    This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.

    Our goal is to expedite the deployment of Hindi language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.

    Updates and Customization:

    We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.

    If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.

    License:

    This audio dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.

  4. F

    Hindi Conversation Chat Dataset for Delivery & Logistics Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Conversation Chat Dataset for Delivery & Logistics Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-delivery-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific Delivery & Logistics related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Delivery & Logistics topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Delivery & Logistics use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    Inbound Chats:
    Order Tracking
    Delivery Complaint
    Undeliverable Address
    Delivery Method Selection
    Return Process Enquiry
    Order Modification, and many more
    Outbound Chats:
    Delivery Confirmation
    Delivery Subscription
    Incorrect Address
    Missed Delivery Attempt
    Delivery Feedback
    Out-of-Stock Notification
    Delivery Satisfaction Survey, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Delivery & Logistics interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Delivery & Logistics contexts.

    The dataset encompasses a wide array of language elements, including:

    Naming Conventions: Chats include a variety of Hindi personal and business names.
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Delivery & Logistics conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Delivery & Logistics interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Delivery & Logistics customer-agent interactions.

    Simple Inquiries
    Detailed Discussions
    Transactional Interactions
    Problem-Solving Dialogues
    Advisory Sessions
    Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    Greetings
    Authentication
    Information gathering
    Resolution identification
    Solution

  5. h

    IndicTTS-Hindi-male

    • huggingface.co
    Updated May 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aditya Anjan (2025). IndicTTS-Hindi-male [Dataset]. https://huggingface.co/datasets/Anjan9320/IndicTTS-Hindi-male
    Explore at:
    Dataset updated
    May 29, 2025
    Authors
    Aditya Anjan
    Description

    Hindi Indic TTS Dataset

    This dataset is derived from the Indic TTS Database project, specifically using the Hindi monolingual recordings from both male and female speakers. The dataset contains high-quality speech recordings with corresponding text transcriptions, making it suitable for text-to-speech (TTS) research and development.

      Dataset Details
    

    Language: Hindi Total Duration: Male: 5.16 hours Audio Format: WAV Sampling Rate: 48000Hz Speakers: 1 male native Hindi… See the full description on the dataset page: https://huggingface.co/datasets/Anjan9320/IndicTTS-Hindi-male.

  6. m

    General conversation speech datasets in Hindi for Health Tech

    • data.macgence.com
    mp3
    Updated Mar 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). General conversation speech datasets in Hindi for Health Tech [Dataset]. https://data.macgence.com/dataset/general-conversation-speech-datasets-in-hindi-for-health-tech
    Explore at:
    mp3Available download formats
    Dataset updated
    Mar 29, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    The audio dataset includes General Conversation, featuring Hindi speakers from India with detailed metadata.

  7. 34 Hours - Hindi(India) Children Real-world Casual Conversation and...

    • nexdata.ai
    • m.nexdata.ai
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 34 Hours - Hindi(India) Children Real-world Casual Conversation and Monologue speech dataset [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1377?source=Github
    Explore at:
    Dataset updated
    May 2, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    India
    Variables measured
    Age, Format, Country, Accuracy, Language, Content category, Language(Region) Code, Recording environment, Features of annotation
    Description

    Hindi(India) Children Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety show and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, age, accent and other attributes. Our dataset was collected from extensive and diversify speakers(12 years old and younger children), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  8. m

    General conversation speech datasets in Hindi for Friendly Advice

    • data.macgence.com
    mp3
    Updated Apr 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). General conversation speech datasets in Hindi for Friendly Advice [Dataset]. https://data.macgence.com/dataset/general-conversation-speech-datasets-in-hindi--for-friendly-advice
    Explore at:
    mp3Available download formats
    Dataset updated
    Apr 6, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    The audio dataset includes General Conversation, featuring Hindi speakers from India with detailed metadata.

  9. m

    Call Center conversation speech datasets in Hindi for Retail

    • data.macgence.com
    mp3
    Updated Mar 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Call Center conversation speech datasets in Hindi for Retail [Dataset]. https://data.macgence.com/dataset/call-center-conversation-speech-datasets-in-hindi--for-retail
    Explore at:
    mp3Available download formats
    Dataset updated
    Mar 27, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    The audio dataset includes Call Center conversations from Retail, featuring Hindi speakers from INDIA ,with detailed metadata.

  10. d

    CSLU: Foreign Accented English Release 1.2

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lander, T (2023). CSLU: Foreign Accented English Release 1.2 [Dataset]. http://doi.org/10.5683/SP2/K7EQTE
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Lander, T
    Description

    Introduction This file contains documentation on CSLU: Foreign Accented English Release 1.2, Linguistic Data Consortium (LDC) catalog number LDC2006S38 and isbn 1-58563-392-5. CSLU: Foreign Accented English Release 1.2 consists of continuous speech in English by native speakers of 22 different languages: Arabic, Cantonese, Czech, Farsi, French, German, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Mandarin Chinese, Malay, Polish, Portuguese (Brazilian and Iberian), Russian, Swedish, Spanish, Swahili, Tamil and Vietnamese. The corpus contains 4925 telephone-quality utterances, information about the speakers' linguistic backgrounds and perceptual judgments about the accents in the utterances. The speakers were asked to speak about themselves in English for 20 seconds. Three native speakers of American English independently listened to each utterance and judged the speakers' accents on a 4-point scale: negligible/no accent, mild accent, strong accent and very strong accent. This corpus is intended to support the study of the underlying characteristics of foreign accent and to enable research, development and evaluation of algorithms for the identification and understanding of accented speech. Some of the files in this corpus are also contained in CSLU: 22 Languages Corpus, LDC2005S26. Samples For an example of the data in this corpus, please listen to this audio sample. Copyright Portions © 2000-2002 Center for Spoken Language Understanding, Oregon Health & Science University, © 2007 Trustees of the University of Pennsylvania

  11. m

    Call Center conversation in Hindi for Insurance

    • data.macgence.com
    mp3
    Updated Mar 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Call Center conversation in Hindi for Insurance [Dataset]. https://data.macgence.com/dataset/call-center-conversation-in-hindi--for-insurance
    Explore at:
    mp3Available download formats
    Dataset updated
    Mar 23, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    The audio dataset includes Call Center conversations from Insurance, featuring Hindi speakers from INDIA ,with detailed metadata.

  12. o

    Data from: CWID-hi: A Dataset for Complex Word Identification in Hindi Text

    • explore.openaire.eu
    • zenodo.org
    Updated Aug 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gayatri Venugopal; Dhanya Pramod (2021). CWID-hi: A Dataset for Complex Word Identification in Hindi Text [Dataset]. http://doi.org/10.5281/zenodo.5229159
    Explore at:
    Dataset updated
    Aug 21, 2021
    Authors
    Gayatri Venugopal; Dhanya Pramod
    Description

    This dataset was created by conducting a human intelligence test, wherein native and non-native Hindi speakers annotated words they could not understand in Hindi text. They were then asked to rank the complexity of these words along with their synonyms. A word that received an average rank of <=3 (out of 5) is labeled 1 and the word that received an average rank of >3 is labeled 0. 1 indicates complex and 0 indicates simple.

  13. m

    Indian Agent to Indian Customer call center Speech Dataset in Hindi for...

    • data.macgence.com
    mp3
    Updated May 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2025). Indian Agent to Indian Customer call center Speech Dataset in Hindi for Banking [Dataset]. https://data.macgence.com/dataset/indian-agent-to-indian-customer-call-center-speech-dataset-in-hindi-for-banking
    Explore at:
    mp3Available download formats
    Dataset updated
    May 12, 2025
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide, India
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    The audio dataset includes Call center conversations, featuring Hindi speakers from India with detailed metadata.

  14. E

    Gram Vaani data set

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated May 15, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2019). Gram Vaani data set [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0405/
    Explore at:
    Dataset updated
    May 15, 2019
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The Gram Vaani data set consists of 130 hours (21,000 different audio recordings) recorded by 4,000 unique Hindi speakers from the states of Bihar, Jharkhand, and Madhya Pradesh in India (20-25% female, 60% people under 30 years of age, mostly rural).The data set was collected via a voice-based community media platform that runs over IVR (Interactive Voice Response) telephone systems. Users can call into the system and listen to audio messages, or record their own message in response to messages they hear. This therefore serves as a discussion forum on voice, but without needing the Internet, and suitable even for less-literate populations. The platform is used for discussions on local policies, local news, questions and answers on agriculture, health and social norms, and poetry. All content recorded by the users is manually reviewed before it can be heard by other users over the IVR, to reject content with poor audio quality or editorial violations such as hate speech or false information. The environment for recordings is mostly outdoor, with a medium level of background noise from roadside and public places. Speech samples are stored as sequences of 8 kHz in MP3 files.An orthographic transcription is provided (transliteration in Latin characters), including the following tagged named entities :- #person: - #location: - #organization: an NGO or a government department, - #crop: farming products, e.g. paddy, wheat, mushroom- #scheme: multi-word names of government schemes and services, like employment guarantee, food subsidy, health center, hospital- #disease: e.g. like malaria, dengue, diarrhea, heat stroke- #event: e.g. festivals like diwali, chath, or event classes like flood, violence, curfew, rally, electionReferences: Aparna Moitra, Vishnupriya Das, Gram Vaani, Archna Kumar, and Aaditeshwar Seth. 2016. Design Lessons from Creating a Mobile-based Community Media Platform in Rural India. In Proceedings of the Eighth International Conference on Information and Communication Technologies and Development (ICTD ’16). Association for Computing Machinery, New York, NY, USA, Article 14, 1–11. DOI:https://doi.org/10.1145/2909609.2909670

  15. F

    Travel Call Center Speech Data: Hindi (India)

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Travel Call Center Speech Data: Hindi (India) [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/travel-call-center-conversation-hindi-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    India
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Hindi Call Center Speech Dataset for the Travel domain designed to enhance the development of call center speech recognition models specifically for the Travel industry. This dataset is meticulously curated to support advanced speech recognition, natural language processing, conversational AI, and generative voice AI algorithms.

    Speech Data:

    This training dataset comprises 30 Hours of call center audio recordings covering various topics and scenarios related to the Travel domain, designed to build robust and accurate customer service speech technology.

    Participant Diversity:
    Speakers: 60 expert native Hindi speakers from the FutureBeeAI Community.
    Regions: Different states/provinces of India, ensuring a balanced representation of Hindi accents, dialects, and demographics.
    Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.
    Recording Details:
    Conversation Nature: Unscripted and spontaneous conversations between call center agents and customers.
    Call Duration: Average duration of 5 to 15 minutes per call.
    Formats: WAV format with stereo channels, a bit depth of 16 bits, and a sample rate of 8 and 16 kHz.
    Environment: Without background noise and without echo.

    Topic Diversity

    This dataset offers a diverse range of conversation topics, call types, and outcomes, including both inbound and outbound calls with positive, neutral, and negative outcomes.

    Inbound Calls:
    Booking inquiries and assistance
    Destination information and recommendations
    Assistance with flight delays or cancellations
    Special assistance for passengers with disabilities
    Travel-related health and safety inquiry
    Assistance with lost or delayed baggage, and many more
    Outbound Calls:
    Promotional offers and package deals
    Customer satisfaction surveys
    Booking confirmations and updates
    Flight schedule changes and notifications
    Customer feedback collection
    Reminders for passport or visa expiration date, and many more

    This extensive coverage ensures the dataset includes realistic call center scenarios, which is essential for developing effective customer support speech recognition models.

    Transcription

    To facilitate your workflow, the dataset includes manual verbatim transcriptions of each call center audio file in JSON format. These transcriptions feature:

    Speaker-wise Segmentation: Time-coded segments for both agents and customers.
    Non-Speech Labels: Tags and labels for non-speech elements.
    Word Error Rate: Word error rate is less than 5% thanks to the dual layer of QA.

    These ready-to-use transcriptions accelerate the development of the Travel domain call center conversational AI and ASR models for the Hindi language.

    Metadata

    The dataset provides comprehensive metadata for each conversation and participant:

    Participant Metadata: Unique identifier, age, gender, country, state, district, accent and dialect.
    Conversation Metadata: Domain, topic, call type, outcome/sentiment, bit depth, and sample rate.

  16. F

    Hindi Conversation Chat Dataset for Real Estate Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Conversation Chat Dataset for Real Estate Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-realestate-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific Real Estate related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Real Estate topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Real Estate use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    Inbound Chats:
    Property Inquiry
    Rental Property Search & Availability
    Renovation Inquiries
    Property Features & Amenities Inquiry
    Investment Property Analysis & Advice
    Property History & Ownership Details, and many more
    Outbound Chats:
    New Property Listing Update
    Post Purchase Follow-ups
    Investment Opportunities & Property Recommendations
    Property Value Updates
    Customer Satisfaction Surveys, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Real Estate interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Real Estate contexts.

    The dataset encompasses a wide array of language elements, including:

    Naming Conventions: Chats include a variety of Hindi personal and business names.
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Real Estate conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Real Estate interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Real Estate customer-agent interactions.

    Simple Inquiries
    Detailed Discussions
    Transactional Interactions
    Problem-Solving Dialogues
    Advisory Sessions
    Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    Greetings
    Authentication
    Information gathering
    Resolution identification
    Solution Delivery
    Closing and Follow-ups
    <span

  17. m

    Call Center conversation in Hindi for Payment collection

    • data.macgence.com
    mp3
    Updated Jun 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Call Center conversation in Hindi for Payment collection [Dataset]. https://data.macgence.com/dataset/call-center-conversation-in-hindi-for-payment-collection
    Explore at:
    mp3Available download formats
    Dataset updated
    Jun 20, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    The audio dataset includes Call Center conversations from Payment Collection, featuring Hindi speakers from INDIA ,with detailed metadata.

  18. d

    Replication Data for: The acquisition of Hindi split-ergativity and...

    • search.dataone.org
    • dataverse.azure.uit.no
    • +1more
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ponnet, Aaricia; De Cuypere, Ludovic (2024). Replication Data for: The acquisition of Hindi split-ergativity and Differential Object Marking by Dutch L1 speakers: systematicity and variation [Dataset]. http://doi.org/10.18710/3YWQ8R
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    DataverseNO
    Authors
    Ponnet, Aaricia; De Cuypere, Ludovic
    Description

    Dataset abstract The dataset includes annotated corpus data of N = 1811 utterances based on a picture description task that elicited semi-spontaneous oral production data from 15 Dutch learners of Hindi, from four (cross-sectional) stages (Years) of the Hindi course trajectory. The corpus data is annotated for (i) Learner, (ii) Year of study of the learner, (iii) the use of ne as an ergative marker, (iv) correct usage of the ne-marker, (v) the use of ko as a Differential Object Marker, (vi) the use of ko as another marker, and multiple features associated with ne- and ko-marking, including: (vii) specificity of the Direct Object, (viii) animacy of the Direct Object, (ix) transitivity of the sentence Verb, (x) perfectivity of the sentence Verb, (xi) other uses of the ko-marker, (xii) the semantic role of these other uses of the ko-marker. Article abstract We investigated the acquisition of Hindi split ergativity (ne-marking) and Differential Object Marking (zero or komarking) by L1 speakers of Dutch. Both grammatical phenomena are conditioned by multiple syntactic and semantic features. On a descriptive level, the study aims to examine when and how Dutch learners acquire and apply the conditional features associated with ne- and ko-marking. A specific learner corpus was created based on a picture description task that elicited semi-spontaneous oral production data from 15 Dutch learners of Hindi, from four (cross-sectional) stages of the Hindi course trajectory. We annotated the corpus data for multiple features associated with ne- and ko-marking. Using a mixed-effects logistic regression analysis, we found an increase in the use and accuracy of each case marker over the different years of study, but individual learner profile analyses revealed considerable intersubject differences in learner behaviour. We show that it is possible to define developmental stages for the acquisition of ne- and ko-marking in line with Processability Theory.

  19. P

    IndicTTS Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Oct 15, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). IndicTTS Dataset [Dataset]. https://paperswithcode.com/dataset/indictts
    Explore at:
    Dataset updated
    Oct 15, 2016
    Description

    A special corpus of Indian languages covering 13 major languages of India. It comprises of 10000+ spoken sentences/utterances each of mono and English recorded by both Male and Female native speakers. Speech waveform files are available in .wav format along with the corresponding text. We hope that these recordings will be useful for researchers and speech technologists working on synthesis and recognition. You can request zip archives of the entire database here.

  20. m

    Call Center conversation in Hindi for Appointment fixing

    • data.macgence.com
    mp3
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Call Center conversation in Hindi for Appointment fixing [Dataset]. https://data.macgence.com/dataset/call-center-conversation-in-hindi-for-appointment-fixing
    Explore at:
    mp3Available download formats
    Dataset updated
    Apr 8, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    The audio dataset includes Call Center conversations from Appointment Fixing, featuring Hindi speakers from INDIA ,with detailed metadata.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FutureBee AI (2022). Hindi Conversation Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-healthcare-domain-conversation-text-dataset

Hindi Conversation Chat Dataset for Healthcare Domain

Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License

https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

Dataset funded by
FutureBeeAI
Description

Introduction

The dataset comprises over 12,000 chat conversations, each focusing on specific Healthcare related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity

The chat dataset covers a wide range of conversations on Healthcare topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Healthcare use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

Inbound Chats:
Appointment Scheduling
New Patient Registration
Surgery Consultation
Consultation regarding Diet, and many more
Outbound Chats:
Appointment Reminder
Health & Wellness Subscription Programs
Lab Test Results
Health Risk Assessments
Preventive Care Reminders, and many more

Language Variety & Nuances

The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Healthcare interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Healthcare contexts.

The dataset encompasses a wide array of language elements, including:

Naming Conventions: Chats include a variety of Hindi personal and business names.
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Healthcare conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Healthcare interactions.

Conversational Flow and Interaction Types

The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Healthcare customer-agent interactions.

Simple Inquiries
Detailed Discussions
Transactional Interactions
Problem-Solving Dialogues
Advisory Sessions
Routine Checks and Follow-Ups

Each of these conversations contains various aspects of conversation flow like:

Greetings
Authentication
Information gathering
Resolution identification
Solution Delivery
Closing and Follow-ups
Feedback, etc

This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.

Data Format and Structure

The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat messages, designed to

Search
Clear search
Close search
Google apps
Main menu