Search
Clear search
Close search
Main menu
Google apps
49 datasets found
  1. F

    Hindi Conversation Chat Dataset for Telecom Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Conversation Chat Dataset for Telecom Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-telecom-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific Telecom related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Telecom topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Telecom use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    Inbound Chats:
    Phone Number Porting
    Network Connectivity Issues
    Billing and Payments
    Technical Support
    Service Activation
    International Roaming Enquiry
    Refunds and Billing Adjustments
    Emergency Service Access, and many more
    Outbound Chats:
    Welcome Calls / Onboarding Process
    Payment Reminders
    Customer Surveys
    Technical Updates
    Service Usage Reviews
    Network Complaint Update, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Telecom interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Telecom contexts.

    The dataset encompasses a wide array of language elements, including:

    Naming Conventions: Chats include a variety of Hindi personal and business names.
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Telecom conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Telecom interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Telecom customer-agent interactions.

    Simple Inquiries
    Detailed Discussions
    Transactional Interactions
    Problem-Solving Dialogues
    Advisory Sessions
    Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    Greetings
    Authentication
    Information gathering
    Resolution identification
    <span

  2. F

    Hindi Conversation Chat Dataset for Healthcare Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Conversation Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-healthcare-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific Healthcare related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Healthcare topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Healthcare use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    Inbound Chats:
    Appointment Scheduling
    New Patient Registration
    Surgery Consultation
    Consultation regarding Diet, and many more
    Outbound Chats:
    Appointment Reminder
    Health & Wellness Subscription Programs
    Lab Test Results
    Health Risk Assessments
    Preventive Care Reminders, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Healthcare interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Healthcare contexts.

    The dataset encompasses a wide array of language elements, including:

    Naming Conventions: Chats include a variety of Hindi personal and business names.
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Healthcare conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Healthcare interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Healthcare customer-agent interactions.

    Simple Inquiries
    Detailed Discussions
    Transactional Interactions
    Problem-Solving Dialogues
    Advisory Sessions
    Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    Greetings
    Authentication
    Information gathering
    Resolution identification
    Solution Delivery
    Closing and Follow-ups
    Feedback, etc

    This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.

    Data Format and Structure

    The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat messages, designed to

  3. n

    797 Hours - Hindi(India) Spontaneous Dialogue Smartphone speech dataset

    • m.nexdata.ai
    Updated Apr 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    797 Hours - Hindi(India) Spontaneous Dialogue Smartphone speech dataset [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1156
    Explore at:
    Dataset updated
    Apr 13, 2024
    Dataset provided by
    nexdata technology inc
    Authors
    Nexdata
    Area covered
    India
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
    Description

    Hindi(India) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(1,022 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  4. s

    Hindi Language Datasets | Audio Data for ASR, Virtual Assistant

    • hmn.shaip.com
    • fa.shaip.com
    • +71more
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Hindi Language Datasets | Audio Data for ASR, Virtual Assistant [Dataset]. https://hmn.shaip.com/offerings/speech-data-catalog/hindi-dataset/
    Explore at:
    Dataset updated
    Aug 30, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Enhance your Conversational AI model with our Off-the-Shelf Hindi Language Datasets. Shaip high-quality audio datasets are a quick and effective solution for model training.

  5. F

    General domain Human-Human conversation chats in Hindi

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). General domain Human-Human conversation chats in Hindi [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    This training dataset comprises more than 10,000 conversational text data between two native Hindi people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.

    This training dataset's licence belongs to FutureBeeAI!

  6. h

    gooftagoo

    • huggingface.co
    Updated Mar 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tensoic AI (2024). gooftagoo [Dataset]. https://huggingface.co/datasets/Tensoic/gooftagoo
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2024
    Dataset authored and provided by
    Tensoic AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Hindi/Hinglish Conversation Dataset

    This repository contains a dataset of conversational text in conversational hindi and hinglish(a mix of Hindi and English languages). The Conversation Dataset contains multi-turn conversations on multiple topics usually revolving around daily real-life experiences. A small amount of reasoning tasks have also been added (specifically COT style reasoning and coding) with about 1k samples from Openhermes 2.5.

      Caution
    

    This dataset… See the full description on the dataset page: https://huggingface.co/datasets/Tensoic/gooftagoo.

  7. 34 Hours - Hindi(India) Children Real-world Casual Conversation and...

    • m.nexdata.ai
    Updated Nov 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 34 Hours - Hindi(India) Children Real-world Casual Conversation and Monologue speech dataset [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1377
    Explore at:
    Dataset updated
    Nov 16, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    World
    Variables measured
    Age, Format, Country, Accuracy, Language, Content category, Language(Region) Code, Recording environment, Features of annotation
    Description

    Hindi(India) Children Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety show and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, age, accent and other attributes. Our dataset was collected from extensive and diversify speakers(12 years old and younger children), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  8. F

    Hindi (India) General Conversation Speech Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi (India) General Conversation Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-hindi-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Welcome to the Hindi Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of Hindi language speech recognition models, with a particular focus on Indian accents and dialects.

    With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the Hindi language spoken in India.

    Speech Data:

    This training dataset comprises 150 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 160 native Hindi speakers from different part of India. This collaborative effort guarantees a balanced representation of Indian accents, dialects, and demographics, reducing biases and promoting inclusivity.

    Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.

    Metadata:

    In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Hindi language speech recognition models.

    Transcription:

    This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.

    Our goal is to expedite the deployment of Hindi language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.

    Updates and Customization:

    We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.

    If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.

    License:

    This audio dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.

  9. F

    Hindi Conversation Chat Dataset for Real Estate Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Conversation Chat Dataset for Real Estate Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-realestate-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific Real Estate related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Real Estate topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Real Estate use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    Inbound Chats:
    Property Inquiry
    Rental Property Search & Availability
    Renovation Inquiries
    Property Features & Amenities Inquiry
    Investment Property Analysis & Advice
    Property History & Ownership Details, and many more
    Outbound Chats:
    New Property Listing Update
    Post Purchase Follow-ups
    Investment Opportunities & Property Recommendations
    Property Value Updates
    Customer Satisfaction Surveys, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Real Estate interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Real Estate contexts.

    The dataset encompasses a wide array of language elements, including:

    Naming Conventions: Chats include a variety of Hindi personal and business names.
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Real Estate conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Real Estate interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Real Estate customer-agent interactions.

    Simple Inquiries
    Detailed Discussions
    Transactional Interactions
    Problem-Solving Dialogues
    Advisory Sessions
    Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    Greetings
    Authentication
    Information gathering
    Resolution identification
    Solution Delivery
    Closing and Follow-ups
    <span

  10. h

    hind_encorp

    • huggingface.co
    • paperswithcode.com
    • +3more
    Updated Mar 22, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel Rychlý (2014). hind_encorp [Dataset]. https://huggingface.co/datasets/pary/hind_encorp
    Explore at:
    Dataset updated
    Mar 22, 2014
    Authors
    Pavel Rychlý
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008).

    Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi.

    EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages.

    Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus.  For the current release, we are extending the parallel corpus using these sources: Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi.

    TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available.

    The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus.

    Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files.

    Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary.

  11. h

    english-hindi-colloquial-dataset

    • huggingface.co
    Updated Feb 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    deeksha bajpai (2025). english-hindi-colloquial-dataset [Dataset]. https://huggingface.co/datasets/bajpaideeksha/english-hindi-colloquial-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 21, 2025
    Authors
    deeksha bajpai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A curated dataset of colloquial English phrases and their corresponding Hindi translations. This dataset focuses on informal language, including slang, idioms, and everyday expressions, making it ideal for training models that handle casual conversations. Dataset Details: Size:e.g., 500+ phrase pairs] Source: Collected from publicly available conversational datasets, social media, and crowdsourced contributions. Language Pair: English → Hindi Annotations: Each phrase pair is manually verified… See the full description on the dataset page: https://huggingface.co/datasets/bajpaideeksha/english-hindi-colloquial-dataset.

  12. n

    494 Hours - Hindi(India) Real-world Casual Conversation and Monologue speech...

    • m.nexdata.ai
    Updated Nov 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 494 Hours - Hindi(India) Real-world Casual Conversation and Monologue speech dataset [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1269
    Explore at:
    Dataset updated
    Nov 11, 2023
    Dataset provided by
    nexdata technology inc
    Authors
    Nexdata
    Area covered
    World
    Variables measured
    Format, Country, Language, Accuracy Rate, Content category, Language(Region) Code, Recording environment, Features of annotation
    Description

    Hindi(India) Real-world Casual Conversation and Monologue speech dataset, covers education, interview, sports domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  13. F

    Hindi Conversation Chat Dataset for Delivery & Logistics Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Conversation Chat Dataset for Delivery & Logistics Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-delivery-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific Delivery & Logistics related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Delivery & Logistics topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Delivery & Logistics use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    Inbound Chats:
    Order Tracking
    Delivery Complaint
    Undeliverable Address
    Delivery Method Selection
    Return Process Enquiry
    Order Modification, and many more
    Outbound Chats:
    Delivery Confirmation
    Delivery Subscription
    Incorrect Address
    Missed Delivery Attempt
    Delivery Feedback
    Out-of-Stock Notification
    Delivery Satisfaction Survey, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Delivery & Logistics interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Delivery & Logistics contexts.

    The dataset encompasses a wide array of language elements, including:

    Naming Conventions: Chats include a variety of Hindi personal and business names.
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Delivery & Logistics conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Delivery & Logistics interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Delivery & Logistics customer-agent interactions.

    Simple Inquiries
    Detailed Discussions
    Transactional Interactions
    Problem-Solving Dialogues
    Advisory Sessions
    Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    Greetings
    Authentication
    Information gathering
    Resolution identification
    Solution

  14. F

    Hindi Conversation Chat Dataset for Retail & E-commerce Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Conversation Chat Dataset for Retail & E-commerce Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-retail-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific Retail & E-Commerce related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Retail & E-Commerce topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Retail & E-Commerce use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    Inbound Chats:
    Product Inquiry
    Return/Exchange Request
    Order Cancellation
    Refund Request
    Membership/Subscriptions Enquiry
    Order Cancellations, and many more
    Outbound Chats:
    Order Confirmation
    Cross-selling and Upselling
    Account Updates
    Loyalty Program Offers
    Special Offers and Promotions
    Customer Verification, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Retail & E-Commerce interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Retail & E-Commerce contexts.

    The dataset encompasses a wide array of language elements, including:

    Naming Conventions: Chats include a variety of Hindi personal and business names.
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Retail & E-Commerce conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Retail & E-Commerce interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Retail & E-Commerce customer-agent interactions.

    Simple Inquiries
    Detailed Discussions
    Transactional Interactions
    Problem-Solving Dialogues
    Advisory Sessions
    Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    Greetings
    Authentication
    Information gathering
    Resolution identification
    Solution Delivery
    Closing and Follow-ups
    <div style="margin-top:10px;

  15. m

    Indo-HateSpeech

    • data.mendeley.com
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pravin Kaware (2024). Indo-HateSpeech [Dataset]. http://doi.org/10.17632/snc7mxpj6t.1
    Explore at:
    Dataset updated
    Dec 2, 2024
    Authors
    Pravin Kaware
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Indo-HateSpeech dataset is a Hindi-English code-mixed dataset specifically designed for identifying hate speech in social media platforms. Given the multilingual nature of Indian social media users, code-mixing (the blending of two or more languages within a conversation) is prevalent, especially between Hindi and English.

  16. F

    Hindi Conversation Chat Dataset for Travel Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Conversation Chat Dataset for Travel Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-travel-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific Travel related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Travel topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Travel use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    Inbound Calls:
    Booking Inquiries & Assistance
    Destination Information & Recommendations
    Flight Delays or Cancellation Assistance
    Assistance for Disable Passengers
    Travel-related Health & Safety Inquiry
    Lost or Delayed Baggage Assistance, and many more
    Outbound Calls:
    Promotional Offers & Package Deals
    Customer Satisfaction Surveys
    Booking Confirmations & Updates
    Flight Schedule Changes & Notifications
    Customer Feedback Collection
    Visa Expiration Reminders, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Travel interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Travel contexts.

    The dataset encompasses a wide array of language elements, including:

    Naming Conventions: Chats include a variety of Hindi personal and business names.
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Travel conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Travel interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Travel customer-agent interactions.

    Simple Inquiries
    Detailed Discussions
    Transactional Interactions
    Problem-Solving Dialogues
    Advisory Sessions
    Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    Greetings
    Authentication
    Information gathering
    Resolution identification
    Solution Delivery
    <span

  17. g

    Hinglish media Audio Dataset

    • gts.ai
    json
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2023). Hinglish media Audio Dataset [Dataset]. https://gts.ai/case-study/hinglish-media-audio-dataset-speech-and-voice-for-ai-and-ml/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 24, 2023
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The “Hinglish Media Audio Dataset” project is designed to create a comprehensive audio dataset that combines Hindi and English languages (Hinglish) for advanced speech recognition applications.

  18. h

    hindi-speech-recognition-dataset

    • huggingface.co
    Updated Mar 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UniData (2025). hindi-speech-recognition-dataset [Dataset]. https://huggingface.co/datasets/UniDataPro/hindi-speech-recognition-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 7, 2025
    Authors
    UniData
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Hindi Speech Dataset for recognition task

    Dataset comprises 760 hours of telephone dialogues in Hindi, collected from 1,000+ native speakers across various topics and domains. This dataset boasts an impressive 95% sentence accuracy rate, making it a valuable resource for advancing speech recognition technology. By utilizing this dataset, researchers and developers can advance their understanding and capabilities in automatic speech recognition (ASR) systems, transcribing audio, and… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/hindi-speech-recognition-dataset.

  19. F

    Hindi Conversation Chat Dataset for BFSI Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Conversation Chat Dataset for BFSI Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-bfsi-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific BFSI-related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on BFSI topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various BFSI use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    Inbound Chats:
    Account Opening
    Account Management
    Transactions
    Loan Inquiries & Applications
    Credit Card Services, and many more
    Outbound Chats:
    Product & Service Promotions
    Cross-selling & Upselling
    Customer Retention & Loyalty Programs
    Loan Application Follow-ups
    Insurance Policy Renewals/Reminders, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi BFSI interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in BFSI contexts.

    The dataset encompasses a wide array of language elements, including:

    Naming Conventions: Chats include a variety of Hindi personal and business names.
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi BFSI conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi BFSI interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of BFSI customer-agent interactions.

    Simple Inquiries
    Detailed Discussions
    Transactional Interactions
    Problem-Solving Dialogues
    Advisory Sessions
    Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    Greetings
    Authentication
    Information gathering
    Resolution identification
    Solution Delivery
    Closing and Follow-ups
    Feedback, etc

    This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.

    Data Format and Structure

    The

  20. 302 Person - Hindi and English Bilingual Spontaneous Monologue smartphone...

    • m.nexdata.ai
    Updated Feb 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 302 Person - Hindi and English Bilingual Spontaneous Monologue smartphone speech dataset [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1420
    Explore at:
    Dataset updated
    Feb 29, 2024
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Country, Speaker, Language, Content category, Recording device, Recording condition
    Description

    Hindi and English Bilingual Spontaneous Monologue smartphone speech dataset, collected from dialogues based on given topics, covering generic domain. Our dataset was collected from extensive and diversify speakers(302 people in total, ages 18 to 46), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FutureBee AI (2022). Hindi Conversation Chat Dataset for Telecom Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-telecom-domain-conversation-text-dataset

Hindi Conversation Chat Dataset for Telecom Domain

Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License

https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

Dataset funded by
FutureBeeAI
Description

Introduction

The dataset comprises over 12,000 chat conversations, each focusing on specific Telecom related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity

The chat dataset covers a wide range of conversations on Telecom topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Telecom use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

Inbound Chats:
Phone Number Porting
Network Connectivity Issues
Billing and Payments
Technical Support
Service Activation
International Roaming Enquiry
Refunds and Billing Adjustments
Emergency Service Access, and many more
Outbound Chats:
Welcome Calls / Onboarding Process
Payment Reminders
Customer Surveys
Technical Updates
Service Usage Reviews
Network Complaint Update, and many more

Language Variety & Nuances

The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Telecom interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Telecom contexts.

The dataset encompasses a wide array of language elements, including:

Naming Conventions: Chats include a variety of Hindi personal and business names.
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Telecom conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Telecom interactions.

Conversational Flow and Interaction Types

The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Telecom customer-agent interactions.

Simple Inquiries
Detailed Discussions
Transactional Interactions
Problem-Solving Dialogues
Advisory Sessions
Routine Checks and Follow-Ups

Each of these conversations contains various aspects of conversation flow like:

Greetings
Authentication
Information gathering
Resolution identification
<span