100+ datasets found
  1. F

    Hindi Conversation Chat Dataset for Telecom Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Conversation Chat Dataset for Telecom Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-telecom-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific Telecom related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    ā€¢
    Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
    ā€¢
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Telecom topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Telecom use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    ā€¢Inbound Chats:
    ā€¢Phone Number Porting
    ā€¢Network Connectivity Issues
    ā€¢Billing and Payments
    ā€¢Technical Support
    ā€¢Service Activation
    ā€¢International Roaming Enquiry
    ā€¢Refunds and Billing Adjustments
    ā€¢Emergency Service Access, and many more
    ā€¢Outbound Chats:
    ā€¢Welcome Calls / Onboarding Process
    ā€¢Payment Reminders
    ā€¢Customer Surveys
    ā€¢Technical Updates
    ā€¢Service Usage Reviews
    ā€¢Network Complaint Update, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Telecom interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Telecom contexts.

    The dataset encompasses a wide array of language elements, including:

    ā€¢
    Naming Conventions: Chats include a variety of Hindi personal and business names.
    ā€¢
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
    ā€¢
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
    ā€¢
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Telecom conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Telecom interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Telecom customer-agent interactions.

    ā€¢Simple Inquiries
    ā€¢Detailed Discussions
    ā€¢Transactional Interactions
    ā€¢Problem-Solving Dialogues
    ā€¢Advisory Sessions
    ā€¢Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    ā€¢Greetings
    ā€¢Authentication
    ā€¢Information gathering
    ā€¢Resolution identification
    <span

  2. F

    Hindi Conversation Chat Dataset for Healthcare Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Conversation Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-healthcare-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific Healthcare related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    ā€¢
    Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
    ā€¢
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Healthcare topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Healthcare use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    ā€¢Inbound Chats:
    ā€¢Appointment Scheduling
    ā€¢New Patient Registration
    ā€¢Surgery Consultation
    ā€¢Consultation regarding Diet, and many more
    ā€¢Outbound Chats:
    ā€¢Appointment Reminder
    ā€¢Health & Wellness Subscription Programs
    ā€¢Lab Test Results
    ā€¢Health Risk Assessments
    ā€¢Preventive Care Reminders, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Healthcare interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Healthcare contexts.

    The dataset encompasses a wide array of language elements, including:

    ā€¢
    Naming Conventions: Chats include a variety of Hindi personal and business names.
    ā€¢
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
    ā€¢
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
    ā€¢
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Healthcare conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Healthcare interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Healthcare customer-agent interactions.

    ā€¢Simple Inquiries
    ā€¢Detailed Discussions
    ā€¢Transactional Interactions
    ā€¢Problem-Solving Dialogues
    ā€¢Advisory Sessions
    ā€¢Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    ā€¢Greetings
    ā€¢Authentication
    ā€¢Information gathering
    ā€¢Resolution identification
    ā€¢Solution Delivery
    ā€¢Closing and Follow-ups
    ā€¢Feedback, etc

    This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.

    Data Format and Structure

    The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat messages, designed to

  3. s

    Hindi Language Datasets | Audio Data for ASR, Virtual Assistant

    • fr.shaip.com
    • pl.shaip.com
    • +25more
    Updated Jul 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Hindi Language Datasets | Audio Data for ASR, Virtual Assistant [Dataset]. https://fr.shaip.com/offerings/speech-data-catalog/hindi-dataset/
    Explore at:
    Dataset updated
    Jul 17, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Enhance your Conversational AI model with our Off-the-Shelf Hindi Language Datasets. Shaip high-quality audio datasets are a quick and effective solution for model training.

  4. i

    Speech Dataset in Hindi Language

    • ieee-dataport.org
    Updated Jun 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivam Shukla (2020). Speech Dataset in Hindi Language [Dataset]. http://doi.org/10.21227/1187-yv12
    Explore at:
    Dataset updated
    Jun 9, 2020
    Dataset provided by
    IEEE Dataport
    Authors
    Shivam Shukla
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    100 Speakers each consisting of 5 voice samples for training data and 1 voice sample for testing data. Total of 600 voice samples collected in different audio formats like mpeg, mp4, mp3, ogg etc. These samples were than preprocessed and converted into .wav format. Each voice sample has a time duration of 5-10 seconds due to different lengths tuning of parameters should be done before usage. Whole Dataset size is 600mb and duration is 1 hour 40 minutes. This dataset can be used for speech synthesis, speaker identification. speaker recognition, speech recogniton etc. Preprocessing of data is required.

  5. F

    Hindi (India) General Conversation Speech Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi (India) General Conversation Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-hindi-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Whatā€™s Included

    Welcome to the Hindi Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of Hindi language speech recognition models, with a particular focus on Indian accents and dialects.

    With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the Hindi language spoken in India.

    Speech Data:

    This training dataset comprises 150 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 160 native Hindi speakers from different part of India. This collaborative effort guarantees a balanced representation of Indian accents, dialects, and demographics, reducing biases and promoting inclusivity.

    Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.

    Metadata:

    In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Hindi language speech recognition models.

    Transcription:

    This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.

    Our goal is to expedite the deployment of Hindi language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.

    Updates and Customization:

    We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.

    If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.

    License:

    This audio dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.

  6. n

    494 Hours - Hindi(India) Real-world Casual Conversation and Monologue speech...

    • m.nexdata.ai
    • nexdata.ai
    Updated Jul 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 494 Hours - Hindi(India) Real-world Casual Conversation and Monologue speech dataset [Dataset]. https://m.nexdata.ai/dataset/1269
    Explore at:
    Dataset updated
    Jul 10, 2024
    Dataset provided by
    nexdata technology inc
    Authors
    Nexdata
    Area covered
    India, World
    Variables measured
    Format, Country, Language, Accuracy Rate, Content category, Language(Region) Code, Recording environment, Features of annotation
    Description

    Hindi(India) Real-world Casual Conversation and Monologue speech dataset, covers education, interview, sports domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  7. h

    indic-instruct-data-v0.1

    • huggingface.co
    Updated Jan 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI4Bharat (2024). indic-instruct-data-v0.1 [Dataset]. https://huggingface.co/datasets/ai4bharat/indic-instruct-data-v0.1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 25, 2024
    Dataset authored and provided by
    AI4Bharat
    Description

    Indic Instruct Data v0.1

    A collection of different instruction datasets spanning English and Hindi languages. The collection consists of:

    Anudesh wikiHow Flan v2 (67k sample subset) Dolly Anthropic-HHH (5k sample subset) OpenAssistant v1 LymSys-Chat (50k sample subset)

    We translate the English subset of specific datasets using IndicTrans2 (Gala et al., 2023). The chrF++ scores of the back-translated example and the corresponding example is provided for quality assessment ofā€¦ See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/indic-instruct-data-v0.1.

  8. F

    Hindi Conversation Chat Dataset for Delivery & Logistics Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Conversation Chat Dataset for Delivery & Logistics Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-delivery-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific Delivery & Logistics related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    ā€¢
    Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
    ā€¢
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Delivery & Logistics topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Delivery & Logistics use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    ā€¢Inbound Chats:
    ā€¢Order Tracking
    ā€¢Delivery Complaint
    ā€¢Undeliverable Address
    ā€¢Delivery Method Selection
    ā€¢Return Process Enquiry
    ā€¢Order Modification, and many more
    ā€¢Outbound Chats:
    ā€¢Delivery Confirmation
    ā€¢Delivery Subscription
    ā€¢Incorrect Address
    ā€¢Missed Delivery Attempt
    ā€¢Delivery Feedback
    ā€¢Out-of-Stock Notification
    ā€¢Delivery Satisfaction Survey, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Delivery & Logistics interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Delivery & Logistics contexts.

    The dataset encompasses a wide array of language elements, including:

    ā€¢
    Naming Conventions: Chats include a variety of Hindi personal and business names.
    ā€¢
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
    ā€¢
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
    ā€¢
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Delivery & Logistics conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Delivery & Logistics interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Delivery & Logistics customer-agent interactions.

    ā€¢Simple Inquiries
    ā€¢Detailed Discussions
    ā€¢Transactional Interactions
    ā€¢Problem-Solving Dialogues
    ā€¢Advisory Sessions
    ā€¢Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    ā€¢Greetings
    ā€¢Authentication
    ā€¢Information gathering
    ā€¢Resolution identification
    ā€¢Solution

  9. E

    Hindi Visual Genome 1.1

    • live.european-language-grid.eu
    • lindat.mff.cuni.cz
    binary format
    Updated Dec 31, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Hindi Visual Genome 1.1 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1309
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Dec 31, 2019
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Data

    ----

    Hindi Visual Genome 1.1 is an updated version of Hindi Visual Genome 1.0. The update concerns primarily the text part of Hindi Visual Genome, fixing translation issues reported during WAT 2019 multimodal task. In the image part, only one segment and thus one image were removed from the dataset.

    Hindi Visual Genome 1.1 serves in "WAT 2020 Multi-Modal Machine Translation Task".

    Hindi Visual Genome is a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We have selected short English segments (captions) from Visual Genome along with associated images and automatically translated them to Hindi with manual post-editing, taking the associated images into account.

    The training set contains 29K segments. Further 1K and 1.6K segments are provided in a development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome.

    A third test set is called ``challenge test set'' consists of 1.4K segments and it was released for WAT2019 multi-modal task. The challenge test set was created by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. The surrounding words in the sentence however also often include sufficient cues to identify the correct meaning of the ambiguous word.

    Dataset Formats

    --------------

    The multimodal dataset contains both text and images.

    The text parts of the dataset (train and test sets) are in simple

    tab-delimited plain text files.

    All the text files have seven columns as follows:

    Column1 - image_id

    Column2 - X

    Column3 - Y

    Column4 - Width

    Column5 - Height

    Column6 - English Text

    Column7 - Hindi Text

    The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption.

    Data Statistics

    ----------------

    The statistics of the current release is given below.

    Parallel Corpus Statistics

    ---------------------------

    Dataset Segments English Words Hindi Words

    ------- --------- ---------------- -------------

    Train 28930 143164 145448

    Dev 998 4922 4978

    Test 1595 7853 7852

    Challenge Test 1400 8186 8639

    ------- --------- ---------------- -------------

    Total 32923 164125 166917

    The word counts are approximate, prior to tokenization.

    Citation

    --------

    If you use this corpus, please cite the following paper:

    @article{hindi-visual-genome:2019,

    title={{Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation}},

    author={Parida, Shantipriya and Bojar, Ond{\v{r}}ej and Dash, Satya Ranjan},

    journal={Computaci{\'o}n y Sistemas},

    volume={23},

    number={4},

    pages={1499--1505},

    year={2019}

    }

  10. s

    Hindi-English Off-the-Shelf Datasets

    • bn.shaip.com
    • no.shaip.com
    • +3more
    json
    Updated Jan 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Hindi-English Off-the-Shelf Datasets [Dataset]. https://bn.shaip.com/offerings/speech-data-catalog/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jan 10, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Off-the-shelf Hindi-English (Hinglish) Audio Dataset - Total Volume 800 hrs, Bifurcated into 8khz Unscripted, synthetic telephonic Call Center conversation: 'agent' & 'customer' 300 hrs, 16 khz Public domain Media & Podcasts audio/video coversations 500 hrs. Topics include Agriculture, Art, Aviation, Banking, Consumer, Crime, Culture, Delivery, Entertainment, Finance, Food, Gaming, Health, Hospitality, IT, Insurance, Legal, News, Oil, Politics, Real Estate, Religion, Retail, Spirituality, Sports, Technology, Telecom, Travel, Weather, Automotive. Audio Format .wav, Transcription Format .json.

  11. h

    BB_HindiHinglishV2

    • huggingface.co
    Updated Dec 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohan (2023). BB_HindiHinglishV2 [Dataset]. https://huggingface.co/datasets/rohansolo/BB_HindiHinglishV2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2023
    Authors
    Rohan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Overview This dataset is a comprehensive collection of popular Hindi instruction-type datasets. It has been meticulously curated and merged into a unified format, making it ideal for use with Hugging Face's alignment notebook. The primary objective of creating this dataset is to offer a single, standardized resource for training models in understanding and generating Hindi and Hinglish (Hindi-English) conversations. Data Sources The dataset is an amalgamation of several individual datasetsā€¦ See the full description on the dataset page: https://huggingface.co/datasets/rohansolo/BB_HindiHinglishV2.

  12. E

    Hindi Visual Genome 1.0

    • live.european-language-grid.eu
    • lindat.mff.cuni.cz
    binary format
    Updated May 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Hindi Visual Genome 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1270
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    May 29, 2019
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Data

    ----

    Hindi Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We have selected short English segments (captions) from Visual Genome along with associated images and automatically translated them to Hindi with manual post-editing, taking the associated images into account. The training set contains 29K segments. Further 1K and 1.6K segments are provided in a development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome.

    Additionally, a challenge test set of 1400 segments will be released for the WAT2019 multi-modal task. This challenge test set was created by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity.

    Dataset Formats

    --------------

    The multimodal dataset contains both text and images.

    The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files.

    All the text files have seven columns as follows:

    Column1 - image_id

    Column2 - X

    Column3 - Y

    Column4 - Width

    Column5 - Height

    Column6 - English Text

    Column7 - Hindi Text

    The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption.

    Data Statistics

    ----------------

    The statistics of the current release is given below.

    Parallel Corpus Statistics

    ---------------------------

    Dataset Segments English Words Hindi Words

    ------- --------- ---------------- -------------

    Train 28932 143178 136722

    Dev 998 4922 4695

    Test 1595 7852 7535

    Challenge Test 1400 8185 8665 (Released separately)

    ------- --------- ---------------- -------------

    Total 32925 164137 157617

    The word counts are approximate, prior to tokenization.

    Citation

    --------

    If you use this corpus, please cite the following paper:

    @article{hindi-visual-genome:2019,

    title={{Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation}},

    author={Parida, Shantipriya and Bojar, Ond{\v{r}}ej and Dash, Satya Ranjan},

    journal={Computaci{\'o}n y Sistemas},

    note={In print. Presented at CICLing 2019, La Rochelle, France},

    year={2019},

    }

  13. F

    General domain Human-Human conversation chats in Hindi

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). General domain Human-Human conversation chats in Hindi [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Whatā€™s Included

    This training dataset comprises more than 10,000 conversational text data between two native Hindi people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.

    This training dataset's licence belongs to FutureBeeAI!

  14. P

    MaSaC_ERC Dataset

    • paperswithcode.com
    Updated Apr 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivani Kumar; Ramaneswaran S; Md Shad Akhtar; Tanmoy Chakraborty (2024). MaSaC_ERC Dataset [Dataset]. https://paperswithcode.com/dataset/masac-erc
    Explore at:
    Dataset updated
    Apr 9, 2024
    Authors
    Shivani Kumar; Ramaneswaran S; Md Shad Akhtar; Tanmoy Chakraborty
    Description

    The E-MASAC Dataset is a collection of code-mixed conversations sourced from an Indian TV series, focusing on Hindi-English interactions. It was derived from the MASAC dataset and specifically annotated for Emotion Recognition in Conversations (ERC) tasks. The dataset comprises 8,607 dialogues with 11,440 utterances, containing instances of sarcasm and humor. Emotions such as anger, fear, joy, sadness, surprise, contempt, and neutral are annotated for each utterance by three fluent English and Hindi-speaking linguists, ensuring a high inter-annotator agreement of 0.85.

  15. 34 Hours - Hindi(India) Children Real-world Casual Conversation and...

    • m.nexdata.ai
    • nexdata.ai
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 34 Hours - Hindi(India) Children Real-world Casual Conversation and Monologue speech dataset [Dataset]. https://m.nexdata.ai/dataset/1377
    Explore at:
    Dataset updated
    May 31, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    World
    Variables measured
    Age, Format, Country, Accuracy, Language, Content category, Language(Region) Code, Recording environment, Features of annotation
    Description

    Hindi(India) Children Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety show and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, age, accent and other attributes. Our dataset was collected from extensive and diversify speakers(12 years old and younger children), geographicly speaking, enhancing model performance in real and complex tasks.rnQuality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  16. F

    Hindi Conversation Chat Dataset for Travel Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Conversation Chat Dataset for Travel Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-travel-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific Travel related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    ā€¢
    Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
    ā€¢
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Travel topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Travel use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    ā€¢Inbound Calls:
    ā€¢Booking Inquiries & Assistance
    ā€¢Destination Information & Recommendations
    ā€¢ Flight Delays or Cancellation Assistance
    ā€¢Assistance for Disable Passengers
    ā€¢Travel-related Health & Safety Inquiry
    ā€¢Lost or Delayed Baggage Assistance, and many more
    ā€¢Outbound Calls:
    ā€¢Promotional Offers & Package Deals
    ā€¢Customer Satisfaction Surveys
    ā€¢Booking Confirmations & Updates
    ā€¢Flight Schedule Changes & Notifications
    ā€¢Customer Feedback Collection
    ā€¢Visa Expiration Reminders, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Travel interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Travel contexts.

    The dataset encompasses a wide array of language elements, including:

    ā€¢
    Naming Conventions: Chats include a variety of Hindi personal and business names.
    ā€¢
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
    ā€¢
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
    ā€¢
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Travel conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Travel interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Travel customer-agent interactions.

    ā€¢Simple Inquiries
    ā€¢Detailed Discussions
    ā€¢Transactional Interactions
    ā€¢Problem-Solving Dialogues
    ā€¢Advisory Sessions
    ā€¢Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    ā€¢Greetings
    ā€¢Authentication
    ā€¢Information gathering
    ā€¢Resolution identification
    ā€¢Solution Delivery
    <span

  17. g

    Hinglish media Audio Dataset

    • gts.ai
    json
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2023). Hinglish media Audio Dataset [Dataset]. https://gts.ai/case-study/hinglish-media-audio-dataset-speech-and-voice-for-ai-and-ml/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 24, 2023
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The ā€œHinglish Media Audio Datasetā€ project is designed to create a comprehensive audio dataset that combines Hindi and English languages (Hinglish) for advanced speech recognition applications.

  18. P

    HindEnCorp Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Mar 30, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ond{\v{r}}ej Bojar; Vojt{\v{e}}ch Diatka; Pavel Rychl{\'y}; Pavel Stra{\v{n}}{\'a}k; V{\'\i}t Suchomel; Ale{\v{s}} Tamchyna; Daniel Zeman (2014). HindEnCorp Dataset [Dataset]. https://paperswithcode.com/dataset/hindencorp
    Explore at:
    Dataset updated
    Mar 30, 2014
    Authors
    Ond{\v{r}}ej Bojar; Vojt{\v{e}}ch Diatka; Pavel Rychl{\'y}; Pavel Stra{\v{n}}{\'a}k; V{\'\i}t Suchomel; Ale{\v{s}} Tamchyna; Daniel Zeman
    Description

    A parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences.

  19. P

    NISP- A Multi-lingual Multi-accent Dataset for Speaker Profiling Dataset

    • paperswithcode.com
    Updated Jul 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shareef Babu Kalluri; Deepu Vijayasenan; Sriram Ganapathy; Ragesh Rajan M; Prashant Krishnan (2020). NISP- A Multi-lingual Multi-accent Dataset for Speaker Profiling Dataset [Dataset]. https://paperswithcode.com/dataset/nisp-a-multi-lingual-multi-accent-dataset-for
    Explore at:
    Dataset updated
    Jul 11, 2020
    Authors
    Shareef Babu Kalluri; Deepu Vijayasenan; Sriram Ganapathy; Ragesh Rajan M; Prashant Krishnan
    Description

    We announce the release of a new multilingual speaker dataset called NITK-IISc Multilingual Multi-accent Speaker Profiling(NISP) dataset. The dataset contains speech in six different languages -- five Indian languages along with Indian English. The dataset contains speech data from 345 bilingual speakers in India. Each speaker has contributed about 4-5 minutes of data that includes recordings in both English and their mother tongue. The transcript for the text is provided in UTF-8 format. For every speaker, the dataset contains speaker meta-data such as L1, native place, medium of instruction, current residing place etc. In addition the dataset also contains physical parameter information of the speakers such as age, height, shoulder size and weight. We hope that the dataset is useful for a diverse set of research activities including multilingual speaker recognition, language and accent recognition, automatic speech recognition etc.

  20. P

    WITS Dataset

    • paperswithcode.com
    Updated Mar 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivani Kumar; Atharva Kulkarni; Md Shad Akhtar; Tanmoy Chakraborty (2022). WITS Dataset [Dataset]. https://paperswithcode.com/dataset/wits
    Explore at:
    Dataset updated
    Mar 28, 2022
    Authors
    Shivani Kumar; Atharva Kulkarni; Md Shad Akhtar; Tanmoy Chakraborty
    Description

    This dataset is an extension of MASAC, a multimodal, multi-party, Hindi-English code-mixed dialogue dataset compiled from the popular Indian TV show, ā€˜Sarabhai v/s Sarabhaiā€™. WITS was created by augmenting MASAC with natural language explanations for each sarcastic dialogue. The dataset consists of the transcribed sarcastic dialogues from 55 episodes of the TV show, along with audio and video multimodal signals. It was designed to facilitate Sarcasm Explanation in Dialogue (SED), a novel task aimed at generating a natural language explanation for a given sarcastic dialogue, that spells out the intended irony. Each data instance in WITS is associated with a corresponding video, audio, and textual transcript where the last utterance is sarcastic in nature. All the final selected explanations contain the following attributes:

    ā€¢ Sarcasm source: The speaker in the dialog who is being sarcastic. ā€¢ Sarcasm target: The person/ thing towards whom the sarcasm is directed. ā€¢ Action word: Verb/ action used to describe how the sarcasm is taking place. e.g. mocking, insults, taunts, etc. ā€¢ Description: A description of the scene to help contextualize the sarcasm.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FutureBee AI (2022). Hindi Conversation Chat Dataset for Telecom Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-telecom-domain-conversation-text-dataset

Hindi Conversation Chat Dataset for Telecom Domain

Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License

https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

Dataset funded by
FutureBeeAI
Description

Introduction

The dataset comprises over 12,000 chat conversations, each focusing on specific Telecom related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

ā€¢
Participants Details: 200+ native Hindi participants from the FutureBeeAI community.
ā€¢
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity

The chat dataset covers a wide range of conversations on Telecom topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Telecom use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

ā€¢Inbound Chats:
ā€¢Phone Number Porting
ā€¢Network Connectivity Issues
ā€¢Billing and Payments
ā€¢Technical Support
ā€¢Service Activation
ā€¢International Roaming Enquiry
ā€¢Refunds and Billing Adjustments
ā€¢Emergency Service Access, and many more
ā€¢Outbound Chats:
ā€¢Welcome Calls / Onboarding Process
ā€¢Payment Reminders
ā€¢Customer Surveys
ā€¢Technical Updates
ā€¢Service Usage Reviews
ā€¢Network Complaint Update, and many more

Language Variety & Nuances

The conversations in this dataset capture the diverse language styles and expressions prevalent in Hindi Telecom interactions. This diversity ensures the dataset accurately represents the language used by Hindi speakers in Telecom contexts.

The dataset encompasses a wide array of language elements, including:

ā€¢
Naming Conventions: Chats include a variety of Hindi personal and business names.
ā€¢
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Hindi-speaking regions.
ā€¢
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Hindi forms, adhering to local conventions.
ā€¢
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Hindi Telecom conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Hindi Telecom interactions.

Conversational Flow and Interaction Types

The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Telecom customer-agent interactions.

ā€¢Simple Inquiries
ā€¢Detailed Discussions
ā€¢Transactional Interactions
ā€¢Problem-Solving Dialogues
ā€¢Advisory Sessions
ā€¢Routine Checks and Follow-Ups

Each of these conversations contains various aspects of conversation flow like:

ā€¢Greetings
ā€¢Authentication
ā€¢Information gathering
ā€¢Resolution identification
<span

Search
Clear search
Close search
Google apps
Main menu