47 datasets found
  1. F

    Telecom domain Human-Human conversation chats in Hindi

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Telecom domain Human-Human conversation chats in Hindi [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-telecom-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    This training dataset comprises more than 10,000 conversational text data between two native Hindi people in the telecom domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.,

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.,

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.,

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.,

    This training dataset's licence belongs to FutureBeeAI!

  2. s

    Hindi Language Datasets | Audio Data for ASR, Virtual Assistant

    • fr.shaip.com
    • pl.shaip.com
    • +23more
    Updated Jul 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Hindi Language Datasets | Audio Data for ASR, Virtual Assistant [Dataset]. https://fr.shaip.com/offerings/speech-data-catalog/hindi-dataset/
    Explore at:
    Dataset updated
    Jul 17, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Enhance your Conversational AI model with our Off-the-Shelf Hindi Language Datasets. Shaip high-quality audio datasets are a quick and effective solution for model training.

  3. F

    Healthcare domain Human-Human conversation chats in Hindi

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Healthcare domain Human-Human conversation chats in Hindi [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-healthcare-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    This training dataset comprises more than 10,000 conversational text data between two native Hindi people in the healthcare domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.,

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.,

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.,

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.,

    This training dataset's licence belongs to FutureBeeAI!

  4. Speech Dataset in Hindi Language

    • ieee-dataport.org
    Updated Jun 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivam Shukla (2020). Speech Dataset in Hindi Language [Dataset]. http://doi.org/10.21227/1187-yv12
    Explore at:
    Dataset updated
    Jun 9, 2020
    Dataset provided by
    Institute of Electrical and Electronics Engineershttp://www.ieee.ro/
    Authors
    Shivam Shukla
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    100 Speakers each consisting of 5 voice samples for training data and 1 voice sample for testing data. Total of 600 voice samples collected in different audio formats like mpeg, mp4, mp3, ogg etc. These samples were than preprocessed and converted into .wav format. Each voice sample has a time duration of 5-10 seconds due to different lengths tuning of parameters should be done before usage. Whole Dataset size is 600mb and duration is 1 hour 40 minutes. This dataset can be used for speech synthesis, speaker identification. speaker recognition, speech recogniton etc. Preprocessing of data is required.

  5. F

    Hindi (India) General Conversation Speech Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi (India) General Conversation Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-hindi-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the Hindi Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of Hindi language speech recognition models, with a particular focus on Indian accents and dialects.

    With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the Hindi language spoken in India.

    Speech Data:

    This training dataset comprises 150 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 160 native Hindi speakers from different part of India. This collaborative effort guarantees a balanced representation of Indian accents, dialects, and demographics, reducing biases and promoting inclusivity.

    Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.

    Metadata:

    In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Hindi language speech recognition models.

    Transcription:

    This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.

    Our goal is to expedite the deployment of Hindi language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.

    Updates and Customization:

    We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.

    If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.

    License:

    This audio dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.

  6. h

    hind_encorp

    • huggingface.co
    • opendatalab.com
    Updated Sep 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The HF Datasets community (2023). hind_encorp [Dataset]. https://huggingface.co/datasets/hind_encorp
    Explore at:
    Dataset updated
    Sep 23, 2023
    Dataset authored and provided by
    The HF Datasets community
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008).

    Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi.

    EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages.

    Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus.  For the current release, we are extending the parallel corpus using these sources: Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi.

    TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available.

    The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus.

    Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files.

    Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary.

  7. d

    494 Hours - Hindi Spontaneous Speech Data

    • datatang.ai
    Updated Nov 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datatang (2023). 494 Hours - Hindi Spontaneous Speech Data [Dataset]. https://www.datatang.ai/datasets/1269
    Explore at:
    Dataset updated
    Nov 12, 2023
    Dataset provided by
    datatang technology inc
    Authors
    Datatang
    Variables measured
    Format, Country, Language, Accuracy Rate, Content category, Language(Region) Code, Recording environment, Features of annotation
    Description

    Hindi(India) Real-world Casual Conversation and Monologue speech dataset, covers education, interview, sports domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  8. 797 Hours - Hindi(India) Spontaneous Dialogue Smartphone speech dataset

    • nexdata.ai
    Updated Apr 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 797 Hours - Hindi(India) Spontaneous Dialogue Smartphone speech dataset [Dataset]. https://www.nexdata.ai/dataset/1156
    Explore at:
    Dataset updated
    Apr 13, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    India
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
    Description

    Hindi(India) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(1,002 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  9. Hindi (India) scripted telephony

    • appen.co.jp
    Updated Sep 24, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Appen (2021). Hindi (India) scripted telephony [Dataset]. https://appen.co.jp/off-the-shelf-datasets/
    Explore at:
    Dataset updated
    Sep 24, 2021
    Dataset authored and provided by
    Appenhttps://appen.com/
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Audio - ASR, Virtual Assistant - Mobile phone - 224 hours - Appen Global - Scripted Speech - Hindi - India - Low background noise - 1920 - 1 - 96000 - 9853 - 8 - alaw - Fully transcribed to SpeechDAT type conventions Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words 50 prompts per speaker including digits, natural numbers, personal, business and place names, web addresses, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words

  10. s

    Hindi-English Off-the-Shelf Datasets

    • bn.shaip.com
    • no.shaip.com
    • +3more
    json
    Updated Jan 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Hindi-English Off-the-Shelf Datasets [Dataset]. https://bn.shaip.com/offerings/speech-data-catalog/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jan 10, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Off-the-shelf Hindi-English (Hinglish) Audio Dataset - Total Volume 800 hrs, Bifurcated into 8khz Unscripted, synthetic telephonic Call Center conversation: 'agent' & 'customer' 300 hrs, 16 khz Public domain Media & Podcasts audio/video coversations 500 hrs. Topics include Agriculture, Art, Aviation, Banking, Consumer, Crime, Culture, Delivery, Entertainment, Finance, Food, Gaming, Health, Hospitality, IT, Insurance, Legal, News, Oil, Politics, Real Estate, Religion, Retail, Spirituality, Sports, Technology, Telecom, Travel, Weather, Automotive. Audio Format .wav, Transcription Format .json.

  11. d

    34 Hours - Hindi Child's Spontaneous Speech Data

    • m.datatang.ai
    Updated Nov 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datatang (2023). 34 Hours - Hindi Child's Spontaneous Speech Data [Dataset]. https://m.datatang.ai/datasets/1377
    Explore at:
    Dataset updated
    Nov 16, 2023
    Dataset provided by
    datatang technology inc
    Authors
    Datatang
    Variables measured
    Age, Format, Country, Accuracy, Language, Content category, Language(Region) Code, Recording environment, Features of annotation
    Description

    Hindi(India) Children Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety show and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, age, accent and other attributes. Our dataset was collected from extensive and diversify speakers(12 years old and younger children), geographicly speaking, enhancing model performance in real and complex tasks.rnQuality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  12. h

    indic-instruct-data-v0.1

    • huggingface.co
    Updated Feb 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI4Bharat (2024). indic-instruct-data-v0.1 [Dataset]. https://huggingface.co/datasets/ai4bharat/indic-instruct-data-v0.1
    Explore at:
    Dataset updated
    Feb 16, 2024
    Dataset authored and provided by
    AI4Bharat
    Description

    Indic Instruct Data v0.1

    A collection of different instruction datasets spanning English and Hindi languages. The collection consists of:

    Anudesh wikiHow Flan v2 (67k sample subset) Dolly Anthropic-HHH (5k sample subset) OpenAssistant v1 LymSys-Chat (50k sample subset)

    We translate the English subset of specific datasets using IndicTrans2 (Gala et al., 2023). The chrF++ scores of the back-translated example and the corresponding example is provided for quality assessment of… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/indic-instruct-data-v0.1.

  13. F

    Delivery & Logistics domain Human-Human conversation chats in Hindi

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Delivery & Logistics domain Human-Human conversation chats in Hindi [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-delivery-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    This training dataset comprises more than 10,000 conversational text data between two native Hindi people in the delivery & logistics domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.,

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.,

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.,

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.,

    This training dataset's licence belongs to FutureBeeAI!

  14. h

    multi_language_conversation

    • huggingface.co
    Updated Jun 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2022). multi_language_conversation [Dataset]. https://huggingface.co/datasets/Nexdata/multi_language_conversation
    Explore at:
    Dataset updated
    Jun 22, 2022
    Authors
    Nexdata
    Description

    Dataset Card for multi_language_conversation

      Dataset Summary
    

    The dataset contains 12,000 hours of multi-language conversation speech data. It's recorded by native speakers, covering English, French, German, Russian, Spanish, Japanese, Korean, Hindi, Vietnamese etc. The speakers start the conversation around a familar topic, to ensure the smoothness and nature of the conversation. The format is 16kHz, 16bit, uncompressed wav, mono channel. The sentence accuracy is… See the full description on the dataset page: https://huggingface.co/datasets/Nexdata/multi_language_conversation.

  15. E

    Hindi Visual Genome 1.1

    • live.european-language-grid.eu
    • lindat.mff.cuni.cz
    binary format
    Updated Dec 31, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Hindi Visual Genome 1.1 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1309
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Dec 31, 2019
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Data

    ----

    Hindi Visual Genome 1.1 is an updated version of Hindi Visual Genome 1.0. The update concerns primarily the text part of Hindi Visual Genome, fixing translation issues reported during WAT 2019 multimodal task. In the image part, only one segment and thus one image were removed from the dataset.

    Hindi Visual Genome 1.1 serves in "WAT 2020 Multi-Modal Machine Translation Task".

    Hindi Visual Genome is a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We have selected short English segments (captions) from Visual Genome along with associated images and automatically translated them to Hindi with manual post-editing, taking the associated images into account.

    The training set contains 29K segments. Further 1K and 1.6K segments are provided in a development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome.

    A third test set is called ``challenge test set'' consists of 1.4K segments and it was released for WAT2019 multi-modal task. The challenge test set was created by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. The surrounding words in the sentence however also often include sufficient cues to identify the correct meaning of the ambiguous word.

    Dataset Formats

    --------------

    The multimodal dataset contains both text and images.

    The text parts of the dataset (train and test sets) are in simple

    tab-delimited plain text files.

    All the text files have seven columns as follows:

    Column1 - image_id

    Column2 - X

    Column3 - Y

    Column4 - Width

    Column5 - Height

    Column6 - English Text

    Column7 - Hindi Text

    The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption.

    Data Statistics

    ----------------

    The statistics of the current release is given below.

    Parallel Corpus Statistics

    ---------------------------

    Dataset Segments English Words Hindi Words

    ------- --------- ---------------- -------------

    Train 28930 143164 145448

    Dev 998 4922 4978

    Test 1595 7853 7852

    Challenge Test 1400 8186 8639

    ------- --------- ---------------- -------------

    Total 32923 164125 166917

    The word counts are approximate, prior to tokenization.

    Citation

    --------

    If you use this corpus, please cite the following paper:

    @article{hindi-visual-genome:2019,

    title={{Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation}},

    author={Parida, Shantipriya and Bojar, Ond{\v{r}}ej and Dash, Satya Ranjan},

    journal={Computaci{\'o}n y Sistemas},

    volume={23},

    number={4},

    pages={1499--1505},

    year={2019}

    }

  16. F

    General domain Human-Human conversation chats in Hindi

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). General domain Human-Human conversation chats in Hindi [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    This training dataset comprises more than 10,000 conversational text data between two native Hindi people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.,

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.,

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.,

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.,

    This training dataset's licence belongs to FutureBeeAI!

  17. P

    NISP- A Multi-lingual Multi-accent Dataset for Speaker Profiling Dataset

    • paperswithcode.com
    Updated Feb 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shareef Babu Kalluri; Deepu Vijayasenan; Sriram Ganapathy; Ragesh Rajan M; Prashant Krishnan (2021). NISP- A Multi-lingual Multi-accent Dataset for Speaker Profiling Dataset [Dataset]. https://paperswithcode.com/dataset/nisp-a-multi-lingual-multi-accent-dataset-for
    Explore at:
    Dataset updated
    Feb 11, 2021
    Authors
    Shareef Babu Kalluri; Deepu Vijayasenan; Sriram Ganapathy; Ragesh Rajan M; Prashant Krishnan
    Description

    We announce the release of a new multilingual speaker dataset called NITK-IISc Multilingual Multi-accent Speaker Profiling(NISP) dataset. The dataset contains speech in six different languages -- five Indian languages along with Indian English. The dataset contains speech data from 345 bilingual speakers in India. Each speaker has contributed about 4-5 minutes of data that includes recordings in both English and their mother tongue. The transcript for the text is provided in UTF-8 format. For every speaker, the dataset contains speaker meta-data such as L1, native place, medium of instruction, current residing place etc. In addition the dataset also contains physical parameter information of the speakers such as age, height, shoulder size and weight. We hope that the dataset is useful for a diverse set of research activities including multilingual speaker recognition, language and accent recognition, automatic speech recognition etc.

  18. d

    Exoma | Call Center Audio Data (Hindi Language & Accent) | Aviation/...

    • datarade.ai
    Updated Jan 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Exoma Services (2023). Exoma | Call Center Audio Data (Hindi Language & Accent) | Aviation/ Tourism/ Retail Sectors [Dataset]. https://datarade.ai/data-products/call-center-audio-data-globally-for-all-languages-exoma-services
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Jan 12, 2023
    Dataset authored and provided by
    Exoma Services
    Area covered
    India, France, Philippines
    Description

    Key Features:

    1. Advanced Speech Recognition: We employ state-of-the-art speech recognition technology, ensuring accurate and efficient transcription of audio data in Hindi. Seamlessly convert customer interactions into valuable insights.

    2. Industry-Specific Customization: Tailored for the Aviation, Tourism, and Retail sectors, Exoma understands the nuances of industry-specific terminology and jargon. This customization enhances the precision and relevance of the transcribed data.

    3. Actionable Insights: Uncover valuable insights from customer conversations. Identify trends, sentiments, and key performance indicators to make informed decisions that drive operational excellence and customer satisfaction.

    4. Multilingual Capabilities: Beyond Hindi, we offer multilingual support, allowing you to analyze customer interactions in various languages spoken within your business environment.

    5. Data Security: We prioritize the security of your sensitive information. Exoma employs robust encryption and data protection measures, ensuring confidentiality and compliance with industry standards.

    6. User-Friendly Interface: Exoma is designed with user convenience in mind. The intuitive interface allows for easy navigation, making it accessible for both technical and non-technical users.

    We ensure all data is removed from PHI information. We can supply audio in any format like MP3. Wave format as per the requirements.

    Exoma has a wide network with Call centers globally which makes it easy to cater to your needs quickly. Once the request is initiated from the customer we will start the process by sending samples for approval and then kickstart the project setting the customer expectations. We ensure quality data delivery which in turn will help you train your working models for AI learning.

  19. h

    hindi-political-chat

    • huggingface.co
    Updated Aug 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajat Raturi (2023). hindi-political-chat [Dataset]. https://huggingface.co/datasets/rajat-jarvis/hindi-political-chat
    Explore at:
    Dataset updated
    Aug 14, 2023
    Authors
    Rajat Raturi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    rajat-jarvis/hindi-political-chat dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. E

    HinDialect 1.1: 26 Hindi-related languages and dialects of the Indic...

    • live.european-language-grid.eu
    binary format
    Updated Jul 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). HinDialect 1.1: 26 Hindi-related languages and dialects of the Indic Continuum in North India [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20494
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jul 13, 2022
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in North India

    Languages This is a collection of folksongs for 26 languages that form a dialect continuum in North India and nearby regions.

    Namely Angika, Awadhi, Baiga, Bengali, Bhadrawahi, Bhili, Bhojpuri, Braj, Bundeli, Chhattisgarhi, Garhwali, Gujarati, Haryanvi, Himachali, Hindi, Kanauji, Khadi Boli, Korku, Kumaoni, Magahi, Malvi, Marathi, Nimadi, Panjabi, Rajasthani, Sanskrit.

    This data is originally collected by the Kavita Kosh Project at http://www.kavitakosh.org/ . Here are the main characteristics of the languages in this collection: - They are all Indic languages except for Korku. - The majority of them are closely related to the standard Hindi dialect genealogically (such as Hariyanvi and Bhojpuri), although the collection also contains languages such as Bengali and Gujarati which are more distant relatives. - They are all primarily spoken in (North) India (Bengali is also spoken in Bangladesh) - All except Sanksrit are alive languages

    Data Categorising them by pre-existing available NLP resources, we have: * Band 1 languages : Hindi, Panjabi, Gujarati, Bengali, Nepali. These languages already have other large standard datasets available. Kavita Kosh may have very little data for these languages. * Band 2 languages: Bhojpuri, Magahi, Awadhi, Braj. These languages have growing interest and some datasets of a relatively small size as compared to Band 1 language resources. * Band 3 languages: All other languages in the collection are previously zero-resource languages. These are the languages for which this dataset is the most relevant.

    Script This dataset is entirely in Devanagari. Content in the case of languages not written in Devanagari (such as Bengali and Gujarati) has been transliterated by the Kavita Kosh Project.

    Format The dataset contains a single text file containing folksongs per language. Folksongs are separated from each other by an empty line. The first line of a new piece is the title of the folksong, and line separation within folksongs is preserved.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FutureBee AI (2022). Telecom domain Human-Human conversation chats in Hindi [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/hindi-telecom-domain-conversation-text-dataset

Telecom domain Human-Human conversation chats in Hindi

Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License

https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

Dataset funded by
FutureBeeAI
Description

This training dataset comprises more than 10,000 conversational text data between two native Hindi people in the telecom domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.,

These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.,

These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.,

This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.,

This training dataset's licence belongs to FutureBeeAI!

Search
Clear search
Close search
Google apps
Main menu