Saved datasets
Last updated
Download format
Croissant
Croissant is a format for Machine Learning datasets
Learn more about this at mlcommons.org/croissant.
Usage rights
License from data provider
Please review the applicable license to make sure your contemplated use is permitted.
Topic
Provider
Free
Cost to access
Described as free to access or have a license that allows redistribution.
100+ datasets found
  1. F

    Hindi (India) General Conversation Speech Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
  2. P

    MaSaC_ERC Dataset

    • paperswithcode.com
    Updated Oct 18, 2023
  3. F

    Telecom domain Human-Human conversation chats in Hindi

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
  4. Hindi-English TED talks, Wikipedia articles, etc.

    • kaggle.com
    zip
    Updated Oct 31, 2020
  5. E

    Hindi Visual Genome 1.1

    • live.european-language-grid.eu
    • lindat.mff.cuni.cz
    binary format
    Updated Dec 31, 2019
  6. P

    WITS Dataset

    • paperswithcode.com
    Updated Mar 28, 2022
    + more versions
  7. F

    General domain Human-Human conversation chats in Hindi

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
  8. E

    USPDATRO: Underrepresented Speech Dataset from Romanian language Open Data

    • live.european-language-grid.eu
    txt
    Updated Jun 4, 2023
  9. F

    Healthcare domain Human-Human conversation chats in Hindi

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
  10. P

    NISP- A Multi-lingual Multi-accent Dataset for Speaker Profiling Dataset

    • paperswithcode.com
    Updated Jul 11, 2020
  11. F

    Delivery & Logistics domain Human-Human conversation chats in Hindi

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
  12. F

    Travel Call Center Speech Data: Hindi (India)

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
  13. E

    Data from: Hindi Web Texts

    • live.european-language-grid.eu
    • lindat.mff.cuni.cz
    • +1more
    binary format
    Updated Nov 22, 2011
  14. E

    HinDialect 1.1: 26 Hindi-related languages and dialects of the Indic...

    • live.european-language-grid.eu
    binary format
    Updated Jul 13, 2022
    + more versions
  15. c

    Data from: HindEnCorp 0.5

    • lindat.mff.cuni.cz
    • paperswithcode.com
    • +3more
    Updated Mar 21, 2014
    + more versions
  16. s

    Hindi Language Datasets | Audio Data for ASR, Virtual Assistant

    • shaip.com
    • pa.shaip.com
    • +24more
    Updated Mar 22, 2023
  17. 302 Person - Hindi and English Bilingual Spontaneous Monologue smartphone...

    • nexdata.ai
    • m.datatang.ai
    • +1more
    Updated Oct 21, 2023
    + more versions
  18. s

    Hindi-English Off-the-Shelf Datasets

    • bn.shaip.com
    • no.shaip.com
    • +3more
    json
    Updated Jan 10, 2023
  19. d

    Shaip - Multilingual Conversational AI Training Data (Text & Audio)

    • datarade.ai
    .json
    Updated Sep 23, 2020
  20. d

    Exoma | Call Center Audio Data (Hindi Language & Accent) | Aviation/...

    • datarade.ai
    Updated Mar 11, 2024
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FutureBee AI (2022). Hindi (India) General Conversation Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-hindi-india

Hindi (India) General Conversation Speech Dataset

Hindi General Conversation Speech Corpus

Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License

https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

Dataset funded by
FutureBeeAI
Description

Welcome to the Hindi Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of Hindi language speech recognition models, with a particular focus on Indian accents and dialects.

With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the Hindi language spoken in India.

Speech Data:

This training dataset comprises 150 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 160 native Hindi speakers from different part of India. This collaborative effort guarantees a balanced representation of Indian accents, dialects, and demographics, reducing biases and promoting inclusivity.

Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.

Metadata:

In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.

The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Hindi language speech recognition models.

Transcription:

This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.

Our goal is to expedite the deployment of Hindi language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.

Updates and Customization:

We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.

If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.

License:

This audio dataset, created by FutureBeeAI, is now available for commercial use.

Conclusion:

Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.

Search
Clear search
Close search
Google apps
Main menu