100+ datasets found
  1. In-Cabin Speech Data | 15,000 Hours | AI Training Data | Speech Recognition...

    • datarade.ai
    Updated Apr 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). In-Cabin Speech Data | 15,000 Hours | AI Training Data | Speech Recognition Data | Audio Data |Natural Language Processing (NLP) Data [Dataset]. https://datarade.ai/data-products/nexdata-in-car-speech-data-15-000-hours-audio-ai-ml-t-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Apr 23, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Switzerland, Russian Federation, Netherlands, Germany, Austria, Poland, Turkey, Egypt, Argentina, Romania
    Description
    1. Specifications Format : Audio format: 48kHz, 16bit, uncompressed wav, mono channel; Vedio format: MP4

    Recording Environment : In-car;1 quiet scene, 1 low noise scene, 3 medium noise scenes and 2 high noise scenes

    Recording Content : It covers 5 fields: navigation field, multimedia field, telephone field, car control field and question and answer field; 500 sentences per people

    Speaker : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

    Device : High fidelity microphone; Binocular camera

    Language : 20 languages

    Transcription content : text

    Accuracy rate : 98%

    Application scenarios : speech recognition, Human-computer interaction; Natural language processing and text analysis; Visual content understanding, etc.

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Natural Language Processing (NLP) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
  2. Scripted Monologues Speech Data | 65,000 Hours | Generative AI Audio Data|...

    • datarade.ai
    Updated Dec 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). Scripted Monologues Speech Data | 65,000 Hours | Generative AI Audio Data| Speech Recognition Data | Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-read-speech-data-65-000-hours-aud-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 11, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Puerto Rico, Italy, Uruguay, Poland, France, Japan, Luxembourg, Taiwan, Pakistan, Chile
    Description
    1. Specifications Format : 16kHz, 16bit, uncompressed wav, mono channel

    Recording environment : quiet indoor environment, without echo

    Recording content (read speech) : economy, entertainment, news, oral language, numbers, letters

    Speaker : native speaker, gender balance

    Device : Android mobile phone, iPhone

    Language : 100+ languages

    Transcription content : text, time point of speech data, 5 noise symbols, 5 special identifiers

    Accuracy rate : 95% (the accuracy rate of noise symbols and other identifiers is not included)

    Application scenarios : speech recognition, voiceprint recognition

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
  3. Bengali Speech Recognition Dataset (BSRD)

    • kaggle.com
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuvo Kumar Basak-4004 (2025). Bengali Speech Recognition Dataset (BSRD) [Dataset]. http://doi.org/10.34740/kaggle/dsv/10465455
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 14, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shuvo Kumar Basak-4004
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The BengaliSpeechRecognitionDataset (BSRD) is a comprehensive dataset designed for the development and evaluation of Bengali speech recognition and text-to-speech systems. This dataset includes a collection of Bengali characters and their corresponding audio files, which are generated using speech synthesis models. It serves as an essential resource for researchers and developers working on automatic speech recognition (ASR) and text-to-speech (TTS) applications for the Bengali language. Key Features: • Bengali Characters: The dataset contains a wide range of Bengali characters, including consonants, vowels, and unique symbols used in the Bengali script. This includes standard characters such as 'ক', 'খ', 'গ', and many more. • Corresponding Speech Data: For each Bengali character, an MP3 audio file is provided, which contains the correct pronunciation of that character. This audio is generated by a Bengali text-to-speech model, ensuring clear and accurate pronunciation. • 1000 Audio Samples per Folder: Each character is associated with at least 1000 MP3 files. These multiple samples provide variations of the character's pronunciation, which is essential for training robust speech recognition systems. • Language and Phonetic Diversity: The dataset offers a phonetic diversity of Bengali sounds, covering different tones and pronunciations commonly found in spoken Bengali. This ensures that the dataset can be used for training models capable of recognizing diverse speech patterns. • Use Cases: o Automatic Speech Recognition (ASR): BSRD is ideal for training ASR systems, as it provides accurate audio samples linked to specific Bengali characters. o Text-to-Speech (TTS): Researchers can use this dataset to fine-tune TTS systems for generating natural Bengali speech from text. o Phonetic Analysis: The dataset can be used for phonetic analysis and developing models that study the linguistic features of Bengali pronunciation. • Applications: o Voice Assistants: The dataset can be used to build and train voice recognition systems and personal assistants that understand Bengali. o Speech-to-Text Systems: BSRD can aid in developing accurate transcription systems for Bengali audio content. o Language Learning Tools: The dataset can help in creating educational tools aimed at teaching Bengali pronunciation.

    …………………………………..Note for Researchers Using the dataset………………………………………………………………………

    This dataset was created by Shuvo Kumar Basak. If you use this dataset for your research or academic purposes, please ensure to cite this dataset appropriately. If you have published your research using this dataset, please share a link to your paper. Good Luck.

  4. Arabic Speech Commands Dataset

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Apr 5, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdulkader Ghandoura; Abdulkader Ghandoura (2021). Arabic Speech Commands Dataset [Dataset]. http://doi.org/10.5281/zenodo.4662481
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 5, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abdulkader Ghandoura; Abdulkader Ghandoura
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Arabic Speech Commands Dataset

    This dataset is designed to help train simple machine learning models that serve educational and research purposes in the speech recognition domain, mainly for keyword spotting tasks.

    Dataset Description

    Our dataset is a list of pairs (x, y), where x is the input speech signal, and y is the corresponding keyword. The final dataset consists of 12000 such pairs, comprising 40 keywords. Each audio file is one-second in length sampled at 16 kHz. We have 30 participants, each of them recorded 10 utterances for each keyword. Therefore, we have 300 audio files for each keyword in total (30 * 10 * 40 = 12000), and the total size of all the recorded keywords is ~384 MB. The dataset also contains several background noise recordings we obtained from various natural sources of noise. We saved these audio files in a separate folder with the name background_noise and a total size of ~49 MB.

    Dataset Structure

    There are 40 folders, each of which represents one keyword and contains 300 files. The first eight digits of each file name identify the contributor, while the last two digits identify the round number. For example, the file path rotate/00000021_NO_06.wav indicates that the contributor with the ID 00000021 pronounced the keyword rotate for the 6th time.

    Data Split

    We recommend using the provided CSV files in your experiments. We kept 60% of the dataset for training, 20% for validation, and the remaining 20% for testing. In our split method, we guarantee that all recordings of a certain contributor are within the same subset.

    License

    This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. For more details, see the LICENSE file in this folder.

    Citations

    If you want to use the Arabic Speech Commands dataset in your work, please cite it as:

    @article{arabicspeechcommandsv1,
       author = {Ghandoura, Abdulkader and Hjabo, Farouk and Al Dakkak, Oumayma},
       title = {Building and Benchmarking an Arabic Speech Commands Dataset for Small-Footprint Keyword Spotting},
       journal = {Engineering Applications of Artificial Intelligence},
       year = {2021},
       publisher={Elsevier}
    }

  5. n

    In-Cabin Speech Data | 15,000 Hours | AI Training Data | Speech Recognition...

    • data.nexdata.ai
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). In-Cabin Speech Data | 15,000 Hours | AI Training Data | Speech Recognition Data | Audio Data |Natural Language Processing (NLP) Data [Dataset]. https://data.nexdata.ai/products/nexdata-in-car-speech-data-15-000-hours-audio-ai-ml-t-nexdata
    Explore at:
    Dataset updated
    Aug 3, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Sweden, Bangladesh, Tanzania, Thailand, Slovakia, Ireland, South Africa, Colombia, Australia, Peru
    Description

    The Natural Language Processing (NLP) Data of in-car speech covers 20+ languages, including read, wake-up word, commend word, code-swithing, multimodal and noise data.

  6. 8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech...

    • datarade.ai
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech Recognition Data| Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-conversational-speech-data-8khz-tele-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    United Arab Emirates, Argentina, Romania, Singapore, Netherlands, Vietnam, Czech Republic, Poland, Philippines, United States of America
    Description
    1. Specifications Format : 8kHz, 8bit, u-law/a-law pcm, mono channel;

    Environment : quiet indoor environment, without echo;

    Recording content : No preset linguistic data,dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;

    Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

    Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;

    Device : Telephony recording system;

    Language : 100+ Languages;

    Application scenarios : speech recognition; voiceprint recognition;

    Accuracy rate : the word accuracy rate is not less than 98%

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
  7. u

    Japanese Speech Recognition Dataset

    • unidata.pro
    u-law/a-law, wav
    Updated Feb 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata L.L.C-FZ (2025). Japanese Speech Recognition Dataset [Dataset]. https://unidata.pro/datasets/japanese-speech-recognition-dataset/
    Explore at:
    wav, u-law/a-lawAvailable download formats
    Dataset updated
    Feb 26, 2025
    Dataset authored and provided by
    Unidata L.L.C-FZ
    Description

    Train AI to understand Japanese with Unidata’s dataset, featuring diverse speech samples for better transcription accuracy

  8. t

    Data from: wav2vec: Unsupervised Pre-Training for Speech Recognition

    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). wav2vec: Unsupervised Pre-Training for Speech Recognition [Dataset]. https://service.tib.eu/ldmservice/dataset/wav2vec--unsupervised-pre-training-for-speech-recognition
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    Unsupervised Pre-Training for Speech Recognition

  9. P

    M-AILabs speech dataset Dataset

    • paperswithcode.com
    Updated Aug 2, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). M-AILabs speech dataset Dataset [Dataset]. https://paperswithcode.com/dataset/m-ailabs-speech-dataset
    Explore at:
    Dataset updated
    Aug 2, 2022
    Description

    The M-AILABS Speech Dataset is the first large dataset that we are providing free-of-charge, freely usable as training data for speech recognition and speech synthesis. Most of the data is based on LibriVox and Project Gutenberg. The training data consist of nearly thousand hours of audio and the text-files in prepared format. A transcription is provided for each clip. Clips vary in length from 1 to 20 seconds and have a total length of approximately shown in the list (and in the respective info.txt-files) below. The texts were published between 1884 and 1964, and are in the public domain. The audio was recorded by the LibriVox project and is also in the public domain

  10. m

    Video Dataset for training AI/ML Models

    • data.macgence.com
    mp3
    Updated Jul 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Video Dataset for training AI/ML Models [Dataset]. https://data.macgence.com/dataset/video-dataset-for-training-aiml-models
    Explore at:
    mp3Available download formats
    Dataset updated
    Jul 18, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Enhance AI/ML training with Macgence's diverse video dataset. High-quality visuals optimized for accuracy, reliability, and advanced model development!

  11. s

    Speech & Audio Datasets

    • sapien.io
    Updated Apr 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sapien (2025). Speech & Audio Datasets [Dataset]. https://www.sapien.io/dataset-marketplace/speech-audio-datasets-for-ai-training
    Explore at:
    Dataset updated
    Apr 22, 2025
    Dataset authored and provided by
    Sapien
    License

    https://www.sapien.io/termshttps://www.sapien.io/terms

    Description

    High-quality speech audio datasets designed for AI model training, supporting various applications like speech recognition, voice identification, and multilingual speech data.

  12. F

    English (UK) General Conversation Speech Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English (UK) General Conversation Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-uk
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United Kingdom
    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Welcome to the English Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of English language speech recognition models, with a particular focus on British accents and dialects.

    With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the English language spoken in United Kingdom.

    Speech Data:

    This training dataset comprises 30 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 40 native English speakers from different states/provinces of United Kingdom. This collaborative effort guarantees a balanced representation of British accents, dialects, and demographics, reducing biases and promoting inclusivity.

    Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.

    Metadata:

    In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of English language speech recognition models.

    Transcription:

    This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.

    Our goal is to expedite the deployment of English language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.

    Updates and Customization:

    We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.

    If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.

    License:

    This audio dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.

  13. S

    Speech Recognition Data Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Speech Recognition Data Report [Dataset]. https://www.archivemarketresearch.com/reports/speech-recognition-data-563446
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Jun 23, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global speech recognition market is experiencing robust growth, driven by the increasing adoption of voice assistants, the proliferation of smart devices, and advancements in artificial intelligence (AI). The market is projected to reach a substantial size, exhibiting a significant Compound Annual Growth Rate (CAGR). While precise figures for market size and CAGR aren't provided, considering the industry's rapid expansion and the involvement of major tech players like Google, Amazon, and Microsoft, a reasonable estimate would place the 2025 market size at approximately $15 billion, growing at a CAGR of 18% from 2025 to 2033. This substantial growth is fueled by several key factors. The rising demand for hands-free and voice-enabled interfaces in various applications, including automotive, healthcare, and customer service, is a major driver. Furthermore, continuous advancements in deep learning and natural language processing (NLP) technologies are leading to more accurate and efficient speech recognition systems. The increasing availability of large datasets for training these systems also contributes to improved performance and wider adoption. However, challenges remain. Data privacy concerns related to the collection and use of voice data pose a significant restraint. The need for robust security measures and transparent data handling practices is paramount to maintaining consumer trust and promoting wider market acceptance. Furthermore, achieving high accuracy in diverse acoustic environments and with varied accents continues to be an area of ongoing development. Despite these challenges, the long-term outlook for the speech recognition market remains highly positive, with continued innovation and expanding applications promising considerable growth throughout the forecast period. The market segmentation is expected to evolve, with specialized solutions for particular industries becoming increasingly prevalent.

  14. u

    Italian Speech Recognition Dataset

    • unidata.pro
    a-law/u-law, pcm
    Updated Feb 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata L.L.C-FZ (2025). Italian Speech Recognition Dataset [Dataset]. https://unidata.pro/datasets/italian-speech-recognition-dataset/
    Explore at:
    a-law/u-law, pcmAvailable download formats
    Dataset updated
    Feb 26, 2025
    Dataset authored and provided by
    Unidata L.L.C-FZ
    Description

    Unidata’s Italian Speech Recognition dataset refines AI models for better speech-to-text conversion and language comprehension

  15. F

    Travel Scripted Monologue Speech Data: Japanese (Japan)

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Travel Scripted Monologue Speech Data: Japanese (Japan) [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/travel-scripted-speech-monologues-japanese-japan
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Japan
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Japanese Scripted Monologue Speech Dataset for the Travel Domain. This meticulously curated dataset is designed to advance the development of Japanese language speech recognition models, particularly for the Travel industry.

    Speech Data

    This training dataset comprises over 6,000 high-quality scripted prompt recordings in Japanese. These recordings cover various topics and scenarios relevant to the Travel domain, designed to build robust and accurate customer service speech technology.

    Participant Diversity:
    Speakers: 60 native Japanese speakers from different regions of Japan.
    Regions: Ensures a balanced representation of Japanese accents, dialects, and demographics.
    Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.
    Recording Details:
    Recording Nature: Audio recordings of scripted prompts/monologues.
    Audio Duration: Average duration of 5 to 30 seconds per recording.
    Formats: WAV format with mono channels, a bit depth of 16 bits, and sample rates of 8 kHz and 16 kHz.
    Environment: Recordings are conducted in quiet settings without background noise and echo.
    Topic Diversity: The dataset encompasses a wide array of topics and conversational scenarios to ensure comprehensive coverage of the Travel sector. Topics include:
    Customer Service Interactions
    Booking and Reservations
    Travel Inquiries
    Technical Support
    General Information and Advice
    Promotional and Sales Events
    Domain Specific Statements
    Other Elements: To enhance realism and utility, the scripted prompts incorporate various elements commonly encountered in Travel interactions:
    Names: Region-specific names of males and females in various formats.
    Addresses: Region-specific addresses in different spoken formats.
    Dates & Times: Inclusion of date and time in various travel contexts, such as booking dates, departure and arrival times.
    Destinations: Specific names of cities, countries, and tourist attractions relevant to the travel sector.
    Numbers & Prices: Various numbers and prices related to ticket costs, hotel rates, and transaction amounts.
    Booking IDs and Confirmation Numbers: Inclusion of booking identification and confirmation details for realistic customer service scenarios.

    Each scripted prompt is crafted to reflect real-life scenarios encountered in the Travel domain, ensuring applicability in training robust natural language processing and speech recognition models.

    Transcription Data

    In addition to high-quality audio recordings, the dataset includes meticulously prepared text files with verbatim transcriptions of each audio file. These transcriptions are essential for training accurate and robust speech recognition models.

    Content: Each text file contains the exact scripted prompt corresponding to its audio file, ensuring consistency.
    Format: Transcriptions are provided in plain text (.TXT) format, with files named to match their associated audio files for easy reference.
    <div style="margin-top:10px;

  16. Speech Recognition Data Collection Services | 100+ Languages Resources...

    • data.nexdata.ai
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). Speech Recognition Data Collection Services | 100+ Languages Resources |Audio Data | Speech Recognition Data | Machine Learning (ML) Data [Dataset]. https://data.nexdata.ai/products/nexdata-speech-recognition-data-collection-services-100-nexdata
    Explore at:
    Dataset updated
    Aug 3, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Luxembourg, Tunisia, Netherlands, Lebanon, Mongolia, Finland, New Zealand, Singapore, Jordan, Cambodia
    Description

    Nexdata is equipped with professional recording equipment and has resources pool of 70+ countries and regions, and provide various types of speech recognition data collection services for Machine Learning (ML) Data.

  17. d

    Speech Synthesis Data | 400 Hours | TTS Data | Audio Data | AI Training...

    • datarade.ai
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). Speech Synthesis Data | 400 Hours | TTS Data | Audio Data | AI Training Data| AI Datasets [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-speech-synthesis-data-400-hours-a-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Canada, Austria, Philippines, Malaysia, Sweden, Colombia, Finland, Belgium, Singapore, Hong Kong
    Description
    1. Specifications Format : 44.1 kHz/48 kHz, 16bit/24bit, uncompressed wav, mono channel.

    Recording environment : professional recording studio.

    Recording content : general narrative sentences, interrogative sentences, etc.

    Speaker : native speaker

    Annotation Feature : word transcription, part-of-speech, phoneme boundary, four-level accents, four-level prosodic boundary.

    Device : Microphone

    Language : American English, British English, Japanese, French, Dutch, Catonese, Canadian French,Australian English, Italian, New Zealand English, Spanish, Mexican Spanish

    Application scenarios : speech synthesis

    Accuracy rate: Word transcription: the sentences accuracy rate is not less than 99%. Part-of-speech annotation: the sentences accuracy rate is not less than 98%. Phoneme annotation: the sentences accuracy rate is not less than 98% (the error rate of voiced and swallowed phonemes is not included, because the labelling is more subjective). Accent annotation: the word accuracy rate is not less than 95%. Prosodic boundary annotation: the sentences accuracy rate is not less than 97% Phoneme boundary annotation: the phoneme accuracy rate is not less than 95% (the error range of boundary is within 5%)

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go AI & ML Training Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/tts?source=Datarade
  18. F

    Healthcare Scripted Monologue Speech Data: English (Australia)

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Healthcare Scripted Monologue Speech Data: English (Australia) [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/healthcare-scripted-speech-monologues-english-australia
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Australia
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Australian English Scripted Monologue Speech Dataset for the Healthcare Domain. This meticulously curated dataset is designed to advance the development of English language speech recognition models, particularly for the Healthcare industry.

    Speech Data

    This training dataset comprises over 6,000 high-quality scripted prompt recordings in Australian English. These recordings cover various topics and scenarios relevant to the Healthcare domain, designed to build robust and accurate customer service speech technology.

    Participant Diversity:
    Speakers: 60 native English speakers from different regions of Australia.
    Regions: Ensures a balanced representation of Australian English accents, dialects, and demographics.
    Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.
    Recording Details:
    Recording Nature: Audio recordings of scripted prompts/monologues.
    Audio Duration: Average duration of 5 to 30 seconds per recording.
    Formats: WAV format with mono channels, a bit depth of 16 bits, and sample rates of 8 kHz and 16 kHz.
    Environment: Recordings are conducted in quiet settings without background noise and echo.
    Topic Diversity: The dataset encompasses a wide array of topics and conversational scenarios to ensure comprehensive coverage of the Healthcare sector. Topics include:
    Patient Interactions
    Medical Consultations
    Healthcare Services Inquiries
    Technical Support
    General Information and Advice
    Regulatory and Compliance Queries
    Emergency and Urgent Care
    Domain Specific Statements
    Other Elements: To enhance realism and utility, the scripted prompts incorporate various elements commonly encountered in Healthcare interactions:
    Names: Region-specific names of males and females in various formats.
    Addresses: Region-specific addresses in different spoken formats.
    Dates & Times: Inclusion of date and time in various healthcare contexts, such as appointment dates and medication schedules.
    Medical Terms: Specific medical terminology relevant to diagnoses, treatments, and procedures.
    Numbers & Measurements: Various numbers and measurements related to dosages, test results, and medical statistics.
    Healthcare Facilities: Names of hospitals, clinics, and medical institutions relevant to the healthcare sector.

    Each scripted prompt is crafted to reflect real-life scenarios encountered in the Healthcare domain, ensuring applicability in training robust natural language processing and speech recognition models.

    Transcription Data

    In addition to high-quality audio recordings, the dataset includes meticulously prepared text files with verbatim transcriptions of each audio file. These transcriptions are essential for training accurate and robust speech recognition models.

    Content: Each text file contains the exact scripted prompt corresponding to its audio file, ensuring consistency.
    Format: <span style="font-weight:

  19. F

    Punjabi General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Punjabi General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-punjabi-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Punjabi General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Punjabi speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Punjabi communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Punjabi speech models that understand and respond to authentic Indian accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Punjabi. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Punjabi speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Punjab to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Punjabi speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Punjabi.
    Voice Assistants: Build smart assistants capable of understanding natural Indian conversations.
    <span

  20. Z

    Automatic speech recognition datasets for Gronings, Nasal, and Besemah

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    McDonnell (2023). Automatic speech recognition datasets for Gronings, Nasal, and Besemah [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7946869
    Explore at:
    Dataset updated
    May 18, 2023
    Dataset provided by
    Bartelds
    McDonnell
    Wieling
    Jurafsky
    San
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Automatic speech recognition datasets for Gronings, Nasal, and Besemah for experiments reported in Bartelds, San, McDonnell, Jurafsky and Wieling (2023). Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation. ACL 2023.

    Model training code available at: https://github.com/Bartelds/asr-augmentation

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nexdata (2024). In-Cabin Speech Data | 15,000 Hours | AI Training Data | Speech Recognition Data | Audio Data |Natural Language Processing (NLP) Data [Dataset]. https://datarade.ai/data-products/nexdata-in-car-speech-data-15-000-hours-audio-ai-ml-t-nexdata
Organization logo

In-Cabin Speech Data | 15,000 Hours | AI Training Data | Speech Recognition Data | Audio Data |Natural Language Processing (NLP) Data

Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Apr 23, 2024
Dataset authored and provided by
Nexdata
Area covered
Switzerland, Russian Federation, Netherlands, Germany, Austria, Poland, Turkey, Egypt, Argentina, Romania
Description
  1. Specifications Format : Audio format: 48kHz, 16bit, uncompressed wav, mono channel; Vedio format: MP4

Recording Environment : In-car;1 quiet scene, 1 low noise scene, 3 medium noise scenes and 2 high noise scenes

Recording Content : It covers 5 fields: navigation field, multimedia field, telephone field, car control field and question and answer field; 500 sentences per people

Speaker : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

Device : High fidelity microphone; Binocular camera

Language : 20 languages

Transcription content : text

Accuracy rate : 98%

Application scenarios : speech recognition, Human-computer interaction; Natural language processing and text analysis; Visual content understanding, etc.

  1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Natural Language Processing (NLP) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
Search
Clear search
Close search
Google apps
Main menu