100+ datasets found
  1. Spanish Speech Recognition Dataset

    • kaggle.com
    zip
    Updated Jun 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata (2025). Spanish Speech Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/unidpro/spanish-speech-recognition-dataset
    Explore at:
    zip(93217 bytes)Available download formats
    Dataset updated
    Jun 25, 2025
    Authors
    Unidata
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Spanish Speech Dataset for recognition task

    Dataset comprises 488 hours of telephone dialogues in Spanish, collected from 600 native speakers across various topics and domains. This dataset boasts an impressive 98% word accuracy rate, making it a valuable resource for advancing speech recognition technology.

    By utilizing this dataset, researchers and developers can advance their understanding and capabilities in automatic speech recognition (ASR) systems, transcribing audio, and natural language processing (NLP). - Get the data

    The dataset includes high-quality audio recordings with text transcriptions, making it ideal for training and evaluating speech recognition models.

    šŸ’µ Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

    Metadata for the dataset

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2Fa3f375fb273dcad3fe17403bdfccb63b%2Fssssssssss.PNG?generation=1739884059328284&alt=media" alt=""> - Audio files: High-quality recordings in WAV format - Text transcriptions: Accurate and detailed transcripts for each audio segment - Speaker information: Metadata on native speakers, including gender and etc - Topics: Diverse domains such as general conversations, business and etc

    This dataset is a valuable resource for researchers and developers working on speech recognition, language models, and speech technology.

    🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects

  2. Speech Synthesis Data | 400 Hours | TTS Data | Audio Data | AI Training...

    • datarade.ai
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). Speech Synthesis Data | 400 Hours | TTS Data | Audio Data | AI Training Data| AI Datasets [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-speech-synthesis-data-400-hours-a-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Austria, Malaysia, Belgium, Sweden, China, Philippines, Hong Kong, Singapore, Colombia, Canada
    Description
    1. Specifications Format : 44.1 kHz/48 kHz, 16bit/24bit, uncompressed wav, mono channel.

    Recording environment : professional recording studio.

    Recording content : general narrative sentences, interrogative sentences, etc.

    Speaker : native speaker

    Annotation Feature : word transcription, part-of-speech, phoneme boundary, four-level accents, four-level prosodic boundary.

    Device : Microphone

    Language : American English, British English, Japanese, French, Dutch, Catonese, Canadian French,Australian English, Italian, New Zealand English, Spanish, Mexican Spanish

    Application scenarios : speech synthesis

    Accuracy rate: Word transcription: the sentences accuracy rate is not less than 99%. Part-of-speech annotation: the sentences accuracy rate is not less than 98%. Phoneme annotation: the sentences accuracy rate is not less than 98% (the error rate of voiced and swallowed phonemes is not included, because the labelling is more subjective). Accent annotation: the word accuracy rate is not less than 95%. Prosodic boundary annotation: the sentences accuracy rate is not less than 97% Phoneme boundary annotation: the phoneme accuracy rate is not less than 95% (the error range of boundary is within 5%)

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go AI & ML Training Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/tts?source=Datarade
  3. Speech Words to Text

    • kaggle.com
    zip
    Updated Nov 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shriamrut V (2020). Speech Words to Text [Dataset]. https://www.kaggle.com/datasets/shriamrut/speech-words-to-text
    Explore at:
    zip(1038579384 bytes)Available download formats
    Dataset updated
    Nov 24, 2020
    Authors
    Shriamrut V
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context

    A speech words to text model, where the model recognizes simple words and converts them to text.

    Content

    The model is trained on TensorFlow's speech recognition dataset. The model recognizes words like left, right, up, down, one, two, three, four, five, six, seven, eight, nine, yes and no. The model achieved an accuracy of 0.9933 in the training dataset and 0.93 accuracy in the test or validation dataset. To find out how the model was trained, check out this repo https://github.com/shriamrut/Speech-Words-to-Text.

    Inspiration

    How audio is understood by a computer? That question is where the inspiration came from.

  4. Persian Speech to Test dataset

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    csv
    Updated Dec 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seyyed Mohammad Masoud Parpanchi; Seyyed Mohammad Masoud Parpanchi (2022). Persian Speech to Test dataset [Dataset]. http://doi.org/10.5281/zenodo.7486154
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 28, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Seyyed Mohammad Masoud Parpanchi; Seyyed Mohammad Masoud Parpanchi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Persian Speech to Text dataset is an open source dataset for training machine learning models for the task of transcribing audio files in the Persian language into text. It is the largest open source dataset of its kind, with a size of approximately 60GB of data. The dataset consists of audio files in the WAV format and their transcripts in CSV file format. This dataset is a valuable resource for researchers and developers working on natural language processing tasks involving the Persian language, and it provides a large and diverse set of data to train and evaluate machine learning models on. The open source nature of the dataset means that it is freely available to be used and modified by anyone, making it an important resource for advancing research and development in the field.

  5. French Speech Recognition Dataset

    • kaggle.com
    Updated Jun 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata (2025). French Speech Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/unidpro/french-speech-recognition-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 25, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Unidata
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    French Speech Dataset for recognition task

    Dataset comprises 547 hours of telephone dialogues in French, collected from 964 native speakers across various topics and domains, with an impressive 98% Word Accuracy Rate. It is designed for research in speech recognition, focusing on various recognition models, primarily aimed at meeting the requirements for automatic speech recognition (ASR) systems.

    By utilizing this dataset, researchers and developers can advance their understanding and capabilities in natural language processing (NLP), speech recognition, and machine learning technologies. - Get the data

    The dataset includes high-quality audio recordings with accurate transcriptions, making it ideal for training and evaluating speech recognition models.

    šŸ’µ Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

    Metadata for the dataset

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2Fb7af35fb0b3dabe083683bebd27fc5e5%2Fweweewew.PNG?generation=1739885543448162&alt=media" alt="">

    • Audio files: High-quality recordings in WAV format
    • Text transcriptions: Accurate and detailed transcripts for each audio segment
    • Speaker information: Metadata on native speakers, including gender and etc
    • Topics: Diverse domains such as general conversations, business and etc

    The native speakers and various topics and domains covered in the dataset make it an ideal resource for research community, allowing researchers to study spoken languages, dialects, and language patterns.

    🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects

  6. h

    german-speech-recognition-dataset

    • huggingface.co
    Updated Aug 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata NLP (2025). german-speech-recognition-dataset [Dataset]. https://huggingface.co/datasets/ud-nlp/german-speech-recognition-dataset
    Explore at:
    Dataset updated
    Aug 2, 2025
    Authors
    Unidata NLP
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    German Telephone Dialogues Dataset - 431 Hours

    Dataset comprises 431 hours of high-quality audio recordings from 590+ native German speakers, featuring telephone dialogues across diverse topics and domains. With a 95% sentence accuracy rate, this essential dataset is ideal for training and evaluating German speech recognition systems. - Get the data

      Dataset characteristics:
    

    Characteristic Data

    Description Audio of telephone dialogues in German for training… See the full description on the dataset page: https://huggingface.co/datasets/ud-nlp/german-speech-recognition-dataset.

  7. u

    Italian Speech Recognition Dataset

    • unidata.pro
    a-law/u-law, pcm
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata L.L.C-FZ, Italian Speech Recognition Dataset [Dataset]. https://unidata.pro/datasets/italian-speech-recognition-dataset/
    Explore at:
    a-law/u-law, pcmAvailable download formats
    Dataset authored and provided by
    Unidata L.L.C-FZ
    Description

    Unidata’s Italian Speech Recognition dataset refines AI models for better speech-to-text conversion and language comprehension

  8. d

    AI Training Data | US Transcription Data| Unique Consumer Sentiment Data:...

    • datarade.ai
    Updated Jan 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WiserBrand.com (2025). AI Training Data | US Transcription Data| Unique Consumer Sentiment Data: Transcription of the calls to the companies [Dataset]. https://datarade.ai/data-products/wiserbrand-ai-training-data-us-transcription-data-unique-wiserbrand-com
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    WiserBrand
    Area covered
    United States
    Description

    WiserBrand's Comprehensive Customer Call Transcription Dataset: Tailored Insights

    WiserBrand offers a customizable dataset comprising transcribed customer call records, meticulously tailored to your specific requirements. This extensive dataset includes:

    • User ID and Firm Name: Identify and categorize calls by unique user IDs and company names.
    • Call Duration: Analyze engagement levels through call lengths.
    • Geographical Information: Detailed data on city, state, and country for regional analysis.
    • Call Timing: Track peak interaction times with precise timestamps.
    • Call Reason and Group: Categorised reasons for calls, helping to identify common customer issues.
    • Device and OS Types: Information on the devices and operating systems used for technical support analysis. Transcriptions: Full-text transcriptions of each call, enabling sentiment analysis, keyword extraction, and detailed interaction reviews.

    WiserBrand's dataset is essential for companies looking to leverage Consumer Data and B2B Marketing Data to drive their strategic initiatives in the English-speaking markets of the USA, UK, and Australia. By accessing this rich dataset, businesses can uncover trends and insights critical for improving customer engagement and satisfaction.

    Cases:

    1. Training Speech Recognition (Speech-to-Text) and Speech Synthesis (Text-to-Speech) Models

    WiserBrand's Comprehensive Customer Call Transcription Dataset is an excellent resource for training and improving speech recognition models (Speech-to-Text, STT) and speech synthesis systems (Text-to-Speech, TTS). Here’s how this dataset can contribute to these tasks:

    Enriching STT Models: The dataset comprises a diverse range of real-world customer service calls, featuring various accents, tones, and terminologies. This makes it highly valuable for training speech-to-text models to better recognize different dialects, regional speech patterns, and industry-specific jargon. It could help improve accuracy in transcribing conversations in customer service, sales, or technical support.

    Contextualized Speech Recognition: Given the contextual information (e.g., reasons for calls, call categories, etc.), it can help models differentiate between various types of conversations (technical support vs. sales queries), which would improve the model’s ability to transcribe in a more contextually relevant manner.

    Improving TTS Systems: The transcriptions, along with their associated metadata (such as call duration, timing, and call reason), can aid in training Text-to-Speech models that mimic natural conversation patterns, including pauses, tone variation, and proper intonation. This is especially beneficial for developing conversational agents that sound more natural and human-like in their responses.

    Noise and Speech Quality Handling: Real-world customer service calls often contain background noise, overlapping speech, and interruptions, which are crucial elements for training speech models to handle real-life scenarios more effectively.

    1. Training AI Agents for Replacing Customer Service Representatives WiserBrand’s dataset can be incredibly valuable for businesses looking to develop AI-powered customer support agents that can replace or augment human customer service representatives. Here’s how this dataset supports AI agent training:

    Customer Interaction Simulation: The transcriptions provide a comprehensive view of real customer interactions, including common queries, complaints, and support requests. By training AI models on this data, businesses can equip their virtual agents with the ability to understand customer concerns, follow up on issues, and provide meaningful solutions, all while mimicking human-like conversational flow.

    Sentiment Analysis and Emotional Intelligence: The full-text transcriptions, along with associated call metadata (e.g., reason for the call, call duration, and geographical data), allow for sentiment analysis, enabling AI agents to gauge the emotional tone of customers. This helps the agents respond appropriately, whether it’s providing reassurance during frustrating technical issues or offering solutions in a polite, empathetic manner. Such capabilities are essential for improving customer satisfaction in automated systems.

    Customizable Dialogue Systems: The dataset allows for categorizing and identifying recurring call patterns and issues. This means AI agents can be trained to recognize the types of queries that come up frequently, allowing them to automate routine tasks such as order inquiries, account management, or technical troubleshooting without needing human intervention.

    Improving Multilingual and Cross-Regional Support: Given that the dataset includes geographical information (e.g., city, state, and country), AI agents can be trained to recognize region-specific slang, phrases, and cultural nuances, which is particularly valuable for multinational companies operating in diverse markets (e.g., the USA, UK, and Australia...

  9. s

    ShefCE: A Cantonese-English bilingual speech corpus -- speech recognition...

    • orda.shef.ac.uk
    application/gzip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wai Man Ng; Alvin C.M. Kwan; Tan Lee; Thomas Hain (2023). ShefCE: A Cantonese-English bilingual speech corpus -- speech recognition model sets and recording transcripts [Dataset]. http://doi.org/10.15131/shef.data.4522925.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    The University of Sheffield
    Authors
    Wai Man Ng; Alvin C.M. Kwan; Tan Lee; Thomas Hain
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This online repository contains the speech recognition model sets and the recording transcripts used in the phoneme/syllable recognition experiments reported in [1].Speech recognition model sets-----------------------------------------The speech recognition model sets are available as a tarball,named model.tar.gz, in this repository.The models were trained on Cantonese and English data. For each language, two model sets were trained according to the background setting and the mixed-condition setting respectively. All models are DNN-HMM models, which are hybrid feed-forward neural network models with 6 hidden layers and 2048 neurons per layer. Details can be found in [1]. The Cantonese models include a bigram syllable language model. The English models include a bigram phoneme language model. All model sets are provided in the kaldi format.1. The background-cantonese model was trained on CUSENT (68 speakers, 19.4 hours) of read Cantonese speech.2. The background-english model was trained on WSJ-SI84 (83 speakers, 15.2 hours) of read English speech3. The mixed-condition-cantonese model was trained on background-cantonese data and ShefCE Cantonese training data (25 speakers, 9.7 hours).4. The mixed-condition-english model was trained on background-english data and ShefCE English training data (25 speakers, 2.3 hours)Recording transcripts----------------------------The recording transcripts are available as a tarball, named, stms.tar.gz, in this repository. These transcripts cover the ShefCE portion of the training data and the ShefCE test data.Four files can be found in the stms.tar.gz archive. - ShefCE_RC.train.v*.stm contains the transcripts for ShefCE training set (Cantonese)- ShefCE_RE.train.v*.stm contains the transcripts for ShefCE training set (English)- ShefCE_RC.test.v*.stm contains the transcripts for ShefCE test set (Cantonese)- ShefCE_RE.test.v*.stm contains the transcripts for ShefCE test set (English)The ShefCE corpus data can be accessed online with DOI:10.15131/shef.data.4522907Please cite [1] for the use of ShefCE data, models or transcripts.[1] Raymond W. M. Ng, Alvin C.M. Kwan, Tan Lee and Thomas Hain, "ShefCE: A Cantonese-English Bilingual Speech Corpus for Pronunciation Assessment", in Proc. The 42th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.

  10. Arabic Speech Commands Dataset

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Apr 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdulkader Ghandoura; Abdulkader Ghandoura (2021). Arabic Speech Commands Dataset [Dataset]. http://doi.org/10.5281/zenodo.4662481
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 5, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abdulkader Ghandoura; Abdulkader Ghandoura
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Arabic Speech Commands Dataset

    This dataset is designed to help train simple machine learning models that serve educational and research purposes in the speech recognition domain, mainly for keyword spotting tasks.

    Dataset Description

    Our dataset is a list of pairs (x, y), where x is the input speech signal, and y is the corresponding keyword. The final dataset consists of 12000 such pairs, comprising 40 keywords. Each audio file is one-second in length sampled at 16 kHz. We have 30 participants, each of them recorded 10 utterances for each keyword. Therefore, we have 300 audio files for each keyword in total (30 * 10 * 40 = 12000), and the total size of all the recorded keywords is ~384 MB. The dataset also contains several background noise recordings we obtained from various natural sources of noise. We saved these audio files in a separate folder with the name background_noise and a total size of ~49 MB.

    Dataset Structure

    There are 40 folders, each of which represents one keyword and contains 300 files. The first eight digits of each file name identify the contributor, while the last two digits identify the round number. For example, the file path rotate/00000021_NO_06.wav indicates that the contributor with the ID 00000021 pronounced the keyword rotate for the 6th time.

    Data Split

    We recommend using the provided CSV files in your experiments. We kept 60% of the dataset for training, 20% for validation, and the remaining 20% for testing. In our split method, we guarantee that all recordings of a certain contributor are within the same subset.

    License

    This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. For more details, see the LICENSE file in this folder.

    Citations

    If you want to use the Arabic Speech Commands dataset in your work, please cite it as:

    @article{arabicspeechcommandsv1,
       author = {Ghandoura, Abdulkader and Hjabo, Farouk and Al Dakkak, Oumayma},
       title = {Building and Benchmarking an Arabic Speech Commands Dataset for Small-Footprint Keyword Spotting},
       journal = {Engineering Applications of Artificial Intelligence},
       year = {2021},
       publisher={Elsevier}
    }

  11. m

    SoraniTTS dataset : Central Kurdish (CK) Speech Corpus for Text-to-Speech

    • data.mendeley.com
    Updated Sep 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farhad Rahimi (2025). SoraniTTS dataset : Central Kurdish (CK) Speech Corpus for Text-to-Speech [Dataset]. http://doi.org/10.17632/jmtn248cc9.5
    Explore at:
    Dataset updated
    Sep 16, 2025
    Authors
    Farhad Rahimi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset represents a comprehensive resource for advancing Kurdish TTS systems. Converting text to speech is one of the important topics in the design and construction of multimedia systems, human-machine communication, and information and communication technology, and its purpose, along with speech recognition, is to establish communication between humans and machines in its most basic and natural form, that is, spoken language.
    For our text corpus, we collected 6,565 sentences from a set of texts in various categories, including news, sport, health, question and exclamation sentences, science, general information, politics, education and literature, story, miscellaneous, and tourism, to create the train sentences. We thoroughly reviewed the texts and normalized them, then they were recorded by a male speaker. We recorded audios in a voice recording studio at 44,100Hz, and all audio files are down sampled to 22,050 Hz in our modeling process. The audio ranges from 3 to 36 seconds in length. We generate the speech corpus in this method, and the last speech has about 6,565 texts and audio pairings, which takes around 19 hours. Altogether, audio files are saved in wave format, and the texts are saved in text files in the corresponding sub-folders. Furthermore, for model training, all of the audio files are gathered in a single folder. Each line in the transcript files is formatted as WAVS | audio file’s name.wav| transcript. The audio file’s name includes the extensions, and the transcript was the speech's text. The audio recording and editing process lasted for 90 days. It involved capturing over 6,565 WAV files and over 19 h of recorded speech. The data set helps researchers improve Kurdish TTS early, thereby reducing the time consumed for this process.

    Acknowledgments: We would like to express our sincere gratitude to Ayoub Mohammadzadeh for his invaluable support in recording the corpus.

  12. A

    Artificial Intelligence Training Dataset Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated May 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Artificial Intelligence Training Dataset Report [Dataset]. https://www.datainsightsmarket.com/reports/artificial-intelligence-training-dataset-1958994
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    May 3, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Artificial Intelligence (AI) Training Dataset market is experiencing robust growth, driven by the increasing adoption of AI across diverse sectors. The market's expansion is fueled by the burgeoning need for high-quality data to train sophisticated AI algorithms capable of powering applications like smart campuses, autonomous vehicles, and personalized healthcare solutions. The demand for diverse dataset types, including image classification, voice recognition, natural language processing, and object detection datasets, is a key factor contributing to market growth. While the exact market size in 2025 is unavailable, considering a conservative estimate of a $10 billion market in 2025 based on the growth trend and reported market sizes of related industries, and a projected CAGR (Compound Annual Growth Rate) of 25%, the market is poised for significant expansion in the coming years. Key players in this space are leveraging technological advancements and strategic partnerships to enhance data quality and expand their service offerings. Furthermore, the increasing availability of cloud-based data annotation and processing tools is further streamlining operations and making AI training datasets more accessible to businesses of all sizes. Growth is expected to be particularly strong in regions with burgeoning technological advancements and substantial digital infrastructure, such as North America and Asia Pacific. However, challenges such as data privacy concerns, the high cost of data annotation, and the scarcity of skilled professionals capable of handling complex datasets remain obstacles to broader market penetration. The ongoing evolution of AI technologies and the expanding applications of AI across multiple sectors will continue to shape the demand for AI training datasets, pushing this market toward higher growth trajectories in the coming years. The diversity of applications—from smart homes and medical diagnoses to advanced robotics and autonomous driving—creates significant opportunities for companies specializing in this market. Maintaining data quality, security, and ethical considerations will be crucial for future market leadership.

  13. Speech Synthesis Data | 400 Hours |Text-to-Speech(TTS) Data | Audio Data |...

    • data.nexdata.ai
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). Speech Synthesis Data | 400 Hours |Text-to-Speech(TTS) Data | Audio Data | AI Training Data| Speech Data [Dataset]. https://data.nexdata.ai/products/nexdata-multilingual-speech-synthesis-data-400-hours-a-nexdata
    Explore at:
    Dataset updated
    Aug 3, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Singapore, Hong Kong, New Zealand, Poland, Greece, United States, Romania, South Korea, Mexico, Hungary
    Description

    Text-to-Speech(TTS) Data is recorded by native speaker, with authentic accent and sweet sound. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

  14. h

    spanish-speech-recognition-dataset

    • huggingface.co
    Updated Jul 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata NLP (2025). spanish-speech-recognition-dataset [Dataset]. https://huggingface.co/datasets/ud-nlp/spanish-speech-recognition-dataset
    Explore at:
    Dataset updated
    Jul 30, 2025
    Authors
    Unidata NLP
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Spanish Telephone Dialogues Dataset - 488 Hours

    Dataset comprises 488 hours of high-quality telephone audio recordings in Spanish, featuring 600 native speakers and achieving a 95% sentence accuracy rate. Designed for advancing speech recognition models and language processing, this extensive speech data corpus covers diverse topics and domains, making it ideal for training robust automatic speech recognition (ASR) systems. - Get the data

      Dataset characteristics:… See the full description on the dataset page: https://huggingface.co/datasets/ud-nlp/spanish-speech-recognition-dataset.
    
  15. Bengali Speech Recognition Dataset (BSRD)

    • kaggle.com
    zip
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuvo Kumar Basak-4004 (2025). Bengali Speech Recognition Dataset (BSRD) [Dataset]. https://www.kaggle.com/datasets/shuvokumarbasak4004/bengali-speech-recognition-dataset-bsrd
    Explore at:
    zip(300882482 bytes)Available download formats
    Dataset updated
    Jan 14, 2025
    Authors
    Shuvo Kumar Basak-4004
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The BengaliSpeechRecognitionDataset (BSRD) is a comprehensive dataset designed for the development and evaluation of Bengali speech recognition and text-to-speech systems. This dataset includes a collection of Bengali characters and their corresponding audio files, which are generated using speech synthesis models. It serves as an essential resource for researchers and developers working on automatic speech recognition (ASR) and text-to-speech (TTS) applications for the Bengali language. Key Features: • Bengali Characters: The dataset contains a wide range of Bengali characters, including consonants, vowels, and unique symbols used in the Bengali script. This includes standard characters such as 'ক', 'খ', 'গ', and many more. • Corresponding Speech Data: For each Bengali character, an MP3 audio file is provided, which contains the correct pronunciation of that character. This audio is generated by a Bengali text-to-speech model, ensuring clear and accurate pronunciation. • 1000 Audio Samples per Folder: Each character is associated with at least 1000 MP3 files. These multiple samples provide variations of the character's pronunciation, which is essential for training robust speech recognition systems. • Language and Phonetic Diversity: The dataset offers a phonetic diversity of Bengali sounds, covering different tones and pronunciations commonly found in spoken Bengali. This ensures that the dataset can be used for training models capable of recognizing diverse speech patterns. • Use Cases: o Automatic Speech Recognition (ASR): BSRD is ideal for training ASR systems, as it provides accurate audio samples linked to specific Bengali characters. o Text-to-Speech (TTS): Researchers can use this dataset to fine-tune TTS systems for generating natural Bengali speech from text. o Phonetic Analysis: The dataset can be used for phonetic analysis and developing models that study the linguistic features of Bengali pronunciation. • Applications: o Voice Assistants: The dataset can be used to build and train voice recognition systems and personal assistants that understand Bengali. o Speech-to-Text Systems: BSRD can aid in developing accurate transcription systems for Bengali audio content. o Language Learning Tools: The dataset can help in creating educational tools aimed at teaching Bengali pronunciation.

    …………………………………..Note for Researchers Using the dataset………………………………………………………………………

    This dataset was created by Shuvo Kumar Basak. If you use this dataset for your research or academic purposes, please ensure to cite this dataset appropriately. If you have published your research using this dataset, please share a link to your paper. Good Luck.

  16. F

    British English Scripted Monologue Speech Data for Healthcare

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). British English Scripted Monologue Speech Data for Healthcare [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/healthcare-scripted-speech-monologues-english-uk
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United Kingdom
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Introducing the UK English Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of English language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.

    Speech Data

    This dataset includes over 6,000 high-quality scripted audio prompts recorded in UK English, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.

    •Participant Diversity
    •
    Speakers: 60 native UK English speakers.
    •
    Regional Balance: Participants are sourced from multiple regions across United Kingdom, reflecting diverse dialects and linguistic traits.
    •
    Demographics: Includes a mix of male and female participants (60:40 ratio), aged between 18 and 70 years.
    •Recording Specifications
    •
    Nature of Recordings: Scripted monologues based on healthcare-related use cases.
    •
    Duration: Each clip ranges between 5 to 30 seconds, offering short, context-rich speech samples.
    •
    Audio Format: WAV files recorded in mono, with 16-bit depth and sample rates of 8 kHz and 16 kHz.
    •
    Environment: Clean and echo-free spaces ensure clear and noise-free audio capture.

    Topic Coverage

    The prompts span a broad range of healthcare-specific interactions, such as:

    •Patient check-in and follow-up communication
    •Appointment booking and cancellation dialogues
    •Insurance and regulatory support queries
    •Medication, test results, and consultation discussions
    •General health tips and wellness advice
    •Emergency and urgent care communication
    •Technical support for patient portals and apps
    •Domain-specific scripted statements and FAQs

    Contextual Depth

    To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:

    •
    Names: Gender- and region-appropriate United Kingdom names
    •
    Addresses: Varied local address formats spoken naturally
    •
    Dates & Times: References to appointment dates, times, follow-ups, and schedules
    •
    Medical Terminology: Common medical procedures, symptoms, and treatment references
    •
    Numbers & Measurements: Health data like dosages, vitals, and test result values
    •
    Healthcare Institutions: Names of clinics, hospitals, and diagnostic centers

    These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.

    Transcription

    Every audio recording is accompanied by a verbatim, manually verified transcription.

    •
    Content: The transcription mirrors the exact scripted prompt recorded by the speaker.
    •
    Format: Files are delivered in plain text (.TXT) format with consistent naming conventions for seamless integration.
    •
    <b

  17. h

    french-speech-recognition-dataset

    • huggingface.co
    Updated Sep 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata NLP (2025). french-speech-recognition-dataset [Dataset]. https://huggingface.co/datasets/ud-nlp/french-speech-recognition-dataset
    Explore at:
    Dataset updated
    Sep 29, 2025
    Authors
    Unidata NLP
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    French Telephone Dialogues Dataset - 547 Hours

    his speech recognition dataset comprises 547 hours of telephone dialogues in French from 964 native speakers, providing audio recordings with detailed annotations (text, speaker ID, gender, age) to support speech recognition systems, natural language processing, and deep learning models for training and evaluating automatic speech recognition technology. - Get the data

      Dataset characteristics:
    

    Characteristic Data… See the full description on the dataset page: https://huggingface.co/datasets/ud-nlp/french-speech-recognition-dataset.

  18. Scripted Monologues Speech Data | 65,000 Hours | GenAI Audio Data|...

    • datarade.ai
    Updated Dec 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). Scripted Monologues Speech Data | 65,000 Hours | GenAI Audio Data| Text-to-Speech Data| Multilingual Language Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-read-speech-data-65-000-hours-aud-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 11, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Russian Federation, Lebanon, Japan, Netherlands, Cambodia, Jordan, El Salvador, Brazil, Slovenia, France
    Description
    1. Specifications Format : 16kHz, 16bit, uncompressed wav, mono channel

    Recording environment : quiet indoor environment, without echo

    Recording content (read speech) : economy, entertainment, news, oral language, numbers, letters

    Speaker : native speaker, gender balance

    Device : Android mobile phone, iPhone

    Language : 100+ languages

    Transcription content : text, time point of speech data, 5 noise symbols, 5 special identifiers

    Accuracy rate : 95% (the accuracy rate of noise symbols and other identifiers is not included)

    Application scenarios : speech recognition, voiceprint recognition

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Multilingual Language Data and 800TB of Computer Vision Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
  19. h

    american-speech-recognition-dataset

    • huggingface.co
    Updated Sep 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata NLP (2025). american-speech-recognition-dataset [Dataset]. https://huggingface.co/datasets/ud-nlp/american-speech-recognition-dataset
    Explore at:
    Dataset updated
    Sep 29, 2025
    Authors
    Unidata NLP
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    American Telephone Dialogues Dataset - 1,136 Hours

    The dataset includes 1,136 hours of annotated telephone dialogues from 1,416 native speakers across the United States. Designed for advancing speech recognition models and language processing, this extensive speech data corpus covers diverse topics and domains, making it ideal for training robust automatic speech recognition (ASR) systems. - Get the data

      Dataset characteristics:
    

    Characteristic Data… See the full description on the dataset page: https://huggingface.co/datasets/ud-nlp/american-speech-recognition-dataset.

  20. S

    Speech Recognition Data Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Nov 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Speech Recognition Data Report [Dataset]. https://www.datainsightsmarket.com/reports/speech-recognition-data-1966543
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Nov 7, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Explore the booming Speech Recognition Data market, driven by AI advancements and voice technology adoption. Discover market size, CAGR, key drivers, and regional trends shaping the future of voice data.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Unidata (2025). Spanish Speech Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/unidpro/spanish-speech-recognition-dataset
Organization logo

Spanish Speech Recognition Dataset

Dataset comprises 488 hours of telephone dialogues in Spanish

Explore at:
168 scholarly articles cite this dataset (View in Google Scholar)
zip(93217 bytes)Available download formats
Dataset updated
Jun 25, 2025
Authors
Unidata
License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

Spanish Speech Dataset for recognition task

Dataset comprises 488 hours of telephone dialogues in Spanish, collected from 600 native speakers across various topics and domains. This dataset boasts an impressive 98% word accuracy rate, making it a valuable resource for advancing speech recognition technology.

By utilizing this dataset, researchers and developers can advance their understanding and capabilities in automatic speech recognition (ASR) systems, transcribing audio, and natural language processing (NLP). - Get the data

The dataset includes high-quality audio recordings with text transcriptions, making it ideal for training and evaluating speech recognition models.

šŸ’µ Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

Metadata for the dataset

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2Fa3f375fb273dcad3fe17403bdfccb63b%2Fssssssssss.PNG?generation=1739884059328284&alt=media" alt=""> - Audio files: High-quality recordings in WAV format - Text transcriptions: Accurate and detailed transcripts for each audio segment - Speaker information: Metadata on native speakers, including gender and etc - Topics: Diverse domains such as general conversations, business and etc

This dataset is a valuable resource for researchers and developers working on speech recognition, language models, and speech technology.

🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects

Search
Clear search
Close search
Google apps
Main menu