Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset comprises 488 hours of telephone dialogues in Spanish, collected from 600 native speakers across various topics and domains. This dataset boasts an impressive 98% word accuracy rate, making it a valuable resource for advancing speech recognition technology.
By utilizing this dataset, researchers and developers can advance their understanding and capabilities in automatic speech recognition (ASR) systems, transcribing audio, and natural language processing (NLP). - Get the data
The dataset includes high-quality audio recordings with text transcriptions, making it ideal for training and evaluating speech recognition models.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2Fa3f375fb273dcad3fe17403bdfccb63b%2Fssssssssss.PNG?generation=1739884059328284&alt=media" alt="">
- Audio files: High-quality recordings in WAV format
- Text transcriptions: Accurate and detailed transcripts for each audio segment
- Speaker information: Metadata on native speakers, including gender and etc
- Topics: Diverse domains such as general conversations, business and etc
This dataset is a valuable resource for researchers and developers working on speech recognition, language models, and speech technology.
Facebook
TwitterRecording environment : professional recording studio.
Recording content : general narrative sentences, interrogative sentences, etc.
Speaker : native speaker
Annotation Feature : word transcription, part-of-speech, phoneme boundary, four-level accents, four-level prosodic boundary.
Device : Microphone
Language : American English, British English, Japanese, French, Dutch, Catonese, Canadian French,Australian English, Italian, New Zealand English, Spanish, Mexican Spanish
Application scenarios : speech synthesis
Accuracy rate: Word transcription: the sentences accuracy rate is not less than 99%. Part-of-speech annotation: the sentences accuracy rate is not less than 98%. Phoneme annotation: the sentences accuracy rate is not less than 98% (the error rate of voiced and swallowed phonemes is not included, because the labelling is more subjective). Accent annotation: the word accuracy rate is not less than 95%. Prosodic boundary annotation: the sentences accuracy rate is not less than 97% Phoneme boundary annotation: the phoneme accuracy rate is not less than 95% (the error range of boundary is within 5%)
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A speech words to text model, where the model recognizes simple words and converts them to text.
The model is trained on TensorFlow's speech recognition dataset. The model recognizes words like left, right, up, down, one, two, three, four, five, six, seven, eight, nine, yes and no. The model achieved an accuracy of 0.9933 in the training dataset and 0.93 accuracy in the test or validation dataset. To find out how the model was trained, check out this repo https://github.com/shriamrut/Speech-Words-to-Text.
How audio is understood by a computer? That question is where the inspiration came from.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Persian Speech to Text dataset is an open source dataset for training machine learning models for the task of transcribing audio files in the Persian language into text. It is the largest open source dataset of its kind, with a size of approximately 60GB of data. The dataset consists of audio files in the WAV format and their transcripts in CSV file format. This dataset is a valuable resource for researchers and developers working on natural language processing tasks involving the Persian language, and it provides a large and diverse set of data to train and evaluate machine learning models on. The open source nature of the dataset means that it is freely available to be used and modified by anyone, making it an important resource for advancing research and development in the field.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset comprises 547 hours of telephone dialogues in French, collected from 964 native speakers across various topics and domains, with an impressive 98% Word Accuracy Rate. It is designed for research in speech recognition, focusing on various recognition models, primarily aimed at meeting the requirements for automatic speech recognition (ASR) systems.
By utilizing this dataset, researchers and developers can advance their understanding and capabilities in natural language processing (NLP), speech recognition, and machine learning technologies. - Get the data
The dataset includes high-quality audio recordings with accurate transcriptions, making it ideal for training and evaluating speech recognition models.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2Fb7af35fb0b3dabe083683bebd27fc5e5%2Fweweewew.PNG?generation=1739885543448162&alt=media" alt="">
The native speakers and various topics and domains covered in the dataset make it an ideal resource for research community, allowing researchers to study spoken languages, dialects, and language patterns.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
German Telephone Dialogues Dataset - 431 Hours
Dataset comprises 431 hours of high-quality audio recordings from 590+ native German speakers, featuring telephone dialogues across diverse topics and domains. With a 95% sentence accuracy rate, this essential dataset is ideal for training and evaluating German speech recognition systems. - Get the data
Dataset characteristics:
Characteristic Data
Description Audio of telephone dialogues in German for training⦠See the full description on the dataset page: https://huggingface.co/datasets/ud-nlp/german-speech-recognition-dataset.
Facebook
TwitterUnidataās Italian Speech Recognition dataset refines AI models for better speech-to-text conversion and language comprehension
Facebook
TwitterWiserBrand's Comprehensive Customer Call Transcription Dataset: Tailored Insights
WiserBrand offers a customizable dataset comprising transcribed customer call records, meticulously tailored to your specific requirements. This extensive dataset includes:
WiserBrand's dataset is essential for companies looking to leverage Consumer Data and B2B Marketing Data to drive their strategic initiatives in the English-speaking markets of the USA, UK, and Australia. By accessing this rich dataset, businesses can uncover trends and insights critical for improving customer engagement and satisfaction.
Cases:
WiserBrand's Comprehensive Customer Call Transcription Dataset is an excellent resource for training and improving speech recognition models (Speech-to-Text, STT) and speech synthesis systems (Text-to-Speech, TTS). Hereās how this dataset can contribute to these tasks:
Enriching STT Models: The dataset comprises a diverse range of real-world customer service calls, featuring various accents, tones, and terminologies. This makes it highly valuable for training speech-to-text models to better recognize different dialects, regional speech patterns, and industry-specific jargon. It could help improve accuracy in transcribing conversations in customer service, sales, or technical support.
Contextualized Speech Recognition: Given the contextual information (e.g., reasons for calls, call categories, etc.), it can help models differentiate between various types of conversations (technical support vs. sales queries), which would improve the modelās ability to transcribe in a more contextually relevant manner.
Improving TTS Systems: The transcriptions, along with their associated metadata (such as call duration, timing, and call reason), can aid in training Text-to-Speech models that mimic natural conversation patterns, including pauses, tone variation, and proper intonation. This is especially beneficial for developing conversational agents that sound more natural and human-like in their responses.
Noise and Speech Quality Handling: Real-world customer service calls often contain background noise, overlapping speech, and interruptions, which are crucial elements for training speech models to handle real-life scenarios more effectively.
Customer Interaction Simulation: The transcriptions provide a comprehensive view of real customer interactions, including common queries, complaints, and support requests. By training AI models on this data, businesses can equip their virtual agents with the ability to understand customer concerns, follow up on issues, and provide meaningful solutions, all while mimicking human-like conversational flow.
Sentiment Analysis and Emotional Intelligence: The full-text transcriptions, along with associated call metadata (e.g., reason for the call, call duration, and geographical data), allow for sentiment analysis, enabling AI agents to gauge the emotional tone of customers. This helps the agents respond appropriately, whether itās providing reassurance during frustrating technical issues or offering solutions in a polite, empathetic manner. Such capabilities are essential for improving customer satisfaction in automated systems.
Customizable Dialogue Systems: The dataset allows for categorizing and identifying recurring call patterns and issues. This means AI agents can be trained to recognize the types of queries that come up frequently, allowing them to automate routine tasks such as order inquiries, account management, or technical troubleshooting without needing human intervention.
Improving Multilingual and Cross-Regional Support: Given that the dataset includes geographical information (e.g., city, state, and country), AI agents can be trained to recognize region-specific slang, phrases, and cultural nuances, which is particularly valuable for multinational companies operating in diverse markets (e.g., the USA, UK, and Australia...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This online repository contains the speech recognition model sets and the recording transcripts used in the phoneme/syllable recognition experiments reported in [1].Speech recognition model sets-----------------------------------------The speech recognition model sets are available as a tarball,named model.tar.gz, in this repository.The models were trained on Cantonese and English data. For each language, two model sets were trained according to the background setting and the mixed-condition setting respectively. All models are DNN-HMM models, which are hybrid feed-forward neural network models with 6 hidden layers and 2048 neurons per layer. Details can be found in [1]. The Cantonese models include a bigram syllable language model. The English models include a bigram phoneme language model. All model sets are provided in the kaldi format.1. The background-cantonese model was trained on CUSENT (68 speakers, 19.4 hours) of read Cantonese speech.2. The background-english model was trained on WSJ-SI84 (83 speakers, 15.2 hours) of read English speech3. The mixed-condition-cantonese model was trained on background-cantonese data and ShefCE Cantonese training data (25 speakers, 9.7 hours).4. The mixed-condition-english model was trained on background-english data and ShefCE English training data (25 speakers, 2.3 hours)Recording transcripts----------------------------The recording transcripts are available as a tarball, named, stms.tar.gz, in this repository. These transcripts cover the ShefCE portion of the training data and the ShefCE test data.Four files can be found in the stms.tar.gz archive. - ShefCE_RC.train.v*.stm contains the transcripts for ShefCE training set (Cantonese)- ShefCE_RE.train.v*.stm contains the transcripts for ShefCE training set (English)- ShefCE_RC.test.v*.stm contains the transcripts for ShefCE test set (Cantonese)- ShefCE_RE.test.v*.stm contains the transcripts for ShefCE test set (English)The ShefCE corpus data can be accessed online with DOI:10.15131/shef.data.4522907Please cite [1] for the use of ShefCE data, models or transcripts.[1] Raymond W. M. Ng, Alvin C.M. Kwan, Tan Lee and Thomas Hain, "ShefCE: A Cantonese-English Bilingual Speech Corpus for Pronunciation Assessment", in Proc. The 42th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Arabic Speech Commands Dataset
This dataset is designed to help train simple machine learning models that serve educational and research purposes in the speech recognition domain, mainly for keyword spotting tasks.
Dataset Description
Our dataset is a list of pairs (x, y), where x is the input speech signal, and y is the corresponding keyword. The final dataset consists of 12000 such pairs, comprising 40 keywords. Each audio file is one-second in length sampled at 16 kHz. We have 30 participants, each of them recorded 10 utterances for each keyword. Therefore, we have 300 audio files for each keyword in total (30 * 10 * 40 = 12000), and the total size of all the recorded keywords is ~384 MB. The dataset also contains several background noise recordings we obtained from various natural sources of noise. We saved these audio files in a separate folder with the name background_noise and a total size of ~49 MB.
Dataset Structure
There are 40 folders, each of which represents one keyword and contains 300 files. The first eight digits of each file name identify the contributor, while the last two digits identify the round number. For example, the file path rotate/00000021_NO_06.wav indicates that the contributor with the ID 00000021 pronounced the keyword rotate for the 6th time.
Data Split
We recommend using the provided CSV files in your experiments. We kept 60% of the dataset for training, 20% for validation, and the remaining 20% for testing. In our split method, we guarantee that all recordings of a certain contributor are within the same subset.
License
This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. For more details, see the LICENSE file in this folder.
Citations
If you want to use the Arabic Speech Commands dataset in your work, please cite it as:
@article{arabicspeechcommandsv1,
author = {Ghandoura, Abdulkader and Hjabo, Farouk and Al Dakkak, Oumayma},
title = {Building and Benchmarking an Arabic Speech Commands Dataset for Small-Footprint Keyword Spotting},
journal = {Engineering Applications of Artificial Intelligence},
year = {2021},
publisher={Elsevier}
}
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset represents a comprehensive resource for advancing Kurdish TTS systems. Converting text to speech is one of the important topics in the design and construction of multimedia systems, human-machine communication, and information and communication technology, and its purpose, along with speech recognition, is to establish communication between humans and machines in its most basic and natural form, that is, spoken language.
For our text corpus, we collected 6,565 sentences from a set of texts in various categories, including news, sport, health, question and exclamation sentences, science, general information, politics, education and literature, story, miscellaneous, and tourism, to create the train sentences. We thoroughly reviewed the texts and normalized them, then they were recorded by a male speaker. We recorded audios in a voice recording studio at 44,100Hz, and all audio files are down sampled to 22,050 Hz in our modeling process. The audio ranges from 3 to 36 seconds in length. We generate the speech corpus in this method, and the last speech has about 6,565 texts and audio pairings, which takes around 19 hours. Altogether, audio files are saved in wave format, and the texts are saved in text files in the corresponding sub-folders. Furthermore, for model training, all of the audio files are gathered in a single folder. Each line in the transcript files is formatted as WAVS | audio fileās name.wav| transcript. The audio fileās name includes the extensions, and the transcript was the speech's text.
The audio recording and editing process lasted for 90 days. It involved capturing over 6,565 WAV files and over 19 h of recorded speech. The data set helps researchers improve Kurdish TTS early, thereby reducing the time consumed for this process.
Acknowledgments: We would like to express our sincere gratitude to Ayoub Mohammadzadeh for his invaluable support in recording the corpus.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global Artificial Intelligence (AI) Training Dataset market is experiencing robust growth, driven by the increasing adoption of AI across diverse sectors. The market's expansion is fueled by the burgeoning need for high-quality data to train sophisticated AI algorithms capable of powering applications like smart campuses, autonomous vehicles, and personalized healthcare solutions. The demand for diverse dataset types, including image classification, voice recognition, natural language processing, and object detection datasets, is a key factor contributing to market growth. While the exact market size in 2025 is unavailable, considering a conservative estimate of a $10 billion market in 2025 based on the growth trend and reported market sizes of related industries, and a projected CAGR (Compound Annual Growth Rate) of 25%, the market is poised for significant expansion in the coming years. Key players in this space are leveraging technological advancements and strategic partnerships to enhance data quality and expand their service offerings. Furthermore, the increasing availability of cloud-based data annotation and processing tools is further streamlining operations and making AI training datasets more accessible to businesses of all sizes. Growth is expected to be particularly strong in regions with burgeoning technological advancements and substantial digital infrastructure, such as North America and Asia Pacific. However, challenges such as data privacy concerns, the high cost of data annotation, and the scarcity of skilled professionals capable of handling complex datasets remain obstacles to broader market penetration. The ongoing evolution of AI technologies and the expanding applications of AI across multiple sectors will continue to shape the demand for AI training datasets, pushing this market toward higher growth trajectories in the coming years. The diversity of applicationsāfrom smart homes and medical diagnoses to advanced robotics and autonomous drivingācreates significant opportunities for companies specializing in this market. Maintaining data quality, security, and ethical considerations will be crucial for future market leadership.
Facebook
TwitterText-to-Speech(TTS) Data is recorded by native speaker, with authentic accent and sweet sound. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Spanish Telephone Dialogues Dataset - 488 Hours
Dataset comprises 488 hours of high-quality telephone audio recordings in Spanish, featuring 600 native speakers and achieving a 95% sentence accuracy rate. Designed for advancing speech recognition models and language processing, this extensive speech data corpus covers diverse topics and domains, making it ideal for training robust automatic speech recognition (ASR) systems. - Get the data
Dataset characteristics:⦠See the full description on the dataset page: https://huggingface.co/datasets/ud-nlp/spanish-speech-recognition-dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The BengaliSpeechRecognitionDataset (BSRD) is a comprehensive dataset designed for the development and evaluation of Bengali speech recognition and text-to-speech systems. This dataset includes a collection of Bengali characters and their corresponding audio files, which are generated using speech synthesis models. It serves as an essential resource for researchers and developers working on automatic speech recognition (ASR) and text-to-speech (TTS) applications for the Bengali language. Key Features: ⢠Bengali Characters: The dataset contains a wide range of Bengali characters, including consonants, vowels, and unique symbols used in the Bengali script. This includes standard characters such as 'ą¦', 'ą¦', 'ą¦', and many more. ⢠Corresponding Speech Data: For each Bengali character, an MP3 audio file is provided, which contains the correct pronunciation of that character. This audio is generated by a Bengali text-to-speech model, ensuring clear and accurate pronunciation. ⢠1000 Audio Samples per Folder: Each character is associated with at least 1000 MP3 files. These multiple samples provide variations of the character's pronunciation, which is essential for training robust speech recognition systems. ⢠Language and Phonetic Diversity: The dataset offers a phonetic diversity of Bengali sounds, covering different tones and pronunciations commonly found in spoken Bengali. This ensures that the dataset can be used for training models capable of recognizing diverse speech patterns. ⢠Use Cases: o Automatic Speech Recognition (ASR): BSRD is ideal for training ASR systems, as it provides accurate audio samples linked to specific Bengali characters. o Text-to-Speech (TTS): Researchers can use this dataset to fine-tune TTS systems for generating natural Bengali speech from text. o Phonetic Analysis: The dataset can be used for phonetic analysis and developing models that study the linguistic features of Bengali pronunciation. ⢠Applications: o Voice Assistants: The dataset can be used to build and train voice recognition systems and personal assistants that understand Bengali. o Speech-to-Text Systems: BSRD can aid in developing accurate transcription systems for Bengali audio content. o Language Learning Tools: The dataset can help in creating educational tools aimed at teaching Bengali pronunciation.
ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦..Note for Researchers Using the datasetā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦
This dataset was created by Shuvo Kumar Basak. If you use this dataset for your research or academic purposes, please ensure to cite this dataset appropriately. If you have published your research using this dataset, please share a link to your paper. Good Luck.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the UK English Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of English language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.
This dataset includes over 6,000 high-quality scripted audio prompts recorded in UK English, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.
The prompts span a broad range of healthcare-specific interactions, such as:
To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:
These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.
Every audio recording is accompanied by a verbatim, manually verified transcription.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
French Telephone Dialogues Dataset - 547 Hours
his speech recognition dataset comprises 547 hours of telephone dialogues in French from 964 native speakers, providing audio recordings with detailed annotations (text, speaker ID, gender, age) to support speech recognition systems, natural language processing, and deep learning models for training and evaluating automatic speech recognition technology. - Get the data
Dataset characteristics:
Characteristic Data⦠See the full description on the dataset page: https://huggingface.co/datasets/ud-nlp/french-speech-recognition-dataset.
Facebook
TwitterRecording environment : quiet indoor environment, without echo
Recording content (read speech) : economy, entertainment, news, oral language, numbers, letters
Speaker : native speaker, gender balance
Device : Android mobile phone, iPhone
Language : 100+ languages
Transcription content : text, time point of speech data, 5 noise symbols, 5 special identifiers
Accuracy rate : 95% (the accuracy rate of noise symbols and other identifiers is not included)
Application scenarios : speech recognition, voiceprint recognition
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
American Telephone Dialogues Dataset - 1,136 Hours
The dataset includes 1,136 hours of annotated telephone dialogues from 1,416 native speakers across the United States. Designed for advancing speech recognition models and language processing, this extensive speech data corpus covers diverse topics and domains, making it ideal for training robust automatic speech recognition (ASR) systems. - Get the data
Dataset characteristics:
Characteristic Data⦠See the full description on the dataset page: https://huggingface.co/datasets/ud-nlp/american-speech-recognition-dataset.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Explore the booming Speech Recognition Data market, driven by AI advancements and voice technology adoption. Discover market size, CAGR, key drivers, and regional trends shaping the future of voice data.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset comprises 488 hours of telephone dialogues in Spanish, collected from 600 native speakers across various topics and domains. This dataset boasts an impressive 98% word accuracy rate, making it a valuable resource for advancing speech recognition technology.
By utilizing this dataset, researchers and developers can advance their understanding and capabilities in automatic speech recognition (ASR) systems, transcribing audio, and natural language processing (NLP). - Get the data
The dataset includes high-quality audio recordings with text transcriptions, making it ideal for training and evaluating speech recognition models.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2Fa3f375fb273dcad3fe17403bdfccb63b%2Fssssssssss.PNG?generation=1739884059328284&alt=media" alt="">
- Audio files: High-quality recordings in WAV format
- Text transcriptions: Accurate and detailed transcripts for each audio segment
- Speaker information: Metadata on native speakers, including gender and etc
- Topics: Diverse domains such as general conversations, business and etc
This dataset is a valuable resource for researchers and developers working on speech recognition, language models, and speech technology.