Facebook
TwitterEnvironment : quiet indoor environment, without echo;
Recording content : No preset linguistic data,dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;
Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.
Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;
Device : Telephony recording system;
Language : 100+ Languages;
Application scenarios : speech recognition; voiceprint recognition;
Accuracy rate : the word accuracy rate is not less than 98%
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
VIVOS Corpus
VIVOS is a free Vietnamese speech corpus consisting of 15 hours of recording speech prepared for the Automatic Speech Recognition task.
The corpus was published by AILAB, a computer science lab of VNUHCM - University of Science, with Prof. Vu Hai Quan as the head.
We publish this corpus in the hope to attract more scientists to solve Vietnamese speech recognition problems. The corpus should only be used for academic purposes.
Facebook
TwitterRecording environment : quiet indoor environment, without echo Recording content (read speech) : general category; human-machine interaction category
Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.
Device : Android mobile phone, iPhone;
Language : English-Korean, English-Japanese, German-English, Hong Kong Cantonese-English, Taiwanese-English,
Application scenarios : speech recognition; voiceprint recognition.
Accuracy rate : 97%
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Canadian English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Canadian English communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic Canadian accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Canadian English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple English speech and language AI applications:
Facebook
TwitterRecording environment : quiet indoor environment, low background noise, without echo.
Recording content (read speech) : generic category; human-machine interaction category; smart home command and control category; in-car command and control category; numbers.
Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.
Device : Android mobile phone, iPhone.
Language : American English, British English, Canadian English, Australian English, French English, German English, Spanish English, Italian English, Portuguese English, Russian English, Indian English, Japanese English, Korean English, Singaporean English and etc.
Application scenarios : speech recognition; voiceprint recognition.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Czech General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Czech speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Czech communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Czech speech models that understand and respond to authentic Czech accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Czech. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Czech speech and language AI applications:
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The Aurora project was originally set up to establish a worldwide standard for the feature extraction software which forms the core of the front-end of a DSR (Distributed Speech Recognition) system. The AURORA-5 database has been mainly developed to investigate the influence on the performance of automatic speech recognition for a hands-free speech input in noisy room environments. Furthermore two test conditions are included to study the influence of transmitting the speech in a mobile communication system.The earlier three Aurora experiments had a focus on additive noise and the influence of some telephone frequency characteristics. Aurora-5 tries to cover all effects as they occur in realistic application scenarios. The focus was put on two scenarios. The first one is the hands-free speech input in the noisy car environment with the intention of controlling either devices in the car itself or retrieving information from a remote speech server over the telephone. The second one covers the hands-free speech input in a type of office or in a type of living room to control e.g. a telephone device or some audio/video equipment.The AURORA-5 database contains the following data:•Artificially distorted versions of the recordings from adult speakers in the TI-Digits speech database downsampled at a sampling frequency of 8000 Hz. The distortions consist of: - additive background noise, - the simulation of a hands-free speech input in rooms, - the simulation of transmitting speech over cellular telephone networks.•A subset of recordings from the meeting recorder project at the International Computer Science Institute. The recordings contain sequences of digits uttered by different speakers in hands-free mode in a meeting room.•A set of scripts for running recognition experiments on the above mentioned speech data. The experiments are based on the usage of the freely available software package HTK where HTK is not part of this resource.Further information is also available at the following address: http://aurora.hsnr.de
Facebook
TwitterRecording environment : quiet indoor environment, without echo
Recording content (read speech) : economy, entertainment, news, oral language, numbers, letters
Speaker : native speaker, gender balance
Device : Android mobile phone, iPhone
Language : 100+ languages
Transcription content : text, time point of speech data, 5 noise symbols, 5 special identifiers
Accuracy rate : 95% (the accuracy rate of noise symbols and other identifiers is not included)
Application scenarios : speech recognition, voiceprint recognition
Facebook
TwitterSpeech samples from over 1000 individuals with impaired speech are collected for Project Euphonia, aimed at improving automated speech recognition systems for disordered speech. While participants consented to making the dataset public, this work is pursuing ways to allow data contribution to a central repository that can open access to other researchers.
Facebook
TwitterEnvironment : quiet indoor environment, without echo;
Recording content : No preset linguistic data,dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;
Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.
Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;
Device : Android mobile phone, iPhone;
Language : 100+ Languages;
Application scenarios : speech recognition; voiceprint recognition;
Accuracy rate : the word accuracy rate is not less than 98%
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The package contains Amharic speech corpus with audio data in the directory /data. The data directory contains 2 subdirectories: a. train - speech data and transription for training automatic speech recognition Kaldi ASR format [1] b. test - speech data and transription for testing automatic speech recognition Kaldi ASR format
A text corpus and language model in the directory /LM, and lexicon in the directory /lang
Directory: /data/train Files: text (training transcription), wav.scp (file id and path), utt2spk (file id and audio id), spk2utt (audio id and file id), wav (.wav files). For more information about the format, please refer to Kaldi website http://kaldi-asr.org/doc/data_prep.html Description: training data in Kaldi format about 20 hours. Note: The path of wav files in wav.scp have to be modified to point to the actual locatiion.
Directory: /data/test Files: text (test transcription), wav.scp (file id and path), utt2spk (file id and audio id), spk2utt (audio id and file id), wav (.wav files) Description: testing data in Kaldi format about 2 hours. The audio files for testing has the format
Directory: /lm Files: amharic_lm_PART1.zip, amharic_lm_PART2.zip Those files have to be unzipped and reassembled in one file to constitute the original language model "amharic.train.lm.data.arpa". This language model is created using SRILM using 3-grams ; the text is segmented in morphemes using morfessor 2.0 [2][3]
Directory: /lang Files: lexicon.txt (lexicon), nonsilence_phones.txt (speech phones), optional_silence.txt (silence phone) Description: lexicon contains words and their respective pronunciation, non-speech sound and noise in Kaldi format ; the tokens have been extracted after morpheme level segmentation using morfessor 2.0.[3]
in /kaldi-scripts you find the scripts used to train and test models (path has to be changed to make it work in your own directory for 00_init_paths.sh and 01_init_datas.sh (don't forget to set the appropriate path for test.pl and train.pl)) and from the existing data and lang directory you can directly start run the sequence : 04_train_mono.sh change the size --beam and retry beam --retry-beam using slightly wider beams side as per the following specification in kaldi script of mono_train.sh for this corpus (i.e beam size from --beam=$[$beam] to --beam=$[$beam*4] and retry-beam size from --retry-beam=$[$beam*4] to --retry-beam=$[$beam*22])
04a_train_triphone.sh + 04b_train_MLLT_LDA.sh + 04c_train_SAT_FMLLR.sh + 04d_train_MMI_FMMI.sh + 04e_train_sgmm.sh
---- %MER 19.89 [ 1234 / 6203, 91 ins, 362 del, 781 sub ]
---- %SER 80.78 [ 290 / 359 ]
---- %CER 15.02
---- %MER 10.83 [ 672 / 6203, 71 ins, 156 del, 445 sub ]
---- %SER 62.12 [ 223 / 359 ]
---- %CER 6.94
---- %WER 9.62 [ 597 / 6203, 94 ins, 106 del, 397 sub ]
---- %SER 60.45 [ 217 / 359 ]
---- %CER 6.46
---- %WER 8.61 [ 534 / 6203, 76 ins, 101 del, 357 sub ]
---- %SER 56.27 [ 202 / 359 ]
---- %CER 5.34
---- %WER 9.24 [ 573 / 6203, 86 ins, 109 del, 378 sub ]
---- %SER 59.89 [ 215 / 359 ]
---- %CER 6.27
---- %MER 10.83 [ 672 / 6203, 85 ins, 157 del, 430 sub ]
---- %SER 64.07 [ 230 / 359 ]
---- %CER 7.66
Iteration 3
---- %MER 10.56 [ 655 / 6203, 80 ins, 154 del, 421 sub ]
---- %SER 64.62 [ 232 / 359 ]
---- %CER 7.13
Iteration 4
---- %MER 10.59 [ 657 / 6203, 75 ins, 162 del, 420 sub ]
---- %SER 64.90 [ 233 / 359 ]
---- %CER 7.31
Iteration 5
---- %MER 10.37 [ 643 / 6203, 81 ins, 145 del, 417 sub ]
---- %SER 63.51 [ 228 / 359 ]
---- %CER 7.07
Iteration 6
---- %MER 10.45 [ 648 / 6203, 83 ins, 147 del, 418 sub ]
---- %SER 64.07 [ 230 / 359 ]
---- %CER 7.21
Iteration 7
---- %MER 10.35 [ 642 / 6203, 79 ins, 147 del, 416 sub ]
---- %SER 64.62 [ 232 / 359 ]
---- %CER 7.05
Iteration 8
---- %MER 10.43 [ 647 / 6203, 79 ins, 155 del, 413 sub ]
---- %SER 64.62 [ 232 / 359 ]
---- %CER 7.34
---- %MER 8.75 [ 543 / 6203, 52 ins, 134 del, 357 sub ]
---- %SER 57.10 [ 205 / 359 ]
---- %CER 5.50
Iteration 1
---- %MER 8.59 [ 533 / 6203, 46 ins, 131 del, 356 sub ]
---- %SER 55.71 [ 200 ...
Facebook
TwitterRecording environment : professional recording studio.
Recording content : general narrative sentences, interrogative sentences, etc.
Speaker : native speaker
Annotation Feature : word transcription, part-of-speech, phoneme boundary, four-level accents, four-level prosodic boundary.
Device : Microphone
Language : American English, British English, Japanese, French, Dutch, Catonese, Canadian French,Australian English, Italian, New Zealand English, Spanish, Mexican Spanish
Application scenarios : speech synthesis
Accuracy rate: Word transcription: the sentences accuracy rate is not less than 99%. Part-of-speech annotation: the sentences accuracy rate is not less than 98%. Phoneme annotation: the sentences accuracy rate is not less than 98% (the error rate of voiced and swallowed phonemes is not included, because the labelling is more subjective). Accent annotation: the word accuracy rate is not less than 95%. Prosodic boundary annotation: the sentences accuracy rate is not less than 97% Phoneme boundary annotation: the phoneme accuracy rate is not less than 95% (the error range of boundary is within 5%)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is FOSD-based (extracted from approximately 30-hour of FPT Open Speech Data, released publicly in 2018 by FPT Corporation, under FPT Public License) Male Speech Dataset which is useful for creating text-to-speech model. It comprises of 9474 audio files totalling more than 10.5 recording hours. All files are in *.wav format (16 kHz sampling rate, 32-bit float, mono). This dataset is useful for various TTS-related applications.
Copyright 2018 FPT Corporation Permission is hereby granted, free of charge, non-exclusive, worldwide, irrevocable, to any person obtaining a copy of this data or software and associated documentation files (the “Data or Software”), to deal in the Data or Software without restriction, including without limitation the rights to use, copy, modify, remix, transform, merge, build upon, publish, distribute and redistribute, sublicense, and/or sell copies of the Data or Software, for any purpose, even commercially, and to permit persons to whom the Data or Software is furnished to do so, subject to the following conditions: The above copyright notice, and this permission notice, and indication of any modification to the Data or Software, shall be included in all copies or substantial portions of the Data or Software.
THE DATA OR SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATA OR SOFTWARE OR THE USE OR OTHER DEALINGS IN THE DATA OR SOFTWARE. Patent and trademark rights are not licensed under this FPT Public License.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description: The "Indian Languages Audio Dataset" is a collection of audio samples featuring a diverse set of 10 Indian languages. Each audio sample in this dataset is precisely 5 seconds in duration and is provided in MP3 format. It is important to note that this dataset is a subset of a larger collection known as the "Audio Dataset with 10 Indian Languages." The source of these audio samples is regional videos freely available on YouTube, and none of the audio samples or source videos are owned by the dataset creator.
Languages Included: 1. Bengali 2. Gujarati 3. Hindi 4. Kannada 5. Malayalam 6. Marathi 7. Punjabi 8. Tamil 9. Telugu 10. Urdu
This dataset offers a valuable resource for researchers, linguists, and machine learning enthusiasts who are interested in studying and analyzing the phonetics, accents, and linguistic characteristics of the Indian subcontinent. It is a representative sample of the linguistic diversity present in India, encompassing a wide array of languages and dialects. Researchers and developers are encouraged to explore this dataset to build applications or conduct research related to speech recognition, language identification, and other audio processing tasks.
Additionally, the dataset is not limited to these 10 languages and has the potential for expansion. Given the dynamic nature of language use in India, this dataset can serve as a foundation for future data collection efforts involving additional Indian languages and dialects.
Access to the "Indian Multilingual Audio Dataset - 10 Languages" is provided with the understanding that users will comply with applicable copyright and licensing restrictions. If users plan to extend this dataset or use it for commercial purposes, it is essential to seek proper permissions and adhere to relevant copyright and licensing regulations.
By utilizing this dataset responsibly and ethically, users can contribute to the advancement of language technology and research, ultimately benefiting language preservation, speech recognition, and cross-cultural communication.
Facebook
TwitterThis dataset is collected to enhance research into speech recognition systems for dysarthic speech.
Facebook
TwitterThis database is orignally created for a resource for developing advanced models in automatic speech recognition that are more suited to the needs of people with dysarthria.
Facebook
Twitterhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
1950 speakers participated in the recording, and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields. The voice was natural and fluent, in line with the actual dialogue scene. Text is transcribed manually, with high accuracy.Format:Mobile phone: 16kHz, 16bit, mono channel, .wav; Voice recorder: 44.1kHz, 16bit, dual channel, .wavRecording Environment:quiet indoor environment, without echoRecording content:dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performedDemographics:1,950 people; 66% speakers of all are in the age group of 16-25; 962 speakers of them spoke in groups of two speakers, 312 speakers of them spoke in groups of three speakers, 396 speakers of them spoke in groups of four speakers, and the other 280 speakers spoke in groups of five speakersAnnotation:annotating for the transcription text, speaker identification and genderDevice:mobile phone and voice recorderLanguage:MandarinApplication scenarios:speech recognition; voiceprint recognitionAccuracy rate:97%
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is FOSD-based (extracted from approximately 30-hour of FPT Open Speech Data, released publicly in 2018 by FPT Corporation, under FPT Public License) Female Speech Dataset which is useful for creating text-to-speech model. It comprises of 7637 audio files totalling more than 9.5 recording hours. All files are in *.wav format (16 kHz sampling rate, 32-bit float, mono). This dataset is useful for various TTS-related applications.
Copyright 2018 FPT Corporation Permission is hereby granted, free of charge, non-exclusive, worldwide, irrevocable, to any person obtaining a copy of this data or software and associated documentation files (the “Data or Software”), to deal in the Data or Software without restriction, including without limitation the rights to use, copy, modify, remix, transform, merge, build upon, publish, distribute and redistribute, sublicense, and/or sell copies of the Data or Software, for any purpose, even commercially, and to permit persons to whom the Data or Software is furnished to do so, subject to the following conditions: The above copyright notice, and this permission notice, and indication of any modification to the Data or Software, shall be included in all copies or substantial portions of the Data or Software.
THE DATA OR SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATA OR SOFTWARE OR THE USE OR OTHER DEALINGS IN THE DATA OR SOFTWARE. Patent and trademark rights are not licensed under this FPT Public License.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Tamil Scripted Monologue Speech Dataset for the General Domain is a carefully curated resource designed to support the development of Tamil language speech recognition systems. This dataset focuses on general-purpose conversational topics and is ideal for a wide range of AI applications requiring natural, domain-agnostic Tamil speech data.
This dataset features over 6,000 high-quality scripted monologue recordings in Tamil. The prompts span diverse real-life topics commonly encountered in general conversations and are intended to help train robust and accurate speech-enabled technologies.
The dataset covers a wide variety of general conversation scenarios, including:
To enhance authenticity, the prompts include:
Each prompt is designed to reflect everyday use cases, making it suitable for developing generalized NLP and ASR solutions.
Every audio file in the dataset is accompanied by a verbatim text transcription, ensuring accurate training and evaluation of speech models.
Rich metadata is included for detailed filtering and analysis:
This dataset can power a variety of Tamil language AI technologies, including:
Facebook
Twitterhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
995 local Cantonese speakers participated in the recording, and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transcribed manually, with high accuracy.Format:Mobile phone: 16kHz, 16bit, mono channel, .wav; Voice recorder: 44.1kHz, 16bit, dual channel, .wav;Environment:quiet indoor environment, without echoRecording Content:dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performedDemographics:995 Cantonese; 45% speakers of all are in the age group of 26-45; 504 speakers of them spoke in groups of two speakers, 195 speakers of them spoke in groups of three speakers, 196 speakers of them spoke in groups of four speakers, and the other 100 speakers spoke in groups of five speakersAnnotation:annotating for the transcription text, speaker identification and genderDevice:mobile phone and voice recorderLanguage:CantoneseApplication Scenario:Voice Recognition, Voice Print RecognitionAccuracy rate:95%
Facebook
TwitterEnvironment : quiet indoor environment, without echo;
Recording content : No preset linguistic data,dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;
Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.
Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;
Device : Telephony recording system;
Language : 100+ Languages;
Application scenarios : speech recognition; voiceprint recognition;
Accuracy rate : the word accuracy rate is not less than 98%