Specifications: - Each user has a unique ID across the entire dataset. - Maximum four hours of speech per person in the dataset. - Speech is recorded and transcribed on separate tracks. - High-quality transcriptions come with the data in JSON format. - No noise and high-quality recordings with both male and female speakers. - Metadata includes: gender, age, and location. - License terms: you pay once and you can use the data commercially in your products, but you cannot resell the data.
https://www.sapien.io/termshttps://www.sapien.io/terms
High-quality speech audio datasets designed for AI model training, supporting various applications like speech recognition, voice identification, and multilingual speech data.
Citation: DOI 10.1007/s10579-011-9145-0
Collection of Audio Recordings by the Department of Computer Science at the University of Toronto from speakers with and without Dysarthtria. Useful for tasks like Audio Classification, Disease Detection, Speech Processing, etc.
Directory Structure:
F_Con : Audio Samples of female speakers from the control group, i.e., female speakers without dysarthria. 'FC01' in the folder names and the filenames refers to the first speaker, 'FC02' refers to the second speaker and so on. 'S01' refers to the first recording session with that speaker, 'S02' refers to the second session and so on. 'arrayMic' suggests that the audio was recorded with an array microphone, whereas 'headMic' suggests that the audio was recorded by a headpiece microphone.
F_Dys : Audio Samples of female speakers with dysarthria. 'F01' in the folder names and the filenames refers to the first speaker, 'F03' refers to the second speaker and so on. 'S01' refers to the first recording session with that speaker, 'S02' refers to the second session and so on. 'arrayMic' suggests that the audio was recorded with an array microphone, whereas 'headMic' suggests that the audio was recorded by a headpiece microphone.
M_Con : Audio Samples of male speakers from the control group, i.e., male speakers without dysarthria. 'MC01' in the folder names and the filenames refers to the first speaker, 'MC02' refers to the second speaker and so on. 'S01' refers to the first recording session with that speaker, 'S02' refers to the second session and so on. 'arrayMic' suggests that the audio was recorded with an array microphone, whereas 'headMic' suggests that the audio was recorded by a headpiece microphone.
M_Dys : Audio Samples of male speakers with dysarthria. 'M01' in the folder names and the filenames refers to the first speaker, 'M03' refers to the second speaker and so on. 'S01' refers to the first recording session with that speaker, 'S02' refers to the second session and so on. 'arrayMic' suggests that the audio was recorded with an array microphone, whereas 'headMic' suggests that the audio was recorded by a headpiece microphone.
Specifications: - Each user has a unique ID across the entire dataset. - Maximum four hours of speech per person in the dataset. - Speech is recorded and transcribed on separate tracks. - High-quality transcriptions come with the data in JSON format. - No noise and high-quality recordings with both male and female speakers. - Metadata includes: gender, age, and location. - License terms: you pay once and you can use the data commercially in your products, but you cannot resell the data.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the English Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of English language speech recognition models, with a particular focus on British accents and dialects.
With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the English language spoken in United Kingdom.
Speech Data:This training dataset comprises 30 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 40 native English speakers from different states/provinces of United Kingdom. This collaborative effort guarantees a balanced representation of British accents, dialects, and demographics, reducing biases and promoting inclusivity.
Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.
Metadata:In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of English language speech recognition models.
Transcription:This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.
Our goal is to expedite the deployment of English language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.
Updates and Customization:We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.
If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.
License:This audio dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Japanese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Japanese speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Japanese communication.
Curated by FutureBeeAI, this 40 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Japanese speech models that understand and respond to authentic Japanese accents and dialects.
The dataset comprises 40 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Japanese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Japanese speech and language AI applications:
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The BengaliSpeechRecognitionDataset (BSRD) is a comprehensive dataset designed for the development and evaluation of Bengali speech recognition and text-to-speech systems. This dataset includes a collection of Bengali characters and their corresponding audio files, which are generated using speech synthesis models. It serves as an essential resource for researchers and developers working on automatic speech recognition (ASR) and text-to-speech (TTS) applications for the Bengali language. Key Features: • Bengali Characters: The dataset contains a wide range of Bengali characters, including consonants, vowels, and unique symbols used in the Bengali script. This includes standard characters such as 'ক', 'খ', 'গ', and many more. • Corresponding Speech Data: For each Bengali character, an MP3 audio file is provided, which contains the correct pronunciation of that character. This audio is generated by a Bengali text-to-speech model, ensuring clear and accurate pronunciation. • 1000 Audio Samples per Folder: Each character is associated with at least 1000 MP3 files. These multiple samples provide variations of the character's pronunciation, which is essential for training robust speech recognition systems. • Language and Phonetic Diversity: The dataset offers a phonetic diversity of Bengali sounds, covering different tones and pronunciations commonly found in spoken Bengali. This ensures that the dataset can be used for training models capable of recognizing diverse speech patterns. • Use Cases: o Automatic Speech Recognition (ASR): BSRD is ideal for training ASR systems, as it provides accurate audio samples linked to specific Bengali characters. o Text-to-Speech (TTS): Researchers can use this dataset to fine-tune TTS systems for generating natural Bengali speech from text. o Phonetic Analysis: The dataset can be used for phonetic analysis and developing models that study the linguistic features of Bengali pronunciation. • Applications: o Voice Assistants: The dataset can be used to build and train voice recognition systems and personal assistants that understand Bengali. o Speech-to-Text Systems: BSRD can aid in developing accurate transcription systems for Bengali audio content. o Language Learning Tools: The dataset can help in creating educational tools aimed at teaching Bengali pronunciation.
…………………………………..Note for Researchers Using the dataset………………………………………………………………………
This dataset was created by Shuvo Kumar Basak. If you use this dataset for your research or academic purposes, please ensure to cite this dataset appropriately. If you have published your research using this dataset, please share a link to your paper. Good Luck.
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
The audio dataset includes General Conversation, featuring Shona speakers from Africa with detailed metadata.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Data Source Kaggle Medical Speech, Transcription, and Intent Context
8.5 hours of audio utterances paired with text for common medical symptoms.
Content
This data contains thousands of audio utterances for common medical symptoms like “knee pain” or “headache,” totaling more than 8 hours in aggregate. Each utterance was created by individual human contributors based on a given symptom. These audio snippets can be used to train conversational agents in the medical field. This Figure Eight… See the full description on the dataset page: https://huggingface.co/datasets/Hani89/medical_asr_recording_dataset.
Train AI to understand Japanese with Unidata’s dataset, featuring diverse speech samples for better transcription accuracy
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Speech Recognition Bias Reduction Project
Executive Summary
Welcome to the Speech Recognition Bias Reduction Project. It aims to create a more inclusive and representative dataset for improving automated speech recognition systems. This project addresses the challenges faced by speakers with non-native English accents, particularly when interacting with automated voice systems that struggle to interpret alphanumeric information such as names, phone numbers, and addresses.… See the full description on the dataset page: https://huggingface.co/datasets/sakshee05/alphanumeric-audio-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
========================================================
This dataset is developed by Md Ashraful Islam, SUST CSE'2010 Md Mahadi Hasan Nahid, SUST CSE'2010 (nahid-cse@sust.edu)
Department of Computer Science and Engineering (CSE)
Shahjalal University of Science and Technology (SUST), www.sust.edu
Special Thanks To
Mohammad Al-Amin, SUST CSE'2011
Md Mazharul Islam Midhat, SUST CSE'2010
Md Mahedi Hasan Nayem, SUST CSE'2010
Avro Key Board, Omicron lab, https://www.omicronlab.com/index.html
=========================================================
It is a Audio Text Parallel Corpus. This dataset contains Some Recording Audio of Bangla Real Number and Its Coresponding Text. Specially designed for Bangla Speech recognition.
There are five speakers(alamin, ashraful, midhat, nahid, nayem) in this dataset.
Vocabulary Contains only bangla real numbers (shunno-ekshoto, hazar, loksho, koti, doshomic etc.)
Total Number of Audio file : 175 (35 from each speaker) Age range of the speakers : 20-23
TextData.txt file contains the text of the audio set. Each line starts with tag and ends with tag. The file name is added after each line using parenthesis, in this audio file you will get its recorder Audio Data. This text data actually generated using Avro (Free Opensourse Writting Software).
==========================================================
For Full Data: please contact nahid-cse@sust.edu
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Arabic Speech Commands Dataset
This dataset is designed to help train simple machine learning models that serve educational and research purposes in the speech recognition domain, mainly for keyword spotting tasks.
Dataset Description
Our dataset is a list of pairs (x, y), where x is the input speech signal, and y is the corresponding keyword. The final dataset consists of 12000 such pairs, comprising 40 keywords. Each audio file is one-second in length sampled at 16 kHz. We have 30 participants, each of them recorded 10 utterances for each keyword. Therefore, we have 300 audio files for each keyword in total (30 * 10 * 40 = 12000), and the total size of all the recorded keywords is ~384 MB. The dataset also contains several background noise recordings we obtained from various natural sources of noise. We saved these audio files in a separate folder with the name background_noise and a total size of ~49 MB.
Dataset Structure
There are 40 folders, each of which represents one keyword and contains 300 files. The first eight digits of each file name identify the contributor, while the last two digits identify the round number. For example, the file path rotate/00000021_NO_06.wav indicates that the contributor with the ID 00000021 pronounced the keyword rotate for the 6th time.
Data Split
We recommend using the provided CSV files in your experiments. We kept 60% of the dataset for training, 20% for validation, and the remaining 20% for testing. In our split method, we guarantee that all recordings of a certain contributor are within the same subset.
License
This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. For more details, see the LICENSE file in this folder.
Citations
If you want to use the Arabic Speech Commands dataset in your work, please cite it as:
@article{arabicspeechcommandsv1,
author = {Ghandoura, Abdulkader and Hjabo, Farouk and Al Dakkak, Oumayma},
title = {Building and Benchmarking an Arabic Speech Commands Dataset for Small-Footprint Keyword Spotting},
journal = {Engineering Applications of Artificial Intelligence},
year = {2021},
publisher={Elsevier}
}
This work introduces Zambezi Voice, an open-source multilingual speech resource for Zambian languages. It contains two collections of datasets: unlabelled audio recordings of radio news and talk shows programs (160 hours) and labelled data (over 80 hours) consisting of read speech recorded from text sourced from publicly available literature books. The dataset is created for speech recognition but can be extended to multilingual speech processing research for both supervised and unsupervised learning approaches. To our knowledge, this is the first multilingual speech dataset created for Zambian languages. We exploit pretraining and cross-lingual transfer learning by finetuning the Wav2Vec2.0 large-scale multilingual pre-trained model to build end-to-end (E2E) speech recognition models for our baseline models. The dataset is released publicly under a Creative Commons BY-NC-ND 4.0 license and can be accessed through the project repository.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 12.0
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17127 validated hours in 104 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0.
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
The audio dataset includes general conversations, featuring Igbo speakers from Africa with detailed metadata.
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
The audio dataset includes general conversations, featuring Arabic speakers from UAE with detailed metadata.
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
The audio dataset includes general conversations, featuring English speakers from USA with detailed metadata.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Wolof Audio Dataset
The Wolof Audio Dataset is a collection of audio recordings and their corresponding transcriptions in Wolof. This dataset is designed to support the development of Automatic Speech Recognition (ASR) models for the Wolof language. It was created by combining three existing datasets:
ALFFA: Available at serge-wilson/wolof_speech_transcription FLEURS: Available at vonewman/fleurs-wolof-dataset Urban Bus Wolof Speech Dataset: Available at vonewman/urban-bus-wolof… See the full description on the dataset page: https://huggingface.co/datasets/vonewman/wolof-audio-data.
Specifications: - Each user has a unique ID across the entire dataset. - Maximum four hours of speech per person in the dataset. - Speech is recorded and transcribed on separate tracks. - High-quality transcriptions come with the data in JSON format. - No noise and high-quality recordings with both male and female speakers. - Metadata includes: gender, age, and location. - License terms: you pay once and you can use the data commercially in your products, but you cannot resell the data.
Specifications: - Each user has a unique ID across the entire dataset. - Maximum four hours of speech per person in the dataset. - Speech is recorded and transcribed on separate tracks. - High-quality transcriptions come with the data in JSON format. - No noise and high-quality recordings with both male and female speakers. - Metadata includes: gender, age, and location. - License terms: you pay once and you can use the data commercially in your products, but you cannot resell the data.