Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This database includes 208 voice samples, from 150 pathological, and 58 healthy voices.
Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset Card for Chest voice and Falsetto Dataset
The original dataset, sourced from the Chest Voice and Falsetto Dataset, includes 1,280 monophonic singing audio files in .wav format, performed, recorded, and annotated by students majoring in Vocal Music at the China Conservatory of Music. The chest voice is tagged as "chest" and the falsetto voice as "falsetto." Additionally, the dataset encompasses the Mel spectrogram, Mel frequency cepstral coefficient (MFCC), and spectral… See the full description on the dataset page: https://huggingface.co/datasets/ccmusic-database/chest_falsetto.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Finnish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Finnish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Finnish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Finnish speech models that understand and respond to authentic Finnish accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Finnish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Finnish speech and language AI applications:
moonling/voice-data dataset hosted on Hugging Face and contributed by the HF Datasets community
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the English Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of English language speech recognition models, with a particular focus on British accents and dialects.
With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the English language spoken in United Kingdom.
Speech Data:This training dataset comprises 30 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 40 native English speakers from different states/provinces of United Kingdom. This collaborative effort guarantees a balanced representation of British accents, dialects, and demographics, reducing biases and promoting inclusivity.
Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.
Metadata:In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of English language speech recognition models.
Transcription:This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.
Our goal is to expedite the deployment of English language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.
Updates and Customization:We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.
If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.
License:This audio dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 12.0
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17127 validated hours in 104 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0.
ESD is an Emotional Speech Database for voice conversion research. The ESD database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise). More than 29 hours of speech data were recorded in a controlled acoustic environment. The database is suitable for multi-speaker and cross-lingual emotional voice conversion studies.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Norwegian General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Norwegian speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Norwegian communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Norwegian speech models that understand and respond to authentic Norwegian accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Norwegian. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Norwegian speech and language AI applications:
In the second quarter of 2024, the mobile data traffic reached almost *** exabytes worldwide, which is an increase of around ** exabytes compared to the same quarter in the previous year. The global mobile voice traffic has remained the same since the first quarter of 2016, with **** exabytes.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Japanese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Japanese speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Japanese communication.
Curated by FutureBeeAI, this 40 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Japanese speech models that understand and respond to authentic Japanese accents and dialects.
The dataset comprises 40 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Japanese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Japanese speech and language AI applications:
I would be grateful if you cite my two following papers:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present VocalSet, a singing voice dataset consisting of 10.1 hours of monophonic recorded audio of professional singers demonstrating both standard and extended vocal techniques on all 5 vowels. Existing singing voice datasets aim to capture a focused subset of singing voice characteristics, and generally consist of just a few singers. VocalSet contains recordings from 20 different singers (9 male, 11 female) and a range of voice types. VocalSet aims to improve the state of existing singing voice datasets and singing voice research by capturing not only a range of vowels, but also a diverse set of voices on many different vocal techniques, sung in contexts of scales, arpeggios, long tones, and excerpts.
We have included two .rtf files test_singers and train_singers in which you will find a list of the singers we used to train and test the majority of our deep learning models on.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains audio files and transcripts in Italian and related to manufacturing. We collected the scripts during the Horizon Europe RIA COALA (GA 957296, project reference website) from industrial use cases and hired a service provider to generate the related audio files (BIBA - Bremer Institut für Produktion und Logistik GmbH ordered the service). The service provider checked the audio files for quality.
The service provider recruited crowd workers, and gathered their audio records, informed consent (privacy) and agreement that their records become public domain (Creative Commons 0; https://creativecommons.org/share-your-work/public-domain/cc0/). The service provider declared to follow a Crowd Code of Ethics and a Fair Pay policy.
The metadata file contains the following information:
file_name: name of the audio file
script: script the speaker had to speak
scriptId: the numeric identifier of the script
participantId: the numeric identifier of the participant (speaker)
gender: the gender as indicated by the participant (MALE or FEMALE)
age: the age in years as indicated by the participant
age_range: the age range in years (18-30, 31-45, 46+)
country: the birth country indicated by the participant
current_country: the country of residence indicated by the participant
primary_language: the language indicated as primary by the participant
ever_worked_factory: answer to the question: "Have you ever worked in a factory, manufacturing setting?" (Yes/No)
years_worked_factory: answer to the question: "If yes, for how many years?" (1-10, 10+)
background_noise_type: background noise in the audio as indicated by the participant (mild, humming/technical, no noise)
gdpr_and_ipr_consent: answer to the privacy notice and the ipr transfer to CC-0 (Yes)
date_signed: date when the participant signed the consent form (US format, MM.DD.YYYY)
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global voice data services market is experiencing robust growth, driven by the increasing adoption of voice-enabled technologies across various sectors. The market's expansion is fueled by the surge in demand for accurate and efficient transcription, translation, and analysis of voice data. This demand stems from several key factors, including the proliferation of virtual assistants, smart speakers, and contact center solutions, all reliant on sophisticated voice data processing. Furthermore, advancements in artificial intelligence (AI) and machine learning (ML) are leading to more accurate and cost-effective voice data solutions, further stimulating market growth. We estimate the market size in 2025 to be $5 billion, based on observed growth in related sectors like AI and the increasing adoption of voice technologies. A Compound Annual Growth Rate (CAGR) of 15% is projected for the forecast period (2025-2033), indicating a significant expansion of the market in the coming years. Key market segments include transcription services, translation services, and voice analytics. Leading companies like SpeechOcean, Nexdata, and others are actively shaping market dynamics through technological innovation and strategic partnerships. However, challenges remain, including data privacy concerns and the need for robust data security measures to ensure responsible and ethical use of voice data. The market's future trajectory is strongly linked to advancements in AI and natural language processing (NLP). Continued improvements in speech recognition accuracy, coupled with the development of more sophisticated voice biometric systems, will unlock new opportunities within healthcare, finance, and customer service industries. While data security and privacy remain significant concerns, regulatory developments and technological advancements are addressing these issues. The increasing adoption of cloud-based solutions is also driving efficiency and scalability within the voice data services market, reducing costs and increasing accessibility for businesses of all sizes. The competitive landscape is characterized by both established players and emerging startups, with companies focusing on innovation and differentiation through specialized services and targeted solutions. Geographic expansion, particularly in developing economies with growing digital infrastructure, is expected to significantly contribute to overall market growth.
Infant Laugh Smartphone speech dataset, Our dataset was collected Laugh sound of 20 infants and young children aged 0~3 years old. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The voice data service market is experiencing rapid growth, driven by increasing demand for AI training and voice content review. The market size is expected to reach USD XXX million by 2033, growing at a CAGR of XX% during the forecast period. Key drivers include the proliferation of voice-enabled devices, advancements in natural language processing (NLP), and growing adoption of AI solutions across industries. Voice recognition data service holds the largest market share, accounting for over XX%, followed by voice synthesis data service. The market is expected to be highly competitive, with major players including Speechocean, Nexdata, and Beijing Surfing Technology. The market is segmented by type, application, and region. By type, the market is divided into voice recognition data service, voice synthesis data service, and others. By application, the market is segmented into AI training, voice content review, financial anti-fraud, and others. North America is expected to remain the dominant region, followed by Europe and Asia Pacific. The market in emerging regions such as South America, Middle East & Africa, and Asia Pacific is anticipated to witness significant growth due to increasing adoption of voice data services in these regions. Additionally, the rising popularity of remote work and online education is driving demand for voice data services that facilitate communication and collaboration.
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
4 hours and 80 minutes of speech as spoken by 2 female speakers and 2 male speakers, covering both mimics and parallel voice conversion data.
English(the United Kingdom) Children Scripted Monologue Microphone speech dataset, collected from monologue based on given scripts, covering educational materials for children, story books, informal language, numbers, alphabet. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(201 British children recorded in hi-fi microphone), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
https://physionet.org/about/duas/bridge2ai-voice-registered-access-agreement/https://physionet.org/about/duas/bridge2ai-voice-registered-access-agreement/
The human voice contains complex acoustic markers which have been linked to important health conditions including dementia, mood disorders, and cancer. When viewed as a biomarker, voice is a promising characteristic to measure as it is simple to collect, cost-effective, and has broad clinical utility. Recent advances in artificial intelligence have provided techniques to extract previously unknown prognostically useful information from dense data elements such as images. The Bridge2AI-Voice project seeks to create an ethically sourced flagship dataset to enable future research in artificial intelligence and support critical insights into the use of voice as a biomarker of health. Here we present Bridge2AI-Voice, a comprehensive collection of data derived from voice recordings with corresponding clinical information. Bridge2AI-Voice v2.0 contains data for 19,271 recordings collected from 442 participants across five sites in North America. Participants were selected based on known conditions which manifest within the voice waveform including voice disorders, neurological disorders, mood disorders, and respiratory disorders. The release contains data considered low risk, including derivations such as spectrograms but not the original voice recordings. Detailed demographic, clinical, and validated questionnaire data are also made available. Audio recordings are included on a companion release on PhysioNet with the title "Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information (Audio Included)". Please see that project for details to request access.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This database includes 208 voice samples, from 150 pathological, and 58 healthy voices.