Facebook
Twitterhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
The database contains recordings from six actors, three of each gender. The following emotions have been recorded: neutral, anger, happiness, sadness and fear. The database consists of: 32 isolated words, 30 short semantically neutral sentences, 30 long semantically neutral sentences and one passage with 79 words in size. The overall size of database is 2790 recordings or approximately 3 hours of speech. Statistical evaluation of database shows full phonetic balance according to the phonetic statistics of Serbian language and the statistics of other speech segments (syllables, consonant sets, accents) are in agreement with overall statistics of Serbian language. GEES database was recorded in an anechoic studio at 22,050 Hz sampling frequency.
Facebook
Twitterhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
The STC Russian speech database was recorded in 1996-1998. The main purpose of the database is to investigate individual speaker variability and to validate speaker recognition algorithms. The database was recorded through a 16-bit Vibra-16 Creative Labs sound card with an 11,025 Hz sampling rate. The database contains Russian read speech of 89 different speakers (54 male, 35 female), including 70 speakers with 15 sessions or more, 10 speakers with 10 sessions or more and 9 speakers with less than 10 sessions. The speakers were recorded in Saint-Petersburg and are within the age of 18-62. All are native speakers. The corpus consists of 5 sentences. Each speaker reads carefully but fluently each sentence 15 times on different dates over the period of 1-3 months. The corpus contains a total of 6,889 utterances and of 2 volumes, total size 700 MB uncompressed data. The signal of each utterance is stored as a separate file (approx. 126 KB). Total size of data for one speaker approximates 9,500 KB. Average utterance duration is about 5 sec. A file gives information about the speakers (speaker?s age and gender). The orthography and phonetic transcription of the corpus is given in separate files which contain the prompted sentences and their transcription in IPA. The signal files are raw files without any header, 16 bit per sample, linear, 11,025 Hz sample frequency. The recording conditions were as follows: Microphone: dynamic omnidirectional high-quality microphone, distance to mouth 5-10 cm Environment: office room Sampling rate: 11,025 Hz Resolution: 16 Bit Sound board: Creative Labs Vibra-16 Means of delivery: CD-ROM
Facebook
Twitterhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The Persian Speech Database Farsdat comprises the recordings of 300 Iranian speakers, who differ from each other with regards to age, sex, education level, and dialect (10 dialect regions of Iran were represented: Tehrani, Torki, Esfahani, Jonubi, Shomali, Khorassani, Baluchi, Kordi, Lori, and Yazdi). Each speaker uttered 20 sentences in two sessions, and 100 of these speakers uttered 110 isolated words. 6000 utterances were segmented and labelled phonetically and phonemically manually, including 386 phonetically balanced sentences, using IPA characters. The acoustic signal has been stored with a Wave file standard, so that it can be used by any other application software. The used sampling frequency reaches 22.5 KHz, and the signal-to-noise ratio 34 dB. The ambiguities in segmentation have been solved by reference to the corresponding spectrograms extracted from DSP sona-Graph KAY 5500.
Facebook
TwitterThe TORGO database of dysarthric articulation consists of aligned acoustics and measured 3D articulatory features from speakers with either cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS), which are two of the most prevalent causes of speech disability (Kent and Rosen, 2004), and matched controls. This database, called TORGO, is the result of a collaboration between Computer Science and Speech-Language Pathology departments at the University of Toronto and the Holland-Bloorview Kids Rehab hospital in Toronto.
This dataset contains 2000 samples for dysarthric males, dysarthric females, non-dysarthric males, and non-dysarthric females.
Originally TORGO database contains 18GB of data, to download and for more information on data, please refer to the following link, http://www.cs.toronto.edu/~complingweb/data/TORGO/torgo.html
This database should be used only for academic purposes.
Database / Licence Reference: Rudzicz, F., Namasivayam, A.K., Wolff, T. (2012) The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Language Resources and Evaluation, 46(4), pages 523--541.
Data Information:
It contains four folders with descriptions below, - dysarthria_female: 500 samples of dysarthric female audio recorded on different sessions. - dysarthria_male: 500 samples of dysarthric male audio recorded on different sessions. - non _dysarthria _female: 500 samples of non-dysarthric female audio recorded on different sessions. - non _dysarthria _male: 500 samples of non-dysarthric male audio recorded on different sessions.
data.csv filename: audio file path is_dysarthria: non-dysarthria or dysarthria gender: male or female
Application of the data, - Applying deep learning technology to classify dysarthria and non-dysarthria patients
References: Dumane, P., Hungund, B., Chavan, S. (2021). Dysarthria Detection Using Convolutional Neural Network. In: Pawar, P.M., Balasubramaniam, R., Ronge, B.P., Salunkhe, S.B., Vibhute, A.S., Melinamath, B. (eds) Techno-Societal 2020. Springer, Cham. https://doi.org/10.1007/978-3-030-69921-5_45
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset comprises 488 hours of telephone dialogues in Spanish, collected from 600 native speakers across various topics and domains. This dataset boasts an impressive 98% word accuracy rate, making it a valuable resource for advancing speech recognition technology.
By utilizing this dataset, researchers and developers can advance their understanding and capabilities in automatic speech recognition (ASR) systems, transcribing audio, and natural language processing (NLP). - Get the data
The dataset includes high-quality audio recordings with text transcriptions, making it ideal for training and evaluating speech recognition models.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2Fa3f375fb273dcad3fe17403bdfccb63b%2Fssssssssss.PNG?generation=1739884059328284&alt=media" alt="">
- Audio files: High-quality recordings in WAV format
- Text transcriptions: Accurate and detailed transcripts for each audio segment
- Speaker information: Metadata on native speakers, including gender and etc
- Topics: Diverse domains such as general conversations, business and etc
This dataset is a valuable resource for researchers and developers working on speech recognition, language models, and speech technology.
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
EWA-DB is a speech database that contains data from 3 clinical groups: Alzheimer's disease, Parkinson's disease, mild cognitive impairment, and a control group of healthy subjects. Speech samples of each clinical group were obtained using the EWA smartphone application, which contains 4 different language tasks: sustained vowel phonation, diadochokinesis, object and action naming (30 objects and 30 actions), picture description (two single pictures and three complex pictures).The total number of speakers in the database is 1649. Of these, there are 87 people with Alzheimer's disease, 175 people with Parkinson's disease, 62 people with mild cognitive impairment, 2 people with a mixed diagnosis of Alzheimer's + Parkinson's disease and 1323 healthy controls.For speakers who provided written consent (total number of 1003 speakers), we publish audio recordings in WAV format. We are also attaching a JSON file with ASR transcription, if available manual annotation (available for 965 speakers) and additional information about the speaker. For speakers who did not give their consent to publish the recording, only the JSON file is provided. ASR transcription is provided for all 1649 speakers. All 1649 speakers gave their consent to the provider to process their audio recordings. Therefore, it is possible for third party researchers to carry out their experiments also on the unpublished audio recordings through cooperation with the provider.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Clarity Speech Corpus is a forty speaker British English speech dataset. The corpus was created for the purpose of running listening tests to gauge speech intelligibility and quality in the Clarity Project, which has the goal of advancing speech signal processing by hearing aids through a series of challenges. The dataset is suitable for machine learning and other uses in speech and hearing technology, acoustics and psychoacoustics. The data comprises recordings of approximately 10,000 sentences drawn from the British National Corpus (BNC) with suitable length, words and grammatical construction for speech intelligibility testing. The collection process involved the selection of a subset of BNC sentences, the recording of these produced by 40 British English speakers, and the processing of these recordings to create individual sentence recordings with associated prompts and metadata.clarity_utterances.v1_2.tar.gz contains all the recordings as .wav files, with the accompanying metadata such as text prompts in clarity_master.json. Further details are given in the readme.Sample_clarity_utterances.zip contains a sample of 10.Please reference the following data paper, which has details on how the corpus was generated: Graetzer, S., Akeroyd, M.A., Barker, J., Cox, T.J., Culling, J.F., Naylor, G., Porter, E. and MuƱoz, R.V., 2022. Dataset of British English speech recordings for psychoacoustics and speech processing research: the Clarity Speech Corpus. Data in Brief, p.107951.
Facebook
TwitterShEMO: a large-scale validated database for Persian speech emotion detection
Abstract
This paper introduces a large-scale, validated database for Persian called Sharif Emotional Speech Database (ShEMO). The database includes 3000 semi-natural utterances, equivalent to 3 hours and 25 minutes of speech data extracted from online radio plays. The ShEMO covers speech samples of 87 native-Persian speakers for five basic emotions including anger, fear, happiness, sadness and⦠See the full description on the dataset page: https://huggingface.co/datasets/Mansooreh/sharif-emotional-speech-dataset.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The dataset contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.
The RAVDESS was developed by Dr Steven R. Livingstone, who now leads the Affective Data Science Lab, and Dr Frank A. Russo who leads the SMART Lab.
Citing the RAVDESS
The RAVDESS is released under a Creative Commons Attribution license, so please cite the RAVDESS if it is used in your work in any form. Published academic papers should use the academic paper citation for our PLoS1 paper. Personal works, such as machine learning projects/blog posts, should provide a URL to this Zenodo page, though a reference to our PLoS1 paper would also be appreciated.
Academic paper citation
Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.
Personal use citation
Include a link to this Zenodo page - https://zenodo.org/record/1188976
Commercial Licenses
Commercial licenses for the RAVDESS can be purchased. For more information, please visit our license page of fees, or contact us at ravdess@gmail.com.
Contact Information
If you would like further information about the RAVDESS, to purchase a commercial license, or if you experience any issues downloading files, please contact us at ravdess@gmail.com.
Example Videos
Watch a sample of the RAVDESS speech and song videos.
Emotion Classification Users
If you're interested in using machine learning to classify emotional expressions with the RAVDESS, please see our new RAVDESS Facial Landmark Tracking data set [Zenodo project page].
Construction and Validation
Full details on the construction and perceptual validation of the RAVDESS are described in our PLoS ONE paper - https://doi.org/10.1371/journal.pone.0196391.
The RAVDESS contains 7356 files. Each file was rated 10 times on emotional validity, intensity, and genuineness. Ratings were provided by 247 individuals who were characteristic of untrained adult research participants from North America. A further set of 72 participants provided test-retest data. High levels of emotional validity, interrater reliability, and test-retest intrarater reliability were reported. Validation data is open-access, and can be downloaded along with our paper from PLoS ONE.
Contents
Audio-only files
Audio-only files of all actors (01-24) are available as two separate zip files (~200 MB each):
Audio-Visual and Video-only files
Video files are provided as separate zip downloads for each actor (01-24, ~500 MB each), and are split into separate speech and song downloads:
File Summary
In total, the RAVDESS collection includes 7356 files (2880+2024+1440+1012 files).
File naming convention
Each of the 7356 RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics:
Filename identifiers
Filename example: 02-01-06-01-02-01-12.mp4
License information
The RAVDESS is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0
Commercial licenses for the RAVDESS can also be purchased. For more information, please visit our license fee page, or contact us at ravdess@gmail.com.
Related Data sets
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Speech and Noise Dataset
Overview
This dataset contains three types of audio recordings:
Clean Speech ā recordings of only speech without noise.
Noisy Speech ā recordings of speech mixed with noise.
Noise Only ā recordings of only background/environmental noise.
The dataset is designed for speech enhancement, noise reduction, and speech recognition research.
Dataset Structure
clean_speech/ ā speech-only recordings
noisy_speech/ ā speech⦠See the full description on the dataset page: https://huggingface.co/datasets/haydarkadioglu/speech-noise-dataset.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
ParlSpeech V2 contains complete full-text vectors of more than 6.3 million parliamentary speeches in the key legislative chambers of Austria, the Czech Republic, Germany, Denmark, the Netherlands, New Zealand, Spain, Sweden, and the United Kingdom, covering periods between 21 and 32 years. Meta-data include information on date, speaker, party, and partially agenda item under which a speech was held. The accompanying release note provides a more detailed guide to the data.
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
This dataset contains over 500 Donald Trump Speeches from the years of 2015-2024. This dataset was made public so that voters can have more context before casting their 2024 US Presidential Election votes. Historians, Linguists and Language Analysts can also use this data in years to come for their research purposes. The data is unbiased, strictly Donald Trump speech, and from a diverse range of times, topics and contexts. Please make good use of this carefully and meticulously crafted dataset and don't forget to share your findings! One last thingā¦..Don't forget to vote in 2024!
Facebook
TwitterSpanish(Latin America) Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and in-car command, numbers, news and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(1,630 people in total, such as Mexicans, Colombians, etc.), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Multilingual Call Center Speech Recognition Dataset: 10,000 Hours
Dataset Summary
10,000 hours of real-world call center speech recordings in 7 languages with transcripts. Train speech recognition, sentiment analysis, and conversation AI models on authentic customer support audio. Covers support, sales, billing, finance, and pharma domains
Dataset Features
š Scale & Quality
10,000 hours of inbound & outbound calls Real-world field⦠See the full description on the dataset page: https://huggingface.co/datasets/AxonData/multilingual-call-center-speech-dataset.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Czech General Conversation Speech Dataset ā a rich, linguistically diverse corpus purpose-built to accelerate the development of Czech speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Czech communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Czech speech models that understand and respond to authentic Czech accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Czech. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Czech speech and language AI applications:
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
German Speech Dataset for recognition task
Dataset comprises 431 hours of telephone dialogues in German, collected from 590+ native speakers across various topics and domains, achieving an impressive 95% sentence accuracy rate. It is designed for research in automatic speech recognition (ASR) systems. By utilizing this dataset, researchers and developers can advance their understanding and capabilities in transcribing audio, and natural language processing (NLP). - Get the data⦠See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/german-speech-recognition-dataset.
Facebook
Twitterhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
The Japanese Kids Speech database (Upper Grade) contains the total recordings of 232 Japanese Kids speakers (104 males and 128 females), from 9 to 13 yearsā old (fourth, fifth and sixth graders in elementary school), recorded in quiet rooms using smartphones. This database may be combined with the Japanese Kids Speech database (Lower Grade) also available in the ELRA Catalogue under reference ELRA-S0411.Number of speakers, utterances and duration, age are as follows :Number of speakers 232 (104 male/128 female)Number of utterances (average):385 utterances per speakerTotal number of utterances:89,454Age: from 9 to 13 years' oldTotal hours of data: 145.41018 sentences were used. Recordings were made through smartphones and audio data stored in .wav files as sequences of 16KHz Mono, 16 bits, Linear PCM.Database:ć»Audio data: WAV format, 16KHz, 16bit, mono (recorded with smartphone)ć»Recording scripts: TSV format(tab-delimited), UTF-8 (without BOM)ć»Transcription data: TSV format(tab-delimited), UTF-8 (without BOM)ć»Size: 16.2GBNumber of speakers per age:9 years' old: 56 (21 male, 35 female)10 years' old: 71 (30 male, 41 female)11 years' old: 65 (28 male, 37 female)12 years' old: 38 (24 male, 14 female)13 years' old: 2 (1 male, 1 female)Structure of database:āā readme.txtāā Japanese Kids Speech Database.pdfDescription document of the databaseāā Transcription.tsvTranscriptionāā scripts.tsvScriptāāā voices/directory of audio data āā high/directory of upper grade āā(speaker_ID/)directory of speaker ID (six digits) āā(audio_file)audio file (WAV format, 16KHz, 16bit, mono)File naming conventions of audio files are as follows:Field number | Contents | Description | Remarks0 | Language ID | āJAā (fixed) | Japanese1 | Speaker ID | Six digit | 5XXXXX2 | Script ID | HXXXX | XXXX: four digits3 | Age | Two digits4 | Gender | M: male, F: femaleFiled separation character is ā_ā.For example, if the audio file name is āJA_500002_H0001_10_F.wav, this file has the following meaning:JA: Language ID (Japanese)500002: speaker ID H0001: script ID 10: age (ten years old)F: gender (female)
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
The MEDIA speech database for French was produced by ELDA within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT).It contains 1,258 transcribed dialogues from 250 adult speakers. The method chosen for the corpus construction process is that of a āWizard of Ozā (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of tourism and hotel reservation. The database is formatted following the SpeechDat conventions and it includes the following items:ā¢1,258 recorded sessions for a total of 70 hours of speech. The signals are stored in a stereo wave file format. Each of the two speech channels is recorded at 8 kHz with 16 bit quantization with the least significant byte first (ālohiā or Intel format) as signed integers. ā¢Manual transcription of each session in XML format. Label files were created with the free transcription tool Transcriber (TRS files).ā¢Phonetic lexicon containing all the words spoken in the database. Column 1 contains the orthography of the French word. Column 2 shows the frequency of the word. Column 3 contains the pronunciation in SAMPA format. Here is a sample entry of the lexicon:1)agitĆ©e3A/ Z i t eā¢Documentation and statistics are also provided with the database.The semantic annotation of the corpus is available in this catalogue and referenced ELRA-E0024 (MEDIA Evaluation Package).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterThe TSP speech database is a dataset of speech recordings.