14 datasets found

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)
zenodo.org
data.niaid.nih.gov
zip
Updated Oct 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steven R. Livingstone; Steven R. Livingstone; Frank A. Russo; Frank A. Russo (2024). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [Dataset]. http://doi.org/10.5281/zenodo.1188976
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1188976
Dataset updated
Oct 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Steven R. Livingstone; Steven R. Livingstone; Frank A. Russo; Frank A. Russo
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Description

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The dataset contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.

The RAVDESS was developed by Dr Steven R. Livingstone, who now leads the Affective Data Science Lab, and Dr Frank A. Russo who leads the SMART Lab.

Citing the RAVDESS

The RAVDESS is released under a Creative Commons Attribution license, so please cite the RAVDESS if it is used in your work in any form. Published academic papers should use the academic paper citation for our PLoS1 paper. Personal works, such as machine learning projects/blog posts, should provide a URL to this Zenodo page, though a reference to our PLoS1 paper would also be appreciated.

Academic paper citation

Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.

Personal use citation

Include a link to this Zenodo page - https://zenodo.org/record/1188976

Commercial Licenses

Commercial licenses for the RAVDESS can be purchased. For more information, please visit our license page of fees, or contact us at ravdess@gmail.com.

Contact Information

If you would like further information about the RAVDESS, to purchase a commercial license, or if you experience any issues downloading files, please contact us at ravdess@gmail.com.

Example Videos

Watch a sample of the RAVDESS speech and song videos.

Emotion Classification Users

If you're interested in using machine learning to classify emotional expressions with the RAVDESS, please see our new RAVDESS Facial Landmark Tracking data set [Zenodo project page].

Construction and Validation

Full details on the construction and perceptual validation of the RAVDESS are described in our PLoS ONE paper - https://doi.org/10.1371/journal.pone.0196391.

The RAVDESS contains 7356 files. Each file was rated 10 times on emotional validity, intensity, and genuineness. Ratings were provided by 247 individuals who were characteristic of untrained adult research participants from North America. A further set of 72 participants provided test-retest data. High levels of emotional validity, interrater reliability, and test-retest intrarater reliability were reported. Validation data is open-access, and can be downloaded along with our paper from PLoS ONE.

Contents

Audio-only files

Audio-only files of all actors (01-24) are available as two separate zip files (~200 MB each):

Speech file (Audio_Speech_Actors_01-24.zip, 215 MB) contains 1440 files: 60 trials per actor x 24 actors = 1440.

Song file (Audio_Song_Actors_01-24.zip, 198 MB) contains 1012 files: 44 trials per actor x 23 actors = 1012.

Audio-Visual and Video-only files

Video files are provided as separate zip downloads for each actor (01-24, ~500 MB each), and are split into separate speech and song downloads:

Speech files (Video_Speech_Actor_01.zip to Video_Speech_Actor_24.zip) collectively contains 2880 files: 60 trials per actor x 2 modalities (AV, VO) x 24 actors = 2880.

Song files (Video_Song_Actor_01.zip to Video_Song_Actor_24.zip) collectively contains 2024 files: 44 trials per actor x 2 modalities (AV, VO) x 23 actors = 2024.

File Summary

In total, the RAVDESS collection includes 7356 files (2880+2024+1440+1012 files).

File naming convention

Each of the 7356 RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics:

Filename identifiers

Modality (01 = full-AV, 02 = video-only, 03 = audio-only).

Vocal channel (01 = speech, 02 = song).

Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).

Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.

Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").

Repetition (01 = 1st repetition, 02 = 2nd repetition).

Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

Filename example: 02-01-06-01-02-01-12.mp4

Video-only (02)

Speech (01)

Fearful (06)

Normal intensity (01)

Statement "dogs" (02)

1st Repetition (01)

12th Actor (12)

Female, as the actor ID number is even.

License information

The RAVDESS is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0

Commercial licenses for the RAVDESS can also be purchased. For more information, please visit our license fee page, or contact us at ravdess@gmail.com.

Related Data sets

RAVDESS Facial Landmark Tracking data set [Zenodo project page].
m
BEASC: Bangla emotional audio-speech corpus - A speech emotion recognition...
data.mendeley.com
Updated Feb 9, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rakesh Kumar Das (2022). BEASC: Bangla emotional audio-speech corpus - A speech emotion recognition corpus for the low-resource Bangla language [Dataset]. http://doi.org/10.17632/t9h6p943xy.2
Explore at:
Unique identifier
https://doi.org/10.17632/t9h6p943xy.2
Dataset updated
Feb 9, 2022
Authors
Rakesh Kumar Das
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BEASC is an audio-speech emotion recognition corpus for the Bangla language. The developed dataset consists of voice data from 34 speakers from diverse age groups between 19 to 57 (mean = 28.75 and Standard deviation = 9.346), equally balanced with 17 males and 17 females. This dataset contains 1224 speech-audio data of four emotional states. There are four emotional states recorded for three sentences. The three sentences are i. ‘১২ টা বেজে গেছে,’ ii. ‘আমি জানতাম এমন কিছু হবে’, and iii. ‘এ কেমন উপহার’. These emotional states include four basic human emotions: Angry, Happy, Sad, and Surprise. Three trials were preserved for each emotional expression. Hence, the total number of utterances involves three sentences × three repetitions × four emotions × 34 speakers = 1224 recordings. The format of the audio file is a . WAV format. We consider that happy and sad emotional speech has normal intensity and angry and surprise emotional states have a strong intensity. The data files are divided into 34 individual folders. Each folder contains 36 audio recordings of each participating actor. BEASC is a balanced dataset with 306 recordings of each individual emotion. The size of the BEASC dataset is 619 MB. While most of the existing datasets of different languages are recorded inside a closed studio or cover a single sentence, this dataset is collected by recording through smartphones, hence preserving the slightly noisy real-life environment. BEASC is compatible with various shallow machine learning and deep learning architectures such CNN, LSTM, HMM, Transformer, etc. Each data file has a unique filename. We followed the same procedure as the famous RAVDESS dataset for the naming. The filename consists of seven two-digit numerical identifiers, separated by hyphens (e.g., 03-01-01-01-02-02-02.wav). Each two-digit numerical identifier defines the level of a different experimental factor. The identifiers are ordered: Modality - Statement type - Emotion - Emotion Intensity - Statement - Repetition - Actor.wav. For example, the filename “03-01-01-01-02-02-02.wav” refers to: Audio only (03) - Scripted (01) - Happy (01) - Normal intensity (01) - 2nd Statement (02) - 2nd Repetition (02) - 2nd Actor, Female (02).
z
A Kannada Emotional Speech Dataset
zenodo.org
data.niaid.nih.gov
wav
Updated Mar 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vishakha Agrawal; Vishakha Agrawal (2022). A Kannada Emotional Speech Dataset [Dataset]. http://doi.org/10.5281/zenodo.6345107
Explore at:
wavAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6345107
Dataset updated
Mar 15, 2022
Dataset provided by
Zenodo
Authors
Vishakha Agrawal; Vishakha Agrawal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
There was no emotional speech dataset available in Kannada. This was a limiting factor for research in the Kannada-speaking world. I introduce a Kannada emotional speech dataset and give details about its design and content. This dataset contains six different sentences, pronounced by thirteen people (four male and nine female), in five basic emotions plus one neutral emotion. They are all Kannada speakers. The dataset has been contributed by volunteers and the recordings were not made in a controlled environment. The dataset contains a total of 468 audio samples, each one in a separate audio file. The file naming convention is as follows: AA-EE-SS.wav where AA is a two-character field that gives the actor number (01 to 13), EE is a two-character field that indicates the emotion number (01 to 06), and SS is a two-character field that gives the sentence number (01 to 06). This dataset is freely available under a Creative Commons license.

Gender and age of each of the 13 people who contributed to the dataset

01, F, 45

02, F, 20

03, F, 21

04, M, 47

05, F, 48

06, M, 20

07, F, 20

08, F, 45

09, F, 21

10, F, 12

11, F, 12

12, M, 17

13, M, 26

Identification characters for the emotions in the dataset

01, Anger

02, Sadness

03, Surprise

04, Happiness

05, Fear

06, Neutral

Identification characters for the sentences in the dataset

01, ರೋಗಿಗಳಿಗೆ ಚಿಕಿತ್ಸೆ ನೀಡಿ ಉಪಚರಿಸುವುದು

02, ಈ ಕಾದಂಬರಿಯು ಎರಡು ಪಾತ್ರಗಳನ್ನು ಒಳಗೊಂಡಿದೆ

03, ಖಾಸಗಿ ವಿಮಾನಗಳೆಂದೂ ಸಾರ್ವಜನಿಕ ವಿಮಾನಗಳೆಂದೂ ವಿಂಗಡಿಸಿದ್ದಾರೆ

04, ನಿಮ್ಮನ್ನು ಬೀಟಿಯಾಗಿ ಬಹಳ ಸಂತೋಶ ಆಯಿತು

05, ರಾಮನ ಎಡಬಲ ದಲ್ಲಿ ಸೀತಾ ಲಕ್ಷ್ಮಣ ರಿದ್ದಾರೆ

06, ಕನ್ನಡವನ್ನು ಕಲಿಯಬೆಕು
Speech Emotion Recognition (en)
kaggle.com
Updated Jan 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dmytro Babko (2021). Speech Emotion Recognition (en) [Dataset]. https://www.kaggle.com/datasets/dmitrybabko/speech-emotion-recognition-en
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 25, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dmytro Babko
Description
Context

Speech is the most natural way of expressing ourselves as humans. It is only natural then to extend this communication medium to computer applications. We define speech emotion recognition (SER) systems as a collection of methodologies that process and classify speech signals to detect the embedded emotions. SER is not a new field, it has been around for over two decades, and has regained attention thanks to the recent advancements. These novel studies make use of the advances in all fields of computing and technology, making it necessary to have an update on the current methodologies and techniques that make SER possible. We have identified and discussed distinct areas of SER, provided a detailed survey of current literature of each, and also listed the current challenges.

Content

Here 4 most popular datasets in English: Crema, Ravdess, Savee and Tess. Each of them contains audio in .wav format with some main labels.

Ravdess:

Here is the filename identifiers as per the official RAVDESS website:

Modality (01 = full-AV, 02 = video-only, 03 = audio-only).

Vocal channel (01 = speech, 02 = song).

Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).

Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.

Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").

Repetition (01 = 1st repetition, 02 = 2nd repetition).

Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

So, here's an example of an audio filename. 02-01-06-01-02-01-12.wav This means the meta data for the audio file is:

Video-only (02)

Speech (01)

Fearful (06)

Normal intensity (01)

Statement "dogs" (02)

1st Repetition (01)

12th Actor (12) - Female (as the actor ID number is even)

Crema:

The third component is responsible for the emotion label: * SAD - sadness; * ANG - angry; * DIS - disgust; * FEA - fear; * HAP - happy; * NEU - neutral.

Tess:

Very similar to Crema - label of emotion is contained in the name of file.

Savee:

The audio files in this dataset are named in such a way that the prefix letters describes the emotion classes as follows:

'a' = 'anger'

'd' = 'disgust'

'f' = 'fear'

'h' = 'happiness'

'n' = 'neutral'

'sa' = 'sadness'

'su' = 'surprise'

Acknowledgements

My pleasure to show you a notebook of this guy which inspire me to contain this dataset publicly.
h
ravdess_speech
huggingface.co
Updated Dec 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enrique Hernández Calabrés (2021). ravdess_speech [Dataset]. https://huggingface.co/datasets/ehcalabres/ravdess_speech
Explore at:
Dataset updated
Dec 13, 2021
Authors
Enrique Hernández Calabrés
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for ravdess_speech

Dataset Summary

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. The conditions… See the full description on the dataset page: https://huggingface.co/datasets/ehcalabres/ravdess_speech.
o
Long-Term Spectral Pseudo-Entropy (LTSPE) Feature
explore.openaire.eu
Updated Jan 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mohammad rasoul kahrizi (2017). Long-Term Spectral Pseudo-Entropy (LTSPE) Feature [Dataset]. http://doi.org/10.21227/p943-qs37
Explore at:
Unique identifier
https://doi.org/10.21227/p943-qs37
Dataset updated
Jan 1, 2017
Authors
mohammad rasoul kahrizi
Description
Abstract Speech detection systems are known as a type of audio classifier systems which are used to recognize, detect or mark parts of audio signal including human speech. Here, a novel robust feature named Long-Term Spectral Pseudo-Entropy (LTSPE) is proposed to detect speech and its purpose is to improve performance in combination with other features, increase accuracy and to have acceptable performance. Experimental results show that if LTSPE is combined with other features, performance of the detector is improved. Moreover, this feature has higher accuracy compared to similar ones.
Cantonese Audio-Visual Emotional Speech (CAVES)
kaggle.com
Updated Aug 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nguyen Thanh Lim (2024). Cantonese Audio-Visual Emotional Speech (CAVES) [Dataset]. https://www.kaggle.com/datasets/nguyenthanhlim/cantonese-audio-visual-emotional-speech-caves/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 29, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nguyen Thanh Lim
Description
Cantonese Audio-Visual Emotional Speech (CAVES): Chinese (Neutral, happy, angry, sad, disgust, fear, surprise)
r
Cantonese Audio-Visual Emotional Speech (CAVES) dataset
researchdata.edu.au
Updated Apr 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kim Jeesun; Davis Christopher; Chong Chee Seng; Jeesun Kim; Chris Davis (2024). Cantonese Audio-Visual Emotional Speech (CAVES) dataset [Dataset]. http://doi.org/10.26183/3SE5-S316
Explore at:
Unique identifier
https://doi.org/10.26183/3SE5-S316
Dataset updated
Apr 30, 2024
Dataset provided by
Western Sydney University
Authors
Kim Jeesun; Davis Christopher; Chong Chee Seng; Jeesun Kim; Chris Davis
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered

Description
This database consists of audio visual recordings of Cantonese spoken expressions of emotions produced by 10 native speakers of Cantonese.

5 speakers are female and their folders are labeled from fm1 to fm5; 5 speakers are male and their folders are labeled from m1 to m5.

Each folder consists of 21 zip files (e.g., 7 emotions x 3 presentation modes (audio only AO, visual only VO, audio visual AV). Each zip file contains a file for each of the 50 Cantonese sentences produced in one emotion type (angry, disgust, fear, happy, neutral, sad, surprise) and in one modality (AO, VO, AV). Note: the AV files are in MTS format(https://docs.fileformat.com/video/avchd/).

FM5 is an exception to the above; only 25 Cantonese sentences were recorded for Sad.

To get an idea of the material, we provide 6 files in AV format as a sample. The sample consists of sentence 1 spoken in the 6 emotions by Speaker FM1.

The data from the perception study (validation experiment) are in the file CAVES_data_final.csv
m
BanglaSER: A Bangla speech emotion recognition dataset
data.mendeley.com
Updated Mar 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rakesh Kumar Das (2022). BanglaSER: A Bangla speech emotion recognition dataset [Dataset]. http://doi.org/10.17632/t9h6p943xy.5
Explore at:
Unique identifier
https://doi.org/10.17632/t9h6p943xy.5
Dataset updated
Mar 14, 2022
Authors
Rakesh Kumar Das
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BanglaSER is a Bangla language-based speech emotion recognition dataset. It consists of speech-audio data of 34 participating speakers from diverse age groups between 19 and 47 years, with a balanced 17 male and 17 female nonprofessional participating actors. This dataset contains 1467 Bangla speech-audio recordings of five rudimentary human emotional states, namely angry, happy, neutral, sad, and surprise. Three trials are conducted for each emotional state. Hence, the total number of recordings involves 3 statements × 3 repetitions × 4 emotional states (angry, happy, sad, and surprise) × 34 participating speakers = 1224 recordings + 3 statements × 3 repetitions × 1 emotional state (neutral) × 27 participating speakers = 243 recordings, making the total number of recordings of 1467. BanglaSER dataset is collected by recording through smartphones, and laptops, having a balanced number of recordings in each category with evenly distributed participating male and female actors, preserves the real-life environment, and would serve as an essential training dataset for the speech emotion recognition model in terms of generalization. BanglaSER is compatible with various deep learning architectures such as CNN, LSTM, BiLSTM etc.
Audio emotions
kaggle.com
zip
Updated Jun 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Uldis Valainis (2020). Audio emotions [Dataset]. https://www.kaggle.com/uldisvalainis/audio-emotions
Explore at:
zip(1203490156 bytes)Available download formats
Dataset updated
Jun 9, 2020
Authors
Uldis Valainis
Description
Content

Data set contains files from RAVDESS [1], CREMA-D [2], SAVEE [3], TESS [4]. Recordings are in .wav format and are sorted in folders: Angry - 2167 records. (16.7%) Happy - 2167 records. (16.46%) Sad - 2167 records. (16.35%) Neutral - 1795 records. (14.26%) Fearful - 2047 records. (16.46%) Disgusted - 1863 records. (15.03%) Surprised - 592 records. (4.74%)

Out of all files data sets make up: CREMA-D - 7,442 (58.15%) TESS - 2,800 (21.88%) RAVDESS 2,076 (16.22%) SAVEE 480 (3.75%)

Acknowledgements

[1]Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): [2] Houwei Cao, D., Cooper, Keutmann, Gur, Nenkova, and Verma. "CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset." IEEE Transactions on Affective Computing 5.4 (2014): 377-90. Web. [3] Jackson, Philip & ul haq, Sana. (2011). Surrey Audio-Visual Expressed Emotion (SAVEE) database. [4] Dupuis, K., & Pichora-Fuller, M. K. (2010). Toronto emotional speech set (TESS). Toronto: University of Toronto, Psychology Department.

I don't own anything i just put them together.
f
EmoMatchSpanishDB
figshare.com
zip
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esteban García-Cuesta; Antonio Barba Salvador (2023). EmoMatchSpanishDB [Dataset]. http://doi.org/10.6084/m9.figshare.14215850.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14215850.v3
Dataset updated
Jun 8, 2023
Dataset provided by
figshare
Authors
Esteban García-Cuesta; Antonio Barba Salvador
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
These carpete contains the datasets features used and described in the research paper entitled
García-Cuesta, E., Barba, A., Gachet, D. "EmoMatchSpanishDB: Study of Speech Emotion Recognition Machine Learning Models in a New Spanish Elicited Database" , Multimedia Tools and Applications, Ed. Springer, 2023

In this paper we address the task of real time emotion recognition for elicited emotions. For this purpose we have created a publicly accessible dataset composed by ﬁfty subjects expressing the emotions of anger, disgust, fear, happiness, sadness, and surprise in Spanish language. In addition, a neutral tone of each subject has been added. This article describes how this database have been created including the recording and the performed crowdsourcing perception test in order to statistically validate the emotion of each sample and remove noisy data samples. Moreover we present a baseline comparative study between different machine learning techniques in terms of accuracy, speciﬁcity, precision, and recall. Prosodic and spectral features are extracted and used for this classiﬁcation purpose. We expect that this database will be useful to get new insights within this area of study.

The first dataset is "EmoSpanishDB" that contains a set of 13 and 140 spectral and prosodic features for a total of 3550 audios of 50 individuals reproducing the 12 sentences for the six different emotions, ’anger, disgust, fear, happiness, sadness, surprise’ (Ekman’s basic emotions]) plus neutral.

The second dataset is "EmoMatchSpanishDB" and contains a set of 13 and 140 spectral and prosodic features for a total of 2050 audios of 50 individuals reproducing the 12 sentences for the six different emotions, ’anger, disgust, fear, happiness, sadness, surprise’ (Ekman’s basic emotions]) plus neutral. These 2050 audios' features are a subset of EmoSpanishDB resulting of the matched audios after application of a crowdsourcing process to validate that the elicited emotion corresponds with the expressed.

The third dataset is "EmoMatchSpanishDB-Compare-features.zip" that contains the COMPARE features for the experiments of dependent-speaker and LOSO. These datasets have been used in the paper "EmoMatchSpanishDB: Study of Machine Learning Models in a New Spanish Elicited Dataset" and their creation, its contents, and also a set of baseline machine learning experiments and results are fully described within it.

The features are available under MIT license and if you want to get access to the original raw audio files for creating your own features and research purposes you can get them under CC-BY-NC completing and signing the agreement file (EMOMATCHAgreement.docx) and sending it via email to esteban.garcia@upm.es
m
Mexican Emotional Speech Database (MESD)
data.mendeley.com
Updated Dec 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathilde Marie Duville (2021). Mexican Emotional Speech Database (MESD) [Dataset]. http://doi.org/10.17632/cy34mh68j9.3
Explore at:
Unique identifier
https://doi.org/10.17632/cy34mh68j9.3
Dataset updated
Dec 8, 2021
Authors
Mathilde Marie Duville
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Mexican Emotional Speech Database (MESD) provides single-word utterances for anger, disgust, fear, happiness, neutral, and sadness affective prosodies with Mexican cultural shaping. The MESD has been uttered by both adult and child non-professional actors: 3 female, 2 male, and 6 child voices are available (female mean age ± SD = 23.33 ± 1.53, male mean age ± SD = 24 ± 1.41, and children mean age ± SD = 9.83 ± 1.17). Words for emotional and neutral utterances come from two corpora: (corpus A) composed of nouns and adjectives that are repeated across emotional prosodies and types of voice (female, male, child), and (corpus B) which consists of words controlled for age-of-acquisition, frequency of use, familiarity, concreteness, valence, arousal, and discrete emotion dimensionality ratings. Particularly, words from corpus B are nouns and adjectives which subjective age of acquisition is under 9-year-old. Neutral-uttered words have valence and arousal ratings strictly greater than 4, but lower than 6 (in a 9-point-scale). Emotional-uttered words have valence and arousal ratings ranging from 1 to 4, or from 6 to 9. Furthermore, ratings for discrete emotional dimension greater than 2.5 (on a 5-point scale) allowed the emotional utterance with the corresponding anger, disgust, fear, happiness, or sadness prosody. Finally, words from corpus B were selected so that emotional prosodies do not differ as regards frequency of use, familiarity, and concreteness dimensions.

The audio recordings took place in a professional studio with the following materials: (1) a Sennheiser e835 microphone with a flat frequency response (100 Hz to 10 kHz), (2) a Focusrite Scarlett 2i4 audio interface connected to the microphone with an XLR cable and to the computer, and (3) the digital audio workstation REAPER (Rapid Environment for Audio Production, Engineering, and Recording). Audio files were stored as a sequence of 24-bit with a sample rate of 48000Hz.

Utterances are shared as 864 audio files in WAV format that are named according to the following pattern:

The MESD seems to be the first set of single-word emotional utterances that includes both adult and child voices for the Mexican population.

Citation M. M. Duville, L. M. Alonso-Valerdi, and D. Ibarra-Zarate, “The Mexican Emotional Speech Database (MESD): elaboration and assessment based on machine learning,” 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, p. 4, 2021.

Duville, M.M.; Alonso-Valerdi, L.M.; Ibarra-Zarate, D.I. Mexican Emotional Speech Database Based on Semantic, Frequency, Familiarity, Concreteness, and Cultural Shaping of Affective Prosody. Data 2021, 6, 130. https://doi.org/10.3390/data6120130
h
KAI-indian-emotional-speech-corpus
huggingface.co
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KratosAI (2025). KAI-indian-emotional-speech-corpus [Dataset]. https://huggingface.co/datasets/Kratos-AI/KAI-indian-emotional-speech-corpus
Explore at:
Dataset updated
Jul 8, 2025
Dataset authored and provided by
KratosAI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Indian Emotional Speech Corpus

Dataset Description

This dataset comprises high-quality audio recordings of Indian speakers reading a standardized 50-word paragraph in four distinct emotional tones — happy, sad, surprised, and angry. Each recording is approximately 20–25 seconds long and includes the full paragraph with tone shifts at specific points. Text spoken by all participants:

(happy tone) Last Monday was perfect—I got the job I’d been dreaming of! I screamed… See the full description on the dataset page: https://huggingface.co/datasets/Kratos-AI/KAI-indian-emotional-speech-corpus.
A
RATS Speech Activity Detection
abacus.library.ubc.ca
pdf, txt
Updated Aug 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abacus Data Network (2023). RATS Speech Activity Detection [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml;jsessionid=3956138b4305adee4c99f1b91d26?persistentId=hdl%3A11272.1%2FAB2%2F1UISJ7&version=&q=&fileTypeGroupFacet=%22Text%22&fileAccess=&fileTag=&fileSortField=&fileSortOrder=
Explore at:
txt(3132), pdf(31909)Available download formats
Dataset updated
Aug 25, 2023
Dataset provided by
Abacus Data Network
Time period covered
2015
Area covered
United States
Dataset funded by
Defense Advanced Research Projects Agency (DARPA)
Description
Introduction RATS Speech Activity Detection was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 3,000 hours of Levantine Arabic, English, Farsi, Pashto, and Urdu conversational telephone speech with automatic and manual annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Speech Activity Detection (SAD) task in the DARPA RATS (Robust Automatic Transcription of Speech) program. The goal of the RATS program was to develop human language technology systems capable of performing speech detection, language identification, speaker identification and keyword spotting on the severely degraded audio signals that are typical of various radio communication channels, especially those employing various types of handheld portable transceiver systems. To support that goal, LDC assembled a system for the transmission, reception and digital capture of audio data that allowed a single source audio signal to be distributed and recorded over eight distinct transceiver configurations simultaneously. Those configurations included three frequencies -- high, very high and ultra high -- variously combined with amplitude modulation, frequency hopping spread spectrum, narrow-band frequency modulation, single-side-band or wide-band frequency modulation. Annotations on the clear source audio signal, e.g., time boundaries for the duration of speech activity, were projected onto the corresponding eight channels recorded from the radio receivers. Data The source audio consists of conversational telephone speech recordings collected by LDC: (1) data collected for the RATS program from Levantine Arabic, Farsi, Pashto and Urdu speakers; and (2) material from the Fisher English (LDC2004S13, LDC2005S13), and Fisher Levantine Arabic telephone studies (LDC2007S02), as well as from CALLFRIEND Farsi (LDC2014S01). Annotation was performed in three steps. LDC's automatic speech activity detector was run against the audio data to produce a speech segmentation for each file. Manual first pass annotation was then performed as a quick correction of the automatic speech activity detection output. Finally, in a manual second pass annotation step, annotators reviewed first pass output and made adjustments to segments as needed. All audio files are presented as single-channel, 16-bit PCM, 16000 samples per second; lossless FLAC compression is used on all files; when uncompressed, the files have typical "MS-WAV" (RIFF) file headers.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Steven R. Livingstone; Steven R. Livingstone; Frank A. Russo; Frank A. Russo (2024). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [Dataset]. http://doi.org/10.5281/zenodo.1188976

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

Explore at:

66 scholarly articles cite this dataset (View in Google Scholar)

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1188976

Dataset updated

Oct 19, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Steven R. Livingstone; Steven R. Livingstone; Frank A. Russo; Frank A. Russo

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

Description

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The dataset contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.

The RAVDESS was developed by Dr Steven R. Livingstone, who now leads the Affective Data Science Lab, and Dr Frank A. Russo who leads the SMART Lab.

Citing the RAVDESS

The RAVDESS is released under a Creative Commons Attribution license, so please cite the RAVDESS if it is used in your work in any form. Published academic papers should use the academic paper citation for our PLoS1 paper. Personal works, such as machine learning projects/blog posts, should provide a URL to this Zenodo page, though a reference to our PLoS1 paper would also be appreciated.

Academic paper citation

Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.

Personal use citation

Include a link to this Zenodo page - https://zenodo.org/record/1188976

Commercial Licenses

Commercial licenses for the RAVDESS can be purchased. For more information, please visit our license page of fees, or contact us at ravdess@gmail.com.

Contact Information

If you would like further information about the RAVDESS, to purchase a commercial license, or if you experience any issues downloading files, please contact us at ravdess@gmail.com.

Example Videos

Watch a sample of the RAVDESS speech and song videos.

Emotion Classification Users

If you're interested in using machine learning to classify emotional expressions with the RAVDESS, please see our new RAVDESS Facial Landmark Tracking data set [Zenodo project page].

Construction and Validation

Full details on the construction and perceptual validation of the RAVDESS are described in our PLoS ONE paper - https://doi.org/10.1371/journal.pone.0196391.

The RAVDESS contains 7356 files. Each file was rated 10 times on emotional validity, intensity, and genuineness. Ratings were provided by 247 individuals who were characteristic of untrained adult research participants from North America. A further set of 72 participants provided test-retest data. High levels of emotional validity, interrater reliability, and test-retest intrarater reliability were reported. Validation data is open-access, and can be downloaded along with our paper from PLoS ONE.

Contents

Audio-only files

Audio-only files of all actors (01-24) are available as two separate zip files (~200 MB each):

Speech file (Audio_Speech_Actors_01-24.zip, 215 MB) contains 1440 files: 60 trials per actor x 24 actors = 1440.
Song file (Audio_Song_Actors_01-24.zip, 198 MB) contains 1012 files: 44 trials per actor x 23 actors = 1012.

Audio-Visual and Video-only files

Video files are provided as separate zip downloads for each actor (01-24, ~500 MB each), and are split into separate speech and song downloads:

Speech files (Video_Speech_Actor_01.zip to Video_Speech_Actor_24.zip) collectively contains 2880 files: 60 trials per actor x 2 modalities (AV, VO) x 24 actors = 2880.
Song files (Video_Song_Actor_01.zip to Video_Song_Actor_24.zip) collectively contains 2024 files: 44 trials per actor x 2 modalities (AV, VO) x 23 actors = 2024.

File Summary

In total, the RAVDESS collection includes 7356 files (2880+2024+1440+1012 files).

File naming convention

Each of the 7356 RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics:

Filename identifiers

Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
Vocal channel (01 = speech, 02 = song).
Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
Repetition (01 = 1st repetition, 02 = 2nd repetition).
Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

Filename example: 02-01-06-01-02-01-12.mp4

Video-only (02)
Speech (01)
Fearful (06)
Normal intensity (01)
Statement "dogs" (02)
1st Repetition (01)
12th Actor (12)
Female, as the actor ID number is even.

License information

The RAVDESS is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0

Commercial licenses for the RAVDESS can also be purchased. For more information, please visit our license fee page, or contact us at ravdess@gmail.com.

Related Data sets

RAVDESS Facial Landmark Tracking data set [Zenodo project page].

Clear search

Close search

Google apps

Main menu

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

BEASC: Bangla emotional audio-speech corpus - A speech emotion recognition...

A Kannada Emotional Speech Dataset

Speech Emotion Recognition (en)

Context

Content

Acknowledgements

ravdess_speech

Long-Term Spectral Pseudo-Entropy (LTSPE) Feature

Cantonese Audio-Visual Emotional Speech (CAVES)

Cantonese Audio-Visual Emotional Speech (CAVES) dataset

BanglaSER: A Bangla speech emotion recognition dataset

Audio emotions

Content

Acknowledgements

EmoMatchSpanishDB

Mexican Emotional Speech Database (MESD)

KAI-indian-emotional-speech-corpus

RATS Speech Activity Detection

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)See More Versions

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)