14 datasets found
  1. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steven R. Livingstone; Steven R. Livingstone; Frank A. Russo; Frank A. Russo (2024). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [Dataset]. http://doi.org/10.5281/zenodo.1188976
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Steven R. Livingstone; Steven R. Livingstone; Frank A. Russo; Frank A. Russo
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Description

    The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The dataset contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.

    The RAVDESS was developed by Dr Steven R. Livingstone, who now leads the Affective Data Science Lab, and Dr Frank A. Russo who leads the SMART Lab.

    Citing the RAVDESS

    The RAVDESS is released under a Creative Commons Attribution license, so please cite the RAVDESS if it is used in your work in any form. Published academic papers should use the academic paper citation for our PLoS1 paper. Personal works, such as machine learning projects/blog posts, should provide a URL to this Zenodo page, though a reference to our PLoS1 paper would also be appreciated.

    Academic paper citation

    Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.

    Personal use citation

    Include a link to this Zenodo page - https://zenodo.org/record/1188976

    Commercial Licenses

    Commercial licenses for the RAVDESS can be purchased. For more information, please visit our license page of fees, or contact us at ravdess@gmail.com.

    Contact Information

    If you would like further information about the RAVDESS, to purchase a commercial license, or if you experience any issues downloading files, please contact us at ravdess@gmail.com.

    Example Videos

    Watch a sample of the RAVDESS speech and song videos.

    Emotion Classification Users

    If you're interested in using machine learning to classify emotional expressions with the RAVDESS, please see our new RAVDESS Facial Landmark Tracking data set [Zenodo project page].

    Construction and Validation

    Full details on the construction and perceptual validation of the RAVDESS are described in our PLoS ONE paper - https://doi.org/10.1371/journal.pone.0196391.

    The RAVDESS contains 7356 files. Each file was rated 10 times on emotional validity, intensity, and genuineness. Ratings were provided by 247 individuals who were characteristic of untrained adult research participants from North America. A further set of 72 participants provided test-retest data. High levels of emotional validity, interrater reliability, and test-retest intrarater reliability were reported. Validation data is open-access, and can be downloaded along with our paper from PLoS ONE.

    Contents

    Audio-only files

    Audio-only files of all actors (01-24) are available as two separate zip files (~200 MB each):

    • Speech file (Audio_Speech_Actors_01-24.zip, 215 MB) contains 1440 files: 60 trials per actor x 24 actors = 1440.
    • Song file (Audio_Song_Actors_01-24.zip, 198 MB) contains 1012 files: 44 trials per actor x 23 actors = 1012.

    Audio-Visual and Video-only files

    Video files are provided as separate zip downloads for each actor (01-24, ~500 MB each), and are split into separate speech and song downloads:

    • Speech files (Video_Speech_Actor_01.zip to Video_Speech_Actor_24.zip) collectively contains 2880 files: 60 trials per actor x 2 modalities (AV, VO) x 24 actors = 2880.
    • Song files (Video_Song_Actor_01.zip to Video_Song_Actor_24.zip) collectively contains 2024 files: 44 trials per actor x 2 modalities (AV, VO) x 23 actors = 2024.

    File Summary

    In total, the RAVDESS collection includes 7356 files (2880+2024+1440+1012 files).

    File naming convention

    Each of the 7356 RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics:

    Filename identifiers

    • Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
    • Vocal channel (01 = speech, 02 = song).
    • Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
    • Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
    • Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
    • Repetition (01 = 1st repetition, 02 = 2nd repetition).
    • Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).


    Filename example: 02-01-06-01-02-01-12.mp4

    1. Video-only (02)
    2. Speech (01)
    3. Fearful (06)
    4. Normal intensity (01)
    5. Statement "dogs" (02)
    6. 1st Repetition (01)
    7. 12th Actor (12)
    8. Female, as the actor ID number is even.

    License information

    The RAVDESS is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0

    Commercial licenses for the RAVDESS can also be purchased. For more information, please visit our license fee page, or contact us at ravdess@gmail.com.

    Related Data sets

  2. m

    BEASC: Bangla emotional audio-speech corpus - A speech emotion recognition...

    • data.mendeley.com
    Updated Feb 9, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rakesh Kumar Das (2022). BEASC: Bangla emotional audio-speech corpus - A speech emotion recognition corpus for the low-resource Bangla language [Dataset]. http://doi.org/10.17632/t9h6p943xy.2
    Explore at:
    Dataset updated
    Feb 9, 2022
    Authors
    Rakesh Kumar Das
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BEASC is an audio-speech emotion recognition corpus for the Bangla language. The developed dataset consists of voice data from 34 speakers from diverse age groups between 19 to 57 (mean = 28.75 and Standard deviation = 9.346), equally balanced with 17 males and 17 females. This dataset contains 1224 speech-audio data of four emotional states. There are four emotional states recorded for three sentences. The three sentences are i. ‘১২ টা বেজে গেছে,’ ii. ‘আমি জানতাম এমন কিছু হবে’, and iii. ‘এ কেমন উপহার’. These emotional states include four basic human emotions: Angry, Happy, Sad, and Surprise. Three trials were preserved for each emotional expression. Hence, the total number of utterances involves three sentences × three repetitions × four emotions × 34 speakers = 1224 recordings. The format of the audio file is a . WAV format. We consider that happy and sad emotional speech has normal intensity and angry and surprise emotional states have a strong intensity. The data files are divided into 34 individual folders. Each folder contains 36 audio recordings of each participating actor. BEASC is a balanced dataset with 306 recordings of each individual emotion. The size of the BEASC dataset is 619 MB. While most of the existing datasets of different languages are recorded inside a closed studio or cover a single sentence, this dataset is collected by recording through smartphones, hence preserving the slightly noisy real-life environment. BEASC is compatible with various shallow machine learning and deep learning architectures such CNN, LSTM, HMM, Transformer, etc. Each data file has a unique filename. We followed the same procedure as the famous RAVDESS dataset for the naming. The filename consists of seven two-digit numerical identifiers, separated by hyphens (e.g., 03-01-01-01-02-02-02.wav). Each two-digit numerical identifier defines the level of a different experimental factor. The identifiers are ordered: Modality - Statement type - Emotion - Emotion Intensity - Statement - Repetition - Actor.wav. For example, the filename “03-01-01-01-02-02-02.wav” refers to: Audio only (03) - Scripted (01) - Happy (01) - Normal intensity (01) - 2nd Statement (02) - 2nd Repetition (02) - 2nd Actor, Female (02).

  3. z

    A Kannada Emotional Speech Dataset

    • zenodo.org
    • data.niaid.nih.gov
    wav
    Updated Mar 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vishakha Agrawal; Vishakha Agrawal (2022). A Kannada Emotional Speech Dataset [Dataset]. http://doi.org/10.5281/zenodo.6345107
    Explore at:
    wavAvailable download formats
    Dataset updated
    Mar 15, 2022
    Dataset provided by
    Zenodo
    Authors
    Vishakha Agrawal; Vishakha Agrawal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    There was no emotional speech dataset available in Kannada. This was a limiting factor for research in the Kannada-speaking world. I introduce a Kannada emotional speech dataset and give details about its design and content. This dataset contains six different sentences, pronounced by thirteen people (four male and nine female), in five basic emotions plus one neutral emotion. They are all Kannada speakers. The dataset has been contributed by volunteers and the recordings were not made in a controlled environment. The dataset contains a total of 468 audio samples, each one in a separate audio file. The file naming convention is as follows: AA-EE-SS.wav where AA is a two-character field that gives the actor number (01 to 13), EE is a two-character field that indicates the emotion number (01 to 06), and SS is a two-character field that gives the sentence number (01 to 06). This dataset is freely available under a Creative Commons license.

    Gender and age of each of the 13 people who contributed to the dataset

    • 01, F, 45
    • 02, F, 20
    • 03, F, 21
    • 04, M, 47
    • 05, F, 48
    • 06, M, 20
    • 07, F, 20
    • 08, F, 45
    • 09, F, 21
    • 10, F, 12
    • 11, F, 12
    • 12, M, 17
    • 13, M, 26

    Identification characters for the emotions in the dataset

    • 01, Anger
    • 02, Sadness
    • 03, Surprise
    • 04, Happiness
    • 05, Fear
    • 06, Neutral

    Identification characters for the sentences in the dataset

    • 01, ರೋಗಿಗಳಿಗೆ ಚಿಕಿತ್ಸೆ ನೀಡಿ ಉಪಚರಿಸುವುದು
    • 02, ಈ ಕಾದಂಬರಿಯು ಎರಡು ಪಾತ್ರಗಳನ್ನು ಒಳಗೊಂಡಿದೆ
    • 03, ಖಾಸಗಿ ವಿಮಾನಗಳೆಂದೂ ಸಾರ್ವಜನಿಕ ವಿಮಾನಗಳೆಂದೂ ವಿಂಗಡಿಸಿದ್ದಾರೆ
    • 04, ನಿಮ್ಮನ್ನು ಬೀಟಿಯಾಗಿ ಬಹಳ ಸಂತೋಶ ಆಯಿತು
    • 05, ರಾಮನ ಎಡಬಲ ದಲ್ಲಿ ಸೀತಾ ಲಕ್ಷ್ಮಣ ರಿದ್ದಾರೆ
    • 06, ಕನ್ನಡವನ್ನು ಕಲಿಯಬೆಕು
  4. Speech Emotion Recognition (en)

    • kaggle.com
    Updated Jan 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dmytro Babko (2021). Speech Emotion Recognition (en) [Dataset]. https://www.kaggle.com/datasets/dmitrybabko/speech-emotion-recognition-en
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 25, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dmytro Babko
    Description

    Context

    Speech is the most natural way of expressing ourselves as humans. It is only natural then to extend this communication medium to computer applications. We define speech emotion recognition (SER) systems as a collection of methodologies that process and classify speech signals to detect the embedded emotions. SER is not a new field, it has been around for over two decades, and has regained attention thanks to the recent advancements. These novel studies make use of the advances in all fields of computing and technology, making it necessary to have an update on the current methodologies and techniques that make SER possible. We have identified and discussed distinct areas of SER, provided a detailed survey of current literature of each, and also listed the current challenges.

    Content

    Here 4 most popular datasets in English: Crema, Ravdess, Savee and Tess. Each of them contains audio in .wav format with some main labels.

    Ravdess:

    Here is the filename identifiers as per the official RAVDESS website:

    • Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
    • Vocal channel (01 = speech, 02 = song).
    • Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
    • Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
    • Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
    • Repetition (01 = 1st repetition, 02 = 2nd repetition).
    • Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

    So, here's an example of an audio filename. 02-01-06-01-02-01-12.wav This means the meta data for the audio file is:

    • Video-only (02)
    • Speech (01)
    • Fearful (06)
    • Normal intensity (01)
    • Statement "dogs" (02)
    • 1st Repetition (01)
    • 12th Actor (12) - Female (as the actor ID number is even)

    Crema:

    The third component is responsible for the emotion label: * SAD - sadness; * ANG - angry; * DIS - disgust; * FEA - fear; * HAP - happy; * NEU - neutral.

    Tess:

    Very similar to Crema - label of emotion is contained in the name of file.

    Savee:

    The audio files in this dataset are named in such a way that the prefix letters describes the emotion classes as follows:

    • 'a' = 'anger'
    • 'd' = 'disgust'
    • 'f' = 'fear'
    • 'h' = 'happiness'
    • 'n' = 'neutral'
    • 'sa' = 'sadness'
    • 'su' = 'surprise'

    Acknowledgements

    My pleasure to show you a notebook of this guy which inspire me to contain this dataset publicly.

  5. h

    ravdess_speech

    • huggingface.co
    Updated Dec 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enrique Hernández Calabrés (2021). ravdess_speech [Dataset]. https://huggingface.co/datasets/ehcalabres/ravdess_speech
    Explore at:
    Dataset updated
    Dec 13, 2021
    Authors
    Enrique Hernández Calabrés
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for ravdess_speech

      Dataset Summary
    

    The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. The conditions… See the full description on the dataset page: https://huggingface.co/datasets/ehcalabres/ravdess_speech.

  6. o

    Long-Term Spectral Pseudo-Entropy (LTSPE) Feature

    • explore.openaire.eu
    Updated Jan 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mohammad rasoul kahrizi (2017). Long-Term Spectral Pseudo-Entropy (LTSPE) Feature [Dataset]. http://doi.org/10.21227/p943-qs37
    Explore at:
    Dataset updated
    Jan 1, 2017
    Authors
    mohammad rasoul kahrizi
    Description

    Abstract Speech detection systems are known as a type of audio classifier systems which are used to recognize, detect or mark parts of audio signal including human speech. Here, a novel robust feature named Long-Term Spectral Pseudo-Entropy (LTSPE) is proposed to detect speech and its purpose is to improve performance in combination with other features, increase accuracy and to have acceptable performance. Experimental results show that if LTSPE is combined with other features, performance of the detector is improved. Moreover, this feature has higher accuracy compared to similar ones.

  7. Cantonese Audio-Visual Emotional Speech (CAVES)

    • kaggle.com
    Updated Aug 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nguyen Thanh Lim (2024). Cantonese Audio-Visual Emotional Speech (CAVES) [Dataset]. https://www.kaggle.com/datasets/nguyenthanhlim/cantonese-audio-visual-emotional-speech-caves/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 29, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nguyen Thanh Lim
    Description

    Cantonese Audio-Visual Emotional Speech (CAVES): Chinese (Neutral, happy, angry, sad, disgust, fear, surprise)

  8. r

    Cantonese Audio-Visual Emotional Speech (CAVES) dataset

    • researchdata.edu.au
    Updated Apr 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kim Jeesun; Davis Christopher; Chong Chee Seng; Jeesun Kim; Chris Davis (2024). Cantonese Audio-Visual Emotional Speech (CAVES) dataset [Dataset]. http://doi.org/10.26183/3SE5-S316
    Explore at:
    Dataset updated
    Apr 30, 2024
    Dataset provided by
    Western Sydney University
    Authors
    Kim Jeesun; Davis Christopher; Chong Chee Seng; Jeesun Kim; Chris Davis
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Description

    This database consists of audio visual recordings of Cantonese spoken expressions of emotions produced by 10 native speakers of Cantonese.

    5 speakers are female and their folders are labeled from fm1 to fm5; 5 speakers are male and their folders are labeled from m1 to m5.

    Each folder consists of 21 zip files (e.g., 7 emotions x 3 presentation modes (audio only AO, visual only VO, audio visual AV). Each zip file contains a file for each of the 50 Cantonese sentences produced in one emotion type (angry, disgust, fear, happy, neutral, sad, surprise) and in one modality (AO, VO, AV). Note: the AV files are in MTS format(https://docs.fileformat.com/video/avchd/).

    FM5 is an exception to the above; only 25 Cantonese sentences were recorded for Sad.

    To get an idea of the material, we provide 6 files in AV format as a sample. The sample consists of sentence 1 spoken in the 6 emotions by Speaker FM1.

    The data from the perception study (validation experiment) are in the file CAVES_data_final.csv

  9. m

    BanglaSER: A Bangla speech emotion recognition dataset

    • data.mendeley.com
    Updated Mar 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rakesh Kumar Das (2022). BanglaSER: A Bangla speech emotion recognition dataset [Dataset]. http://doi.org/10.17632/t9h6p943xy.5
    Explore at:
    Dataset updated
    Mar 14, 2022
    Authors
    Rakesh Kumar Das
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BanglaSER is a Bangla language-based speech emotion recognition dataset. It consists of speech-audio data of 34 participating speakers from diverse age groups between 19 and 47 years, with a balanced 17 male and 17 female nonprofessional participating actors. This dataset contains 1467 Bangla speech-audio recordings of five rudimentary human emotional states, namely angry, happy, neutral, sad, and surprise. Three trials are conducted for each emotional state. Hence, the total number of recordings involves 3 statements × 3 repetitions × 4 emotional states (angry, happy, sad, and surprise) × 34 participating speakers = 1224 recordings + 3 statements × 3 repetitions × 1 emotional state (neutral) × 27 participating speakers = 243 recordings, making the total number of recordings of 1467. BanglaSER dataset is collected by recording through smartphones, and laptops, having a balanced number of recordings in each category with evenly distributed participating male and female actors, preserves the real-life environment, and would serve as an essential training dataset for the speech emotion recognition model in terms of generalization. BanglaSER is compatible with various deep learning architectures such as CNN, LSTM, BiLSTM etc.

  10. Audio emotions

    • kaggle.com
    zip
    Updated Jun 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Uldis Valainis (2020). Audio emotions [Dataset]. https://www.kaggle.com/uldisvalainis/audio-emotions
    Explore at:
    zip(1203490156 bytes)Available download formats
    Dataset updated
    Jun 9, 2020
    Authors
    Uldis Valainis
    Description

    Content

    Data set contains files from RAVDESS [1], CREMA-D [2], SAVEE [3], TESS [4]. Recordings are in .wav format and are sorted in folders: Angry - 2167 records. (16.7%) Happy - 2167 records. (16.46%) Sad - 2167 records. (16.35%) Neutral - 1795 records. (14.26%) Fearful - 2047 records. (16.46%) Disgusted - 1863 records. (15.03%) Surprised - 592 records. (4.74%)

    Out of all files data sets make up: CREMA-D - 7,442 (58.15%) TESS - 2,800 (21.88%) RAVDESS 2,076 (16.22%) SAVEE 480 (3.75%)

    Acknowledgements

    [1]Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): [2] Houwei Cao, D., Cooper, Keutmann, Gur, Nenkova, and Verma. "CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset." IEEE Transactions on Affective Computing 5.4 (2014): 377-90. Web. [3] Jackson, Philip & ul haq, Sana. (2011). Surrey Audio-Visual Expressed Emotion (SAVEE) database. [4] Dupuis, K., & Pichora-Fuller, M. K. (2010). Toronto emotional speech set (TESS). Toronto: University of Toronto, Psychology Department.

    I don't own anything i just put them together.

  11. f

    EmoMatchSpanishDB

    • figshare.com
    zip
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esteban García-Cuesta; Antonio Barba Salvador (2023). EmoMatchSpanishDB [Dataset]. http://doi.org/10.6084/m9.figshare.14215850.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    figshare
    Authors
    Esteban García-Cuesta; Antonio Barba Salvador
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    These carpete contains the datasets features used and described in the research paper entitled
    García-Cuesta, E., Barba, A., Gachet, D. "EmoMatchSpanishDB: Study of Speech Emotion Recognition Machine Learning Models in a New Spanish Elicited Database" , Multimedia Tools and Applications, Ed. Springer, 2023

    In this paper we address the task of real time emotion recognition for elicited emotions. For this purpose we have created a publicly accessible dataset composed by fifty subjects expressing the emotions of anger, disgust, fear, happiness, sadness, and surprise in Spanish language. In addition, a neutral tone of each subject has been added. This article describes how this database have been created including the recording and the performed crowdsourcing perception test in order to statistically validate the emotion of each sample and remove noisy data samples. Moreover we present a baseline comparative study between different machine learning techniques in terms of accuracy, specificity, precision, and recall. Prosodic and spectral features are extracted and used for this classification purpose. We expect that this database will be useful to get new insights within this area of study.

    The first dataset is "EmoSpanishDB" that contains a set of 13 and 140 spectral and prosodic features for a total of 3550 audios of 50 individuals reproducing the 12 sentences for the six different emotions, ’anger, disgust, fear, happiness, sadness, surprise’ (Ekman’s basic emotions]) plus neutral.

    The second dataset is "EmoMatchSpanishDB" and contains a set of 13 and 140 spectral and prosodic features for a total of 2050 audios of 50 individuals reproducing the 12 sentences for the six different emotions, ’anger, disgust, fear, happiness, sadness, surprise’ (Ekman’s basic emotions]) plus neutral. These 2050 audios' features are a subset of EmoSpanishDB resulting of the matched audios after application of a crowdsourcing process to validate that the elicited emotion corresponds with the expressed.

    The third dataset is "EmoMatchSpanishDB-Compare-features.zip" that contains the COMPARE features for the experiments of dependent-speaker and LOSO. These datasets have been used in the paper "EmoMatchSpanishDB: Study of Machine Learning Models in a New Spanish Elicited Dataset" and their creation, its contents, and also a set of baseline machine learning experiments and results are fully described within it.

    The features are available under MIT license and if you want to get access to the original raw audio files for creating your own features and research purposes you can get them under CC-BY-NC completing and signing the agreement file (EMOMATCHAgreement.docx) and sending it via email to esteban.garcia@upm.es

  12. m

    Mexican Emotional Speech Database (MESD)

    • data.mendeley.com
    Updated Dec 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathilde Marie Duville (2021). Mexican Emotional Speech Database (MESD) [Dataset]. http://doi.org/10.17632/cy34mh68j9.3
    Explore at:
    Dataset updated
    Dec 8, 2021
    Authors
    Mathilde Marie Duville
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Mexican Emotional Speech Database (MESD) provides single-word utterances for anger, disgust, fear, happiness, neutral, and sadness affective prosodies with Mexican cultural shaping. The MESD has been uttered by both adult and child non-professional actors: 3 female, 2 male, and 6 child voices are available (female mean age ± SD = 23.33 ± 1.53, male mean age ± SD = 24 ± 1.41, and children mean age ± SD = 9.83 ± 1.17). Words for emotional and neutral utterances come from two corpora: (corpus A) composed of nouns and adjectives that are repeated across emotional prosodies and types of voice (female, male, child), and (corpus B) which consists of words controlled for age-of-acquisition, frequency of use, familiarity, concreteness, valence, arousal, and discrete emotion dimensionality ratings. Particularly, words from corpus B are nouns and adjectives which subjective age of acquisition is under 9-year-old. Neutral-uttered words have valence and arousal ratings strictly greater than 4, but lower than 6 (in a 9-point-scale). Emotional-uttered words have valence and arousal ratings ranging from 1 to 4, or from 6 to 9. Furthermore, ratings for discrete emotional dimension greater than 2.5 (on a 5-point scale) allowed the emotional utterance with the corresponding anger, disgust, fear, happiness, or sadness prosody. Finally, words from corpus B were selected so that emotional prosodies do not differ as regards frequency of use, familiarity, and concreteness dimensions.

    The audio recordings took place in a professional studio with the following materials: (1) a Sennheiser e835 microphone with a flat frequency response (100 Hz to 10 kHz), (2) a Focusrite Scarlett 2i4 audio interface connected to the microphone with an XLR cable and to the computer, and (3) the digital audio workstation REAPER (Rapid Environment for Audio Production, Engineering, and Recording). Audio files were stored as a sequence of 24-bit with a sample rate of 48000Hz.

    Utterances are shared as 864 audio files in WAV format that are named according to the following pattern:

    The MESD seems to be the first set of single-word emotional utterances that includes both adult and child voices for the Mexican population.

    Citation M. M. Duville, L. M. Alonso-Valerdi, and D. Ibarra-Zarate, “The Mexican Emotional Speech Database (MESD): elaboration and assessment based on machine learning,” 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, p. 4, 2021.

    Duville, M.M.; Alonso-Valerdi, L.M.; Ibarra-Zarate, D.I. Mexican Emotional Speech Database Based on Semantic, Frequency, Familiarity, Concreteness, and Cultural Shaping of Affective Prosody. Data 2021, 6, 130. https://doi.org/10.3390/data6120130

  13. h

    KAI-indian-emotional-speech-corpus

    • huggingface.co
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KratosAI (2025). KAI-indian-emotional-speech-corpus [Dataset]. https://huggingface.co/datasets/Kratos-AI/KAI-indian-emotional-speech-corpus
    Explore at:
    Dataset updated
    Jul 8, 2025
    Dataset authored and provided by
    KratosAI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Indian Emotional Speech Corpus

      Dataset Description
    

    This dataset comprises high-quality audio recordings of Indian speakers reading a standardized 50-word paragraph in four distinct emotional tones — happy, sad, surprised, and angry. Each recording is approximately 20–25 seconds long and includes the full paragraph with tone shifts at specific points. Text spoken by all participants:

    (happy tone) Last Monday was perfect—I got the job I’d been dreaming of! I screamed… See the full description on the dataset page: https://huggingface.co/datasets/Kratos-AI/KAI-indian-emotional-speech-corpus.

  14. A

    RATS Speech Activity Detection

    • abacus.library.ubc.ca
    pdf, txt
    Updated Aug 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2023). RATS Speech Activity Detection [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml;jsessionid=3956138b4305adee4c99f1b91d26?persistentId=hdl%3A11272.1%2FAB2%2F1UISJ7&version=&q=&fileTypeGroupFacet=%22Text%22&fileAccess=&fileTag=&fileSortField=&fileSortOrder=
    Explore at:
    txt(3132), pdf(31909)Available download formats
    Dataset updated
    Aug 25, 2023
    Dataset provided by
    Abacus Data Network
    Time period covered
    2015
    Area covered
    United States
    Dataset funded by
    Defense Advanced Research Projects Agency (DARPA)
    Description

    Introduction RATS Speech Activity Detection was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 3,000 hours of Levantine Arabic, English, Farsi, Pashto, and Urdu conversational telephone speech with automatic and manual annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Speech Activity Detection (SAD) task in the DARPA RATS (Robust Automatic Transcription of Speech) program. The goal of the RATS program was to develop human language technology systems capable of performing speech detection, language identification, speaker identification and keyword spotting on the severely degraded audio signals that are typical of various radio communication channels, especially those employing various types of handheld portable transceiver systems. To support that goal, LDC assembled a system for the transmission, reception and digital capture of audio data that allowed a single source audio signal to be distributed and recorded over eight distinct transceiver configurations simultaneously. Those configurations included three frequencies -- high, very high and ultra high -- variously combined with amplitude modulation, frequency hopping spread spectrum, narrow-band frequency modulation, single-side-band or wide-band frequency modulation. Annotations on the clear source audio signal, e.g., time boundaries for the duration of speech activity, were projected onto the corresponding eight channels recorded from the radio receivers. Data The source audio consists of conversational telephone speech recordings collected by LDC: (1) data collected for the RATS program from Levantine Arabic, Farsi, Pashto and Urdu speakers; and (2) material from the Fisher English (LDC2004S13, LDC2005S13), and Fisher Levantine Arabic telephone studies (LDC2007S02), as well as from CALLFRIEND Farsi (LDC2014S01). Annotation was performed in three steps. LDC's automatic speech activity detector was run against the audio data to produce a speech segmentation for each file. Manual first pass annotation was then performed as a quick correction of the automatic speech activity detection output. Finally, in a manual second pass annotation step, annotators reviewed first pass output and made adjustments to segments as needed. All audio files are presented as single-channel, 16-bit PCM, 16000 samples per second; lossless FLAC compression is used on all files; when uncompressed, the files have typical "MS-WAV" (RIFF) file headers.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Steven R. Livingstone; Steven R. Livingstone; Frank A. Russo; Frank A. Russo (2024). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [Dataset]. http://doi.org/10.5281/zenodo.1188976
Organization logo

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

Explore at:
66 scholarly articles cite this dataset (View in Google Scholar)
zipAvailable download formats
Dataset updated
Oct 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Steven R. Livingstone; Steven R. Livingstone; Frank A. Russo; Frank A. Russo
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

Description

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The dataset contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.

The RAVDESS was developed by Dr Steven R. Livingstone, who now leads the Affective Data Science Lab, and Dr Frank A. Russo who leads the SMART Lab.

Citing the RAVDESS

The RAVDESS is released under a Creative Commons Attribution license, so please cite the RAVDESS if it is used in your work in any form. Published academic papers should use the academic paper citation for our PLoS1 paper. Personal works, such as machine learning projects/blog posts, should provide a URL to this Zenodo page, though a reference to our PLoS1 paper would also be appreciated.

Academic paper citation

Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.

Personal use citation

Include a link to this Zenodo page - https://zenodo.org/record/1188976

Commercial Licenses

Commercial licenses for the RAVDESS can be purchased. For more information, please visit our license page of fees, or contact us at ravdess@gmail.com.

Contact Information

If you would like further information about the RAVDESS, to purchase a commercial license, or if you experience any issues downloading files, please contact us at ravdess@gmail.com.

Example Videos

Watch a sample of the RAVDESS speech and song videos.

Emotion Classification Users

If you're interested in using machine learning to classify emotional expressions with the RAVDESS, please see our new RAVDESS Facial Landmark Tracking data set [Zenodo project page].

Construction and Validation

Full details on the construction and perceptual validation of the RAVDESS are described in our PLoS ONE paper - https://doi.org/10.1371/journal.pone.0196391.

The RAVDESS contains 7356 files. Each file was rated 10 times on emotional validity, intensity, and genuineness. Ratings were provided by 247 individuals who were characteristic of untrained adult research participants from North America. A further set of 72 participants provided test-retest data. High levels of emotional validity, interrater reliability, and test-retest intrarater reliability were reported. Validation data is open-access, and can be downloaded along with our paper from PLoS ONE.

Contents

Audio-only files

Audio-only files of all actors (01-24) are available as two separate zip files (~200 MB each):

  • Speech file (Audio_Speech_Actors_01-24.zip, 215 MB) contains 1440 files: 60 trials per actor x 24 actors = 1440.
  • Song file (Audio_Song_Actors_01-24.zip, 198 MB) contains 1012 files: 44 trials per actor x 23 actors = 1012.

Audio-Visual and Video-only files

Video files are provided as separate zip downloads for each actor (01-24, ~500 MB each), and are split into separate speech and song downloads:

  • Speech files (Video_Speech_Actor_01.zip to Video_Speech_Actor_24.zip) collectively contains 2880 files: 60 trials per actor x 2 modalities (AV, VO) x 24 actors = 2880.
  • Song files (Video_Song_Actor_01.zip to Video_Song_Actor_24.zip) collectively contains 2024 files: 44 trials per actor x 2 modalities (AV, VO) x 23 actors = 2024.

File Summary

In total, the RAVDESS collection includes 7356 files (2880+2024+1440+1012 files).

File naming convention

Each of the 7356 RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics:

Filename identifiers

  • Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
  • Vocal channel (01 = speech, 02 = song).
  • Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
  • Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
  • Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
  • Repetition (01 = 1st repetition, 02 = 2nd repetition).
  • Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).


Filename example: 02-01-06-01-02-01-12.mp4

  1. Video-only (02)
  2. Speech (01)
  3. Fearful (06)
  4. Normal intensity (01)
  5. Statement "dogs" (02)
  6. 1st Repetition (01)
  7. 12th Actor (12)
  8. Female, as the actor ID number is even.

License information

The RAVDESS is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0

Commercial licenses for the RAVDESS can also be purchased. For more information, please visit our license fee page, or contact us at ravdess@gmail.com.

Related Data sets

Search
Clear search
Close search
Google apps
Main menu