100+ datasets found

P
VOICES Dataset
paperswithcode.com
Updated Apr 17, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). VOICES Dataset [Dataset]. https://paperswithcode.com/dataset/voices
Explore at:
Dataset updated
Apr 17, 2018
Description
The VOICES corpus is a dataset to promote speech and signal processing research of speech recorded by far-field microphones in noisy room conditions.

For this corpus, audio was recorded in furnished rooms with background noise played in conjunction with foreground speech selected from the LibriSpeech corpus. Multiple sessions were recorded in each room to accommodate for all foreground speech-background noise combinations. Audio was recorded using twelve microphones placed throughout the room, resulting in 120 hours of audio per microphone.
p
VOICED Database
physionet.org
Updated Jun 7, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laura Verde; Giovanna Sannino (2018). VOICED Database [Dataset]. http://doi.org/10.13026/C25Q2N
Explore at:
Unique identifier
https://doi.org/10.13026/C25Q2N
Dataset updated
Jun 7, 2018
Authors
Laura Verde; Giovanna Sannino
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
This database includes 208 voice samples, from 150 pathological, and 58 healthy voices.
NOAA Voices Data Map
noaa.hub.arcgis.com
Updated May 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA GeoPlatform (2023). NOAA Voices Data Map [Dataset]. https://noaa.hub.arcgis.com/maps/09d293a8ed9745bbba97d03d06dd5d0f
Explore at:
Dataset updated
May 17, 2023
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
Authors
NOAA GeoPlatform
Area covered

Description
This web map was developed to show the geographic distribution of the oral history interviews contained within the archive of the NOAA Voices program. This map is used in the NOAA Voices Oral History Interview Mapping Application, found here: https://noaa.maps.arcgis.com/home/item.html?id=a220357bec444ab0be7e586fb5ecd26eEach interview is treated as a separate data point with a variety of attributes. These attributes include: narrator, interviewer, date of interview, city, state, interviewer, project, link to interview, and interview description.Each point in this dataset is plotted at the city level. The size of these points is directly tied to the number of interviews within that location.The data and metadata for this application can be found on the NOAA Voices website, here: https://voices.nmfs.noaa.gov/. Each interview has its own landing page on the NOAA Voices site, and the information on these landing pages mirrors the data in this application.
EmoV-DB Sorted
kaggle.com
Updated Dec 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Phantasm34 (2021). EmoV-DB Sorted [Dataset]. https://www.kaggle.com/phantasm34/emovdb-sorted/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Phantasm34
Description
EmoV-DB

See also

https://github.com/noetits/ICE-Talk for controllable TTS

How to use

Download link

Sorted version (recommended), new link: https://openslr.org/115/

old link (slow download) but gives ou the folder structure needed to use "load_emov_db()" function: https://mega.nz/#F!KBp32apT!gLIgyWf9iQ-yqnWFUFuUHg

Not sorted version: http://www.coe.neu.edu/Research/AClab/Speech%20Data/

Forced alignments with gentle

"It is the process of taking the text transcription of an audio speech segment and determining where in time particular words occur in the speech segment." source

It also allows to separate verbal and non-verbal vocalizations (laughs, yawns, etc.)

Go to https://github.com/lowerquality/gentle

Clone the repo

In Getting started, use the 3rd option: .\install.sh

Copy align_db.py in the repository

In align_db.py, change the "path" variable so that it corresponds to the path of EmoV-DB.

Launch command "python align_db.py". You'll probably have to install some packages to make it work

It should create a folder called "alignments" in the repo, with the same structure as the database, containing a json file for each sentence of the database.

The function "get_start_end_from_json(path)" allows you to extract start and end of the computed force alignment

you can play a file with function "play(path)"

you can play the part of the file in which there is speech according to the forced alignment with "play_start_end(path, start, end)"

Overview of data

The Emotional Voices Database: Towards Controlling the Emotional Expressiveness in Voice Generation Systems

This dataset is built for the purpose of emotional speech synthesis. The transcript were based on the CMU arctic database: http://www.festvox.org/cmu_arctic/cmuarctic.data.

It includes recordings for four speakers- two males and two females.

The emotional styles are neutral, sleepiness, anger, disgust and amused.

Each audio file is recorded in 16bits .wav format

Spk-Je (Female, English: Neutral(417 files), Amused(222 files), Angry(523 files), Sleepy(466 files), Disgust(189 files))

Spk-Bea (Female, English: Neutral(373 files), Amused(309 files), Angry(317 files), Sleepy(520 files), Disgust(347 files))

Spk-Sa (Male, English: Neutral(493 files), Amused(501 files), Angry(468 files), Sleepy(495 files), Disgust(497 files))

Spk-Jsh (Male, English: Neutral(302 files), Amused(298 files), Sleepy(263 files))

File naming (audio_folder): anger_1-28_0011.wav - 1) first word (emotion style), 1-28 - annotation doc file range, Last four digit is the sentence number.

File naming (annotation_folder): anger_1-28.TextGrid - 1) first word (emotional style), 1-28- annotation doc range

References

A description of the database here: https://arxiv.org/pdf/1806.09514.pdf

Please reference this paper when using this database:

Bibtex: @article{adigwe2018emotional, title={The emotional voices database: Towards controlling the emotion dimension in voice generation systems}, author={Adigwe, Adaeze and Tits, No{\'e} and Haddad, Kevin El and Ostadabbas, Sarah and Dutoit, Thierry}, journal={arXiv preprint arXiv:1806.09514}, year={2018} }
d
NOAA Voices Oral History Archives
catalog.data.gov
fisheries.noaa.gov
Updated Oct 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Custodian) (2024). NOAA Voices Oral History Archives [Dataset]. https://catalog.data.gov/dataset/noaa-voices-oral-history-archives1
Explore at:
Dataset updated
Oct 19, 2024
Dataset provided by
(Custodian)
Description
The NOAA Voices Oral History Archives (VOHA) seeks to document the human experience as it relates to the changing environment, climate, oceans and coasts and other key areas of NOAAs work through firsthand oral history accounts from across the US and its territories. Oral histories contribute to NOAA's Mission of "Science, Service, and Stewardship" by creating, compiling, archiving and sharing the experiences of stakeholders, scientists and others. Any individual or organization can participate in the VOHA program by contributing individual oral history interviews or collections of interviews that are related to the project scope and mission, or by using the interviews archived here in their research, scholarship, exhibits, or general use. We accept oral histories produced by NOAA staff (including social scientists, historians, and others) as well as from from external organizations, universities, researchers and oral history practitioners. This content is made available to the public in this digital repository for educational and research purposes. The Voices Oral History Archives database is a powerful resource available to the public to inform, educate, and provide primary information for researchers interested in our local, human experience associated with the varied facets of NOAA's mission (including but not limited to Climate, Fisheries, Weather, Heritage, etc...)
h
chest_falsetto
huggingface.co
Updated Aug 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CCMUSIC Database (2024). chest_falsetto [Dataset]. https://huggingface.co/datasets/ccmusic-database/chest_falsetto
Explore at:
Dataset updated
Aug 4, 2024
Dataset authored and provided by
CCMUSIC Database
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Dataset Card for Chest voice and Falsetto Dataset

The original dataset, sourced from the Chest Voice and Falsetto Dataset, includes 1,280 monophonic singing audio files in .wav format, performed, recorded, and annotated by students majoring in Vocal Music at the China Conservatory of Music. The chest voice is tagged as "chest" and the falsetto voice as "falsetto." Additionally, the dataset encompasses the Mel spectrogram, Mel frequency cepstral coefficient (MFCC), and spectral… See the full description on the dataset page: https://huggingface.co/datasets/ccmusic-database/chest_falsetto.
Z
Emotional Voice Messages (EMOVOME) database
data.niaid.nih.gov
Updated Jun 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gómez-Zaragozá, Lucía (2024). Emotional Voice Messages (EMOVOME) database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6453063
Explore at:
Dataset updated
Jun 13, 2024
Dataset provided by
Naranjo, Valery
Marín-Morales, Javier
Parra Vargas, Elena
Gómez-Zaragozá, Lucía
Alcañiz Raya, Mariano
del Amor, Rocío
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Emotional Voice Messages (EMOVOME) database is a speech dataset collected for emotion recognition in real-world conditions. It contains 999 spontaneous voice messages from 100 Spanish speakers, collected from real conversations on a messaging app. EMOVOME includes both expert and non-expert emotional annotations, covering valence and arousal dimensions, along with emotion categories for the expert annotations. Detailed participant information is provided, including sociodemographic data and personality trait assessments using the NEO-FFI questionnaire. Moreover, EMOVOME provides audio recordings of participants reading a given text, as well as transcriptions of all 999 voice messages. Additionally, baseline models for valence and arousal recognition are provided, utilizing both speech and audio transcriptions.

Description

For details on the EMOVOME database, please refer to the article:

"EMOVOME Database: Advancing Emotion Recognition in Speech Beyond Staged Scenarios". Lucía Gómez-Zaragozá, Rocío del Amor, María José Castro-Bleda, Valery Naranjo, Mariano Alcañiz Raya, Javier Marín-Morales. (pre-print available in https://doi.org/10.48550/arXiv.2403.02167)

Content

The Zenodo repository contains four files:

EMOVOME_agreement.pdf: agreement file required to access the original audio files, detailed in section Usage Notes.

labels.csv: ratings of the three non-experts and the expert annotator, independently and combined.

participants_ids.csv: table mapping each numerical file ID to its corresponding alphanumeric participant ID.

transcriptions.csv: transcriptions of each audio.

The repository also includes three folders:

Audios: it contains the file features_eGeMAPSv02.csv corresponding to the standard acoustic feature set used in the baseline model, and two folders:

Lecture: contains the audio files corresponding to the text readings, with each file named according to the participant's ID.

Emotions: contains the voice recordings from the messaging app provided by the user, named with a file ID.

Questionnaires: it contains two files: 1) sociodemographic_spanish.csv and sociodemographic_english.csv are the sociodemographic data of participants in Spanish and English, respectively, including the demographic information; and 2) NEO-FFI_spanish.csv includes the participants’ answers to the Spanish version of the NEO-FFI questionnaire. The three files include a column indicating the participant's ID to link the information.

Baseline_emotion_recognition: it includes three files and two folders. The file partitions.csv specifies the proposed data partition. Particularly, the dataset is divided into 80% for development and 20% for testing using a speaker-independent approach, i.e., samples from the same speaker are not included in both development and test. The development set includes 80 participants (40 female, 40 male) containing the following distribution of labels: 241 negative, 305 neutral and 261 positive valence; and 148 low, 328 neutral and 331 high arousal. The test set includes 20 participants (10 female, 10 male) with the distribution of labels that follows: 57 negative, 62 neutral and 73 positive valence; and 13 low, 70 neutral and 109 high arousal. Files baseline_speech.ipynb and baseline_text.ipynb contain the code used to create the baseline emotion recognition models based on speech and text, respectively. The actual trained models for valence and arousal prediction are provided in folders models_speech and models_text.

Audio files in “Lecture” and “Emotions” are only provided to the users that complete the agreement file in section Usage Notes. Audio files are in Ogg Vorbis format at 16-bit and 44.1 kHz or 48 kHz. The total size of the “Audios” folder is about 213 MB.

Usage Notes

All the data included in the EMOVOME database is publicly available under the Creative Commons Attribution 4.0 International license. The only exception is the original raw audio files, for which an additional step is required as a security measure to safeguard the speakers' privacy. To request access, interested authors should first complete and sign the agreement file EMOVOME_agreement.pdf and send it to the corresponding author (jamarmo@htech.upv.es). The data included in the EMOVOME database is expected to be used for research purposes only. Therefore, the agreement file states that the authors are not allowed to share the data with profit-making companies or organisations. They are also not expected to distribute the data to other research institutions; instead, they are suggested to kindly refer interested colleagues to the corresponding author of this article. By agreeing to the terms of the agreement, the authors also commit to refraining from publishing the audio content on the media (such as television and radio), in scientific journals (or any other publications), as well as on other platforms on the internet. The agreement must bear the signature of the legally authorised representative of the research institution (e.g., head of laboratory/department). Once the signed agreement is received and validated, the corresponding author will deliver the "Audios" folder containing the audio files through a download procedure. A direct connection between the EMOVOME authors and the applicants guarantees that updates regarding additional materials included in the database can be received by all EMOVOME users.
Data from: Gender Recognition by Voice
kaggle.com
Updated Aug 26, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kory Becker (2016). Gender Recognition by Voice [Dataset]. https://www.kaggle.com/datasets/primaryobjects/voicegender/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 26, 2016
Dataset provided by
Kaggle
Authors
Kory Becker
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Voice Gender

Gender Recognition by Voice and Speech Analysis

This database was created to identify a voice as male or female, based upon acoustic properties of the voice and speech. The dataset consists of 3,168 recorded voice samples, collected from male and female speakers. The voice samples are pre-processed by acoustic analysis in R using the seewave and tuneR packages, with an analyzed frequency range of 0hz-280hz (human vocal range).

The Dataset

The following acoustic properties of each voice are measured and included within the CSV:

meanfreq: mean frequency (in kHz)

sd: standard deviation of frequency

median: median frequency (in kHz)

Q25: first quantile (in kHz)

Q75: third quantile (in kHz)

IQR: interquantile range (in kHz)

skew: skewness (see note in specprop description)

kurt: kurtosis (see note in specprop description)

sp.ent: spectral entropy

sfm: spectral flatness

mode: mode frequency

centroid: frequency centroid (see specprop)

peakf: peak frequency (frequency with highest energy)

meanfun: average of fundamental frequency measured across acoustic signal

minfun: minimum fundamental frequency measured across acoustic signal

maxfun: maximum fundamental frequency measured across acoustic signal

meandom: average of dominant frequency measured across acoustic signal

mindom: minimum of dominant frequency measured across acoustic signal

maxdom: maximum of dominant frequency measured across acoustic signal

dfrange: range of dominant frequency measured across acoustic signal

modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range

label: male or female

Accuracy

Baseline (always predict male)

50% / 50%

Logistic Regression

97% / 98%

CART

96% / 97%

Random Forest

100% / 98%

SVM

100% / 99%

XGBoost

100% / 99%

Research Questions

An original analysis of the data-set can be found in the following article:

Identifying the Gender of a Voice using Machine Learning

The best model achieves 99% accuracy on the test set. According to a CART model, it appears that looking at the mean fundamental frequency might be enough to accurately classify a voice. However, some male voices use a higher frequency, even though their resonance differs from female voices, and may be incorrectly classified as female. To the human ear, there is apparently more than simple frequency, that determines a voice's gender.

Questions

What other features differ between male and female voices?

Can we find a difference in resonance between male and female voices?

Can we identify falsetto from regular voices? (separate data-set likely needed for this)

Are there other interesting features in the data?

CART Diagram

http://i.imgur.com/Npr2U7O.png" alt="CART model">

Mean fundamental frequency appears to be an indicator of voice gender, with a threshold of 140hz separating male from female classifications.

References

The Harvard-Haskins Database of Regularly-Timed Speech

Telecommunications & Signal Processing Laboratory (TSP) Speech Database at McGill University, Home

VoxForge Speech Corpus, Home

Festvox CMU_ARCTIC Speech Database at Carnegie Mellon University
P
Global Voices Dataset
paperswithcode.com
opendatalab.com
Updated Oct 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khanh Nguyen; Hal Daumé III (2024). Global Voices Dataset [Dataset]. https://paperswithcode.com/dataset/global-voices
Explore at:
Dataset updated
Oct 12, 2024
Authors
Khanh Nguyen; Hal Daumé III
Description
Global Voices is a multilingual dataset for evaluating cross-lingual summarization methods. It is extracted from social-network descriptions of Global Voices news articles to cheaply collect evaluation data for into-English and from-English summarization in 15 languages.
Supplementary material for "A database for the comparison of measured...
zenodo.org
pdf, zip
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christoph Pörschmann; Christoph Pörschmann (2024). Supplementary material for "A database for the comparison of measured datasets of human voice directivity" [Dataset]. http://doi.org/10.5281/zenodo.7834211
Explore at:
pdf, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7834211
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christoph Pörschmann; Christoph Pörschmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
[1] C. Pörschmann. “A database for the comparison of measured datasets of human voice directivity," in Proceedings of the Forum Acusticum, Torino, Italy, 2023.

This study presents a database that allows direct comparison and visualization of datasets from 19 different studies. The data is collected from tables, plots, and datasets from the supplemental material of the respective studies. Some studies present directivity patterns averaged over a whole sentence, while others report phoneme-dependent data. Furthermore, these datasets vary in their sampling grids, with many measured in the horizontal plane and just a few measured spherically. Most datasets included in this work present frequency-band averaged values, for example, in one-third octave bands, while a few newer studies provide the raw data in the form of transfer functions.

Furthermore, the supplementary material contains voice directivity datasets determined over a complete sentence determined for a phonetically balanced German sentence (measured twice for 13 subjects).

The .pdf file contains

information on the database that allows comparing the results of 19 publications on voice directivities

general information on the voice directivity files in the SOFA format

information on the indices and names of the SOFA-files

additional plots

The Database.zip

Excel-file containing datasets from the publications given in frequency-bands

SOFA-Files (sampled on sparse grid) of all own datasets from previous studies

Matlab scripts for importing, upsampling and visualizing all voice directivity datasets considered in the database

The VoiceDirectivitySentence.zip files contain voice directivities patterns averaged over one phonetically balanced sentence in the SOFA format

sampled on the sparse measuring grid

upsampled to a dense grid
Z
Data from: Voice Conversion Challenge 2020 database v1.0
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Dec 23, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaohai Tian (2020). Voice Conversion Challenge 2020 database v1.0 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4345688
Explore at:
Dataset updated
Dec 23, 2020
Dataset provided by
Xiaohai Tian
Tomi Kinnunen
Zhao Yi
Tomoki Toda
Wen-Chin Huang
Rohan Kumar Das
Zhenhua Ling
Junichi Yamagishi
Description
Voice conversion (VC) is a technique to transform a speaker identity included in a source speech waveform into a different one while preserving linguistic information of the source speech waveform.

In 2016, we have launched the Voice Conversion Challenge (VCC) 2016 [1][2] at Interspeech 2016. The objective of the 2016 challenge was to better understand different VC techniques built on a freely-available common dataset to look at a common goal, and to share views about unsolved problems and challenges faced by the current VC techniques. The VCC 2016 focused on the most basic VC task, that is, the construction of VC models that automatically transform the voice identity of a source speaker into that of a target speaker using a parallel clean training database where source and target speakers read out the same set of utterances in a professional recording studio. 17 research groups had participated in the 2016 challenge. The challenge was successful and it established new standard evaluation methodology and protocols for bench-marking the performance of VC systems.

In 2018, we have launched the second edition of VCC, the VCC 2018 [3]. In the second edition, we revised three aspects of the challenge. First, we educed the amount of speech data used for the construction of participant's VC systems to half. This is based on feedback from participants in the previous challenge and this is also essential for practical applications. Second, we introduced a more challenging task refereed to a Spoke task in addition to a similar task to the 1st edition, which we call a Hub task. In the Spoke task, participants need to build their VC systems using a non-parallel database in which source and target speakers read out different sets of utterances. We then evaluate both parallel and non-parallel voice conversion systems via the same large-scale crowdsourcing listening test. Third, we also attempted to bridge the gap between the ASV and VC communities. Since new VC systems developed for the VCC 2018 may be strong candidates for enhancing the ASVspoof 2015 database, we also asses spoofing performance of the VC systems based on anti-spoofing scores.

In 2020, we launched the third edition of VCC, the VCC 2020 [4][5]. In this third edition, we constructed and distributed a new database for two tasks, intra-lingual semi-parallel and cross-lingual VC. The dataset for intra-lingual VC consists of a smaller parallel corpus and a larger nonparallel corpus, where both of them are of the same language. The dataset for cross-lingual VC consists of a corpus of the source speakers speaking in the source language and another corpus of the target speakers speaking in the target language. As a more challenging task than the previous ones, we focused on cross-lingual VC, in which the speaker identity is transformed between two speakers uttering different languages, which requires handling completely nonparallel training over different languages.

This repository contains the training and evaluation data released to participants, target speaker’s speech data in English for reference purpose, and the transcriptions for evaluation data. For more details about the challenge and the listening test results please refer to [4] and README file.

[1] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio, Mirjam Wester, Zhizheng Wu, Junichi Yamagishi "The Voice Conversion Challenge 2016" in Proc. of Interspeech, San Francisco.

[2] Mirjam Wester, Zhizheng Wu, Junichi Yamagishi "Analysis of the Voice Conversion Challenge 2016 Evaluation Results" in Proc. of Interspeech 2016.

[3] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, Zhenhua Ling, "The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods", Proc Speaker Odyssey 2018, June 2018.

[4] Yi Zhao, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhenhua Ling, and Tomoki Toda. "Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion" Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 80-98, DOI: 10.21437/VCC_BC.2020-14.
Laryngeal Voice Disorder Classification
kaggle.com
zip
Updated Apr 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniil Krasnoproshin (2024). Laryngeal Voice Disorder Classification [Dataset]. https://www.kaggle.com/datasets/daniilkrasnoproshin/healthy-vs-laryngeal-disorder-classification
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 22, 2024
Authors
Daniil Krasnoproshin
Description
Description:

Unlock the potential of voice analysis in diagnosing laryngeal disorders with the this voice recordings dataset. This comprehensive dataset, gathered at the prestigious Belarus' Republican Scientific and Practical Center for Otorhinolaryngology, comprises anonymized voice recordings from 60 individuals.

Key Features:

Diverse Samples: Explore voice samples from 30 healthy individuals and 30 individuals with various laryngeal disorders, including vocal fold nodules, laryngeal paralysis, and functional dysphonia.

Anonymized Data: Each voice sample is anonymized and labeled with alphanumeric codes to ensure privacy and confidentiality.

No Personal Information: Rest assured, the recordings contain no personal data such as names or ages, maintaining the anonymity of participants.

High-Quality Recordings: The recordings were captured under controlled conditions in the phoniatric department, ensuring consistency and reliability.

Potential Applications: Use this dataset to develop machine learning models for automatic diagnosis and classification of laryngeal disorders, aiding healthcare professionals in timely and accurate assessments.

Research Opportunities: Investigate patterns and features in voice data to uncover insights into the manifestation and progression of various laryngeal conditions.

Community Collaboration: Join a vibrant community of researchers, data scientists, and healthcare professionals on Kaggle to collaborate, share insights, and advance the field of voice-based diagnostics.

If you use this dataset in your research, please credit the authors.

Citation

Analysis of acoustic voice parameters for larynx pathology detection (link)

License

****License was not specified at the source, yet access to the data is public and a citation was requested.****
P
ESD Dataset
paperswithcode.com
Updated Jun 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kun Zhou; Berrak Sisman; Rui Liu; Haizhou Li (2023). ESD Dataset [Dataset]. https://paperswithcode.com/dataset/esd
Explore at:
Dataset updated
Jun 30, 2023
Authors
Kun Zhou; Berrak Sisman; Rui Liu; Haizhou Li
Description
ESD is an Emotional Speech Database for voice conversion research. The ESD database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise). More than 29 hours of speech data were recorded in a controlled acoustic environment. The database is suitable for multi-speaker and cross-lingual emotional voice conversion studies.
Z
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)
data.niaid.nih.gov
zenodo.org
Updated Oct 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Livingstone, Steven R. (2024). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1188975
Explore at:
Dataset updated
Oct 19, 2024
Dataset provided by
Livingstone, Steven R.
Russo, Frank A.
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Description

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The dataset contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.

The RAVDESS was developed by Dr Steven R. Livingstone, who now leads the Affective Data Science Lab, and Dr Frank A. Russo who leads the SMART Lab.

Citing the RAVDESS

The RAVDESS is released under a Creative Commons Attribution license, so please cite the RAVDESS if it is used in your work in any form. Published academic papers should use the academic paper citation for our PLoS1 paper. Personal works, such as machine learning projects/blog posts, should provide a URL to this Zenodo page, though a reference to our PLoS1 paper would also be appreciated.

Academic paper citation

Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.

Personal use citation

Include a link to this Zenodo page - https://zenodo.org/record/1188976

Commercial Licenses

Commercial licenses for the RAVDESS can be purchased. For more information, please visit our license page of fees, or contact us at ravdess@gmail.com.

Contact Information

If you would like further information about the RAVDESS, to purchase a commercial license, or if you experience any issues downloading files, please contact us at ravdess@gmail.com.

Example Videos

Watch a sample of the RAVDESS speech and song videos.

Emotion Classification Users

If you're interested in using machine learning to classify emotional expressions with the RAVDESS, please see our new RAVDESS Facial Landmark Tracking data set [Zenodo project page].

Construction and Validation

Full details on the construction and perceptual validation of the RAVDESS are described in our PLoS ONE paper - https://doi.org/10.1371/journal.pone.0196391.

The RAVDESS contains 7356 files. Each file was rated 10 times on emotional validity, intensity, and genuineness. Ratings were provided by 247 individuals who were characteristic of untrained adult research participants from North America. A further set of 72 participants provided test-retest data. High levels of emotional validity, interrater reliability, and test-retest intrarater reliability were reported. Validation data is open-access, and can be downloaded along with our paper from PLoS ONE.

Contents

Audio-only files

Audio-only files of all actors (01-24) are available as two separate zip files (~200 MB each):

Speech file (Audio_Speech_Actors_01-24.zip, 215 MB) contains 1440 files: 60 trials per actor x 24 actors = 1440.

Song file (Audio_Song_Actors_01-24.zip, 198 MB) contains 1012 files: 44 trials per actor x 23 actors = 1012.

Audio-Visual and Video-only files

Video files are provided as separate zip downloads for each actor (01-24, ~500 MB each), and are split into separate speech and song downloads:

Speech files (Video_Speech_Actor_01.zip to Video_Speech_Actor_24.zip) collectively contains 2880 files: 60 trials per actor x 2 modalities (AV, VO) x 24 actors = 2880.

Song files (Video_Song_Actor_01.zip to Video_Song_Actor_24.zip) collectively contains 2024 files: 44 trials per actor x 2 modalities (AV, VO) x 23 actors = 2024.

File Summary

In total, the RAVDESS collection includes 7356 files (2880+2024+1440+1012 files).

File naming convention

Each of the 7356 RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics: Filename identifiers

Modality (01 = full-AV, 02 = video-only, 03 = audio-only).

Vocal channel (01 = speech, 02 = song).

Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).

Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.

Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").

Repetition (01 = 1st repetition, 02 = 2nd repetition).

Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

Filename example: 02-01-06-01-02-01-12.mp4

Video-only (02)

Speech (01)

Fearful (06)

Normal intensity (01)

Statement "dogs" (02)

1st Repetition (01)

12th Actor (12)

Female, as the actor ID number is even.

License information

The RAVDESS is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0

Commercial licenses for the RAVDESS can also be purchased. For more information, please visit our license fee page, or contact us at ravdess@gmail.com.

Related Data sets

RAVDESS Facial Landmark Tracking data set [Zenodo project page].
P
Common Voice Dataset
paperswithcode.com
opendatalab.com
+1more
Updated Jan 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosana Ardila; Megan Branson; Kelly Davis; Michael Henretty; Michael Kohler; Josh Meyer; Reuben Morais; Lindsay Saunders; Francis M. Tyers; Gregor Weber (2021). Common Voice Dataset [Dataset]. https://paperswithcode.com/dataset/common-voice
Explore at:
Dataset updated
Jan 7, 2021
Authors
Rosana Ardila; Megan Branson; Kelly Davis; Michael Henretty; Michael Kohler; Josh Meyer; Reuben Morais; Lindsay Saunders; Francis M. Tyers; Gregor Weber
Description
Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.
Finance Interactive Voice Response System
catalog.data.gov
s.cnmilf.com
+2more
Updated Jul 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Security Administration (2025). Finance Interactive Voice Response System [Dataset]. https://catalog.data.gov/dataset/finance-interactive-voice-response-system
Explore at:
Dataset updated
Jul 4, 2025
Dataset provided by
Social Security Administrationhttp://ssa.gov/
Description
The database stores information to support the capability to access (by phone) vendor invoice/payment status reports using an Interactive Voice Response System.
common_voice_12_0
huggingface.co
Updated Mar 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mozilla Foundation (2023). common_voice_12_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0
Explore at:
Dataset updated
Mar 24, 2023
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 12.0

Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17127 validated hours in 104 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0.
Z
COALA voice data and transcripts Italian
data.niaid.nih.gov
Updated Oct 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massimo Curti (2023). COALA voice data and transcripts Italian [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8413134
Explore at:
Dataset updated
Oct 7, 2023
Dataset provided by
Evangelos Niforatos
Massimo Curti
Samuel Kernan Freire
Stefan Wellsandt
Mina Foosherian
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains audio files and transcripts in Italian and related to manufacturing. We collected the scripts during the Horizon Europe RIA COALA (GA 957296, project reference website) from industrial use cases and hired a service provider to generate the related audio files (BIBA - Bremer Institut für Produktion und Logistik GmbH ordered the service). The service provider checked the audio files for quality.

The service provider recruited crowd workers, and gathered their audio records, informed consent (privacy) and agreement that their records become public domain (Creative Commons 0; https://creativecommons.org/share-your-work/public-domain/cc0/). The service provider declared to follow a Crowd Code of Ethics and a Fair Pay policy.

The metadata file contains the following information:

file_name: name of the audio file

script: script the speaker had to speak

scriptId: the numeric identifier of the script

participantId: the numeric identifier of the participant (speaker)

gender: the gender as indicated by the participant (MALE or FEMALE)

age: the age in years as indicated by the participant

age_range: the age range in years (18-30, 31-45, 46+)

country: the birth country indicated by the participant

current_country: the country of residence indicated by the participant

primary_language: the language indicated as primary by the participant

ever_worked_factory: answer to the question: "Have you ever worked in a factory, manufacturing setting?" (Yes/No)

years_worked_factory: answer to the question: "If yes, for how many years?" (1-10, 10+)

background_noise_type: background noise in the audio as indicated by the participant (mild, humming/technical, no noise)

gdpr_and_ipr_consent: answer to the privacy notice and the ipr transfer to CC-0 (Yes)

date_signed: date when the participant signed the consent form (US format, MM.DD.YYYY)
E
TC-STAR Bilingual Voice-Conversion Spanish Speech Database
catalogue.elra.info
live.european-language-grid.eu
Updated Dec 21, 2010
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2010). TC-STAR Bilingual Voice-Conversion Spanish Speech Database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0311/
Explore at:
Dataset updated
Dec 21, 2010
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
4 hours and 80 minutes of speech as spoken by 2 female speakers and 2 male speakers, covering both mimics and parallel voice conversion data.
V
Data from: The Uncommonwealth
data.virginia.gov
url
Updated Oct 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Library of Virginia (2024). The Uncommonwealth [Dataset]. https://data.virginia.gov/dataset/the-uncommonwealth
Explore at:
urlAvailable download formats
Dataset updated
Oct 2, 2024
Dataset authored and provided by
Library of Virginia
Description
Learn about what we do, why we do it, and how our efforts relate to current issues and events. In addition to our intriguing collections and groundbreaking projects, we’ll spotlight public libraries, staff members, and specialized professions. Visit uncommonwealth.virginiamemory.com to learn more!

Facebook

Twitter

Click to copy link

Link copied

Cite

(2018). VOICES Dataset [Dataset]. https://paperswithcode.com/dataset/voices

VOICES Dataset

Voices Obscured In Complex Environmental Settings

Explore at:

Dataset updated

Apr 17, 2018

Description

The VOICES corpus is a dataset to promote speech and signal processing research of speech recorded by far-field microphones in noisy room conditions.

For this corpus, audio was recorded in furnished rooms with background noise played in conjunction with foreground speech selected from the LibriSpeech corpus. Multiple sessions were recorded in each room to accommodate for all foreground speech-background noise combinations. Audio was recorded using twelve microphones placed throughout the room, resulting in 120 hours of audio per microphone.

Clear search

Close search

Google apps

Main menu

VOICES Dataset

VOICED Database

NOAA Voices Data Map

EmoV-DB Sorted

EmoV-DB

See also

How to use

Download link

Forced alignments with gentle

Overview of data

References

NOAA Voices Oral History Archives

chest_falsetto

Emotional Voice Messages (EMOVOME) database

Data from: Gender Recognition by Voice

Voice Gender

The Dataset

Accuracy

Baseline (always predict male)

Logistic Regression

CART

Random Forest

SVM

XGBoost

Research Questions

Questions

CART Diagram

References

Global Voices Dataset

Supplementary material for "A database for the comparison of measured...

Data from: Voice Conversion Challenge 2020 database v1.0

Laryngeal Voice Disorder Classification

ESD Dataset

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

Common Voice Dataset

Finance Interactive Voice Response System

common_voice_12_0

COALA voice data and transcripts Italian

TC-STAR Bilingual Voice-Conversion Spanish Speech Database

Data from: The Uncommonwealth

VOICES Dataset

Voices Obscured In Complex Environmental Settings