100+ datasets found
  1. P

    VOICES Dataset

    • paperswithcode.com
    Updated Apr 17, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). VOICES Dataset [Dataset]. https://paperswithcode.com/dataset/voices
    Explore at:
    Dataset updated
    Apr 17, 2018
    Description

    The VOICES corpus is a dataset to promote speech and signal processing research of speech recorded by far-field microphones in noisy room conditions.

    For this corpus, audio was recorded in furnished rooms with background noise played in conjunction with foreground speech selected from the LibriSpeech corpus. Multiple sessions were recorded in each room to accommodate for all foreground speech-background noise combinations. Audio was recorded using twelve microphones placed throughout the room, resulting in 120 hours of audio per microphone.

  2. p

    VOICED Database

    • physionet.org
    Updated Jun 7, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura Verde; Giovanna Sannino (2018). VOICED Database [Dataset]. http://doi.org/10.13026/C25Q2N
    Explore at:
    Dataset updated
    Jun 7, 2018
    Authors
    Laura Verde; Giovanna Sannino
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    This database includes 208 voice samples, from 150 pathological, and 58 healthy voices.

  3. NOAA Voices Data Map

    • noaa.hub.arcgis.com
    Updated May 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NOAA GeoPlatform (2023). NOAA Voices Data Map [Dataset]. https://noaa.hub.arcgis.com/maps/09d293a8ed9745bbba97d03d06dd5d0f
    Explore at:
    Dataset updated
    May 17, 2023
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    Authors
    NOAA GeoPlatform
    Area covered
    Description

    This web map was developed to show the geographic distribution of the oral history interviews contained within the archive of the NOAA Voices program. This map is used in the NOAA Voices Oral History Interview Mapping Application, found here: https://noaa.maps.arcgis.com/home/item.html?id=a220357bec444ab0be7e586fb5ecd26eEach interview is treated as a separate data point with a variety of attributes. These attributes include: narrator, interviewer, date of interview, city, state, interviewer, project, link to interview, and interview description.Each point in this dataset is plotted at the city level. The size of these points is directly tied to the number of interviews within that location.The data and metadata for this application can be found on the NOAA Voices website, here: https://voices.nmfs.noaa.gov/. Each interview has its own landing page on the NOAA Voices site, and the information on these landing pages mirrors the data in this application.

  4. EmoV-DB Sorted

    • kaggle.com
    Updated Dec 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Phantasm34 (2021). EmoV-DB Sorted [Dataset]. https://www.kaggle.com/phantasm34/emovdb-sorted/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 12, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Phantasm34
    Description

    EmoV-DB

    See also

    https://github.com/noetits/ICE-Talk for controllable TTS

    How to use

    Download link

    Sorted version (recommended), new link: https://openslr.org/115/

    old link (slow download) but gives ou the folder structure needed to use "load_emov_db()" function: https://mega.nz/#F!KBp32apT!gLIgyWf9iQ-yqnWFUFuUHg

    Not sorted version: http://www.coe.neu.edu/Research/AClab/Speech%20Data/

    Forced alignments with gentle

    "It is the process of taking the text transcription of an audio speech segment and determining where in time particular words occur in the speech segment." source

    It also allows to separate verbal and non-verbal vocalizations (laughs, yawns, etc.)

    1. Go to https://github.com/lowerquality/gentle
    2. Clone the repo
    3. In Getting started, use the 3rd option: .\install.sh
    4. Copy align_db.py in the repository
    5. In align_db.py, change the "path" variable so that it corresponds to the path of EmoV-DB.
    6. Launch command "python align_db.py". You'll probably have to install some packages to make it work
    7. It should create a folder called "alignments" in the repo, with the same structure as the database, containing a json file for each sentence of the database.

    8. The function "get_start_end_from_json(path)" allows you to extract start and end of the computed force alignment

    9. you can play a file with function "play(path)"

    10. you can play the part of the file in which there is speech according to the forced alignment with "play_start_end(path, start, end)"

    Overview of data

    The Emotional Voices Database: Towards Controlling the Emotional Expressiveness in Voice Generation Systems

    • This dataset is built for the purpose of emotional speech synthesis. The transcript were based on the CMU arctic database: http://www.festvox.org/cmu_arctic/cmuarctic.data.

    • It includes recordings for four speakers- two males and two females.

    • The emotional styles are neutral, sleepiness, anger, disgust and amused.

    • Each audio file is recorded in 16bits .wav format

    • Spk-Je (Female, English: Neutral(417 files), Amused(222 files), Angry(523 files), Sleepy(466 files), Disgust(189 files))

    • Spk-Bea (Female, English: Neutral(373 files), Amused(309 files), Angry(317 files), Sleepy(520 files), Disgust(347 files))

    • Spk-Sa (Male, English: Neutral(493 files), Amused(501 files), Angry(468 files), Sleepy(495 files), Disgust(497 files))

    • Spk-Jsh (Male, English: Neutral(302 files), Amused(298 files), Sleepy(263 files))

    • File naming (audio_folder): anger_1-28_0011.wav - 1) first word (emotion style), 1-28 - annotation doc file range, Last four digit is the sentence number.

    • File naming (annotation_folder): anger_1-28.TextGrid - 1) first word (emotional style), 1-28- annotation doc range

    References

    A description of the database here: https://arxiv.org/pdf/1806.09514.pdf

    Please reference this paper when using this database:

    Bibtex: @article{adigwe2018emotional, title={The emotional voices database: Towards controlling the emotion dimension in voice generation systems}, author={Adigwe, Adaeze and Tits, No{\'e} and Haddad, Kevin El and Ostadabbas, Sarah and Dutoit, Thierry}, journal={arXiv preprint arXiv:1806.09514}, year={2018} }

  5. d

    NOAA Voices Oral History Archives

    • catalog.data.gov
    • fisheries.noaa.gov
    Updated Oct 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Custodian) (2024). NOAA Voices Oral History Archives [Dataset]. https://catalog.data.gov/dataset/noaa-voices-oral-history-archives1
    Explore at:
    Dataset updated
    Oct 19, 2024
    Dataset provided by
    (Custodian)
    Description

    The NOAA Voices Oral History Archives (VOHA) seeks to document the human experience as it relates to the changing environment, climate, oceans and coasts and other key areas of NOAAs work through firsthand oral history accounts from across the US and its territories. Oral histories contribute to NOAA's Mission of "Science, Service, and Stewardship" by creating, compiling, archiving and sharing the experiences of stakeholders, scientists and others. Any individual or organization can participate in the VOHA program by contributing individual oral history interviews or collections of interviews that are related to the project scope and mission, or by using the interviews archived here in their research, scholarship, exhibits, or general use. We accept oral histories produced by NOAA staff (including social scientists, historians, and others) as well as from from external organizations, universities, researchers and oral history practitioners. This content is made available to the public in this digital repository for educational and research purposes. The Voices Oral History Archives database is a powerful resource available to the public to inform, educate, and provide primary information for researchers interested in our local, human experience associated with the varied facets of NOAA's mission (including but not limited to Climate, Fisheries, Weather, Heritage, etc...)

  6. h

    chest_falsetto

    • huggingface.co
    Updated Aug 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CCMUSIC Database (2024). chest_falsetto [Dataset]. https://huggingface.co/datasets/ccmusic-database/chest_falsetto
    Explore at:
    Dataset updated
    Aug 4, 2024
    Dataset authored and provided by
    CCMUSIC Database
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Dataset Card for Chest voice and Falsetto Dataset

    The original dataset, sourced from the Chest Voice and Falsetto Dataset, includes 1,280 monophonic singing audio files in .wav format, performed, recorded, and annotated by students majoring in Vocal Music at the China Conservatory of Music. The chest voice is tagged as "chest" and the falsetto voice as "falsetto." Additionally, the dataset encompasses the Mel spectrogram, Mel frequency cepstral coefficient (MFCC), and spectral… See the full description on the dataset page: https://huggingface.co/datasets/ccmusic-database/chest_falsetto.

  7. Z

    Emotional Voice Messages (EMOVOME) database

    • data.niaid.nih.gov
    Updated Jun 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gómez-Zaragozá, Lucía (2024). Emotional Voice Messages (EMOVOME) database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6453063
    Explore at:
    Dataset updated
    Jun 13, 2024
    Dataset provided by
    Naranjo, Valery
    MarĂ­n-Morales, Javier
    Parra Vargas, Elena
    Gómez-Zaragozá, Lucía
    Alcañiz Raya, Mariano
    del Amor, RocĂ­o
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Emotional Voice Messages (EMOVOME) database is a speech dataset collected for emotion recognition in real-world conditions. It contains 999 spontaneous voice messages from 100 Spanish speakers, collected from real conversations on a messaging app. EMOVOME includes both expert and non-expert emotional annotations, covering valence and arousal dimensions, along with emotion categories for the expert annotations. Detailed participant information is provided, including sociodemographic data and personality trait assessments using the NEO-FFI questionnaire. Moreover, EMOVOME provides audio recordings of participants reading a given text, as well as transcriptions of all 999 voice messages. Additionally, baseline models for valence and arousal recognition are provided, utilizing both speech and audio transcriptions.

    Description

    For details on the EMOVOME database, please refer to the article:

    "EMOVOME Database: Advancing Emotion Recognition in Speech Beyond Staged Scenarios". Lucía Gómez-Zaragozá, Rocío del Amor, María José Castro-Bleda, Valery Naranjo, Mariano Alcañiz Raya, Javier Marín-Morales. (pre-print available in https://doi.org/10.48550/arXiv.2403.02167)

    Content

    The Zenodo repository contains four files:

    EMOVOME_agreement.pdf: agreement file required to access the original audio files, detailed in section Usage Notes.

    labels.csv: ratings of the three non-experts and the expert annotator, independently and combined.

    participants_ids.csv: table mapping each numerical file ID to its corresponding alphanumeric participant ID.

    transcriptions.csv: transcriptions of each audio.

    The repository also includes three folders:

    Audios: it contains the file features_eGeMAPSv02.csv corresponding to the standard acoustic feature set used in the baseline model, and two folders:

    Lecture: contains the audio files corresponding to the text readings, with each file named according to the participant's ID.

    Emotions: contains the voice recordings from the messaging app provided by the user, named with a file ID.

    Questionnaires: it contains two files: 1) sociodemographic_spanish.csv and sociodemographic_english.csv are the sociodemographic data of participants in Spanish and English, respectively, including the demographic information; and 2) NEO-FFI_spanish.csv includes the participants’ answers to the Spanish version of the NEO-FFI questionnaire. The three files include a column indicating the participant's ID to link the information.

    Baseline_emotion_recognition: it includes three files and two folders. The file partitions.csv specifies the proposed data partition. Particularly, the dataset is divided into 80% for development and 20% for testing using a speaker-independent approach, i.e., samples from the same speaker are not included in both development and test. The development set includes 80 participants (40 female, 40 male) containing the following distribution of labels: 241 negative, 305 neutral and 261 positive valence; and 148 low, 328 neutral and 331 high arousal. The test set includes 20 participants (10 female, 10 male) with the distribution of labels that follows: 57 negative, 62 neutral and 73 positive valence; and 13 low, 70 neutral and 109 high arousal. Files baseline_speech.ipynb and baseline_text.ipynb contain the code used to create the baseline emotion recognition models based on speech and text, respectively. The actual trained models for valence and arousal prediction are provided in folders models_speech and models_text.

    Audio files in “Lecture” and “Emotions” are only provided to the users that complete the agreement file in section Usage Notes. Audio files are in Ogg Vorbis format at 16-bit and 44.1 kHz or 48 kHz. The total size of the “Audios” folder is about 213 MB.

    Usage Notes

    All the data included in the EMOVOME database is publicly available under the Creative Commons Attribution 4.0 International license. The only exception is the original raw audio files, for which an additional step is required as a security measure to safeguard the speakers' privacy. To request access, interested authors should first complete and sign the agreement file EMOVOME_agreement.pdf and send it to the corresponding author (jamarmo@htech.upv.es). The data included in the EMOVOME database is expected to be used for research purposes only. Therefore, the agreement file states that the authors are not allowed to share the data with profit-making companies or organisations. They are also not expected to distribute the data to other research institutions; instead, they are suggested to kindly refer interested colleagues to the corresponding author of this article. By agreeing to the terms of the agreement, the authors also commit to refraining from publishing the audio content on the media (such as television and radio), in scientific journals (or any other publications), as well as on other platforms on the internet. The agreement must bear the signature of the legally authorised representative of the research institution (e.g., head of laboratory/department). Once the signed agreement is received and validated, the corresponding author will deliver the "Audios" folder containing the audio files through a download procedure. A direct connection between the EMOVOME authors and the applicants guarantees that updates regarding additional materials included in the database can be received by all EMOVOME users.

  8. Data from: Gender Recognition by Voice

    • kaggle.com
    Updated Aug 26, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kory Becker (2016). Gender Recognition by Voice [Dataset]. https://www.kaggle.com/datasets/primaryobjects/voicegender/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 26, 2016
    Dataset provided by
    Kaggle
    Authors
    Kory Becker
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Voice Gender

    Gender Recognition by Voice and Speech Analysis

    This database was created to identify a voice as male or female, based upon acoustic properties of the voice and speech. The dataset consists of 3,168 recorded voice samples, collected from male and female speakers. The voice samples are pre-processed by acoustic analysis in R using the seewave and tuneR packages, with an analyzed frequency range of 0hz-280hz (human vocal range).

    The Dataset

    The following acoustic properties of each voice are measured and included within the CSV:

    • meanfreq: mean frequency (in kHz)
    • sd: standard deviation of frequency
    • median: median frequency (in kHz)
    • Q25: first quantile (in kHz)
    • Q75: third quantile (in kHz)
    • IQR: interquantile range (in kHz)
    • skew: skewness (see note in specprop description)
    • kurt: kurtosis (see note in specprop description)
    • sp.ent: spectral entropy
    • sfm: spectral flatness
    • mode: mode frequency
    • centroid: frequency centroid (see specprop)
    • peakf: peak frequency (frequency with highest energy)
    • meanfun: average of fundamental frequency measured across acoustic signal
    • minfun: minimum fundamental frequency measured across acoustic signal
    • maxfun: maximum fundamental frequency measured across acoustic signal
    • meandom: average of dominant frequency measured across acoustic signal
    • mindom: minimum of dominant frequency measured across acoustic signal
    • maxdom: maximum of dominant frequency measured across acoustic signal
    • dfrange: range of dominant frequency measured across acoustic signal
    • modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range
    • label: male or female

    Accuracy

    Baseline (always predict male)

    50% / 50%

    Logistic Regression

    97% / 98%

    CART

    96% / 97%

    Random Forest

    100% / 98%

    SVM

    100% / 99%

    XGBoost

    100% / 99%

    Research Questions

    An original analysis of the data-set can be found in the following article:

    Identifying the Gender of a Voice using Machine Learning

    The best model achieves 99% accuracy on the test set. According to a CART model, it appears that looking at the mean fundamental frequency might be enough to accurately classify a voice. However, some male voices use a higher frequency, even though their resonance differs from female voices, and may be incorrectly classified as female. To the human ear, there is apparently more than simple frequency, that determines a voice's gender.

    Questions

    • What other features differ between male and female voices?
    • Can we find a difference in resonance between male and female voices?
    • Can we identify falsetto from regular voices? (separate data-set likely needed for this)
    • Are there other interesting features in the data?

    CART Diagram

    http://i.imgur.com/Npr2U7O.png" alt="CART model">

    Mean fundamental frequency appears to be an indicator of voice gender, with a threshold of 140hz separating male from female classifications.

    References

    The Harvard-Haskins Database of Regularly-Timed Speech

    Telecommunications & Signal Processing Laboratory (TSP) Speech Database at McGill University, Home

    VoxForge Speech Corpus, Home

    Festvox CMU_ARCTIC Speech Database at Carnegie Mellon University

  9. P

    Global Voices Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Oct 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khanh Nguyen; Hal Daumé III (2024). Global Voices Dataset [Dataset]. https://paperswithcode.com/dataset/global-voices
    Explore at:
    Dataset updated
    Oct 12, 2024
    Authors
    Khanh Nguyen; Hal Daumé III
    Description

    Global Voices is a multilingual dataset for evaluating cross-lingual summarization methods. It is extracted from social-network descriptions of Global Voices news articles to cheaply collect evaluation data for into-English and from-English summarization in 15 languages.

  10. Supplementary material for "A database for the comparison of measured...

    • zenodo.org
    pdf, zip
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Pörschmann; Christoph Pörschmann (2024). Supplementary material for "A database for the comparison of measured datasets of human voice directivity" [Dataset]. http://doi.org/10.5281/zenodo.7834211
    Explore at:
    pdf, zipAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christoph Pörschmann; Christoph Pörschmann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    [1] C. Pörschmann. “A database for the comparison of measured datasets of human voice directivity," in Proceedings of the Forum Acusticum, Torino, Italy, 2023.

    This study presents a database that allows direct comparison and visualization of datasets from 19 different studies. The data is collected from tables, plots, and datasets from the supplemental material of the respective studies. Some studies present directivity patterns averaged over a whole sentence, while others report phoneme-dependent data. Furthermore, these datasets vary in their sampling grids, with many measured in the horizontal plane and just a few measured spherically. Most datasets included in this work present frequency-band averaged values, for example, in one-third octave bands, while a few newer studies provide the raw data in the form of transfer functions.

    Furthermore, the supplementary material contains voice directivity datasets determined over a complete sentence determined for a phonetically balanced German sentence (measured twice for 13 subjects).

    The .pdf file contains

    • information on the database that allows comparing the results of 19 publications on voice directivities
    • general information on the voice directivity files in the SOFA format
    • information on the indices and names of the SOFA-files
    • additional plots

    The Database.zip

    • Excel-file containing datasets from the publications given in frequency-bands
    • SOFA-Files (sampled on sparse grid) of all own datasets from previous studies
    • Matlab scripts for importing, upsampling and visualizing all voice directivity datasets considered in the database

    The VoiceDirectivitySentence.zip files contain voice directivities patterns averaged over one phonetically balanced sentence in the SOFA format

    • sampled on the sparse measuring grid
    • upsampled to a dense grid
  11. Z

    Data from: Voice Conversion Challenge 2020 database v1.0

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Dec 23, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaohai Tian (2020). Voice Conversion Challenge 2020 database v1.0 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4345688
    Explore at:
    Dataset updated
    Dec 23, 2020
    Dataset provided by
    Xiaohai Tian
    Tomi Kinnunen
    Zhao Yi
    Tomoki Toda
    Wen-Chin Huang
    Rohan Kumar Das
    Zhenhua Ling
    Junichi Yamagishi
    Description

    Voice conversion (VC) is a technique to transform a speaker identity included in a source speech waveform into a different one while preserving linguistic information of the source speech waveform.

    In 2016, we have launched the Voice Conversion Challenge (VCC) 2016 [1][2] at Interspeech 2016. The objective of the 2016 challenge was to better understand different VC techniques built on a freely-available common dataset to look at a common goal, and to share views about unsolved problems and challenges faced by the current VC techniques. The VCC 2016 focused on the most basic VC task, that is, the construction of VC models that automatically transform the voice identity of a source speaker into that of a target speaker using a parallel clean training database where source and target speakers read out the same set of utterances in a professional recording studio. 17 research groups had participated in the 2016 challenge. The challenge was successful and it established new standard evaluation methodology and protocols for bench-marking the performance of VC systems.

    In 2018, we have launched the second edition of VCC, the VCC 2018 [3]. In the second edition, we revised three aspects of the challenge. First, we educed the amount of speech data used for the construction of participant's VC systems to half. This is based on feedback from participants in the previous challenge and this is also essential for practical applications. Second, we introduced a more challenging task refereed to a Spoke task in addition to a similar task to the 1st edition, which we call a Hub task. In the Spoke task, participants need to build their VC systems using a non-parallel database in which source and target speakers read out different sets of utterances. We then evaluate both parallel and non-parallel voice conversion systems via the same large-scale crowdsourcing listening test. Third, we also attempted to bridge the gap between the ASV and VC communities. Since new VC systems developed for the VCC 2018 may be strong candidates for enhancing the ASVspoof 2015 database, we also asses spoofing performance of the VC systems based on anti-spoofing scores.

    In 2020, we launched the third edition of VCC, the VCC 2020 [4][5]. In this third edition, we constructed and distributed a new database for two tasks, intra-lingual semi-parallel and cross-lingual VC. The dataset for intra-lingual VC consists of a smaller parallel corpus and a larger nonparallel corpus, where both of them are of the same language. The dataset for cross-lingual VC consists of a corpus of the source speakers speaking in the source language and another corpus of the target speakers speaking in the target language. As a more challenging task than the previous ones, we focused on cross-lingual VC, in which the speaker identity is transformed between two speakers uttering different languages, which requires handling completely nonparallel training over different languages.

    This repository contains the training and evaluation data released to participants, target speaker’s speech data in English for reference purpose, and the transcriptions for evaluation data. For more details about the challenge and the listening test results please refer to [4] and README file.

    [1] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio, Mirjam Wester, Zhizheng Wu, Junichi Yamagishi "The Voice Conversion Challenge 2016" in Proc. of Interspeech, San Francisco.

    [2] Mirjam Wester, Zhizheng Wu, Junichi Yamagishi "Analysis of the Voice Conversion Challenge 2016 Evaluation Results" in Proc. of Interspeech 2016.

    [3] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, Zhenhua Ling, "The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods", Proc Speaker Odyssey 2018, June 2018.

    [4] Yi Zhao, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhenhua Ling, and Tomoki Toda. "Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion" Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 80-98, DOI: 10.21437/VCC_BC.2020-14.

  12. Laryngeal Voice Disorder Classification

    • kaggle.com
    zip
    Updated Apr 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniil Krasnoproshin (2024). Laryngeal Voice Disorder Classification [Dataset]. https://www.kaggle.com/datasets/daniilkrasnoproshin/healthy-vs-laryngeal-disorder-classification
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 22, 2024
    Authors
    Daniil Krasnoproshin
    Description

    Description:

    Unlock the potential of voice analysis in diagnosing laryngeal disorders with the this voice recordings dataset. This comprehensive dataset, gathered at the prestigious Belarus' Republican Scientific and Practical Center for Otorhinolaryngology, comprises anonymized voice recordings from 60 individuals.

    Key Features:

    • Diverse Samples: Explore voice samples from 30 healthy individuals and 30 individuals with various laryngeal disorders, including vocal fold nodules, laryngeal paralysis, and functional dysphonia.
    • Anonymized Data: Each voice sample is anonymized and labeled with alphanumeric codes to ensure privacy and confidentiality.
    • No Personal Information: Rest assured, the recordings contain no personal data such as names or ages, maintaining the anonymity of participants.
    • High-Quality Recordings: The recordings were captured under controlled conditions in the phoniatric department, ensuring consistency and reliability.
    • Potential Applications: Use this dataset to develop machine learning models for automatic diagnosis and classification of laryngeal disorders, aiding healthcare professionals in timely and accurate assessments.
    • Research Opportunities: Investigate patterns and features in voice data to uncover insights into the manifestation and progression of various laryngeal conditions.
    • Community Collaboration: Join a vibrant community of researchers, data scientists, and healthcare professionals on Kaggle to collaborate, share insights, and advance the field of voice-based diagnostics.

    If you use this dataset in your research, please credit the authors.

    Citation

    Analysis of acoustic voice parameters for larynx pathology detection (link)

    License

    ****License was not specified at the source, yet access to the data is public and a citation was requested.****

  13. P

    ESD Dataset

    • paperswithcode.com
    Updated Jun 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kun Zhou; Berrak Sisman; Rui Liu; Haizhou Li (2023). ESD Dataset [Dataset]. https://paperswithcode.com/dataset/esd
    Explore at:
    Dataset updated
    Jun 30, 2023
    Authors
    Kun Zhou; Berrak Sisman; Rui Liu; Haizhou Li
    Description

    ESD is an Emotional Speech Database for voice conversion research. The ESD database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise). More than 29 hours of speech data were recorded in a controlled acoustic environment. The database is suitable for multi-speaker and cross-lingual emotional voice conversion studies.

  14. Z

    The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Livingstone, Steven R. (2024). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1188975
    Explore at:
    Dataset updated
    Oct 19, 2024
    Dataset provided by
    Livingstone, Steven R.
    Russo, Frank A.
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Description

    The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The dataset contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.

    The RAVDESS was developed by Dr Steven R. Livingstone, who now leads the Affective Data Science Lab, and Dr Frank A. Russo who leads the SMART Lab.

    Citing the RAVDESS

    The RAVDESS is released under a Creative Commons Attribution license, so please cite the RAVDESS if it is used in your work in any form. Published academic papers should use the academic paper citation for our PLoS1 paper. Personal works, such as machine learning projects/blog posts, should provide a URL to this Zenodo page, though a reference to our PLoS1 paper would also be appreciated.

    Academic paper citation

    Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.

    Personal use citation

    Include a link to this Zenodo page - https://zenodo.org/record/1188976

    Commercial Licenses

    Commercial licenses for the RAVDESS can be purchased. For more information, please visit our license page of fees, or contact us at ravdess@gmail.com.

    Contact Information

    If you would like further information about the RAVDESS, to purchase a commercial license, or if you experience any issues downloading files, please contact us at ravdess@gmail.com.

    Example Videos

    Watch a sample of the RAVDESS speech and song videos.

    Emotion Classification Users

    If you're interested in using machine learning to classify emotional expressions with the RAVDESS, please see our new RAVDESS Facial Landmark Tracking data set [Zenodo project page].

    Construction and Validation

    Full details on the construction and perceptual validation of the RAVDESS are described in our PLoS ONE paper - https://doi.org/10.1371/journal.pone.0196391.

    The RAVDESS contains 7356 files. Each file was rated 10 times on emotional validity, intensity, and genuineness. Ratings were provided by 247 individuals who were characteristic of untrained adult research participants from North America. A further set of 72 participants provided test-retest data. High levels of emotional validity, interrater reliability, and test-retest intrarater reliability were reported. Validation data is open-access, and can be downloaded along with our paper from PLoS ONE.

    Contents

    Audio-only files

    Audio-only files of all actors (01-24) are available as two separate zip files (~200 MB each):

    Speech file (Audio_Speech_Actors_01-24.zip, 215 MB) contains 1440 files: 60 trials per actor x 24 actors = 1440.

    Song file (Audio_Song_Actors_01-24.zip, 198 MB) contains 1012 files: 44 trials per actor x 23 actors = 1012.

    Audio-Visual and Video-only files

    Video files are provided as separate zip downloads for each actor (01-24, ~500 MB each), and are split into separate speech and song downloads:

    Speech files (Video_Speech_Actor_01.zip to Video_Speech_Actor_24.zip) collectively contains 2880 files: 60 trials per actor x 2 modalities (AV, VO) x 24 actors = 2880.

    Song files (Video_Song_Actor_01.zip to Video_Song_Actor_24.zip) collectively contains 2024 files: 44 trials per actor x 2 modalities (AV, VO) x 23 actors = 2024.

    File Summary

    In total, the RAVDESS collection includes 7356 files (2880+2024+1440+1012 files).

    File naming convention

    Each of the 7356 RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics: Filename identifiers

    Modality (01 = full-AV, 02 = video-only, 03 = audio-only).

    Vocal channel (01 = speech, 02 = song).

    Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).

    Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.

    Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").

    Repetition (01 = 1st repetition, 02 = 2nd repetition).

    Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

    Filename example: 02-01-06-01-02-01-12.mp4

    Video-only (02)

    Speech (01)

    Fearful (06)

    Normal intensity (01)

    Statement "dogs" (02)

    1st Repetition (01)

    12th Actor (12)

    Female, as the actor ID number is even.

    License information

    The RAVDESS is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0

    Commercial licenses for the RAVDESS can also be purchased. For more information, please visit our license fee page, or contact us at ravdess@gmail.com.

    Related Data sets

    RAVDESS Facial Landmark Tracking data set [Zenodo project page].

  15. P

    Common Voice Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Jan 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosana Ardila; Megan Branson; Kelly Davis; Michael Henretty; Michael Kohler; Josh Meyer; Reuben Morais; Lindsay Saunders; Francis M. Tyers; Gregor Weber (2021). Common Voice Dataset [Dataset]. https://paperswithcode.com/dataset/common-voice
    Explore at:
    Dataset updated
    Jan 7, 2021
    Authors
    Rosana Ardila; Megan Branson; Kelly Davis; Michael Henretty; Michael Kohler; Josh Meyer; Reuben Morais; Lindsay Saunders; Francis M. Tyers; Gregor Weber
    Description

    Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.

  16. Finance Interactive Voice Response System

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Jul 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social Security Administration (2025). Finance Interactive Voice Response System [Dataset]. https://catalog.data.gov/dataset/finance-interactive-voice-response-system
    Explore at:
    Dataset updated
    Jul 4, 2025
    Dataset provided by
    Social Security Administrationhttp://ssa.gov/
    Description

    The database stores information to support the capability to access (by phone) vendor invoice/payment status reports using an Interactive Voice Response System.

  17. common_voice_12_0

    • huggingface.co
    Updated Mar 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mozilla Foundation (2023). common_voice_12_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0
    Explore at:
    Dataset updated
    Mar 24, 2023
    Dataset authored and provided by
    Mozilla Foundationhttp://mozilla.org/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Common Voice Corpus 12.0

      Dataset Summary
    

    The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17127 validated hours in 104 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0.

  18. Z

    COALA voice data and transcripts Italian

    • data.niaid.nih.gov
    Updated Oct 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massimo Curti (2023). COALA voice data and transcripts Italian [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8413134
    Explore at:
    Dataset updated
    Oct 7, 2023
    Dataset provided by
    Evangelos Niforatos
    Massimo Curti
    Samuel Kernan Freire
    Stefan Wellsandt
    Mina Foosherian
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains audio files and transcripts in Italian and related to manufacturing. We collected the scripts during the Horizon Europe RIA COALA (GA 957296, project reference website) from industrial use cases and hired a service provider to generate the related audio files (BIBA - Bremer Institut fĂĽr Produktion und Logistik GmbH ordered the service). The service provider checked the audio files for quality.

    The service provider recruited crowd workers, and gathered their audio records, informed consent (privacy) and agreement that their records become public domain (Creative Commons 0; https://creativecommons.org/share-your-work/public-domain/cc0/). The service provider declared to follow a Crowd Code of Ethics and a Fair Pay policy.

    The metadata file contains the following information:

    file_name: name of the audio file

    script: script the speaker had to speak

    scriptId: the numeric identifier of the script

    participantId: the numeric identifier of the participant (speaker)

    gender: the gender as indicated by the participant (MALE or FEMALE)

    age: the age in years as indicated by the participant

    age_range: the age range in years (18-30, 31-45, 46+)

    country: the birth country indicated by the participant

    current_country: the country of residence indicated by the participant

    primary_language: the language indicated as primary by the participant

    ever_worked_factory: answer to the question: "Have you ever worked in a factory, manufacturing setting?" (Yes/No)

    years_worked_factory: answer to the question: "If yes, for how many years?" (1-10, 10+)

    background_noise_type: background noise in the audio as indicated by the participant (mild, humming/technical, no noise)

    gdpr_and_ipr_consent: answer to the privacy notice and the ipr transfer to CC-0 (Yes)

    date_signed: date when the participant signed the consent form (US format, MM.DD.YYYY)

  19. E

    TC-STAR Bilingual Voice-Conversion Spanish Speech Database

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Dec 21, 2010
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2010). TC-STAR Bilingual Voice-Conversion Spanish Speech Database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0311/
    Explore at:
    Dataset updated
    Dec 21, 2010
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    4 hours and 80 minutes of speech as spoken by 2 female speakers and 2 male speakers, covering both mimics and parallel voice conversion data.

  20. V

    Data from: The Uncommonwealth

    • data.virginia.gov
    url
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Library of Virginia (2024). The Uncommonwealth [Dataset]. https://data.virginia.gov/dataset/the-uncommonwealth
    Explore at:
    urlAvailable download formats
    Dataset updated
    Oct 2, 2024
    Dataset authored and provided by
    Library of Virginia
    Description

    Learn about what we do, why we do it, and how our efforts relate to current issues and events. In addition to our intriguing collections and groundbreaking projects, we’ll spotlight public libraries, staff members, and specialized professions. Visit uncommonwealth.virginiamemory.com to learn more!

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2018). VOICES Dataset [Dataset]. https://paperswithcode.com/dataset/voices

VOICES Dataset

Voices Obscured In Complex Environmental Settings

Explore at:
Dataset updated
Apr 17, 2018
Description

The VOICES corpus is a dataset to promote speech and signal processing research of speech recorded by far-field microphones in noisy room conditions.

For this corpus, audio was recorded in furnished rooms with background noise played in conjunction with foreground speech selected from the LibriSpeech corpus. Multiple sessions were recorded in each room to accommodate for all foreground speech-background noise combinations. Audio was recorded using twelve microphones placed throughout the room, resulting in 120 hours of audio per microphone.

Search
Clear search
Close search
Google apps
Main menu