The VOICES corpus is a dataset to promote speech and signal processing research of speech recorded by far-field microphones in noisy room conditions.
For this corpus, audio was recorded in furnished rooms with background noise played in conjunction with foreground speech selected from the LibriSpeech corpus. Multiple sessions were recorded in each room to accommodate for all foreground speech-background noise combinations. Audio was recorded using twelve microphones placed throughout the room, resulting in 120 hours of audio per microphone.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This database includes 208 voice samples, from 150 pathological, and 58 healthy voices.
This web map was developed to show the geographic distribution of the oral history interviews contained within the archive of the NOAA Voices program. This map is used in the NOAA Voices Oral History Interview Mapping Application, found here: https://noaa.maps.arcgis.com/home/item.html?id=a220357bec444ab0be7e586fb5ecd26eEach interview is treated as a separate data point with a variety of attributes. These attributes include: narrator, interviewer, date of interview, city, state, interviewer, project, link to interview, and interview description.Each point in this dataset is plotted at the city level. The size of these points is directly tied to the number of interviews within that location.The data and metadata for this application can be found on the NOAA Voices website, here: https://voices.nmfs.noaa.gov/. Each interview has its own landing page on the NOAA Voices site, and the information on these landing pages mirrors the data in this application.
https://github.com/noetits/ICE-Talk for controllable TTS
Sorted version (recommended), new link: https://openslr.org/115/
old link (slow download) but gives ou the folder structure needed to use "load_emov_db()" function: https://mega.nz/#F!KBp32apT!gLIgyWf9iQ-yqnWFUFuUHg
Not sorted version: http://www.coe.neu.edu/Research/AClab/Speech%20Data/
"It is the process of taking the text transcription of an audio speech segment and determining where in time particular words occur in the speech segment." source
It also allows to separate verbal and non-verbal vocalizations (laughs, yawns, etc.)
It should create a folder called "alignments" in the repo, with the same structure as the database, containing a json file for each sentence of the database.
The function "get_start_end_from_json(path)" allows you to extract start and end of the computed force alignment
you can play a file with function "play(path)"
you can play the part of the file in which there is speech according to the forced alignment with "play_start_end(path, start, end)"
The Emotional Voices Database: Towards Controlling the Emotional Expressiveness in Voice Generation Systems
This dataset is built for the purpose of emotional speech synthesis. The transcript were based on the CMU arctic database: http://www.festvox.org/cmu_arctic/cmuarctic.data.
It includes recordings for four speakers- two males and two females.
The emotional styles are neutral, sleepiness, anger, disgust and amused.
Each audio file is recorded in 16bits .wav format
Spk-Je (Female, English: Neutral(417 files), Amused(222 files), Angry(523 files), Sleepy(466 files), Disgust(189 files))
Spk-Bea (Female, English: Neutral(373 files), Amused(309 files), Angry(317 files), Sleepy(520 files), Disgust(347 files))
Spk-Sa (Male, English: Neutral(493 files), Amused(501 files), Angry(468 files), Sleepy(495 files), Disgust(497 files))
Spk-Jsh (Male, English: Neutral(302 files), Amused(298 files), Sleepy(263 files))
File naming (audio_folder): anger_1-28_0011.wav - 1) first word (emotion style), 1-28 - annotation doc file range, Last four digit is the sentence number.
File naming (annotation_folder): anger_1-28.TextGrid - 1) first word (emotional style), 1-28- annotation doc range
A description of the database here: https://arxiv.org/pdf/1806.09514.pdf
Please reference this paper when using this database:
Bibtex:
@article{adigwe2018emotional,
title={The emotional voices database: Towards controlling the emotion dimension in voice generation systems},
author={Adigwe, Adaeze and Tits, No{\'e} and Haddad, Kevin El and Ostadabbas, Sarah and Dutoit, Thierry},
journal={arXiv preprint arXiv:1806.09514},
year={2018}
}
The NOAA Voices Oral History Archives (VOHA) seeks to document the human experience as it relates to the changing environment, climate, oceans and coasts and other key areas of NOAAs work through firsthand oral history accounts from across the US and its territories. Oral histories contribute to NOAA's Mission of "Science, Service, and Stewardship" by creating, compiling, archiving and sharing the experiences of stakeholders, scientists and others. Any individual or organization can participate in the VOHA program by contributing individual oral history interviews or collections of interviews that are related to the project scope and mission, or by using the interviews archived here in their research, scholarship, exhibits, or general use. We accept oral histories produced by NOAA staff (including social scientists, historians, and others) as well as from from external organizations, universities, researchers and oral history practitioners. This content is made available to the public in this digital repository for educational and research purposes. The Voices Oral History Archives database is a powerful resource available to the public to inform, educate, and provide primary information for researchers interested in our local, human experience associated with the varied facets of NOAA's mission (including but not limited to Climate, Fisheries, Weather, Heritage, etc...)
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset Card for Chest voice and Falsetto Dataset
The original dataset, sourced from the Chest Voice and Falsetto Dataset, includes 1,280 monophonic singing audio files in .wav format, performed, recorded, and annotated by students majoring in Vocal Music at the China Conservatory of Music. The chest voice is tagged as "chest" and the falsetto voice as "falsetto." Additionally, the dataset encompasses the Mel spectrogram, Mel frequency cepstral coefficient (MFCC), and spectral… See the full description on the dataset page: https://huggingface.co/datasets/ccmusic-database/chest_falsetto.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Emotional Voice Messages (EMOVOME) database is a speech dataset collected for emotion recognition in real-world conditions. It contains 999 spontaneous voice messages from 100 Spanish speakers, collected from real conversations on a messaging app. EMOVOME includes both expert and non-expert emotional annotations, covering valence and arousal dimensions, along with emotion categories for the expert annotations. Detailed participant information is provided, including sociodemographic data and personality trait assessments using the NEO-FFI questionnaire. Moreover, EMOVOME provides audio recordings of participants reading a given text, as well as transcriptions of all 999 voice messages. Additionally, baseline models for valence and arousal recognition are provided, utilizing both speech and audio transcriptions.
Description
For details on the EMOVOME database, please refer to the article:
"EMOVOME Database: Advancing Emotion Recognition in Speech Beyond Staged Scenarios". LucĂa GĂłmez-Zaragozá, RocĂo del Amor, MarĂa JosĂ© Castro-Bleda, Valery Naranjo, Mariano Alcañiz Raya, Javier MarĂn-Morales. (pre-print available in https://doi.org/10.48550/arXiv.2403.02167)
Content
The Zenodo repository contains four files:
EMOVOME_agreement.pdf: agreement file required to access the original audio files, detailed in section Usage Notes.
labels.csv: ratings of the three non-experts and the expert annotator, independently and combined.
participants_ids.csv: table mapping each numerical file ID to its corresponding alphanumeric participant ID.
transcriptions.csv: transcriptions of each audio.
The repository also includes three folders:
Audios: it contains the file features_eGeMAPSv02.csv corresponding to the standard acoustic feature set used in the baseline model, and two folders:
Lecture: contains the audio files corresponding to the text readings, with each file named according to the participant's ID.
Emotions: contains the voice recordings from the messaging app provided by the user, named with a file ID.
Questionnaires: it contains two files: 1) sociodemographic_spanish.csv and sociodemographic_english.csv are the sociodemographic data of participants in Spanish and English, respectively, including the demographic information; and 2) NEO-FFI_spanish.csv includes the participants’ answers to the Spanish version of the NEO-FFI questionnaire. The three files include a column indicating the participant's ID to link the information.
Baseline_emotion_recognition: it includes three files and two folders. The file partitions.csv specifies the proposed data partition. Particularly, the dataset is divided into 80% for development and 20% for testing using a speaker-independent approach, i.e., samples from the same speaker are not included in both development and test. The development set includes 80 participants (40 female, 40 male) containing the following distribution of labels: 241 negative, 305 neutral and 261 positive valence; and 148 low, 328 neutral and 331 high arousal. The test set includes 20 participants (10 female, 10 male) with the distribution of labels that follows: 57 negative, 62 neutral and 73 positive valence; and 13 low, 70 neutral and 109 high arousal. Files baseline_speech.ipynb and baseline_text.ipynb contain the code used to create the baseline emotion recognition models based on speech and text, respectively. The actual trained models for valence and arousal prediction are provided in folders models_speech and models_text.
Audio files in “Lecture” and “Emotions” are only provided to the users that complete the agreement file in section Usage Notes. Audio files are in Ogg Vorbis format at 16-bit and 44.1 kHz or 48 kHz. The total size of the “Audios” folder is about 213 MB.
Usage Notes
All the data included in the EMOVOME database is publicly available under the Creative Commons Attribution 4.0 International license. The only exception is the original raw audio files, for which an additional step is required as a security measure to safeguard the speakers' privacy. To request access, interested authors should first complete and sign the agreement file EMOVOME_agreement.pdf and send it to the corresponding author (jamarmo@htech.upv.es). The data included in the EMOVOME database is expected to be used for research purposes only. Therefore, the agreement file states that the authors are not allowed to share the data with profit-making companies or organisations. They are also not expected to distribute the data to other research institutions; instead, they are suggested to kindly refer interested colleagues to the corresponding author of this article. By agreeing to the terms of the agreement, the authors also commit to refraining from publishing the audio content on the media (such as television and radio), in scientific journals (or any other publications), as well as on other platforms on the internet. The agreement must bear the signature of the legally authorised representative of the research institution (e.g., head of laboratory/department). Once the signed agreement is received and validated, the corresponding author will deliver the "Audios" folder containing the audio files through a download procedure. A direct connection between the EMOVOME authors and the applicants guarantees that updates regarding additional materials included in the database can be received by all EMOVOME users.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Gender Recognition by Voice and Speech Analysis
This database was created to identify a voice as male or female, based upon acoustic properties of the voice and speech. The dataset consists of 3,168 recorded voice samples, collected from male and female speakers. The voice samples are pre-processed by acoustic analysis in R using the seewave and tuneR packages, with an analyzed frequency range of 0hz-280hz (human vocal range).
The following acoustic properties of each voice are measured and included within the CSV:
50% / 50%
97% / 98%
96% / 97%
100% / 98%
100% / 99%
100% / 99%
An original analysis of the data-set can be found in the following article:
Identifying the Gender of a Voice using Machine Learning
The best model achieves 99% accuracy on the test set. According to a CART model, it appears that looking at the mean fundamental frequency might be enough to accurately classify a voice. However, some male voices use a higher frequency, even though their resonance differs from female voices, and may be incorrectly classified as female. To the human ear, there is apparently more than simple frequency, that determines a voice's gender.
http://i.imgur.com/Npr2U7O.png" alt="CART model">
Mean fundamental frequency appears to be an indicator of voice gender, with a threshold of 140hz separating male from female classifications.
The Harvard-Haskins Database of Regularly-Timed Speech
Telecommunications & Signal Processing Laboratory (TSP) Speech Database at McGill University, Home
Festvox CMU_ARCTIC Speech Database at Carnegie Mellon University
Global Voices is a multilingual dataset for evaluating cross-lingual summarization methods. It is extracted from social-network descriptions of Global Voices news articles to cheaply collect evaluation data for into-English and from-English summarization in 15 languages.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
[1] C. Pörschmann. “A database for the comparison of measured datasets of human voice directivity," in Proceedings of the Forum Acusticum, Torino, Italy, 2023.
This study presents a database that allows direct comparison and visualization of datasets from 19 different studies. The data is collected from tables, plots, and datasets from the supplemental material of the respective studies. Some studies present directivity patterns averaged over a whole sentence, while others report phoneme-dependent data. Furthermore, these datasets vary in their sampling grids, with many measured in the horizontal plane and just a few measured spherically. Most datasets included in this work present frequency-band averaged values, for example, in one-third octave bands, while a few newer studies provide the raw data in the form of transfer functions.
Furthermore, the supplementary material contains voice directivity datasets determined over a complete sentence determined for a phonetically balanced German sentence (measured twice for 13 subjects).
The .pdf file contains
The Database.zip
The VoiceDirectivitySentence.zip files contain voice directivities patterns averaged over one phonetically balanced sentence in the SOFA format
Voice conversion (VC) is a technique to transform a speaker identity included in a source speech waveform into a different one while preserving linguistic information of the source speech waveform.
In 2016, we have launched the Voice Conversion Challenge (VCC) 2016 [1][2] at Interspeech 2016. The objective of the 2016 challenge was to better understand different VC techniques built on a freely-available common dataset to look at a common goal, and to share views about unsolved problems and challenges faced by the current VC techniques. The VCC 2016 focused on the most basic VC task, that is, the construction of VC models that automatically transform the voice identity of a source speaker into that of a target speaker using a parallel clean training database where source and target speakers read out the same set of utterances in a professional recording studio. 17 research groups had participated in the 2016 challenge. The challenge was successful and it established new standard evaluation methodology and protocols for bench-marking the performance of VC systems.
In 2018, we have launched the second edition of VCC, the VCC 2018 [3]. In the second edition, we revised three aspects of the challenge. First, we educed the amount of speech data used for the construction of participant's VC systems to half. This is based on feedback from participants in the previous challenge and this is also essential for practical applications. Second, we introduced a more challenging task refereed to a Spoke task in addition to a similar task to the 1st edition, which we call a Hub task. In the Spoke task, participants need to build their VC systems using a non-parallel database in which source and target speakers read out different sets of utterances. We then evaluate both parallel and non-parallel voice conversion systems via the same large-scale crowdsourcing listening test. Third, we also attempted to bridge the gap between the ASV and VC communities. Since new VC systems developed for the VCC 2018 may be strong candidates for enhancing the ASVspoof 2015 database, we also asses spoofing performance of the VC systems based on anti-spoofing scores.
In 2020, we launched the third edition of VCC, the VCC 2020 [4][5]. In this third edition, we constructed and distributed a new database for two tasks, intra-lingual semi-parallel and cross-lingual VC. The dataset for intra-lingual VC consists of a smaller parallel corpus and a larger nonparallel corpus, where both of them are of the same language. The dataset for cross-lingual VC consists of a corpus of the source speakers speaking in the source language and another corpus of the target speakers speaking in the target language. As a more challenging task than the previous ones, we focused on cross-lingual VC, in which the speaker identity is transformed between two speakers uttering different languages, which requires handling completely nonparallel training over different languages.
This repository contains the training and evaluation data released to participants, target speaker’s speech data in English for reference purpose, and the transcriptions for evaluation data. For more details about the challenge and the listening test results please refer to [4] and README file.
[1] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio, Mirjam Wester, Zhizheng Wu, Junichi Yamagishi "The Voice Conversion Challenge 2016" in Proc. of Interspeech, San Francisco.
[2] Mirjam Wester, Zhizheng Wu, Junichi Yamagishi "Analysis of the Voice Conversion Challenge 2016 Evaluation Results" in Proc. of Interspeech 2016.
[3] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, Zhenhua Ling, "The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods", Proc Speaker Odyssey 2018, June 2018.
[4] Yi Zhao, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhenhua Ling, and Tomoki Toda. "Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion" Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 80-98, DOI: 10.21437/VCC_BC.2020-14.
Description:
Unlock the potential of voice analysis in diagnosing laryngeal disorders with the this voice recordings dataset. This comprehensive dataset, gathered at the prestigious Belarus' Republican Scientific and Practical Center for Otorhinolaryngology, comprises anonymized voice recordings from 60 individuals.
Key Features:
If you use this dataset in your research, please credit the authors.
Citation
Analysis of acoustic voice parameters for larynx pathology detection (link)
License
****License was not specified at the source, yet access to the data is public and a citation was requested.****
ESD is an Emotional Speech Database for voice conversion research. The ESD database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise). More than 29 hours of speech data were recorded in a controlled acoustic environment. The database is suitable for multi-speaker and cross-lingual emotional voice conversion studies.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The dataset contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.
The RAVDESS was developed by Dr Steven R. Livingstone, who now leads the Affective Data Science Lab, and Dr Frank A. Russo who leads the SMART Lab.
Citing the RAVDESS
The RAVDESS is released under a Creative Commons Attribution license, so please cite the RAVDESS if it is used in your work in any form. Published academic papers should use the academic paper citation for our PLoS1 paper. Personal works, such as machine learning projects/blog posts, should provide a URL to this Zenodo page, though a reference to our PLoS1 paper would also be appreciated.
Academic paper citation
Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.
Personal use citation
Include a link to this Zenodo page - https://zenodo.org/record/1188976
Commercial Licenses
Commercial licenses for the RAVDESS can be purchased. For more information, please visit our license page of fees, or contact us at ravdess@gmail.com.
Contact Information
If you would like further information about the RAVDESS, to purchase a commercial license, or if you experience any issues downloading files, please contact us at ravdess@gmail.com.
Example Videos
Watch a sample of the RAVDESS speech and song videos.
Emotion Classification Users
If you're interested in using machine learning to classify emotional expressions with the RAVDESS, please see our new RAVDESS Facial Landmark Tracking data set [Zenodo project page].
Construction and Validation
Full details on the construction and perceptual validation of the RAVDESS are described in our PLoS ONE paper - https://doi.org/10.1371/journal.pone.0196391.
The RAVDESS contains 7356 files. Each file was rated 10 times on emotional validity, intensity, and genuineness. Ratings were provided by 247 individuals who were characteristic of untrained adult research participants from North America. A further set of 72 participants provided test-retest data. High levels of emotional validity, interrater reliability, and test-retest intrarater reliability were reported. Validation data is open-access, and can be downloaded along with our paper from PLoS ONE.
Contents
Audio-only files
Audio-only files of all actors (01-24) are available as two separate zip files (~200 MB each):
Speech file (Audio_Speech_Actors_01-24.zip, 215 MB) contains 1440 files: 60 trials per actor x 24 actors = 1440.
Song file (Audio_Song_Actors_01-24.zip, 198 MB) contains 1012 files: 44 trials per actor x 23 actors = 1012.
Audio-Visual and Video-only files
Video files are provided as separate zip downloads for each actor (01-24, ~500 MB each), and are split into separate speech and song downloads:
Speech files (Video_Speech_Actor_01.zip to Video_Speech_Actor_24.zip) collectively contains 2880 files: 60 trials per actor x 2 modalities (AV, VO) x 24 actors = 2880.
Song files (Video_Song_Actor_01.zip to Video_Song_Actor_24.zip) collectively contains 2024 files: 44 trials per actor x 2 modalities (AV, VO) x 23 actors = 2024.
File Summary
In total, the RAVDESS collection includes 7356 files (2880+2024+1440+1012 files).
File naming convention
Each of the 7356 RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics: Filename identifiers
Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
Vocal channel (01 = speech, 02 = song).
Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
Repetition (01 = 1st repetition, 02 = 2nd repetition).
Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).
Filename example: 02-01-06-01-02-01-12.mp4
Video-only (02)
Speech (01)
Fearful (06)
Normal intensity (01)
Statement "dogs" (02)
1st Repetition (01)
12th Actor (12)
Female, as the actor ID number is even.
License information
The RAVDESS is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0
Commercial licenses for the RAVDESS can also be purchased. For more information, please visit our license fee page, or contact us at ravdess@gmail.com.
Related Data sets
RAVDESS Facial Landmark Tracking data set [Zenodo project page].
Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.
The database stores information to support the capability to access (by phone) vendor invoice/payment status reports using an Interactive Voice Response System.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 12.0
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17127 validated hours in 104 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains audio files and transcripts in Italian and related to manufacturing. We collected the scripts during the Horizon Europe RIA COALA (GA 957296, project reference website) from industrial use cases and hired a service provider to generate the related audio files (BIBA - Bremer Institut fĂĽr Produktion und Logistik GmbH ordered the service). The service provider checked the audio files for quality.
The service provider recruited crowd workers, and gathered their audio records, informed consent (privacy) and agreement that their records become public domain (Creative Commons 0; https://creativecommons.org/share-your-work/public-domain/cc0/). The service provider declared to follow a Crowd Code of Ethics and a Fair Pay policy.
The metadata file contains the following information:
file_name: name of the audio file
script: script the speaker had to speak
scriptId: the numeric identifier of the script
participantId: the numeric identifier of the participant (speaker)
gender: the gender as indicated by the participant (MALE or FEMALE)
age: the age in years as indicated by the participant
age_range: the age range in years (18-30, 31-45, 46+)
country: the birth country indicated by the participant
current_country: the country of residence indicated by the participant
primary_language: the language indicated as primary by the participant
ever_worked_factory: answer to the question: "Have you ever worked in a factory, manufacturing setting?" (Yes/No)
years_worked_factory: answer to the question: "If yes, for how many years?" (1-10, 10+)
background_noise_type: background noise in the audio as indicated by the participant (mild, humming/technical, no noise)
gdpr_and_ipr_consent: answer to the privacy notice and the ipr transfer to CC-0 (Yes)
date_signed: date when the participant signed the consent form (US format, MM.DD.YYYY)
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
4 hours and 80 minutes of speech as spoken by 2 female speakers and 2 male speakers, covering both mimics and parallel voice conversion data.
Learn about what we do, why we do it, and how our efforts relate to current issues and events. In addition to our intriguing collections and groundbreaking projects, we’ll spotlight public libraries, staff members, and specialized professions. Visit uncommonwealth.virginiamemory.com to learn more!
The VOICES corpus is a dataset to promote speech and signal processing research of speech recorded by far-field microphones in noisy room conditions.
For this corpus, audio was recorded in furnished rooms with background noise played in conjunction with foreground speech selected from the LibriSpeech corpus. Multiple sessions were recorded in each room to accommodate for all foreground speech-background noise combinations. Audio was recorded using twelve microphones placed throughout the room, resulting in 120 hours of audio per microphone.