Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Our project, “Indonesian Media Audio Database,” is designed to establish a rich and diverse dataset tailored for training advanced machine learning models in language processing, speech recognition, and cultural analysis.
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The Aurora project was originally set up to establish a worldwide standard for the feature extraction software which forms the core of the front-end of a DSR (Distributed Speech Recognition) system. The AURORA-5 database has been mainly developed to investigate the influence on the performance of automatic speech recognition for a hands-free speech input in noisy room environments. Furthermore two test conditions are included to study the influence of transmitting the speech in a mobile communication system.The earlier three Aurora experiments had a focus on additive noise and the influence of some telephone frequency characteristics. Aurora-5 tries to cover all effects as they occur in realistic application scenarios. The focus was put on two scenarios. The first one is the hands-free speech input in the noisy car environment with the intention of controlling either devices in the car itself or retrieving information from a remote speech server over the telephone. The second one covers the hands-free speech input in a type of office or in a type of living room to control e.g. a telephone device or some audio/video equipment.The AURORA-5 database contains the following data:•Artificially distorted versions of the recordings from adult speakers in the TI-Digits speech database downsampled at a sampling frequency of 8000 Hz. The distortions consist of: - additive background noise, - the simulation of a hands-free speech input in rooms, - the simulation of transmitting speech over cellular telephone networks.•A subset of recordings from the meeting recorder project at the International Computer Science Institute. The recordings contain sequences of digits uttered by different speakers in hands-free mode in a meeting room.•A set of scripts for running recognition experiments on the above mentioned speech data. The experiments are based on the usage of the freely available software package HTK where HTK is not part of this resource.Further information is also available at the following address: http://aurora.hsnr.de
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Noise is an unwanted behavior in audio datasets. Noise plays an important part in the machine learning field of audio data type.
The dataset can be used for noise filtering, noise generation & noise recognition in audio classification, audio recognition, audio generation, and audio-related machine learning. I, Min Si Thu, used this dataset on open-source projects.
I collected ten types of noise in this dataset.
Location - Myanmar, Mandalay, Amarapura Township
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ARTUR is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,035 hours of speech, although only 840 hours are transcribed, while the remaining 195 hours are without transcription. The data is divided into 4 parts: (1) approx. 520 hours of read speech, which includes the reading of pre-defined sentences, selected from the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320); each sentence is contained in one file; speakers are demographically balanced; spelling is included in special files; all with manual transcriptions; (2) approx. 204 hours of public speech, which includes media recordings, online recordings of conferences, workshops, education videos, etc.; 56 hours are manually transcribed; (3) approx. 110 hours of private speech, which includes monologues and dialogues between two persons, recorded for the purposes of the speech database; the speakers are demographically balanced; two subsets for domain-specific ASR (i.e., smart-home and face-description) are included; 63 hours are manually transcribed; (4) approx. 201 hours of parliamentary speech, which includes recordings from the Slovene National Assembly, all with manual transcriptions. Audio files are WAV 44,1 kHz, pcm, 16-bit, mono.
This entry includes the recordings only; transcriptions are available at http://hdl.handle.net/11356/1718.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The dataset contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.
The RAVDESS was developed by Dr Steven R. Livingstone, who now leads the Affective Data Science Lab, and Dr Frank A. Russo who leads the SMART Lab.
Citing the RAVDESS
The RAVDESS is released under a Creative Commons Attribution license, so please cite the RAVDESS if it is used in your work in any form. Published academic papers should use the academic paper citation for our PLoS1 paper. Personal works, such as machine learning projects/blog posts, should provide a URL to this Zenodo page, though a reference to our PLoS1 paper would also be appreciated.
Academic paper citation
Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.
Personal use citation
Include a link to this Zenodo page - https://zenodo.org/record/1188976
Commercial Licenses
Commercial licenses for the RAVDESS can be purchased. For more information, please visit our license page of fees, or contact us at ravdess@gmail.com.
Contact Information
If you would like further information about the RAVDESS, to purchase a commercial license, or if you experience any issues downloading files, please contact us at ravdess@gmail.com.
Example Videos
Watch a sample of the RAVDESS speech and song videos.
Emotion Classification Users
If you're interested in using machine learning to classify emotional expressions with the RAVDESS, please see our new RAVDESS Facial Landmark Tracking data set [Zenodo project page].
Construction and Validation
Full details on the construction and perceptual validation of the RAVDESS are described in our PLoS ONE paper - https://doi.org/10.1371/journal.pone.0196391.
The RAVDESS contains 7356 files. Each file was rated 10 times on emotional validity, intensity, and genuineness. Ratings were provided by 247 individuals who were characteristic of untrained adult research participants from North America. A further set of 72 participants provided test-retest data. High levels of emotional validity, interrater reliability, and test-retest intrarater reliability were reported. Validation data is open-access, and can be downloaded along with our paper from PLoS ONE.
Contents
Audio-only files
Audio-only files of all actors (01-24) are available as two separate zip files (~200 MB each):
Audio-Visual and Video-only files
Video files are provided as separate zip downloads for each actor (01-24, ~500 MB each), and are split into separate speech and song downloads:
File Summary
In total, the RAVDESS collection includes 7356 files (2880+2024+1440+1012 files).
File naming convention
Each of the 7356 RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics:
Filename identifiers
Filename example: 02-01-06-01-02-01-12.mp4
License information
The RAVDESS is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0
Commercial licenses for the RAVDESS can also be purchased. For more information, please visit our license fee page, or contact us at ravdess@gmail.com.
Related Data sets
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset represents a comprehensive resource for advancing Kurdish TTS systems. Converting text to speech is one of the important topics in the design and construction of multimedia systems, human-machine communication, and information and communication technology, and its purpose, along with speech recognition, is to establish communication between humans and machines in its most basic and natural form, that is, spoken language.
For our text corpus, we collected 6,565 sentences from a set of texts in various categories, including news, sport, health, question and exclamation sentences, science, general information, politics, education and literature, story, miscellaneous, and tourism, to create the train sentences. We thoroughly reviewed the texts and normalized them, then they were recorded by a male speaker. We recorded audios in a voice recording studio at 44,100Hz, and all audio files are down sampled to 22,050 Hz in our modeling process. The audio ranges from 3 to 36 seconds in length. We generate the speech corpus in this method, and the last speech has about 6,565 texts and audio pairings, which takes around 19 hours. Altogether, audio files are saved in wave format, and the texts are saved in text files in the corresponding sub-folders. Furthermore, for model training, all of the audio files are gathered in a single folder. Each line in the transcript files is formatted as WAVS | audio file’s name.wav| transcript. The audio file’s name includes the extensions, and the transcript was the speech's text.
The audio recording and editing process lasted for 90 days. It involved capturing over 6,565 WAV files and over 19 h of recorded speech. The data set helps researchers improve Kurdish TTS early, thereby reducing the time consumed for this process.
Acknowledgments: We would like to express our sincere gratitude to Ayoub Mohammadzadeh for his invaluable support in recording the corpus.
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
This is an audio-visual speech database for training and testing of Czech audio-visual continuous speech recognition systems collected with impaired illumination conditions. The corpus consists of about 20 hours of audio-visual records of 50 speakers in laboratory conditions. Recorded subjects were instructed to remain static. The illumination varied and chunks of each speaker were recorded with several different conditions, such as full illumination, or illumination from one side (left or right) only. These conditions make the database usable for training lip-/head-tracking systems under various illumination conditions independently of the language. Speakers were asked to read 200 sentences each (50 common for all speakers and 150 specific to each speaker). The average total length of recording per speaker was 23 minutes.Acoustic data are stored in wave files using PCM format, sampling frequency 44kHz, resolution 16 bits. Each speaker’s acoustic data set represents about 180 MB of disk space (about 8.8 GB).Visual data are stored in video files (.avi format) using the digital video (DV) codec. Visual data per speaker take about 3.7 GB of disk (about 185 GB as a whole) and are stored on an IDE hard disk (NTFS format).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database description:
The written and spoken digits database is not a new database but a constructed database from existing ones, in order to provide a ready-to-use database for multimodal fusion.
The written digits database is the original MNIST handwritten digits database [1] with no additional processing. It consists of 70000 images (60000 for training and 10000 for test) of 28 x 28 = 784 dimensions.
The spoken digits database was extracted from Google Speech Commands [2], an audio dataset of spoken words that was proposed to train and evaluate keyword spotting systems. It consists of 105829 utterances of 35 words, amongst which 38908 utterances of the ten digits (34801 for training and 4107 for test). A pre-processing was done via the extraction of the Mel Frequency Cepstral Coefficients (MFCC) with a framing window size of 50 ms and frame shift size of 25 ms. Since the speech samples are approximately 1 s long, we end up with 39 time slots. For each one, we extract 12 MFCC coefficients with an additional energy coefficient. Thus, we have a final vector of 39 x 13 = 507 dimensions. Standardization and normalization were applied on the MFCC features.
To construct the multimodal digits dataset, we associated written and spoken digits of the same class respecting the initial partitioning in [1] and [2] for the training and test subsets. Since we have less samples for the spoken digits, we duplicated some random samples to match the number of written digits and have a multimodal digits database of 70000 samples (60000 for training and 10000 for test).
The dataset is provided in six files as described below. Therefore, if a shuffle is performed on the training or test subsets, it must be performed in unison with the same order for the written digits, spoken digits and labels.
Files:
References:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
| Column | Description |
|---|---|
| id | file id (string) |
| file_path | file path to .wav file (string) |
| speech | transcription of the audio file (string) |
| speaker | speaker name, use this as the target variable if you are doing audio classification (string) |
The CMU_ARCTIC databases were constructed at the Language Technologies Institute at Carnegie Mellon University as phonetically balanced, US-English single-speaker databases designed for unit selection speech synthesis research. A detailed report on the structure and content of the database and the recording environment etc is available as a Carnegie Mellon University, Language Technologies Institute Tech Report CMU-LTI-03-177 and is also available here.
The databases consist of around 1150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databases include US English male (bdl) and female (slt) speakers (both experienced voice talent) as well as other accented speakers.
The 1132 sentence prompt list is available from cmuarctic.data
The distributions include 16KHz waveform and simultaneous EGG signals. Full phonetically labeling was performed by the CMU Sphinx using the FestVox based labeling scripts. Complete runnable Festival Voices are included with the database distributions, as examples though better voices can be made by improving labeling, etc.
This work was partially supported by the U.S. National Science Foundation under Grant No. 0219687, "ITR/CIS Evaluation and Personalization of Synthetic Voices". Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Our Laboratory of Artificial Neural Network Applications (LANNA) in the Czech Technical University in Prague (head of the laboratory is professor Jana Tučková) collaborates on a project with the Department of Paediatric Neurology, 2nd Faculty of Medicine of Charles University in Prague and with the Motol University Hospital (head of clinic is professor Vladimír Komárek), which focuses on the study of children with SLI.The speech database contains two subgroups of recordings of children's speech from different types of speakers. The first subgroup (healthy) consists of recordings of children without speech disorders; the second subgroup (patients) consists of recordings of children with SLI. These children have different degrees of severity (1 – mild, 2 – moderate, and 3 – severe). The speech therapists and specialists from Motol Hospital decided upon this classification. The children’s speech was recorded in the period 2003-2013. These databases were commonly created in a schoolroom or a speech therapist’s consulting room, in the presence of surrounding background noise. This situation simulates the natural environment in which the children live, and is important for capturing the normal behavior of children. The database of healthy children’s speech was created as a referential database for the computer processing of children’s speech. It was recorded on the SONY digital Dictaphone (sampling frequency, fs = 16 kHz, 16-bit resolution in stereo mode in the standardized wav format) and on the MD SONY MZ-N710 (sampling frequency, fs = 44.1 kHz, 16-bit resolution in stereo mode in the standardized wav format). The corpus was recorded in the natural environment of a schoolroom and in a clinic. This subgroup contains a total of 44 native Czech participants (15 boys, 29 girls) aged 4 to 12 years, and was recorded during the period 2003–2005. The database of children with SLI was recorded in a private speech therapist’s office. The children’s speech is captured by means of a SHURE lapel microphone using the solution by the company AVID (MBox – USB AD/DA converter and ProTools LE software) on an Apple laptop (iBook G4). The sound recordings are saved in the standardized wav format. The sampling frequency is set to 44.1 kHz with 16-bit resolution in mono mode. This subgroup contains a total of 54 native Czech participants (35 boys, 19 girls) aged 6 to 12 years, and was recorded during the period 2009–2013. This package contains wav data sets for development and testing methods for detection children with SLI.Software pack:FORANA - was developed the original software FORANA for formants analysis. It is based on the MATLAB programming environment. The development of this software was mainly driven by the need to have the ability to complete formant analysis correctly and full automation of the process of extracting formants from the recorded speech signals. Development of this application is still running. Software was developed in the LANNA at CTU FEE in Prague.LABELING - the program LABELING is used for segmentation of the speech signal. It is a part of SOMLab program system. Software was developed in the LANNA at CTU FEE in Prague.PRAAT - is an acoustic analysis software. The Praat program was created by Paul Boersma and David Weenink of the Institute of Phonetics Sciences of the University of Amsterdam. Home page:http://www.praat.org or http://www.fon.hum.uva.nl/praat/.openSMILE - The openSMILE feature extration tool enables you to extract large audio feature spaces in realtime. It combines features from Music Information Retrieval and Speech Processing. SMILE is an acronym forSpeech & Music Interpretation by Large-space Extraction. It is written in C++ and is available as both a standalone commandline executable as well as a dynamic library. The main features of openSMILE are its capability of on-line incremental processing and its modularity. Feature extractor components can be freely interconnected to create new and custom features, all via a simple configuration file. New components can be added to openSMILE via an easy binary plugin interface and a comprehensive API. Citing: Florian Eyben, Martin Wöllmer, Björn Schuller: "openSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor", In Proc. ACM Multimedia (MM), ACM, Florence, Italy, ACM, ISBN 978-1-60558-933-6, pp. 1459-1462, October 2010. doi:10.1145/1873951.1874246
Facebook
TwitterThis is a small sample from the RAVDESS Emotional speech audio used only for basic Audio Data Analysis.
"The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)" by Livingstone & Russo is licensed under CC BY-NA-SC 4.0.
Facebook
Twitterhttp://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
http://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
This is an audio-visual speech database for training and testing of Czech audio-visual continuous speech recognition systems. The corpus consists of about 25 hours of audio-visual records of 65 speakers in laboratory conditions. Data collection was done with static illumination, and recorded subjects were instructed to remain static.The average speaker age was 22 years old. Speakers were asked to read 200 sentences each (50 common for all speakers and 150 specific to each speaker). The average total length of recording per speaker is 23 minutes.All audio-visual data are transcribed (.trs files) and divided into sentences (one sentence per file). For each video file we get the description file containing information about the position and size of the region of interest.Acoustic data are stored in wave files using PCM format, sampling frequency 44kHz, resolution 16 bits. Each speaker’s acoustic data set represents about 140 MB of disk space (about 9 GB as a whole).Visual data are stored in video files (.avi format) using the digital video (DV) codec. Visual data per speaker take about 3 GB of disk (about 195 GB as a whole) and are stored on an IDE hard disk (NTFS format).
Facebook
Twitter== Quick facts ==
The most up-to-date and comprehensive podcast database available All languages & All countries Includes over 3,600,000 podcasts Features 35+ data fields , such as basic metadata, global rank, RSS feed (with audio URLs), Spotify links, and more Delivered in SQLite format Learn how we build a high quality podcast database: https://www.listennotes.help/article/105-high-quality-podcast-database-from-listen-notes
== Use Cases ==
AI training, including speech recognition, generative AI, voice cloning / synthesis, and news analysis Alternative data for investment research, such as sentiment analysis of executive interviews, market research and tracking investment themes PR and marketing, including social monitoring, content research, outreach, and guest booking ...
== Data Attributes ==
See the full list of data attributes on this page: https://www.listennotes.com/podcast-datasets/fields/?filter=podcast_only
How to access podcast audio files: Our dataset includes RSS feed URLs for all podcasts. You can retrieve audio for over 170 million episodes directly from these feeds. With access to the raw audio, you’ll have high-quality podcast speech data ideal for AI training and related applications.
== Custom Offers ==
We can provide custom datasets based on your needs, such as language-specific data, daily/weekly/monthly update frequency, or one-time purchases.
We also provide a RESTful API at PodcastAPI.com
Contact us: hello@listennotes.com
== Need Help? ==
If you have any questions about our products, feel free to reach out hello@listennotes.com
== About Listen Notes, Inc. ==
Since 2017, Listen Notes, Inc. has provided the leading podcast search engine and podcast database.
Facebook
Twitterhttps://cdla.io/permissive-1-0https://cdla.io/permissive-1-0
We present a speech data corpus that simulates a "dinner party" scenario taking place in an everyday home environment. The corpus was created by recording multiple groups of four Amazon employee volunteers having a natural conversation in English around a dining table. The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes. The corpus was created to advance in the field of noise robust and distant speech processing and is intended to serve as a public research and benchmarking data set.
Facebook
Twitterhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
The corpus contains 26 pairs of Afghanistan Southern Pashto spontaneous conversational speech, which were from 52 speakers (27 males and 25 females). For this collection, 2 speakers of each group performed the recording in separate quiet rooms. 21 topics were contained in this database. The audio duration is 160.3 hours and the speech hour is about 50.8 hours, including the reasonable leading and trailing silence. The total size of this database is 8.6 GB.
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The STC Russian speech database was recorded in 1996-1998. The main purpose of the database is to investigate individual speaker variability and to validate speaker recognition algorithms. The database was recorded through a 16-bit Vibra-16 Creative Labs sound card with an 11,025 Hz sampling rate.The database contains Russian read speech of 89 different speakers (54 male, 35 female), including 70 speakers with 15 sessions or more, 10 speakers with 10 sessions or more and 9 speakers with less than 10 sessions. The speakers were recorded in Saint-Petersburg and are within the age of 18-62. All are native speakers. The corpus consists of 5 sentences. Each speaker reads carefully but fluently each sentence 15 times on different dates over the period of 1-3 months. The corpus contains a total of 6,889 utterances and of 2 volumes, total size 700 MB uncompressed data. The signal of each utterance is stored as a separate file (approx. 126 KB). Total size of data for one speaker approximates 9,500 KB. Average utterance duration is about 5 sec.A file gives information about the speakers (speaker?s age and gender). The orthography and phonetic transcription of the corpus is given in separate files which contain the prompted sentences and their transcription in IPA. The signal files are raw files without any header, 16 bit per sample, linear, 11,025 Hz sample frequency. The recording conditions were as follows:Microphone: dynamic omnidirectional high-quality microphone, distance to mouth 5-10 cmEnvironment: office roomSampling rate: 11,025 HzResolution: 16 BitSound board: Creative Labs Vibra-16Means of delivery: CD-ROM
Facebook
TwitterThis database is orignally created for a resource for developing advanced models in automatic speech recognition that are more suited to the needs of people with dysarthria.
Facebook
TwitterThis dataset is collected to enhance research into speech recognition systems for dysarthic speech.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Emotional Voice Messages (EMOVOME) database is a speech dataset collected for emotion recognition in real-world conditions. It contains 999 spontaneous voice messages from 100 Spanish speakers, collected from real conversations on a messaging app. EMOVOME includes both expert and non-expert emotional annotations, covering valence and arousal dimensions, along with emotion categories for the expert annotations. Detailed participant information is provided, including sociodemographic data and personality trait assessments using the NEO-FFI questionnaire. Moreover, EMOVOME provides audio recordings of participants reading a given text, as well as transcriptions of all 999 voice messages. Additionally, baseline models for valence and arousal recognition are provided, utilizing both speech and audio transcriptions.
For details on the EMOVOME database, please refer to the article:
"EMOVOME Database: Advancing Emotion Recognition in Speech Beyond Staged Scenarios". Lucía Gómez-Zaragozá, Rocío del Amor, María José Castro-Bleda, Valery Naranjo, Mariano Alcañiz Raya, Javier Marín-Morales. (pre-print available in https://doi.org/10.48550/arXiv.2403.02167)
The Zenodo repository contains four files:
The repository also includes three folders:
Audio files in “Lecture” and “Emotions” are only provided to the users that complete the agreement file in section Usage Notes. Audio files are in Ogg Vorbis format at 16-bit and 44.1 kHz or 48 kHz. The total size of the “Audios” folder is about 213 MB.
All the data included in the EMOVOME database is publicly available under the Creative Commons Attribution 4.0 International license. The only exception is the original raw audio files, for which an additional step is required as a security measure to safeguard the speakers' privacy. To request access, interested authors should first complete and sign the agreement file EMOVOME_agreement.pdf and send it to the corresponding author (jamarmo@htech.upv.es). The data included in the EMOVOME database is expected to be used for research purposes only. Therefore, the agreement file states that the authors are not allowed to share the data with profit-making companies or organisations. They are also not expected to distribute the data to other research institutions; instead, they are suggested to kindly refer interested colleagues to the corresponding author of this article. By agreeing to the terms of the agreement, the authors also commit to refraining from publishing the audio content on the media (such as television and radio), in scientific journals (or any other publications), as well as on other platforms on the internet. The agreement must bear the signature of the legally authorised representative of the research institution (e.g., head of laboratory/department). Once the signed agreement is received and validated, the corresponding author will deliver the "Audios" folder containing the audio files through a download procedure. A direct connection between the EMOVOME authors and the applicants guarantees that updates regarding additional materials included in the database can be received by all EMOVOME users.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The speech activity detection task discriminates the segments of a signal where human speech and other type of sounds (such as silence and noise) occur. it's very critical and important since it's the starting point for many speech/audio applications, including speech coding, speech recognition and speech enhancement.
Since there are no specific public dataset for this kind of task I collect 719 audio from three different databases: - TIMIT: is a corpus of read speech, designed to provide speech data for acoustic and phonetic studies and evaluation of automatic speech recognition system. - PTDB-TUG: is a speech database for pitch tracking that provide microphone signals of 20 English speakers. - Noizeus: contains speech data of 30 sentences. Noise signal (from AURORA database) is artificially added to the speech signal, in particular the database contains audio corrupted with babble (crowd of preople), street, train, train station, car and restaurant noise at SNRs of 5dB, and the original ones.
Praat is a useful speech analysis tool which provides a wide range of functionalities, one of those creates an annotation file which write, in a txt format, where the silent and sounding intervals are in the signal. For the data from TIMIT and PTDB-TUG the silences correspond to NON-SPEECH and the sounding intervals correspond to SPEECH, because these databases don’t have any kind of background noise. Instead for those files from Noizeus I worked slightly differently, since the database has also the original audio (without noise) I associated to each noisy file the annotations of the corresponding noiseless one.
All audio files are .wav, and all the annotation files are .TextGrid. the structure of such files is described in the description of the folder. The annotations can be read with the suitable python library: https://pypi.org/project/praat-textgrids/
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Our project, “Indonesian Media Audio Database,” is designed to establish a rich and diverse dataset tailored for training advanced machine learning models in language processing, speech recognition, and cultural analysis.