Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The SNABI speech database can be used to train continuous speech recognition for Slovene language. The database comprises 1530 sentences, 150 words and the alphabet. 132 speakers were recorded, each reading 200 sentences or more. This resulted in more than 15,000 recordings of speech signal contained in the database. The recordings were done in studio (SNABI SI_SSQ) and through a telephone line (SNABI SI_SFN).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
| Column | Description |
|---|---|
| id | file id (string) |
| file_path | file path to .wav file (string) |
| speech | transcription of the audio file (string) |
| speaker | speaker name, use this as the target variable if you are doing audio classification (string) |
The CMU_ARCTIC databases were constructed at the Language Technologies Institute at Carnegie Mellon University as phonetically balanced, US-English single-speaker databases designed for unit selection speech synthesis research. A detailed report on the structure and content of the database and the recording environment etc is available as a Carnegie Mellon University, Language Technologies Institute Tech Report CMU-LTI-03-177 and is also available here.
The databases consist of around 1150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databases include US English male (bdl) and female (slt) speakers (both experienced voice talent) as well as other accented speakers.
The 1132 sentence prompt list is available from cmuarctic.data
The distributions include 16KHz waveform and simultaneous EGG signals. Full phonetically labeling was performed by the CMU Sphinx using the FestVox based labeling scripts. Complete runnable Festival Voices are included with the database distributions, as examples though better voices can be made by improving labeling, etc.
This work was partially supported by the U.S. National Science Foundation under Grant No. 0219687, "ITR/CIS Evaluation and Personalization of Synthetic Voices". Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
AccentDB is a multi-pairwise parallel corpus of structured and labelled accented speech. It contains speech samples from speakers of 4 non-native accents of English (8 speakers, 4 Indian languages); and also has a compilation of 4 native accents of English (4 countries, 13 speakers) and a metropolitan Indian accent (2 speakers). The dataset available here corresponds to release titled accentdb_extended on
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The Aurora project was originally set up to establish a worldwide standard for the feature extraction software which forms the core of the front-end of a DSR (Distributed Speech Recognition) system. The AURORA-5 database has been mainly developed to investigate the influence on the performance of automatic speech recognition for a hands-free speech input in noisy room environments. Furthermore two test conditions are included to study the influence of transmitting the speech in a mobile communication system.The earlier three Aurora experiments had a focus on additive noise and the influence of some telephone frequency characteristics. Aurora-5 tries to cover all effects as they occur in realistic application scenarios. The focus was put on two scenarios. The first one is the hands-free speech input in the noisy car environment with the intention of controlling either devices in the car itself or retrieving information from a remote speech server over the telephone. The second one covers the hands-free speech input in a type of office or in a type of living room to control e.g. a telephone device or some audio/video equipment.The AURORA-5 database contains the following data:•Artificially distorted versions of the recordings from adult speakers in the TI-Digits speech database downsampled at a sampling frequency of 8000 Hz. The distortions consist of: - additive background noise, - the simulation of a hands-free speech input in rooms, - the simulation of transmitting speech over cellular telephone networks.•A subset of recordings from the meeting recorder project at the International Computer Science Institute. The recordings contain sequences of digits uttered by different speakers in hands-free mode in a meeting room.•A set of scripts for running recognition experiments on the above mentioned speech data. The experiments are based on the usage of the freely available software package HTK where HTK is not part of this resource.Further information is also available at the following address: http://aurora.hsnr.de
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ARTUR is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,035 hours of speech, although only 840 hours are transcribed, while the remaining 195 hours are without transcription. The data is divided into 4 parts: (1) approx. 520 hours of read speech, which includes the reading of pre-defined sentences, selected from the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320); each sentence is contained in one file; speakers are demographically balanced; spelling is included in special files; all with manual transcriptions; (2) approx. 204 hours of public speech, which includes media recordings, online recordings of conferences, workshops, education videos, etc.; 56 hours are manually transcribed; (3) approx. 110 hours of private speech, which includes monologues and dialogues between two persons, recorded for the purposes of the speech database; the speakers are demographically balanced; two subsets for domain-specific ASR (i.e., smart-home and face-description) are included; 63 hours are manually transcribed; (4) approx. 201 hours of parliamentary speech, which includes recordings from the Slovene National Assembly, all with manual transcriptions. Audio files are WAV 44,1 kHz, pcm, 16-bit, mono.
This entry includes the recordings only; transcriptions are available at http://hdl.handle.net/11356/1718.
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The Aurora project was originally set up to establish a world wide standard for the feature extraction software which forms the core of the front-end of a DSR (Distributed Speech Recognition) system. ETSI formally adopted this activity as work items 007 and 008.The two work items within ETSI are :- ETSI DES/STQ WI007 : Distributed Speech Recognition - Front-End Feature Extraction Algorithm & Compression Algorithm- ETSI DES/STQ WI008 : Distributed Speech Recognition - Advanced Feature Extraction Algorithm.This database is a subset of the SpeechDat-Car database in Spanish language which has been collected as part of the European Union funded SpeechDat-Car project. It contains isolated and connected Spanish digits spoken in the following noise and driving conditions inside a car : 1. Quiet environment. Stop motor running. 2. Low noise. Town traffic + low speed rough road. 3. High noise : High speed good road.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database description:
The written and spoken digits database is not a new database but a constructed database from existing ones, in order to provide a ready-to-use database for multimodal fusion.
The written digits database is the original MNIST handwritten digits database [1] with no additional processing. It consists of 70000 images (60000 for training and 10000 for test) of 28 x 28 = 784 dimensions.
The spoken digits database was extracted from Google Speech Commands [2], an audio dataset of spoken words that was proposed to train and evaluate keyword spotting systems. It consists of 105829 utterances of 35 words, amongst which 38908 utterances of the ten digits (34801 for training and 4107 for test). A pre-processing was done via the extraction of the Mel Frequency Cepstral Coefficients (MFCC) with a framing window size of 50 ms and frame shift size of 25 ms. Since the speech samples are approximately 1 s long, we end up with 39 time slots. For each one, we extract 12 MFCC coefficients with an additional energy coefficient. Thus, we have a final vector of 39 x 13 = 507 dimensions. Standardization and normalization were applied on the MFCC features.
To construct the multimodal digits dataset, we associated written and spoken digits of the same class respecting the initial partitioning in [1] and [2] for the training and test subsets. Since we have less samples for the spoken digits, we duplicated some random samples to match the number of written digits and have a multimodal digits database of 70000 samples (60000 for training and 10000 for test).
The dataset is provided in six files as described below. Therefore, if a shuffle is performed on the training or test subsets, it must be performed in unison with the same order for the written digits, spoken digits and labels.
Files:
References:
Facebook
TwitterThis dataset contains common speech and noise corpora for evaluating fundamental frequency estimation algorithms as convenient JBOF dataframes. Each corpus is available freely on its own, and allows redistribution:
CMU-ARCTIC (BSD license) [1]
FDA (free to download) [2]
KEELE (free for noncommercial use) [3]
MOCHA-TIMIT (free for noncommercial use) [4]
PTDB-TUG (ODBL license) [5]
NOISEX (free to download) [7]
QUT-NOISE (CC-BY-SA license) [8]
These files are published as part of my dissertation, "Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods", and in support of the Replication Dataset for Fundamental Frequency Estimation.
References:
John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.
Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.
F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.
Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.
Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.
John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.
Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.
David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This database was created by Nordic Language Technology for the development of automatic speech recognition and dictation in Norwegian. In this version, the organization of the data have been altered to improve the usefulness of the database.
The acoustic databases described below were developed by the firm Nordisk språkteknologi holding AS (NST), which went bankrupt in 2003. In 2006, a consortium consisting of the University of Oslo, the University of Bergen, the Norwegian University of Science and Technology, the Norwegian Language Council and IBM bought the bankruptcy estate of NST, in order to ensure that the language resources developed by NST were preserved. In 2009, the Norwegian Ministry of Culture charged the National Library of Norway with the task of creating a Norwegian language bank, which they initiated in 2010. The resources from NST were transferred to the National Library in May 2011, and are now made available in Språkbanken, for the time being without any further modification. Språkbanken is open for feedback from users about how the resources can be improved, and we are also interested in improved versions of the databases that users wish to share with other users. Please send response and feedback to sprakbanken@nb.no.
Facebook
TwitterThis database is orignally created for a resource for developing advanced models in automatic speech recognition that are more suited to the needs of people with dysarthria.
Facebook
TwitterThe EMODB database is the freely available German emotional database. The database is created by the Institute of Communication Science, Technical University, Berlin, Germany. Ten professional speakers (five males and five females) participated in data recording. The database contains a total of 535 utterances. The EMODB database comprises of seven emotions: 1) anger; 2) boredom; 3) anxiety; 4) happiness; 5) sadness; 6) disgust; and 7) neutral. The data was recorded at a 48-kHz sampling rate and then down-sampled to 16-kHz.
Every utterance is named according to the same scheme:
Example: 03a01Fa.wav is the audio file from Speaker 03 speaking text a01 with the emotion "Freude" (Happiness).
|
| letter | emotion (english) | letter | emotion (german) |
|---|---|---|---|
| A | anger | W | Ärger (Wut) |
| B | boredom | L | Langeweile |
| D | disgust | E | Ekel |
| F | anxiety/fear | A | Angst |
| H | happiness | F | Freude |
| S | sadness | T | Trauer |
| N = neutral version |
EMOTION classification from speech has an increasing interest in the field of the speech processing area. The objective of the emotion classification is to classify different emotions from the speech signal. A person’s emotional state affects the production mechanism of speech, and due to this, breathing rate and muscle tension change from the neutral condition. Therefore, the resulting speech signal may have different characteristics from that of neutral speech.
The performance of speech recognition or speaker recognition decreases significantly if the model is trained with neutral speech and it is tested with an emotional speech. So we as a Machine Learning Enthusiast can start working on speaker emotion recognition problems and can come up with some good robust models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "SpeechRec_LanguageLearning_ConversationalSkills" dataset is a collection of data generated in a game-based language learning environment, aiming to explore the impact of Speech Recognition Technology (SRT) on the development of conversational skills. The dataset encompasses speaking test results conducted within the context of language learning games utilizing SRT.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Our Laboratory of Artificial Neural Network Applications (LANNA) in the Czech Technical University in Prague (head of the laboratory is professor Jana Tučková) collaborates on a project with the Department of Paediatric Neurology, 2nd Faculty of Medicine of Charles University in Prague and with the Motol University Hospital (head of clinic is professor Vladimír Komárek), which focuses on the study of children with SLI.The speech database contains two subgroups of recordings of children's speech from different types of speakers. The first subgroup (healthy) consists of recordings of children without speech disorders; the second subgroup (patients) consists of recordings of children with SLI. These children have different degrees of severity (1 – mild, 2 – moderate, and 3 – severe). The speech therapists and specialists from Motol Hospital decided upon this classification. The children’s speech was recorded in the period 2003-2013. These databases were commonly created in a schoolroom or a speech therapist’s consulting room, in the presence of surrounding background noise. This situation simulates the natural environment in which the children live, and is important for capturing the normal behavior of children. The database of healthy children’s speech was created as a referential database for the computer processing of children’s speech. It was recorded on the SONY digital Dictaphone (sampling frequency, fs = 16 kHz, 16-bit resolution in stereo mode in the standardized wav format) and on the MD SONY MZ-N710 (sampling frequency, fs = 44.1 kHz, 16-bit resolution in stereo mode in the standardized wav format). The corpus was recorded in the natural environment of a schoolroom and in a clinic. This subgroup contains a total of 44 native Czech participants (15 boys, 29 girls) aged 4 to 12 years, and was recorded during the period 2003–2005. The database of children with SLI was recorded in a private speech therapist’s office. The children’s speech is captured by means of a SHURE lapel microphone using the solution by the company AVID (MBox – USB AD/DA converter and ProTools LE software) on an Apple laptop (iBook G4). The sound recordings are saved in the standardized wav format. The sampling frequency is set to 44.1 kHz with 16-bit resolution in mono mode. This subgroup contains a total of 54 native Czech participants (35 boys, 19 girls) aged 6 to 12 years, and was recorded during the period 2009–2013. This package contains wav data sets for development and testing methods for detection children with SLI.Software pack:FORANA - was developed the original software FORANA for formants analysis. It is based on the MATLAB programming environment. The development of this software was mainly driven by the need to have the ability to complete formant analysis correctly and full automation of the process of extracting formants from the recorded speech signals. Development of this application is still running. Software was developed in the LANNA at CTU FEE in Prague.LABELING - the program LABELING is used for segmentation of the speech signal. It is a part of SOMLab program system. Software was developed in the LANNA at CTU FEE in Prague.PRAAT - is an acoustic analysis software. The Praat program was created by Paul Boersma and David Weenink of the Institute of Phonetics Sciences of the University of Amsterdam. Home page:http://www.praat.org or http://www.fon.hum.uva.nl/praat/.openSMILE - The openSMILE feature extration tool enables you to extract large audio feature spaces in realtime. It combines features from Music Information Retrieval and Speech Processing. SMILE is an acronym forSpeech & Music Interpretation by Large-space Extraction. It is written in C++ and is available as both a standalone commandline executable as well as a dynamic library. The main features of openSMILE are its capability of on-line incremental processing and its modularity. Feature extractor components can be freely interconnected to create new and custom features, all via a simple configuration file. New components can be added to openSMILE via an easy binary plugin interface and a comprehensive API. Citing: Florian Eyben, Martin Wöllmer, Björn Schuller: "openSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor", In Proc. ACM Multimedia (MM), ACM, Florence, Italy, ACM, ISBN 978-1-60558-933-6, pp. 1459-1462, October 2010. doi:10.1145/1873951.1874246
Facebook
TwitterThis dataset is collected to enhance research into speech recognition systems for dysarthic speech.
Facebook
TwitterThis dataset is collected to enhance research into speech recognition systems for dysarthic speech. It consists of 19275 isolated‐word utterances of speakers with dysarthria due to cerebral palsy.
Facebook
Twitterhttps://hdl.handle.net/21.11146/13/.well-known/skolem/bead126e-5168-3e2b-8959-07eaf5d458d5https://hdl.handle.net/21.11146/13/.well-known/skolem/bead126e-5168-3e2b-8959-07eaf5d458d5
This database was originally developed by Nordic Language Technology in the 1990ies in order to facilitate automatic speech recognition (ASR) in Norwegian . A reorganized and more user friendly version of this database is also available from The Language Bank. Type "sbr-54" in the search bar to find the updated version.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This distribution contains the QUT-NOISE database and the code required to create the QUT-NOISE-TIMIT database from the QUT-NOISE database and a locally installed copy of the TIMIT database. It also contains code to create the QUT-NOISE-SRE protocol on top of an existing speaker recognition evaluation database (such as NIST evaluations). Further information on the QUT-NOISE and QUT-NOISE-TIMIT databases is available in our paper:
D. Dean, S. Sridharan, R. Vogt, M. Mason (2010) , in Proceedings of Interspeech 2010, Makuhari Messe International Convention Complex, Makuhari, Japan.
This paper is also available in the file: docs/Dean2010, The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithm.pdf, distributed with this database.
Further information on the QUT-NOISE-SRE protocol is available in our paper:
D. Dean, A. Kanagasundaram, H. Ghaemmaghami, M. Hafizur, S. Sridharan (2015) . In Proceedings of Interspeech 2015, September, Dresden, Germany.
Licensing
The QUT-NOISE data itself is licensed CC-BY-SA, and the code required to create the QUT-NOISE-TIMIT database and QUT-NOISE-SRE protocols is licensed under the BSD license. Please consult the approriate LICENSE.txt files (in the code and QUT-NOISE directories) for more information. To attribute this database, please include the following citation:
D. Dean, S. Sridharan, R. Vogt, M. Mason (2010) , in Proceedings of Interspeech 2010, Makuhari Messe International Convention Complex, Makuhari, Japan.
If your work is based upon the QUT-NOISE-SRE, please also include this citation:
D. Dean, A. Kanagasundaram, H. Ghaemmaghami, M. Hafizur, S. Sridharan (2015) . In Proceedings of Interspeech 2015, September, Dresden, Germany.
In order to construct the QUT-NOISE-TIMIT database from the QUT-NOISE data supplied here you will need to obtain a copy of the TIMIT database from the . If you just want to use the QUT-NOISE database, or you wish to combine it with different speech data, TIMIT is not required.
Facebook
Twitterhttps://cdla.io/permissive-1-0https://cdla.io/permissive-1-0
We present a speech data corpus that simulates a "dinner party" scenario taking place in an everyday home environment. The corpus was created by recording multiple groups of four Amazon employee volunteers having a natural conversation in English around a dining table. The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes. The corpus was created to advance in the field of noise robust and distant speech processing and is intended to serve as a public research and benchmarking data set.
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
This database was created by Nordic Language Technology for the development of automatic speech recognition and dictation in Swedish. In this updated version, the organization of the data have been altered to improve the usefulness of the database.
In the original version of the material, the files were organized in a specific folder structure where the folder names were meaningful. However, the file names were not meaningful, and there were also cases of files with identical names in different folders. This proved to be impractical, since users had to keep the original folder structure in order to use the data. The files have been renamed, such that the file names are unique and meaningful regardless of the folder structure. The original metadata files were in spl format. These have been converted to JSON format. The converted metadata files are also anonymized and the text encoding has been converted from ANSI to UTF-8.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Our project, “Indonesian Media Audio Database,” is designed to establish a rich and diverse dataset tailored for training advanced machine learning models in language processing, speech recognition, and cultural analysis.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The SNABI speech database can be used to train continuous speech recognition for Slovene language. The database comprises 1530 sentences, 150 words and the alphabet. 132 speakers were recorded, each reading 200 sentences or more. This resulted in more than 15,000 recordings of speech signal contained in the database. The recordings were done in studio (SNABI SI_SSQ) and through a telephone line (SNABI SI_SFN).