https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Column | Description |
---|---|
id | file id (string) |
file_path | file path to .wav file (string) |
speech | transcription of the audio file (string) |
speaker | speaker name, use this as the target variable if you are doing audio classification (string) |
The CMU_ARCTIC databases were constructed at the Language Technologies Institute at Carnegie Mellon University as phonetically balanced, US-English single-speaker databases designed for unit selection speech synthesis research. A detailed report on the structure and content of the database and the recording environment etc is available as a Carnegie Mellon University, Language Technologies Institute Tech Report CMU-LTI-03-177 and is also available here.
The databases consist of around 1150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databases include US English male (bdl) and female (slt) speakers (both experienced voice talent) as well as other accented speakers.
The 1132 sentence prompt list is available from cmuarctic.data
The distributions include 16KHz waveform and simultaneous EGG signals. Full phonetically labeling was performed by the CMU Sphinx using the FestVox based labeling scripts. Complete runnable Festival Voices are included with the database distributions, as examples though better voices can be made by improving labeling, etc.
This work was partially supported by the U.S. National Science Foundation under Grant No. 0219687, "ITR/CIS Evaluation and Personalization of Synthetic Voices". Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Artur 1.0 is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,067 hours of speech. 884 hours are transcribed, while the remaining 183 hours are recordings only. This repository entry includes audio files only, the transcriptions are available on http://hdl.handle.net/11356/1772.
The data are structured as follows: (1) Artur-B, read speech, 573 hours in total. It includes: (1a) Artur-B-Brani, 485 hours: Readings of sentences which were pre-selected from a 10% increment in the Gigafida 2.0 corpus. The sentences were chosen in such a way that they reflect the natural or the actual distribution of triphones in the words. They were distributed between 1,000 speakers, so that we recorded approx. 30 min in read form from each speaker. The speakers were balanced according to gender, age, region, and a small proportion of speakers were non-native speakers of Slovene. Each sentence is its own audio file and has a corresponding transcription file. (1b) Artur-B-Crkovani, 10 hours: Spellings. Speakers were asked to spell abbreviations and personal names and surnames, all chosen so that all Slovene letters were covered, plus the most common foreign letters. (1c) Artur-B-Studio, 51 hours: Designed for the development of speech synthesis. The sentences were read in a studio by a single speaker. Each sentence is its own audio file and has a corresponding transcription file. (1d) Artur-B-Izloceno, 27 hours: The recordings include different types of errors, typically, incorrect reading of sentences or a noisy environment.
(2) Artur-J, public speech, 62 hours in total. It includes: (2a) Artur-J-Splosni, 62 hours: media recordings, online recordings of conferences, workshops, education videos, etc.
(3) Artur-N, private speech, 74 hours in total. It includes: (3a) Artur-N-Obrazi, 6 hours: Speakers were asked to describe faces on pictures. Designed for a face-description domain-specific speech recognition. (3b) Artur-N-PDom, 7 hours: Speakers were asked to read pre-written sentences, as well as to express instructions for a potential smart-home system freely. Designed for a smart-home domain-specific speech recognition. (3c) Artur-N-Prosti, 61 hours: Monologues and dialogues between two persons, recorded for the purposes of the Artur database creation. Speakers were asked to conversate or explain freely on casual topics.
(4) Artur-P, parliamentary speech, 201 hours in total. It includes: (4a) Artur-P-SejeDZ, 201 hours: Speech from the Slovene National Assembly.
Further information on the database are available in the Artur-DOC file, which is part of this repository entry.
== Quick facts ==
The most up-to-date and comprehensive podcast database available All languages & All countries Includes over 3,500,000 podcasts Features 35+ data fields , such as basic metadata, global rank, RSS feed (with audio URLs), Spotify links, and more Delivered in SQLite format Learn how we build a high quality podcast database: https://www.listennotes.help/article/105-high-quality-podcast-database-from-listen-notes
== Use Cases ==
AI training, including speech recognition, generative AI, voice cloning / synthesis, and news analysis Alternative data for investment research, such as sentiment analysis of executive interviews, market research and tracking investment themes PR and marketing, including social monitoring, content research, outreach, and guest booking ...
== Data Attributes ==
See the full list of data attributes on this page: https://www.listennotes.com/podcast-datasets/fields/?filter=podcast_only
How to access podcast audio files: Our dataset includes RSS feed URLs for all podcasts. You can retrieve audio for over 170 million episodes directly from these feeds. With access to the raw audio, you’ll have high-quality podcast speech data ideal for AI training and related applications.
== Custom Offers ==
We can provide custom datasets based on your needs, such as language-specific data, daily/weekly/monthly update frequency, or one-time purchases.
We also provide a RESTful API at PodcastAPI.com
Contact us: hello@listennotes.com
== Need Help? ==
If you have any questions about our products, feel free to reach out hello@listennotes.com
== About Listen Notes, Inc. ==
Since 2017, Listen Notes, Inc. has provided the leading podcast search engine and podcast database.
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
This database has been collected and packaged under the auspices of the IST-EU STREP project HIWIRE (Human Input that Works In Real Environments). The database was designed to be used as a tool for development and test of speech processing and recognition techniques dealing with robust non-native speech recognition.The database contains 8,099 English utterances pronounced by non-native speakers (31 French, 20 Greek, 20 Italian, and 10 Spanish speakers). The collected utterances correspond to human input in a command and control aeronautics application. The data was recorded in studio with a close-talking microphone and real noise recorded in an airplane cockpit was artificially added to the data. The signals are provided in clean (studio recordings with close talking microphone), low, mid and high noise conditions. The three noise levels correspond approximately to signal-to-noise ratios of 10dB, 5dB and -5 dB respectively.Clean audio data has been recorded in different office rooms using a close-talking microphone for lowest ambient acoustic effects (Plantronics USB-45). The used sampling frequency is 16 kHz and data is stored in Windows PCM WAV 16 bits mono format.Recordings correspond to prompts extracted from an aeronautic command and control application. A total of 8,099 utterances have been recorded corresponding to 81 speakers pronouncing 100 utterances each. The speaker distribution is as follows:
Country | # Speakers | # Utterances |
France | 31 (38.3%) | 3100 |
Greece | 20 (24.7%) | 2000 |
Italy | 20 (24.7%) | 2000 |
Spain | 10 (12.3%) | 999 |
Total | 81 | 8099 |
The EMODB database is the freely available German emotional database. The database is created by the Institute of Communication Science, Technical University, Berlin, Germany. Ten professional speakers (five males and five females) participated in data recording. The database contains a total of 535 utterances. The EMODB database comprises of seven emotions: 1) anger; 2) boredom; 3) anxiety; 4) happiness; 5) sadness; 6) disgust; and 7) neutral. The data was recorded at a 48-kHz sampling rate and then down-sampled to 16-kHz.
Every utterance is named according to the same scheme:
Example: 03a01Fa.wav is the audio file from Speaker 03 speaking text a01 with the emotion "Freude" (Happiness).
|
letter | emotion (english) | letter | emotion (german) |
---|---|---|---|
A | anger | W | Ärger (Wut) |
B | boredom | L | Langeweile |
D | disgust | E | Ekel |
F | anxiety/fear | A | Angst |
H | happiness | F | Freude |
S | sadness | T | Trauer |
N = neutral version |
EMOTION classification from speech has an increasing interest in the field of the speech processing area. The objective of the emotion classification is to classify different emotions from the speech signal. A person’s emotional state affects the production mechanism of speech, and due to this, breathing rate and muscle tension change from the neutral condition. Therefore, the resulting speech signal may have different characteristics from that of neutral speech.
The performance of speech recognition or speaker recognition decreases significantly if the model is trained with neutral speech and it is tested with an emotional speech. So we as a Machine Learning Enthusiast can start working on speaker emotion recognition problems and can come up with some good robust models.
Citation: DOI 10.1007/s10579-011-9145-0
Collection of Audio Recordings by the Department of Computer Science at the University of Toronto from speakers with and without Dysarthtria. Useful for tasks like Audio Classification, Disease Detection, Speech Processing, etc.
Directory Structure:
F_Con : Audio Samples of female speakers from the control group, i.e., female speakers without dysarthria. 'FC01' in the folder names and the filenames refers to the first speaker, 'FC02' refers to the second speaker and so on. 'S01' refers to the first recording session with that speaker, 'S02' refers to the second session and so on. 'arrayMic' suggests that the audio was recorded with an array microphone, whereas 'headMic' suggests that the audio was recorded by a headpiece microphone.
F_Dys : Audio Samples of female speakers with dysarthria. 'F01' in the folder names and the filenames refers to the first speaker, 'F03' refers to the second speaker and so on. 'S01' refers to the first recording session with that speaker, 'S02' refers to the second session and so on. 'arrayMic' suggests that the audio was recorded with an array microphone, whereas 'headMic' suggests that the audio was recorded by a headpiece microphone.
M_Con : Audio Samples of male speakers from the control group, i.e., male speakers without dysarthria. 'MC01' in the folder names and the filenames refers to the first speaker, 'MC02' refers to the second speaker and so on. 'S01' refers to the first recording session with that speaker, 'S02' refers to the second session and so on. 'arrayMic' suggests that the audio was recorded with an array microphone, whereas 'headMic' suggests that the audio was recorded by a headpiece microphone.
M_Dys : Audio Samples of male speakers with dysarthria. 'M01' in the folder names and the filenames refers to the first speaker, 'M03' refers to the second speaker and so on. 'S01' refers to the first recording session with that speaker, 'S02' refers to the second session and so on. 'arrayMic' suggests that the audio was recorded with an array microphone, whereas 'headMic' suggests that the audio was recorded by a headpiece microphone.
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The Aurora project was originally set up to establish a worldwide standard for the feature extraction software which forms the core of the front-end of a DSR (Distributed Speech Recognition) system. The AURORA-5 database has been mainly developed to investigate the influence on the performance of automatic speech recognition for a hands-free speech input in noisy room environments. Furthermore two test conditions are included to study the influence of transmitting the speech in a mobile communication system.The earlier three Aurora experiments had a focus on additive noise and the influence of some telephone frequency characteristics. Aurora-5 tries to cover all effects as they occur in realistic application scenarios. The focus was put on two scenarios. The first one is the hands-free speech input in the noisy car environment with the intention of controlling either devices in the car itself or retrieving information from a remote speech server over the telephone. The second one covers the hands-free speech input in a type of office or in a type of living room to control e.g. a telephone device or some audio/video equipment.The AURORA-5 database contains the following data:•Artificially distorted versions of the recordings from adult speakers in the TI-Digits speech database downsampled at a sampling frequency of 8000 Hz. The distortions consist of: - additive background noise, - the simulation of a hands-free speech input in rooms, - the simulation of transmitting speech over cellular telephone networks.•A subset of recordings from the meeting recorder project at the International Computer Science Institute. The recordings contain sequences of digits uttered by different speakers in hands-free mode in a meeting room.•A set of scripts for running recognition experiments on the above mentioned speech data. The experiments are based on the usage of the freely available software package HTK where HTK is not part of this resource.Further information is also available at the following address: http://aurora.hsnr.de
https://hdl.handle.net/21.11146/13/.well-known/skolem/bead126e-5168-3e2b-8959-07eaf5d458d5https://hdl.handle.net/21.11146/13/.well-known/skolem/bead126e-5168-3e2b-8959-07eaf5d458d5
This database was originally developed by Nordic Language Technology in the 1990ies in order to facilitate automatic speech recognition (ASR) in Norwegian . A reorganized and more user friendly version of this database is also available from The Language Bank. Type "sbr-54" in the search bar to find the updated version.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Artur 1.0 is a speech database designed for the needs of developing automatic speech recognition for the Slovenian language. The complete database includes 1,067 hours of speech, of which 884 hours are transcribed, while the remaining 183 hours are recordings only. This repository entry includes transcriptions only, while the audio files are available on http://hdl.handle.net/11356/1776.
Transcriptions are available in the original TRS format of the Transcriber 1.5.1 tool which was used for making the transcriptions. All transcriptions were made manually or manually corrected.
The data are structured as follows: (1) Artur-B, read speech, 573 hours in total. It includes: (1a) Artur-B-Brani, 485 hours: Readings of sentences which were pre-selected from a 10% increment in the Gigafida 2.0 corpus. The sentences were chosen in such a way that they reflect the natural or the actual distribution of triphones in the words. They were distributed between 1,000 speakers, so that we recorded approx. 30 min in read form from each speaker. The speakers were balanced according to gender, age, region, and a small proportion of speakers were non-native speakers of Slovene. Each sentence is its own transcription file and has a corresponding audio file. (1b) Artur-B-Crkovani, 10 hours: Spellings. Speakers were asked to spell abbreviations and personal names and surnames, all chosen so that all Slovene letters were covered, plus the most common foreign letters. The transcriptions were corrected manually. (1c) Artur-B-Studio, 51 hours: Designed for the development of speech synthesis. The sentences were read in a studio by a single speaker. Each sentence is its own transcription file and has a corresponding recording. (1d) Artur-B-Izloceno, 27 hours: in trs format only. The recordings that correspond to these transcriptions include different types of errors, typically, incorrect reading of sentences or a noisy environment.
(2) Artur-J, public speech, 62 hours in total. It includes: (2a) Artur-J-Splosni, 62 hours: manual transcriptions of media recordings, online recordings of conferences, workshops, education videos, etc. Transcriptions were made in two modes: - 'pog' files include the pronunciation-based or citation-phonemic transcriptions (containing the output phoneme string derived from the orthographic form by letter-to-sound rules) - 'std' files include standardised or expanded orthographic transcriptions (the standard Slovenian spelling is used to indicate the spoken words, but there are additional rules and word-lists for non-standard lexis)
(3) Artur-N, private speech, 74 hours in total. It includes: (3a) Artur-N-Obrazi, 6 hours: Speakers were asked to describe faces on pictures. Designed for a face-description domain-specific speech recognition. (3b) Artur-N-PDom, 7 hours: Speakers were asked to read pre-written sentences, as well as to express instructions for a potential smart-home system freely. Designed for a smart-home domain-specific speech recognition. (3c) Artur-N-Prosti, 61 hours: Monologues and dialogues between two persons, recorded for the purposes of the Artur database creation. Speakers were asked to conversate or explain freely on casual topics. The manual transcriptions were done in two modes, the same as for Artur-J.
(4) Artur-P, parliamentary speech, 201 hours in total. It includes: (4a) Artur-P-SejeDZ, 201 hours: Transcriptions of speech from the Slovene National Assembly. Manual transcriptions were done in two modes, the same as for Artur-J.
Further information on the database, including various statistics, are available in the Artur-DOC directory, which is part of Artur_1.0_TRS.
This dataset contains common speech and noise corpora for evaluating fundamental frequency estimation algorithms as convenient JBOF dataframes. Each corpus is available freely on its own, and allows redistribution:
These files are published as part of my dissertation, "Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods", and in support of the Replication Dataset for Fundamental Frequency Estimation.
References:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "SpeechRec_LanguageLearning_ConversationalSkills" dataset is a collection of data generated in a game-based language learning environment, aiming to explore the impact of Speech Recognition Technology (SRT) on the development of conversational skills. The dataset encompasses speaking test results conducted within the context of language learning games utilizing SRT.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.
© 2020, Bastian Bechtold. All rights reserved.
Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.
The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.
Included Code and Data
ground truth data.zip
is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:
noisy speech data.zip
is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:
synthetic speech data.zip
is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.noisy_speech.pkl
and synthetic_speech.pkl
are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:
noisy speech evaluation.py
and synthetic speech evaluation.py
are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:
Pipfile
is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.
References:
https://cdla.io/permissive-1-0https://cdla.io/permissive-1-0
We present a speech data corpus that simulates a "dinner party" scenario taking place in an everyday home environment. The corpus was created by recording multiple groups of four Amazon employee volunteers having a natural conversation in English around a dining table. The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes. The corpus was created to advance in the field of noise robust and distant speech processing and is intended to serve as a public research and benchmarking data set.
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
In order to analyze gender by voice and speech, a training database was required. A database was built using thousands of samples of male and female voices, each labeled by their gender of male or female. Voice samples were collected from the following resources:
The output from the pre-processed WAV files were saved into a CSV file, containing 3168 rows and 21 columns (20 columns for each feature and one label column for the classification of male or female). You can download the pre-processed dataset in CSV format, using the link above
The following acoustic properties of each voice are measured:
duration: length of signal meanfreq: mean frequency (in kHz) sd: standard deviation of frequency median: median frequency (in kHz) Q25: first quantile (in kHz) Q75: third quantile (in kHz) IQR: interquantile range (in kHz) skew: skewness (see note in specprop description) kurt: kurtosis (see note in specprop description) sp.ent: spectral entropy sfm: spectral flatness mode: mode frequency centroid: frequency centroid (see specprop) peakf: peak frequency (frequency with highest energy) meanfun: average of fundamental frequency measured across acoustic signal minfun: minimum fundamental frequency measured across acoustic signal maxfun: maximum fundamental frequency measured across acoustic signal meandom: average of dominant frequency measured across acoustic signal mindom: minimum of dominant frequency measured across acoustic signal maxdom: maximum of dominant frequency measured across acoustic signal dfrange: range of dominant frequency measured across acoustic signal modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range Note, the features for duration and peak frequency (peakf) were removed from training. Duration refers to the length of the recording, which for training, is cut off at 20 seconds. Peakf was omitted from calculation due to time and CPU constraints in calculating the value. In this case, all records will have the same value for duration (20) and peak frequency (0).
Foto von Jason Rosewell auf Unsplash
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dialectal Voice is a community project initiated by AIOX Labs to facilitate voice recognition by Intelligent Systems. Today, the need for AI systems capable of recognizing the human voice is increasingly expressed within communities. However, we note that for some languages such as Darija, there are not enough voice technology solutions. To meet this need, we then proposed to establish this program of iterative and interactive construction of a dialectal database open to all in order to help improve models of voice recognition and generation.
The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a standard dataset used for evaluation of automatic speech recognition systems. It consists of recordings of 630 speakers of 8 dialects of American English each reading 10 phonetically-rich sentences. It also comes with the word and phone-level transcriptions of the speech.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The SNABI speech database can be used to train continuous speech recognition for Slovene language. The database comprises 1530 sentences, 150 words and the alphabet. 132 speakers were recorded, each reading 200 sentences or more. This resulted in more than 15,000 recordings of speech signal contained in the database. The recordings were done in studio (SNABI SI_SSQ) and through a telephone line (SNABI SI_SFN).
THCHS-30 is a free Chinese speech database THCHS-30 that can be used to build a full-fledged Chinese speech recognition system.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Our project, “Indonesian Media Audio Database,” is designed to establish a rich and diverse dataset tailored for training advanced machine learning models in language processing, speech recognition, and cultural analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database description:
The written and spoken digits database is not a new database but a constructed database from existing ones, in order to provide a ready-to-use database for multimodal fusion [1].
The written digits database is the original MNIST handwritten digits database [2] with no additional processing. It consists of 70000 images (60000 for training and 10000 for test) of 28 x 28 = 784 dimensions.
The spoken digits database was extracted from Google Speech Commands [3], an audio dataset of spoken words that was proposed to train and evaluate keyword spotting systems. It consists of 105829 utterances of 35 words, amongst which 38908 utterances of the ten digits (34801 for training and 4107 for test). A pre-processing was done via the extraction of the Mel Frequency Cepstral Coefficients (MFCC) with a framing window size of 50 ms and frame shift size of 25 ms. Since the speech samples are approximately 1 s long, we end up with 39 time slots. For each one, we extract 12 MFCC coefficients with an additional energy coefficient. Thus, we have a final vector of 39 x 13 = 507 dimensions. Standardization and normalization were applied on the MFCC features.
To construct the multimodal digits dataset, we associated written and spoken digits of the same class respecting the initial partitioning in [2] and [3] for the training and test subsets. Since we have less samples for the spoken digits, we duplicated some random samples to match the number of written digits and have a multimodal digits database of 70000 samples (60000 for training and 10000 for test).
The dataset is provided in six files as described below. Therefore, if a shuffle is performed on the training or test subsets, it must be performed in unison with the same order for the written digits, spoken digits and labels.
Files:
data_wr_train.npy: 60000 samples of 784-dimentional written digits for training;
data_sp_train.npy: 60000 samples of 507-dimentional spoken digits for training;
labels_train.npy: 60000 labels for the training subset;
data_wr_test.npy: 10000 samples of 784-dimentional written digits for test;
data_sp_test.npy: 10000 samples of 507-dimentional spoken digits for test;
labels_test.npy: 10000 labels for the test subset.
References:
Khacef, L. et al. (2020), "Brain-Inspired Self-Organization with Cellular Neuromorphic Computing for Multimodal Unsupervised Learning".
LeCun, Y. & Cortes, C. (1998), “MNIST handwritten digit database”.
Warden, P. (2018), “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition”.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Column | Description |
---|---|
id | file id (string) |
file_path | file path to .wav file (string) |
speech | transcription of the audio file (string) |
speaker | speaker name, use this as the target variable if you are doing audio classification (string) |
The CMU_ARCTIC databases were constructed at the Language Technologies Institute at Carnegie Mellon University as phonetically balanced, US-English single-speaker databases designed for unit selection speech synthesis research. A detailed report on the structure and content of the database and the recording environment etc is available as a Carnegie Mellon University, Language Technologies Institute Tech Report CMU-LTI-03-177 and is also available here.
The databases consist of around 1150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databases include US English male (bdl) and female (slt) speakers (both experienced voice talent) as well as other accented speakers.
The 1132 sentence prompt list is available from cmuarctic.data
The distributions include 16KHz waveform and simultaneous EGG signals. Full phonetically labeling was performed by the CMU Sphinx using the FestVox based labeling scripts. Complete runnable Festival Voices are included with the database distributions, as examples though better voices can be made by improving labeling, etc.
This work was partially supported by the U.S. National Science Foundation under Grant No. 0219687, "ITR/CIS Evaluation and Personalization of Synthetic Voices". Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.