100+ datasets found

Speaker Recognition - CMU ARCTIC

kaggle.com

Updated Nov 21, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Gabriel Lins (2022). Speaker Recognition - CMU ARCTIC [Dataset]. https://www.kaggle.com/datasets/mrgabrielblins/speaker-recognition-cmu-arctic

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 21, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Gabriel Lins

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Can you predict which speaker is talking?
Can you predict what they are saying? This dataset makes all of these possible. Perfect for a school project, research project, or resume builder.

File information

train.csv - file containing all the data you need for training, with 4 columns, id (file id), file_path(path to .wav files), speech(transcription of audio file), and speaker (target column)
test.csv - file containing all the data you need to test your model (20% of total audio files), it has the same columns as train.csv
train/ - Folder with training data, subdivided with Speaker's folders
- aew/ - Folder containing audio files in .wav format for speaker aew
- ...
test/ - Folder containing audio files for test data.

Column description

Column	Description
id	file id (string)
file_path	file path to .wav file (string)
speech	transcription of the audio file (string)
speaker	speaker name, use this as the target variable if you are doing audio classification (string)

More Details

The CMU_ARCTIC databases were constructed at the Language Technologies Institute at Carnegie Mellon University as phonetically balanced, US-English single-speaker databases designed for unit selection speech synthesis research. A detailed report on the structure and content of the database and the recording environment etc is available as a Carnegie Mellon University, Language Technologies Institute Tech Report CMU-LTI-03-177 and is also available here.

The databases consist of around 1150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databases include US English male (bdl) and female (slt) speakers (both experienced voice talent) as well as other accented speakers.

The 1132 sentence prompt list is available from cmuarctic.data

The distributions include 16KHz waveform and simultaneous EGG signals. Full phonetically labeling was performed by the CMU Sphinx using the FestVox based labeling scripts. Complete runnable Festival Voices are included with the database distributions, as examples though better voices can be made by improving labeling, etc.

Acknowledgements

This work was partially supported by the U.S. National Science Foundation under Grant No. 0219687, "ITR/CIS Evaluation and Personalization of Synthetic Voices". Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

E
Data from: ASR database ARTUR 1.0 (audio)
live.european-language-grid.eu
binary format
Updated Feb 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). ASR database ARTUR 1.0 (audio) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/21520
Explore at:
binary formatAvailable download formats
Dataset updated
Feb 26, 2023
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Artur 1.0 is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,067 hours of speech. 884 hours are transcribed, while the remaining 183 hours are recordings only. This repository entry includes audio files only, the transcriptions are available on http://hdl.handle.net/11356/1772.

The data are structured as follows: (1) Artur-B, read speech, 573 hours in total. It includes: (1a) Artur-B-Brani, 485 hours: Readings of sentences which were pre-selected from a 10% increment in the Gigafida 2.0 corpus. The sentences were chosen in such a way that they reflect the natural or the actual distribution of triphones in the words. They were distributed between 1,000 speakers, so that we recorded approx. 30 min in read form from each speaker. The speakers were balanced according to gender, age, region, and a small proportion of speakers were non-native speakers of Slovene. Each sentence is its own audio file and has a corresponding transcription file. (1b) Artur-B-Crkovani, 10 hours: Spellings. Speakers were asked to spell abbreviations and personal names and surnames, all chosen so that all Slovene letters were covered, plus the most common foreign letters. (1c) Artur-B-Studio, 51 hours: Designed for the development of speech synthesis. The sentences were read in a studio by a single speaker. Each sentence is its own audio file and has a corresponding transcription file. (1d) Artur-B-Izloceno, 27 hours: The recordings include different types of errors, typically, incorrect reading of sentences or a noisy environment.

(2) Artur-J, public speech, 62 hours in total. It includes: (2a) Artur-J-Splosni, 62 hours: media recordings, online recordings of conferences, workshops, education videos, etc.

(3) Artur-N, private speech, 74 hours in total. It includes: (3a) Artur-N-Obrazi, 6 hours: Speakers were asked to describe faces on pictures. Designed for a face-description domain-specific speech recognition. (3b) Artur-N-PDom, 7 hours: Speakers were asked to read pre-written sentences, as well as to express instructions for a potential smart-home system freely. Designed for a smart-home domain-specific speech recognition. (3c) Artur-N-Prosti, 61 hours: Monologues and dialogues between two persons, recorded for the purposes of the Artur database creation. Speakers were asked to conversate or explain freely on casual topics.

(4) Artur-P, parliamentary speech, 201 hours in total. It includes: (4a) Artur-P-SejeDZ, 201 hours: Speech from the Slovene National Assembly.

Further information on the database are available in the Artur-DOC file, which is part of this repository entry.
Podcast Database - Complete Podcast Metadata, All Countries & Languages
datarade.ai
Updated Mar 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Listen Notes (2020). Podcast Database - Complete Podcast Metadata, All Countries & Languages [Dataset]. https://datarade.ai/data-categories/podcast-data/datasets
Explore at:
Dataset updated
Mar 20, 2020
Dataset authored and provided by
Listen Notes
Area covered
Panama, Costa Rica, Solomon Islands, Turks and Caicos Islands, Bhutan, Switzerland, United States of America, Fiji
Description
== Quick facts ==

The most up-to-date and comprehensive podcast database available All languages & All countries Includes over 3,500,000 podcasts Features 35+ data fields , such as basic metadata, global rank, RSS feed (with audio URLs), Spotify links, and more Delivered in SQLite format Learn how we build a high quality podcast database: https://www.listennotes.help/article/105-high-quality-podcast-database-from-listen-notes

== Use Cases ==

AI training, including speech recognition, generative AI, voice cloning / synthesis, and news analysis Alternative data for investment research, such as sentiment analysis of executive interviews, market research and tracking investment themes PR and marketing, including social monitoring, content research, outreach, and guest booking ...

== Data Attributes ==

See the full list of data attributes on this page: https://www.listennotes.com/podcast-datasets/fields/?filter=podcast_only

How to access podcast audio files: Our dataset includes RSS feed URLs for all podcasts. You can retrieve audio for over 170 million episodes directly from these feeds. With access to the raw audio, you’ll have high-quality podcast speech data ideal for AI training and related applications.

== Custom Offers ==

We can provide custom datasets based on your needs, such as language-specific data, daily/weekly/monthly update frequency, or one-time purchases.

We also provide a RESTful API at PodcastAPI.com

Contact us: hello@listennotes.com

== Need Help? ==

If you have any questions about our products, feel free to reach out hello@listennotes.com

== About Listen Notes, Inc. ==

Since 2017, Listen Notes, Inc. has provided the leading podcast search engine and podcast database.
E
Data from: The HIWIRE database, a noisy and non-native English speech corpus...
catalogue.elra.info
live.european-language-grid.eu
Updated Nov 25, 2008
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) (2008). The HIWIRE database, a noisy and non-native English speech corpus for cockpit communication [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0293/
Explore at:
Dataset updated
Nov 25, 2008
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
This database has been collected and packaged under the auspices of the IST-EU STREP project HIWIRE (Human Input that Works In Real Environments). The database was designed to be used as a tool for development and test of speech processing and recognition techniques dealing with robust non-native speech recognition.The database contains 8,099 English utterances pronounced by non-native speakers (31 French, 20 Greek, 20 Italian, and 10 Spanish speakers). The collected utterances correspond to human input in a command and control aeronautics application. The data was recorded in studio with a close-talking microphone and real noise recorded in an airplane cockpit was artificially added to the data. The signals are provided in clean (studio recordings with close talking microphone), low, mid and high noise conditions. The three noise levels correspond approximately to signal-to-noise ratios of 10dB, 5dB and -5 dB respectively.Clean audio data has been recorded in different office rooms using a close-talking microphone for lowest ambient acoustic effects (Plantronics USB-45). The used sampling frequency is 16 kHz and data is stored in Windows PCM WAV 16 bits mono format.Recordings correspond to prompts extracted from an aeronautic command and control application. A total of 8,099 utterances have been recorded corresponding to 81 speakers pronouncing 100 utterances each. The speaker distribution is as follows:
Country # Speakers # Utterances
France 31 (38.3%) 3100
Greece 20 (24.7%) 2000
Italy 20 (24.7%) 2000
Spain 10 (12.3%) 999
Total 81 8099
To generate the noisy data utterances, the speech level is maintained and only the noise amplitude is modified to obtain the desired SNR. The noise amplitude is adjusted to obtain three different averaged SNR values of 10dB, 5dB and -5dB which are referenced as low noise (LN), mid noise (MN) and high noise (HN) conditions. For each given condition the noise level remains constant.The speech data are pcm-wav files (16kHz / 16 bits / mono) stored on one DVD. The total size is 3.03 Gbytes for 33.053 files.
EmoDB Dataset
kaggle.com
Updated Sep 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Piyush Agnihotri (2020). EmoDB Dataset [Dataset]. https://www.kaggle.com/piyushagni5/berlin-database-of-emotional-speech-emodb/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 24, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Piyush Agnihotri
Description
Emo-DB Database

The EMODB database is the freely available German emotional database. The database is created by the Institute of Communication Science, Technical University, Berlin, Germany. Ten professional speakers (five males and five females) participated in data recording. The database contains a total of 535 utterances. The EMODB database comprises of seven emotions: 1) anger; 2) boredom; 3) anxiety; 4) happiness; 5) sadness; 6) disgust; and 7) neutral. The data was recorded at a 48-kHz sampling rate and then down-sampled to 16-kHz.

Additional Information

Every utterance is named according to the same scheme:
Positions 1-2: number of speaker
Positions 3-5: code for text
Position 6: emotion (sorry, letter stands for german emotion word)
Position 7: if there are more than two versions these are numbered a, b, c ....

Example: 03a01Fa.wav is the audio file from Speaker 03 speaking text a01 with the emotion "Freude" (Happiness).

Information about the speakers

03 - male, 31 years old
08 - female, 34 years
09 - female, 21 years
10 - male, 32 years
11 - male, 26 years
12 - male, 30 years
13 - female, 32 years
14 - female, 35 years
15 - male, 25 years
16 - female, 31 years

Code of emotions:

letter emotion (english) letter emotion (german)
A anger W Ärger (Wut)
B boredom L Langeweile
D disgust E Ekel
F anxiety/fear A Angst
H happiness F Freude
S sadness T Trauer
N = neutral version

Inspiration

EMOTION classification from speech has an increasing interest in the field of the speech processing area. The objective of the emotion classification is to classify different emotions from the speech signal. A person’s emotional state affects the production mechanism of speech, and due to this, breathing rate and muscle tension change from the neutral condition. Therefore, the resulting speech signal may have different characteristics from that of neutral speech.

The performance of speech recognition or speaker recognition decreases significantly if the model is trained with neutral speech and it is tested with an emotional speech. So we as a Machine Learning Enthusiast can start working on speaker emotion recognition problems and can come up with some good robust models.
TORGO Dataset for Dysarthric Speech - Audio Files
kaggle.com
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pranay Koppula (2023). TORGO Dataset for Dysarthric Speech - Audio Files [Dataset]. https://www.kaggle.com/datasets/pranaykoppula/torgo-audio
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 14, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Pranay Koppula
Description
Citation: DOI 10.1007/s10579-011-9145-0

Collection of Audio Recordings by the Department of Computer Science at the University of Toronto from speakers with and without Dysarthtria. Useful for tasks like Audio Classification, Disease Detection, Speech Processing, etc.

Directory Structure:

F_Con : Audio Samples of female speakers from the control group, i.e., female speakers without dysarthria. 'FC01' in the folder names and the filenames refers to the first speaker, 'FC02' refers to the second speaker and so on. 'S01' refers to the first recording session with that speaker, 'S02' refers to the second session and so on. 'arrayMic' suggests that the audio was recorded with an array microphone, whereas 'headMic' suggests that the audio was recorded by a headpiece microphone.

F_Dys : Audio Samples of female speakers with dysarthria. 'F01' in the folder names and the filenames refers to the first speaker, 'F03' refers to the second speaker and so on. 'S01' refers to the first recording session with that speaker, 'S02' refers to the second session and so on. 'arrayMic' suggests that the audio was recorded with an array microphone, whereas 'headMic' suggests that the audio was recorded by a headpiece microphone.

M_Con : Audio Samples of male speakers from the control group, i.e., male speakers without dysarthria. 'MC01' in the folder names and the filenames refers to the first speaker, 'MC02' refers to the second speaker and so on. 'S01' refers to the first recording session with that speaker, 'S02' refers to the second session and so on. 'arrayMic' suggests that the audio was recorded with an array microphone, whereas 'headMic' suggests that the audio was recorded by a headpiece microphone.

M_Dys : Audio Samples of male speakers with dysarthria. 'M01' in the folder names and the filenames refers to the first speaker, 'M03' refers to the second speaker and so on. 'S01' refers to the first recording session with that speaker, 'S02' refers to the second session and so on. 'arrayMic' suggests that the audio was recorded with an array microphone, whereas 'headMic' suggests that the audio was recorded by a headpiece microphone.
E
AURORA-5
catalogue.elra.info
live.european-language-grid.eu
Updated Aug 16, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). AURORA-5 [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-AURORA-CD0005/
Explore at:
Dataset updated
Aug 16, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
The Aurora project was originally set up to establish a worldwide standard for the feature extraction software which forms the core of the front-end of a DSR (Distributed Speech Recognition) system. The AURORA-5 database has been mainly developed to investigate the influence on the performance of automatic speech recognition for a hands-free speech input in noisy room environments. Furthermore two test conditions are included to study the influence of transmitting the speech in a mobile communication system.The earlier three Aurora experiments had a focus on additive noise and the influence of some telephone frequency characteristics. Aurora-5 tries to cover all effects as they occur in realistic application scenarios. The focus was put on two scenarios. The first one is the hands-free speech input in the noisy car environment with the intention of controlling either devices in the car itself or retrieving information from a remote speech server over the telephone. The second one covers the hands-free speech input in a type of office or in a type of living room to control e.g. a telephone device or some audio/video equipment.The AURORA-5 database contains the following data:•Artificially distorted versions of the recordings from adult speakers in the TI-Digits speech database downsampled at a sampling frequency of 8000 Hz. The distortions consist of: - additive background noise, - the simulation of a hands-free speech input in rooms, - the simulation of transmitting speech over cellular telephone networks.•A subset of recordings from the meeting recorder project at the International Computer Science Institute. The recordings contain sequences of digits uttered by different speakers in hands-free mode in a meeting room.•A set of scripts for running recognition experiments on the above mentioned speech data. The experiments are based on the usage of the freely available software package HTK where HTK is not part of this resource.Further information is also available at the following address: http://aurora.hsnr.de
e
NST Norwegian ASR Database (16 kHz)
data.europa.eu
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NST Norwegian ASR Database (16 kHz) [Dataset]. https://data.europa.eu/data/datasets/sbr-13?locale=en
Explore at:
application/x-gtar , application/pdfAvailable download formats
License
https://hdl.handle.net/21.11146/13/.well-known/skolem/bead126e-5168-3e2b-8959-07eaf5d458d5https://hdl.handle.net/21.11146/13/.well-known/skolem/bead126e-5168-3e2b-8959-07eaf5d458d5
Description
This database was originally developed by Nordic Language Technology in the 1990ies in order to facilitate automatic speech recognition (ASR) in Norwegian . A reorganized and more user friendly version of this database is also available from The Language Bank. Type "sbr-54" in the search bar to find the updated version.
E
Data from: ASR database ARTUR 1.0 (transcriptions)
live.european-language-grid.eu
binary format
Updated Feb 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). ASR database ARTUR 1.0 (transcriptions) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/21519
Explore at:
binary formatAvailable download formats
Dataset updated
Feb 21, 2023
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Artur 1.0 is a speech database designed for the needs of developing automatic speech recognition for the Slovenian language. The complete database includes 1,067 hours of speech, of which 884 hours are transcribed, while the remaining 183 hours are recordings only. This repository entry includes transcriptions only, while the audio files are available on http://hdl.handle.net/11356/1776.

Transcriptions are available in the original TRS format of the Transcriber 1.5.1 tool which was used for making the transcriptions. All transcriptions were made manually or manually corrected.

The data are structured as follows: (1) Artur-B, read speech, 573 hours in total. It includes: (1a) Artur-B-Brani, 485 hours: Readings of sentences which were pre-selected from a 10% increment in the Gigafida 2.0 corpus. The sentences were chosen in such a way that they reflect the natural or the actual distribution of triphones in the words. They were distributed between 1,000 speakers, so that we recorded approx. 30 min in read form from each speaker. The speakers were balanced according to gender, age, region, and a small proportion of speakers were non-native speakers of Slovene. Each sentence is its own transcription file and has a corresponding audio file. (1b) Artur-B-Crkovani, 10 hours: Spellings. Speakers were asked to spell abbreviations and personal names and surnames, all chosen so that all Slovene letters were covered, plus the most common foreign letters. The transcriptions were corrected manually. (1c) Artur-B-Studio, 51 hours: Designed for the development of speech synthesis. The sentences were read in a studio by a single speaker. Each sentence is its own transcription file and has a corresponding recording. (1d) Artur-B-Izloceno, 27 hours: in trs format only. The recordings that correspond to these transcriptions include different types of errors, typically, incorrect reading of sentences or a noisy environment.

(2) Artur-J, public speech, 62 hours in total. It includes: (2a) Artur-J-Splosni, 62 hours: manual transcriptions of media recordings, online recordings of conferences, workshops, education videos, etc. Transcriptions were made in two modes: - 'pog' files include the pronunciation-based or citation-phonemic transcriptions (containing the output phoneme string derived from the orthographic form by letter-to-sound rules) - 'std' files include standardised or expanded orthographic transcriptions (the standard Slovenian spelling is used to indicate the spoken words, but there are additional rules and word-lists for non-standard lexis)

(3) Artur-N, private speech, 74 hours in total. It includes: (3a) Artur-N-Obrazi, 6 hours: Speakers were asked to describe faces on pictures. Designed for a face-description domain-specific speech recognition. (3b) Artur-N-PDom, 7 hours: Speakers were asked to read pre-written sentences, as well as to express instructions for a potential smart-home system freely. Designed for a smart-home domain-specific speech recognition. (3c) Artur-N-Prosti, 61 hours: Monologues and dialogues between two persons, recorded for the purposes of the Artur database creation. Speakers were asked to conversate or explain freely on casual topics. The manual transcriptions were done in two modes, the same as for Artur-J.

(4) Artur-P, parliamentary speech, 201 hours in total. It includes: (4a) Artur-P-SejeDZ, 201 hours: Transcriptions of speech from the Slovene National Assembly. Manual transcriptions were done in two modes, the same as for Artur-J.

Further information on the database, including various statistics, are available in the Artur-DOC directory, which is part of Artur_1.0_TRS.
Speech and Noise Corpora for Pitch Estimation of Human Speech
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bastian Bechtold; Bastian Bechtold (2020). Speech and Noise Corpora for Pitch Estimation of Human Speech [Dataset]. http://doi.org/10.5281/zenodo.3920591
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3920591
Dataset updated
Jun 30, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bastian Bechtold; Bastian Bechtold
Description
This dataset contains common speech and noise corpora for evaluating fundamental frequency estimation algorithms as convenient JBOF dataframes. Each corpus is available freely on its own, and allows redistribution:

CMU-ARCTIC (BSD license) [1]

FDA (free to download) [2]

KEELE (free for noncommercial use) [3]

MOCHA-TIMIT (free for noncommercial use) [4]

PTDB-TUG (ODBL license) [5]

NOISEX (free to download) [7]

QUT-NOISE (CC-BY-SA license) [8]

These files are published as part of my dissertation, "Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods", and in support of the Replication Dataset for Fundamental Frequency Estimation.

References:

John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.

Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.

F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.

Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.

Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.

John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.

Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.

David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.
m
Conversational Skills in Language Learning Games: A Speech Recognition...
data.mendeley.com
Updated Dec 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Murat Kuvvetli (2023). Conversational Skills in Language Learning Games: A Speech Recognition Technology Dataset [Dataset]. http://doi.org/10.17632/bhvd9z5jjr.1
Explore at:
Unique identifier
https://doi.org/10.17632/bhvd9z5jjr.1
Dataset updated
Dec 8, 2023
Authors
Murat Kuvvetli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The "SpeechRec_LanguageLearning_ConversationalSkills" dataset is a collection of data generated in a game-based language learning environment, aiming to explore the impact of Speech Recognition Technology (SRT) on the development of conversational skills. The dataset encompasses speaking test results conducted within the context of language learning games utilizing SRT.
A Replication Dataset for Fundamental Frequency Estimation
zenodo.org
live.european-language-grid.eu
+1more
bin
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bastian Bechtold; Bastian Bechtold (2025). A Replication Dataset for Fundamental Frequency Estimation [Dataset]. http://doi.org/10.5281/zenodo.3904389
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3904389
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bastian Bechtold; Bastian Bechtold
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.
© 2020, Bastian Bechtold. All rights reserved.

Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.

The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.

Included Code and Data

ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:

CMU-ARCTIC (consensus truth) [1]

FDA (corpus truth and consensus truth) [2]

KEELE (corpus truth and consensus truth) [3]

MOCHA-TIMIT (consensus truth) [4]

PTDB-TUG (corpus truth and consensus truth) [5]

TIMIT (consensus truth) [6]

noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:

NOISEX [7]

QUT-NOISE [8]

synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.

noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:

AUTOC [9]

AMDF [10]

BANA [11]

CEP [12]

CREPE [13]

DIO [14]

DNN [15]

KALDI [16]

MAPS

MBSC [17]

NLS [18]

PEFAC [19]

PRAAT [20]

RAPT [21]

SACC [22]

SAFE [23]

SHR [24]

SIFT [25]

SRH [26]

STRAIGHT [27]

SWIPE [28]

YAAPT [29]

YIN [30]

noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:

Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%.

Fine Pitch Error (FPE), the mean error of grossly correct estimates.

High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch.

Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs.

Fine Remaining Bias (FRB), the median error of GREs.

True Positive Rate (TPR), the percentage of true positive voicing estimates.

False Positive Rate (FPR), the percentage of false positive voicing estimates.

False Negative Rate (FNR), the percentage of false negative voicing estimates.

F₁, the harmonic mean of precision and recall of the voicing decision.

Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.

The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.

References:

John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.

Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.

F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.

Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.

Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.

John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.

Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.

David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.

Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968.

Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974.

Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014.

Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967.

Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182.

Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016.

Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014.

Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014.

Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013.

Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically
Data from: DiPCo -- Dinner Party Corpus
zenodo.org
application/gzip, pdf
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maarten Van Segbroeck; Ahmed Zaid; Ksenia Kutsenko; Cirenia Huerta; Tinh Nguyen; Xuewen Luo; Björn Hoffmeister; Jan Trmal; Maurizio Omologo; Roland Maas; Maarten Van Segbroeck; Ahmed Zaid; Ksenia Kutsenko; Cirenia Huerta; Tinh Nguyen; Xuewen Luo; Björn Hoffmeister; Jan Trmal; Maurizio Omologo; Roland Maas (2024). DiPCo -- Dinner Party Corpus [Dataset]. http://doi.org/10.21437/interspeech.2020-2800
Explore at:
application/gzip, pdfAvailable download formats
Unique identifier
https://doi.org/10.21437/interspeech.2020-2800
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maarten Van Segbroeck; Ahmed Zaid; Ksenia Kutsenko; Cirenia Huerta; Tinh Nguyen; Xuewen Luo; Björn Hoffmeister; Jan Trmal; Maurizio Omologo; Roland Maas; Maarten Van Segbroeck; Ahmed Zaid; Ksenia Kutsenko; Cirenia Huerta; Tinh Nguyen; Xuewen Luo; Björn Hoffmeister; Jan Trmal; Maurizio Omologo; Roland Maas
License
https://cdla.io/permissive-1-0https://cdla.io/permissive-1-0
Description
We present a speech data corpus that simulates a "dinner party" scenario taking place in an everyday home environment. The corpus was created by recording multiple groups of four Amazon employee volunteers having a natural conversation in English around a dining table. The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes. The corpus was created to advance in the field of noise robust and distant speech processing and is intended to serve as a public research and benchmarking data set.
Data from: 🎤 Gender Recognition by Voice
kaggle.com
Updated Oct 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mexwell (2023). 🎤 Gender Recognition by Voice [Dataset]. https://www.kaggle.com/datasets/mexwell/gender-recognition-by-voice
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2023
Dataset provided by
Kaggle
Authors
mexwell
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
In order to analyze gender by voice and speech, a training database was required. A database was built using thousands of samples of male and female voices, each labeled by their gender of male or female. Voice samples were collected from the following resources:

The Harvard-Haskins Database of Regularly-Timed Speech

Telecommunications & Signal Processing Laboratory (TSP) Speech Database at McGill University

VoxForge Speech Corpus

Festvox CMU_ARCTIC Speech Database at Carnegie Mellon University Each voice sample is stored as a .WAV file, which is then pre-processed for acoustic analysis using the specan function from the WarbleR R package. Specan measures 22 acoustic parameters on acoustic signals for which the start and end times are provided.

The output from the pre-processed WAV files were saved into a CSV file, containing 3168 rows and 21 columns (20 columns for each feature and one label column for the classification of male or female). You can download the pre-processed dataset in CSV format, using the link above

Acoustic Properties Measured

The following acoustic properties of each voice are measured:

duration: length of signal meanfreq: mean frequency (in kHz) sd: standard deviation of frequency median: median frequency (in kHz) Q25: first quantile (in kHz) Q75: third quantile (in kHz) IQR: interquantile range (in kHz) skew: skewness (see note in specprop description) kurt: kurtosis (see note in specprop description) sp.ent: spectral entropy sfm: spectral flatness mode: mode frequency centroid: frequency centroid (see specprop) peakf: peak frequency (frequency with highest energy) meanfun: average of fundamental frequency measured across acoustic signal minfun: minimum fundamental frequency measured across acoustic signal maxfun: maximum fundamental frequency measured across acoustic signal meandom: average of dominant frequency measured across acoustic signal mindom: minimum of dominant frequency measured across acoustic signal maxdom: maximum of dominant frequency measured across acoustic signal dfrange: range of dominant frequency measured across acoustic signal modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range Note, the features for duration and peak frequency (peakf) were removed from training. Duration refers to the length of the recording, which for training, is cut off at 20 seconds. Peakf was omitted from calculation due to time and CPU constraints in calculating the value. In this case, all records will have the same value for duration (20) and peak frequency (0).

Original Data

Acknowlegement

Foto von Jason Rosewell auf Unsplash
Z
Dvoice : An open source dataset for Automatic Speech Recognition on Moroccan...
data.niaid.nih.gov
explore.openaire.eu
Updated Jan 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Imade Benelallam (2022). Dvoice : An open source dataset for Automatic Speech Recognition on Moroccan dialectal Arabic [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5482550
Explore at:
Dataset updated
Jan 13, 2022
Dataset provided by
Abdou Mohamed Naira
Imade Benelallam
Anass Allak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Morocco
Description
Dialectal Voice is a community project initiated by AIOX Labs to facilitate voice recognition by Intelligent Systems. Today, the need for AI systems capable of recognizing the human voice is increasingly expressed within communities. However, we note that for some languages such as Darija, there are not enough voice technology solutions. To meet this need, we then proposed to establish this program of iterative and interactive construction of a dialectal database open to all in order to help improve models of voice recognition and generation.
P
TIMIT Dataset
paperswithcode.com
Updated Jul 5, 2012
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2012). TIMIT Dataset [Dataset]. https://paperswithcode.com/dataset/timit
Explore at:
Dataset updated
Jul 5, 2012
Description
The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a standard dataset used for evaluation of automatic speech recognition systems. It consists of recordings of 630 speakers of 8 dialects of American English each reading 10 phonetically-rich sentences. It also comes with the word and phone-level transcriptions of the speech.
E
SNABI database for continuous speech recognition 1.2
live.european-language-grid.eu
binary format
Updated Mar 1, 2002
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2002). SNABI database for continuous speech recognition 1.2 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20237
Explore at:
binary formatAvailable download formats
Dataset updated
Mar 1, 2002
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The SNABI speech database can be used to train continuous speech recognition for Slovene language. The database comprises 1530 sentences, 150 words and the alphabet. 132 speakers were recorded, each reading 200 sentences or more. This resulted in more than 15,000 recordings of speech signal contained in the database. The recordings were done in studio (SNABI SI_SSQ) and through a telephone line (SNABI SI_SFN).
P
THCHS-30 Dataset
paperswithcode.com
Updated Feb 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dong Wang; Xuewei Zhang (2025). THCHS-30 Dataset [Dataset]. https://paperswithcode.com/dataset/thchs-30
Explore at:
Dataset updated
Feb 26, 2025
Authors
Dong Wang; Xuewei Zhang
Description
THCHS-30 is a free Chinese speech database THCHS-30 that can be used to build a full-fledged Chinese speech recognition system.
g
Indonesian Media Audio Database
gts.ai
json
Updated Jan 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS (2024). Indonesian Media Audio Database [Dataset]. https://gts.ai/case-study/indonesian-media-audio-database-custom-ai-data-collection/
Explore at:
jsonAvailable download formats
Dataset updated
Jan 31, 2024
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Our project, “Indonesian Media Audio Database,” is designed to establish a rich and diverse dataset tailored for training advanced machine learning models in language processing, speech recognition, and cultural analysis.
Z
Data from: Written and spoken digits database for multimodal learning
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jan 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khacef, Lyes (2021). Written and spoken digits database for multimodal learning [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3515934
Explore at:
Dataset updated
Jan 21, 2021
Dataset provided by
Rodriguez, Laurent
Miramond, Benoit
Khacef, Lyes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Database description:

The written and spoken digits database is not a new database but a constructed database from existing ones, in order to provide a ready-to-use database for multimodal fusion [1].

The written digits database is the original MNIST handwritten digits database [2] with no additional processing. It consists of 70000 images (60000 for training and 10000 for test) of 28 x 28 = 784 dimensions.

The spoken digits database was extracted from Google Speech Commands [3], an audio dataset of spoken words that was proposed to train and evaluate keyword spotting systems. It consists of 105829 utterances of 35 words, amongst which 38908 utterances of the ten digits (34801 for training and 4107 for test). A pre-processing was done via the extraction of the Mel Frequency Cepstral Coefficients (MFCC) with a framing window size of 50 ms and frame shift size of 25 ms. Since the speech samples are approximately 1 s long, we end up with 39 time slots. For each one, we extract 12 MFCC coefficients with an additional energy coefficient. Thus, we have a final vector of 39 x 13 = 507 dimensions. Standardization and normalization were applied on the MFCC features.

To construct the multimodal digits dataset, we associated written and spoken digits of the same class respecting the initial partitioning in [2] and [3] for the training and test subsets. Since we have less samples for the spoken digits, we duplicated some random samples to match the number of written digits and have a multimodal digits database of 70000 samples (60000 for training and 10000 for test).

The dataset is provided in six files as described below. Therefore, if a shuffle is performed on the training or test subsets, it must be performed in unison with the same order for the written digits, spoken digits and labels.

Files:

data_wr_train.npy: 60000 samples of 784-dimentional written digits for training;

data_sp_train.npy: 60000 samples of 507-dimentional spoken digits for training;

labels_train.npy: 60000 labels for the training subset;

data_wr_test.npy: 10000 samples of 784-dimentional written digits for test;

data_sp_test.npy: 10000 samples of 507-dimentional spoken digits for test;

labels_test.npy: 10000 labels for the test subset.

References:

Khacef, L. et al. (2020), "Brain-Inspired Self-Organization with Cellular Neuromorphic Computing for Multimodal Unsupervised Learning".

LeCun, Y. & Cortes, C. (1998), “MNIST handwritten digit database”.

Warden, P. (2018), “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition”.

Facebook

Twitter

Click to copy link

Link copied

Cite

Gabriel Lins (2022). Speaker Recognition - CMU ARCTIC [Dataset]. https://www.kaggle.com/datasets/mrgabrielblins/speaker-recognition-cmu-arctic

Speaker Recognition - CMU ARCTIC

try to classify correctly the speaker given an audio file!

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 21, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Gabriel Lins

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Can you predict which speaker is talking?
Can you predict what they are saying? This dataset makes all of these possible. Perfect for a school project, research project, or resume builder.

File information

train.csv - file containing all the data you need for training, with 4 columns, id (file id), file_path(path to .wav files), speech(transcription of audio file), and speaker (target column)
test.csv - file containing all the data you need to test your model (20% of total audio files), it has the same columns as train.csv
train/ - Folder with training data, subdivided with Speaker's folders
- aew/ - Folder containing audio files in .wav format for speaker aew
- ...
test/ - Folder containing audio files for test data.

Column description

Column	Description
id	file id (string)
file_path	file path to .wav file (string)
speech	transcription of the audio file (string)
speaker	speaker name, use this as the target variable if you are doing audio classification (string)

More Details

The 1132 sentence prompt list is available from cmuarctic.data

Acknowledgements

Clear search

Close search

Google apps

Main menu

Country	# Speakers	# Utterances
France	31 (38.3%)	3100
Greece	20 (24.7%)	2000
Italy	20 (24.7%)	2000
Spain	10 (12.3%)	999
Total	81	8099

Speaker Recognition - CMU ARCTIC

File information

Column description

More Details

Acknowledgements

Data from: ASR database ARTUR 1.0 (audio)

Podcast Database - Complete Podcast Metadata, All Countries & Languages

Data from: The HIWIRE database, a noisy and non-native English speech corpus...

EmoDB Dataset

Emo-DB Database

Additional Information

Information about the speakers

Code of emotions:

Inspiration

TORGO Dataset for Dysarthric Speech - Audio Files

AURORA-5

NST Norwegian ASR Database (16 kHz)

Data from: ASR database ARTUR 1.0 (transcriptions)

Speech and Noise Corpora for Pitch Estimation of Human Speech

Conversational Skills in Language Learning Games: A Speech Recognition...

A Replication Dataset for Fundamental Frequency Estimation

Data from: DiPCo -- Dinner Party Corpus

Data from: 🎤 Gender Recognition by Voice

Acoustic Properties Measured

Acknowlegement

Dvoice : An open source dataset for Automatic Speech Recognition on Moroccan...

TIMIT Dataset

SNABI database for continuous speech recognition 1.2

THCHS-30 Dataset

Indonesian Media Audio Database

Data from: Written and spoken digits database for multimodal learning

Speaker Recognition - CMU ARCTIC

try to classify correctly the speaker given an audio file!

File information

Column description

More Details

Acknowledgements