Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically
The TIMIT dataset is a corpus of read speech. It consists of recordings of 630 speakers, each reading 10 phonetically-balanced sentences. The dataset is divided into a training set of 462 speakers and a test set of 168 speakers.
The TIMIT corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT has resulted from the joint efforts of several sites under sponsorship from the Defense Advanced Research Projects Agency - Information Science and Technology Office (DARPA-ISTO). Text corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), Stanford Research Institute (SRI), and Texas Instruments (TI). The speech was recorded at TI, transcribed at MIT, and has been maintained, verified, and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). This file contains a brief description of the TIMIT Speech Corpus. Additional information including the referenced material and some relevant reprints of articles may be found in the printed documentation which is also available from NTIS (NTIS# PB91-100354).
TIMIT contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States. Table 1 shows the number of speakers for the 8 dialect regions, broken down by sex. The percentages are given in parentheses. A speaker's dialect region is the geographical area of the U.S. where they lived during their childhood years. The geographical areas correspond with recognized dialect regions in U.S. (Language Files, Ohio State University Linguistics Dept., 1982), with the exception of the Western region (dr7) in which dialect boundaries are not known with any confidence and dialect region 8 where the speakers moved around a lot during their childhood.
Table 1: Dialect distribution of speakers
Dialect
Region(dr) #Male #Female Total
---------- --------- --------- ----------
1 31 (63%) 18 (27%) 49 (8%)
2 71 (70%) 31 (30%) 102 (16%)
3 79 (67%) 23 (23%) 102 (16%)
4 69 (69%) 31 (31%) 100 (16%)
5 62 (63%) 36 (37%) 98 (16%)
6 30 (65%) 16 (35%) 46 (7%)
7 74 (74%) 26 (26%) 100 (16%)
8 22 (67%) 11 (33%) 33 (5%)
------ --------- --------- ----------
8 438 (70%) 192 (30%) 630 (100%)
The dialect regions are:
dr1: New England
dr2: Northern
dr3: North Midland
dr4: South Midland
dr5: Southern
dr6: New York City
dr7: Western
dr8: Army Brat (moved around)
The text material in the TIMIT prompts (found in the file "prompts.doc") consists of 2 dialect "shibboleth" sentences designed at SRI, 450 phonetically-compact sentences designed at MIT, and 1890 phonetically-diverse sentences selected at TI. The dialect sentences (the SA sentences) were meant to expose the dialectal variants of the speakers and were read by all 630 speakers. The phonetically-compact sentences were designed to provide a good coverage of pairs of phones, with extra occurrences of phonetic contexts thought to be either difficult or of particular interest. Each speaker read 5 of these sentences (the SX sentences) and each text was spoken by 7 different speakers. The phonetically-diverse sentences (the SI sentences) were selected from existing text sources - the Brown Corpus (Kuchera and Francis, 1967) and the Playwrights Dialog (Hultzen, et al., 1964) - so as to add diversity in sentence types and phonetic contexts. The selection criteria maximized the variety of allophonic contexts found in the texts. Each speaker read 3 of these sentences, with each sentence being read only by a single speaker. Table 2 summarizes the speech material in TIMIT.
Table 2: TIMIT speech material
Sentence Type #Sentences #Speakers Total #Sentences/Speaker
------------- ---------- --------- ----- ------------------
Dialect (SA) 2 630 1260 2
Compact (SX) 450 7 3150 5
Diverse (SI) 1890 1 1890 3
------------- ---------- --------- ----- ----------------
Total 2342 6300 10
Suggested Training/Test Subdivision
The speech material has been subdivided into portions for training and testing. The criteria for the subdivision is described in the file "testset.doc". THIS SUBDIVISION HAS NO RELATION TO THE DATA DISTRIBUTED ON THE PROTOTYPE VERSION OF THE CDROM.
The test data has a core portion containing 24 speakers, 2 male and 1 female from each dialect region. The core test speakers are shown in Table 3. Each speaker read a different set of SX sentences. Thus the core test material contains 192 sentences, 5 SX and 3 SI for each speaker, each having a distinct text prompt.
Table 3: The core test set of 24 speakers
Dialect Male Female
------- ------ ------
1 DAB0, WBT0 ELC0
2 TAS1, WEW0 PAS0
3 JMP0, LNT0 PKT0
4 LLL0, TLS0 JLM0
5 BPM0, KLT0 NLP0
6 CMJ0, JDH0 MGD0
7 GRT0, NJM0 DHC0
8 JLN0, PAM0 MLD0
A more extensive test set was obtained by including the sentences from all speakers that read any of the SX texts included in the core test set. In doing so, no sentence text appears in both the training and test sets. This complete test set contains a total of 168 speakers and 1344 utterances, accounting for about 27% of the total speech material. The resulting dialect distribution of the 168 speaker test set is given in Table 4. The complete test material contains 624 distinct texts. ``` Table 4: Dialect distribution for complete test set
Dialect #Male #Female Total
------- ----- ------- -----
1 7 4 11
2 18 8 26
3 23 3 26
4 16 16 32
5 17 11 28
6 8 3 11
7 15 8 23
8 8 3 11
----- ----- ------- ------
Total 112 56 168
CDROM TIMIT Directory and File Structure
The speech and associated data is organized on the CD-ROM according to the following hierarchy:
/
The TIMIT corpus of reading speech has been developed to provide speech data for acoustic-phonetic research studies and for the evaluation of automatic speech recognition systems. TIMIT contains high quality recordings of 630 individuals/speakers with 8 different American English dialects, with each individual reading upto 10 phonetically rich sentences. More info on TIMIT dataset can be understood from the "README" which can be found here: https://catalog.ldc.upenn.edu/docs/LDC93S1/readme.txt
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material are becoming increasingly straightforward to perform. At the same time, sharing fake content on the web has become so simple that malicious users can create unpleasant situations with minimal effort. Also, forged media are getting more and more complex, with manipulated videos (e.g., deepfakes where both the visual and audio contents can be counterfeited) that are taking the scene over still images. The multimedia forensic community has addressed the possible threats that this situation could imply by developing detectors that verify the authenticity of multimedia objects. However, the vast majority of these tools only analyze one modality at a time. This was not a problem as long as still images were considered the most widely edited media, but now, since manipulated videos are becoming customary, performing monomodal analyses could be reductive. Nonetheless, there is a lack in the literature regarding multimodal detectors (systems that consider both audio and video components). This is due to the difficulty of developing them but also to the scarsity of datasets containing forged multimodal data to train and test the designed algorithms.
In this paper we focus on the generation of an audio-visual deepfake dataset. First, we present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks. Then, we use the pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with DeepfakeTIMIT and VidTIMIT video datasets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both monomodal (i.e., audio) and multimodal (i.e., audio and video) conditions. This highlights the need for multimodal forensic detectors and more multimodal deepfake data.
For the initial version of TIMIT-TTS v1.0
Arxiv: https://arxiv.org/abs/2209.08000
TIMIT-TTS Database v1.0: https://zenodo.org/record/6560159
The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed at MIT and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). The TIMIT corpus transcriptions have been hand verified. Test and training subsets, balanced for phonetic and dialectal coverage, are specified. Tabular computer-searchable information is included as well as written documentation.
This file contains documentation for STC-TIMIT 1.0, Linguistic Data Consortium (LDC) catalog number LDC2008S03 and isbn 1-58563-468-9. STC-TIMIT 1.0 is a telephone version of TIMIT Acoustic Phonetic Continuous Speech Corpus, LDC93S1 (TIMIT). TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English reading ten phonetically rich sentences. Created in 1993, TIMIT was designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. Since that time, several corpora have been developed using the TIMIT database: NTIMIT, LDC93S2 (transmiting TIMIT recordings through a telephone handset and over various channels in the NYNEX telephone network and redigitizing them); CTIMIT, LDC96S30 (passing TIMIT files through cellular telephone circuits); FFMTIMIT, LDC96S32 (re-recording TIMIT files with a free-field microphone); and HTIMIT, LDC98S67 (re-recording a subset of TIMIT files through different telephone handsets). What differentiates STC-TIMIT 1.0 from other TIMIT-derived corpora is that the entire TIMIT database was passed through an actual telephone channel in a single call. Thus, a single type of channel distortion and noise affect the whole database. The process was managed using a Dialogic switchboard for the calling and receiving ends. No transducer (microphone) was employed; the original digital signal was converted to analog using the switchboard's A/D converter, transmitted trough a telephone channel and converted back to digital format before recording. As a result, the only distortion introduced is that of the telephone channel itself. The STC-TIMIT 1.0 database is organized in the same manner as in the original TIMIT corpus: 4620 files belonging to the training partition and 1680 files belonging to the test partition. Files were recorded using 8kHz sampling frequency and muLaw encoding. Additionally four sets of two calibration tones were generated. These were passed through the telephone line approximately at the start of every 1/4th of the whole database (both the source and recorded calibration tones in each set are provided). Calibration tones are: 2 sec. 1kHz tone 2 sec. sweep tone from 10 Hz to 4000 Hz. Utterances in STC-TIMIT 1.0 are time-aligned with those of TIMIT with an average precision of 0.125 ms (1 sample), by maximizing the cross-correlation between pairs of files from each corpus. Thus, labels from TIMIT may be used for STC-TIMIT 1.0, and the effects of telephone channels may be studied on a frame-by-frame basis.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The QUT-NOISE Databases and Protocols
Overview
This distribution contains the QUT-NOISE database and the code required to create the QUT-NOISE-TIMIT database from the QUT-NOISE database and a locally installed copy of the TIMIT database. It also contains code to create the QUT-NOISE-SRE protocol on top of an existing speaker recognition evaluation database (such as NIST evaluations). Further information on the QUT-NOISE and QUT-NOISE-TIMIT databases is available in our paper:
D. Dean, S. Sridharan, R. Vogt, M. Mason (2010) The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms, in Proceedings of Interspeech 2010, Makuhari Messe International Convention Complex, Makuhari, Japan.
This paper is also available in the file: docs/Dean2010, The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithm.pdf, distributed with this database.
Further information on the QUT-NOISE-SRE protocol is available in our paper: D. Dean, A. Kanagasundaram, H. Ghaemmaghami, M. Hafizur, S. Sridharan (2015) The QUT-NOISE-SRE protocol for the evaluation of noisy speaker recognition. In Proceedings of Interspeech 2015, September, Dresden, Germany.
Licensing
The QUT-NOISE data itself is licensed CC-BY-SA, and the code required to create the QUT-NOISE-TIMIT database and QUT-NOISE-SRE protocols is licensed under the BSD license. Please consult the approriate LICENSE.txt files (in the code and QUT-NOISE directories) for more information. To attribute this database, please include the following citation:
D. Dean, S. Sridharan, R. Vogt, M. Mason (2010) The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms, in Proceedings of Interspeech 2010, Makuhari Messe International Convention Complex, Makuhari, Japan.
If your work is based upon the QUT-NOISE-SRE, please also include this citation: D. Dean, A. Kanagasundaram, H. Ghaemmaghami, M. Hafizur, S. Sridharan (2015) The QUT-NOISE-SRE protocol for the evaluation of noisy speaker recognition. In Proceedings of Interspeech 2015, September, Dresden, Germany.
Download and Installation
Download the following QUT-NOISE*.zip files:
QUT_NOISE.zip (26.7 MB, md5sum: 672461fd88782e9ea10d5c2cb7a84196)
QUT_NOISE_CAFE.zip (1.6 GB, md5sum: f87fb213c0e1c439e1b727fb258ef2cd)
QUT_NOISE_CAR.zip (1.7 GB, md5sum: d680118b4517e1257a9263b99d1ac401)
QUT_NOISE_HOME.zip (1.4 GB, md5sum: d99572ae1c118b749c1ffdb2e0cf0d2e)
QUT_NOISE_REVERB.zip (1.4 GB, md5sum: fe107ab341e6bc75de3a32c69344190e)
QUT_NOISE_STREET.zip (1.6 GB, md5sum: 68d5ebc2e60cb07927cc4d33cdf2f017)
​Creating QUT-NOISE-TIMIT
Obtaining TIMIT
In order to construct the QUT-NOISE-TIMIT database from the QUT-NOISE data supplied here you will need to obtain a copy of the TIMIT database from the Linguistic Data Consortium. If you just want to use the QUT-NOISE database, or you wish to combine it with different speech data, TIMIT is not required.
Creating QUT-NOISE-TIMIT
Once you have obtained TIMIT, download and install a copy of VOICEBOX: Speech Processing Toolbox for MATLAB and install it in your MATLABPATH.
Run matlab in the QUT-NOISE/code directory, and run the function: createQUTNOISETIMIT('/location/of/timit-cd/timit'). This will create the QUT-NOISE-TIMIT database in the QUT-NOISE/QUT-NOISE-TIMIT directory.
If you wish to verify that the QUT-NOISE-TIMIT database matches that evaluated in our original paper, please check that the md5sums (use md5sum on unix-based OSes) match those in the QUT-NOISE-TIMIT/md5sum.txt file.
Using the QUT-NOISE-SRE protocol
The code related to the QUT-NOISE-SRE protocol can be used in two ways:
To create a collection of noisy audio files across the scenarios in the QUT-NOISE database at different noise levels, or,
To recreate a list of file names based on the QUT-NOISE-SRE protocl produced by another researcher, having already done (1). This allows existing research to be reproduced without having to send large volumes of audio around.
If you are interested in creating your own noisy database from an existing SRE database (1 above), please look at the example script exampleQUTNOISESRE.sh in the QUT-NOISE/code directory. You will need to make some modifications, but it should give you the right idea.
If you are interested in creating our QUT-NOISE-NIST2008 database published at Interspeech 2015, you can find the list of created noisy files in the QUT-NOISE-NIST2008.train.short2.list and QUT-NOISE-NIST2008.test.short3.list files in the QUT-NOISE/code directory.
These files can be recreated as follows (provided you have access to the NIST2008 SRE data):
Run matlab in the QUT-NOISE/code directory, and run the following functions:
createQUTNOISESREfiles('NIST2008.train.short2.list', ...
'QUT-NOISE-NIST2008.train.short2.list', ...
'
AbstractIntroductionGlobal TIMIT Mandarin Chinese was developed by the Linguistic Data Consortium and Shanghai Jiao Tong University and consists of approximately five hours of read speech and transcripts in Mandarin Chinese. The Global TIMIT project aimed to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. Specifically, these features included: A large number of fluently-read sentences, containing a representative sample of phonetic, lexical, syntactic, semantic, and pragmatic patterns A relatively large number of speakers Time-aligned lexical and phonetic transcription of all utterances Some sentences read by all speakers, others read by a few speakers, and others read by just one speaker DataGlobal TIMIT Mandarin Chinese consists of 50 speakers reading 120 sentences selected from Chinese Gigaword Fifth Edition (LDC2011T13). Among the 120 sentences, 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 3220 sentence types. The corpus was recorded at Shanghai Jiao Tong University, China. Speakers (25 female, 25 male) were students at the university and all achieved Class 2 Level 1 or better on Putonghua Shuiping Ceshi (the national standard Mandarin proficiency test). All speech data are presented as 16kHz, 16-bit flac compressed wav files. Each file has accompanying phone and word segmentation files, as well as Praat TextGrid files.
The TIMIT corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT has resulted from the joint efforts of several sites under sponsorship from the Defense Advanced Research Projects Agency - Information Science and Technology Office (DARPA-ISTO). Text corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), Stanford Research Institute (SRI), and Texas Instruments (TI). The speech was recorded at TI, transcribed at MIT, and has been maintained, verified, and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). This file contains a brief description of the TIMIT Speech Corpus. Additional information including the referenced material and some relevant reprints of articles may be found in the printed documentation which is also available from NTIS (NTIS# PB91-100354).
This version of the TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) has all the waveform files formatted with ms-wav / RIFF headers, to make the corpus more accessible to a wider audience. The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed at MIT and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). The TIMIT corpus transcriptions have been hand verified. Test and training subsets, balanced for phonetic and dialectal coverage, are specified. Tabular computer-searchable information is included as well as written documentation.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The canonical metadata on NLTK:
<package id="timit" name="TIMIT Corpus Sample"
sample="True"
license="This corpus sample is Copyright 1993 Linguistic Data Consortium, and is distributed under the terms of the Creative Commons Attribution, Non-Commercial, ShareAlike license. http://creativecommons.org/"
webpage="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1"
unzip="1"
/>
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
See http://conradsanderson.id.au/vidtimit/ for details.
Summary: Video and corresponding audio recordings of 43 people, reciting short sentences. Useful for research on topics such as automatic lip reading, multi-view face recognition, multi-modal speech recognition and person identification.
TIMITPhones: TIMIT Phoneme Dataset
This corpus is a phoneme‑level derivative of the original TIMIT Acoustic‑Phonetic Continuous Speech Corpus.Each entry pairs a 1‑second waveform excerpt with a single phoneme label taken from the 61‑phone TIMIT inventory (the mapping to 39‑phone and broad‑class sets is also provided). This version is designed for quick prototyping of phoneme classifiers or probing acoustic representations.
Supported Tasks and Leaderboards
Automatic… See the full description on the dataset page: https://huggingface.co/datasets/IParraMartin/TIMITPhones.
This dataset contains common speech and noise corpora for evaluating fundamental frequency estimation algorithms as convenient JBOF dataframes. Each corpus is available freely on its own, and allows redistribution:
These files are published as part of my dissertation, "Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods", and in support of the Replication Dataset for Fundamental Frequency Estimation.
References:
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Tommy NgX
Released under CC0: Public Domain
The Deepfake-TIMIT dataset contains 100,000 images of faces manipulated using Deepfakes.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
DeepfakeTIMIT is a database of videos where faces are swapped using the open source GAN-based approach (adapted from here: https://github.com/shaoanlu/faceswap-GAN), which, in turn, was developed from the original autoencoder-based Deepfake algorithm.
When creating the database, we manually selected 16 similar looking pairs of people from publicly available VidTIMIT database. For each of 32 subjects, we trained two different models: a lower quality (LQ) with 64 x 64 input/output size model, and higher quality (HQ) with 128 x 128 size model (see the available images for the illustration). Since there are 10 videos per person in VidTIMIT database, we generated 320 videos corresponding to each version, resulting in 620 total videos with faces swapped. For the audio, we kept the original audio track of each video, i.e., no manipulation was done to the audio channel.
Any publication (eg. conference paper, journal article, technical report, book chapter, etc) resulting from the usage of DeepfakeTIMIT must cite the following paper:
P. Korshunov and S. Marcel,
DeepFakes: a New Threat to Face Recognition? Assessment and Detection.
arXiv and Idiap Research Report
Any publication (eg. conference paper, journal article, technical report, book chapter, etc) resulting from the usage of VidTIMIT and subsequently DeepfakeTIMIT must also cite the following paper:
C. Sanderson and B.C. Lovell,
Multi-Region Probabilistic Histograms for Robust and Scalable Identity Inference.
Lecture Notes in Computer Science (LNCS), Vol. 5558, pp. 199-208, 2009.
AbstractIntroduction The Audiovisual Database of Spoken American English, Linguistic Data Consortium (LDC) catalog number LDC2009V01 and isbn 1-58563-496-4, was developed at Butler University, Indianapolis, IN in 2007 for use by a a variety of researchers to evaluate speech production and speech recognition. It contains approximately seven hours of audiovisual recordings of fourteen American English speakers producing syllables, word lists and sentences used in both academic and clinical settings. All talkers were from the North Midland dialect region -- roughly defined as Indianapolis and north within the state of Indiana -- and had lived in that region for the majority of the time from birth to 18 years of age. Each participant read 238 different words and 166 different sentences. The sentences spoken were drawn from the following sources: Central Institute for the Deaf (CID) Everyday Sentences (Lists A-J) Northwestern University Auditory Test No. 6 (Lists I-IV) Vowels in /hVd/ context (separate words) Texas Instruments/Massachusetts Institute for Technology (TIMIT) sentences The CID Everyday Sentences were created in the 1950s from a sample developed by the Armed Forces National Research Committee on Hearing and Bio-Acoustics. They are considered to represent everyday American speech and have the following characteristics: the vocabulary is appropriate to adults; the words appear with high frequency in one or more of the well-known word counts of the English language; proper names and proper nouns are not used; common non-slang idioms and contractions are used freely; phonetic loading and "tongue-twisting" are avoided; redundancy is high; the level of abstraction is low; and grammatical structure varies freely. Northwestern University Auditory Test No. 6 is a phonemically-balanced set of monosyllabic English words used clinically to test speech perception in adults with hearing loss. The /hVd/ vowel list was created to elicit all of the vowel sounds of American English. The TIMIT sentences are a subset (34 sentences) of the 2342 phonetically-rich sentences read by speakers in the TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. TIMIT was designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT speakers were from eight dialect regions of the United States. The Audiovisual Database of Spoken American English will be of interest in various disciplines: to linguists for studies of phonetics, phonology, and prosody of American English; to speech scientists for investigations of motor speech production and auditory-visual speech perception; to engineers and computer scientists for investigations of machine audio-visual speech recognition (AVSR); and to speech and hearing scientists for clinical purposes, such as the examination and improvement of speech perception by listeners with hearing loss. Data Participants were recorded individually during a single session. A participant first completed a statement of informed consent and a questionnaire to gather biographical data and then was asked by the experimenter to mark his or her Indiana hometown on a state map. The experimenter and participant then moved to a small, sound-treated studio where the participant was seated in front of three navy blue baffles. A laptop computer was elevated to eye-level on a speaker stand and placed approximately 50-60 cm in front of the participant. Prompts were presented to the participant in a Microsoft PowerPoint presentation. The experimenter was seated directly next to the participant, but outside the camera angle, and advanced the PowerPoint slides at a comfortable pace. Participants were recorded with a Panasonic DVC-80 digital video camera to miniDV digital video cassette tapes. All participants wore a Sennheiser MKE-2060 directional/cardioid lapel microphone throughout the recordings. Each speaker produced a total of 94 segmented files which were converted from Final Cut Express to Quicktime (.mov) files and then saved in the appropriately marked folder. If a speaker mispronounced a sentence or word during the recording process, the mispronunciations were edited out of the segments to be archived. The remaining parts of the recording, including the correct repetition of each prompt, were then sequenced together to create a continuous and complete segment. The fourteen participants were between 19 and 61 years of age (with a mean age of 30 years) and native speakers of American English.
Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically
The TIMIT dataset is a corpus of read speech. It consists of recordings of 630 speakers, each reading 10 phonetically-balanced sentences. The dataset is divided into a training set of 462 speakers and a test set of 168 speakers.