Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
| Column | Description |
|---|---|
| id | file id (string) |
| file_path | file path to .wav file (string) |
| speech | transcription of the audio file (string) |
| speaker | speaker name, use this as the target variable if you are doing audio classification (string) |
The CMU_ARCTIC databases were constructed at the Language Technologies Institute at Carnegie Mellon University as phonetically balanced, US-English single-speaker databases designed for unit selection speech synthesis research. A detailed report on the structure and content of the database and the recording environment etc is available as a Carnegie Mellon University, Language Technologies Institute Tech Report CMU-LTI-03-177 and is also available here.
The databases consist of around 1150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databases include US English male (bdl) and female (slt) speakers (both experienced voice talent) as well as other accented speakers.
The 1132 sentence prompt list is available from cmuarctic.data
The distributions include 16KHz waveform and simultaneous EGG signals. Full phonetically labeling was performed by the CMU Sphinx using the FestVox based labeling scripts. Complete runnable Festival Voices are included with the database distributions, as examples though better voices can be made by improving labeling, etc.
This work was partially supported by the U.S. National Science Foundation under Grant No. 0219687, "ITR/CIS Evaluation and Personalization of Synthetic Voices". Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset comprises 488 hours of telephone dialogues in Spanish, collected from 600 native speakers across various topics and domains. This dataset boasts an impressive 98% word accuracy rate, making it a valuable resource for advancing speech recognition technology.
By utilizing this dataset, researchers and developers can advance their understanding and capabilities in automatic speech recognition (ASR) systems, transcribing audio, and natural language processing (NLP). - Get the data
The dataset includes high-quality audio recordings with text transcriptions, making it ideal for training and evaluating speech recognition models.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2Fa3f375fb273dcad3fe17403bdfccb63b%2Fssssssssss.PNG?generation=1739884059328284&alt=media" alt="">
- Audio files: High-quality recordings in WAV format
- Text transcriptions: Accurate and detailed transcripts for each audio segment
- Speaker information: Metadata on native speakers, including gender and etc
- Topics: Diverse domains such as general conversations, business and etc
This dataset is a valuable resource for researchers and developers working on speech recognition, language models, and speech technology.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
British English Speech Dataset for recognition task
Dataset comprises 200 hours of high-quality audio recordings featuring 310 speakers, achieving an impressive 95% Sentence Accuracy Rate. This extensive collection of speech data is designed for NLP tasks such as speech recognition, dialogue systems, and language understanding. By utilizing this dataset, developers and researchers can advance their work in automatic speech recognition and improve recognition systems. - Get the data… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/british-english-speech-recognition-dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
owahltinez/speaker-recognition-american-rhetoric dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
10,000 hours of real-world call center speech recordings in 7 languages with transcripts. Train speech recognition, sentiment analysis, and conversation AI models on authentic customer support audio. Covers support, sales, billing, finance, and pharma domains
English, Russian, Polish, French, German, Spanish, Portuguese - Non-English calls include English translation - Additional languages available on request: Swedish, Dutch, Arabic, Japanese, etc.
Support, Billing/Account, Sales, Finance/Account Management, Pharma - Each call labeled by domain - Speaker roles annotated (Agent/Customer)
Full version of dataset is availible for commercial usage - leave a request on our website Axonlabs to purchase the dataset 💰
Facebook
TwitterExplore our Slovenian Speech Dataset with 10+ hours of clean phone dialogues in MP3/WAV, fully annotated for ASR and language models
Facebook
TwitterDiscover our Arabic Speech Dataset with 10+ hours of UAE dialogues in M4A/MP3/WAV/AAC. Clean, annotated audio for ASR training
Facebook
TwitterThis dataset was created by Avishkar_001
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset comprises 547 hours of telephone dialogues in French, collected from 964 native speakers across various topics and domains, with an impressive 98% Word Accuracy Rate. It is designed for research in speech recognition, focusing on various recognition models, primarily aimed at meeting the requirements for automatic speech recognition (ASR) systems.
By utilizing this dataset, researchers and developers can advance their understanding and capabilities in natural language processing (NLP), speech recognition, and machine learning technologies. - Get the data
The dataset includes high-quality audio recordings with accurate transcriptions, making it ideal for training and evaluating speech recognition models.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2Fb7af35fb0b3dabe083683bebd27fc5e5%2Fweweewew.PNG?generation=1739885543448162&alt=media" alt="">
The native speakers and various topics and domains covered in the dataset make it an ideal resource for research community, allowing researchers to study spoken languages, dialects, and language patterns.
Facebook
TwitterUnidata’s Italian Speech Recognition dataset refines AI models for better speech-to-text conversion and language comprehension
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Russian Telephone Dialogues Dataset - 338 Hours
The Russian speech dataset includes 338 hours of telephone dialogues in Russian from 460 native speakers, offering high-quality audio recordings with detailed annotations (text, speaker ID, gender, age) to support speech recognition systems, natural language processing, and deep learning models for building accurate Russian dialogue and audio datasets. - Get the data
Dataset characteristics:
Characteristic Data… See the full description on the dataset page: https://huggingface.co/datasets/ud-nlp/russian-speech-recognition-dataset.
Facebook
TwitterThe Voxceleb2 dataset is a large-scale speaker recognition dataset, containing 2442 hours raw speech from 6112 speakers.
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The Multi Modal Verification for Teleservices and Security applications project (M2VTS), running under the European ACTS programme, has produced a database designed to facilitate access control using multimodal identification of human faces. This technique improves recognition efficiency by combining individual modalities (i.e. face and voice). Its relative novelty means that new test material had to be created, since no existing database could offer all modalities needed.The M2VTS database comprises 37 different faces, with 5 shots of each being taken at one-week intervals, or when drastic face changes occurred in the mean time. During each shot, subjects were asked to count from 0 to 9 in their native language (generally French), and to move their heads from left to right, both with and without glasses. The data were then used to create three sequences, for voice, motion and "glasses off". The first sequence can be used for speech verification, 2-D dynamic face verification and speech/lips movement correlation, while the second and third provide information on 3-D face recognition, and may also be used to compare other recognition techniques.
Facebook
Twitterhttps://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The speaker recognition market is booming, projected to reach $15.1 billion by 2033, with a 15% CAGR. This comprehensive analysis explores market drivers, trends, restraints, and key players like Google, Amazon, and Microsoft, offering insights into this rapidly evolving technology.
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
The Italian speech database SIVA (?Speaker Identification and Verification Archives: SIVA?), is a database comprising more than two thousands calls, collected over the public switched telephone network, and available very soon via ELRA. The SIVA database consists of four speaker categories: male users, female users, male impostors, female impostors. Speakers were contacted via mail before the test, and they were asked to read the information and the instructions provided carefully before making the call. About 500 speakers were recruited using a company specialized in selection of population samples. The others were volunteers contacted by the institute concerned. Speakers access the recording system by calling a toll free number. An automatic answering system guides them through the three sessions that make up a recording. In the first session, a list of 28 words (including digits and some commands) is recorded using a standard enumerated prompt. The second session is a simple unidirectional dialogue (the caller answers prompted questions) where personal information is asked (name, age, etc.). In the third session, the speaker is asked to read a continuous passage of phonetically balanced text that resembles a short curriculum vitae. The signal is a standard 8kHz sampled signal, coded using 8 bits mu-law format. The data collected so far consists of:· MU: male users 18 speakers, 20 repetitions· FU: female users 16 speakers, 26 repetitions· MI: male impostors: 189 speakers, 2 repetitions, and 128 speakers, 1 repetition· FI: female impostors: 213 speakers, 2 repetitions, and 107 speakers, 1 repetition.
Facebook
Twitterhttps://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
The audio dataset includes general conversations, featuring English speakers from USA with detailed metadata.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The AXIOM Voice Dataset has the main purpose of gathering audio recordings from Italian natural language speakers. This voice data collection intended to obtain audio reconding sample for the training and testing of VIMAR algorithm implemented for the Smart Home scenario for the Axiom board. The final goal was to developing an efficient voice recognition system using machine learning algorithms. A team of UX researchers of the University of Siena collected data for five months and tested the voice recognition system on the AXIOM board [1]. The data acquisition process involved natural Italian speakers who provided their written consent to participate in the research project. The participants were selected in order to maintain a cluster with different characteristics in gender, age, region of origin and background.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Explore the booming Speaker Identification Software market, projected to reach $1.8 billion by 2033. Discover key drivers, application trends in in-car systems and healthcare, and regional growth opportunities.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Rishabh Dhawan
Released under Apache 2.0
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
| Column | Description |
|---|---|
| id | file id (string) |
| file_path | file path to .wav file (string) |
| speech | transcription of the audio file (string) |
| speaker | speaker name, use this as the target variable if you are doing audio classification (string) |
The CMU_ARCTIC databases were constructed at the Language Technologies Institute at Carnegie Mellon University as phonetically balanced, US-English single-speaker databases designed for unit selection speech synthesis research. A detailed report on the structure and content of the database and the recording environment etc is available as a Carnegie Mellon University, Language Technologies Institute Tech Report CMU-LTI-03-177 and is also available here.
The databases consist of around 1150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databases include US English male (bdl) and female (slt) speakers (both experienced voice talent) as well as other accented speakers.
The 1132 sentence prompt list is available from cmuarctic.data
The distributions include 16KHz waveform and simultaneous EGG signals. Full phonetically labeling was performed by the CMU Sphinx using the FestVox based labeling scripts. Complete runnable Festival Voices are included with the database distributions, as examples though better voices can be made by improving labeling, etc.
This work was partially supported by the U.S. National Science Foundation under Grant No. 0219687, "ITR/CIS Evaluation and Personalization of Synthetic Voices". Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.