https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 13.0
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 27141 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17689 validated hours in 108 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 8.0
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 18243 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 14122 validated hours in 87 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The BengaliSpeechRecognitionDataset (BSRD) is a comprehensive dataset designed for the development and evaluation of Bengali speech recognition and text-to-speech systems. This dataset includes a collection of Bengali characters and their corresponding audio files, which are generated using speech synthesis models. It serves as an essential resource for researchers and developers working on automatic speech recognition (ASR) and text-to-speech (TTS) applications for the Bengali language. Key Features: • Bengali Characters: The dataset contains a wide range of Bengali characters, including consonants, vowels, and unique symbols used in the Bengali script. This includes standard characters such as 'ক', 'খ', 'গ', and many more. • Corresponding Speech Data: For each Bengali character, an MP3 audio file is provided, which contains the correct pronunciation of that character. This audio is generated by a Bengali text-to-speech model, ensuring clear and accurate pronunciation. • 1000 Audio Samples per Folder: Each character is associated with at least 1000 MP3 files. These multiple samples provide variations of the character's pronunciation, which is essential for training robust speech recognition systems. • Language and Phonetic Diversity: The dataset offers a phonetic diversity of Bengali sounds, covering different tones and pronunciations commonly found in spoken Bengali. This ensures that the dataset can be used for training models capable of recognizing diverse speech patterns. • Use Cases: o Automatic Speech Recognition (ASR): BSRD is ideal for training ASR systems, as it provides accurate audio samples linked to specific Bengali characters. o Text-to-Speech (TTS): Researchers can use this dataset to fine-tune TTS systems for generating natural Bengali speech from text. o Phonetic Analysis: The dataset can be used for phonetic analysis and developing models that study the linguistic features of Bengali pronunciation. • Applications: o Voice Assistants: The dataset can be used to build and train voice recognition systems and personal assistants that understand Bengali. o Speech-to-Text Systems: BSRD can aid in developing accurate transcription systems for Bengali audio content. o Language Learning Tools: The dataset can help in creating educational tools aimed at teaching Bengali pronunciation.
…………………………………..Note for Researchers Using the dataset………………………………………………………………………
This dataset was created by Shuvo Kumar Basak. If you use this dataset for your research or academic purposes, please ensure to cite this dataset appropriately. If you have published your research using this dataset, please share a link to your paper. Good Luck.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The main purpose of this data set is to facilitate research into audio DeepFakes. We hope that this work helps in finding new detection methods to prevent such attempts. These generated media files have been increasingly used to commit impersonation attempts or online harassment.
The data set consists of 104,885 generated audio clips (16-bit PCM wav). We examine multiple networks trained on two reference data sets. First, the LJSpeech data set consisting of 13,100 short audio clips (on average 6 seconds each; roughly 24 hours total) read by a female speaker. It features passages from 7 non-fiction books and the audio was recorded on a MacBook Pro microphone. Second, we include samples based on the JSUT data set, specifically, basic5000 corpus. This corpus consists of 5,000 sentences covering all basic kanji of the Japanese language (4.8 seconds on average; roughly 6.7 hours total). The recordings were performed by a female native Japanese speaker in an anechoic room. Finally, we include samples from a full text-to-speech pipeline (16,283 phrases; 3.8s on average; roughly 17.5 hours total). Thus, our data set consists of approximately 175 hours of generated audio files in total. Note that we do not redistribute the reference data.
We included a range of architectures in our data set:
MelGAN
Parallel WaveGAN
Multi-Band MelGAN
Full-Band MelGAN
WaveGlow
Additionally, we examined a bigger version of MelGAN and include samples from a full TTS-pipeline consisting of a conformer and parallel WaveGAN model.
Collection Process
For WaveGlow, we utilize the official implementation (commit 8afb643) in conjunction with the official pre-trained network on PyTorch Hub. We use a popular implementation available on GitHub (commit 12c677e) for the remaining networks. The repository also offers pre-trained models. We used the pre-trained networks to generate samples that are similar to their respective training distributions, LJ Speech and JSUT. When sampling the data set, we first extract Mel spectrograms from the original audio files, using the pre-processing scripts of the corresponding repositories. We then feed these Mel spectrograms to the respective models to obtain the data set. For sampling the full TTS results, we use the ESPnet project. To make sure the generated phrases do not overlap with the training set, we downloaded the common voices data set and extracted 16.285 phrases from it.
This data set is licensed with a CC-BY-SA 4.0 license.
This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy -- EXC-2092 CaSa -- 390781972.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Egyptian Arabic General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Arabic speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Egyptian Arabic communication.
Curated by FutureBeeAI, this 40 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Arabic speech models that understand and respond to authentic Egyptian accents and dialects.
The dataset comprises 40 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Egyptian Arabic. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Arabic speech and language AI applications:
Text-To-Speech Market Size 2025-2029
The text-to-speech market size is forecast to increase by USD 3.99 billion, at a CAGR of 14.1% between 2024 and 2029.
The Text-To-Speech (TTS) market is experiencing significant growth, driven primarily by the increasing demand for voice-enabled devices. This trend is expected to continue as technology advances and voice interfaces become more integrated into daily life. Another key driver is the development of AI-based TTS models, which offer improved accuracy and natural-sounding voices. However, regulatory compliance poses a significant challenge for market players. Technology advancements, such as artificial intelligence and machine learning, are revolutionizing the delivery. As governments and regulatory bodies impose stricter guidelines on data privacy and security, TTS providers must ensure their solutions meet these requirements to maintain customer trust and avoid potential legal issues.
The proliferation of high-speed internet, smartphones, and tablets has further fueled market expansion. Companies seeking to capitalize on market opportunities in the TTS space should focus on developing advanced, AI-driven TTS models while prioritizing regulatory compliance to navigate this complex landscape.
What will be the Size of the Text-To-Speech Market during the forecast period?
Request Free Sample
The text-to-speech (TTS) market is experiencing significant advancements in speech recognition technology and voice search optimization. Metrics such as speech recognition dataset, voice modulation, and voice cloning play a crucial role in evaluating TTS systems' performance. Speech synthesis evaluation and voice cloning evaluation are essential for ensuring high-quality audiobook narration and call center automation. Voice modulation technology and voice cloning technology are revolutionizing industries like interactive voice response and speech interface design. VPNs and secure platforms are essential to ensure data security. Convolutional neural networks and transformer networks are driving improvements in speech recognition quality and speech synthesis quality. Voice commerce and human-computer interaction are benefiting from these advancements, with voice modulation metrics and speech-to-text metrics playing a key role in voice commerce evaluation.
Audiobook narration and speech-to-text quality are essential for digital signage applications. Vocal training and speech therapy are also utilizing speech-to-text datasets and deep neural networks for data augmentation, enhancing the overall effectiveness of these applications. Voice banking and voice interface design are further expanding the use cases for TTS technology. In summary, the TTS market is witnessing continuous innovation, with advancements in speech recognition, voice modulation, and voice cloning metrics driving improvements in various industries, including call centers, e-commerce, and digital signage.
How is this Text-To-Speech Industry segmented?
The text-to-speech industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.
Language
English
Chinese
Spanish
Others
Technology
Neural TTS
Concatenative TTS
Formant-based TTS
Type
Natural voices
Synthetic voices
End-user
Automotive and transportation
Healthcare
Consumer Electronics
Finance
Others
Geography
North America
US
Canada
Europe
France
Germany
UK
APAC
Australia
China
India
Japan
South Korea
Rest of World (ROW)
By Language Insights
The english segment is estimated to witness significant growth during the forecast period. The Text-to-Speech (TTS) market is witnessing significant growth, driven by the increasing adoption of English language systems in various sectors. English, as the most widely used language, holds a dominant position in this market due to its extensive application in business, education, media, and technology. TTS solutions for English are developed with a diverse range of voice options, including regional accents such as American, British, and Australian, and multiple speaking styles, from formal and instructional to conversational and expressive. Virtual assistants, customer service platforms, e-learning modules, and accessibility tools are among the major applications of English TTS systems.
The integration of these solutions in these domains reflects both the global reach of the English language and the technological advancements supporting it. Advanced functionalities such as speech recognition, speaker identification, and conversational AI are becoming increasingly common in TTS systems, enhancing their capabilities and usability. Moreover, the integration of TTS t
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
CapSpeech-CommonVoice Audio
DataSet used for the paper: CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech Please refer to 🤗CapSpeech for the whole dataset and 🚀CapSpeech repo for more details.
Overview
🔥 CapSpeech is a new benchmark designed for style-captioned TTS (CapTTS) tasks, including style-captioned text-to-speech synthesis with sound effects (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS) and… See the full description on the dataset page: https://huggingface.co/datasets/OpenSound/CapSpeech-CommonVoice.
🚀🚀🚀🚀https://t.me/+Y5kL2iHis9A0ZWI1
✅ Obtain a complete dataset
✅ Mutual communication within the industry
✅ Get more information and consultation
✅ Timely dataset update notifications
v1.0
2024-10-15
Personnel requirements: Professional and common speakers, male and female each half
Collection equipment:Professional recording booth
Data format:.WAV,.TXT,.textgrid
Data features: Common and professional speakers, text content needs to cover all Chinese phonemes, as well as the main Chinese context environments, recorded content and text completely consistent
Annotation content: Vowel-consonant segmentation annotation, rhythm annotation, pinyin annotation
All speakers except for those aged 0-15 and over 50 have certain broadcasting foundation and hold first-class second or first certificate of Putonghua.
For speakers without Putonghua certificate, their voices should be natural and friendly, with relatively standard Putonghua.
Speaking speed should be natural, and volume and speed should be kept as consistent as possible.
Good state during recording, avoid breath sounds and too much saliva sounds in silent segments.
Each person has 1500 read-aloud texts in one complete file, where the order of sentences in the file is exactly the same as that in the text. And there is 700ms-1s silence between every two sentences.
The audio format is .wav, sampling rate is 48kHz, bit rate is 16bits, single channel. The background noise of the audio is lower than -60dB, and the signal-to-noise ratio reaches 35dB. The peak value of a single sentence's voice is between -2dB~9dB, no clipping wave.
High-frequency information of the audio is complete.
There is no obvious in the audio, including but not limited to background noise, current sound, key-pressing sound, breathing sound, saliva sound, etc.
The sound ensures maximum restoration of real human voice.
Gender distribution: 50 men, 50 women, total 100 people
Age & quantity distribution: ``` | Age | Gender | Number | Quantity | | --- | --- | --- | --- | | 0-15 years old | Male | 5 | Effective time 1.5-2 hours/person, total effective time 142 hours, total 102400 sentences | | Female | 6 |
16-50 years old | Male | 40 |
---|---|---|
Female | 39 |
Over 50 years old | Male | 5 |
---|---|---|
Female | 5 |
## Directory Structure
root_directory/
├── audio/
│ ├── audio1.wav
│ ├── text1.txt
The SOMOS dataset is a large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset which is a common benchmark for building neural acoustic models and vocoders. Utterances are generated from 200 TTS systems including vanilla neural acoustic models as well as models which allow prosodic variations.
CommonVoice Clones
This dataset consists of recordings taken from the CommonVoice english dataset. Each voice and transcript are used as input to a voice cloner, and generate a cloned version of the voice and text.
TTS Models
We use the following high-scoring models from the TTS leaderboard:
playHT metavoice StyleTTSv2 XttsV2
Model Comparisons
To facilitate data exploration, check out this HF space 🤗, which allows you to listen to all clones from a given… See the full description on the dataset page: https://huggingface.co/datasets/jerpint/vox-cloned-data.
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.The second TC-STAR evaluation campaign took place in March 2006. Three core technologies were evaluated during the campaign:• Automatic Speech Recognition (ASR),• Spoken Language Translation (SLT),• Text to Speech (TTS).Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.This package includes the material used for the TC-STAR 2006 Automatic Speech Recognition (ASR) second evaluation campaign for Spanish. The same packages are available for English (ELRA-E0011), Mandarin (ELRA-E0013), and for the EPPS task for Spanish (ELRA-E0012/02), for ASR and for SLT in 3 directions, English-to-Spanish (ELRA-E0014), Spanish-to-English (ELRA-E0015), Chinese-to-English (ELRA-E0016).To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.This package was used within the CORTES task and consists of 2 data sets:-Development data set: consists of audio recordings of CORTES’ sessions from 1 to 2 December 2004, manually transcribed. 3 hours of recordings were selected and transcribed, corresponding to approximately 30,000 running words in Spanish.-Test data set: consists of audio recordings of CORTES’ sessions of 24 November 2005, As for the development set, the test data set is made of 3 hours (30,000 running words).
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdf
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.The third TC-STAR evaluation campaign took place in March 2007. Three core technologies were evaluated during the campaign:• Automatic Speech Recognition (ASR),• Spoken Language Translation (SLT),• Text to Speech (TTS).Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.This package includes the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) third evaluation campaign for Spanish. The same packages are available for both English (ELRA-E0025) and Mandarin (ELRA-E0027), and for the EPPS task for Spanish (ELRA-E0026/02), for ASR and for SLT in 3 directions, English-to-Spanish (ELRA-E0028), Spanish-to-English (ELRA-E0029-01 and E0029-02), Chinese-to-English (ELRA-E0030).To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.This package was used within the CORTES task and consists of one test data set, composed of audio recordings of CORTES’ sessions of June 2006. The test data set is made of 3 hours (33,920 running words).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "edited_common_voice"
More Information needed This dataset is a Thai TTS dataset that use the voice from Common Voice dataset and modify the voice to not to sound like the original. Medium: Text-To-Speech ภาษาไทยด้วย Tacotron2
http://www.elra.info/media/filer_public/2015/04/13/evaluation_150325.pdfhttp://www.elra.info/media/filer_public/2015/04/13/evaluation_150325.pdf
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.
The first TC-STAR evaluation campaign took place in March 2005.
Two core technologies were evaluated during the campaign:
• Automatic Speech Recognition (ASR),
• Spoken Language Translation (SLT).
Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the first evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.
This package includes the material used for the TC-STAR 2005 Automatic Speech Recognition (ASR) first evaluation campaign for the Mandarin Chinese language. The same packages are available for both English (ELRA-E0002) and Spanish (ELRA-E0003) for ASR and for SLT in 3 directions, English-to-Spanish (ELRA-E0005), Spanish-to-English (ELRA-E0006), Chinese-to-English (ELRA-E0007).
To be able to chain the components, ASR and SLT evaluation tasks were designed to use common sets of raw data and conditions. Two evaluation tasks, common to ASR and SLT, were selected: EPPS (European Parliament Plenary Sessions) task and VOA (Voice of America) task. This package was used within the VOA task and consists of 2 data sets:
- Development data set: consists of 3 hours of audio recordings from the broadcast news of Mandarin Voice of America between 1 and 3 December 1998 which corresponds more or less to 42,000 Chinese characters.
- Test data set: consists of 3 hours of audio recordings from news broadcast between 14 and 22 December 1998 and corresponds to 44,000 Chinese characters.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Overview
To assess the multilingual zero-shot voice cloning capabilities of TTS models, we have constructed a test set encompassing 24 languages. This dataset provides both audio samples for voice cloning and corresponding test texts. Specifically, the test set for each language includes: 100 distinct test sentences. Audio samples from two speakers (one male and one female) carefully selected from the Mozilla Common Voice (MCV) dataset, intended for voice cloning. Researchers can… See the full description on the dataset page: https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
ESLTTS
The full paper can be accessed here: arXiv, IEEE Xplore.
Dataset Access
You can access this dataset through Huggingface or Google Driver or IEEE Dataport.
Abstract
With the progress made in speaker-adaptive TTS approaches, advanced approaches have shown a remarkable capacity to reproduce the speaker’s voice in the commonly used TTS datasets. However, mimicking voices characterized by substantial accents, such as non-native English speakers, is still… See the full description on the dataset page: https://huggingface.co/datasets/MushanW/ESLTTS.
Ar-ASR
Dataset Description
This dataset is designed for Automatic Speech Recognition (ASR), focusing on Arabic speech with precise transcriptions including tashkeel (diacritics). It contains 33,607 audio samples from multiple sources: Microsoft Edge TTS API, Common Voice (validated Arabic subset), individual contributions, and manually transcribed YouTube videos (we also added the dataset ClArTTS). The dataset is paired with aligned Arabic text transcriptions and is… See the full description on the dataset page: https://huggingface.co/datasets/CUAIStudents/Ar-ASR.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for "hausa_long_voice_dataset"
Dataset Overview
Dataset Name: Hausa Long Voice Dataset Description: This dataset contains merged Hausa language audio samples from Common Voice. Audio files from the same speaker have been concatenated to create longer audio samples with their corresponding transcriptions, designed for text-to-speech (TTS) training where longer sequences are beneficial.
Dataset Structure
Configs:
default
Data Files:
Split: train… See the full description on the dataset page: https://huggingface.co/datasets/mide7x/hausa_long_voice_dataset.
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdf
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.The third TC-STAR evaluation campaign took place in March 2007. Three core technologies were evaluated during the campaign:• Automatic Speech Recognition (ASR),• Spoken Language Translation (SLT),• Text to Speech (TTS).Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.This package includes the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) third evaluation campaign for Spanish. The same packages are available for both English (ELRA-E0025) and Mandarin (ELRA-E0027), and for the CORTES task for Spanish (ELRA-E0026/01), for ASR and for SLT in 3 directions, English-to-Spanish (ELRA-E0028), Spanish-to-English (ELRA-E0029-01 and E0029-02), Chinese-to-English (ELRA-E0030).To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.This package was used within the EPPS task and consists of one test data set, composed of audio recordings of Parliament’s sessions from June to September 2006. The test data set is made of 3 hours (28,823 running words).
In this project, the main focus has been on preparing a clean dataset and training models for automatic recognition of Persian speech. Three methods for creating the dataset have been investigated. One of these methods has been the use of open-text audio and text resources to train the model and create a data cleaning pipeline. In this regard, the CommonVoice-V16 dataset was used and a pipeline was designed to clean the data. Text-to-speech models have also been evaluated for dataset… See the full description on the dataset page: https://huggingface.co/datasets/FanavaranPars/ASR.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 13.0
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 27141 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17689 validated hours in 108 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0.