21 datasets found

common_voice_13_0
huggingface.co
Updated Apr 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mozilla Foundation (2023). common_voice_13_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0
Explore at:
Dataset updated
Apr 1, 2023
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 13.0

Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 27141 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17689 validated hours in 108 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0.
common_voice_8_0
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mozilla Foundation, common_voice_8_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0
Explore at:
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 8.0

Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 18243 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 14122 validated hours in 87 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0.
Bengali Speech Recognition Dataset (BSRD)
kaggle.com
Updated Jan 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuvo Kumar Basak-4004 (2025). Bengali Speech Recognition Dataset (BSRD) [Dataset]. http://doi.org/10.34740/kaggle/dsv/10465455
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/10465455
Dataset updated
Jan 14, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shuvo Kumar Basak-4004
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The BengaliSpeechRecognitionDataset (BSRD) is a comprehensive dataset designed for the development and evaluation of Bengali speech recognition and text-to-speech systems. This dataset includes a collection of Bengali characters and their corresponding audio files, which are generated using speech synthesis models. It serves as an essential resource for researchers and developers working on automatic speech recognition (ASR) and text-to-speech (TTS) applications for the Bengali language. Key Features: • Bengali Characters: The dataset contains a wide range of Bengali characters, including consonants, vowels, and unique symbols used in the Bengali script. This includes standard characters such as 'ক', 'খ', 'গ', and many more. • Corresponding Speech Data: For each Bengali character, an MP3 audio file is provided, which contains the correct pronunciation of that character. This audio is generated by a Bengali text-to-speech model, ensuring clear and accurate pronunciation. • 1000 Audio Samples per Folder: Each character is associated with at least 1000 MP3 files. These multiple samples provide variations of the character's pronunciation, which is essential for training robust speech recognition systems. • Language and Phonetic Diversity: The dataset offers a phonetic diversity of Bengali sounds, covering different tones and pronunciations commonly found in spoken Bengali. This ensures that the dataset can be used for training models capable of recognizing diverse speech patterns. • Use Cases: o Automatic Speech Recognition (ASR): BSRD is ideal for training ASR systems, as it provides accurate audio samples linked to specific Bengali characters. o Text-to-Speech (TTS): Researchers can use this dataset to fine-tune TTS systems for generating natural Bengali speech from text. o Phonetic Analysis: The dataset can be used for phonetic analysis and developing models that study the linguistic features of Bengali pronunciation. • Applications: o Voice Assistants: The dataset can be used to build and train voice recognition systems and personal assistants that understand Bengali. o Speech-to-Text Systems: BSRD can aid in developing accurate transcription systems for Bengali audio content. o Language Learning Tools: The dataset can help in creating educational tools aimed at teaching Bengali pronunciation.

…………………………………..Note for Researchers Using the dataset………………………………………………………………………

This dataset was created by Shuvo Kumar Basak. If you use this dataset for your research or academic purposes, please ensure to cite this dataset appropriately. If you have published your research using this dataset, please share a link to your paper. Good Luck.
Z
Data from: WaveFake: A data set to facilitate audio DeepFake detection
data.niaid.nih.gov
Updated Jul 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schönherr, Lea (2024). WaveFake: A data set to facilitate audio DeepFake detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4904578
Explore at:
Dataset updated
Jul 18, 2024
Dataset provided by
Frank, Joel
Schönherr, Lea
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The main purpose of this data set is to facilitate research into audio DeepFakes. We hope that this work helps in finding new detection methods to prevent such attempts. These generated media files have been increasingly used to commit impersonation attempts or online harassment.

The data set consists of 104,885 generated audio clips (16-bit PCM wav). We examine multiple networks trained on two reference data sets. First, the LJSpeech data set consisting of 13,100 short audio clips (on average 6 seconds each; roughly 24 hours total) read by a female speaker. It features passages from 7 non-fiction books and the audio was recorded on a MacBook Pro microphone. Second, we include samples based on the JSUT data set, specifically, basic5000 corpus. This corpus consists of 5,000 sentences covering all basic kanji of the Japanese language (4.8 seconds on average; roughly 6.7 hours total). The recordings were performed by a female native Japanese speaker in an anechoic room. Finally, we include samples from a full text-to-speech pipeline (16,283 phrases; 3.8s on average; roughly 17.5 hours total). Thus, our data set consists of approximately 175 hours of generated audio files in total. Note that we do not redistribute the reference data.

We included a range of architectures in our data set:

MelGAN

Parallel WaveGAN

Multi-Band MelGAN

Full-Band MelGAN

WaveGlow

Additionally, we examined a bigger version of MelGAN and include samples from a full TTS-pipeline consisting of a conformer and parallel WaveGAN model.

Collection Process

For WaveGlow, we utilize the official implementation (commit 8afb643) in conjunction with the official pre-trained network on PyTorch Hub. We use a popular implementation available on GitHub (commit 12c677e) for the remaining networks. The repository also offers pre-trained models. We used the pre-trained networks to generate samples that are similar to their respective training distributions, LJ Speech and JSUT. When sampling the data set, we first extract Mel spectrograms from the original audio files, using the pre-processing scripts of the corresponding repositories. We then feed these Mel spectrograms to the respective models to obtain the data set. For sampling the full TTS results, we use the ESPnet project. To make sure the generated phrases do not overlap with the training set, we downloaded the common voices data set and extracted 16.285 phrases from it.

This data set is licensed with a CC-BY-SA 4.0 license.

This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy -- EXC-2092 CaSa -- 390781972.
F
Egyptian Arabic General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Egyptian Arabic General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-arabic-egypt
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Egyptian Arabic General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Arabic speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Egyptian Arabic communication.
Curated by FutureBeeAI, this 40 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Arabic speech models that understand and respond to authentic Egyptian accents and dialects.
Speech Data
The dataset comprises 40 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Egyptian Arabic. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 80 verified native Egyptian Arabic speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Egypt to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Arabic speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Egyptian Arabic.

•
Voice Assistants: Build smart assistants capable of understanding natural Egyptian conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

Text-To-Speech Market Analysis, Size, and Forecast 2025-2029: North America...

technavio.com

Updated May 26, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Technavio (2025). Text-To-Speech Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, and UK), APAC (Australia, China, India, Japan, and South Korea), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/text-to-speech-market-industry-analysis

Explore at:

Dataset updated

May 26, 2025

Dataset provided by

TechNavio

Authors

Technavio

Time period covered

2021 - 2025

Area covered

France, Canada, United States, Germany, United Kingdom, Global

Description

Snapshot img

Text-To-Speech Market Size 2025-2029

The text-to-speech market size is forecast to increase by USD 3.99 billion, at a CAGR of 14.1% between 2024 and 2029.

The Text-To-Speech (TTS) market is experiencing significant growth, driven primarily by the increasing demand for voice-enabled devices. This trend is expected to continue as technology advances and voice interfaces become more integrated into daily life. Another key driver is the development of AI-based TTS models, which offer improved accuracy and natural-sounding voices. However, regulatory compliance poses a significant challenge for market players. Technology advancements, such as artificial intelligence and machine learning, are revolutionizing the delivery. As governments and regulatory bodies impose stricter guidelines on data privacy and security, TTS providers must ensure their solutions meet these requirements to maintain customer trust and avoid potential legal issues.
The proliferation of high-speed internet, smartphones, and tablets has further fueled market expansion. Companies seeking to capitalize on market opportunities in the TTS space should focus on developing advanced, AI-driven TTS models while prioritizing regulatory compliance to navigate this complex landscape.

What will be the Size of the Text-To-Speech Market during the forecast period?

Request Free Sample

The text-to-speech (TTS) market is experiencing significant advancements in speech recognition technology and voice search optimization. Metrics such as speech recognition dataset, voice modulation, and voice cloning play a crucial role in evaluating TTS systems' performance. Speech synthesis evaluation and voice cloning evaluation are essential for ensuring high-quality audiobook narration and call center automation. Voice modulation technology and voice cloning technology are revolutionizing industries like interactive voice response and speech interface design. VPNs and secure platforms are essential to ensure data security. Convolutional neural networks and transformer networks are driving improvements in speech recognition quality and speech synthesis quality. Voice commerce and human-computer interaction are benefiting from these advancements, with voice modulation metrics and speech-to-text metrics playing a key role in voice commerce evaluation.
Audiobook narration and speech-to-text quality are essential for digital signage applications. Vocal training and speech therapy are also utilizing speech-to-text datasets and deep neural networks for data augmentation, enhancing the overall effectiveness of these applications. Voice banking and voice interface design are further expanding the use cases for TTS technology. In summary, the TTS market is witnessing continuous innovation, with advancements in speech recognition, voice modulation, and voice cloning metrics driving improvements in various industries, including call centers, e-commerce, and digital signage.

How is this Text-To-Speech Industry segmented?

The text-to-speech industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Language

  English
  Chinese
  Spanish
  Others


Technology

  Neural TTS
  Concatenative TTS
  Formant-based TTS


Type

  Natural voices
  Synthetic voices


End-user

  Automotive and transportation
  Healthcare
  Consumer Electronics
  Finance
  Others


Geography

  North America

    US
    Canada


  Europe

    France
    Germany
    UK


  APAC

    Australia
    China
    India
    Japan
    South Korea


  Rest of World (ROW)

By Language Insights

The english segment is estimated to witness significant growth during the forecast period. The Text-to-Speech (TTS) market is witnessing significant growth, driven by the increasing adoption of English language systems in various sectors. English, as the most widely used language, holds a dominant position in this market due to its extensive application in business, education, media, and technology. TTS solutions for English are developed with a diverse range of voice options, including regional accents such as American, British, and Australian, and multiple speaking styles, from formal and instructional to conversational and expressive. Virtual assistants, customer service platforms, e-learning modules, and accessibility tools are among the major applications of English TTS systems.

The integration of these solutions in these domains reflects both the global reach of the English language and the technological advancements supporting it. Advanced functionalities such as speech recognition, speaker identification, and conversational AI are becoming increasingly common in TTS systems, enhancing their capabilities and usability. Moreover, the integration of TTS t

h
CapSpeech-CommonVoice
huggingface.co
Updated Mar 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenSound (2025). CapSpeech-CommonVoice [Dataset]. https://huggingface.co/datasets/OpenSound/CapSpeech-CommonVoice
Explore at:
Dataset updated
Mar 19, 2025
Authors
OpenSound
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
CapSpeech-CommonVoice Audio

DataSet used for the paper: CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech Please refer to 🤗CapSpeech for the whole dataset and 🚀CapSpeech repo for more details.

Overview

🔥 CapSpeech is a new benchmark designed for style-captioned TTS (CapTTS) tasks, including style-captioned text-to-speech synthesis with sound effects (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS) and… See the full description on the dataset page: https://huggingface.co/datasets/OpenSound/CapSpeech-CommonVoice.
TTS average voice library
kaggle.com
Updated Oct 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
longmaodata (2024). TTS average voice library [Dataset]. https://www.kaggle.com/datasets/longmaodata/tts-average-voice-library
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 15, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
longmaodata
Description
🔔Due to the platform's upload size restrictions and the extensive nature of our numerous public datasets, we can only provide samples of the datasets here. If you need the full public dataset, please join our official group to access it;

🔔It is entirely free!

🔔This helps promote open-source development!

Complete data size

67.1GB

Join the group

🚀🚀🚀🚀https://t.me/+Y5kL2iHis9A0ZWI1

✅ Obtain a complete dataset

✅ Mutual communication within the industry

✅ Get more information and consultation

✅ Timely dataset update notifications

Dataset Introduction

TTS average voice library

Version

v1.0

Release Date

2024-10-15

Data Description

Personnel requirements: Professional and common speakers, male and female each half

Collection equipment:Professional recording booth

Data format:.WAV,.TXT,.textgrid

Data features: Common and professional speakers, text content needs to cover all Chinese phonemes, as well as the main Chinese context environments, recorded content and text completely consistent

Annotation content: Vowel-consonant segmentation annotation, rhythm annotation, pinyin annotation

Requirements:

All speakers except for those aged 0-15 and over 50 have certain broadcasting foundation and hold first-class second or first certificate of Putonghua.

For speakers without Putonghua certificate, their voices should be natural and friendly, with relatively standard Putonghua.

Speaking speed should be natural, and volume and speed should be kept as consistent as possible.

Good state during recording, avoid breath sounds and too much saliva sounds in silent segments.

Quality

Each person has 1500 read-aloud texts in one complete file, where the order of sentences in the file is exactly the same as that in the text. And there is 700ms-1s silence between every two sentences.

The audio format is .wav, sampling rate is 48kHz, bit rate is 16bits, single channel. The background noise of the audio is lower than -60dB, and the signal-to-noise ratio reaches 35dB. The peak value of a single sentence's voice is between -2dB~9dB, no clipping wave.

High-frequency information of the audio is complete.

There is no obvious in the audio, including but not limited to background noise, current sound, key-pressing sound, breathing sound, saliva sound, etc.

The sound ensures maximum restoration of real human voice.

Gender distribution: 50 men, 50 women, total 100 people

Age & quantity distribution: ``` | Age | Gender | Number | Quantity | | --- | --- | --- | --- | | 0-15 years old | Male | 5 | Effective time 1.5-2 hours/person, total effective time 142 hours, total 102400 sentences | | Female | 6 |

16-50 years old Male 40
Female 39

Over 50 years old Male 5
Female 5

## Directory Structure

root_directory/

├── audio/

│ ├── audio1.wav

│ ├── text1.txt
P
SOMOS Dataset
paperswithcode.com
Updated Apr 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georgia Maniati; Alexandra Vioni; Nikolaos Ellinas; Karolos Nikitaras; Konstantinos Klapsas; June Sig Sung; Gunu Jho; Aimilios Chalamandaris; Pirros Tsiakoulis (2022). SOMOS Dataset [Dataset]. https://paperswithcode.com/dataset/somos
Explore at:
Dataset updated
Apr 8, 2022
Authors
Georgia Maniati; Alexandra Vioni; Nikolaos Ellinas; Karolos Nikitaras; Konstantinos Klapsas; June Sig Sung; Gunu Jho; Aimilios Chalamandaris; Pirros Tsiakoulis
Description
The SOMOS dataset is a large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset which is a common benchmark for building neural acoustic models and vocoders. Utterances are generated from 200 TTS systems including vanilla neural acoustic models as well as models which allow prosodic variations.
h
vox-cloned-data
huggingface.co
Updated Oct 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeremy Pinto (2024). vox-cloned-data [Dataset]. https://huggingface.co/datasets/jerpint/vox-cloned-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 25, 2024
Authors
Jeremy Pinto
Description
CommonVoice Clones

This dataset consists of recordings taken from the CommonVoice english dataset. Each voice and transcript are used as input to a voice cloner, and generate a cloned version of the voice and text.

TTS Models

We use the following high-scoring models from the TTS leaderboard:

playHT metavoice StyleTTSv2 XttsV2

Model Comparisons

To facilitate data exploration, check out this HF space 🤗, which allows you to listen to all clones from a given… See the full description on the dataset page: https://huggingface.co/datasets/jerpint/vox-cloned-data.
E
TC-STAR 2006 Evaluation Package - ASR Spanish - CORTES
catalog.elra.info
live.european-language-grid.eu
Updated Mar 26, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2013). TC-STAR 2006 Evaluation Package - ASR Spanish - CORTES [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-E0012_01/
Explore at:
Dataset updated
Mar 26, 2013
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.The second TC-STAR evaluation campaign took place in March 2006. Three core technologies were evaluated during the campaign:• Automatic Speech Recognition (ASR),• Spoken Language Translation (SLT),• Text to Speech (TTS).Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.This package includes the material used for the TC-STAR 2006 Automatic Speech Recognition (ASR) second evaluation campaign for Spanish. The same packages are available for English (ELRA-E0011), Mandarin (ELRA-E0013), and for the EPPS task for Spanish (ELRA-E0012/02), for ASR and for SLT in 3 directions, English-to-Spanish (ELRA-E0014), Spanish-to-English (ELRA-E0015), Chinese-to-English (ELRA-E0016).To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.This package was used within the CORTES task and consists of 2 data sets:-Development data set: consists of audio recordings of CORTES’ sessions from 1 to 2 December 2004, manually transcribed. 3 hours of recordings were selected and transcribed, corresponding to approximately 30,000 running words in Spanish.-Test data set: consists of audio recordings of CORTES’ sessions of 24 November 2005, As for the development set, the test data set is made of 3 hours (30,000 running words).
E
TC-STAR 2007 Evaluation Package - ASR Spanish - CORTES
catalogue.elra.info
catalog.elra.info
+1more
Updated Mar 26, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2013). TC-STAR 2007 Evaluation Package - ASR Spanish - CORTES [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-E0026_01/
Explore at:
Dataset updated
Mar 26, 2013
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdf
Description
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.The third TC-STAR evaluation campaign took place in March 2007. Three core technologies were evaluated during the campaign:• Automatic Speech Recognition (ASR),• Spoken Language Translation (SLT),• Text to Speech (TTS).Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.This package includes the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) third evaluation campaign for Spanish. The same packages are available for both English (ELRA-E0025) and Mandarin (ELRA-E0027), and for the EPPS task for Spanish (ELRA-E0026/02), for ASR and for SLT in 3 directions, English-to-Spanish (ELRA-E0028), Spanish-to-English (ELRA-E0029-01 and E0029-02), Chinese-to-English (ELRA-E0030).To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.This package was used within the CORTES task and consists of one test data set, composed of audio recordings of CORTES’ sessions of June 2006. The test data set is made of 3 hours (33,920 running words).
h
edited_common_voice
huggingface.co
Updated Jul 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
taetiya taechamatavorn (2023). edited_common_voice [Dataset]. https://huggingface.co/datasets/lunarlist/edited_common_voice
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 7, 2023
Authors
taetiya taechamatavorn
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "edited_common_voice"

More Information needed This dataset is a Thai TTS dataset that use the voice from Common Voice dataset and modify the voice to not to sound like the original. Medium: Text-To-Speech ภาษาไทยด้วย Tacotron2
E
TC-STAR 2005 Evaluation Package - ASR Mandarin Chinese
live.european-language-grid.eu
catalogue.elra.info
audio format
Updated May 1, 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2011). TC-STAR 2005 Evaluation Package - ASR Mandarin Chinese [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1922
Explore at:
audio formatAvailable download formats
Dataset updated
May 1, 2011
License
http://www.elra.info/media/filer_public/2015/04/13/evaluation_150325.pdfhttp://www.elra.info/media/filer_public/2015/04/13/evaluation_150325.pdf
Description
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.
The first TC-STAR evaluation campaign took place in March 2005.
Two core technologies were evaluated during the campaign:
• Automatic Speech Recognition (ASR),
• Spoken Language Translation (SLT).
Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the first evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.
This package includes the material used for the TC-STAR 2005 Automatic Speech Recognition (ASR) first evaluation campaign for the Mandarin Chinese language. The same packages are available for both English (ELRA-E0002) and Spanish (ELRA-E0003) for ASR and for SLT in 3 directions, English-to-Spanish (ELRA-E0005), Spanish-to-English (ELRA-E0006), Chinese-to-English (ELRA-E0007).
To be able to chain the components, ASR and SLT evaluation tasks were designed to use common sets of raw data and conditions. Two evaluation tasks, common to ASR and SLT, were selected: EPPS (European Parliament Plenary Sessions) task and VOA (Voice of America) task. This package was used within the VOA task and consists of 2 data sets:
- Development data set: consists of 3 hours of audio recordings from the broadcast news of Mandarin Voice of America between 1 and 3 December 1998 which corresponds more or less to 42,000 Chinese characters.
- Test data set: consists of 3 hours of audio recordings from news broadcast between 14 and 22 December 1998 and corresponds to 44,000 Chinese characters.
h
TTS-Multilingual-Test-Set
huggingface.co
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MiniMax (2025). TTS-Multilingual-Test-Set [Dataset]. https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set
Explore at:
Dataset updated
May 27, 2025
Dataset authored and provided by
MiniMax
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Overview

To assess the multilingual zero-shot voice cloning capabilities of TTS models, we have constructed a test set encompassing 24 languages. This dataset provides both audio samples for voice cloning and corresponding test texts. Specifically, the test set for each language includes: 100 distinct test sentences. Audio samples from two speakers (one male and one female) carefully selected from the Mozilla Common Voice (MCV) dataset, intended for voice cloning. Researchers can… See the full description on the dataset page: https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set.
h
ESLTTS
huggingface.co
Updated Jun 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MushanW (2024). ESLTTS [Dataset]. https://huggingface.co/datasets/MushanW/ESLTTS
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 22, 2024
Authors
MushanW
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
ESLTTS

The full paper can be accessed here: arXiv, IEEE Xplore.

Dataset Access

You can access this dataset through Huggingface or Google Driver or IEEE Dataport.

Abstract

With the progress made in speaker-adaptive TTS approaches, advanced approaches have shown a remarkable capacity to reproduce the speaker’s voice in the commonly used TTS datasets. However, mimicking voices characterized by substantial accents, such as non-native English speakers, is still… See the full description on the dataset page: https://huggingface.co/datasets/MushanW/ESLTTS.
h
Ar-ASR
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cairo University AI Students (2025). Ar-ASR [Dataset]. https://huggingface.co/datasets/CUAIStudents/Ar-ASR
Explore at:
Dataset updated
Jun 1, 2025
Dataset authored and provided by
Cairo University AI Students
Description
Ar-ASR

Dataset Description

This dataset is designed for Automatic Speech Recognition (ASR), focusing on Arabic speech with precise transcriptions including tashkeel (diacritics). It contains 33,607 audio samples from multiple sources: Microsoft Edge TTS API, Common Voice (validated Arabic subset), individual contributions, and manually transcribed YouTube videos (we also added the dataset ClArTTS). The dataset is paired with aligned Arabic text transcriptions and is… See the full description on the dataset page: https://huggingface.co/datasets/CUAIStudents/Ar-ASR.
h
hausa_long_voice_dataset
huggingface.co
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olumide Adewole (2025). hausa_long_voice_dataset [Dataset]. https://huggingface.co/datasets/mide7x/hausa_long_voice_dataset
Explore at:
Dataset updated
May 27, 2025
Authors
Olumide Adewole
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for "hausa_long_voice_dataset"

Dataset Overview

Dataset Name: Hausa Long Voice Dataset Description: This dataset contains merged Hausa language audio samples from Common Voice. Audio files from the same speaker have been concatenated to create longer audio samples with their corresponding transcriptions, designed for text-to-speech (TTS) training where longer sequences are beneficial.

Dataset Structure

Configs:

default

Data Files:

Split: train… See the full description on the dataset page: https://huggingface.co/datasets/mide7x/hausa_long_voice_dataset.
E
TC-STAR 2007 Evaluation Package - ASR Spanish - EPPS
catalog.elra.info
live.european-language-grid.eu
Updated Mar 26, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2013). TC-STAR 2007 Evaluation Package - ASR Spanish - EPPS [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-E0026_02/
Explore at:
Dataset updated
Mar 26, 2013
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdf
Description
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.The third TC-STAR evaluation campaign took place in March 2007. Three core technologies were evaluated during the campaign:• Automatic Speech Recognition (ASR),• Spoken Language Translation (SLT),• Text to Speech (TTS).Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.This package includes the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) third evaluation campaign for Spanish. The same packages are available for both English (ELRA-E0025) and Mandarin (ELRA-E0027), and for the CORTES task for Spanish (ELRA-E0026/01), for ASR and for SLT in 3 directions, English-to-Spanish (ELRA-E0028), Spanish-to-English (ELRA-E0029-01 and E0029-02), Chinese-to-English (ELRA-E0030).To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.This package was used within the EPPS task and consists of one test data set, composed of audio recordings of Parliament’s sessions from June to September 2006. The test data set is made of 3 hours (28,823 running words).
h
ASR
huggingface.co
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FanavaranPars (2024). ASR [Dataset]. https://huggingface.co/datasets/FanavaranPars/ASR
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 26, 2024
Authors
FanavaranPars
Description
In this project, the main focus has been on preparing a clean dataset and training models for automatic recognition of Persian speech. Three methods for creating the dataset have been investigated. One of these methods has been the use of open-text audio and text resources to train the model and create a data cleaning pipeline. In this regard, the CommonVoice-V16 dataset was used and a pipeline was designed to clean the data. Text-to-speech models have also been evaluated for dataset… See the full description on the dataset page: https://huggingface.co/datasets/FanavaranPars/ASR.

16-50 years old	Male	40
Female	39

Over 50 years old	Male	5
Female	5

Facebook

Twitter

Click to copy link

Link copied

Cite

Mozilla Foundation (2023). common_voice_13_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0

common_voice_13_0

Common Voice Corpus 13.0

mozilla-foundation/common_voice_13_0

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 1, 2023

Dataset authored and provided by

Mozilla Foundationhttp://mozilla.org/

License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

Dataset Card for Common Voice Corpus 13.0

  Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 27141 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17689 validated hours in 108 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0.

Clear search

Close search

Google apps

Main menu

common_voice_13_0

common_voice_8_0

Bengali Speech Recognition Dataset (BSRD)

Data from: WaveFake: A data set to facilitate audio DeepFake detection

Egyptian Arabic General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Text-To-Speech Market Analysis, Size, and Forecast 2025-2029: North America...

Snapshot img

CapSpeech-CommonVoice

TTS average voice library

🔔Due to the platform's upload size restrictions and the extensive nature of our numerous public datasets, we can only provide samples of the datasets here. If you need the full public dataset, please join our official group to access it;

🔔It is entirely free!

🔔This helps promote open-source development!

Complete data size

67.1GB

Join the group

Dataset Introduction

TTS average voice library

Version

Release Date

Data Description

Requirements:

Quality

SOMOS Dataset

vox-cloned-data

TC-STAR 2006 Evaluation Package - ASR Spanish - CORTES

TC-STAR 2007 Evaluation Package - ASR Spanish - CORTES

edited_common_voice

TC-STAR 2005 Evaluation Package - ASR Mandarin Chinese

TTS-Multilingual-Test-Set

ESLTTS

Ar-ASR

hausa_long_voice_dataset

TC-STAR 2007 Evaluation Package - ASR Spanish - EPPS

ASR

common_voice_13_0See More Versions

Common Voice Corpus 13.0

mozilla-foundation/common_voice_13_0

common_voice_13_0