21 datasets found
  1. common_voice_13_0

    • huggingface.co
    Updated Apr 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mozilla Foundation (2023). common_voice_13_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0
    Explore at:
    Dataset updated
    Apr 1, 2023
    Dataset authored and provided by
    Mozilla Foundationhttp://mozilla.org/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Common Voice Corpus 13.0

      Dataset Summary
    

    The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 27141 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17689 validated hours in 108 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0.

  2. common_voice_8_0

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mozilla Foundation, common_voice_8_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0
    Explore at:
    Dataset authored and provided by
    Mozilla Foundationhttp://mozilla.org/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Common Voice Corpus 8.0

      Dataset Summary
    

    The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 18243 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 14122 validated hours in 87 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0.

  3. Bengali Speech Recognition Dataset (BSRD)

    • kaggle.com
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuvo Kumar Basak-4004 (2025). Bengali Speech Recognition Dataset (BSRD) [Dataset]. http://doi.org/10.34740/kaggle/dsv/10465455
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 14, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shuvo Kumar Basak-4004
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The BengaliSpeechRecognitionDataset (BSRD) is a comprehensive dataset designed for the development and evaluation of Bengali speech recognition and text-to-speech systems. This dataset includes a collection of Bengali characters and their corresponding audio files, which are generated using speech synthesis models. It serves as an essential resource for researchers and developers working on automatic speech recognition (ASR) and text-to-speech (TTS) applications for the Bengali language. Key Features: • Bengali Characters: The dataset contains a wide range of Bengali characters, including consonants, vowels, and unique symbols used in the Bengali script. This includes standard characters such as 'ক', 'খ', 'গ', and many more. • Corresponding Speech Data: For each Bengali character, an MP3 audio file is provided, which contains the correct pronunciation of that character. This audio is generated by a Bengali text-to-speech model, ensuring clear and accurate pronunciation. • 1000 Audio Samples per Folder: Each character is associated with at least 1000 MP3 files. These multiple samples provide variations of the character's pronunciation, which is essential for training robust speech recognition systems. • Language and Phonetic Diversity: The dataset offers a phonetic diversity of Bengali sounds, covering different tones and pronunciations commonly found in spoken Bengali. This ensures that the dataset can be used for training models capable of recognizing diverse speech patterns. • Use Cases: o Automatic Speech Recognition (ASR): BSRD is ideal for training ASR systems, as it provides accurate audio samples linked to specific Bengali characters. o Text-to-Speech (TTS): Researchers can use this dataset to fine-tune TTS systems for generating natural Bengali speech from text. o Phonetic Analysis: The dataset can be used for phonetic analysis and developing models that study the linguistic features of Bengali pronunciation. • Applications: o Voice Assistants: The dataset can be used to build and train voice recognition systems and personal assistants that understand Bengali. o Speech-to-Text Systems: BSRD can aid in developing accurate transcription systems for Bengali audio content. o Language Learning Tools: The dataset can help in creating educational tools aimed at teaching Bengali pronunciation.

    …………………………………..Note for Researchers Using the dataset………………………………………………………………………

    This dataset was created by Shuvo Kumar Basak. If you use this dataset for your research or academic purposes, please ensure to cite this dataset appropriately. If you have published your research using this dataset, please share a link to your paper. Good Luck.

  4. Z

    Data from: WaveFake: A data set to facilitate audio DeepFake detection

    • data.niaid.nih.gov
    Updated Jul 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schönherr, Lea (2024). WaveFake: A data set to facilitate audio DeepFake detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4904578
    Explore at:
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Frank, Joel
    Schönherr, Lea
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The main purpose of this data set is to facilitate research into audio DeepFakes. We hope that this work helps in finding new detection methods to prevent such attempts. These generated media files have been increasingly used to commit impersonation attempts or online harassment.

    The data set consists of 104,885 generated audio clips (16-bit PCM wav). We examine multiple networks trained on two reference data sets. First, the LJSpeech data set consisting of 13,100 short audio clips (on average 6 seconds each; roughly 24 hours total) read by a female speaker. It features passages from 7 non-fiction books and the audio was recorded on a MacBook Pro microphone. Second, we include samples based on the JSUT data set, specifically, basic5000 corpus. This corpus consists of 5,000 sentences covering all basic kanji of the Japanese language (4.8 seconds on average; roughly 6.7 hours total). The recordings were performed by a female native Japanese speaker in an anechoic room. Finally, we include samples from a full text-to-speech pipeline (16,283 phrases; 3.8s on average; roughly 17.5 hours total). Thus, our data set consists of approximately 175 hours of generated audio files in total. Note that we do not redistribute the reference data.

    We included a range of architectures in our data set:

    MelGAN

    Parallel WaveGAN

    Multi-Band MelGAN

    Full-Band MelGAN

    WaveGlow

    Additionally, we examined a bigger version of MelGAN and include samples from a full TTS-pipeline consisting of a conformer and parallel WaveGAN model.

    Collection Process

    For WaveGlow, we utilize the official implementation (commit 8afb643) in conjunction with the official pre-trained network on PyTorch Hub. We use a popular implementation available on GitHub (commit 12c677e) for the remaining networks. The repository also offers pre-trained models. We used the pre-trained networks to generate samples that are similar to their respective training distributions, LJ Speech and JSUT. When sampling the data set, we first extract Mel spectrograms from the original audio files, using the pre-processing scripts of the corresponding repositories. We then feed these Mel spectrograms to the respective models to obtain the data set. For sampling the full TTS results, we use the ESPnet project. To make sure the generated phrases do not overlap with the training set, we downloaded the common voices data set and extracted 16.285 phrases from it.

    This data set is licensed with a CC-BY-SA 4.0 license.

    This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy -- EXC-2092 CaSa -- 390781972.

  5. F

    Egyptian Arabic General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Egyptian Arabic General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-arabic-egypt
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Egyptian Arabic General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Arabic speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Egyptian Arabic communication.

    Curated by FutureBeeAI, this 40 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Arabic speech models that understand and respond to authentic Egyptian accents and dialects.

    Speech Data

    The dataset comprises 40 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Egyptian Arabic. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 80 verified native Egyptian Arabic speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Egypt to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Arabic speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Egyptian Arabic.
    Voice Assistants: Build smart assistants capable of understanding natural Egyptian conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  6. Text-To-Speech Market Analysis, Size, and Forecast 2025-2029: North America...

    • technavio.com
    Updated May 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Text-To-Speech Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, and UK), APAC (Australia, China, India, Japan, and South Korea), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/text-to-speech-market-industry-analysis
    Explore at:
    Dataset updated
    May 26, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    France, Canada, United States, Germany, United Kingdom, Global
    Description

    Snapshot img

    Text-To-Speech Market Size 2025-2029

    The text-to-speech market size is forecast to increase by USD 3.99 billion, at a CAGR of 14.1% between 2024 and 2029.

    The Text-To-Speech (TTS) market is experiencing significant growth, driven primarily by the increasing demand for voice-enabled devices. This trend is expected to continue as technology advances and voice interfaces become more integrated into daily life. Another key driver is the development of AI-based TTS models, which offer improved accuracy and natural-sounding voices. However, regulatory compliance poses a significant challenge for market players. Technology advancements, such as artificial intelligence and machine learning, are revolutionizing the delivery. As governments and regulatory bodies impose stricter guidelines on data privacy and security, TTS providers must ensure their solutions meet these requirements to maintain customer trust and avoid potential legal issues.
    The proliferation of high-speed internet, smartphones, and tablets has further fueled market expansion. Companies seeking to capitalize on market opportunities in the TTS space should focus on developing advanced, AI-driven TTS models while prioritizing regulatory compliance to navigate this complex landscape.
    

    What will be the Size of the Text-To-Speech Market during the forecast period?

    Request Free Sample

    The text-to-speech (TTS) market is experiencing significant advancements in speech recognition technology and voice search optimization. Metrics such as speech recognition dataset, voice modulation, and voice cloning play a crucial role in evaluating TTS systems' performance. Speech synthesis evaluation and voice cloning evaluation are essential for ensuring high-quality audiobook narration and call center automation. Voice modulation technology and voice cloning technology are revolutionizing industries like interactive voice response and speech interface design. VPNs and secure platforms are essential to ensure data security. Convolutional neural networks and transformer networks are driving improvements in speech recognition quality and speech synthesis quality. Voice commerce and human-computer interaction are benefiting from these advancements, with voice modulation metrics and speech-to-text metrics playing a key role in voice commerce evaluation.
    Audiobook narration and speech-to-text quality are essential for digital signage applications. Vocal training and speech therapy are also utilizing speech-to-text datasets and deep neural networks for data augmentation, enhancing the overall effectiveness of these applications. Voice banking and voice interface design are further expanding the use cases for TTS technology. In summary, the TTS market is witnessing continuous innovation, with advancements in speech recognition, voice modulation, and voice cloning metrics driving improvements in various industries, including call centers, e-commerce, and digital signage.
    

    How is this Text-To-Speech Industry segmented?

    The text-to-speech industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

    Language
    
      English
      Chinese
      Spanish
      Others
    
    
    Technology
    
      Neural TTS
      Concatenative TTS
      Formant-based TTS
    
    
    Type
    
      Natural voices
      Synthetic voices
    
    
    End-user
    
      Automotive and transportation
      Healthcare
      Consumer Electronics
      Finance
      Others
    
    
    Geography
    
      North America
    
        US
        Canada
    
    
      Europe
    
        France
        Germany
        UK
    
    
      APAC
    
        Australia
        China
        India
        Japan
        South Korea
    
    
      Rest of World (ROW)
    

    By Language Insights

    The english segment is estimated to witness significant growth during the forecast period. The Text-to-Speech (TTS) market is witnessing significant growth, driven by the increasing adoption of English language systems in various sectors. English, as the most widely used language, holds a dominant position in this market due to its extensive application in business, education, media, and technology. TTS solutions for English are developed with a diverse range of voice options, including regional accents such as American, British, and Australian, and multiple speaking styles, from formal and instructional to conversational and expressive. Virtual assistants, customer service platforms, e-learning modules, and accessibility tools are among the major applications of English TTS systems.

    The integration of these solutions in these domains reflects both the global reach of the English language and the technological advancements supporting it. Advanced functionalities such as speech recognition, speaker identification, and conversational AI are becoming increasingly common in TTS systems, enhancing their capabilities and usability. Moreover, the integration of TTS t

  7. h

    CapSpeech-CommonVoice

    • huggingface.co
    Updated Mar 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenSound (2025). CapSpeech-CommonVoice [Dataset]. https://huggingface.co/datasets/OpenSound/CapSpeech-CommonVoice
    Explore at:
    Dataset updated
    Mar 19, 2025
    Authors
    OpenSound
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    CapSpeech-CommonVoice Audio

    DataSet used for the paper: CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech Please refer to 🤗CapSpeech for the whole dataset and 🚀CapSpeech repo for more details.

      Overview
    

    🔥 CapSpeech is a new benchmark designed for style-captioned TTS (CapTTS) tasks, including style-captioned text-to-speech synthesis with sound effects (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS) and… See the full description on the dataset page: https://huggingface.co/datasets/OpenSound/CapSpeech-CommonVoice.

  8. TTS average voice library

    • kaggle.com
    Updated Oct 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    longmaodata (2024). TTS average voice library [Dataset]. https://www.kaggle.com/datasets/longmaodata/tts-average-voice-library
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    longmaodata
    Description

    🔔Due to the platform's upload size restrictions and the extensive nature of our numerous public datasets, we can only provide samples of the datasets here. If you need the full public dataset, please join our official group to access it;

    🔔It is entirely free!

    🔔This helps promote open-source development!

    Complete data size

    67.1GB

    Join the group

    🚀🚀🚀🚀https://t.me/+Y5kL2iHis9A0ZWI1

    ✅ Obtain a complete dataset

    ✅ Mutual communication within the industry

    ✅ Get more information and consultation

    ✅ Timely dataset update notifications

    Dataset Introduction

    TTS average voice library

    Version

    v1.0

    Release Date

    2024-10-15

    Data Description

    Personnel requirements: Professional and common speakers, male and female each half

    Collection equipment:Professional recording booth

    Data format:.WAV,.TXT,.textgrid

    Data features: Common and professional speakers, text content needs to cover all Chinese phonemes, as well as the main Chinese context environments, recorded content and text completely consistent

    Annotation content: Vowel-consonant segmentation annotation, rhythm annotation, pinyin annotation

    Requirements:

    All speakers except for those aged 0-15 and over 50 have certain broadcasting foundation and hold first-class second or first certificate of Putonghua.

    For speakers without Putonghua certificate, their voices should be natural and friendly, with relatively standard Putonghua.

    Speaking speed should be natural, and volume and speed should be kept as consistent as possible.

    Good state during recording, avoid breath sounds and too much saliva sounds in silent segments.

    Quality

    Each person has 1500 read-aloud texts in one complete file, where the order of sentences in the file is exactly the same as that in the text. And there is 700ms-1s silence between every two sentences.

    The audio format is .wav, sampling rate is 48kHz, bit rate is 16bits, single channel. The background noise of the audio is lower than -60dB, and the signal-to-noise ratio reaches 35dB. The peak value of a single sentence's voice is between -2dB~9dB, no clipping wave.

    High-frequency information of the audio is complete.

    There is no obvious in the audio, including but not limited to background noise, current sound, key-pressing sound, breathing sound, saliva sound, etc.

    The sound ensures maximum restoration of real human voice.

    Gender distribution: 50 men, 50 women, total 100 people

    Age & quantity distribution: ``` | Age | Gender | Number | Quantity | | --- | --- | --- | --- | | 0-15 years old | Male | 5 | Effective time 1.5-2 hours/person, total effective time 142 hours, total 102400 sentences | | Female | 6 |

    16-50 years oldMale40
    Female39
    Over 50 years oldMale5
    Female5
    
    ## Directory Structure
    
    

    root_directory/

    ├── audio/

    │ ├── audio1.wav

    │ ├── text1.txt

  9. P

    SOMOS Dataset

    • paperswithcode.com
    Updated Apr 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Georgia Maniati; Alexandra Vioni; Nikolaos Ellinas; Karolos Nikitaras; Konstantinos Klapsas; June Sig Sung; Gunu Jho; Aimilios Chalamandaris; Pirros Tsiakoulis (2022). SOMOS Dataset [Dataset]. https://paperswithcode.com/dataset/somos
    Explore at:
    Dataset updated
    Apr 8, 2022
    Authors
    Georgia Maniati; Alexandra Vioni; Nikolaos Ellinas; Karolos Nikitaras; Konstantinos Klapsas; June Sig Sung; Gunu Jho; Aimilios Chalamandaris; Pirros Tsiakoulis
    Description

    The SOMOS dataset is a large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset which is a common benchmark for building neural acoustic models and vocoders. Utterances are generated from 200 TTS systems including vanilla neural acoustic models as well as models which allow prosodic variations.

  10. h

    vox-cloned-data

    • huggingface.co
    Updated Oct 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeremy Pinto (2024). vox-cloned-data [Dataset]. https://huggingface.co/datasets/jerpint/vox-cloned-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 25, 2024
    Authors
    Jeremy Pinto
    Description

    CommonVoice Clones

    This dataset consists of recordings taken from the CommonVoice english dataset. Each voice and transcript are used as input to a voice cloner, and generate a cloned version of the voice and text.

      TTS Models
    

    We use the following high-scoring models from the TTS leaderboard:

    playHT metavoice StyleTTSv2 XttsV2

      Model Comparisons
    

    To facilitate data exploration, check out this HF space 🤗, which allows you to listen to all clones from a given… See the full description on the dataset page: https://huggingface.co/datasets/jerpint/vox-cloned-data.

  11. E

    TC-STAR 2006 Evaluation Package - ASR Spanish - CORTES

    • catalog.elra.info
    • live.european-language-grid.eu
    Updated Mar 26, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2013). TC-STAR 2006 Evaluation Package - ASR Spanish - CORTES [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-E0012_01/
    Explore at:
    Dataset updated
    Mar 26, 2013
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdf

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.The second TC-STAR evaluation campaign took place in March 2006. Three core technologies were evaluated during the campaign:• Automatic Speech Recognition (ASR),• Spoken Language Translation (SLT),• Text to Speech (TTS).Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.This package includes the material used for the TC-STAR 2006 Automatic Speech Recognition (ASR) second evaluation campaign for Spanish. The same packages are available for English (ELRA-E0011), Mandarin (ELRA-E0013), and for the EPPS task for Spanish (ELRA-E0012/02), for ASR and for SLT in 3 directions, English-to-Spanish (ELRA-E0014), Spanish-to-English (ELRA-E0015), Chinese-to-English (ELRA-E0016).To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.This package was used within the CORTES task and consists of 2 data sets:-Development data set: consists of audio recordings of CORTES’ sessions from 1 to 2 December 2004, manually transcribed. 3 hours of recordings were selected and transcribed, corresponding to approximately 30,000 running words in Spanish.-Test data set: consists of audio recordings of CORTES’ sessions of 24 November 2005, As for the development set, the test data set is made of 3 hours (30,000 running words).

  12. E

    TC-STAR 2007 Evaluation Package - ASR Spanish - CORTES

    • catalogue.elra.info
    • catalog.elra.info
    • +1more
    Updated Mar 26, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2013). TC-STAR 2007 Evaluation Package - ASR Spanish - CORTES [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-E0026_01/
    Explore at:
    Dataset updated
    Mar 26, 2013
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdf

    Description

    TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.The third TC-STAR evaluation campaign took place in March 2007. Three core technologies were evaluated during the campaign:• Automatic Speech Recognition (ASR),• Spoken Language Translation (SLT),• Text to Speech (TTS).Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.This package includes the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) third evaluation campaign for Spanish. The same packages are available for both English (ELRA-E0025) and Mandarin (ELRA-E0027), and for the EPPS task for Spanish (ELRA-E0026/02), for ASR and for SLT in 3 directions, English-to-Spanish (ELRA-E0028), Spanish-to-English (ELRA-E0029-01 and E0029-02), Chinese-to-English (ELRA-E0030).To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.This package was used within the CORTES task and consists of one test data set, composed of audio recordings of CORTES’ sessions of June 2006. The test data set is made of 3 hours (33,920 running words).

  13. h

    edited_common_voice

    • huggingface.co
    Updated Jul 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    taetiya taechamatavorn (2023). edited_common_voice [Dataset]. https://huggingface.co/datasets/lunarlist/edited_common_voice
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2023
    Authors
    taetiya taechamatavorn
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "edited_common_voice"

    More Information needed This dataset is a Thai TTS dataset that use the voice from Common Voice dataset and modify the voice to not to sound like the original. Medium: Text-To-Speech ภาษาไทยด้วย Tacotron2

  14. E

    TC-STAR 2005 Evaluation Package - ASR Mandarin Chinese

    • live.european-language-grid.eu
    • catalogue.elra.info
    audio format
    Updated May 1, 2011
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2011). TC-STAR 2005 Evaluation Package - ASR Mandarin Chinese [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1922
    Explore at:
    audio formatAvailable download formats
    Dataset updated
    May 1, 2011
    License

    http://www.elra.info/media/filer_public/2015/04/13/evaluation_150325.pdfhttp://www.elra.info/media/filer_public/2015/04/13/evaluation_150325.pdf

    Description

    TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.

    The first TC-STAR evaluation campaign took place in March 2005.

    Two core technologies were evaluated during the campaign:

    • Automatic Speech Recognition (ASR),

    • Spoken Language Translation (SLT).

    Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the first evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.

    This package includes the material used for the TC-STAR 2005 Automatic Speech Recognition (ASR) first evaluation campaign for the Mandarin Chinese language. The same packages are available for both English (ELRA-E0002) and Spanish (ELRA-E0003) for ASR and for SLT in 3 directions, English-to-Spanish (ELRA-E0005), Spanish-to-English (ELRA-E0006), Chinese-to-English (ELRA-E0007).

    To be able to chain the components, ASR and SLT evaluation tasks were designed to use common sets of raw data and conditions. Two evaluation tasks, common to ASR and SLT, were selected: EPPS (European Parliament Plenary Sessions) task and VOA (Voice of America) task. This package was used within the VOA task and consists of 2 data sets:

    - Development data set: consists of 3 hours of audio recordings from the broadcast news of Mandarin Voice of America between 1 and 3 December 1998 which corresponds more or less to 42,000 Chinese characters.

    - Test data set: consists of 3 hours of audio recordings from news broadcast between 14 and 22 December 1998 and corresponds to 44,000 Chinese characters.

  15. h

    TTS-Multilingual-Test-Set

    • huggingface.co
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MiniMax (2025). TTS-Multilingual-Test-Set [Dataset]. https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set
    Explore at:
    Dataset updated
    May 27, 2025
    Dataset authored and provided by
    MiniMax
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Overview

    To assess the multilingual zero-shot voice cloning capabilities of TTS models, we have constructed a test set encompassing 24 languages. This dataset provides both audio samples for voice cloning and corresponding test texts. Specifically, the test set for each language includes: 100 distinct test sentences. Audio samples from two speakers (one male and one female) carefully selected from the Mozilla Common Voice (MCV) dataset, intended for voice cloning. Researchers can… See the full description on the dataset page: https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set.

  16. h

    ESLTTS

    • huggingface.co
    Updated Jun 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MushanW (2024). ESLTTS [Dataset]. https://huggingface.co/datasets/MushanW/ESLTTS
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 22, 2024
    Authors
    MushanW
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    ESLTTS

    The full paper can be accessed here: arXiv, IEEE Xplore.

      Dataset Access
    

    You can access this dataset through Huggingface or Google Driver or IEEE Dataport.

      Abstract
    

    With the progress made in speaker-adaptive TTS approaches, advanced approaches have shown a remarkable capacity to reproduce the speaker’s voice in the commonly used TTS datasets. However, mimicking voices characterized by substantial accents, such as non-native English speakers, is still… See the full description on the dataset page: https://huggingface.co/datasets/MushanW/ESLTTS.

  17. h

    Ar-ASR

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cairo University AI Students (2025). Ar-ASR [Dataset]. https://huggingface.co/datasets/CUAIStudents/Ar-ASR
    Explore at:
    Dataset updated
    Jun 1, 2025
    Dataset authored and provided by
    Cairo University AI Students
    Description

    Ar-ASR

      Dataset Description
    

    This dataset is designed for Automatic Speech Recognition (ASR), focusing on Arabic speech with precise transcriptions including tashkeel (diacritics). It contains 33,607 audio samples from multiple sources: Microsoft Edge TTS API, Common Voice (validated Arabic subset), individual contributions, and manually transcribed YouTube videos (we also added the dataset ClArTTS). The dataset is paired with aligned Arabic text transcriptions and is… See the full description on the dataset page: https://huggingface.co/datasets/CUAIStudents/Ar-ASR.

  18. h

    hausa_long_voice_dataset

    • huggingface.co
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olumide Adewole (2025). hausa_long_voice_dataset [Dataset]. https://huggingface.co/datasets/mide7x/hausa_long_voice_dataset
    Explore at:
    Dataset updated
    May 27, 2025
    Authors
    Olumide Adewole
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for "hausa_long_voice_dataset"

      Dataset Overview
    

    Dataset Name: Hausa Long Voice Dataset Description: This dataset contains merged Hausa language audio samples from Common Voice. Audio files from the same speaker have been concatenated to create longer audio samples with their corresponding transcriptions, designed for text-to-speech (TTS) training where longer sequences are beneficial.

      Dataset Structure
    

    Configs:

    default

    Data Files:

    Split: train… See the full description on the dataset page: https://huggingface.co/datasets/mide7x/hausa_long_voice_dataset.

  19. E

    TC-STAR 2007 Evaluation Package - ASR Spanish - EPPS

    • catalog.elra.info
    • live.european-language-grid.eu
    Updated Mar 26, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2013). TC-STAR 2007 Evaluation Package - ASR Spanish - EPPS [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-E0026_02/
    Explore at:
    Dataset updated
    Mar 26, 2013
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdf

    Description

    TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.The third TC-STAR evaluation campaign took place in March 2007. Three core technologies were evaluated during the campaign:• Automatic Speech Recognition (ASR),• Spoken Language Translation (SLT),• Text to Speech (TTS).Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.This package includes the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) third evaluation campaign for Spanish. The same packages are available for both English (ELRA-E0025) and Mandarin (ELRA-E0027), and for the CORTES task for Spanish (ELRA-E0026/01), for ASR and for SLT in 3 directions, English-to-Spanish (ELRA-E0028), Spanish-to-English (ELRA-E0029-01 and E0029-02), Chinese-to-English (ELRA-E0030).To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.This package was used within the EPPS task and consists of one test data set, composed of audio recordings of Parliament’s sessions from June to September 2006. The test data set is made of 3 hours (28,823 running words).

  20. h

    ASR

    • huggingface.co
    Updated Sep 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FanavaranPars (2024). ASR [Dataset]. https://huggingface.co/datasets/FanavaranPars/ASR
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 26, 2024
    Authors
    FanavaranPars
    Description

    In this project, the main focus has been on preparing a clean dataset and training models for automatic recognition of Persian speech. Three methods for creating the dataset have been investigated. One of these methods has been the use of open-text audio and text resources to train the model and create a data cleaning pipeline. In this regard, the CommonVoice-V16 dataset was used and a pipeline was designed to clean the data. Text-to-speech models have also been evaluated for dataset… See the full description on the dataset page: https://huggingface.co/datasets/FanavaranPars/ASR.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mozilla Foundation (2023). common_voice_13_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0
Organization logo

common_voice_13_0

Common Voice Corpus 13.0

mozilla-foundation/common_voice_13_0

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 1, 2023
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

Dataset Card for Common Voice Corpus 13.0

  Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 27141 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17689 validated hours in 108 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0.

Search
Clear search
Close search
Google apps
Main menu