100+ datasets found
  1. Hindi Text-to-Speech

    • kaggle.com
    zip
    Updated Nov 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loop Assembly (2025). Hindi Text-to-Speech [Dataset]. https://www.kaggle.com/datasets/loopassembly/hindi-tts
    Explore at:
    zip(7021369977 bytes)Available download formats
    Dataset updated
    Nov 12, 2025
    Authors
    Loop Assembly
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Hindi Text-to-Speech (TTS) Dataset

    Overview

    This comprehensive dataset contains over 23,700 high-quality Hindi audio samples paired with their corresponding text transcriptions, specifically designed for Text-to-Speech (TTS) synthesis and speech processing research. The dataset provides a robust foundation for developing and training Hindi language speech synthesis models.

    Dataset Contents

    • Audio Files: 23.7k+ WAV format audio recordings
    • Metadata: CSV file containing text transcriptions and audio file mappings
    • Total Size: 9.12 GB of audio data
    • Language: Hindi (ą¤¹ą¤æą¤Øą„ą¤¦ą„€)

    Use Cases

    • Text-to-Speech Systems: Train neural TTS models for Hindi language
    • Speech Synthesis Research: Develop and evaluate speech generation algorithms
    • Voice Cloning: Create synthetic Hindi voices
    • Prosody Modeling: Study intonation and rhythm patterns in Hindi speech
    • Academic Research: Support linguistic and phonetic studies of Hindi

    Data Structure

    The dataset is organized with audio files in the audio/ directory and corresponding metadata in metadata.csv, making it easy to integrate into your ML pipeline.

    Applications

    Ideal for researchers, developers, and organizations working on: - Voice assistants in Hindi - Accessibility tools for visually impaired users - Language learning applications - Speech technology products - AI-powered audio content creation

  2. F

    New Zealand English TTS Speech Dataset for Speech Synthesis

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). New Zealand English TTS Speech Dataset for Speech Synthesis [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/tts-monolgue-english-newzealand
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    New Zealand
    Dataset funded by
    FutureBeeAI
    Description

    The English TTS Monologue Speech Dataset is a professionally curated resource built to train realistic, expressive, and production-grade text-to-speech (TTS) systems. It contains studio-recorded long-form speech by trained native English voice artists, each contributing 1 to 2 hours of clean, uninterrupted monologue audio.

    Unlike typical prompt-based datasets with short, isolated phrases, this collection features long-form, topic-driven monologues that mirror natural human narration. It includes content types that are directly useful for real-world applications, like audiobook-style storytelling, educational lectures, health advisories, product explainers, digital how-tos, formal announcements, and more.

    All recordings are captured in professional studios using high-end equipment and under the guidance of experienced voice directors.

    Recording & Audio Quality

    •
    Audio Format: WAV, 48 kHz, available in 16-bit, 24-bit, and 32-bit depth
    •
    SNR: Minimum 30 dB
    •
    Channel: Mono
    •
    Recording Duration: 20-30 minutes
    •
    Recording Environment: Studio-controlled, acoustically treated rooms
    •
    Per Speaker Volume: 1–2 hours of speech per artist
    •
    Quality Control: Each file is reviewed and cleaned for common acoustic issues, including: reverberation, lip smacks, mouth clicks, thumping, hissing, plosives, sibilance, background noise, static interference, clipping, and other artifacts.

    Only clean, production-grade audio makes it into the final dataset.

    Voice Artist Selection

    All voice artists are native English speakers with professional training or prior experience in narration. We ensure a diverse pool in terms of age, gender, and region to bring a balanced and rich vocal dataset.

    •Artist Profile:
    •Gender: Male and Female
    •Age Range: 20–60 years
    •Regions: Native English-speaking states from New Zealand
    •
    Selection Process: All artists are screened, onboarded, and sample-approved using FutureBeeAI’s proprietary Yugo platform.

    Script Quality & Coverage

    Scripts are not generic or repetitive. Scripts are professionally authored by domain experts to reflect real-world use cases. They avoid redundancy and include modern vocabulary, emotional range, and phonetically rich sentence structures.

    •
    Word Count per Script: 3,000–5,000 words per 30-minute session
    •Content Types:
    •Storytelling
    •Script and book reading
    •Informational explainers
    •Government service instructions
    •E-commerce tutorials
    •Motivational content
    •Health & wellness guides
    •Education & career advice
    •
    Linguistic Design: Balanced punctuation, emotional range, modern syntax, and vocabulary diversity

    Transcripts & Alignment

    While the script is used during the recording, we also provide post-recording updates to ensure the transcript reflects the final spoken audio. Minor edits are made to adjust for skipped or rephrased words.

    •
    Segmentation: Time-stamped at the sentence level, aligned to actual spoken delivery
    •
    Format: Available in plain text and JSON
    •Post-processing:
    •Corrected for

  3. Luganda-Swahili Speech for Text-to-SpeechSynthesis

    • kaggle.com
    zip
    Updated Jul 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jocelyn Dumlao (2025). Luganda-Swahili Speech for Text-to-SpeechSynthesis [Dataset]. https://www.kaggle.com/datasets/jocelyndumlao/luganda-swahili-speech-for-text-to-speechsynthesis
    Explore at:
    zip(6609769165 bytes)Available download formats
    Dataset updated
    Jul 27, 2025
    Authors
    Jocelyn Dumlao
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    A Curated Crowdsourced Dataset of Luganda and Swahili Speech for Text-to-Speech Synthesis

    Description

    This dataset contains curated and preprocessed speech recordings in Luganda and Kiswahili for use in text-to-speech (TTS) research (Katumba et al., 2025). The audio and transcripts were sourced from Mozilla Common Voice (Luganda v12.0 and Kiswahili v15.0) and curated for voice consistency and quality. This dataset is designed for training and evaluating end-to-end TTS systems in low-resource African languages. The data is organized into two folders, Luganda and Kiswahili, each containing:

    wavs.zip: A ZIP archive of .wav audio files from six selected female speakers per language. All audio files have been silence-trimmed, denoised using a causal DEMUCS model, and filtered using WV-MOS to retain only clips with a predicted MOS ≄ 3.5.

    metadata.csv: A CSV file with two columns: filename and transcript. Each row corresponds to an audio file in the wavs.zip archive and provides the spoken sentence for that clip.

    This dataset was used in: Katumba, A., Kagumire, S., Nakatumba-Nabende, J., Quinn, J., & Murindanyi, S. (2025). Building Text-to-Speech Models for Low-Resourced Languages From Crowdsourced Data. Applied AI Letters, 6:e117. https://doi.org/10.1002/ail2.117

    Article https://doi.org/10.1002/ail2.117

    Categories

    Computer Science, Speech Processing, Natural Language Processing, Text-to-Speech, Deep Learning

    Data Source: Mendeley Dataset

  4. h

    trump-speech-dataset-tts

    • huggingface.co
    Updated Jan 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sw_tuenguyen (2025). trump-speech-dataset-tts [Dataset]. https://huggingface.co/datasets/meoconxinhxan/trump-speech-dataset-tts
    Explore at:
    Dataset updated
    Jan 11, 2025
    Authors
    sw_tuenguyen
    Description

    Trump Voice Dataset

    This dataset serves as an example for training a text-to-speech (TTS) fine-tuning platform. It consists of three columns:

    path (required): The file path to the audio. transcript (optional): The text transcript of the audio. speaker_id (optional): The unique identifier for the speaker.

    If the transcript is not provided, it will be automatically generated using the Whisper-large v3 model.

  5. Text to Speech Market Size, Trends Report, Share and Forecast 2030

    • mordorintelligence.com
    pdf,excel,csv,ppt
    Updated Jun 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mordor Intelligence (2025). Text to Speech Market Size, Trends Report, Share and Forecast 2030 [Dataset]. https://www.mordorintelligence.com/industry-reports/text-to-speech-market
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Jun 16, 2025
    Dataset authored and provided by
    Mordor Intelligence
    License

    https://www.mordorintelligence.com/privacy-policyhttps://www.mordorintelligence.com/privacy-policy

    Time period covered
    2019 - 2030
    Area covered
    Global
    Description

    The Text-To-Speech Market is Segmented by Component (Software and Services), Deployment Mode (Cloud-Based, On-Premise, and Edge Embedded), Voice Type (Neural/AI-based, Standard Concatenative, and Hybrid), Application (Consumer Media and Entertainment, E-Learning and Education, Customer Service, and More), Language (English, Spanish, Hindi, Chinese, and More), and Geography. The Market Forecasts are Provided in Terms of Value (USD).

  6. Data from: SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maniati, Georgia; Vioni, Alexandra; Ellinas, Nikolaos; Nikitaras, Karolos; Klapsas, Konstantinos; Sung, June Sig; Jho, Gunu; Chalamandaris, Aimilios; Tsiakoulis, Pirros (2025). SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7119399
    Explore at:
    Dataset updated
    Mar 7, 2025
    Dataset provided by
    Samsung Electronicshttp://samsung.com/
    Authors
    Maniati, Georgia; Vioni, Alexandra; Ellinas, Nikolaos; Nikitaras, Karolos; Klapsas, Konstantinos; Sung, June Sig; Jho, Gunu; Chalamandaris, Aimilios; Tsiakoulis, Pirros
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This is the public release of the Samsung Open Mean Opinion Scores (SOMOS) dataset for the evaluation of neural text-to-speech (TTS) synthesis, which consists of audio files generated with a public domain voice from trained TTS models based on bibliography, and numbers assigned to each audio as quality (naturalness) evaluations by several crowdsourced listeners.DescriptionThe SOMOS dataset contains 20,000 synthetic utterances (wavs), 100 natural utterances and 374,955 naturalness evaluations (human-assigned scores in the range 1-5). The synthetic utterances are single-speaker, generated by training several Tacotron-like acoustic models and an LPCNet vocoder on the LJ Speech voice public dataset. 2,000 text sentences were synthesized, selected from Blizzard Challenge texts of years 2007-2016, the LJ Speech corpus as well as Wikipedia and general domain data from the Internet.Naturalness evaluations were collected via crowdsourcing a listening test on Amazon Mechanical Turk in the US, GB and CA locales. The records of listening test participants (workers) are fully anonymized. Statistics on the reliability of the scores assigned by the workers are also included, generated through processing the scores and validation controls per submission page.

    To listen to audio samples of the dataset, please see our Github page.

    The dataset release comes with a carefully designed train-validation-test split (70%-15%-15%) with unseen systems, listeners and texts, which can be used for experimentation on MOS prediction.

    This version also contains the necessary resources to obtain the transcripts corresponding to all dataset audios.

    Terms of use

    The dataset may be used for research purposes only, for non-commercial purposes only, and may be distributed with the same terms.

    Every time you produce research that has used this dataset, please cite the dataset appropriately.

    Cite as:

    @inproceedings{maniati22_interspeech, author={Georgia Maniati and Alexandra Vioni and Nikolaos Ellinas and Karolos Nikitaras and Konstantinos Klapsas and June Sig Sung and Gunu Jho and Aimilios Chalamandaris and Pirros Tsiakoulis}, title={{SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis}}, year=2022, booktitle={Proc. Interspeech 2022}, pages={2388--2392}, doi={10.21437/Interspeech.2022-10922} }

    References of resources & models used

    Voice & synthesized texts:K. Ito and L. Johnson, ā€œThe LJ Speech Dataset,ā€ https://keithito.com/LJ-Speech-Dataset/, 2017.

    Vocoder:J.-M. Valin and J. Skoglund, ā€œLPCNet: Improving neural speech synthesis through linear prediction,ā€ in Proc. ICASSP, 2019.R. Vipperla, S. Park, K. Choo, S. Ishtiaq, K. Min, S. Bhattacharya, A. Mehrotra, A. G. C. P. Ramos, and N. D. Lane, ā€œBunched lpcnet: Vocoder for low-cost neural text-to-speech systems,ā€ in Proc. Interspeech, 2020.

    Acoustic models:N. Ellinas, G. Vamvoukakis, K. Markopoulos, A. Chalamandaris, G. Maniati, P. Kakoulidis, S. Raptis, J. S. Sung, H. Park, and P. Tsiakoulis, ā€œHigh quality streaming speech synthesis with low, sentence-length-independent latency,ā€ in Proc. Interspeech, 2020.Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., ā€œTacotron: Towards End-to-End Speech Synthesis,ā€ in Proc. Interspeech, 2017.J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., ā€œNatural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions,ā€ in Proc. ICASSP, 2018.J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, and Y. Wu, ā€œNon-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling,ā€ arXiv preprint arXiv:2010.04301, 2020.M. Honnibal and M. Johnson, ā€œAn Improved Non-monotonic Transition System for Dependency Parsing,ā€ in Proc. EMNLP, 2015.M. Dominguez, P. L. Rohrer, and J. Soler-Company, ā€œPyToBI: A Toolkit for ToBI Labeling Under Python,ā€ in Proc. Interspeech, 2019.Y. Zou, S. Liu, X. Yin, H. Lin, C. Wang, H. Zhang, and Z. Ma, ā€œFine-grained prosody modeling in neural speech synthesis using ToBI representation,ā€ in Proc. Interspeech, 2021.K. Klapsas, N. Ellinas, J. S. Sung, H. Park, and S. Raptis, ā€œWordLevel Style Control for Expressive, Non-attentive Speech Synthesis,ā€ in Proc. SPECOM, 2021.T. Raitio, R. Rasipuram, and D. Castellani, ā€œControllable neural text-to-speech synthesis using intuitive prosodic features,ā€ in Proc. Interspeech, 2020.

    Synthesized texts from the Blizzard Challenges 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2016:M. Fraser and S. King, "The Blizzard Challenge 2007," in Proc. SSW6, 2007.V. Karaiskos, S. King, R. A. Clark, and C. Mayo, "The Blizzard Challenge 2008," in Proc. Blizzard Challenge Workshop, 2008.A. W. Black, S. King, and K. Tokuda, "The Blizzard Challenge 2009," in Proc. Blizzard Challenge, 2009.S. King and V. Karaiskos, "The Blizzard Challenge 2010," 2010.S. King and V. Karaiskos, "The Blizzard Challenge 2011," 2011.S. King and V. Karaiskos, "The Blizzard Challenge 2012," 2012.S. King and V. Karaiskos, "The Blizzard Challenge 2013," 2013.S. King and V. Karaiskos, "The Blizzard Challenge 2016," 2016.

    Contact

    Alexandra Vioni - a.vioni@samsung.com

    If you have any questions or comments about the dataset, please feel free to write to us.

    We are interested in knowing if you find our dataset useful! If you use our dataset, please email us and tell us about your research.

  7. h

    peoples_speech

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MLCommons, peoples_speech [Dataset]. https://huggingface.co/datasets/MLCommons/peoples_speech
    Explore at:
    Dataset authored and provided by
    MLCommons
    License

    Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
    License information was derived automatically

    Description

    Dataset Card for People's Speech

      Dataset Summary
    

    The People's Speech Dataset is among the world's largest English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0. It includes 30,000+ hours of transcribed speech in English languages with a diverse set of speakers. This open dataset is large enough to train speech-to-text systems and crucially is available with a permissive license.

      Supported Tasks… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/peoples_speech.
    
  8. m

    SoraniTTS dataset : Central Kurdish (CK) Speech Corpus for Text-to-Speech

    • data.mendeley.com
    Updated Sep 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farhad Rahimi (2025). SoraniTTS dataset : Central Kurdish (CK) Speech Corpus for Text-to-Speech [Dataset]. http://doi.org/10.17632/jmtn248cc9.5
    Explore at:
    Dataset updated
    Sep 16, 2025
    Authors
    Farhad Rahimi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset represents a comprehensive resource for advancing Kurdish TTS systems. Converting text to speech is one of the important topics in the design and construction of multimedia systems, human-machine communication, and information and communication technology, and its purpose, along with speech recognition, is to establish communication between humans and machines in its most basic and natural form, that is, spoken language.
    For our text corpus, we collected 6,565 sentences from a set of texts in various categories, including news, sport, health, question and exclamation sentences, science, general information, politics, education and literature, story, miscellaneous, and tourism, to create the train sentences. We thoroughly reviewed the texts and normalized them, then they were recorded by a male speaker. We recorded audios in a voice recording studio at 44,100Hz, and all audio files are down sampled to 22,050 Hz in our modeling process. The audio ranges from 3 to 36 seconds in length. We generate the speech corpus in this method, and the last speech has about 6,565 texts and audio pairings, which takes around 19 hours. Altogether, audio files are saved in wave format, and the texts are saved in text files in the corresponding sub-folders. Furthermore, for model training, all of the audio files are gathered in a single folder. Each line in the transcript files is formatted as WAVS | audio file’s name.wav| transcript. The audio file’s name includes the extensions, and the transcript was the speech's text. The audio recording and editing process lasted for 90 days. It involved capturing over 6,565 WAV files and over 19 h of recorded speech. The data set helps researchers improve Kurdish TTS early, thereby reducing the time consumed for this process.

    Acknowledgments: We would like to express our sincere gratitude to Ayoub Mohammadzadeh for his invaluable support in recording the corpus.

  9. F

    Russian TTS Speech Dataset for Speech Synthesis

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Russian TTS Speech Dataset for Speech Synthesis [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/tts-monolgue-russian-russia
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    The Russian TTS Monologue Speech Dataset is a professionally curated resource built to train realistic, expressive, and production-grade text-to-speech (TTS) systems. It contains studio-recorded long-form speech by trained native Russian voice artists, each contributing 1 to 2 hours of clean, uninterrupted monologue audio.

    Unlike typical prompt-based datasets with short, isolated phrases, this collection features long-form, topic-driven monologues that mirror natural human narration. It includes content types that are directly useful for real-world applications, like audiobook-style storytelling, educational lectures, health advisories, product explainers, digital how-tos, formal announcements, and more.

    All recordings are captured in professional studios using high-end equipment and under the guidance of experienced voice directors.

    Recording & Audio Quality

    •
    Audio Format: WAV, 48 kHz, available in 16-bit, 24-bit, and 32-bit depth
    •
    SNR: Minimum 30 dB
    •
    Channel: Mono
    •
    Recording Duration: 20-30 minutes
    •
    Recording Environment: Studio-controlled, acoustically treated rooms
    •
    Per Speaker Volume: 1–2 hours of speech per artist
    •
    Quality Control: Each file is reviewed and cleaned for common acoustic issues, including: reverberation, lip smacks, mouth clicks, thumping, hissing, plosives, sibilance, background noise, static interference, clipping, and other artifacts.

    Only clean, production-grade audio makes it into the final dataset.

    Voice Artist Selection

    All voice artists are native Russian speakers with professional training or prior experience in narration. We ensure a diverse pool in terms of age, gender, and region to bring a balanced and rich vocal dataset.

    •Artist Profile:
    •Gender: Male and Female
    •Age Range: 20–60 years
    •Regions: Native Russian-speaking states from Russia
    •
    Selection Process: All artists are screened, onboarded, and sample-approved using FutureBeeAI’s proprietary Yugo platform.

    Script Quality & Coverage

    Scripts are not generic or repetitive. Scripts are professionally authored by domain experts to reflect real-world use cases. They avoid redundancy and include modern vocabulary, emotional range, and phonetically rich sentence structures.

    •
    Word Count per Script: 3,000–5,000 words per 30-minute session
    •Content Types:
    •Storytelling
    •Script and book reading
    •Informational explainers
    •Government service instructions
    •E-commerce tutorials
    •Motivational content
    •Health & wellness guides
    •Education & career advice
    •
    Linguistic Design: Balanced punctuation, emotional range, modern syntax, and vocabulary diversity

    Transcripts & Alignment

    While the script is used during the recording, we also provide post-recording updates to ensure the transcript reflects the final spoken audio. Minor edits are made to adjust for skipped or rephrased words.

    •
    Segmentation: Time-stamped at the sentence level, aligned to actual spoken delivery
    •
    Format: Available in plain text and JSON
    •Post-processing:
    •Corrected for

  10. h

    text-to-speech-tts

    • huggingface.co
    Updated Feb 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShoyimObloqulov (2024). text-to-speech-tts [Dataset]. https://huggingface.co/datasets/shoyimobloqulov/text-to-speech-tts
    Explore at:
    Dataset updated
    Feb 15, 2024
    Authors
    ShoyimObloqulov
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

    This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]

      Dataset Sources [optional]
    

    Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/shoyimobloqulov/text-to-speech-tts.

  11. Z

    Lada: Ukrainian High-Quality Female Text-to-Speech Dataset

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Smoliakov, Yehor (2022). Lada: Ukrainian High-Quality Female Text-to-Speech Dataset [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7396773
    Explore at:
    Dataset updated
    Dec 6, 2022
    Authors
    Smoliakov, Yehor
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    The dataset has high-quality data recorded in a professional studio.

    Archives with a trimmed tag are having removed silence (aligned) using https://github.com/proger/uk

    Features

    Quality: high

    Duration: 10h37m

    Audio formats: OPUS/WAV

    Text format: JSONL (a metadata.jsonl file)

    Frequency: 16000/22050/48000 Hz

  12. Z

    TIMIT-TTS: a Text-to-Speech Dataset for Synthetic Speech Detection

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davide Salvi; Brian Hosler; Paolo Bestagini; Matthew C. Stamm; Stefano Tubaro (2022). TIMIT-TTS: a Text-to-Speech Dataset for Synthetic Speech Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6560158
    Explore at:
    Dataset updated
    Sep 21, 2022
    Dataset provided by
    Politecnico di Milano, Italy
    Drexel University, USA
    Authors
    Davide Salvi; Brian Hosler; Paolo Bestagini; Matthew C. Stamm; Stefano Tubaro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material are becoming increasingly straightforward to perform. At the same time, sharing fake content on the web has become so simple that malicious users can create unpleasant situations with minimal effort. Also, forged media are getting more and more complex, with manipulated videos (e.g., deepfakes where both the visual and audio contents can be counterfeited) that are taking the scene over still images. The multimedia forensic community has addressed the possible threats that this situation could imply by developing detectors that verify the authenticity of multimedia objects. However, the vast majority of these tools only analyze one modality at a time. This was not a problem as long as still images were considered the most widely edited media, but now, since manipulated videos are becoming customary, performing monomodal analyses could be reductive. Nonetheless, there is a lack in the literature regarding multimodal detectors (systems that consider both audio and video components). This is due to the difficulty of developing them but also to the scarsity of datasets containing forged multimodal data to train and test the designed algorithms.

    In this paper we focus on the generation of an audio-visual deepfake dataset. First, we present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks. Then, we use the pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with DeepfakeTIMIT and VidTIMIT video datasets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both monomodal (i.e., audio) and multimodal (i.e., audio and video) conditions. This highlights the need for multimodal forensic detectors and more multimodal deepfake data.

    For the initial version of TIMIT-TTS v1.0

    Arxiv: https://arxiv.org/abs/2209.08000

    TIMIT-TTS Database v1.0: https://zenodo.org/record/6560159

  13. F

    Thai TTS Speech Dataset for Speech Synthesis

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Thai TTS Speech Dataset for Speech Synthesis [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/tts-monolgue-thai-thailand
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    The Thai TTS Monologue Speech Dataset is a professionally curated resource built to train realistic, expressive, and production-grade text-to-speech (TTS) systems. It contains studio-recorded long-form speech by trained native Thai voice artists, each contributing 1 to 2 hours of clean, uninterrupted monologue audio.

    Unlike typical prompt-based datasets with short, isolated phrases, this collection features long-form, topic-driven monologues that mirror natural human narration. It includes content types that are directly useful for real-world applications, like audiobook-style storytelling, educational lectures, health advisories, product explainers, digital how-tos, formal announcements, and more.

    All recordings are captured in professional studios using high-end equipment and under the guidance of experienced voice directors.

    Recording & Audio Quality

    •
    Audio Format: WAV, 48 kHz, available in 16-bit, 24-bit, and 32-bit depth
    •
    SNR: Minimum 30 dB
    •
    Channel: Mono
    •
    Recording Duration: 20-30 minutes
    •
    Recording Environment: Studio-controlled, acoustically treated rooms
    •
    Per Speaker Volume: 1–2 hours of speech per artist
    •
    Quality Control: Each file is reviewed and cleaned for common acoustic issues, including: reverberation, lip smacks, mouth clicks, thumping, hissing, plosives, sibilance, background noise, static interference, clipping, and other artifacts.

    Only clean, production-grade audio makes it into the final dataset.

    Voice Artist Selection

    All voice artists are native Thai speakers with professional training or prior experience in narration. We ensure a diverse pool in terms of age, gender, and region to bring a balanced and rich vocal dataset.

    •Artist Profile:
    •Gender: Male and Female
    •Age Range: 20–60 years
    •Regions: Native Thai-speaking states from Thailand
    •
    Selection Process: All artists are screened, onboarded, and sample-approved using FutureBeeAI’s proprietary Yugo platform.

    Script Quality & Coverage

    Scripts are not generic or repetitive. Scripts are professionally authored by domain experts to reflect real-world use cases. They avoid redundancy and include modern vocabulary, emotional range, and phonetically rich sentence structures.

    •
    Word Count per Script: 3,000–5,000 words per 30-minute session
    •Content Types:
    •Storytelling
    •Script and book reading
    •Informational explainers
    •Government service instructions
    •E-commerce tutorials
    •Motivational content
    •Health & wellness guides
    •Education & career advice
    •
    Linguistic Design: Balanced punctuation, emotional range, modern syntax, and vocabulary diversity

    Transcripts & Alignment

    While the script is used during the recording, we also provide post-recording updates to ensure the transcript reflects the final spoken audio. Minor edits are made to adjust for skipped or rephrased words.

    •
    Segmentation: Time-stamped at the sentence level, aligned to actual spoken delivery
    •
    Format: Available in plain text and JSON
    •Post-processing:
    •Corrected for disfluencies
    <div

  14. e

    Text-to-Speech Market Size USD 7.06 Billion by 2028 | TTS Industry Growth...

    • emergenresearch.com
    pdf,excel,csv,ppt
    Updated Feb 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emergen Research (2021). Text-to-Speech Market Size USD 7.06 Billion by 2028 | TTS Industry Growth 14.7% CAGR [Dataset]. https://www.emergenresearch.com/industry-report/text-to-speech-market
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Feb 18, 2021
    Dataset authored and provided by
    Emergen Research
    License

    https://www.emergenresearch.com/privacy-policyhttps://www.emergenresearch.com/privacy-policy

    Area covered
    Global
    Variables measured
    Base Year, No. of Pages, Growth Drivers, Forecast Period, Segments covered, Historical Data for, Pitfalls Challenges, 2028 Value Projection, Tables, Charts, and Figures, Forecast Period 2021 - 2028 CAGR, and 1 more
    Description

    The text-to-speech market size is expected to reach USD 7.06 Billion in 2028 and register a CAGR of 14.7%. Text-to-speech industry report classifies global market by share, trend, and on the basis of deployment mode, voice type, organization size, vertical, and region

  15. F

    Algerian Arabic TTS Speech Dataset for Speech Synthesis

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Algerian Arabic TTS Speech Dataset for Speech Synthesis [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/tts-monolgue-arabic-algeria
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Algeria
    Dataset funded by
    FutureBeeAI
    Description

    The Arabic TTS Monologue Speech Dataset is a professionally curated resource built to train realistic, expressive, and production-grade text-to-speech (TTS) systems. It contains studio-recorded long-form speech by trained native Arabic voice artists, each contributing 1 to 2 hours of clean, uninterrupted monologue audio.

    Unlike typical prompt-based datasets with short, isolated phrases, this collection features long-form, topic-driven monologues that mirror natural human narration. It includes content types that are directly useful for real-world applications, like audiobook-style storytelling, educational lectures, health advisories, product explainers, digital how-tos, formal announcements, and more.

    All recordings are captured in professional studios using high-end equipment and under the guidance of experienced voice directors.

    Recording & Audio Quality

    •
    Audio Format: WAV, 48 kHz, available in 16-bit, 24-bit, and 32-bit depth
    •
    SNR: Minimum 30 dB
    •
    Channel: Mono
    •
    Recording Duration: 20-30 minutes
    •
    Recording Environment: Studio-controlled, acoustically treated rooms
    •
    Per Speaker Volume: 1–2 hours of speech per artist
    •
    Quality Control: Each file is reviewed and cleaned for common acoustic issues, including: reverberation, lip smacks, mouth clicks, thumping, hissing, plosives, sibilance, background noise, static interference, clipping, and other artifacts.

    Only clean, production-grade audio makes it into the final dataset.

    Voice Artist Selection

    All voice artists are native Arabic speakers with professional training or prior experience in narration. We ensure a diverse pool in terms of age, gender, and region to bring a balanced and rich vocal dataset.

    •Artist Profile:
    •Gender: Male and Female
    •Age Range: 20–60 years
    •Regions: Native Arabic-speaking states from Algeria
    •
    Selection Process: All artists are screened, onboarded, and sample-approved using FutureBeeAI’s proprietary Yugo platform.

    Script Quality & Coverage

    Scripts are not generic or repetitive. Scripts are professionally authored by domain experts to reflect real-world use cases. They avoid redundancy and include modern vocabulary, emotional range, and phonetically rich sentence structures.

    •
    Word Count per Script: 3,000–5,000 words per 30-minute session
    •Content Types:
    •Storytelling
    •Script and book reading
    •Informational explainers
    •Government service instructions
    •E-commerce tutorials
    •Motivational content
    •Health & wellness guides
    •Education & career advice
    •
    Linguistic Design: Balanced punctuation, emotional range, modern syntax, and vocabulary diversity

    Transcripts & Alignment

    While the script is used during the recording, we also provide post-recording updates to ensure the transcript reflects the final spoken audio. Minor edits are made to adjust for skipped or rephrased words.

    •
    Segmentation: Time-stamped at the sentence level, aligned to actual spoken delivery
    •
    Format: Available in plain text and JSON
    •Post-processing:
    •Corrected for disfluencies
    <div

  16. Nexdata | German Conversational Speech Data by Mobile Phone | 434 Hours...

    • data.nexdata.ai
    Updated Nov 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). Nexdata | German Conversational Speech Data by Mobile Phone | 434 Hours |Multilingual Language Data [Dataset]. https://data.nexdata.ai/products/nexdata-german-conversational-speech-data-by-mobile-phone-nexdata
    Explore at:
    Dataset updated
    Nov 11, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    Germany
    Description

    German(Germany) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics. Transcribed with text content, timestamp, speaker's ID, gender and other attributes.

  17. F

    Finnish TTS Speech Dataset for Speech Synthesis

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Finnish TTS Speech Dataset for Speech Synthesis [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/tts-monolgue-finnish-finland
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    The Finnish TTS Monologue Speech Dataset is a professionally curated resource built to train realistic, expressive, and production-grade text-to-speech (TTS) systems. It contains studio-recorded long-form speech by trained native Finnish voice artists, each contributing 1 to 2 hours of clean, uninterrupted monologue audio.

    Unlike typical prompt-based datasets with short, isolated phrases, this collection features long-form, topic-driven monologues that mirror natural human narration. It includes content types that are directly useful for real-world applications, like audiobook-style storytelling, educational lectures, health advisories, product explainers, digital how-tos, formal announcements, and more.

    All recordings are captured in professional studios using high-end equipment and under the guidance of experienced voice directors.

    Recording & Audio Quality

    •
    Audio Format: WAV, 48 kHz, available in 16-bit, 24-bit, and 32-bit depth
    •
    SNR: Minimum 30 dB
    •
    Channel: Mono
    •
    Recording Duration: 20-30 minutes
    •
    Recording Environment: Studio-controlled, acoustically treated rooms
    •
    Per Speaker Volume: 1–2 hours of speech per artist
    •
    Quality Control: Each file is reviewed and cleaned for common acoustic issues, including: reverberation, lip smacks, mouth clicks, thumping, hissing, plosives, sibilance, background noise, static interference, clipping, and other artifacts.

    Only clean, production-grade audio makes it into the final dataset.

    Voice Artist Selection

    All voice artists are native Finnish speakers with professional training or prior experience in narration. We ensure a diverse pool in terms of age, gender, and region to bring a balanced and rich vocal dataset.

    •Artist Profile:
    •Gender: Male and Female
    •Age Range: 20–60 years
    •Regions: Native Finnish-speaking states from Finland
    •
    Selection Process: All artists are screened, onboarded, and sample-approved using FutureBeeAI’s proprietary Yugo platform.

    Script Quality & Coverage

    Scripts are not generic or repetitive. Scripts are professionally authored by domain experts to reflect real-world use cases. They avoid redundancy and include modern vocabulary, emotional range, and phonetically rich sentence structures.

    •
    Word Count per Script: 3,000–5,000 words per 30-minute session
    •Content Types:
    •Storytelling
    •Script and book reading
    •Informational explainers
    •Government service instructions
    •E-commerce tutorials
    •Motivational content
    •Health & wellness guides
    •Education & career advice
    •
    Linguistic Design: Balanced punctuation, emotional range, modern syntax, and vocabulary diversity

    Transcripts & Alignment

    While the script is used during the recording, we also provide post-recording updates to ensure the transcript reflects the final spoken audio. Minor edits are made to adjust for skipped or rephrased words.

    •
    Segmentation: Time-stamped at the sentence level, aligned to actual spoken delivery
    •
    Format: Available in plain text and JSON
    •Post-processing:
    •Corrected for

  18. t

    FastSpeech: Fast, Robust and Controllable Text to Speech - Dataset - LDM

    • service.tib.eu
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). FastSpeech: Fast, Robust and Controllable Text to Speech - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/fastspeech--fast--robust-and-controllable-text-to-speech
    Explore at:
    Dataset updated
    Dec 3, 2024
    Description

    Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of con-trollability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.

  19. Nexdata | American English Conversational Speech Data by Mobile Phone |...

    • datarade.ai
    Updated Nov 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). Nexdata | American English Conversational Speech Data by Mobile Phone | 1,136 Hours|Multilingual Language Data [Dataset]. https://datarade.ai/data-products/nexdata-american-english-conversational-speech-data-by-mobi-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Nov 12, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    United States
    Description

    English(the United States) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering generic domain. Transcribed with text content, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(1,416 Americans), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

    Format

    16kHz, 16 bit, uncompressed wav, mono channel;

    Content category

    generic domain;

    Recording condition

    Low background noise;

    Recording device

    Android Smartphone, iPhone;

    Speaker

    1,416 Americans, with 45% male and 55% female;

    Country

    the United States(USA);

    Language(Region) Code

    en-US;

    Language

    English;

    Features of annotation

    Transcription text, speaker ID, gender;

    Accuracy Rate

    Sentence Accuracy Rate (SAR) 95%

  20. h

    ClArTTS

    • huggingface.co
    Updated Nov 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Bin Zayed University of Artificial Intelligence (2025). ClArTTS [Dataset]. https://huggingface.co/datasets/MBZUAI/ClArTTS
    Explore at:
    Dataset updated
    Nov 13, 2025
    Dataset authored and provided by
    Mohamed Bin Zayed University of Artificial Intelligence
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Summary

    We present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic. The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated. The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz.

      Dataset Structure
    

    A typical data point comprises the name of the audio file, called… See the full description on the dataset page: https://huggingface.co/datasets/MBZUAI/ClArTTS.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Loop Assembly (2025). Hindi Text-to-Speech [Dataset]. https://www.kaggle.com/datasets/loopassembly/hindi-tts
Organization logo

Hindi Text-to-Speech

Comprehensive Hindi Text-to-Speech Dataset with 23K+ Audio Samples for Speech

Explore at:
107 scholarly articles cite this dataset (View in Google Scholar)
zip(7021369977 bytes)Available download formats
Dataset updated
Nov 12, 2025
Authors
Loop Assembly
License

Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically

Description

Hindi Text-to-Speech (TTS) Dataset

Overview

This comprehensive dataset contains over 23,700 high-quality Hindi audio samples paired with their corresponding text transcriptions, specifically designed for Text-to-Speech (TTS) synthesis and speech processing research. The dataset provides a robust foundation for developing and training Hindi language speech synthesis models.

Dataset Contents

  • Audio Files: 23.7k+ WAV format audio recordings
  • Metadata: CSV file containing text transcriptions and audio file mappings
  • Total Size: 9.12 GB of audio data
  • Language: Hindi (ą¤¹ą¤æą¤Øą„ą¤¦ą„€)

Use Cases

  • Text-to-Speech Systems: Train neural TTS models for Hindi language
  • Speech Synthesis Research: Develop and evaluate speech generation algorithms
  • Voice Cloning: Create synthetic Hindi voices
  • Prosody Modeling: Study intonation and rhythm patterns in Hindi speech
  • Academic Research: Support linguistic and phonetic studies of Hindi

Data Structure

The dataset is organized with audio files in the audio/ directory and corresponding metadata in metadata.csv, making it easy to integrate into your ML pipeline.

Applications

Ideal for researchers, developers, and organizations working on: - Voice assistants in Hindi - Accessibility tools for visually impaired users - Language learning applications - Speech technology products - AI-powered audio content creation

Search
Clear search
Close search
Google apps
Main menu