100+ datasets found
  1. Hindi Text-to-Speech

    • kaggle.com
    zip
    Updated Nov 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loop Assembly (2025). Hindi Text-to-Speech [Dataset]. https://www.kaggle.com/datasets/loopassembly/hindi-tts
    Explore at:
    zip(7021369977 bytes)Available download formats
    Dataset updated
    Nov 12, 2025
    Authors
    Loop Assembly
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Hindi Text-to-Speech (TTS) Dataset

    Overview

    This comprehensive dataset contains over 23,700 high-quality Hindi audio samples paired with their corresponding text transcriptions, specifically designed for Text-to-Speech (TTS) synthesis and speech processing research. The dataset provides a robust foundation for developing and training Hindi language speech synthesis models.

    Dataset Contents

    • Audio Files: 23.7k+ WAV format audio recordings
    • Metadata: CSV file containing text transcriptions and audio file mappings
    • Total Size: 9.12 GB of audio data
    • Language: Hindi (हिन्दी)

    Use Cases

    • Text-to-Speech Systems: Train neural TTS models for Hindi language
    • Speech Synthesis Research: Develop and evaluate speech generation algorithms
    • Voice Cloning: Create synthetic Hindi voices
    • Prosody Modeling: Study intonation and rhythm patterns in Hindi speech
    • Academic Research: Support linguistic and phonetic studies of Hindi

    Data Structure

    The dataset is organized with audio files in the audio/ directory and corresponding metadata in metadata.csv, making it easy to integrate into your ML pipeline.

    Applications

    Ideal for researchers, developers, and organizations working on: - Voice assistants in Hindi - Accessibility tools for visually impaired users - Language learning applications - Speech technology products - AI-powered audio content creation

  2. Luganda-Swahili Speech for Text-to-SpeechSynthesis

    • kaggle.com
    zip
    Updated Jul 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jocelyn Dumlao (2025). Luganda-Swahili Speech for Text-to-SpeechSynthesis [Dataset]. https://www.kaggle.com/datasets/jocelyndumlao/luganda-swahili-speech-for-text-to-speechsynthesis
    Explore at:
    zip(6609769165 bytes)Available download formats
    Dataset updated
    Jul 27, 2025
    Authors
    Jocelyn Dumlao
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    A Curated Crowdsourced Dataset of Luganda and Swahili Speech for Text-to-Speech Synthesis

    Description

    This dataset contains curated and preprocessed speech recordings in Luganda and Kiswahili for use in text-to-speech (TTS) research (Katumba et al., 2025). The audio and transcripts were sourced from Mozilla Common Voice (Luganda v12.0 and Kiswahili v15.0) and curated for voice consistency and quality. This dataset is designed for training and evaluating end-to-end TTS systems in low-resource African languages. The data is organized into two folders, Luganda and Kiswahili, each containing:

    wavs.zip: A ZIP archive of .wav audio files from six selected female speakers per language. All audio files have been silence-trimmed, denoised using a causal DEMUCS model, and filtered using WV-MOS to retain only clips with a predicted MOS ≥ 3.5.

    metadata.csv: A CSV file with two columns: filename and transcript. Each row corresponds to an audio file in the wavs.zip archive and provides the spoken sentence for that clip.

    This dataset was used in: Katumba, A., Kagumire, S., Nakatumba-Nabende, J., Quinn, J., & Murindanyi, S. (2025). Building Text-to-Speech Models for Low-Resourced Languages From Crowdsourced Data. Applied AI Letters, 6:e117. https://doi.org/10.1002/ail2.117

    Article https://doi.org/10.1002/ail2.117

    Categories

    Computer Science, Speech Processing, Natural Language Processing, Text-to-Speech, Deep Learning

    Data Source: Mendeley Dataset

  3. m

    SoraniTTS dataset : Central Kurdish (CK) Speech Corpus for Text-to-Speech

    • data.mendeley.com
    Updated Sep 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farhad Rahimi (2025). SoraniTTS dataset : Central Kurdish (CK) Speech Corpus for Text-to-Speech [Dataset]. http://doi.org/10.17632/jmtn248cc9.5
    Explore at:
    Dataset updated
    Sep 16, 2025
    Authors
    Farhad Rahimi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset represents a comprehensive resource for advancing Kurdish TTS systems. Converting text to speech is one of the important topics in the design and construction of multimedia systems, human-machine communication, and information and communication technology, and its purpose, along with speech recognition, is to establish communication between humans and machines in its most basic and natural form, that is, spoken language.
    For our text corpus, we collected 6,565 sentences from a set of texts in various categories, including news, sport, health, question and exclamation sentences, science, general information, politics, education and literature, story, miscellaneous, and tourism, to create the train sentences. We thoroughly reviewed the texts and normalized them, then they were recorded by a male speaker. We recorded audios in a voice recording studio at 44,100Hz, and all audio files are down sampled to 22,050 Hz in our modeling process. The audio ranges from 3 to 36 seconds in length. We generate the speech corpus in this method, and the last speech has about 6,565 texts and audio pairings, which takes around 19 hours. Altogether, audio files are saved in wave format, and the texts are saved in text files in the corresponding sub-folders. Furthermore, for model training, all of the audio files are gathered in a single folder. Each line in the transcript files is formatted as WAVS | audio file’s name.wav| transcript. The audio file’s name includes the extensions, and the transcript was the speech's text. The audio recording and editing process lasted for 90 days. It involved capturing over 6,565 WAV files and over 19 h of recorded speech. The data set helps researchers improve Kurdish TTS early, thereby reducing the time consumed for this process.

    Acknowledgments: We would like to express our sincere gratitude to Ayoub Mohammadzadeh for his invaluable support in recording the corpus.

  4. Scripted Monologues Speech Data | 65,000 Hours | GenAI Audio Data|...

    • datarade.ai
    Updated Dec 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). Scripted Monologues Speech Data | 65,000 Hours | GenAI Audio Data| Text-to-Speech Data| Multilingual Language Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-read-speech-data-65-000-hours-aud-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 11, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Russian Federation, Lebanon, Cambodia, Jordan, France, Japan, El Salvador, Brazil, Slovenia, Netherlands
    Description
    1. Specifications Format : 16kHz, 16bit, uncompressed wav, mono channel

    Recording environment : quiet indoor environment, without echo

    Recording content (read speech) : economy, entertainment, news, oral language, numbers, letters

    Speaker : native speaker, gender balance

    Device : Android mobile phone, iPhone

    Language : 100+ languages

    Transcription content : text, time point of speech data, 5 noise symbols, 5 special identifiers

    Accuracy rate : 95% (the accuracy rate of noise symbols and other identifiers is not included)

    Application scenarios : speech recognition, voiceprint recognition

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Multilingual Language Data and 800TB of Computer Vision Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
  5. Speech Synthesis Data | 400 Hours |Text-to-Speech(TTS) Data | Audio Data |...

    • data.nexdata.ai
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). Speech Synthesis Data | 400 Hours |Text-to-Speech(TTS) Data | Audio Data | AI Training Data| Speech Data [Dataset]. https://data.nexdata.ai/products/nexdata-multilingual-speech-synthesis-data-400-hours-a-nexdata
    Explore at:
    Dataset updated
    Aug 3, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Singapore, South Korea, Greece, Poland, Hungary, Hong Kong, Mexico, United States, New Zealand, Romania
    Description

    Text-to-Speech(TTS) Data is recorded by native speaker, with authentic accent and sweet sound. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

  6. Speech Recognition Data Collection Services | 100+ Languages Resources...

    • datarade.ai
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). Speech Recognition Data Collection Services | 100+ Languages Resources |Audio Data | Speech Data | Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-speech-recognition-data-collection-services-100-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 28, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Sri Lanka, Cambodia, United Kingdom, Lithuania, Brazil, Malaysia, Austria, Estonia, El Salvador, Haiti
    Description
    1. Overview With extensive experience in speech recognition, Nexdata has resource pool covering more than 50 countries and regions. Our linguist team works closely with clients to assist them with dictionary and text corpus construction, speech quality inspection, linguistics consulting and etc.

    2. Our Capacity -Global Resources: Global resources covering hundreds of languages worldwide

    -Compliance: All the Machine Learning (ML) Data are collected with proper authorization -Quality: Multiple rounds of quality inspections ensures high quality data output

    -Secure Implementation: NDA is signed to gurantee secure implementation and Machine Learning (ML) Data is destroyed upon delivery.

    1. About Nexdata Nexdata is equipped with professional Speech Data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the data collection requirements in various scenarios and types. Please visit us at https://www.nexdata.ai/service/speech-recognition?source=Datarade
  7. Z

    TIMIT-TTS: a Text-to-Speech Dataset for Synthetic Speech Detection

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davide Salvi; Brian Hosler; Paolo Bestagini; Matthew C. Stamm; Stefano Tubaro (2022). TIMIT-TTS: a Text-to-Speech Dataset for Synthetic Speech Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6560158
    Explore at:
    Dataset updated
    Sep 21, 2022
    Dataset provided by
    Drexel University, USA
    Politecnico di Milano, Italy
    Authors
    Davide Salvi; Brian Hosler; Paolo Bestagini; Matthew C. Stamm; Stefano Tubaro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material are becoming increasingly straightforward to perform. At the same time, sharing fake content on the web has become so simple that malicious users can create unpleasant situations with minimal effort. Also, forged media are getting more and more complex, with manipulated videos (e.g., deepfakes where both the visual and audio contents can be counterfeited) that are taking the scene over still images. The multimedia forensic community has addressed the possible threats that this situation could imply by developing detectors that verify the authenticity of multimedia objects. However, the vast majority of these tools only analyze one modality at a time. This was not a problem as long as still images were considered the most widely edited media, but now, since manipulated videos are becoming customary, performing monomodal analyses could be reductive. Nonetheless, there is a lack in the literature regarding multimodal detectors (systems that consider both audio and video components). This is due to the difficulty of developing them but also to the scarsity of datasets containing forged multimodal data to train and test the designed algorithms.

    In this paper we focus on the generation of an audio-visual deepfake dataset. First, we present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks. Then, we use the pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with DeepfakeTIMIT and VidTIMIT video datasets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both monomodal (i.e., audio) and multimodal (i.e., audio and video) conditions. This highlights the need for multimodal forensic detectors and more multimodal deepfake data.

    For the initial version of TIMIT-TTS v1.0

    Arxiv: https://arxiv.org/abs/2209.08000

    TIMIT-TTS Database v1.0: https://zenodo.org/record/6560159

  8. Speech Recognition Data Collection Services | 100+ Languages Resources...

    • data.nexdata.ai
    Updated Aug 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). Speech Recognition Data Collection Services | 100+ Languages Resources |Audio Data | Speech Data | Machine Learning (ML) Data [Dataset]. https://data.nexdata.ai/
    Explore at:
    Dataset updated
    Aug 3, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    United Kingdom, United States, Japan
    Description

    Nexdata is equipped with professional recording equipment and has resources pool of 70+ countries and regions, and provide various types of speech data collection services for Machine Learning (ML) Data.

  9. F

    US English TTS Speech Dataset for Speech Synthesis

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). US English TTS Speech Dataset for Speech Synthesis [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/tts-monolgue-english-us
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    The English TTS Monologue Speech Dataset is a professionally curated resource built to train realistic, expressive, and production-grade text-to-speech (TTS) systems. It contains studio-recorded long-form speech by trained native English voice artists, each contributing 1 to 2 hours of clean, uninterrupted monologue audio.

    Unlike typical prompt-based datasets with short, isolated phrases, this collection features long-form, topic-driven monologues that mirror natural human narration. It includes content types that are directly useful for real-world applications, like audiobook-style storytelling, educational lectures, health advisories, product explainers, digital how-tos, formal announcements, and more.

    All recordings are captured in professional studios using high-end equipment and under the guidance of experienced voice directors.

    Recording & Audio Quality

    Audio Format: WAV, 48 kHz, available in 16-bit, 24-bit, and 32-bit depth
    SNR: Minimum 30 dB
    Channel: Mono
    Recording Duration: 20-30 minutes
    Recording Environment: Studio-controlled, acoustically treated rooms
    Per Speaker Volume: 1–2 hours of speech per artist
    Quality Control: Each file is reviewed and cleaned for common acoustic issues, including: reverberation, lip smacks, mouth clicks, thumping, hissing, plosives, sibilance, background noise, static interference, clipping, and other artifacts.

    Only clean, production-grade audio makes it into the final dataset.

    Voice Artist Selection

    All voice artists are native English speakers with professional training or prior experience in narration. We ensure a diverse pool in terms of age, gender, and region to bring a balanced and rich vocal dataset.

    Artist Profile:
    Gender: Male and Female
    Age Range: 20–60 years
    Regions: Native English-speaking states from United States of America
    Selection Process: All artists are screened, onboarded, and sample-approved using FutureBeeAI’s proprietary Yugo platform.

    Script Quality & Coverage

    Scripts are not generic or repetitive. Scripts are professionally authored by domain experts to reflect real-world use cases. They avoid redundancy and include modern vocabulary, emotional range, and phonetically rich sentence structures.

    Word Count per Script: 3,000–5,000 words per 30-minute session
    Content Types:
    Storytelling
    Script and book reading
    Informational explainers
    Government service instructions
    E-commerce tutorials
    Motivational content
    Health & wellness guides
    Education & career advice
    Linguistic Design: Balanced punctuation, emotional range, modern syntax, and vocabulary diversity

    Transcripts & Alignment

    While the script is used during the recording, we also provide post-recording updates to ensure the transcript reflects the final spoken audio. Minor edits are made to adjust for skipped or rephrased words.

    Segmentation: Time-stamped at the sentence level, aligned to actual spoken delivery
    Format: Available in plain text and JSON
    Post-processing:
    Corrected for

  10. Unscripted Call Center Telephony Speech Data | 20,000 Hours |Speech...

    • data.nexdata.ai
    Updated Mar 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). Unscripted Call Center Telephony Speech Data | 20,000 Hours |Speech Recognition Data| Speech AI Datasets| Audio Data [Dataset]. https://data.nexdata.ai/products/unscripted-call-center-telephony-speech-data-20-000-hours-nexdata
    Explore at:
    Dataset updated
    Mar 8, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    Switzerland, Philippines, Uruguay, Malaysia, Norway, Russian Federation, Poland, Portugal, New Zealand, Venezuela
    Description

    Off-the-shelf 20,000 hours Unscripted Call Center Telephony Audio Data, covering 30+ languages including English, German, French, Spanish, Italian, Portuguese, Korean, Japanese, Hindi, Arabic and etc. It covers multiple domains like finance, real-estate, sale, health, insurance, and telecom.

  11. Bengali Speech Recognition Dataset (BSRD)

    • kaggle.com
    zip
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuvo Kumar Basak-4004 (2025). Bengali Speech Recognition Dataset (BSRD) [Dataset]. https://www.kaggle.com/datasets/shuvokumarbasak4004/bengali-speech-recognition-dataset-bsrd
    Explore at:
    zip(300882482 bytes)Available download formats
    Dataset updated
    Jan 14, 2025
    Authors
    Shuvo Kumar Basak-4004
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The BengaliSpeechRecognitionDataset (BSRD) is a comprehensive dataset designed for the development and evaluation of Bengali speech recognition and text-to-speech systems. This dataset includes a collection of Bengali characters and their corresponding audio files, which are generated using speech synthesis models. It serves as an essential resource for researchers and developers working on automatic speech recognition (ASR) and text-to-speech (TTS) applications for the Bengali language. Key Features: • Bengali Characters: The dataset contains a wide range of Bengali characters, including consonants, vowels, and unique symbols used in the Bengali script. This includes standard characters such as 'ক', 'খ', 'গ', and many more. • Corresponding Speech Data: For each Bengali character, an MP3 audio file is provided, which contains the correct pronunciation of that character. This audio is generated by a Bengali text-to-speech model, ensuring clear and accurate pronunciation. • 1000 Audio Samples per Folder: Each character is associated with at least 1000 MP3 files. These multiple samples provide variations of the character's pronunciation, which is essential for training robust speech recognition systems. • Language and Phonetic Diversity: The dataset offers a phonetic diversity of Bengali sounds, covering different tones and pronunciations commonly found in spoken Bengali. This ensures that the dataset can be used for training models capable of recognizing diverse speech patterns. • Use Cases: o Automatic Speech Recognition (ASR): BSRD is ideal for training ASR systems, as it provides accurate audio samples linked to specific Bengali characters. o Text-to-Speech (TTS): Researchers can use this dataset to fine-tune TTS systems for generating natural Bengali speech from text. o Phonetic Analysis: The dataset can be used for phonetic analysis and developing models that study the linguistic features of Bengali pronunciation. • Applications: o Voice Assistants: The dataset can be used to build and train voice recognition systems and personal assistants that understand Bengali. o Speech-to-Text Systems: BSRD can aid in developing accurate transcription systems for Bengali audio content. o Language Learning Tools: The dataset can help in creating educational tools aimed at teaching Bengali pronunciation.

    …………………………………..Note for Researchers Using the dataset………………………………………………………………………

    This dataset was created by Shuvo Kumar Basak. If you use this dataset for your research or academic purposes, please ensure to cite this dataset appropriately. If you have published your research using this dataset, please share a link to your paper. Good Luck.

  12. Data from: SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maniati, Georgia; Vioni, Alexandra; Ellinas, Nikolaos; Nikitaras, Karolos; Klapsas, Konstantinos; Sung, June Sig; Jho, Gunu; Chalamandaris, Aimilios; Tsiakoulis, Pirros (2025). SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7119399
    Explore at:
    Dataset updated
    Mar 7, 2025
    Dataset provided by
    Samsung Electronicshttp://samsung.com/
    Authors
    Maniati, Georgia; Vioni, Alexandra; Ellinas, Nikolaos; Nikitaras, Karolos; Klapsas, Konstantinos; Sung, June Sig; Jho, Gunu; Chalamandaris, Aimilios; Tsiakoulis, Pirros
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This is the public release of the Samsung Open Mean Opinion Scores (SOMOS) dataset for the evaluation of neural text-to-speech (TTS) synthesis, which consists of audio files generated with a public domain voice from trained TTS models based on bibliography, and numbers assigned to each audio as quality (naturalness) evaluations by several crowdsourced listeners.DescriptionThe SOMOS dataset contains 20,000 synthetic utterances (wavs), 100 natural utterances and 374,955 naturalness evaluations (human-assigned scores in the range 1-5). The synthetic utterances are single-speaker, generated by training several Tacotron-like acoustic models and an LPCNet vocoder on the LJ Speech voice public dataset. 2,000 text sentences were synthesized, selected from Blizzard Challenge texts of years 2007-2016, the LJ Speech corpus as well as Wikipedia and general domain data from the Internet.Naturalness evaluations were collected via crowdsourcing a listening test on Amazon Mechanical Turk in the US, GB and CA locales. The records of listening test participants (workers) are fully anonymized. Statistics on the reliability of the scores assigned by the workers are also included, generated through processing the scores and validation controls per submission page.

    To listen to audio samples of the dataset, please see our Github page.

    The dataset release comes with a carefully designed train-validation-test split (70%-15%-15%) with unseen systems, listeners and texts, which can be used for experimentation on MOS prediction.

    This version also contains the necessary resources to obtain the transcripts corresponding to all dataset audios.

    Terms of use

    The dataset may be used for research purposes only, for non-commercial purposes only, and may be distributed with the same terms.

    Every time you produce research that has used this dataset, please cite the dataset appropriately.

    Cite as:

    @inproceedings{maniati22_interspeech, author={Georgia Maniati and Alexandra Vioni and Nikolaos Ellinas and Karolos Nikitaras and Konstantinos Klapsas and June Sig Sung and Gunu Jho and Aimilios Chalamandaris and Pirros Tsiakoulis}, title={{SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis}}, year=2022, booktitle={Proc. Interspeech 2022}, pages={2388--2392}, doi={10.21437/Interspeech.2022-10922} }

    References of resources & models used

    Voice & synthesized texts:K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.

    Vocoder:J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthesis through linear prediction,” in Proc. ICASSP, 2019.R. Vipperla, S. Park, K. Choo, S. Ishtiaq, K. Min, S. Bhattacharya, A. Mehrotra, A. G. C. P. Ramos, and N. D. Lane, “Bunched lpcnet: Vocoder for low-cost neural text-to-speech systems,” in Proc. Interspeech, 2020.

    Acoustic models:N. Ellinas, G. Vamvoukakis, K. Markopoulos, A. Chalamandaris, G. Maniati, P. Kakoulidis, S. Raptis, J. S. Sung, H. Park, and P. Tsiakoulis, “High quality streaming speech synthesis with low, sentence-length-independent latency,” in Proc. Interspeech, 2020.Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards End-to-End Speech Synthesis,” in Proc. Interspeech, 2017.J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions,” in Proc. ICASSP, 2018.J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, and Y. Wu, “Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling,” arXiv preprint arXiv:2010.04301, 2020.M. Honnibal and M. Johnson, “An Improved Non-monotonic Transition System for Dependency Parsing,” in Proc. EMNLP, 2015.M. Dominguez, P. L. Rohrer, and J. Soler-Company, “PyToBI: A Toolkit for ToBI Labeling Under Python,” in Proc. Interspeech, 2019.Y. Zou, S. Liu, X. Yin, H. Lin, C. Wang, H. Zhang, and Z. Ma, “Fine-grained prosody modeling in neural speech synthesis using ToBI representation,” in Proc. Interspeech, 2021.K. Klapsas, N. Ellinas, J. S. Sung, H. Park, and S. Raptis, “WordLevel Style Control for Expressive, Non-attentive Speech Synthesis,” in Proc. SPECOM, 2021.T. Raitio, R. Rasipuram, and D. Castellani, “Controllable neural text-to-speech synthesis using intuitive prosodic features,” in Proc. Interspeech, 2020.

    Synthesized texts from the Blizzard Challenges 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2016:M. Fraser and S. King, "The Blizzard Challenge 2007," in Proc. SSW6, 2007.V. Karaiskos, S. King, R. A. Clark, and C. Mayo, "The Blizzard Challenge 2008," in Proc. Blizzard Challenge Workshop, 2008.A. W. Black, S. King, and K. Tokuda, "The Blizzard Challenge 2009," in Proc. Blizzard Challenge, 2009.S. King and V. Karaiskos, "The Blizzard Challenge 2010," 2010.S. King and V. Karaiskos, "The Blizzard Challenge 2011," 2011.S. King and V. Karaiskos, "The Blizzard Challenge 2012," 2012.S. King and V. Karaiskos, "The Blizzard Challenge 2013," 2013.S. King and V. Karaiskos, "The Blizzard Challenge 2016," 2016.

    Contact

    Alexandra Vioni - a.vioni@samsung.com

    If you have any questions or comments about the dataset, please feel free to write to us.

    We are interested in knowing if you find our dataset useful! If you use our dataset, please email us and tell us about your research.

  13. Spanish Speech Recognition Dataset

    • kaggle.com
    zip
    Updated Jun 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata (2025). Spanish Speech Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/unidpro/spanish-speech-recognition-dataset
    Explore at:
    zip(93217 bytes)Available download formats
    Dataset updated
    Jun 25, 2025
    Authors
    Unidata
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Spanish Speech Dataset for recognition task

    Dataset comprises 488 hours of telephone dialogues in Spanish, collected from 600 native speakers across various topics and domains. This dataset boasts an impressive 98% word accuracy rate, making it a valuable resource for advancing speech recognition technology.

    By utilizing this dataset, researchers and developers can advance their understanding and capabilities in automatic speech recognition (ASR) systems, transcribing audio, and natural language processing (NLP). - Get the data

    The dataset includes high-quality audio recordings with text transcriptions, making it ideal for training and evaluating speech recognition models.

    💵 Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

    Metadata for the dataset

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2Fa3f375fb273dcad3fe17403bdfccb63b%2Fssssssssss.PNG?generation=1739884059328284&alt=media" alt=""> - Audio files: High-quality recordings in WAV format - Text transcriptions: Accurate and detailed transcripts for each audio segment - Speaker information: Metadata on native speakers, including gender and etc - Topics: Diverse domains such as general conversations, business and etc

    This dataset is a valuable resource for researchers and developers working on speech recognition, language models, and speech technology.

    🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects

  14. h

    trump-speech-dataset-tts

    • huggingface.co
    Updated Jan 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sw_tuenguyen (2025). trump-speech-dataset-tts [Dataset]. https://huggingface.co/datasets/meoconxinhxan/trump-speech-dataset-tts
    Explore at:
    Dataset updated
    Jan 11, 2025
    Authors
    sw_tuenguyen
    Description

    Trump Voice Dataset

    This dataset serves as an example for training a text-to-speech (TTS) fine-tuning platform. It consists of three columns:

    path (required): The file path to the audio. transcript (optional): The text transcript of the audio. speaker_id (optional): The unique identifier for the speaker.

    If the transcript is not provided, it will be automatically generated using the Whisper-large v3 model.

  15. Speech Synthesis Data Collection Service | 50+ Languages Resources |...

    • data.nexdata.ai
    Updated Mar 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). Speech Synthesis Data Collection Service | 50+ Languages Resources | Numerous Voice Sample | Audio Data | Text-to-Speech(TTS) Data [Dataset]. https://data.nexdata.ai/products/nexdata-speech-synthesis-data-collection-services-50-lan-nexdata
    Explore at:
    Dataset updated
    Mar 6, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Italy, Dominican Republic, Singapore, Romania, Malaysia, Portugal, China, Azerbaijan, Uruguay, Mexico
    Description

    Nexdata provides multi-language, multi-timbre, multi-domain and multi-style speech synthesis data collection services for Text-to-Speech(TTS) Data.

  16. Podcast Database - Complete Podcast Metadata, All Countries & Languages

    • datarade.ai
    .json, .csv, .sql
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Listen Notes (2025). Podcast Database - Complete Podcast Metadata, All Countries & Languages [Dataset]. https://datarade.ai/data-products/podcast-database-complete-podcast-metadata-all-countries-listen-notes
    Explore at:
    .json, .csv, .sqlAvailable download formats
    Dataset updated
    May 27, 2025
    Dataset authored and provided by
    Listen Notes
    Area covered
    Colombia, Zambia, Slovenia, Bosnia and Herzegovina, Anguilla, Turkey, Gibraltar, Guinea-Bissau, Iran (Islamic Republic of), Indonesia
    Description

    == Quick facts ==

    The most up-to-date and comprehensive podcast database available All languages & All countries Includes over 3,600,000 podcasts Features 35+ data fields , such as basic metadata, global rank, RSS feed (with audio URLs), Spotify links, and more Delivered in SQLite format Learn how we build a high quality podcast database: https://www.listennotes.help/article/105-high-quality-podcast-database-from-listen-notes

    == Use Cases ==

    AI training, including speech recognition, generative AI, voice cloning / synthesis, and news analysis Alternative data for investment research, such as sentiment analysis of executive interviews, market research and tracking investment themes PR and marketing, including social monitoring, content research, outreach, and guest booking ...

    == Data Attributes ==

    See the full list of data attributes on this page: https://www.listennotes.com/podcast-datasets/fields/?filter=podcast_only

    How to access podcast audio files: Our dataset includes RSS feed URLs for all podcasts. You can retrieve audio for over 170 million episodes directly from these feeds. With access to the raw audio, you’ll have high-quality podcast speech data ideal for AI training and related applications.

    == Custom Offers ==

    We can provide custom datasets based on your needs, such as language-specific data, daily/weekly/monthly update frequency, or one-time purchases.

    We also provide a RESTful API at PodcastAPI.com

    Contact us: hello@listennotes.com

    == Need Help? ==

    If you have any questions about our products, feel free to reach out hello@listennotes.com

    == About Listen Notes, Inc. ==

    Since 2017, Listen Notes, Inc. has provided the leading podcast search engine and podcast database.

  17. persian tts dataset (female)

    • kaggle.com
    zip
    Updated Mar 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Magnoliasis (2023). persian tts dataset (female) [Dataset]. https://www.kaggle.com/datasets/magnoliasis/persian-tts-dataset-famale/data
    Explore at:
    zip(3514729836 bytes)Available download formats
    Dataset updated
    Mar 10, 2023
    Authors
    Magnoliasis
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Persian tts dataset (female) Dataset is designed for the Persian text-to-speech task. It consists of audio files generated using Microsoft Edge's online text-to-speech service and their text extracted from naab textual corpus .To my best knowledge, this is the first publicly available ~30h of clear female voice speech dataset for Persian.

  18. h

    peoples_speech

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MLCommons, peoples_speech [Dataset]. https://huggingface.co/datasets/MLCommons/peoples_speech
    Explore at:
    Dataset authored and provided by
    MLCommons
    License

    Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
    License information was derived automatically

    Description

    Dataset Card for People's Speech

      Dataset Summary
    

    The People's Speech Dataset is among the world's largest English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0. It includes 30,000+ hours of transcribed speech in English languages with a diverse set of speakers. This open dataset is large enough to train speech-to-text systems and crucially is available with a permissive license.

      Supported Tasks… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/peoples_speech.
    
  19. E

    Data from: The "Mići Princ" text and speech dataset of Chakavian...

    • live.european-language-grid.eu
    binary format
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). The "Mići Princ" text and speech dataset of Chakavian micro-dialects [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23600
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Mar 4, 2024
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The Mići Princ "text and speech" dialectal dataset is a word-aligned version of the translation of The Little Prince into various Chakavian micro-dialects, released by the Udruga Calculus and the Peek&Poke museum (http://skupnikatalog.nsk.hr/Record/nsk.NSK01001103632), both in form of a printed book and an audio book.

    The printed book is a translation of Antoine de Saint-Exupéry's "Le Petit Prince". The translation was performed by Tea Perinčić and the following additional translators (almost every character in the book uses a different micro-dialect): Davina Ivašić, Annamaria Grus, Maria Luisa Ivašić, Marin Miš, Josip Fafanđel, Glorija Fabijanić Jelović, Vlasta Juretić, Anica Pritchard, Tea Rosić, Dino Marković, Ilinka Babić, Jadranka Ajvaz, Vlado Simičić Vava, Irena Grdinić, and Ivana Marinčić.

    The audio book has been read by Zoran Prodanović Prlja, Davina Ivašić, Josip Fafanđel, Melita Nilović, Glorija Fabijanić Jelović, Albert Sirotich, Tea Rosić, Tea Perinčić, Dino Marković, Iva Močibob, Dražen Turina Šajeta, Vlado Simčić Vava, Ilinka Babić, Melita and Svetozar Nilović, and Ivana Marinčić.

    The master encoding of this "text and speech" dataset is available in form of json files (MP_13.json for the thirteenth chapter of the book), where the text, the turn-level alignment, and the word-level alignment to the audio are available. This master encoding is available from the MP.json.tgz archive for the text and alignment part, with the audio part of the master encoding located in the MP.wav.tgz archive.

    Besides this master encoding, an encoding focused on applications in automatic speech recognition (ASR) testing and adaptation, is available as well. Chapters 13 and 15 have been selected as testing data, and the text and audio reference files MP_13.asr.json and MP_15.asr.json contain segments split by speaker turns. The remainder of the dataset has been prepared in segments of length up to 20 seconds, ideal for training / fine-tuning current ASR systems. The text and audio reference data are available in the MP.asr.json.tgz archive, while the audio data are available in form of MP3 files in the MP.mp3.tgz archive.

    The dataset also includes an encoding for the Exmaralda speech editor (https://exmaralda.org), one file per chapter (MP_13.exb for the thirteenth chapter), available from the MP.exb.tgz archive. The wav files from the MP.wav.tgz archive are required if speech data are to be available inside Exmaralda.

    Speaker information is available in the speakers.json file, each speaker having a textual and wikidata reference to the location of the micro-dialect, as well as the name of the translator in the printed book and the reader in the audio book.

    An application of the dataset on fine-tuning the current (March 2024) SotA automatic speech recognition model for standard Croatian, whisper-v3-large (https://huggingface.co/classla/whisper-large-v3-mici-princ), shows for word error rate to drop from 35.43% to 16.83%, and the character error rate to drop from 11.54% to 3.95% (in-dataset test data, two seen speakers / micro-dialects, two unseen).

  20. h

    Synthetic-Medical-Speech-Dataset

    • huggingface.co
    Updated Oct 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hani. M (2025). Synthetic-Medical-Speech-Dataset [Dataset]. https://huggingface.co/datasets/Hani89/Synthetic-Medical-Speech-Dataset
    Explore at:
    Dataset updated
    Oct 23, 2025
    Authors
    Hani. M
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Synthetic Medical Speech Dataset

      Overview
    

    Synthetic Medical Speech Dataset is a synthetic dataset of audio–text pairs designed for developing and evaluating automatic speech recognition (ASR) models in the medical domain.The corpus contains thousands of short audio clips generated from medically relevant text using a text-to-speech (TTS) system.Each clip is paired with its corresponding transcript.Because all content is synthetically produced, the dataset does not contain… See the full description on the dataset page: https://huggingface.co/datasets/Hani89/Synthetic-Medical-Speech-Dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Loop Assembly (2025). Hindi Text-to-Speech [Dataset]. https://www.kaggle.com/datasets/loopassembly/hindi-tts
Organization logo

Hindi Text-to-Speech

Comprehensive Hindi Text-to-Speech Dataset with 23K+ Audio Samples for Speech

Explore at:
107 scholarly articles cite this dataset (View in Google Scholar)
zip(7021369977 bytes)Available download formats
Dataset updated
Nov 12, 2025
Authors
Loop Assembly
License

Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically

Description

Hindi Text-to-Speech (TTS) Dataset

Overview

This comprehensive dataset contains over 23,700 high-quality Hindi audio samples paired with their corresponding text transcriptions, specifically designed for Text-to-Speech (TTS) synthesis and speech processing research. The dataset provides a robust foundation for developing and training Hindi language speech synthesis models.

Dataset Contents

  • Audio Files: 23.7k+ WAV format audio recordings
  • Metadata: CSV file containing text transcriptions and audio file mappings
  • Total Size: 9.12 GB of audio data
  • Language: Hindi (हिन्दी)

Use Cases

  • Text-to-Speech Systems: Train neural TTS models for Hindi language
  • Speech Synthesis Research: Develop and evaluate speech generation algorithms
  • Voice Cloning: Create synthetic Hindi voices
  • Prosody Modeling: Study intonation and rhythm patterns in Hindi speech
  • Academic Research: Support linguistic and phonetic studies of Hindi

Data Structure

The dataset is organized with audio files in the audio/ directory and corresponding metadata in metadata.csv, making it easy to integrate into your ML pipeline.

Applications

Ideal for researchers, developers, and organizations working on: - Voice assistants in Hindi - Accessibility tools for visually impaired users - Language learning applications - Speech technology products - AI-powered audio content creation

Search
Clear search
Close search
Google apps
Main menu