100+ datasets found
  1. Z

    Data from: SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tsiakoulis, Pirros (2025). SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7119399
    Explore at:
    Dataset updated
    Mar 7, 2025
    Dataset provided by
    Chalamandaris, Aimilios
    Klapsas, Konstantinos
    Sung, June Sig
    Nikitaras, Karolos
    Ellinas, Nikolaos
    Tsiakoulis, Pirros
    Vioni, Alexandra
    Jho, Gunu
    Maniati, Georgia
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This is the public release of the Samsung Open Mean Opinion Scores (SOMOS) dataset for the evaluation of neural text-to-speech (TTS) synthesis, which consists of audio files generated with a public domain voice from trained TTS models based on bibliography, and numbers assigned to each audio as quality (naturalness) evaluations by several crowdsourced listeners.DescriptionThe SOMOS dataset contains 20,000 synthetic utterances (wavs), 100 natural utterances and 374,955 naturalness evaluations (human-assigned scores in the range 1-5). The synthetic utterances are single-speaker, generated by training several Tacotron-like acoustic models and an LPCNet vocoder on the LJ Speech voice public dataset. 2,000 text sentences were synthesized, selected from Blizzard Challenge texts of years 2007-2016, the LJ Speech corpus as well as Wikipedia and general domain data from the Internet.Naturalness evaluations were collected via crowdsourcing a listening test on Amazon Mechanical Turk in the US, GB and CA locales. The records of listening test participants (workers) are fully anonymized. Statistics on the reliability of the scores assigned by the workers are also included, generated through processing the scores and validation controls per submission page.

    To listen to audio samples of the dataset, please see our Github page.

    The dataset release comes with a carefully designed train-validation-test split (70%-15%-15%) with unseen systems, listeners and texts, which can be used for experimentation on MOS prediction.

    This version also contains the necessary resources to obtain the transcripts corresponding to all dataset audios.

    Terms of use

    The dataset may be used for research purposes only, for non-commercial purposes only, and may be distributed with the same terms.

    Every time you produce research that has used this dataset, please cite the dataset appropriately.

    Cite as:

    @inproceedings{maniati22_interspeech, author={Georgia Maniati and Alexandra Vioni and Nikolaos Ellinas and Karolos Nikitaras and Konstantinos Klapsas and June Sig Sung and Gunu Jho and Aimilios Chalamandaris and Pirros Tsiakoulis}, title={{SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis}}, year=2022, booktitle={Proc. Interspeech 2022}, pages={2388--2392}, doi={10.21437/Interspeech.2022-10922} }

    References of resources & models used

    Voice & synthesized texts:K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.

    Vocoder:J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthesis through linear prediction,” in Proc. ICASSP, 2019.R. Vipperla, S. Park, K. Choo, S. Ishtiaq, K. Min, S. Bhattacharya, A. Mehrotra, A. G. C. P. Ramos, and N. D. Lane, “Bunched lpcnet: Vocoder for low-cost neural text-to-speech systems,” in Proc. Interspeech, 2020.

    Acoustic models:N. Ellinas, G. Vamvoukakis, K. Markopoulos, A. Chalamandaris, G. Maniati, P. Kakoulidis, S. Raptis, J. S. Sung, H. Park, and P. Tsiakoulis, “High quality streaming speech synthesis with low, sentence-length-independent latency,” in Proc. Interspeech, 2020.Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards End-to-End Speech Synthesis,” in Proc. Interspeech, 2017.J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions,” in Proc. ICASSP, 2018.J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, and Y. Wu, “Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling,” arXiv preprint arXiv:2010.04301, 2020.M. Honnibal and M. Johnson, “An Improved Non-monotonic Transition System for Dependency Parsing,” in Proc. EMNLP, 2015.M. Dominguez, P. L. Rohrer, and J. Soler-Company, “PyToBI: A Toolkit for ToBI Labeling Under Python,” in Proc. Interspeech, 2019.Y. Zou, S. Liu, X. Yin, H. Lin, C. Wang, H. Zhang, and Z. Ma, “Fine-grained prosody modeling in neural speech synthesis using ToBI representation,” in Proc. Interspeech, 2021.K. Klapsas, N. Ellinas, J. S. Sung, H. Park, and S. Raptis, “WordLevel Style Control for Expressive, Non-attentive Speech Synthesis,” in Proc. SPECOM, 2021.T. Raitio, R. Rasipuram, and D. Castellani, “Controllable neural text-to-speech synthesis using intuitive prosodic features,” in Proc. Interspeech, 2020.

    Synthesized texts from the Blizzard Challenges 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2016:M. Fraser and S. King, "The Blizzard Challenge 2007," in Proc. SSW6, 2007.V. Karaiskos, S. King, R. A. Clark, and C. Mayo, "The Blizzard Challenge 2008," in Proc. Blizzard Challenge Workshop, 2008.A. W. Black, S. King, and K. Tokuda, "The Blizzard Challenge 2009," in Proc. Blizzard Challenge, 2009.S. King and V. Karaiskos, "The Blizzard Challenge 2010," 2010.S. King and V. Karaiskos, "The Blizzard Challenge 2011," 2011.S. King and V. Karaiskos, "The Blizzard Challenge 2012," 2012.S. King and V. Karaiskos, "The Blizzard Challenge 2013," 2013.S. King and V. Karaiskos, "The Blizzard Challenge 2016," 2016.

    Contact

    Alexandra Vioni - a.vioni@samsung.com

    If you have any questions or comments about the dataset, please feel free to write to us.

    We are interested in knowing if you find our dataset useful! If you use our dataset, please email us and tell us about your research.

  2. l

    Data from: English TTS speech corpus of air traffic (pilot) messages -...

    • lindat.cz
    • live.european-language-grid.eu
    Updated Dec 11, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jindřich Matoušek; Daniel Tihelka (2015). English TTS speech corpus of air traffic (pilot) messages - German accent [Dataset]. https://lindat.cz/repository/xmlui/handle/11234/1-1588
    Explore at:
    Dataset updated
    Dec 11, 2015
    Authors
    Jindřich Matoušek; Daniel Tihelka
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The corpus contains recordings of male speaker, native in German, talking in English. The sentences that were read by the speaker originate in the domain of air traffic control (ATC), specifically the messages used by plane pilots during routine flight. The text in the corpus originates from the transcripts of the real recordings, part of which has been released in LINDAT/CLARIN (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0), and individual phrases were selected by special algorithm described in Jůzová, M. and Tihelka, D.: Minimum Text Corpus Selection for Limited Domain Speech Synthesis (DOI 10.1007/978-3-319-10816-2_48). The corpus was used to create a limited domain speech synthesis system capable of simulating a pilot communication with an ATC officer.

  3. m

    Data from: Balinese Text-to-Speech Dataset as Digital Cultural Heritage

    • data.mendeley.com
    Updated Dec 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    I Gusti Agung Gede Arya Kadyanan (2024). Balinese Text-to-Speech Dataset as Digital Cultural Heritage [Dataset]. http://doi.org/10.17632/6syjwz24v5.2
    Explore at:
    Dataset updated
    Dec 30, 2024
    Authors
    I Gusti Agung Gede Arya Kadyanan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a collection of audio recordings from native Balinese speakers. This dataset consists of 1187 recordings covering various levels of Balinese, such as Alus Singgih, Alus Mider, Andap, Mider, and Alus Sor. In addition, this dataset also records phrases and alphabets to provide a wider linguistic variation. This dataset is designed to support the development of various voice-based applications, including Text-to-Speech (TTS) systems, automatic speech recognition, and speech-to-text conversion. This dataset also supports research in the field of natural language processing (NLP), especially for regional languages ​​that still have minimal digital representation. The use of this dataset is expected to enrich voice-based technology and strengthen the existence of Balinese in the digital era. With this data, researchers and developers can create systems that support the preservation of regional languages ​​as part of Indonesia's cultural heritage.

  4. Bengali Speech Recognition Dataset (BSRD)

    • kaggle.com
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuvo Kumar Basak-4004 (2025). Bengali Speech Recognition Dataset (BSRD) [Dataset]. http://doi.org/10.34740/kaggle/dsv/10465455
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 14, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shuvo Kumar Basak-4004
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The BengaliSpeechRecognitionDataset (BSRD) is a comprehensive dataset designed for the development and evaluation of Bengali speech recognition and text-to-speech systems. This dataset includes a collection of Bengali characters and their corresponding audio files, which are generated using speech synthesis models. It serves as an essential resource for researchers and developers working on automatic speech recognition (ASR) and text-to-speech (TTS) applications for the Bengali language. Key Features: • Bengali Characters: The dataset contains a wide range of Bengali characters, including consonants, vowels, and unique symbols used in the Bengali script. This includes standard characters such as 'ক', 'খ', 'গ', and many more. • Corresponding Speech Data: For each Bengali character, an MP3 audio file is provided, which contains the correct pronunciation of that character. This audio is generated by a Bengali text-to-speech model, ensuring clear and accurate pronunciation. • 1000 Audio Samples per Folder: Each character is associated with at least 1000 MP3 files. These multiple samples provide variations of the character's pronunciation, which is essential for training robust speech recognition systems. • Language and Phonetic Diversity: The dataset offers a phonetic diversity of Bengali sounds, covering different tones and pronunciations commonly found in spoken Bengali. This ensures that the dataset can be used for training models capable of recognizing diverse speech patterns. • Use Cases: o Automatic Speech Recognition (ASR): BSRD is ideal for training ASR systems, as it provides accurate audio samples linked to specific Bengali characters. o Text-to-Speech (TTS): Researchers can use this dataset to fine-tune TTS systems for generating natural Bengali speech from text. o Phonetic Analysis: The dataset can be used for phonetic analysis and developing models that study the linguistic features of Bengali pronunciation. • Applications: o Voice Assistants: The dataset can be used to build and train voice recognition systems and personal assistants that understand Bengali. o Speech-to-Text Systems: BSRD can aid in developing accurate transcription systems for Bengali audio content. o Language Learning Tools: The dataset can help in creating educational tools aimed at teaching Bengali pronunciation.

    …………………………………..Note for Researchers Using the dataset………………………………………………………………………

    This dataset was created by Shuvo Kumar Basak. If you use this dataset for your research or academic purposes, please ensure to cite this dataset appropriately. If you have published your research using this dataset, please share a link to your paper. Good Luck.

  5. Z

    TIMIT-TTS: a Text-to-Speech Dataset for Synthetic Speech Detection

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefano Tubaro (2022). TIMIT-TTS: a Text-to-Speech Dataset for Synthetic Speech Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6560158
    Explore at:
    Dataset updated
    Sep 21, 2022
    Dataset provided by
    Brian Hosler
    Matthew C. Stamm
    Davide Salvi
    Paolo Bestagini
    Stefano Tubaro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material are becoming increasingly straightforward to perform. At the same time, sharing fake content on the web has become so simple that malicious users can create unpleasant situations with minimal effort. Also, forged media are getting more and more complex, with manipulated videos (e.g., deepfakes where both the visual and audio contents can be counterfeited) that are taking the scene over still images. The multimedia forensic community has addressed the possible threats that this situation could imply by developing detectors that verify the authenticity of multimedia objects. However, the vast majority of these tools only analyze one modality at a time. This was not a problem as long as still images were considered the most widely edited media, but now, since manipulated videos are becoming customary, performing monomodal analyses could be reductive. Nonetheless, there is a lack in the literature regarding multimodal detectors (systems that consider both audio and video components). This is due to the difficulty of developing them but also to the scarsity of datasets containing forged multimodal data to train and test the designed algorithms.

    In this paper we focus on the generation of an audio-visual deepfake dataset. First, we present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks. Then, we use the pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with DeepfakeTIMIT and VidTIMIT video datasets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both monomodal (i.e., audio) and multimodal (i.e., audio and video) conditions. This highlights the need for multimodal forensic detectors and more multimodal deepfake data.

    For the initial version of TIMIT-TTS v1.0

    Arxiv: https://arxiv.org/abs/2209.08000

    TIMIT-TTS Database v1.0: https://zenodo.org/record/6560159

  6. E

    Noisy reverberant speech database for training speech enhancement algorithms...

    • find.data.gov.scot
    • dtechtive.com
    txt, zip
    Updated Sep 14, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Edinburgh (2017). Noisy reverberant speech database for training speech enhancement algorithms and TTS models [Dataset]. http://doi.org/10.7488/ds/2139
    Explore at:
    zip(2781.184 MB), zip(168.6 MB), zip(0.1187 MB), txt(0.0166 MB), zip(5629.952 MB)Available download formats
    Dataset updated
    Sep 14, 2017
    Dataset provided by
    University of Edinburgh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    UNITED KINGDOM
    Description

    Noisy reverberant speech database. The database was designed to train and test speech enhancement (noise suppression and dereverberation) methods that operate at 48kHz. Clean speech was made reverberant and noisy by convolving it with a room impulse response and then adding it to a noisy signal that was also convolved with a room impulse response. The room impulse responses used to create this dataset were selected from: - The ACE challenge (http://www.commsp.ee.ic.ac.uk/~sap/projects/ace-challenge/) - The MIRD database (http://www.iks.rwth-aachen.de/en/research/tools-downloads/multichannel-impulse-response-database/) - The MARDY database (http://www.commsp.ee.ic.ac.uk/~sap/resources/mardy-multichannel-acoustic-reverberation-database-at-york-database/) The underlying clean speech data can be found in: https://doi.org/10.7488/ds/2117 . The speech-shaped and babble noise files that were used to create this dataset are available here: http://homepages.inf.ed.ac.uk/cvbotinh/se/noises/.

  7. Scripted Monologues Speech Data | 65,000 Hours | Generative AI Audio Data|...

    • datarade.ai
    Updated Dec 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). Scripted Monologues Speech Data | 65,000 Hours | Generative AI Audio Data| Speech Recognition Data | Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-read-speech-data-65-000-hours-aud-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 11, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Chile, Puerto Rico, Italy, Luxembourg, Japan, France, Uruguay, Pakistan, Taiwan, Poland
    Description
    1. Specifications Format : 16kHz, 16bit, uncompressed wav, mono channel

    Recording environment : quiet indoor environment, without echo

    Recording content (read speech) : economy, entertainment, news, oral language, numbers, letters

    Speaker : native speaker, gender balance

    Device : Android mobile phone, iPhone

    Language : 100+ languages

    Transcription content : text, time point of speech data, 5 noise symbols, 5 special identifiers

    Accuracy rate : 95% (the accuracy rate of noise symbols and other identifiers is not included)

    Application scenarios : speech recognition, voiceprint recognition

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
  8. m

    FOSD Male Speech Dataset

    • data.mendeley.com
    Updated Jul 31, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Duc Chung Tran (2020). FOSD Male Speech Dataset [Dataset]. http://doi.org/10.17632/3zz6txz35t.7
    Explore at:
    Dataset updated
    Jul 31, 2020
    Authors
    Duc Chung Tran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is FOSD-based (extracted from approximately 30-hour of FPT Open Speech Data, released publicly in 2018 by FPT Corporation, under FPT Public License) Male Speech Dataset which is useful for creating text-to-speech model. It comprises of 9474 audio files totalling more than 10.5 recording hours. All files are in *.wav format (16 kHz sampling rate, 32-bit float, mono). This dataset is useful for various TTS-related applications.

    Copyright 2018 FPT Corporation Permission is hereby granted, free of charge, non-exclusive, worldwide, irrevocable, to any person obtaining a copy of this data or software and associated documentation files (the “Data or Software”), to deal in the Data or Software without restriction, including without limitation the rights to use, copy, modify, remix, transform, merge, build upon, publish, distribute and redistribute, sublicense, and/or sell copies of the Data or Software, for any purpose, even commercially, and to permit persons to whom the Data or Software is furnished to do so, subject to the following conditions: The above copyright notice, and this permission notice, and indication of any modification to the Data or Software, shall be included in all copies or substantial portions of the Data or Software.

    THE DATA OR SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATA OR SOFTWARE OR THE USE OR OTHER DEALINGS IN THE DATA OR SOFTWARE. Patent and trademark rights are not licensed under this FPT Public License.

  9. E

    FESTCAT Catalan TTS baseline female speech database

    • catalog.elra.info
    • live.european-language-grid.eu
    Updated Jan 5, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2012). FESTCAT Catalan TTS baseline female speech database [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0331/
    Explore at:
    Dataset updated
    Jan 5, 2012
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    This database contains the recordings of one female Catalan professional speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. It consists in the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems). The FESTCAT Catalan TTS Baseline Female Speech Database was created within the scope of the FESTCAT project funded by the Catalan Government.

  10. m

    A kiswahili Dataset for Development of Text-To-Speech System

    • data.mendeley.com
    Updated Nov 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kiptoo Rono (2021). A kiswahili Dataset for Development of Text-To-Speech System [Dataset]. http://doi.org/10.17632/vbvj6j6pm9.1
    Explore at:
    Dataset updated
    Nov 30, 2021
    Authors
    Kiptoo Rono
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains Kiswahili text and audio files. The dataset contains 7,108 text files and audio files. The Kiswahili dataset was created from an open-source non-copyrighted material: Kiswahili audio Bible. The authors permit use for non-profit, educational, and public benefit purposes. The downloaded audio files length was more than 12.5s. Therefore, the audio files were programmatically split into short audio clips based on silence. They were then combined based on a random length such that each eventual audio file lies between 1 to 12.5s. This was done using python 3. The audio files were saved as a single channel,16 PCM WAVE file with a sampling rate of 22.05 kHz The dataset contains approximately 106,000 Kiswahili words. The words were then transcribed into mean words of 14.96 per text file and saved in CSV format. Each text file was divided into three parts: unique ID, transcribed words, and normalized words. A unique ID is a number assigned to each text file. The transcribed words are the text spoken by a reader. Normalized texts are the expansion of abbreviations and numbers into full words. An audio file split was assigned a unique ID, the same as the text file.

  11. S

    UGSpeechData: A Multilingual Speech Dataset of Ghanaian Languages

    • scidb.cn
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wiafe, Isaac; Abdulai, Jamal-Deen; Ekpezu, Akon Obu; Helegah, Raynard Dodzi; Atsakpo, Elikem Doe; Nutrokpor, Charles; Winful, Fiifi Baffoe Payin; Solaga, Kafui Kwashia (2025). UGSpeechData: A Multilingual Speech Dataset of Ghanaian Languages [Dataset]. http://doi.org/10.57760/sciencedb.22298
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 26, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Wiafe, Isaac; Abdulai, Jamal-Deen; Ekpezu, Akon Obu; Helegah, Raynard Dodzi; Atsakpo, Elikem Doe; Nutrokpor, Charles; Winful, Fiifi Baffoe Payin; Solaga, Kafui Kwashia
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Area covered
    Ghana
    Description

    The UGSpeechData is a collection of audio speech data of Akan, Ewe, Dagaare, Dagbani, and Ikposo. These languages are among the most spoken languages in Ghana. The uploaded dataset contains a total of 970148 audio files (5384.28 hours) and 93262 transcribed audio files (518 hours). The audio files are descriptions of 1000 culturally relevant images collected from indigenous speakers of each of the languages. Each audio is between 15 to 30 seconds long. More specifically, the dataset contains five subfolders for each of the five respective languages. Each language has at least 1000 hours of speech data and 100 hours of transcribed speech data. Fig. 1 provides details of the transcribed audio corpus, including gender and recording environments for each language.Fig. 1. Details of transcribed audio files

  12. h

    English-Technical-Speech-Dataset

    • huggingface.co
    Updated Oct 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tejasva Maurya (2024). English-Technical-Speech-Dataset [Dataset]. https://huggingface.co/datasets/Tejasva-Maurya/English-Technical-Speech-Dataset
    Explore at:
    Dataset updated
    Oct 26, 2024
    Authors
    Tejasva Maurya
    Description

    English Technical Speech Dataset

      Overview
    

    The English Technical Speech Dataset is a curated collection of English technical vocabulary recordings, designed for applications like Text-to-Speech (TTS), Automatic Speech Recognition (ASR), and Audio Classification. The dataset includes 11,247 entries and provides audio files, transcriptions, and speaker embeddings to support the development of robust technical language models.

    Language: English (technical focus) Total… See the full description on the dataset page: https://huggingface.co/datasets/Tejasva-Maurya/English-Technical-Speech-Dataset.

  13. In-Cabin Speech Data | 15,000 Hours | AI Training Data | Speech Recognition...

    • datarade.ai
    Updated Apr 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). In-Cabin Speech Data | 15,000 Hours | AI Training Data | Speech Recognition Data | Audio Data |Natural Language Processing (NLP) Data [Dataset]. https://datarade.ai/data-products/nexdata-in-car-speech-data-15-000-hours-audio-ai-ml-t-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Apr 23, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Russian Federation, Romania, Poland, Austria, Switzerland, Egypt, Netherlands, Germany, Turkey, Argentina
    Description
    1. Specifications Format : Audio format: 48kHz, 16bit, uncompressed wav, mono channel; Vedio format: MP4

    Recording Environment : In-car;1 quiet scene, 1 low noise scene, 3 medium noise scenes and 2 high noise scenes

    Recording Content : It covers 5 fields: navigation field, multimedia field, telephone field, car control field and question and answer field; 500 sentences per people

    Speaker : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

    Device : High fidelity microphone; Binocular camera

    Language : 20 languages

    Transcription content : text

    Accuracy rate : 98%

    Application scenarios : speech recognition, Human-computer interaction; Natural language processing and text analysis; Visual content understanding, etc.

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Natural Language Processing (NLP) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
  14. E

    Data from: The "Mići Princ" text and speech dataset of Chakavian...

    • live.european-language-grid.eu
    binary format
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). The "Mići Princ" text and speech dataset of Chakavian micro-dialects [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23600
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Mar 4, 2024
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The Mići Princ "text and speech" dialectal dataset is a word-aligned version of the translation of The Little Prince into various Chakavian micro-dialects, released by the Udruga Calculus and the Peek&Poke museum (http://skupnikatalog.nsk.hr/Record/nsk.NSK01001103632), both in form of a printed book and an audio book.

    The printed book is a translation of Antoine de Saint-Exupéry's "Le Petit Prince". The translation was performed by Tea Perinčić and the following additional translators (almost every character in the book uses a different micro-dialect): Davina Ivašić, Annamaria Grus, Maria Luisa Ivašić, Marin Miš, Josip Fafanđel, Glorija Fabijanić Jelović, Vlasta Juretić, Anica Pritchard, Tea Rosić, Dino Marković, Ilinka Babić, Jadranka Ajvaz, Vlado Simičić Vava, Irena Grdinić, and Ivana Marinčić.

    The audio book has been read by Zoran Prodanović Prlja, Davina Ivašić, Josip Fafanđel, Melita Nilović, Glorija Fabijanić Jelović, Albert Sirotich, Tea Rosić, Tea Perinčić, Dino Marković, Iva Močibob, Dražen Turina Šajeta, Vlado Simčić Vava, Ilinka Babić, Melita and Svetozar Nilović, and Ivana Marinčić.

    The master encoding of this "text and speech" dataset is available in form of json files (MP_13.json for the thirteenth chapter of the book), where the text, the turn-level alignment, and the word-level alignment to the audio are available. This master encoding is available from the MP.json.tgz archive for the text and alignment part, with the audio part of the master encoding located in the MP.wav.tgz archive.

    Besides this master encoding, an encoding focused on applications in automatic speech recognition (ASR) testing and adaptation, is available as well. Chapters 13 and 15 have been selected as testing data, and the text and audio reference files MP_13.asr.json and MP_15.asr.json contain segments split by speaker turns. The remainder of the dataset has been prepared in segments of length up to 20 seconds, ideal for training / fine-tuning current ASR systems. The text and audio reference data are available in the MP.asr.json.tgz archive, while the audio data are available in form of MP3 files in the MP.mp3.tgz archive.

    The dataset also includes an encoding for the Exmaralda speech editor (https://exmaralda.org), one file per chapter (MP_13.exb for the thirteenth chapter), available from the MP.exb.tgz archive. The wav files from the MP.wav.tgz archive are required if speech data are to be available inside Exmaralda.

    Speaker information is available in the speakers.json file, each speaker having a textual and wikidata reference to the location of the micro-dialect, as well as the name of the translator in the printed book and the reader in the audio book.

    An application of the dataset on fine-tuning the current (March 2024) SotA automatic speech recognition model for standard Croatian, whisper-v3-large (https://huggingface.co/classla/whisper-large-v3-mici-princ), shows for word error rate to drop from 35.43% to 16.83%, and the character error rate to drop from 11.54% to 3.95% (in-dataset test data, two seen speakers / micro-dialects, two unseen).

  15. Speech Recognition Data Collection Services | 100+ Languages Resources...

    • datarade.ai
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). Speech Recognition Data Collection Services | 100+ Languages Resources |Audio Data | Speech Recognition Data | Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-speech-recognition-data-collection-services-100-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 28, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Austria, Estonia, Sri Lanka, Brazil, United Kingdom, Lithuania, Malaysia, El Salvador, Haiti, Cambodia
    Description
    1. Overview With extensive experience in speech recognition, Nexdata has resource pool covering more than 50 countries and regions. Our linguist team works closely with clients to assist them with dictionary and text corpus construction, speech quality inspection, linguistics consulting and etc.

    2. Our Capacity -Global Resources: Global resources covering hundreds of languages worldwide

    -Compliance: All the Machine Learning (ML) Data are collected with proper authorization -Quality: Multiple rounds of quality inspections ensures high quality data output

    -Secure Implementation: NDA is signed to gurantee secure implementation and Machine Learning (ML) Data is destroyed upon delivery.

    1. About Nexdata Nexdata is equipped with professional Machine Learning (ML) Data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the data collection requirements in various scenarios and types. Please visit us at https://www.nexdata.ai/service/speech-recognition?source=Datarade
  16. E

    TC-STAR Spanish Baseline Female Speech Database

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Dec 21, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2010). TC-STAR Spanish Baseline Female Speech Database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0309/
    Explore at:
    Dataset updated
    Dec 21, 2010
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The TC-STAR Spanish Baseline Female Speech Database was created within the scope of the TC-STAR project (IST- FP6-506738) funded by the European Commission.It contains the recordings of one female Spanish speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems). This database is distributed on 10 DVDs. The database complies with the common specifications created in the TC-STAR project.The annotation of the database includes manual orthographic transcriptions, the automatic segmentation into phonemes and automatic generation of pitch marks. A certain percentage of phonetic segments and pitch marks has been manually checked. A pronunciation lexicon in SAMPA with POS, lemma and phonetic transcription of all the words prompted and spoken is also provided.Speech samples are stored as sequences of 24-bit 96 kHz with the least significant byte first (“lohi” or Intel format) as (signed) integers. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.The TC-STAR Spanish Baseline Male Speech Database is also available via ELRA under reference ELRA-S0310.

  17. h

    peoples_speech

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MLCommons, peoples_speech [Dataset]. https://huggingface.co/datasets/MLCommons/peoples_speech
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    MLCommons
    License

    Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
    License information was derived automatically

    Description

    Dataset Card for People's Speech

      Dataset Summary
    

    The People's Speech Dataset is among the world's largest English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0. It includes 30,000+ hours of transcribed speech in English languages with a diverse set of speakers. This open dataset is large enough to train speech-to-text systems and crucially is available with a permissive license.

      Supported Tasks… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/peoples_speech.
    
  18. F

    Egyptian Arabic General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Egyptian Arabic General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-arabic-egypt
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Egyptian Arabic General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Arabic speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Egyptian Arabic communication.

    Curated by FutureBeeAI, this 40 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Arabic speech models that understand and respond to authentic Egyptian accents and dialects.

    Speech Data

    The dataset comprises 40 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Egyptian Arabic. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 80 verified native Egyptian Arabic speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Egypt to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Arabic speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Egyptian Arabic.
    Voice Assistants: Build smart assistants capable of understanding natural Egyptian conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  19. F

    Travel Scripted Monologue Speech Data: Japanese (Japan)

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Travel Scripted Monologue Speech Data: Japanese (Japan) [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/travel-scripted-speech-monologues-japanese-japan
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Japan
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Japanese Scripted Monologue Speech Dataset for the Travel Domain. This meticulously curated dataset is designed to advance the development of Japanese language speech recognition models, particularly for the Travel industry.

    Speech Data

    This training dataset comprises over 6,000 high-quality scripted prompt recordings in Japanese. These recordings cover various topics and scenarios relevant to the Travel domain, designed to build robust and accurate customer service speech technology.

    Participant Diversity:
    Speakers: 60 native Japanese speakers from different regions of Japan.
    Regions: Ensures a balanced representation of Japanese accents, dialects, and demographics.
    Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.
    Recording Details:
    Recording Nature: Audio recordings of scripted prompts/monologues.
    Audio Duration: Average duration of 5 to 30 seconds per recording.
    Formats: WAV format with mono channels, a bit depth of 16 bits, and sample rates of 8 kHz and 16 kHz.
    Environment: Recordings are conducted in quiet settings without background noise and echo.
    Topic Diversity: The dataset encompasses a wide array of topics and conversational scenarios to ensure comprehensive coverage of the Travel sector. Topics include:
    Customer Service Interactions
    Booking and Reservations
    Travel Inquiries
    Technical Support
    General Information and Advice
    Promotional and Sales Events
    Domain Specific Statements
    Other Elements: To enhance realism and utility, the scripted prompts incorporate various elements commonly encountered in Travel interactions:
    Names: Region-specific names of males and females in various formats.
    Addresses: Region-specific addresses in different spoken formats.
    Dates & Times: Inclusion of date and time in various travel contexts, such as booking dates, departure and arrival times.
    Destinations: Specific names of cities, countries, and tourist attractions relevant to the travel sector.
    Numbers & Prices: Various numbers and prices related to ticket costs, hotel rates, and transaction amounts.
    Booking IDs and Confirmation Numbers: Inclusion of booking identification and confirmation details for realistic customer service scenarios.

    Each scripted prompt is crafted to reflect real-life scenarios encountered in the Travel domain, ensuring applicability in training robust natural language processing and speech recognition models.

    Transcription Data

    In addition to high-quality audio recordings, the dataset includes meticulously prepared text files with verbatim transcriptions of each audio file. These transcriptions are essential for training accurate and robust speech recognition models.

    Content: Each text file contains the exact scripted prompt corresponding to its audio file, ensuring consistency.
    Format: Transcriptions are provided in plain text (.TXT) format, with files named to match their associated audio files for easy reference.
    <div style="margin-top:10px;

  20. z

    Data from: LaFresCat: a Catalan multi-accent speech dataset for...

    • zenodo.org
    application/gzip, txt
    Updated Feb 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). LaFresCat: a Catalan multi-accent speech dataset for text-to-speech [Dataset]. http://doi.org/10.21437/iberspeech.2024-42
    Explore at:
    txt, application/gzipAvailable download formats
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    Zenodo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LaFresCat Multiaccent

    We present LaFresCat, the first Catalan multiaccented and multispeaker dataset.

    This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Commercial use is only possible through licensing by the voice artists. For further information, contact langtech@bsc.es and lafrescaproduccions@gmail.com.

    Dataset Details

    Dataset Description

    The audios from this dataset have been created with professional studio recordings by professional voice actors in Lafresca Creative Studio. This is the raw version of the dataset, no resampling or trimming has been applied to the audios. Audios are stored in wav format at 48khz sampling rate

    In total, there are 4 different accents, with 2 speakers per accent (female and male). After trimming, accumulates a total of 3,75h (divided by speaker IDs) as follows:

    • Balear

      • olga -> 23.5 min
      • quim -> 30.93 min
    • Central

      • elia -> 33.14 min
      • grau -> 37,86 min
    • Occidental (North-Western)

      • emma -> 28,67 min
      • pere -> 25,12 min
    • Valencia

      • gina -> 22,25 min
      • lluc -> 23,58 min

    Uses

    The purpose of this dataset is mainly for training text-to-speech and automatic speech recognition models in Catalan accents.

    Languages

    The dataset is in Catalan (ca-ES).

    Dataset Structure

    The dataset consists of 2858 audios and transcriptions in the following structure:

    lafresca_multiaccent_raw
    ├── balear
    │ ├── olga
    │ ├── olga.txt
    │ ├── quim
    │ └── quim.txt
    ├── central
    │ ├── elia
    │ ├── elia.txt
    │ ├── grau
    │ └── grau.txt
    ├── full_filelist.txt
    ├── occidental
    │ ├── emma
    │ ├── emma.txt
    │ ├── pere
    │ └── pere.txt
    └── valencia
    ├── gina
    ├── gina.txt
    ├── lluc
    └── lluc.txt

    Metadata of the dataset can be found in the file `full_filelist.txt` , each line represents an audio and follows the format:

    audio_path | speaker_id | transcription

    The speaker ids have the following mapping:

    "quim": 0,
    "olga": 1,
    "grau": 2,
    "elia": 3,
    "pere": 4,
    "emma": 5,
    "lluc": 6,
    "gina": 7

    Dataset Creation

    This dataset has been created by members of the Language Technologies unit from the Life Sciences department of the Barcelona Supercomputing Center, except the valencian sentences which were created with the support of Cenid, the Digital Intelligence Center of the University of Alicante. The voices belong to professional voice actors and they've been recorded in Lafresca Creative Studio.

    Source Data

    The data presented in this dataset is the source data.

    Data Collection and Processing

    These are the technical details of the data collection and processing:

    • Microphone: Austrian Audio oc818

    • Preamp: Focusrite ISA Two

    • Audio Interface: Antelope Orion 32+

    • DAW: ProTools 2023.6.0

    Processing:

    • Noise Gate: C1 Gate

    • Compression BF-76

    • De-Esser Renaissance

    • EQ Maag EQ2

    • EQ FabFilter Pro-Q3

    • Limiter: L1 Ultramaximizer

    Here's the information about the speakers:

    DialectGenderCounty
    CentralmaleBarcelonès
    CentralfemaleBarcelonès
    BalearfemalePla de Mallorca
    BalearmaleLlevant
    OccidentalmaleBaix Ebre
    OccidentalfemaleBaix Ebre
    ValencianfemaleRibera Alta
    ValencianmaleLa Plana Baixa

    Who are the source data producers?

    The Language Technologies team from the Life Sciences department at the Barcelona Supercomputing Center developed this dataset. It features recordings by professional voice actors made at Lafresca Creative Studio.

    Annotations

    In order to check whether or not there were any errors in the transcriptions of the audios, we created a Label Studio space. In that space, we manually listened to subset of the dataset, and compared what we heard with the transcription. If the transcription was mistaken, we corrected it.

    Personal and Sensitive Information

    The dataset consists of professional voice actors who have recorded their voice. You agree to not attempt to determine the identity of speakers in this dataset.

    Bias, Risks, and Limitations

    Training a Text-to-Speech (TTS) model by fine-tuning with a Catalan speaker who speaks a particular dialect presents significant limitations. Mostly, the challenge is in capturing the full range of variability inherent in that accent. Each dialect has its own unique phonetic, intonational, and prosodic characteristics that can vary greatly even within a single linguistic region. Consequently, a TTS model trained on a narrow dialect sample will struggle to generalize across different accents and sub-dialects, leading to reduced accuracy and naturalness. Additionally, achieving a standard representation is exceedingly difficult because linguistic features can differ markedly not only between dialects but also among individual speakers within the same dialect group. These variations encompass subtle nuances in pronunciation, rhythm, and speech patterns that are challenging to standardize in a model trained on a limited dataset.

    Funding

    This work has been promoted and financed by the Generalitat de Catalunya through the Aina project, in addition the Valencian sentences have been created within the framework of the NEL-VIVES project 2022/TL22/00215334.

    Dataset Card Contact

    langtech@bsc.es

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Tsiakoulis, Pirros (2025). SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7119399

Data from: SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

Related Article
Explore at:
Dataset updated
Mar 7, 2025
Dataset provided by
Chalamandaris, Aimilios
Klapsas, Konstantinos
Sung, June Sig
Nikitaras, Karolos
Ellinas, Nikolaos
Tsiakoulis, Pirros
Vioni, Alexandra
Jho, Gunu
Maniati, Georgia
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

This is the public release of the Samsung Open Mean Opinion Scores (SOMOS) dataset for the evaluation of neural text-to-speech (TTS) synthesis, which consists of audio files generated with a public domain voice from trained TTS models based on bibliography, and numbers assigned to each audio as quality (naturalness) evaluations by several crowdsourced listeners.DescriptionThe SOMOS dataset contains 20,000 synthetic utterances (wavs), 100 natural utterances and 374,955 naturalness evaluations (human-assigned scores in the range 1-5). The synthetic utterances are single-speaker, generated by training several Tacotron-like acoustic models and an LPCNet vocoder on the LJ Speech voice public dataset. 2,000 text sentences were synthesized, selected from Blizzard Challenge texts of years 2007-2016, the LJ Speech corpus as well as Wikipedia and general domain data from the Internet.Naturalness evaluations were collected via crowdsourcing a listening test on Amazon Mechanical Turk in the US, GB and CA locales. The records of listening test participants (workers) are fully anonymized. Statistics on the reliability of the scores assigned by the workers are also included, generated through processing the scores and validation controls per submission page.

To listen to audio samples of the dataset, please see our Github page.

The dataset release comes with a carefully designed train-validation-test split (70%-15%-15%) with unseen systems, listeners and texts, which can be used for experimentation on MOS prediction.

This version also contains the necessary resources to obtain the transcripts corresponding to all dataset audios.

Terms of use

The dataset may be used for research purposes only, for non-commercial purposes only, and may be distributed with the same terms.

Every time you produce research that has used this dataset, please cite the dataset appropriately.

Cite as:

@inproceedings{maniati22_interspeech, author={Georgia Maniati and Alexandra Vioni and Nikolaos Ellinas and Karolos Nikitaras and Konstantinos Klapsas and June Sig Sung and Gunu Jho and Aimilios Chalamandaris and Pirros Tsiakoulis}, title={{SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis}}, year=2022, booktitle={Proc. Interspeech 2022}, pages={2388--2392}, doi={10.21437/Interspeech.2022-10922} }

References of resources & models used

Voice & synthesized texts:K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.

Vocoder:J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthesis through linear prediction,” in Proc. ICASSP, 2019.R. Vipperla, S. Park, K. Choo, S. Ishtiaq, K. Min, S. Bhattacharya, A. Mehrotra, A. G. C. P. Ramos, and N. D. Lane, “Bunched lpcnet: Vocoder for low-cost neural text-to-speech systems,” in Proc. Interspeech, 2020.

Acoustic models:N. Ellinas, G. Vamvoukakis, K. Markopoulos, A. Chalamandaris, G. Maniati, P. Kakoulidis, S. Raptis, J. S. Sung, H. Park, and P. Tsiakoulis, “High quality streaming speech synthesis with low, sentence-length-independent latency,” in Proc. Interspeech, 2020.Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards End-to-End Speech Synthesis,” in Proc. Interspeech, 2017.J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions,” in Proc. ICASSP, 2018.J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, and Y. Wu, “Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling,” arXiv preprint arXiv:2010.04301, 2020.M. Honnibal and M. Johnson, “An Improved Non-monotonic Transition System for Dependency Parsing,” in Proc. EMNLP, 2015.M. Dominguez, P. L. Rohrer, and J. Soler-Company, “PyToBI: A Toolkit for ToBI Labeling Under Python,” in Proc. Interspeech, 2019.Y. Zou, S. Liu, X. Yin, H. Lin, C. Wang, H. Zhang, and Z. Ma, “Fine-grained prosody modeling in neural speech synthesis using ToBI representation,” in Proc. Interspeech, 2021.K. Klapsas, N. Ellinas, J. S. Sung, H. Park, and S. Raptis, “WordLevel Style Control for Expressive, Non-attentive Speech Synthesis,” in Proc. SPECOM, 2021.T. Raitio, R. Rasipuram, and D. Castellani, “Controllable neural text-to-speech synthesis using intuitive prosodic features,” in Proc. Interspeech, 2020.

Synthesized texts from the Blizzard Challenges 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2016:M. Fraser and S. King, "The Blizzard Challenge 2007," in Proc. SSW6, 2007.V. Karaiskos, S. King, R. A. Clark, and C. Mayo, "The Blizzard Challenge 2008," in Proc. Blizzard Challenge Workshop, 2008.A. W. Black, S. King, and K. Tokuda, "The Blizzard Challenge 2009," in Proc. Blizzard Challenge, 2009.S. King and V. Karaiskos, "The Blizzard Challenge 2010," 2010.S. King and V. Karaiskos, "The Blizzard Challenge 2011," 2011.S. King and V. Karaiskos, "The Blizzard Challenge 2012," 2012.S. King and V. Karaiskos, "The Blizzard Challenge 2013," 2013.S. King and V. Karaiskos, "The Blizzard Challenge 2016," 2016.

Contact

Alexandra Vioni - a.vioni@samsung.com

If you have any questions or comments about the dataset, please feel free to write to us.

We are interested in knowing if you find our dataset useful! If you use our dataset, please email us and tell us about your research.

Search
Clear search
Close search
Google apps
Main menu