100+ datasets found

Z
Data from: SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural...
data.niaid.nih.gov
zenodo.org
Updated Mar 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tsiakoulis, Pirros (2025). SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7119399
Explore at:
Dataset updated
Mar 7, 2025
Dataset provided by
Vioni, Alexandra
Nikitaras, Karolos
Chalamandaris, Aimilios
Ellinas, Nikolaos
Maniati, Georgia
Klapsas, Konstantinos
Sung, June Sig
Tsiakoulis, Pirros
Jho, Gunu
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This is the public release of the Samsung Open Mean Opinion Scores (SOMOS) dataset for the evaluation of neural text-to-speech (TTS) synthesis, which consists of audio files generated with a public domain voice from trained TTS models based on bibliography, and numbers assigned to each audio as quality (naturalness) evaluations by several crowdsourced listeners.DescriptionThe SOMOS dataset contains 20,000 synthetic utterances (wavs), 100 natural utterances and 374,955 naturalness evaluations (human-assigned scores in the range 1-5). The synthetic utterances are single-speaker, generated by training several Tacotron-like acoustic models and an LPCNet vocoder on the LJ Speech voice public dataset. 2,000 text sentences were synthesized, selected from Blizzard Challenge texts of years 2007-2016, the LJ Speech corpus as well as Wikipedia and general domain data from the Internet.Naturalness evaluations were collected via crowdsourcing a listening test on Amazon Mechanical Turk in the US, GB and CA locales. The records of listening test participants (workers) are fully anonymized. Statistics on the reliability of the scores assigned by the workers are also included, generated through processing the scores and validation controls per submission page.

To listen to audio samples of the dataset, please see our Github page.

The dataset release comes with a carefully designed train-validation-test split (70%-15%-15%) with unseen systems, listeners and texts, which can be used for experimentation on MOS prediction.

This version also contains the necessary resources to obtain the transcripts corresponding to all dataset audios.

Terms of use

The dataset may be used for research purposes only, for non-commercial purposes only, and may be distributed with the same terms.

Every time you produce research that has used this dataset, please cite the dataset appropriately.

Cite as:

@inproceedings{maniati22_interspeech, author={Georgia Maniati and Alexandra Vioni and Nikolaos Ellinas and Karolos Nikitaras and Konstantinos Klapsas and June Sig Sung and Gunu Jho and Aimilios Chalamandaris and Pirros Tsiakoulis}, title={{SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis}}, year=2022, booktitle={Proc. Interspeech 2022}, pages={2388--2392}, doi={10.21437/Interspeech.2022-10922} }

References of resources & models used

Voice & synthesized texts:K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.

Vocoder:J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthesis through linear prediction,” in Proc. ICASSP, 2019.R. Vipperla, S. Park, K. Choo, S. Ishtiaq, K. Min, S. Bhattacharya, A. Mehrotra, A. G. C. P. Ramos, and N. D. Lane, “Bunched lpcnet: Vocoder for low-cost neural text-to-speech systems,” in Proc. Interspeech, 2020.

Acoustic models:N. Ellinas, G. Vamvoukakis, K. Markopoulos, A. Chalamandaris, G. Maniati, P. Kakoulidis, S. Raptis, J. S. Sung, H. Park, and P. Tsiakoulis, “High quality streaming speech synthesis with low, sentence-length-independent latency,” in Proc. Interspeech, 2020.Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards End-to-End Speech Synthesis,” in Proc. Interspeech, 2017.J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions,” in Proc. ICASSP, 2018.J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, and Y. Wu, “Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling,” arXiv preprint arXiv:2010.04301, 2020.M. Honnibal and M. Johnson, “An Improved Non-monotonic Transition System for Dependency Parsing,” in Proc. EMNLP, 2015.M. Dominguez, P. L. Rohrer, and J. Soler-Company, “PyToBI: A Toolkit for ToBI Labeling Under Python,” in Proc. Interspeech, 2019.Y. Zou, S. Liu, X. Yin, H. Lin, C. Wang, H. Zhang, and Z. Ma, “Fine-grained prosody modeling in neural speech synthesis using ToBI representation,” in Proc. Interspeech, 2021.K. Klapsas, N. Ellinas, J. S. Sung, H. Park, and S. Raptis, “WordLevel Style Control for Expressive, Non-attentive Speech Synthesis,” in Proc. SPECOM, 2021.T. Raitio, R. Rasipuram, and D. Castellani, “Controllable neural text-to-speech synthesis using intuitive prosodic features,” in Proc. Interspeech, 2020.

Synthesized texts from the Blizzard Challenges 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2016:M. Fraser and S. King, "The Blizzard Challenge 2007," in Proc. SSW6, 2007.V. Karaiskos, S. King, R. A. Clark, and C. Mayo, "The Blizzard Challenge 2008," in Proc. Blizzard Challenge Workshop, 2008.A. W. Black, S. King, and K. Tokuda, "The Blizzard Challenge 2009," in Proc. Blizzard Challenge, 2009.S. King and V. Karaiskos, "The Blizzard Challenge 2010," 2010.S. King and V. Karaiskos, "The Blizzard Challenge 2011," 2011.S. King and V. Karaiskos, "The Blizzard Challenge 2012," 2012.S. King and V. Karaiskos, "The Blizzard Challenge 2013," 2013.S. King and V. Karaiskos, "The Blizzard Challenge 2016," 2016.

Contact

Alexandra Vioni - a.vioni@samsung.com

If you have any questions or comments about the dataset, please feel free to write to us.

We are interested in knowing if you find our dataset useful! If you use our dataset, please email us and tell us about your research.
l
Data from: English TTS speech corpus of air traffic (pilot) messages -...
lindat.cz
live.european-language-grid.eu
Updated Dec 11, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jindřich Matoušek; Daniel Tihelka (2015). English TTS speech corpus of air traffic (pilot) messages - German accent [Dataset]. https://lindat.cz/repository/xmlui/handle/11234/1-1588
Explore at:
Dataset updated
Dec 11, 2015
Authors
Jindřich Matoušek; Daniel Tihelka
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The corpus contains recordings of male speaker, native in German, talking in English. The sentences that were read by the speaker originate in the domain of air traffic control (ATC), specifically the messages used by plane pilots during routine flight. The text in the corpus originates from the transcripts of the real recordings, part of which has been released in LINDAT/CLARIN (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0), and individual phrases were selected by special algorithm described in Jůzová, M. and Tihelka, D.: Minimum Text Corpus Selection for Limited Domain Speech Synthesis (DOI 10.1007/978-3-319-10816-2_48). The corpus was used to create a limited domain speech synthesis system capable of simulating a pilot communication with an ATC officer.
m
Data from: Balinese Text-to-Speech Dataset as Digital Cultural Heritage
data.mendeley.com
Updated Dec 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
I Gusti Agung Gede Arya Kadyanan (2024). Balinese Text-to-Speech Dataset as Digital Cultural Heritage [Dataset]. http://doi.org/10.17632/6syjwz24v5.2
Explore at:
Unique identifier
https://doi.org/10.17632/6syjwz24v5.2
Dataset updated
Dec 30, 2024
Authors
I Gusti Agung Gede Arya Kadyanan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a collection of audio recordings from native Balinese speakers. This dataset consists of 1187 recordings covering various levels of Balinese, such as Alus Singgih, Alus Mider, Andap, Mider, and Alus Sor. In addition, this dataset also records phrases and alphabets to provide a wider linguistic variation. This dataset is designed to support the development of various voice-based applications, including Text-to-Speech (TTS) systems, automatic speech recognition, and speech-to-text conversion. This dataset also supports research in the field of natural language processing (NLP), especially for regional languages that still have minimal digital representation. The use of this dataset is expected to enrich voice-based technology and strengthen the existence of Balinese in the digital era. With this data, researchers and developers can create systems that support the preservation of regional languages as part of Indonesia's cultural heritage.
Bengali Speech Recognition Dataset (BSRD)
kaggle.com
Updated Jan 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuvo Kumar Basak-4004 (2025). Bengali Speech Recognition Dataset (BSRD) [Dataset]. http://doi.org/10.34740/kaggle/dsv/10465455
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/10465455
Dataset updated
Jan 14, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shuvo Kumar Basak-4004
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The BengaliSpeechRecognitionDataset (BSRD) is a comprehensive dataset designed for the development and evaluation of Bengali speech recognition and text-to-speech systems. This dataset includes a collection of Bengali characters and their corresponding audio files, which are generated using speech synthesis models. It serves as an essential resource for researchers and developers working on automatic speech recognition (ASR) and text-to-speech (TTS) applications for the Bengali language. Key Features: • Bengali Characters: The dataset contains a wide range of Bengali characters, including consonants, vowels, and unique symbols used in the Bengali script. This includes standard characters such as 'ক', 'খ', 'গ', and many more. • Corresponding Speech Data: For each Bengali character, an MP3 audio file is provided, which contains the correct pronunciation of that character. This audio is generated by a Bengali text-to-speech model, ensuring clear and accurate pronunciation. • 1000 Audio Samples per Folder: Each character is associated with at least 1000 MP3 files. These multiple samples provide variations of the character's pronunciation, which is essential for training robust speech recognition systems. • Language and Phonetic Diversity: The dataset offers a phonetic diversity of Bengali sounds, covering different tones and pronunciations commonly found in spoken Bengali. This ensures that the dataset can be used for training models capable of recognizing diverse speech patterns. • Use Cases: o Automatic Speech Recognition (ASR): BSRD is ideal for training ASR systems, as it provides accurate audio samples linked to specific Bengali characters. o Text-to-Speech (TTS): Researchers can use this dataset to fine-tune TTS systems for generating natural Bengali speech from text. o Phonetic Analysis: The dataset can be used for phonetic analysis and developing models that study the linguistic features of Bengali pronunciation. • Applications: o Voice Assistants: The dataset can be used to build and train voice recognition systems and personal assistants that understand Bengali. o Speech-to-Text Systems: BSRD can aid in developing accurate transcription systems for Bengali audio content. o Language Learning Tools: The dataset can help in creating educational tools aimed at teaching Bengali pronunciation.

…………………………………..Note for Researchers Using the dataset………………………………………………………………………

This dataset was created by Shuvo Kumar Basak. If you use this dataset for your research or academic purposes, please ensure to cite this dataset appropriately. If you have published your research using this dataset, please share a link to your paper. Good Luck.
Z
TIMIT-TTS: a Text-to-Speech Dataset for Synthetic Speech Detection
data.niaid.nih.gov
zenodo.org
Updated Sep 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefano Tubaro (2022). TIMIT-TTS: a Text-to-Speech Dataset for Synthetic Speech Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6560158
Explore at:
Dataset updated
Sep 21, 2022
Dataset provided by
Matthew C. Stamm
Davide Salvi
Brian Hosler
Paolo Bestagini
Stefano Tubaro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material are becoming increasingly straightforward to perform. At the same time, sharing fake content on the web has become so simple that malicious users can create unpleasant situations with minimal effort. Also, forged media are getting more and more complex, with manipulated videos (e.g., deepfakes where both the visual and audio contents can be counterfeited) that are taking the scene over still images. The multimedia forensic community has addressed the possible threats that this situation could imply by developing detectors that verify the authenticity of multimedia objects. However, the vast majority of these tools only analyze one modality at a time. This was not a problem as long as still images were considered the most widely edited media, but now, since manipulated videos are becoming customary, performing monomodal analyses could be reductive. Nonetheless, there is a lack in the literature regarding multimodal detectors (systems that consider both audio and video components). This is due to the difficulty of developing them but also to the scarsity of datasets containing forged multimodal data to train and test the designed algorithms.

In this paper we focus on the generation of an audio-visual deepfake dataset. First, we present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks. Then, we use the pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with DeepfakeTIMIT and VidTIMIT video datasets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both monomodal (i.e., audio) and multimodal (i.e., audio and video) conditions. This highlights the need for multimodal forensic detectors and more multimodal deepfake data.

For the initial version of TIMIT-TTS v1.0

Arxiv: https://arxiv.org/abs/2209.08000

TIMIT-TTS Database v1.0: https://zenodo.org/record/6560159
E
Noisy reverberant speech database for training speech enhancement algorithms...
find.data.gov.scot
dtechtive.com
txt, zip
Updated Sep 14, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Edinburgh (2017). Noisy reverberant speech database for training speech enhancement algorithms and TTS models [Dataset]. http://doi.org/10.7488/ds/2139
Explore at:
zip(2781.184 MB), zip(168.6 MB), zip(0.1187 MB), txt(0.0166 MB), zip(5629.952 MB)Available download formats
Unique identifier
https://doi.org/10.7488/ds/2139
Dataset updated
Sep 14, 2017
Dataset provided by
University of Edinburgh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
UNITED KINGDOM
Description
Noisy reverberant speech database. The database was designed to train and test speech enhancement (noise suppression and dereverberation) methods that operate at 48kHz. Clean speech was made reverberant and noisy by convolving it with a room impulse response and then adding it to a noisy signal that was also convolved with a room impulse response. The room impulse responses used to create this dataset were selected from: - The ACE challenge (http://www.commsp.ee.ic.ac.uk/~sap/projects/ace-challenge/) - The MIRD database (http://www.iks.rwth-aachen.de/en/research/tools-downloads/multichannel-impulse-response-database/) - The MARDY database (http://www.commsp.ee.ic.ac.uk/~sap/resources/mardy-multichannel-acoustic-reverberation-database-at-york-database/) The underlying clean speech data can be found in: https://doi.org/10.7488/ds/2117 . The speech-shaped and babble noise files that were used to create this dataset are available here: http://homepages.inf.ed.ac.uk/cvbotinh/se/noises/.
Scripted Monologues Speech Data | 65,000 Hours | Generative AI Audio Data|...
datarade.ai
Updated Dec 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). Scripted Monologues Speech Data | 65,000 Hours | Generative AI Audio Data| Speech Recognition Data | Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-read-speech-data-65-000-hours-aud-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Dec 11, 2023
Dataset authored and provided by
Nexdata
Area covered
Chile, Uruguay, Luxembourg, Taiwan, France, Poland, Japan, Italy, Pakistan, Puerto Rico
Description
Specifications Format : 16kHz, 16bit, uncompressed wav, mono channel

Recording environment : quiet indoor environment, without echo

Recording content (read speech) : economy, entertainment, news, oral language, numbers, letters

Speaker : native speaker, gender balance

Device : Android mobile phone, iPhone

Language : 100+ languages

Transcription content : text, time point of speech data, 5 noise symbols, 5 special identifiers

Accuracy rate : 95% (the accuracy rate of noise symbols and other identifiers is not included)

Application scenarios : speech recognition, voiceprint recognition

About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
m
FOSD Male Speech Dataset
data.mendeley.com
Updated Jul 31, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Duc Chung Tran (2020). FOSD Male Speech Dataset [Dataset]. http://doi.org/10.17632/3zz6txz35t.7
Explore at:
Unique identifier
https://doi.org/10.17632/3zz6txz35t.7
Dataset updated
Jul 31, 2020
Authors
Duc Chung Tran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is FOSD-based (extracted from approximately 30-hour of FPT Open Speech Data, released publicly in 2018 by FPT Corporation, under FPT Public License) Male Speech Dataset which is useful for creating text-to-speech model. It comprises of 9474 audio files totalling more than 10.5 recording hours. All files are in *.wav format (16 kHz sampling rate, 32-bit float, mono). This dataset is useful for various TTS-related applications.

Copyright 2018 FPT Corporation Permission is hereby granted, free of charge, non-exclusive, worldwide, irrevocable, to any person obtaining a copy of this data or software and associated documentation files (the “Data or Software”), to deal in the Data or Software without restriction, including without limitation the rights to use, copy, modify, remix, transform, merge, build upon, publish, distribute and redistribute, sublicense, and/or sell copies of the Data or Software, for any purpose, even commercially, and to permit persons to whom the Data or Software is furnished to do so, subject to the following conditions: The above copyright notice, and this permission notice, and indication of any modification to the Data or Software, shall be included in all copies or substantial portions of the Data or Software.

THE DATA OR SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATA OR SOFTWARE OR THE USE OR OTHER DEALINGS IN THE DATA OR SOFTWARE. Patent and trademark rights are not licensed under this FPT Public License.
e
FESTCAT Catalan TTS baseline female speech database
catalog.elra.info
live.european-language-grid.eu
Updated Jan 5, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2012). FESTCAT Catalan TTS baseline female speech database [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0331/
Explore at:
Dataset updated
Jan 5, 2012
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
This database contains the recordings of one female Catalan professional speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. It consists in the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems). The FESTCAT Catalan TTS Baseline Female Speech Database was created within the scope of the FESTCAT project funded by the Catalan Government.
m
A kiswahili Dataset for Development of Text-To-Speech System
data.mendeley.com
Updated Nov 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kiptoo Rono (2021). A kiswahili Dataset for Development of Text-To-Speech System [Dataset]. http://doi.org/10.17632/vbvj6j6pm9.1
Explore at:
Unique identifier
https://doi.org/10.17632/vbvj6j6pm9.1
Dataset updated
Nov 30, 2021
Authors
Kiptoo Rono
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains Kiswahili text and audio files. The dataset contains 7,108 text files and audio files. The Kiswahili dataset was created from an open-source non-copyrighted material: Kiswahili audio Bible. The authors permit use for non-profit, educational, and public benefit purposes. The downloaded audio files length was more than 12.5s. Therefore, the audio files were programmatically split into short audio clips based on silence. They were then combined based on a random length such that each eventual audio file lies between 1 to 12.5s. This was done using python 3. The audio files were saved as a single channel,16 PCM WAVE file with a sampling rate of 22.05 kHz The dataset contains approximately 106,000 Kiswahili words. The words were then transcribed into mean words of 14.96 per text file and saved in CSV format. Each text file was divided into three parts: unique ID, transcribed words, and normalized words. A unique ID is a number assigned to each text file. The transcribed words are the text spoken by a reader. Normalized texts are the expansion of abbreviations and numbers into full words. An audio file split was assigned a unique ID, the same as the text file.
S
UGSpeechData: A Multilingual Speech Dataset of Ghanaian Languages
scidb.cn
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wiafe, Isaac; Abdulai, Jamal-Deen; Ekpezu, Akon Obu; Helegah, Raynard Dodzi; Atsakpo, Elikem Doe; Nutrokpor, Charles; Winful, Fiifi Baffoe Payin; Solaga, Kafui Kwashia (2025). UGSpeechData: A Multilingual Speech Dataset of Ghanaian Languages [Dataset]. http://doi.org/10.57760/sciencedb.22298
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.22298
Dataset updated
Mar 26, 2025
Dataset provided by
Science Data Bank
Authors
Wiafe, Isaac; Abdulai, Jamal-Deen; Ekpezu, Akon Obu; Helegah, Raynard Dodzi; Atsakpo, Elikem Doe; Nutrokpor, Charles; Winful, Fiifi Baffoe Payin; Solaga, Kafui Kwashia
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Area covered
Ghana
Description
The UGSpeechData is a collection of audio speech data of Akan, Ewe, Dagaare, Dagbani, and Ikposo. These languages are among the most spoken languages in Ghana. The uploaded dataset contains a total of 970148 audio files (5384.28 hours) and 93262 transcribed audio files (518 hours). The audio files are descriptions of 1000 culturally relevant images collected from indigenous speakers of each of the languages. Each audio is between 15 to 30 seconds long. More specifically, the dataset contains five subfolders for each of the five respective languages. Each language has at least 1000 hours of speech data and 100 hours of transcribed speech data. Fig. 1 provides details of the transcribed audio corpus, including gender and recording environments for each language.Fig. 1. Details of transcribed audio files
h
English-Technical-Speech-Dataset
huggingface.co
Updated Oct 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tejasva Maurya (2024). English-Technical-Speech-Dataset [Dataset]. https://huggingface.co/datasets/Tejasva-Maurya/English-Technical-Speech-Dataset
Explore at:
Dataset updated
Oct 26, 2024
Authors
Tejasva Maurya
Description
English Technical Speech Dataset

Overview

The English Technical Speech Dataset is a curated collection of English technical vocabulary recordings, designed for applications like Text-to-Speech (TTS), Automatic Speech Recognition (ASR), and Audio Classification. The dataset includes 11,247 entries and provides audio files, transcriptions, and speaker embeddings to support the development of robust technical language models.

Language: English (technical focus) Total… See the full description on the dataset page: https://huggingface.co/datasets/Tejasva-Maurya/English-Technical-Speech-Dataset.
In-Cabin Speech Data | 15,000 Hours | AI Training Data | Speech Recognition...
datarade.ai
Updated Apr 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2024). In-Cabin Speech Data | 15,000 Hours | AI Training Data | Speech Recognition Data | Audio Data |Natural Language Processing (NLP) Data [Dataset]. https://datarade.ai/data-products/nexdata-in-car-speech-data-15-000-hours-audio-ai-ml-t-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Apr 23, 2024
Dataset authored and provided by
Nexdata
Area covered
Russian Federation, Poland, Turkey, Switzerland, Netherlands, Germany, Austria, Romania, Egypt, Argentina
Description
Specifications Format : Audio format: 48kHz, 16bit, uncompressed wav, mono channel; Vedio format: MP4

Recording Environment : In-car;1 quiet scene, 1 low noise scene, 3 medium noise scenes and 2 high noise scenes

Recording Content : It covers 5 fields: navigation field, multimedia field, telephone field, car control field and question and answer field; 500 sentences per people

Speaker : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

Device : High fidelity microphone; Binocular camera

Language : 20 languages

Transcription content : text

Accuracy rate : 98%

Application scenarios : speech recognition, Human-computer interaction; Natural language processing and text analysis; Visual content understanding, etc.

About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Natural Language Processing (NLP) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
E
Data from: The "Mići Princ" text and speech dataset of Chakavian...
live.european-language-grid.eu
binary format
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). The "Mići Princ" text and speech dataset of Chakavian micro-dialects [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23600
Explore at:
binary formatAvailable download formats
Dataset updated
Mar 4, 2024
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The Mići Princ "text and speech" dialectal dataset is a word-aligned version of the translation of The Little Prince into various Chakavian micro-dialects, released by the Udruga Calculus and the Peek&Poke museum (http://skupnikatalog.nsk.hr/Record/nsk.NSK01001103632), both in form of a printed book and an audio book.

The printed book is a translation of Antoine de Saint-Exupéry's "Le Petit Prince". The translation was performed by Tea Perinčić and the following additional translators (almost every character in the book uses a different micro-dialect): Davina Ivašić, Annamaria Grus, Maria Luisa Ivašić, Marin Miš, Josip Fafanđel, Glorija Fabijanić Jelović, Vlasta Juretić, Anica Pritchard, Tea Rosić, Dino Marković, Ilinka Babić, Jadranka Ajvaz, Vlado Simičić Vava, Irena Grdinić, and Ivana Marinčić.

The audio book has been read by Zoran Prodanović Prlja, Davina Ivašić, Josip Fafanđel, Melita Nilović, Glorija Fabijanić Jelović, Albert Sirotich, Tea Rosić, Tea Perinčić, Dino Marković, Iva Močibob, Dražen Turina Šajeta, Vlado Simčić Vava, Ilinka Babić, Melita and Svetozar Nilović, and Ivana Marinčić.

The master encoding of this "text and speech" dataset is available in form of json files (MP_13.json for the thirteenth chapter of the book), where the text, the turn-level alignment, and the word-level alignment to the audio are available. This master encoding is available from the MP.json.tgz archive for the text and alignment part, with the audio part of the master encoding located in the MP.wav.tgz archive.

Besides this master encoding, an encoding focused on applications in automatic speech recognition (ASR) testing and adaptation, is available as well. Chapters 13 and 15 have been selected as testing data, and the text and audio reference files MP_13.asr.json and MP_15.asr.json contain segments split by speaker turns. The remainder of the dataset has been prepared in segments of length up to 20 seconds, ideal for training / fine-tuning current ASR systems. The text and audio reference data are available in the MP.asr.json.tgz archive, while the audio data are available in form of MP3 files in the MP.mp3.tgz archive.

The dataset also includes an encoding for the Exmaralda speech editor (https://exmaralda.org), one file per chapter (MP_13.exb for the thirteenth chapter), available from the MP.exb.tgz archive. The wav files from the MP.wav.tgz archive are required if speech data are to be available inside Exmaralda.

Speaker information is available in the speakers.json file, each speaker having a textual and wikidata reference to the location of the micro-dialect, as well as the name of the translator in the printed book and the reader in the audio book.

An application of the dataset on fine-tuning the current (March 2024) SotA automatic speech recognition model for standard Croatian, whisper-v3-large (https://huggingface.co/classla/whisper-large-v3-mici-princ), shows for word error rate to drop from 35.43% to 16.83%, and the character error rate to drop from 11.54% to 3.95% (in-dataset test data, two seen speakers / micro-dialects, two unseen).
Speech Recognition Data Collection Services | 100+ Languages Resources...
datarade.ai
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). Speech Recognition Data Collection Services | 100+ Languages Resources |Audio Data | Speech Recognition Data | Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-speech-recognition-data-collection-services-100-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Dec 28, 2023
Dataset authored and provided by
Nexdata
Area covered
Brazil, Cambodia, Malaysia, Estonia, Sri Lanka, Austria, United Kingdom, Lithuania, El Salvador, Haiti
Description
Overview With extensive experience in speech recognition, Nexdata has resource pool covering more than 50 countries and regions. Our linguist team works closely with clients to assist them with dictionary and text corpus construction, speech quality inspection, linguistics consulting and etc.

Our Capacity -Global Resources: Global resources covering hundreds of languages worldwide

-Compliance: All the Machine Learning (ML) Data are collected with proper authorization -Quality: Multiple rounds of quality inspections ensures high quality data output

-Secure Implementation: NDA is signed to gurantee secure implementation and Machine Learning (ML) Data is destroyed upon delivery.

About Nexdata Nexdata is equipped with professional Machine Learning (ML) Data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the data collection requirements in various scenarios and types. Please visit us at https://www.nexdata.ai/service/speech-recognition?source=Datarade
E
TC-STAR Spanish Baseline Female Speech Database
catalogue.elra.info
live.european-language-grid.eu
Updated Dec 21, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2010). TC-STAR Spanish Baseline Female Speech Database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0309/
Explore at:
Dataset updated
Dec 21, 2010
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
The TC-STAR Spanish Baseline Female Speech Database was created within the scope of the TC-STAR project (IST- FP6-506738) funded by the European Commission.It contains the recordings of one female Spanish speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems). This database is distributed on 10 DVDs. The database complies with the common specifications created in the TC-STAR project.The annotation of the database includes manual orthographic transcriptions, the automatic segmentation into phonemes and automatic generation of pitch marks. A certain percentage of phonetic segments and pitch marks has been manually checked. A pronunciation lexicon in SAMPA with POS, lemma and phonetic transcription of all the words prompted and spoken is also provided.Speech samples are stored as sequences of 24-bit 96 kHz with the least significant byte first (“lohi” or Intel format) as (signed) integers. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.The TC-STAR Spanish Baseline Male Speech Database is also available via ELRA under reference ELRA-S0310.
h
peoples_speech
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MLCommons, peoples_speech [Dataset]. https://huggingface.co/datasets/MLCommons/peoples_speech
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
MLCommons
License
Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
Description
Dataset Card for People's Speech

Dataset Summary

The People's Speech Dataset is among the world's largest English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0. It includes 30,000+ hours of transcribed speech in English languages with a diverse set of speakers. This open dataset is large enough to train speech-to-text systems and crucially is available with a permissive license.

Supported Tasks… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/peoples_speech.
F
Egyptian Arabic General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Egyptian Arabic General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-arabic-egypt
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Egyptian Arabic General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Arabic speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Egyptian Arabic communication.
Curated by FutureBeeAI, this 40 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Arabic speech models that understand and respond to authentic Egyptian accents and dialects.
Speech Data
The dataset comprises 40 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Egyptian Arabic. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 80 verified native Egyptian Arabic speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Egypt to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Arabic speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Egyptian Arabic.

•
Voice Assistants: Build smart assistants capable of understanding natural Egyptian conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;
F
Travel Scripted Monologue Speech Data: Japanese (Japan)
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Travel Scripted Monologue Speech Data: Japanese (Japan) [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/travel-scripted-speech-monologues-japanese-japan
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Japan
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Japanese Scripted Monologue Speech Dataset for the Travel Domain. This meticulously curated dataset is designed to advance the development of Japanese language speech recognition models, particularly for the Travel industry.
Speech Data
This training dataset comprises over 6,000 high-quality scripted prompt recordings in Japanese. These recordings cover various topics and scenarios relevant to the Travel domain, designed to build robust and accurate customer service speech technology.
•Participant Diversity:
•
Speakers: 60 native Japanese speakers from different regions of Japan.

•
Regions: Ensures a balanced representation of Japanese accents, dialects, and demographics.

•
Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.

•Recording Details:
•
Recording Nature: Audio recordings of scripted prompts/monologues.

•
Audio Duration: Average duration of 5 to 30 seconds per recording.

•
Formats: WAV format with mono channels, a bit depth of 16 bits, and sample rates of 8 kHz and 16 kHz.

•
Environment: Recordings are conducted in quiet settings without background noise and echo.

•
Topic Diversity: The dataset encompasses a wide array of topics and conversational scenarios to ensure comprehensive coverage of the Travel sector. Topics include:

•Customer Service Interactions
•Booking and Reservations
•Travel Inquiries
•Technical Support
•General Information and Advice
•Promotional and Sales Events
•Domain Specific Statements
•
Other Elements: To enhance realism and utility, the scripted prompts incorporate various elements commonly encountered in Travel interactions:

•
Names: Region-specific names of males and females in various formats.

•
Addresses: Region-specific addresses in different spoken formats.

•
Dates & Times: Inclusion of date and time in various travel contexts, such as booking dates, departure and arrival times.

•
Destinations: Specific names of cities, countries, and tourist attractions relevant to the travel sector.

•
Numbers & Prices: Various numbers and prices related to ticket costs, hotel rates, and transaction amounts.

•
Booking IDs and Confirmation Numbers: Inclusion of booking identification and confirmation details for realistic customer service scenarios.

Each scripted prompt is crafted to reflect real-life scenarios encountered in the Travel domain, ensuring applicability in training robust natural language processing and speech recognition models.
Transcription Data
In addition to high-quality audio recordings, the dataset includes meticulously prepared text files with verbatim transcriptions of each audio file. These transcriptions are essential for training accurate and robust speech recognition models.
•
Content: Each text file contains the exact scripted prompt corresponding to its audio file, ensuring consistency.

•
Format: Transcriptions are provided in plain text (.TXT) format, with files named to match their associated audio files for easy reference.

<div style="margin-top:10px;

Data from: LaFresCat: a Catalan multi-accent speech dataset for...

zenodo.org

application/gzip, txt

Updated Feb 18, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Zenodo (2025). LaFresCat: a Catalan multi-accent speech dataset for text-to-speech [Dataset]. http://doi.org/10.21437/iberspeech.2024-42

Explore at:

txt, application/gzipAvailable download formats

Unique identifier

https://doi.org/10.21437/iberspeech.2024-42

Dataset updated

Feb 18, 2025

Dataset provided by

Zenodohttp://zenodo.org/

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

LaFresCat Multiaccent

We present LaFresCat, the first Catalan multiaccented and multispeaker dataset.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Commercial use is only possible through licensing by the voice artists. For further information, contact langtech@bsc.es and lafrescaproduccions@gmail.com.

Dataset Details

Dataset Description

The audios from this dataset have been created with professional studio recordings by professional voice actors in Lafresca Creative Studio. This is the raw version of the dataset, no resampling or trimming has been applied to the audios. Audios are stored in wav format at 48khz sampling rate

In total, there are 4 different accents, with 2 speakers per accent (female and male). After trimming, accumulates a total of 3,75h (divided by speaker IDs) as follows:

Balear
- olga -> 23.5 min
- quim -> 30.93 min
Central
- elia -> 33.14 min
- grau -> 37,86 min
Occidental (North-Western)
- emma -> 28,67 min
- pere -> 25,12 min
Valencia
- gina -> 22,25 min
- lluc -> 23,58 min

Uses

The purpose of this dataset is mainly for training text-to-speech and automatic speech recognition models in Catalan accents.

Languages

The dataset is in Catalan (ca-ES).

Dataset Structure

The dataset consists of 2858 audios and transcriptions in the following structure:lafresca_multiaccent_raw ├── balear │ ├── olga │ ├── olga.txt │ ├── quim │ └── quim.txt ├── central │ ├── elia │ ├── elia.txt │ ├── grau │ └── grau.txt ├── full_filelist.txt ├── occidental │ ├── emma │ ├── emma.txt │ ├── pere │ └── pere.txt └── valencia ├── gina ├── gina.txt ├── lluc └── lluc.txt

Metadata of the dataset can be found in the file `full_filelist.txt` , each line represents an audio and follows the format:

audio_path | speaker_id | transcription

The speaker ids have the following mapping:

"quim": 0,
"olga": 1,
"grau": 2,
"elia": 3,
"pere": 4,
"emma": 5,
"lluc": 6,
"gina": 7

Dataset Creation

This dataset has been created by members of the Language Technologies unit from the Life Sciences department of the Barcelona Supercomputing Center, except the valencian sentences which were created with the support of Cenid, the Digital Intelligence Center of the University of Alicante. The voices belong to professional voice actors and they've been recorded in Lafresca Creative Studio.

Source Data

The data presented in this dataset is the source data.

Data Collection and Processing

These are the technical details of the data collection and processing:

Microphone: Austrian Audio oc818
Preamp: Focusrite ISA Two
Audio Interface: Antelope Orion 32+
DAW: ProTools 2023.6.0

Processing:

Noise Gate: C1 Gate
Compression BF-76
De-Esser Renaissance
EQ Maag EQ2
EQ FabFilter Pro-Q3
Limiter: L1 Ultramaximizer

Here's the information about the speakers:

Dialect	Gender	County
Central	male	Barcelonès
Central	female	Barcelonès
Balear	female	Pla de Mallorca
Balear	male	Llevant
Occidental	male	Baix Ebre
Occidental	female	Baix Ebre
Valencian	female	Ribera Alta
Valencian	male	La Plana Baixa

Who are the source data producers?

The Language Technologies team from the Life Sciences department at the Barcelona Supercomputing Center developed this dataset. It features recordings by professional voice actors made at Lafresca Creative Studio.

Annotations

In order to check whether or not there were any errors in the transcriptions of the audios, we created a Label Studio space. In that space, we manually listened to subset of the dataset, and compared what we heard with the transcription. If the transcription was mistaken, we corrected it.

Personal and Sensitive Information

The dataset consists of professional voice actors who have recorded their voice. You agree to not attempt to determine the identity of speakers in this dataset.

Bias, Risks, and Limitations

Training a Text-to-Speech (TTS) model by fine-tuning with a Catalan speaker who speaks a particular dialect presents significant limitations. Mostly, the challenge is in capturing the full range of variability inherent in that accent. Each dialect has its own unique phonetic, intonational, and prosodic characteristics that can vary greatly even within a single linguistic region. Consequently, a TTS model trained on a narrow dialect sample will struggle to generalize across different accents and sub-dialects, leading to reduced accuracy and naturalness. Additionally, achieving a standard representation is exceedingly difficult because linguistic features can differ markedly not only between dialects but also among individual speakers within the same dialect group. These variations encompass subtle nuances in pronunciation, rhythm, and speech patterns that are challenging to standardize in a model trained on a limited dataset.

Funding

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project, in addition the Valencian sentences have been created within the framework of the NEL-VIVES project 2022/TL22/00215334.

Dataset Card Contact

langtech@bsc.es

Facebook

Twitter

Click to copy link

Link copied

Cite

Tsiakoulis, Pirros (2025). SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7119399

Data from: SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

Explore at:

Dataset updated

Mar 7, 2025

Dataset provided by

Vioni, Alexandra
Nikitaras, Karolos
Chalamandaris, Aimilios
Ellinas, Nikolaos
Maniati, Georgia
Klapsas, Konstantinos
Sung, June Sig
Tsiakoulis, Pirros
Jho, Gunu

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

This is the public release of the Samsung Open Mean Opinion Scores (SOMOS) dataset for the evaluation of neural text-to-speech (TTS) synthesis, which consists of audio files generated with a public domain voice from trained TTS models based on bibliography, and numbers assigned to each audio as quality (naturalness) evaluations by several crowdsourced listeners.DescriptionThe SOMOS dataset contains 20,000 synthetic utterances (wavs), 100 natural utterances and 374,955 naturalness evaluations (human-assigned scores in the range 1-5). The synthetic utterances are single-speaker, generated by training several Tacotron-like acoustic models and an LPCNet vocoder on the LJ Speech voice public dataset. 2,000 text sentences were synthesized, selected from Blizzard Challenge texts of years 2007-2016, the LJ Speech corpus as well as Wikipedia and general domain data from the Internet.Naturalness evaluations were collected via crowdsourcing a listening test on Amazon Mechanical Turk in the US, GB and CA locales. The records of listening test participants (workers) are fully anonymized. Statistics on the reliability of the scores assigned by the workers are also included, generated through processing the scores and validation controls per submission page.

To listen to audio samples of the dataset, please see our Github page.

The dataset release comes with a carefully designed train-validation-test split (70%-15%-15%) with unseen systems, listeners and texts, which can be used for experimentation on MOS prediction.

This version also contains the necessary resources to obtain the transcripts corresponding to all dataset audios.

The dataset may be used for research purposes only, for non-commercial purposes only, and may be distributed with the same terms.

Every time you produce research that has used this dataset, please cite the dataset appropriately.

Cite as:

@inproceedings{maniati22_interspeech, author={Georgia Maniati and Alexandra Vioni and Nikolaos Ellinas and Karolos Nikitaras and Konstantinos Klapsas and June Sig Sung and Gunu Jho and Aimilios Chalamandaris and Pirros Tsiakoulis}, title={{SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis}}, year=2022, booktitle={Proc. Interspeech 2022}, pages={2388--2392}, doi={10.21437/Interspeech.2022-10922} }

References of resources & models used

Voice & synthesized texts:K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.

Vocoder:J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthesis through linear prediction,” in Proc. ICASSP, 2019.R. Vipperla, S. Park, K. Choo, S. Ishtiaq, K. Min, S. Bhattacharya, A. Mehrotra, A. G. C. P. Ramos, and N. D. Lane, “Bunched lpcnet: Vocoder for low-cost neural text-to-speech systems,” in Proc. Interspeech, 2020.

Acoustic models:N. Ellinas, G. Vamvoukakis, K. Markopoulos, A. Chalamandaris, G. Maniati, P. Kakoulidis, S. Raptis, J. S. Sung, H. Park, and P. Tsiakoulis, “High quality streaming speech synthesis with low, sentence-length-independent latency,” in Proc. Interspeech, 2020.Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards End-to-End Speech Synthesis,” in Proc. Interspeech, 2017.J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions,” in Proc. ICASSP, 2018.J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, and Y. Wu, “Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling,” arXiv preprint arXiv:2010.04301, 2020.M. Honnibal and M. Johnson, “An Improved Non-monotonic Transition System for Dependency Parsing,” in Proc. EMNLP, 2015.M. Dominguez, P. L. Rohrer, and J. Soler-Company, “PyToBI: A Toolkit for ToBI Labeling Under Python,” in Proc. Interspeech, 2019.Y. Zou, S. Liu, X. Yin, H. Lin, C. Wang, H. Zhang, and Z. Ma, “Fine-grained prosody modeling in neural speech synthesis using ToBI representation,” in Proc. Interspeech, 2021.K. Klapsas, N. Ellinas, J. S. Sung, H. Park, and S. Raptis, “WordLevel Style Control for Expressive, Non-attentive Speech Synthesis,” in Proc. SPECOM, 2021.T. Raitio, R. Rasipuram, and D. Castellani, “Controllable neural text-to-speech synthesis using intuitive prosodic features,” in Proc. Interspeech, 2020.

Synthesized texts from the Blizzard Challenges 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2016:M. Fraser and S. King, "The Blizzard Challenge 2007," in Proc. SSW6, 2007.V. Karaiskos, S. King, R. A. Clark, and C. Mayo, "The Blizzard Challenge 2008," in Proc. Blizzard Challenge Workshop, 2008.A. W. Black, S. King, and K. Tokuda, "The Blizzard Challenge 2009," in Proc. Blizzard Challenge, 2009.S. King and V. Karaiskos, "The Blizzard Challenge 2010," 2010.S. King and V. Karaiskos, "The Blizzard Challenge 2011," 2011.S. King and V. Karaiskos, "The Blizzard Challenge 2012," 2012.S. King and V. Karaiskos, "The Blizzard Challenge 2013," 2013.S. King and V. Karaiskos, "The Blizzard Challenge 2016," 2016.

Contact

Alexandra Vioni - a.vioni@samsung.com

If you have any questions or comments about the dataset, please feel free to write to us.

We are interested in knowing if you find our dataset useful! If you use our dataset, please email us and tell us about your research.

Clear search

Close search

Google apps

Main menu

Data from: SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural...

Data from: English TTS speech corpus of air traffic (pilot) messages -...

Data from: Balinese Text-to-Speech Dataset as Digital Cultural Heritage

Bengali Speech Recognition Dataset (BSRD)

TIMIT-TTS: a Text-to-Speech Dataset for Synthetic Speech Detection

Noisy reverberant speech database for training speech enhancement algorithms...

Scripted Monologues Speech Data | 65,000 Hours | Generative AI Audio Data|...

FOSD Male Speech Dataset

FESTCAT Catalan TTS baseline female speech database

A kiswahili Dataset for Development of Text-To-Speech System

UGSpeechData: A Multilingual Speech Dataset of Ghanaian Languages

English-Technical-Speech-Dataset

In-Cabin Speech Data | 15,000 Hours | AI Training Data | Speech Recognition...

Data from: The "Mići Princ" text and speech dataset of Chakavian...

Speech Recognition Data Collection Services | 100+ Languages Resources...

TC-STAR Spanish Baseline Female Speech Database

peoples_speech

Egyptian Arabic General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Travel Scripted Monologue Speech Data: Japanese (Japan)