100+ datasets found
  1. 8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech...

    • datarade.ai
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech Recognition Data| Multilingual Language Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-conversational-speech-data-8khz-tele-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Kazakhstan, Colombia, Ukraine, Bulgaria, Puerto Rico, United Republic of, Uzbekistan, Georgia, Jordan, Sri Lanka
    Description
    1. Specifications Format : 8kHz, 8bit, u-law/a-law pcm, mono channel;

    Environment : quiet indoor environment, without echo;

    Recording content : No preset linguistic data,dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;

    Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

    Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;

    Device : Telephony recording system;

    Language : 100+ Languages;

    Application scenarios : speech recognition; voiceprint recognition;

    Accuracy rate : the word accuracy rate is not less than 98%

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Multilingual Language Data and 800TB of Computer Vision Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
  2. VIVOS: Vietnamese Speech Corpus for ASR

    • kaggle.com
    zip
    Updated Dec 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khoa D. Vo (2022). VIVOS: Vietnamese Speech Corpus for ASR [Dataset]. https://www.kaggle.com/datasets/kynthesis/vivos-vietnamese-speech-corpus-for-asr
    Explore at:
    zip(1473514466 bytes)Available download formats
    Dataset updated
    Dec 4, 2022
    Authors
    Khoa D. Vo
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    VIVOS Corpus

    VIVOS is a free Vietnamese speech corpus consisting of 15 hours of recording speech prepared for the Automatic Speech Recognition task.

    The corpus was published by AILAB, a computer science lab of VNUHCM - University of Science, with Prof. Vu Hai Quan as the head.

    We publish this corpus in the hope to attract more scientists to solve Vietnamese speech recognition problems. The corpus should only be used for academic purposes.

  3. Mixed Speech Data |5,000 Hours |Code-switching|Audio Data| Speech...

    • datarade.ai
    Updated Mar 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). Mixed Speech Data |5,000 Hours |Code-switching|Audio Data| Speech Recognition Data| AI Datasets [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-code-switching-speech-data-5-000-hou-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Mar 10, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Australia, Italy, Mexico, France, Taiwan, Korea (Republic of), New Zealand, China, Hong Kong, Germany
    Description
    1. Specifications Format : 16kHz, 16bit, uncompressed wav, mono channel

    Recording environment : quiet indoor environment, without echo Recording content (read speech) : general category; human-machine interaction category

    Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

    Device : Android mobile phone, iPhone;

    Language : English-Korean, English-Japanese, German-English, Hong Kong Cantonese-English, Taiwanese-English,

    Application scenarios : speech recognition; voiceprint recognition.

    Accuracy rate : 97%

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Natural Language Processing (NLP) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
  4. F

    Canadian English General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Canadian English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-canada
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Canada
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Canadian English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Canadian English communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic Canadian accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Canadian English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Canadian English speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Canada to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple English speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Canadian English.
    Voice Assistants: Build smart assistants capable of understanding natural Canadian conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  5. Native & Accented English Speech Data |40,000 Hours | Audio Data|Speech...

    • datarade.ai
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). Native & Accented English Speech Data |40,000 Hours | Audio Data|Speech Recognition Data| Text-to-Speech(TTS) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-native-accented-english-speech-data-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Mar 20, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Denmark, Myanmar, Turkey, Sweden, Egypt, Pakistan, United States of America, Taiwan, United Kingdom, Macao
    Description
    1. Specifications Format : 16kHz, 16bit, uncompressed wav, mono channel.

    Recording environment : quiet indoor environment, low background noise, without echo.

    Recording content (read speech) : generic category; human-machine interaction category; smart home command and control category; in-car command and control category; numbers.

    Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

    Device : Android mobile phone, iPhone.

    Language : American English, British English, Canadian English, Australian English, French English, German English, Spanish English, Italian English, Portuguese English, Russian English, Indian English, Japanese English, Korean English, Singaporean English and etc.

    Application scenarios : speech recognition; voiceprint recognition.

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Speech Data and 800TB of Imagery Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
  6. F

    Czech General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Czech General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-czech
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Czech General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Czech speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Czech communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Czech speech models that understand and respond to authentic Czech accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Czech. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Czech speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Czech Republic to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Czech speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Czech.
    Voice Assistants: Build smart assistants capable of understanding natural Czech conversations.
    <span

  7. E

    AURORA-5

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Aug 16, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). AURORA-5 [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-AURORA-CD0005/
    Explore at:
    Dataset updated
    Aug 16, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The Aurora project was originally set up to establish a worldwide standard for the feature extraction software which forms the core of the front-end of a DSR (Distributed Speech Recognition) system. The AURORA-5 database has been mainly developed to investigate the influence on the performance of automatic speech recognition for a hands-free speech input in noisy room environments. Furthermore two test conditions are included to study the influence of transmitting the speech in a mobile communication system.The earlier three Aurora experiments had a focus on additive noise and the influence of some telephone frequency characteristics. Aurora-5 tries to cover all effects as they occur in realistic application scenarios. The focus was put on two scenarios. The first one is the hands-free speech input in the noisy car environment with the intention of controlling either devices in the car itself or retrieving information from a remote speech server over the telephone. The second one covers the hands-free speech input in a type of office or in a type of living room to control e.g. a telephone device or some audio/video equipment.The AURORA-5 database contains the following data:•Artificially distorted versions of the recordings from adult speakers in the TI-Digits speech database downsampled at a sampling frequency of 8000 Hz. The distortions consist of: - additive background noise, - the simulation of a hands-free speech input in rooms, - the simulation of transmitting speech over cellular telephone networks.•A subset of recordings from the meeting recorder project at the International Computer Science Institute. The recordings contain sequences of digits uttered by different speakers in hands-free mode in a meeting room.•A set of scripts for running recognition experiments on the above mentioned speech data. The experiments are based on the usage of the freely available software package HTK where HTK is not part of this resource.Further information is also available at the following address: http://aurora.hsnr.de

  8. Scripted Monologues Speech Data | 65,000 Hours | GenAI Audio Data|...

    • datarade.ai
    Updated Dec 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). Scripted Monologues Speech Data | 65,000 Hours | GenAI Audio Data| Text-to-Speech Data| Multilingual Language Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-read-speech-data-65-000-hours-aud-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 11, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Lebanon, Russian Federation, Netherlands, Japan, Slovenia, Jordan, Cambodia, El Salvador, Brazil, France
    Description
    1. Specifications Format : 16kHz, 16bit, uncompressed wav, mono channel

    Recording environment : quiet indoor environment, without echo

    Recording content (read speech) : economy, entertainment, news, oral language, numbers, letters

    Speaker : native speaker, gender balance

    Device : Android mobile phone, iPhone

    Language : 100+ languages

    Transcription content : text, time point of speech data, 5 noise symbols, 5 special identifiers

    Accuracy rate : 95% (the accuracy rate of noise symbols and other identifiers is not included)

    Application scenarios : speech recognition, voiceprint recognition

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Multilingual Language Data and 800TB of Computer Vision Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
  9. i

    Data from: Disordered Speech Data Collection: Lessons Learned at 1 Million...

    • incluset.com
    Updated 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert L. MacDonald; Pan-Pan Jiang; Julie Cattiau; Rus Heywood; Richard Cave; Katie Seaver; Marilyn A. Ladewig; Jimmy Tobin; Michael P. Brenner; Philip C. Nelson; Jordan R. Green; Katrin Tomanek (2021). Disordered Speech Data Collection: Lessons Learned at 1 Million Utterances from Project Euphonia [Dataset]. https://incluset.com/datasets
    Explore at:
    Dataset updated
    2021
    Authors
    Robert L. MacDonald; Pan-Pan Jiang; Julie Cattiau; Rus Heywood; Richard Cave; Katie Seaver; Marilyn A. Ladewig; Jimmy Tobin; Michael P. Brenner; Philip C. Nelson; Jordan R. Green; Katrin Tomanek
    Description

    Speech samples from over 1000 individuals with impaired speech are collected for Project Euphonia, aimed at improving automated speech recognition systems for disordered speech. While participants consented to making the dataset public, this work is pursuing ways to allow data contribution to a central repository that can open access to other researchers.

  10. 16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM)...

    • datarade.ai
    Updated Dec 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM) Data | Speech AI Datasets|Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-conversational-speech-data-16khz-mob-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 9, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Austria, Canada, Germany, Saudi Arabia, Ecuador, Indonesia, Turkey, Vietnam, Korea (Republic of), Malaysia
    Description
    1. Specifications Format : 16kHz 16bit, uncompressed wav, mono channel;

    Environment : quiet indoor environment, without echo;

    Recording content : No preset linguistic data,dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;

    Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

    Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;

    Device : Android mobile phone, iPhone;

    Language : 100+ Languages;

    Application scenarios : speech recognition; voiceprint recognition;

    Accuracy rate : the word accuracy rate is not less than 98%

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Audio Data and 800TB of computer vision data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
  11. Amharic speech corpus

    • kaggle.com
    Updated Mar 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashenafi Fasil Kebede (2024). Amharic speech corpus [Dataset]. https://www.kaggle.com/datasets/ashenafifasilkebede/amharic-speech-corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 13, 2024
    Dataset provided by
    Kaggle
    Authors
    Ashenafi Fasil Kebede
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Amharic speech data Acquired and Prepared by Dr. Solomon T. and Dr. Martha Y.

    OVERVIEW

    The package contains Amharic speech corpus with audio data in the directory /data. The data directory contains 2 subdirectories: a. train - speech data and transription for training automatic speech recognition Kaldi ASR format [1] b. test - speech data and transription for testing automatic speech recognition Kaldi ASR format

    A text corpus and language model in the directory /LM, and lexicon in the directory /lang

    Amharic SPEECH CORPUS

    Directory: /data/train Files: text (training transcription), wav.scp (file id and path), utt2spk (file id and audio id), spk2utt (audio id and file id), wav (.wav files). For more information about the format, please refer to Kaldi website http://kaldi-asr.org/doc/data_prep.html Description: training data in Kaldi format about 20 hours. Note: The path of wav files in wav.scp have to be modified to point to the actual locatiion.

    Directory: /data/test Files: text (test transcription), wav.scp (file id and path), utt2spk (file id and audio id), spk2utt (audio id and file id), wav (.wav files) Description: testing data in Kaldi format about 2 hours. The audio files for testing has the format

    Amharic TEXT CORPUS

    Directory: /lm Files: amharic_lm_PART1.zip, amharic_lm_PART2.zip Those files have to be unzipped and reassembled in one file to constitute the original language model "amharic.train.lm.data.arpa". This language model is created using SRILM using 3-grams ; the text is segmented in morphemes using morfessor 2.0 [2][3]

    LEXICON/PRONUNCIATION DICTIONARY

    Directory: /lang Files: lexicon.txt (lexicon), nonsilence_phones.txt (speech phones), optional_silence.txt (silence phone) Description: lexicon contains words and their respective pronunciation, non-speech sound and noise in Kaldi format ; the tokens have been extracted after morpheme level segmentation using morfessor 2.0.[3]

    SCRIPTS

    in /kaldi-scripts you find the scripts used to train and test models (path has to be changed to make it work in your own directory for 00_init_paths.sh and 01_init_datas.sh (don't forget to set the appropriate path for test.pl and train.pl)) and from the existing data and lang directory you can directly start run the sequence : 04_train_mono.sh change the size --beam and retry beam --retry-beam using slightly wider beams side as per the following specification in kaldi script of mono_train.sh for this corpus (i.e beam size from --beam=$[$beam] to --beam=$[$beam*4] and retry-beam size from --retry-beam=$[$beam*4] to --retry-beam=$[$beam*22])

    04a_train_triphone.sh + 04b_train_MLLT_LDA.sh + 04c_train_SAT_FMLLR.sh + 04d_train_MMI_FMMI.sh + 04e_train_sgmm.sh

    THE FOLLOWING RESULTS OBTAINED SO FAR (you should obtain the same result on this data if same protocol used)

    ## MER stands for Morpheme Error Rate
    ## SER stands for Sentence Error Rate
    ## CER stands for Character Error Rate per utterance [4]
    Monophone (13 MFCC)

    ---- %MER 19.89 [ 1234 / 6203, 91 ins, 362 del, 781 sub ]

    ---- %SER 80.78 [ 290 / 359 ]

    ---- %CER 15.02

    Triphone (13 MFCC)

    ---- %MER 10.83 [ 672 / 6203, 71 ins, 156 del, 445 sub ]

    ---- %SER 62.12 [ 223 / 359 ]

    ---- %CER 6.94

    Triphone (13 MFCC + delta + delta2)

    ---- %WER 9.62 [ 597 / 6203, 94 ins, 106 del, 397 sub ]

    ---- %SER 60.45 [ 217 / 359 ]

    ---- %CER 6.46

    Triphone (39 features) with LDA+MLLT

    ---- %WER 8.61 [ 534 / 6203, 76 ins, 101 del, 357 sub ]

    ---- %SER 56.27 [ 202 / 359 ]

    ---- %CER 5.34

    Triphone (39 features) SAT+FMLLR

    ---- %WER 9.24 [ 573 / 6203, 86 ins, 109 del, 378 sub ]

    ---- %SER 59.89 [ 215 / 359 ]

    ---- %CER 6.27

    Triphone (39 features) MMI

    ---- %MER 10.83 [ 672 / 6203, 85 ins, 157 del, 430 sub ]

    ---- %SER 64.07 [ 230 / 359 ]

    ---- %CER 7.66

    Triphone (39 features) fMMI + mmi with indirect differential

    Iteration 3

    ---- %MER 10.56 [ 655 / 6203, 80 ins, 154 del, 421 sub ]

    ---- %SER 64.62 [ 232 / 359 ]

    ---- %CER 7.13

    Iteration 4

    ---- %MER 10.59 [ 657 / 6203, 75 ins, 162 del, 420 sub ]

    ---- %SER 64.90 [ 233 / 359 ]

    ---- %CER 7.31

    Iteration 5

    ---- %MER 10.37 [ 643 / 6203, 81 ins, 145 del, 417 sub ]

    ---- %SER 63.51 [ 228 / 359 ]

    ---- %CER 7.07

    Iteration 6

    ---- %MER 10.45 [ 648 / 6203, 83 ins, 147 del, 418 sub ]

    ---- %SER 64.07 [ 230 / 359 ]

    ---- %CER 7.21

    Iteration 7

    ---- %MER 10.35 [ 642 / 6203, 79 ins, 147 del, 416 sub ]

    ---- %SER 64.62 [ 232 / 359 ]

    ---- %CER 7.05

    Iteration 8

    ---- %MER 10.43 [ 647 / 6203, 79 ins, 155 del, 413 sub ]

    ---- %SER 64.62 [ 232 / 359 ]

    ---- %CER 7.34

    Triphone (39 features) SGMM

    ---- %MER 8.75 [ 543 / 6203, 52 ins, 134 del, 357 sub ]

    ---- %SER 57.10 [ 205 / 359 ]

    ---- %CER 5.50

    Triphone (39 features) SGMM+MMI

    Iteration 1

    ---- %MER 8.59 [ 533 / 6203, 46 ins, 131 del, 356 sub ]

    ---- %SER 55.71 [ 200 ...

  12. Speech Synthesis Data | 400 Hours | TTS Data | Audio Data | AI Training...

    • datarade.ai
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). Speech Synthesis Data | 400 Hours | TTS Data | Audio Data | AI Training Data| AI Datasets [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-speech-synthesis-data-400-hours-a-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    China, Austria, Philippines, Malaysia, Belgium, Singapore, Colombia, Sweden, Hong Kong, Canada
    Description
    1. Specifications Format : 44.1 kHz/48 kHz, 16bit/24bit, uncompressed wav, mono channel.

    Recording environment : professional recording studio.

    Recording content : general narrative sentences, interrogative sentences, etc.

    Speaker : native speaker

    Annotation Feature : word transcription, part-of-speech, phoneme boundary, four-level accents, four-level prosodic boundary.

    Device : Microphone

    Language : American English, British English, Japanese, French, Dutch, Catonese, Canadian French,Australian English, Italian, New Zealand English, Spanish, Mexican Spanish

    Application scenarios : speech synthesis

    Accuracy rate: Word transcription: the sentences accuracy rate is not less than 99%. Part-of-speech annotation: the sentences accuracy rate is not less than 98%. Phoneme annotation: the sentences accuracy rate is not less than 98% (the error rate of voiced and swallowed phonemes is not included, because the labelling is more subjective). Accent annotation: the word accuracy rate is not less than 95%. Prosodic boundary annotation: the sentences accuracy rate is not less than 97% Phoneme boundary annotation: the phoneme accuracy rate is not less than 95% (the error range of boundary is within 5%)

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go AI & ML Training Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/tts?source=Datarade
  13. m

    FOSD Male Speech Dataset

    • data.mendeley.com
    Updated Jul 31, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Duc Chung Tran (2020). FOSD Male Speech Dataset [Dataset]. http://doi.org/10.17632/3zz6txz35t.7
    Explore at:
    Dataset updated
    Jul 31, 2020
    Authors
    Duc Chung Tran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is FOSD-based (extracted from approximately 30-hour of FPT Open Speech Data, released publicly in 2018 by FPT Corporation, under FPT Public License) Male Speech Dataset which is useful for creating text-to-speech model. It comprises of 9474 audio files totalling more than 10.5 recording hours. All files are in *.wav format (16 kHz sampling rate, 32-bit float, mono). This dataset is useful for various TTS-related applications.

    Copyright 2018 FPT Corporation Permission is hereby granted, free of charge, non-exclusive, worldwide, irrevocable, to any person obtaining a copy of this data or software and associated documentation files (the “Data or Software”), to deal in the Data or Software without restriction, including without limitation the rights to use, copy, modify, remix, transform, merge, build upon, publish, distribute and redistribute, sublicense, and/or sell copies of the Data or Software, for any purpose, even commercially, and to permit persons to whom the Data or Software is furnished to do so, subject to the following conditions: The above copyright notice, and this permission notice, and indication of any modification to the Data or Software, shall be included in all copies or substantial portions of the Data or Software.

    THE DATA OR SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATA OR SOFTWARE OR THE USE OR OTHER DEALINGS IN THE DATA OR SOFTWARE. Patent and trademark rights are not licensed under this FPT Public License.

  14. Indian Languages Audio Dataset

    • kaggle.com
    Updated Nov 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HARSHMAN SOLANKI (2023). Indian Languages Audio Dataset [Dataset]. https://www.kaggle.com/datasets/hmsolanki/indian-languages-audio-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    HARSHMAN SOLANKI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    India
    Description

    Description: The "Indian Languages Audio Dataset" is a collection of audio samples featuring a diverse set of 10 Indian languages. Each audio sample in this dataset is precisely 5 seconds in duration and is provided in MP3 format. It is important to note that this dataset is a subset of a larger collection known as the "Audio Dataset with 10 Indian Languages." The source of these audio samples is regional videos freely available on YouTube, and none of the audio samples or source videos are owned by the dataset creator.

    Languages Included: 1. Bengali 2. Gujarati 3. Hindi 4. Kannada 5. Malayalam 6. Marathi 7. Punjabi 8. Tamil 9. Telugu 10. Urdu

    This dataset offers a valuable resource for researchers, linguists, and machine learning enthusiasts who are interested in studying and analyzing the phonetics, accents, and linguistic characteristics of the Indian subcontinent. It is a representative sample of the linguistic diversity present in India, encompassing a wide array of languages and dialects. Researchers and developers are encouraged to explore this dataset to build applications or conduct research related to speech recognition, language identification, and other audio processing tasks.

    Additionally, the dataset is not limited to these 10 languages and has the potential for expansion. Given the dynamic nature of language use in India, this dataset can serve as a foundation for future data collection efforts involving additional Indian languages and dialects.

    Access to the "Indian Multilingual Audio Dataset - 10 Languages" is provided with the understanding that users will comply with applicable copyright and licensing restrictions. If users plan to extend this dataset or use it for commercial purposes, it is essential to seek proper permissions and adhere to relevant copyright and licensing regulations.

    By utilizing this dataset responsibly and ethically, users can contribute to the advancement of language technology and research, ultimately benefiting language preservation, speech recognition, and cross-cultural communication.

  15. i

    Data from: Dysarthric speech database for universal access research

    • incluset.com
    Updated 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heejin Kim; Mark Allan Hasegawa-Johnson; Adrienne Perlman; Jon Gunderson; Thomas S Huang; Kenneth Watkin; Simone Frame (2007). Dysarthric speech database for universal access research [Dataset]. https://incluset.com/datasets
    Explore at:
    Dataset updated
    2007
    Authors
    Heejin Kim; Mark Allan Hasegawa-Johnson; Adrienne Perlman; Jon Gunderson; Thomas S Huang; Kenneth Watkin; Simone Frame
    Measurement technique
    Participants read a variety of words, like digits, letters, computer commands, common words, and uncommon words from porject Gutenberg novels.
    Description

    This dataset is collected to enhance research into speech recognition systems for dysarthic speech.

  16. i

    Data from: The TORGO database of acoustic and articulatory speech from...

    • incluset.com
    Updated 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank Rudzicz; Aravind Kumar Namasivayam; Talya Wolff (2011). The TORGO database of acoustic and articulatory speech from speakers with dysarthria [Dataset]. https://incluset.com/datasets
    Explore at:
    Dataset updated
    2011
    Authors
    Frank Rudzicz; Aravind Kumar Namasivayam; Talya Wolff
    Measurement technique
    It consists of aligned acoustics and measured 3D articulatory features from speakers with either cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS).
    Description

    This database is orignally created for a resource for developing advanced models in automatic speech recognition that are more suited to the needs of people with dysarthria.

  17. E

    Mandarin Conversational Speech Data by Mobile Phone and Voice Recorder -...

    • catalog.elra.info
    Updated Oct 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2022). Mandarin Conversational Speech Data by Mobile Phone and Voice Recorder - 1,351 Hours [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0436/
    Explore at:
    Dataset updated
    Oct 6, 2022
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    1950 speakers participated in the recording, and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields. The voice was natural and fluent, in line with the actual dialogue scene. Text is transcribed manually, with high accuracy.Format:Mobile phone: 16kHz, 16bit, mono channel, .wav; Voice recorder: 44.1kHz, 16bit, dual channel, .wavRecording Environment:quiet indoor environment, without echoRecording content:dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performedDemographics:1,950 people; 66% speakers of all are in the age group of 16-25; 962 speakers of them spoke in groups of two speakers, 312 speakers of them spoke in groups of three speakers, 396 speakers of them spoke in groups of four speakers, and the other 280 speakers spoke in groups of five speakersAnnotation:annotating for the transcription text, speaker identification and genderDevice:mobile phone and voice recorderLanguage:MandarinApplication scenarios:speech recognition; voiceprint recognitionAccuracy rate:97%

  18. m

    FOSD Female Speech Dataset

    • data.mendeley.com
    Updated Jul 31, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Duc Chung Tran (2020). FOSD Female Speech Dataset [Dataset]. http://doi.org/10.17632/4gzzc9k49n.5
    Explore at:
    Dataset updated
    Jul 31, 2020
    Authors
    Duc Chung Tran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is FOSD-based (extracted from approximately 30-hour of FPT Open Speech Data, released publicly in 2018 by FPT Corporation, under FPT Public License) Female Speech Dataset which is useful for creating text-to-speech model. It comprises of 7637 audio files totalling more than 9.5 recording hours. All files are in *.wav format (16 kHz sampling rate, 32-bit float, mono). This dataset is useful for various TTS-related applications.

    Copyright 2018 FPT Corporation Permission is hereby granted, free of charge, non-exclusive, worldwide, irrevocable, to any person obtaining a copy of this data or software and associated documentation files (the “Data or Software”), to deal in the Data or Software without restriction, including without limitation the rights to use, copy, modify, remix, transform, merge, build upon, publish, distribute and redistribute, sublicense, and/or sell copies of the Data or Software, for any purpose, even commercially, and to permit persons to whom the Data or Software is furnished to do so, subject to the following conditions: The above copyright notice, and this permission notice, and indication of any modification to the Data or Software, shall be included in all copies or substantial portions of the Data or Software.

    THE DATA OR SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATA OR SOFTWARE OR THE USE OR OTHER DEALINGS IN THE DATA OR SOFTWARE. Patent and trademark rights are not licensed under this FPT Public License.

  19. F

    Tamil General Domain Scripted Monologue Speech Data

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil General Domain Scripted Monologue Speech Data [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/general-scripted-speech-monologues-tamil-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Tamil Scripted Monologue Speech Dataset for the General Domain is a carefully curated resource designed to support the development of Tamil language speech recognition systems. This dataset focuses on general-purpose conversational topics and is ideal for a wide range of AI applications requiring natural, domain-agnostic Tamil speech data.

    Speech Data

    This dataset features over 6,000 high-quality scripted monologue recordings in Tamil. The prompts span diverse real-life topics commonly encountered in general conversations and are intended to help train robust and accurate speech-enabled technologies.

    Participant Diversity
    Speakers: 60 native Tamil speakers
    Regions: Broad regional coverage ensures diverse accents and dialects
    Demographics: Participants aged 18 to 70, with a 60:40 male-to-female ratio
    Recording Specifications
    Recording Type: Scripted monologues and prompt-based recordings
    Audio Duration: 5 to 30 seconds per file
    Format: WAV, mono channel, 16-bit, 8 kHz & 16 kHz sample rates
    Environment: Clean, noise-free conditions to ensure clarity and usability

    Topic Coverage

    The dataset covers a wide variety of general conversation scenarios, including:

    Daily Conversations
    Topic-Specific Discussions
    General Knowledge and Advice
    Idioms and Sayings

    Contextual Features

    To enhance authenticity, the prompts include:

    Names: Male and female names specific to different Tamil Nadu regions
    Addresses: Commonly used address formats in daily Tamil speech
    Dates & Times: References used in general scheduling and time expressions
    Organization Names: Names of businesses, institutions, and other entities
    Numbers & Currencies: Mentions of quantities, prices, and monetary values

    Each prompt is designed to reflect everyday use cases, making it suitable for developing generalized NLP and ASR solutions.

    Transcription

    Every audio file in the dataset is accompanied by a verbatim text transcription, ensuring accurate training and evaluation of speech models.

    Content: Exact match to the spoken audio
    Format: Plain text (.TXT), named identically to the corresponding audio file
    Quality Control: All transcripts are validated by native Tamil transcribers

    Metadata

    Rich metadata is included for detailed filtering and analysis:

    Speaker Metadata: Unique speaker ID, age, gender, region, and dialect
    Audio Metadata: Prompt transcript, recording setup, device specs, sample rate, bit depth, and format

    Applications & Use Cases

    This dataset can power a variety of Tamil language AI technologies, including:

    Speech Recognition Training: ASR model development and fine-tuning
    <div

  20. E

    Cantonese Conversational Speech Data by Mobile Phone and Voice Recorder -...

    • catalog.elra.info
    Updated Oct 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2022). Cantonese Conversational Speech Data by Mobile Phone and Voice Recorder - 607 Hours [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0427/
    Explore at:
    Dataset updated
    Oct 6, 2022
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    995 local Cantonese speakers participated in the recording, and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transcribed manually, with high accuracy.Format:Mobile phone: 16kHz, 16bit, mono channel, .wav; Voice recorder: 44.1kHz, 16bit, dual channel, .wav;Environment:quiet indoor environment, without echoRecording Content:dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performedDemographics:995 Cantonese; 45% speakers of all are in the age group of 26-45; 504 speakers of them spoke in groups of two speakers, 195 speakers of them spoke in groups of three speakers, 196 speakers of them spoke in groups of four speakers, and the other 100 speakers spoke in groups of five speakersAnnotation:annotating for the transcription text, speaker identification and genderDevice:mobile phone and voice recorderLanguage:CantoneseApplication Scenario:Voice Recognition, Voice Print RecognitionAccuracy rate:95%

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nexdata (2023). 8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech Recognition Data| Multilingual Language Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-conversational-speech-data-8khz-tele-nexdata
Organization logo

8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech Recognition Data| Multilingual Language Data

Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Dec 10, 2023
Dataset authored and provided by
Nexdata
Area covered
Kazakhstan, Colombia, Ukraine, Bulgaria, Puerto Rico, United Republic of, Uzbekistan, Georgia, Jordan, Sri Lanka
Description
  1. Specifications Format : 8kHz, 8bit, u-law/a-law pcm, mono channel;

Environment : quiet indoor environment, without echo;

Recording content : No preset linguistic data,dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;

Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;

Device : Telephony recording system;

Language : 100+ Languages;

Application scenarios : speech recognition; voiceprint recognition;

Accuracy rate : the word accuracy rate is not less than 98%

  1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Multilingual Language Data and 800TB of Computer Vision Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
Search
Clear search
Close search
Google apps
Main menu