6 datasets found
  1. English Pronunciation Error Detection Dataset

    • kaggle.com
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziya (2025). English Pronunciation Error Detection Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/english-pronunciation-error-detection-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 16, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ziya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is designed for research in English pronunciation error detection using artificial intelligence. It consists of 200 samples of English speech data collected from university-level students with varying proficiency levels (Beginner, Intermediate, and Advanced). Each sample includes features such as MFCCs (Mel-frequency cepstral coefficients), pitch, and duration, extracted from speech recordings. The dataset also includes labels for common pronunciation errors such as misarticulations, vowel/consonant discrepancies, stress issues, and intonation errors. The data is annotated with word-level transcriptions, start and end times of speech segments, and proficiency levels, making it suitable for training AI models to detect and correct pronunciation errors.

    This dataset supports AI-driven frameworks aimed at providing real-time feedback for English language learners, particularly in pronunciation improvement.

  2. h

    phonetic

    • huggingface.co
    Updated Aug 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neurlang Project (2025). phonetic [Dataset]. https://huggingface.co/datasets/neurlang/phonetic
    Explore at:
    Dataset updated
    Aug 6, 2025
    Authors
    Neurlang Project
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    Phonetic IPA Dataset for 85+ languages

    Download the dataset: [https://github.com/neurlang/dataset] Use our model trained on the dataset: [https://www.hashtron.cloud]

      Licensing information:
    

    MIT Original data for the 15 languages taken from gruut databases MIT To this the data for the 31 languages were added ipa dict files CC0: Public Domain Chinese/Mandarin-IPA language sentence pairs were generated: from the chinese sentences taken from dataset from kaggle based on the… See the full description on the dataset page: https://huggingface.co/datasets/neurlang/phonetic.

  3. 444,202 Korean Pronunciation Dictionary

    • m.nexdata.ai
    • nexdata.ai
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 444,202 Korean Pronunciation Dictionary [Dataset]. https://m.nexdata.ai/datasets/pronunciation/1221?source=Kaggle
    Explore at:
    Dataset updated
    May 6, 2025
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Content, Language, Application scenarios
    Description

    The data contains 444,202 entries. All words and pronunciations are produced by Korean linguists. It can be used in the research and development of Korean ASR technology.

  4. word-in-french

    • huggingface.co
    Updated Oct 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FrancophonIA (2024). word-in-french [Dataset]. https://huggingface.co/datasets/FrancophonIA/word-in-french
    Explore at:
    Dataset updated
    Oct 12, 2024
    Dataset provided by
    Francophonia
    Authors
    FrancophonIA
    Area covered
    French
    Description

    [!NOTE] Dataset origin: https://www.kaggle.com/datasets/stephrouen/word-in-french

      Context
    

    Lexique v3.81 on www.lexique.org/

      Content
    

    Words of the French language with pronunciation, grouping and statistics.

      Acknowledgements
    

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

  5. DARPA TIMIT Acoustic-Phonetic Continuous Speech

    • kaggle.com
    zip
    Updated Jun 5, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Fekadu (2019). DARPA TIMIT Acoustic-Phonetic Continuous Speech [Dataset]. https://www.kaggle.com/mfekadu/darpa-timit-acousticphonetic-continuous-speech
    Explore at:
    zip(869007403 bytes)Available download formats
    Dataset updated
    Jun 5, 2019
    Authors
    Michael Fekadu
    Description

    The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus

    The TIMIT corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT has resulted from the joint efforts of several sites under sponsorship from the Defense Advanced Research Projects Agency - Information Science and Technology Office (DARPA-ISTO). Text corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), Stanford Research Institute (SRI), and Texas Instruments (TI). The speech was recorded at TI, transcribed at MIT, and has been maintained, verified, and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). This file contains a brief description of the TIMIT Speech Corpus. Additional information including the referenced material and some relevant reprints of articles may be found in the printed documentation which is also available from NTIS (NTIS# PB91-100354).

    Corpus Speaker Distribution

    TIMIT contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States. Table 1 shows the number of speakers for the 8 dialect regions, broken down by sex. The percentages are given in parentheses. A speaker's dialect region is the geographical area of the U.S. where they lived during their childhood years. The geographical areas correspond with recognized dialect regions in U.S. (Language Files, Ohio State University Linguistics Dept., 1982), with the exception of the Western region (dr7) in which dialect boundaries are not known with any confidence and dialect region 8 where the speakers moved around a lot during their childhood.

      Table 1: Dialect distribution of speakers
    
       Dialect
       Region(dr)  #Male  #Female  Total
       ---------- --------- --------- ----------
         1     31 (63%) 18 (27%)  49 (8%) 
         2     71 (70%) 31 (30%) 102 (16%) 
         3     79 (67%) 23 (23%) 102 (16%) 
         4     69 (69%) 31 (31%) 100 (16%) 
         5     62 (63%) 36 (37%)  98 (16%) 
         6     30 (65%) 16 (35%)  46 (7%) 
         7     74 (74%) 26 (26%) 100 (16%) 
         8     22 (67%) 11 (33%)  33 (5%)
        ------   --------- --------- ---------- 
         8    438 (70%) 192 (30%) 630 (100%)
    
    The dialect regions are:
       dr1: New England
       dr2: Northern
       dr3: North Midland
       dr4: South Midland
       dr5: Southern
       dr6: New York City
       dr7: Western
       dr8: Army Brat (moved around)
    

    Corpus Text Material

    The text material in the TIMIT prompts (found in the file "prompts.doc") consists of 2 dialect "shibboleth" sentences designed at SRI, 450 phonetically-compact sentences designed at MIT, and 1890 phonetically-diverse sentences selected at TI. The dialect sentences (the SA sentences) were meant to expose the dialectal variants of the speakers and were read by all 630 speakers. The phonetically-compact sentences were designed to provide a good coverage of pairs of phones, with extra occurrences of phonetic contexts thought to be either difficult or of particular interest. Each speaker read 5 of these sentences (the SX sentences) and each text was spoken by 7 different speakers. The phonetically-diverse sentences (the SI sentences) were selected from existing text sources - the Brown Corpus (Kuchera and Francis, 1967) and the Playwrights Dialog (Hultzen, et al., 1964) - so as to add diversity in sentence types and phonetic contexts. The selection criteria maximized the variety of allophonic contexts found in the texts. Each speaker read 3 of these sentences, with each sentence being read only by a single speaker. Table 2 summarizes the speech material in TIMIT.

    Table 2: TIMIT speech material
     Sentence Type  #Sentences  #Speakers  Total  #Sentences/Speaker
     -------------  ----------  ---------  -----  ------------------
     Dialect (SA)     2     630    1260      2
     Compact (SX)    450      7    3150      5
     Diverse (SI)    1890      1    1890      3
     -------------  ----------  ---------  -----  ----------------
     Total       2342          6300     10
    

    Suggested Training/Test Subdivision

    The speech material has been subdivided into portions for training and testing. The criteria for the subdivision is described in the file "testset.doc". THIS SUBDIVISION HAS NO RELATION TO THE DATA DISTRIBUTED ON THE PROTOTYPE VERSION OF THE CDROM.

    Core Test Set:

    The test data has a core portion containing 24 speakers, 2 male and 1 female from each dialect region. The core test speakers are shown in Table 3. Each speaker read a different set of SX sentences. Thus the core test material contains 192 sentences, 5 SX and 3 SI for each speaker, each having a distinct text prompt.

      Table 3: The core test set of 24 speakers
    
       Dialect    Male   Female
       -------    ------   ------
        1    DAB0, WBT0  ELC0  
        2    TAS1, WEW0  PAS0  
        3    JMP0, LNT0  PKT0  
        4    LLL0, TLS0  JLM0  
        5    BPM0, KLT0  NLP0  
        6    CMJ0, JDH0  MGD0  
        7    GRT0, NJM0  DHC0
        8    JLN0, PAM0  MLD0
    

    Complete Test Set:

    A more extensive test set was obtained by including the sentences from all speakers that read any of the SX texts included in the core test set. In doing so, no sentence text appears in both the training and test sets. This complete test set contains a total of 168 speakers and 1344 utterances, accounting for about 27% of the total speech material. The resulting dialect distribution of the 168 speaker test set is given in Table 4. The complete test material contains 624 distinct texts. ``` Table 4: Dialect distribution for complete test set

     Dialect  #Male  #Female  Total
     -------  -----  -------  -----
      1      7    4    11
      2     18    8    26
      3     23    3    26
      4     16    16    32
      5     17    11    28
      6      8    3    11
      7     15    8    23
      8      8    3    11
     -----   -----  -------  ------
     Total    112    56   168
    
    CDROM TIMIT Directory and File Structure
    
    The speech and associated data is organized on the CD-ROM according to the following hierarchy:
    
    /
    
  6. Assamese Text-to-Speech Dataset

    • kaggle.com
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Faizal Karim (2023). Assamese Text-to-Speech Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/5843808
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Faizal Karim
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Assamese Text-to-Speech (TTS) dataset is a valuable resource for researchers and developers interested in the field of speech synthesis for the Assamese language. Assamese is an Indo-Aryan language spoken primarily in the northeastern state of Assam in India. With a rich cultural heritage and a significant number of speakers, Assamese plays a vital role in regional communication and literature.


    This dataset is specifically curated to support the development and training of text-to-speech systems for the Assamese language. It comprises a total of 1877 text samples in Assamese along with their corresponding audio recordings. The audio files are short and on average are about 3-4 seconds long.

    Applications

    1. Accessibility: The Assamese TTS dataset opens up opportunities for the development of assistive technologies, enabling visually impaired individuals to access written content in Assamese through synthesized speech.

    2. Language Learning: The dataset can be utilized to create interactive language learning applications or tools, aiding learners in improving their pronunciation and fluency in Assamese.

    3. Content Generation: TTS systems trained on the dataset can be employed in content creation, such as audiobook production, podcasting, or voice-over services, to generate high-quality spoken content in Assamese.

    As the dataset is small, it is recommended to utilize pretrained models as a starting point and fine-tune them using the provided data to achieve better performance and accuracy in Assamese TTS applications.

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ziya (2025). English Pronunciation Error Detection Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/english-pronunciation-error-detection-dataset
Organization logo

English Pronunciation Error Detection Dataset

Speech features and error labels for AI-based pronunciation correction

Explore at:
23 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 16, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ziya
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is designed for research in English pronunciation error detection using artificial intelligence. It consists of 200 samples of English speech data collected from university-level students with varying proficiency levels (Beginner, Intermediate, and Advanced). Each sample includes features such as MFCCs (Mel-frequency cepstral coefficients), pitch, and duration, extracted from speech recordings. The dataset also includes labels for common pronunciation errors such as misarticulations, vowel/consonant discrepancies, stress issues, and intonation errors. The data is annotated with word-level transcriptions, start and end times of speech segments, and proficiency levels, making it suitable for training AI models to detect and correct pronunciation errors.

This dataset supports AI-driven frameworks aimed at providing real-time feedback for English language learners, particularly in pronunciation improvement.

Search
Clear search
Close search
Google apps
Main menu