14 datasets found
  1. CMU Pronunciation Dictionary Unmodified (0.7b)

    • kaggle.com
    Updated Jun 8, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Epp (2018). CMU Pronunciation Dictionary Unmodified (0.7b) [Dataset]. https://www.kaggle.com/datasets/reppic/cmu-pronunciation-dictionary-unmodified-07b/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 8, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ryan Epp
    Description

    Dataset

    This dataset was created by Ryan Epp

    Released under Other (specified in description)

    Contents

  2. English words with stress position analyzed

    • kaggle.com
    Updated Jul 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor_42 (2024). English words with stress position analyzed [Dataset]. https://www.kaggle.com/datasets/victorcheng42/english-words-with-stress-position-analyzed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2024
    Dataset provided by
    Kaggle
    Authors
    Victor_42
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Context

    This dataset is a side product of a notebook to find out the rules of stress position in English.

    It is a work based on another dataset with 300k+ English words.

    I looked up dictionary for phonetic transcriptions with this free dictionary API and got about 30k transcriptions. Then I managed to extract syllable counts, stress positions and stressed syllables from them to make this new dataset.

    File description

    words_stress_analyzed.csv is the final dataset. Other files are just intermediate steps in the process.

    Column description

    ColumnDatatypeExampleDescription
    wordstrcomplimentarythe English words
    phoneticstr/ˌkɒmplɪ̈ˈment(ə)ɹɪ/the phonetic transcription of the words
    part_of_speechstr(list like)['adjective']how are these words used in sentences
    syllable_lenint5how many syllables are there in these words
    stress_posint3on which syllable the stress falls on, if there are more than one stress, this is the position of the first stress
    stress_syllablestrethe vowel of the stressed syllable

    Note: Absence of stress symbol in some short words led to blanks in this dataset. It is recommended to filter out rows with empty stress_syllable and rows that syllable_len is 1.

    Images

    • created with Midjourney
  3. n

    500,113 English Pronunciation Dictionary

    • nexdata.ai
    • m.nexdata.ai
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 500,113 English Pronunciation Dictionary [Dataset]. https://www.nexdata.ai/datasets/pronunciation/1095?source=Kaggle
    Explore at:
    Dataset updated
    Apr 26, 2025
    Dataset provided by
    Nexdata
    nexdata technology inc
    Authors
    Nexdata
    Variables measured
    Format, Language, Data content, Application scenario
    Description

    The data contains 500,113 entries. All words and pronunciations are produced by English linguists. It can be used in the research and development of English ASR technology.

  4. Predicting English Pronunciations - Model Weights

    • kaggle.com
    Updated Jun 12, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Epp (2018). Predicting English Pronunciations - Model Weights [Dataset]. https://www.kaggle.com/datasets/reppic/predicting-english-pronunciations-model-weights/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 12, 2018
    Dataset provided by
    Kaggle
    Authors
    Ryan Epp
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Ryan Epp

    Released under CC0: Public Domain

    Contents

  5. h

    phonetic

    • huggingface.co
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neurlang Project (2025). phonetic [Dataset]. https://huggingface.co/datasets/neurlang/phonetic
    Explore at:
    Dataset updated
    Jul 15, 2025
    Authors
    Neurlang Project
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    Phonetic IPA Dataset for 85+ languages

    Download the dataset: [https://github.com/neurlang/dataset] Use our model trained on the dataset: [https://www.hashtron.cloud]

      Licensing information:
    

    MIT Original data for the 15 languages taken from gruut databases MIT To this the data for the 31 languages were added ipa dict files CC0: Public Domain Chinese/Mandarin-IPA language sentence pairs were generated: from the chinese sentences taken from dataset from kaggle based on the… See the full description on the dataset page: https://huggingface.co/datasets/neurlang/phonetic.

  6. German IPA Pronunciation Dictionary

    • kaggle.com
    Updated Dec 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Minixhofer (2019). German IPA Pronunciation Dictionary [Dataset]. https://www.kaggle.com/cdminix/german-ipa-pronunciation-dictionary/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Christoph Minixhofer
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Christoph Minixhofer

    Released under CC0: Public Domain

    Contents

  7. 109,818 Word Dictionary in .text Format

    • kaggle.com
    zip
    Updated May 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    christernyc (2021). 109,818 Word Dictionary in .text Format [Dataset]. http://doi.org/10.34740/kaggle/dsv/2231296
    Explore at:
    zip(10016010 bytes)Available download formats
    Dataset updated
    May 14, 2021
    Authors
    christernyc
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dictionary for your App

    It is 5, pipe '|' separated columns:

    Word | Pronunciation | Second pronunciations (if any ) | Part of Speech | Definition.

    I'm assuming this is the Webster's Unabridged Dictionary you can find here:

    gutenberg.org

    The file I used to make this was located here:

    https://www.kaggle.com/prabhaskumarpsk/dictionary

  8. word-in-french

    • huggingface.co
    Updated Oct 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FrancophonIA (2024). word-in-french [Dataset]. https://huggingface.co/datasets/FrancophonIA/word-in-french
    Explore at:
    Dataset updated
    Oct 12, 2024
    Dataset provided by
    Francophonia
    Authors
    FrancophonIA
    Area covered
    French
    Description

    Dataset origin: https://www.kaggle.com/datasets/stephrouen/word-in-french

      Context
    

    Lexique v3.81 on www.lexique.org/

      Content
    

    Words of the French language with pronunciation, grouping and statistics.

      Acknowledgements
    

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

  9. urdu-to-roman-urdu-stop-words

    • kaggle.com
    Updated Jan 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdullah (2023). urdu-to-roman-urdu-stop-words [Dataset]. https://www.kaggle.com/datasets/kane6543/urdutoromanurdustopwords
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 30, 2023
    Dataset provided by
    Kaggle
    Authors
    Abdullah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains the most common urdu stop words. These stop words are translated into roman urdu using urdu phonetic alphabets. Each version of the urdu stop word in roman urdu is the most commonly used form of that word

  10. Sichuanese Phonetic Transcription

    • kaggle.com
    Updated Oct 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    longmaodata (2024). Sichuanese Phonetic Transcription [Dataset]. https://www.kaggle.com/datasets/longmaodata/sichuanese-phonetic-transcription/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 23, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    longmaodata
    Description

    🔔Due to the platform's upload size restrictions and the extensive nature of our numerous public datasets, we can only provide samples of the datasets here. If you need the full public dataset, please join our official group to access it;

    🔔It is entirely free!

    🔔This helps promote open-source development!

    Complete data size

    55.4GB

    Join the group

    🚀🚀🚀🚀https://t.me/+Y5kL2iHis9A0ZWI1

    ✅ Obtain a complete dataset

    ✅ Mutual communication within the industry

    ✅ Get more information and consultation

    ✅ Timely dataset update notifications

    Dataset Introduction

    TTS average voice library

    Version

    v1.0

    Release Date

    2024-10-15

    Data Description

    Material Type: Voice Collection and Annotation

    Data Description (including: language category, etc.):Sichuan dialect

    Total Data Volume:800 hours

    Collection Equipment Requirements (model, etc.):Telephone recording

    Collection Environment Requirements:Indoor

    Pronunciation Style (Neutral news, emotional, reading/Free topic/Natural speech):Natural speech

    Number of Collectors:346 people

    Recording Content Audio scenes include: Customer service dialogue scenarios, financial industry, simulated production scenario data

    Duration of Each Recording:5-10 minutes

    Annotation Content:Text, timestamp, gender, background noise, English, amplitude reduction

    Annotation Format:Customer service and customer are annotated separately

    Recording Format (whether to follow reading, whether to read aloud, whether dialogue, etc.; Follow reading is only for children):Dialogue

    Voice Ability Requirements (requirements for accent; requirements for regional distribution; if it is a small language, its language ability requirements; whether professional broadcasters are needed):Normal Sichuan dialect speech

    Recording Environment (Recording studio/Signal-to-Noise Ratio SNR[Office/Shopping mall/Outdoor/Desk/Mobile phone/Car (model, speed, in-car environment)]):Office

    Audio Format:8k, 16bit wav

    Channels:Dual channel

    Silent Reservation:No reservation

    Voice Annotation:Divided according to the speaker's pause, and the corresponding text is marked

    Acceptance Rate Requirement:99%

    
    ## Directory Structure
    
    

    root_directory/

    ├── audio/

    │ ├── audio1.wav ├── annotations/

    │ ├── text1.txt ├── labels/

    │ ├── text1.txt

  11. P

    TIMIT Dataset

    • paperswithcode.com
    Updated Feb 2, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). TIMIT Dataset [Dataset]. https://paperswithcode.com/dataset/timit
    Explore at:
    Dataset updated
    Feb 2, 2021
    Description

    The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a standard dataset used for evaluation of automatic speech recognition systems. It consists of recordings of 630 speakers of 8 dialects of American English each reading 10 phonetically-rich sentences. It also comes with the word and phone-level transcriptions of the speech.

  12. VR English Phoneme Learning Dataset

    • kaggle.com
    Updated May 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Python Developer (2025). VR English Phoneme Learning Dataset [Dataset]. https://www.kaggle.com/datasets/programmer3/vr-english-phoneme-learning-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 8, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Python Developer
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains multimodal neurocognitive and behavioral data collected from 450 participants engaged in immersive virtual reality (VR) environments designed to improve English phonemic competence. Each participant interacted with authentic conversational VR tasks targeting 44 challenging English phonemes.

    The dataset includes:

    EEG-derived PSD features (Alpha, Beta, Theta) for cognitive state monitoring

    Speech-derived MFCC features for phoneme articulation assessment

    Eye-tracking metrics including fixation duration for visual engagement

    GSR and ECG data for physiological arousal and heart rate analysis

    Audio file references corresponding to individual English phoneme pronunciations

    Two target variables:

    Correct_Pronunciation (0/1) – binary classification label

    Neurocognitive_Load – continuous score

  13. Sindhi Stop-words

    • kaggle.com
    Updated Oct 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JayKay243 (2022). Sindhi Stop-words [Dataset]. https://www.kaggle.com/datasets/jaykay243/sindhi-stopwords
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 12, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    JayKay243
    Description

    A list of stop-words in the Sindhi Language sorted by dictionary order.

    Source

    They have been extracted and processed from the following research

  14. DARPA TIMIT Acoustic-Phonetic Continuous Speech

    • kaggle.com
    zip
    Updated Jun 5, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Fekadu (2019). DARPA TIMIT Acoustic-Phonetic Continuous Speech [Dataset]. https://www.kaggle.com/mfekadu/darpa-timit-acousticphonetic-continuous-speech
    Explore at:
    zip(869007403 bytes)Available download formats
    Dataset updated
    Jun 5, 2019
    Authors
    Michael Fekadu
    Description

    The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus

    The TIMIT corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT has resulted from the joint efforts of several sites under sponsorship from the Defense Advanced Research Projects Agency - Information Science and Technology Office (DARPA-ISTO). Text corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), Stanford Research Institute (SRI), and Texas Instruments (TI). The speech was recorded at TI, transcribed at MIT, and has been maintained, verified, and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). This file contains a brief description of the TIMIT Speech Corpus. Additional information including the referenced material and some relevant reprints of articles may be found in the printed documentation which is also available from NTIS (NTIS# PB91-100354).

    Corpus Speaker Distribution

    TIMIT contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States. Table 1 shows the number of speakers for the 8 dialect regions, broken down by sex. The percentages are given in parentheses. A speaker's dialect region is the geographical area of the U.S. where they lived during their childhood years. The geographical areas correspond with recognized dialect regions in U.S. (Language Files, Ohio State University Linguistics Dept., 1982), with the exception of the Western region (dr7) in which dialect boundaries are not known with any confidence and dialect region 8 where the speakers moved around a lot during their childhood.

      Table 1: Dialect distribution of speakers
    
       Dialect
       Region(dr)  #Male  #Female  Total
       ---------- --------- --------- ----------
         1     31 (63%) 18 (27%)  49 (8%) 
         2     71 (70%) 31 (30%) 102 (16%) 
         3     79 (67%) 23 (23%) 102 (16%) 
         4     69 (69%) 31 (31%) 100 (16%) 
         5     62 (63%) 36 (37%)  98 (16%) 
         6     30 (65%) 16 (35%)  46 (7%) 
         7     74 (74%) 26 (26%) 100 (16%) 
         8     22 (67%) 11 (33%)  33 (5%)
        ------   --------- --------- ---------- 
         8    438 (70%) 192 (30%) 630 (100%)
    
    The dialect regions are:
       dr1: New England
       dr2: Northern
       dr3: North Midland
       dr4: South Midland
       dr5: Southern
       dr6: New York City
       dr7: Western
       dr8: Army Brat (moved around)
    

    Corpus Text Material

    The text material in the TIMIT prompts (found in the file "prompts.doc") consists of 2 dialect "shibboleth" sentences designed at SRI, 450 phonetically-compact sentences designed at MIT, and 1890 phonetically-diverse sentences selected at TI. The dialect sentences (the SA sentences) were meant to expose the dialectal variants of the speakers and were read by all 630 speakers. The phonetically-compact sentences were designed to provide a good coverage of pairs of phones, with extra occurrences of phonetic contexts thought to be either difficult or of particular interest. Each speaker read 5 of these sentences (the SX sentences) and each text was spoken by 7 different speakers. The phonetically-diverse sentences (the SI sentences) were selected from existing text sources - the Brown Corpus (Kuchera and Francis, 1967) and the Playwrights Dialog (Hultzen, et al., 1964) - so as to add diversity in sentence types and phonetic contexts. The selection criteria maximized the variety of allophonic contexts found in the texts. Each speaker read 3 of these sentences, with each sentence being read only by a single speaker. Table 2 summarizes the speech material in TIMIT.

    Table 2: TIMIT speech material
     Sentence Type  #Sentences  #Speakers  Total  #Sentences/Speaker
     -------------  ----------  ---------  -----  ------------------
     Dialect (SA)     2     630    1260      2
     Compact (SX)    450      7    3150      5
     Diverse (SI)    1890      1    1890      3
     -------------  ----------  ---------  -----  ----------------
     Total       2342          6300     10
    

    Suggested Training/Test Subdivision

    The speech material has been subdivided into portions for training and testing. The criteria for the subdivision is described in the file "testset.doc". THIS SUBDIVISION HAS NO RELATION TO THE DATA DISTRIBUTED ON THE PROTOTYPE VERSION OF THE CDROM.

    Core Test Set:

    The test data has a core portion containing 24 speakers, 2 male and 1 female from each dialect region. The core test speakers are shown in Table 3. Each speaker read a different set of SX sentences. Thus the core test material contains 192 sentences, 5 SX and 3 SI for each speaker, each having a distinct text prompt.

      Table 3: The core test set of 24 speakers
    
       Dialect    Male   Female
       -------    ------   ------
        1    DAB0, WBT0  ELC0  
        2    TAS1, WEW0  PAS0  
        3    JMP0, LNT0  PKT0  
        4    LLL0, TLS0  JLM0  
        5    BPM0, KLT0  NLP0  
        6    CMJ0, JDH0  MGD0  
        7    GRT0, NJM0  DHC0
        8    JLN0, PAM0  MLD0
    

    Complete Test Set:

    A more extensive test set was obtained by including the sentences from all speakers that read any of the SX texts included in the core test set. In doing so, no sentence text appears in both the training and test sets. This complete test set contains a total of 168 speakers and 1344 utterances, accounting for about 27% of the total speech material. The resulting dialect distribution of the 168 speaker test set is given in Table 4. The complete test material contains 624 distinct texts. ``` Table 4: Dialect distribution for complete test set

     Dialect  #Male  #Female  Total
     -------  -----  -------  -----
      1      7    4    11
      2     18    8    26
      3     23    3    26
      4     16    16    32
      5     17    11    28
      6      8    3    11
      7     15    8    23
      8      8    3    11
     -----   -----  -------  ------
     Total    112    56   168
    
    CDROM TIMIT Directory and File Structure
    
    The speech and associated data is organized on the CD-ROM according to the following hierarchy:
    
    /
    
  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ryan Epp (2018). CMU Pronunciation Dictionary Unmodified (0.7b) [Dataset]. https://www.kaggle.com/datasets/reppic/cmu-pronunciation-dictionary-unmodified-07b/discussion
Organization logo

CMU Pronunciation Dictionary Unmodified (0.7b)

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 8, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ryan Epp
Description

Dataset

This dataset was created by Ryan Epp

Released under Other (specified in description)

Contents

Search
Clear search
Close search
Google apps
Main menu