This dataset was created by Ryan Epp
Released under Other (specified in description)
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is a side product of a notebook to find out the rules of stress position in English.
It is a work based on another dataset with 300k+ English words.
I looked up dictionary for phonetic transcriptions with this free dictionary API and got about 30k transcriptions. Then I managed to extract syllable counts, stress positions and stressed syllables from them to make this new dataset.
words_stress_analyzed.csv
is the final dataset. Other files are just intermediate steps in the process.
Column | Datatype | Example | Description |
---|---|---|---|
word | str | complimentary | the English words |
phonetic | str | /ˌkɒmplɪ̈ˈment(ə)ɹɪ/ | the phonetic transcription of the words |
part_of_speech | str(list like) | ['adjective'] | how are these words used in sentences |
syllable_len | int | 5 | how many syllables are there in these words |
stress_pos | int | 3 | on which syllable the stress falls on, if there are more than one stress, this is the position of the first stress |
stress_syllable | str | e | the vowel of the stressed syllable |
Note: Absence of stress symbol in some short words led to blanks in this dataset. It is recommended to filter out rows with empty stress_syllable and rows that syllable_len is 1.
The data contains 500,113 entries. All words and pronunciations are produced by English linguists. It can be used in the research and development of English ASR technology.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Ryan Epp
Released under CC0: Public Domain
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Phonetic IPA Dataset for 85+ languages
Download the dataset: [https://github.com/neurlang/dataset] Use our model trained on the dataset: [https://www.hashtron.cloud]
Licensing information:
MIT Original data for the 15 languages taken from gruut databases MIT To this the data for the 31 languages were added ipa dict files CC0: Public Domain Chinese/Mandarin-IPA language sentence pairs were generated: from the chinese sentences taken from dataset from kaggle based on the… See the full description on the dataset page: https://huggingface.co/datasets/neurlang/phonetic.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Christoph Minixhofer
Released under CC0: Public Domain
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
It is 5, pipe '|' separated columns:
Word | Pronunciation | Second pronunciations (if any ) | Part of Speech | Definition.
I'm assuming this is the Webster's Unabridged Dictionary you can find here:
The file I used to make this was located here:
Dataset origin: https://www.kaggle.com/datasets/stephrouen/word-in-french
Context
Lexique v3.81 on www.lexique.org/
Content
Words of the French language with pronunciation, grouping and statistics.
Acknowledgements
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains the most common urdu stop words. These stop words are translated into roman urdu using urdu phonetic alphabets. Each version of the urdu stop word in roman urdu is the most commonly used form of that word
🚀🚀🚀🚀https://t.me/+Y5kL2iHis9A0ZWI1
✅ Obtain a complete dataset
✅ Mutual communication within the industry
✅ Get more information and consultation
✅ Timely dataset update notifications
v1.0
2024-10-15
Material Type: Voice Collection and Annotation
Data Description (including: language category, etc.):Sichuan dialect
Total Data Volume:800 hours
Collection Equipment Requirements (model, etc.):Telephone recording
Collection Environment Requirements:Indoor
Pronunciation Style (Neutral news, emotional, reading/Free topic/Natural speech):Natural speech
Number of Collectors:346 people
Recording Content Audio scenes include: Customer service dialogue scenarios, financial industry, simulated production scenario data
Duration of Each Recording:5-10 minutes
Annotation Content:Text, timestamp, gender, background noise, English, amplitude reduction
Annotation Format:Customer service and customer are annotated separately
Recording Format (whether to follow reading, whether to read aloud, whether dialogue, etc.; Follow reading is only for children):Dialogue
Voice Ability Requirements (requirements for accent; requirements for regional distribution; if it is a small language, its language ability requirements; whether professional broadcasters are needed):Normal Sichuan dialect speech
Recording Environment (Recording studio/Signal-to-Noise Ratio SNR[Office/Shopping mall/Outdoor/Desk/Mobile phone/Car (model, speed, in-car environment)]):Office
Audio Format:8k, 16bit wav
Channels:Dual channel
Silent Reservation:No reservation
Voice Annotation:Divided according to the speaker's pause, and the corresponding text is marked
Acceptance Rate Requirement:99%
## Directory Structure
root_directory/
├── audio/
│ ├── audio1.wav ├── annotations/
│ ├── text1.txt ├── labels/
│ ├── text1.txt
The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a standard dataset used for evaluation of automatic speech recognition systems. It consists of recordings of 630 speakers of 8 dialects of American English each reading 10 phonetically-rich sentences. It also comes with the word and phone-level transcriptions of the speech.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains multimodal neurocognitive and behavioral data collected from 450 participants engaged in immersive virtual reality (VR) environments designed to improve English phonemic competence. Each participant interacted with authentic conversational VR tasks targeting 44 challenging English phonemes.
The dataset includes:
EEG-derived PSD features (Alpha, Beta, Theta) for cognitive state monitoring
Speech-derived MFCC features for phoneme articulation assessment
Eye-tracking metrics including fixation duration for visual engagement
GSR and ECG data for physiological arousal and heart rate analysis
Audio file references corresponding to individual English phoneme pronunciations
Two target variables:
Correct_Pronunciation (0/1) – binary classification label
Neurocognitive_Load – continuous score
A list of stop-words in the Sindhi Language sorted by dictionary order.
They have been extracted and processed from the following research
The TIMIT corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT has resulted from the joint efforts of several sites under sponsorship from the Defense Advanced Research Projects Agency - Information Science and Technology Office (DARPA-ISTO). Text corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), Stanford Research Institute (SRI), and Texas Instruments (TI). The speech was recorded at TI, transcribed at MIT, and has been maintained, verified, and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). This file contains a brief description of the TIMIT Speech Corpus. Additional information including the referenced material and some relevant reprints of articles may be found in the printed documentation which is also available from NTIS (NTIS# PB91-100354).
TIMIT contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States. Table 1 shows the number of speakers for the 8 dialect regions, broken down by sex. The percentages are given in parentheses. A speaker's dialect region is the geographical area of the U.S. where they lived during their childhood years. The geographical areas correspond with recognized dialect regions in U.S. (Language Files, Ohio State University Linguistics Dept., 1982), with the exception of the Western region (dr7) in which dialect boundaries are not known with any confidence and dialect region 8 where the speakers moved around a lot during their childhood.
Table 1: Dialect distribution of speakers
Dialect
Region(dr) #Male #Female Total
---------- --------- --------- ----------
1 31 (63%) 18 (27%) 49 (8%)
2 71 (70%) 31 (30%) 102 (16%)
3 79 (67%) 23 (23%) 102 (16%)
4 69 (69%) 31 (31%) 100 (16%)
5 62 (63%) 36 (37%) 98 (16%)
6 30 (65%) 16 (35%) 46 (7%)
7 74 (74%) 26 (26%) 100 (16%)
8 22 (67%) 11 (33%) 33 (5%)
------ --------- --------- ----------
8 438 (70%) 192 (30%) 630 (100%)
The dialect regions are:
dr1: New England
dr2: Northern
dr3: North Midland
dr4: South Midland
dr5: Southern
dr6: New York City
dr7: Western
dr8: Army Brat (moved around)
The text material in the TIMIT prompts (found in the file "prompts.doc") consists of 2 dialect "shibboleth" sentences designed at SRI, 450 phonetically-compact sentences designed at MIT, and 1890 phonetically-diverse sentences selected at TI. The dialect sentences (the SA sentences) were meant to expose the dialectal variants of the speakers and were read by all 630 speakers. The phonetically-compact sentences were designed to provide a good coverage of pairs of phones, with extra occurrences of phonetic contexts thought to be either difficult or of particular interest. Each speaker read 5 of these sentences (the SX sentences) and each text was spoken by 7 different speakers. The phonetically-diverse sentences (the SI sentences) were selected from existing text sources - the Brown Corpus (Kuchera and Francis, 1967) and the Playwrights Dialog (Hultzen, et al., 1964) - so as to add diversity in sentence types and phonetic contexts. The selection criteria maximized the variety of allophonic contexts found in the texts. Each speaker read 3 of these sentences, with each sentence being read only by a single speaker. Table 2 summarizes the speech material in TIMIT.
Table 2: TIMIT speech material
Sentence Type #Sentences #Speakers Total #Sentences/Speaker
------------- ---------- --------- ----- ------------------
Dialect (SA) 2 630 1260 2
Compact (SX) 450 7 3150 5
Diverse (SI) 1890 1 1890 3
------------- ---------- --------- ----- ----------------
Total 2342 6300 10
Suggested Training/Test Subdivision
The speech material has been subdivided into portions for training and testing. The criteria for the subdivision is described in the file "testset.doc". THIS SUBDIVISION HAS NO RELATION TO THE DATA DISTRIBUTED ON THE PROTOTYPE VERSION OF THE CDROM.
The test data has a core portion containing 24 speakers, 2 male and 1 female from each dialect region. The core test speakers are shown in Table 3. Each speaker read a different set of SX sentences. Thus the core test material contains 192 sentences, 5 SX and 3 SI for each speaker, each having a distinct text prompt.
Table 3: The core test set of 24 speakers
Dialect Male Female
------- ------ ------
1 DAB0, WBT0 ELC0
2 TAS1, WEW0 PAS0
3 JMP0, LNT0 PKT0
4 LLL0, TLS0 JLM0
5 BPM0, KLT0 NLP0
6 CMJ0, JDH0 MGD0
7 GRT0, NJM0 DHC0
8 JLN0, PAM0 MLD0
A more extensive test set was obtained by including the sentences from all speakers that read any of the SX texts included in the core test set. In doing so, no sentence text appears in both the training and test sets. This complete test set contains a total of 168 speakers and 1344 utterances, accounting for about 27% of the total speech material. The resulting dialect distribution of the 168 speaker test set is given in Table 4. The complete test material contains 624 distinct texts. ``` Table 4: Dialect distribution for complete test set
Dialect #Male #Female Total
------- ----- ------- -----
1 7 4 11
2 18 8 26
3 23 3 26
4 16 16 32
5 17 11 28
6 8 3 11
7 15 8 23
8 8 3 11
----- ----- ------- ------
Total 112 56 168
CDROM TIMIT Directory and File Structure
The speech and associated data is organized on the CD-ROM according to the following hierarchy:
/
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This dataset was created by Ryan Epp
Released under Other (specified in description)