In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.
KsponSpeech is a large-scale spontaneous speech corpus of Korean conversations. This corpus contains 969 hrs of general open-domain dialog utterances, spoken by about 2,000 native Korean speakers in a clean environment. All data were constructed by recording the dialogue of two people freely conversing on a variety of topics and manually transcribing the utterances. The transcription provides a dual transcription consisting of orthography and pronunciation, and disfluency tags for spontaneity of speech, such as filler words, repeated words, and word fragments. KsponSpeech is publicly available on an open data hub site of the Korea government. (https://aihub.or.kr/aidata/105)
ID
King-ASR-217
Duration
511 hours
Recording Device
Telephone
Description
This dataset was recorded in a quiet office/home environment, with a total of 636 speakers participating, including 283 males and 353 females. All speakers involved in the recording were professionally selected to ensure standard pronunciation and clear articulation. The recorded text covers health, sports, travel, and other information.
URL… See the full description on the dataset page: https://huggingface.co/datasets/DataoceanAI/Dolphin_Model_North-Korean-Conversational-Speech-Recognition-Corpus.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Second language (L2) learners often exhibit difficulty perceiving novel phonological contrasts and/or using them to distinguish similar-sounding words. The auditory lexical decision (LD) task has emerged as a promising method to elicit the asymmetries in lexical processing performance that help to identify the locus of learners’ difficulty. However, LD tasks have been implemented and interpreted variably in the literature, complicating their utility in distinguishing between cases where learners’ difficulty lies at the level of perceptual and/or lexical coding. Building on previous work, we elaborate a set of LD ordinal accuracy predictions associated with various logically possible scenarios concerning the locus of learner difficulty, and provide new LD data involving multiple contrasts and native language (L1) groups. The inclusion of a native speaker control group allows us to isolate which patterns are unique to L2 learners, and the combination of multiple contrasts and L1 groups allows us to elicit evidence of various scenarios. We present findings of an experiment where native English, Korean, and Mandarin speakers completed an LD task that probed the robustness of listeners’ phonological representations of the English /æ/-/ɛ/ and /l/-/ɹ/ contrasts. Words contained the target phonemes, and nonwords were created by replacing the target phoneme with its counterpart (e.g., lecture/*[ɹ]ecture, battle/*b[ɛ]ttle). For the /æ/-/ɛ/ contrast, all three groups exhibited the same pattern of accuracy: near-ceiling acceptance of words and an asymmetric pattern of responses to nonwords, with higher accuracy for nonwords containing [æ] than [ɛ]. For the /l/-/ɹ/ contrast, we found three distinct accuracy patterns: native English speakers’ performance was highly accurate and symmetric for words and nonwords, native Mandarin speakers exhibited asymmetries favoring [l] items for words and nonwords (interpreted as evidence that they experienced difficulty at the perceptual coding level), and native Korean speakers exhibited asymmetries in opposite directions for words (favoring [l]) and nonwords (favoring [ɹ]; evidence of difficulty at the lexical coding level). Our findings suggest that the auditory LD task holds promise for determining the locus of learners’ difficulty with L2 contrasts; however, we raise several issues requiring attention to maximize its utility in investigating L2 phonolexical processing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for 3i4K
Dataset Summary
The 3i4K dataset is a set of frequently used Korean words (corpus provided by the Seoul National University Speech Language Processing Lab) and manually created questions/commands containing short utterances. The goal is to identify the speaker intention of a spoken utterance based on its transcript, and whether in some cases, requires using auxiliary acoustic features. The classification system decides whether the utterance is a… See the full description on the dataset page: https://huggingface.co/datasets/wicho/kor_3i4k.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description of the original author
KSS Dataset: Korean Single speaker Speech Dataset
KSS Dataset is designed for the Korean text-to-speech task. It consists of audio files recorded by a professional female voice actoress and their aligned text extracted from my books. As a copyright holder, by courtesy of the publishers, I release this dataset to the public. To my best knowledge, this is the first publicly available speech dataset for Korean.
File Format
Each… See the full description on the dataset page: https://huggingface.co/datasets/Bingsu/KSS_Dataset.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
L2-ARCTIC: a non-native English speech corpus
L2-ARCTIC contains English speech from 24 non-native speakers of Vietnamese, Korean, Mandarin, Spanish, Hindi, and Arabic backgrounds. It contains phonemic annotations using the sounds supported by ARPABet. It was compiled by researchers at Texas A&M University and Iowa State University. Read more on their official website.
This Processed Version
We have processed the dataset into an easily consumable Hugging Face dataset… See the full description on the dataset page: https://huggingface.co/datasets/KoelLabs/L2Arctic.
According to a study conducted in South Korea in 2023, over half of the respondents indicated that they would not send their children to study abroad, while about ** percent stated the opposite. Korean students studying abroad Many Korean students choose to study abroad because they can benefit from the advanced educational system and various opportunities available. The United States is the most preferred destination for Korean students due to its status as an English-speaking country. In the 2022/23 academic year, more than ****** South Korean students were enrolled in U.S. universities. Internationalization of Korean universities In response to this trend, the South Korean government is actively working to globalize universities. The aim is to provide diverse opportunities for Korean students without the need to study abroad. The Songdo commercial district has been designated to attract universities from other countries. Furthermore, many other higher education institutions have been signing contracts with foreign partners for credit exchanges, degree programs, research projects, and instructor exchanges. As a result, local undergraduate students can now easily take courses in English and interact with foreign professors and students from diverse cultural backgrounds.
n 2023, the value of sauces and sauce products exported from South Korea amounted to around ********U.S. dollars, an increase from the previous year. During recent years, Korean food items, which includes sauces, have experienced an increase in popularity. Thanks to that, exports of sauces and pastes like the well-known red pepper paste gochujang have been increasing to keep up with the growing demand.
The rise of K-food
As access to Korean culture is now easy thanks to the Korean wave, Hallyu, and internet trends, more consumers have discovered Korean cuisine for themselves. While the taste plays a major part in attracting a new food target audience, Korean media as well as celebrities and influencers add further motivation. The power of media for food trends is visible with examples like the Korean hot noodle challenge and dalgona.
Consumers' access to Korean food
Especially in major Asian and English-speaking large cities, consumers enjoyed visiting Korean restaurants. According to a survey, those who ate Korean food, spent an average of around ** U.S. dollar per month on it. The cuisine is known for featuring a variety of vegetables and healthy cooking methods. Nonetheless, Korean-style fried chicken and instant noodles are the most popular dishes of K-food globally, far ahead of traditional dishes.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
L2-ARCTIC Suitcase: a spontaneous non-native English speech corpus
L2-ARCTIC Suitcase Corpus contains English speech from 22 non-native speakers of Vietnamese, Korean, Mandarin, Spanish, Hindi, and Arabic backgrounds. It contains phonemic annotations using the sounds supported by ARPABet. It was compiled by researchers at Texas A&M University and Iowa State University. Read more on their official website.
This Processed Version
We have processed the dataset into an… See the full description on the dataset page: https://huggingface.co/datasets/KoelLabs/L2ArcticSpontaneousSplit.
https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
BASE YEAR | 2024 |
HISTORICAL DATA | 2019 - 2024 |
REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
MARKET SIZE 2023 | 15.4(USD Billion) |
MARKET SIZE 2024 | 17.97(USD Billion) |
MARKET SIZE 2032 | 61.9(USD Billion) |
SEGMENTS COVERED | Device ,End-User ,Application ,Language Support ,Vertical ,Regional |
COUNTRIES COVERED | North America, Europe, APAC, South America, MEA |
KEY MARKET DYNAMICS | AI Integration Rising Demand IoT Advancements Privacy Concerns |
MARKET FORECAST UNITS | USD Billion |
KEY COMPANIES PROFILED | NICE Ltd. ,Apple Inc. ,Tencent Holdings Ltd. ,Google LLC ,Genesys Telecommunications Laboratories, Inc. ,IBM Corporation ,Alibaba Group Holding Ltd. ,Microsoft Corporation ,Amazon Web Services ,SoundHound Inc. ,Sensory Inc. ,Verint Systems Inc. ,Nuance Communications Inc. ,Samsung Electronics Co., Ltd. ,Baidu Inc. |
MARKET FORECAST PERIOD | 2024 - 2032 |
KEY MARKET OPPORTUNITIES | 1 Enhanced user experience 2 Growing demand for smart homes 3 Increased adoption in healthcare 4 Integration with IoT devices 5 Expansion into emerging markets |
COMPOUND ANNUAL GROWTH RATE (CAGR) | 16.71% (2024 - 2032) |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FLEURS
Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and ”unit error rate” (characters, signs) of all languages is averaged. Languages and results are also grouped into seven… See the full description on the dataset page: https://huggingface.co/datasets/google/fleurs.
AbstractIntroduction CALLFRIEND Russian Speech (LDC2023S08) was developed by the Linguistic Data Consortium (LDC) and consists of approximately 48 hours of telephone conversations (100 recordings) between native speakers of Russian. The calls were recorded in 1999 as part of the CALLFRIEND collection. One hundred native Russian speakers living in the continental United States each made a single phone call, lasting up to 30 minutes, to a family member or friend living in the United States. Corresponding transcripts and a lexicon are available in CALLFRIEND Russian Text (LDC2023T09). The CALLFRIEND series is a collection of telephone conversations in several languages conducted by LDC in support of language identification technology development. Languages covered in the collection include American English, Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean, Mandarin Chinese, Russian, Spanish, Tamil and Vietnamese. Data All recordings involved domestic calls routed through the automated telephone collection platform at LDC and were stored as 2-channel (4-wire) 8-KHz mu-law samples taken directly from a public telephone network via a T-1 circuit. Each audio file is a FLAC-compressed MS-WAV (RIFF) format audio file containing 2-channel, 8-KHz, 16-bit PCM sample data. This release includes call metadata, including speaker gender, the number of speakers on each channel and call duration.
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Large acoustic corpus of read text in Korean produced by Kaist Korterm. Native Korean speakers (males and females) have uttered 36 geographical proper nouns. Information such as the size and the level of studies of the speakers are provided. The recordings took place in a soundproof room. The data are stored in a 8-bit A-law speech file, with a 16 kHz sampling rate. The standard in use is NIST.
Official data repository for LLM-as-a-Judge & Reward Model: What They Can and Cannot DoTLDR; Automated Evaluators (LLM-as-a-Judge, Reward Models) can be transferred to non-English settings without additional training. (most of the times)
Dataset Description
At the best of our knowledge, KUDGE is the only, non-English, human-annotated meta-evaluation dataset at this point. Consisted of 5,012 human annotation from native Korean speakers, we expect KUDGE to be widely used as a tool… See the full description on the dataset page: https://huggingface.co/datasets/HAERAE-HUB/KUDGE.
This dataset includes the primary language of newly Medi-Cal eligible individuals who identified their primary language as English, Spanish, Vietnamese, Mandarin, Cantonese, Arabic, Other Non-English, Armenian, Russian, Farsi, Korean, Tagalog, Other Chinese Languages, Hmong, Cambodian, Portuguese, Lao, French, Thai, Japanese, Samoan, Other Sign Language, American Sign Language (ASL), Turkish, Ilacano, Mien, Italian, Hebrew, and Polish, by reporting period. The primary language data is from the Medi-Cal Eligibility Data System (MEDS) and includes eligible individuals without prior Medi-Cal eligibility. This dataset is part of the public reporting requirements set forth in California Welfare and Institutions Code 14102.5.
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The GlobalPhone pronunciation dictionaries, created within the framework of the multilingual speech and language corpus GlobalPhone, were developed in collaboration with the Karlsruhe Institute of Technology (KIT). The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The pronunciation dictionaries are currently available in 18 languages: Arabic (29230 entries/27059 words), Bulgarian (20193 entries), Croatian (23497 entries/20628 words), Czech (33049 entries/32942 words), French (36837 entries/20710 words), German (48979 entries/46035 words), Hausa (42662 entries/42079 words), Japanese (18094 entries), Polish (36484 entries), Portuguese (Brazilian) (54146 entries/54130 words), Russian (28818 entries/27667 words), Spanish (Latin American) (43264 entries/33960 words), Swedish (about 25000 entries), Turkish (31330 entries/31087 words), Vietnamese (38504 entries/29974 words), Chinese-Mandarin (73388 pronunciations), Korean (3500 syllables), and Thai (a small set with 12,420 pronunciation entries of 12,420 different words, and does not include pronunciation variants, and a larger set which contains 25,570 pronunciation entries of 22,462 different words units, and includes 3,108 entries of up to four pronunciation variants). 1) Dictionary Encoding: The pronunciation dictionary entries consist of full word forms and are either given in the original script of that language, mostly in UTF-8 encoding (Bulgarian, Croatian, Czech, French, Polish, Russian, Spanish, Thai) corresponding to the trl-files of the GlobalPhone transcriptions or in Romanized script (Arabic, German, Hausa, Japanese, Korean, Mandarin, Portuguese, Swedish, Turkish, Vietnamese) corresponding to the rmn-files of the GlobalPhone transcriptions, respectively. In the latter case the documentation mostly provides a mapping from the Romanized to the original script. 2) Dictionary Phone set: The phone sets for each language were derived individually from the literature following best practices for automatic speech processing. Each phone set is explained and described in the documentation using the international standards of the International Phonetic Alphabet (IPA). For most languages a mapping to the language independent GlobalPhone naming conventions (indicated by “M_”) is provided for the purpose of data sharing across languages to build multilingual acoustic models. 3) Dictionary Generation:Whenever the grapheme-to-phoneme relationship allowed, the dictionaries were created semi-automatically in a rule-based fashion using a set of grapheme-to-phoneme mapping rules. The number of rules highly depends on the language. After the automatic creation process, all dictionaries were manually cross-checked by native speakers, correcting potential errors of the automatic pronunciation generation process. Most of the dictionaries have been applied to large vocabulary speech recognition. In ...
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank, see ELRA-T0376) and a multilingual set of sentences in 28 languages (the PhraseBank, see ELRA-T0377).This version includes the corresponding audio files covering 26 languages of the 32 languages available in the Collins MLD Wordbank: Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Finnish, French, German, Greek, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese. The WordBank contains 10,000 words for each language, XML-annotated for part-of-speech, gender, irregular forms and disambiguating information for homographs. An additional dataset of 10,000 headwords is included for 12 languages (Chinese, American and British English, French, German, Italian, Japanese, Korean, Iberian and Brazilian Portuguese, Iberian and Latin American Spanish).The full database contains 10,000 audio files for each language (26 languages), and 10,000 additional audio files corresponding to the 10,000 additional headwords in 12 languages. Audio was recorded by native speakers.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.