18 datasets found

Ranking of languages spoken at home in the U.S. 2023
statista.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
Explore at:
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
United States
Description
In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.
h
ksponspeech
huggingface.co
Updated Jun 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cheul (2025). ksponspeech [Dataset]. https://huggingface.co/datasets/cheulyop/ksponspeech
Explore at:
Dataset updated
Jun 17, 2025
Authors
Cheul
Description
KsponSpeech is a large-scale spontaneous speech corpus of Korean conversations. This corpus contains 969 hrs of general open-domain dialog utterances, spoken by about 2,000 native Korean speakers in a clean environment. All data were constructed by recording the dialogue of two people freely conversing on a variety of topics and manually transcribing the utterances. The transcription provides a dual transcription consisting of orthography and pronunciation, and disfluency tags for spontaneity of speech, such as filler words, repeated words, and word fragments. KsponSpeech is publicly available on an open data hub site of the Korea government. (https://aihub.or.kr/aidata/105)
h
Dolphin_Model_North-Korean-Conversational-Speech-Recognition-Corpus
huggingface.co
Updated Apr 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataocean AI (2025). Dolphin_Model_North-Korean-Conversational-Speech-Recognition-Corpus [Dataset]. https://huggingface.co/datasets/DataoceanAI/Dolphin_Model_North-Korean-Conversational-Speech-Recognition-Corpus
Explore at:
Dataset updated
Apr 26, 2025
Authors
Dataocean AI
Area covered
North Korea
Description
ID

King-ASR-217

Duration

511 hours

Recording Device

Telephone

Description

This dataset was recorded in a quiet office/home environment, with a total of 636 speakers participating, including 283 males and 353 females. All speakers involved in the recording were professionally selected to ensure standard pronunciation and clear articulation. The recorded text covers health, sports, travel, and other information.

URL… See the full description on the dataset page: https://huggingface.co/datasets/DataoceanAI/Dolphin_Model_North-Korean-Conversational-Speech-Recognition-Corpus.
f
Table2_L2 Processing of Words Containing English /æ/-/ɛ/ and /l/-/ɹ/...
frontiersin.figshare.com
docx
Updated Jun 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shannon Barrios; Rachel Hayes-Harb (2023). Table2_L2 Processing of Words Containing English /æ/-/ɛ/ and /l/-/ɹ/ Contrasts, and the Uses and Limits of the Auditory Lexical Decision Task for Understanding the Locus of Difficulty.DOCX [Dataset]. http://doi.org/10.3389/fcomm.2021.689470.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fcomm.2021.689470.s002
Dataset updated
Jun 9, 2023
Dataset provided by
Frontiers
Authors
Shannon Barrios; Rachel Hayes-Harb
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Second language (L2) learners often exhibit difficulty perceiving novel phonological contrasts and/or using them to distinguish similar-sounding words. The auditory lexical decision (LD) task has emerged as a promising method to elicit the asymmetries in lexical processing performance that help to identify the locus of learners’ difficulty. However, LD tasks have been implemented and interpreted variably in the literature, complicating their utility in distinguishing between cases where learners’ difficulty lies at the level of perceptual and/or lexical coding. Building on previous work, we elaborate a set of LD ordinal accuracy predictions associated with various logically possible scenarios concerning the locus of learner difficulty, and provide new LD data involving multiple contrasts and native language (L1) groups. The inclusion of a native speaker control group allows us to isolate which patterns are unique to L2 learners, and the combination of multiple contrasts and L1 groups allows us to elicit evidence of various scenarios. We present findings of an experiment where native English, Korean, and Mandarin speakers completed an LD task that probed the robustness of listeners’ phonological representations of the English /æ/-/ɛ/ and /l/-/ɹ/ contrasts. Words contained the target phonemes, and nonwords were created by replacing the target phoneme with its counterpart (e.g., lecture/*[ɹ]ecture, battle/*b[ɛ]ttle). For the /æ/-/ɛ/ contrast, all three groups exhibited the same pattern of accuracy: near-ceiling acceptance of words and an asymmetric pattern of responses to nonwords, with higher accuracy for nonwords containing [æ] than [ɛ]. For the /l/-/ɹ/ contrast, we found three distinct accuracy patterns: native English speakers’ performance was highly accurate and symmetric for words and nonwords, native Mandarin speakers exhibited asymmetries favoring [l] items for words and nonwords (interpreted as evidence that they experienced difficulty at the perceptual coding level), and native Korean speakers exhibited asymmetries in opposite directions for words (favoring [l]) and nonwords (favoring [ɹ]; evidence of difficulty at the lexical coding level). Our findings suggest that the auditory LD task holds promise for determining the locus of learners’ difficulty with L2 contrasts; however, we raise several issues requiring attention to maximize its utility in investigating L2 phonolexical processing.
h
kor_3i4k
huggingface.co
opendatalab.com
Updated Jan 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Won Ik Cho (2021). kor_3i4k [Dataset]. https://huggingface.co/datasets/wicho/kor_3i4k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 13, 2021
Authors
Won Ik Cho
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for 3i4K

Dataset Summary

The 3i4K dataset is a set of frequently used Korean words (corpus provided by the Seoul National University Speech Language Processing Lab) and manually created questions/commands containing short utterances. The goal is to identify the speaker intention of a spoken utterance based on its transcript, and whether in some cases, requires using auxiliary acoustic features. The classification system decides whether the utterance is a… See the full description on the dataset page: https://huggingface.co/datasets/wicho/kor_3i4k.
h
KSS_Dataset
huggingface.co
Updated Apr 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dowon Hwang (2018). KSS_Dataset [Dataset]. https://huggingface.co/datasets/Bingsu/KSS_Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 15, 2018
Authors
Dowon Hwang
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Description of the original author

KSS Dataset: Korean Single speaker Speech Dataset

KSS Dataset is designed for the Korean text-to-speech task. It consists of audio files recorded by a professional female voice actoress and their aligned text extracted from my books. As a copyright holder, by courtesy of the publishers, I release this dataset to the public. To my best knowledge, this is the first publicly available speech dataset for Korean.

File Format

Each… See the full description on the dataset page: https://huggingface.co/datasets/Bingsu/KSS_Dataset.
h
L2Arctic
huggingface.co
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koel Labs (2025). L2Arctic [Dataset]. https://huggingface.co/datasets/KoelLabs/L2Arctic
Explore at:
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Koel Labs
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
L2-ARCTIC: a non-native English speech corpus

L2-ARCTIC contains English speech from 24 non-native speakers of Vietnamese, Korean, Mandarin, Spanish, Hindi, and Arabic backgrounds. It contains phonemic annotations using the sounds supported by ARPABet. It was compiled by researchers at Texas A&M University and Iowa State University. Read more on their official website.

This Processed Version

We have processed the dataset into an easily consumable Hugging Face dataset… See the full description on the dataset page: https://huggingface.co/datasets/KoelLabs/L2Arctic.
Willingness to send children to study abroad South Korea 2023
statista.com
Updated Aug 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Willingness to send children to study abroad South Korea 2023 [Dataset]. https://www.statista.com/statistics/1060268/south-korea-willingness-to-send-their-children-to-study-abroad/
Explore at:
Dataset updated
Aug 15, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jul 31, 2023 - Aug 17, 2023
Area covered
South Korea
Description
According to a study conducted in South Korea in 2023, over half of the respondents indicated that they would not send their children to study abroad, while about ** percent stated the opposite. Korean students studying abroad Many Korean students choose to study abroad because they can benefit from the advanced educational system and various opportunities available. The United States is the most preferred destination for Korean students due to its status as an English-speaking country. In the 2022/23 academic year, more than ****** South Korean students were enrolled in U.S. universities. Internationalization of Korean universities In response to this trend, the South Korean government is actively working to globalize universities. The aim is to provide diverse opportunities for Korean students without the need to study abroad. The Songdo commercial district has been designated to attract universities from other countries. Furthermore, many other higher education institutions have been signing contracts with foreign partners for credit exchanges, degree programs, research projects, and instructor exchanges. As a result, local undergraduate students can now easily take courses in English and interact with foreign professors and students from diverse cultural backgrounds.
Export value of sauces South Korea 2017-2023
statista.com
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Export value of sauces South Korea 2017-2023 [Dataset]. https://www.statista.com/statistics/1226934/south-korea-export-value-sauces/
Explore at:
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
South Korea
Description
n 2023, the value of sauces and sauce products exported from South Korea amounted to around ********U.S. dollars, an increase from the previous year. During recent years, Korean food items, which includes sauces, have experienced an increase in popularity. Thanks to that, exports of sauces and pastes like the well-known red pepper paste gochujang have been increasing to keep up with the growing demand. The rise of K-food As access to Korean culture is now easy thanks to the Korean wave, Hallyu, and internet trends, more consumers have discovered Korean cuisine for themselves. While the taste plays a major part in attracting a new food target audience, Korean media as well as celebrities and influencers add further motivation. The power of media for food trends is visible with examples like the Korean hot noodle challenge and dalgona.
Consumers' access to Korean food Especially in major Asian and English-speaking large cities, consumers enjoyed visiting Korean restaurants. According to a survey, those who ate Korean food, spent an average of around ** U.S. dollar per month on it. The cuisine is known for featuring a variety of vegetables and healthy cooking methods. Nonetheless, Korean-style fried chicken and instant noodles are the most popular dishes of K-food globally, far ahead of traditional dishes.
h
L2ArcticSpontaneousSplit
huggingface.co
Updated Aug 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koel Labs (2025). L2ArcticSpontaneousSplit [Dataset]. https://huggingface.co/datasets/KoelLabs/L2ArcticSpontaneousSplit
Explore at:
Dataset updated
Aug 17, 2025
Dataset authored and provided by
Koel Labs
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
L2-ARCTIC Suitcase: a spontaneous non-native English speech corpus

L2-ARCTIC Suitcase Corpus contains English speech from 22 non-native speakers of Vietnamese, Korean, Mandarin, Spanish, Hindi, and Arabic backgrounds. It contains phonemic annotations using the sounds supported by ARPABet. It was compiled by researchers at Texas A&M University and Iowa State University. Read more on their official website.

This Processed Version

We have processed the dataset into an… See the full description on the dataset page: https://huggingface.co/datasets/KoelLabs/L2ArcticSpontaneousSplit.

Global Intelligent Voice Assistant Market Research Report: By Device...

wiseguyreports.com

Updated Jul 23, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

wWiseguy Research Consultants Pvt Ltd (2024). Global Intelligent Voice Assistant Market Research Report: By Device (Smartphones, Smart Speakers, Smart Displays, Wearables), By End-User (Personal, Enterprise, Government), By Application (Control Devices, Information Retrieval, Communication, Entertainment, Shopping, Banking), By Language Support (English, Mandarin, Spanish, Hindi, Arabic, French, Japanese, Korean), By Vertical (Healthcare, Retail, Banking and Finance, Education, Manufacturing, Automotive) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2032. [Dataset]. https://www.wiseguyreports.com/reports/intelligent-voice-assistant-market

Explore at:

Dataset updated

Jul 23, 2024

Dataset authored and provided by

wWiseguy Research Consultants Pvt Ltd

License

https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

Time period covered

Jan 7, 2024

Area covered

Global

Description

BASE YEAR	2024
HISTORICAL DATA	2019 - 2024
REPORT COVERAGE	Revenue Forecast, Competitive Landscape, Growth Factors, and Trends
MARKET SIZE 2023	15.4(USD Billion)
MARKET SIZE 2024	17.97(USD Billion)
MARKET SIZE 2032	61.9(USD Billion)
SEGMENTS COVERED	Device ,End-User ,Application ,Language Support ,Vertical ,Regional
COUNTRIES COVERED	North America, Europe, APAC, South America, MEA
KEY MARKET DYNAMICS	AI Integration Rising Demand IoT Advancements Privacy Concerns
MARKET FORECAST UNITS	USD Billion
KEY COMPANIES PROFILED	NICE Ltd. ,Apple Inc. ,Tencent Holdings Ltd. ,Google LLC ,Genesys Telecommunications Laboratories, Inc. ,IBM Corporation ,Alibaba Group Holding Ltd. ,Microsoft Corporation ,Amazon Web Services ,SoundHound Inc. ,Sensory Inc. ,Verint Systems Inc. ,Nuance Communications Inc. ,Samsung Electronics Co., Ltd. ,Baidu Inc.
MARKET FORECAST PERIOD	2024 - 2032
KEY MARKET OPPORTUNITIES	1 Enhanced user experience 2 Growing demand for smart homes 3 Increased adoption in healthcare 4 Integration with IoT devices 5 Expansion into emerging markets
COMPOUND ANNUAL GROWTH RATE (CAGR)	16.71% (2024 - 2032)

fleurs
huggingface.co
opendatalab.com
Updated Jun 4, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2022). fleurs [Dataset]. https://huggingface.co/datasets/google/fleurs
Explore at:
Dataset updated
Jun 4, 2022
Dataset authored and provided by
Googlehttp://google.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
FLEURS

Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and ”unit error rate” (characters, signs) of all languages is averaged. Languages and results are also grouped into seven… See the full description on the dataset page: https://huggingface.co/datasets/google/fleurs.
A
CALLFRIEND Russian Speech
dvrs-applnxprd2.library.ubc.ca
iso, txt
Updated Oct 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abacus Data Network (2023). CALLFRIEND Russian Speech [Dataset]. https://dvrs-applnxprd2.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/NGRVVO
Explore at:
txt(1308), iso(2403022848)Available download formats
Dataset updated
Oct 16, 2023
Dataset provided by
Abacus Data Network
Description
AbstractIntroduction CALLFRIEND Russian Speech (LDC2023S08) was developed by the Linguistic Data Consortium (LDC) and consists of approximately 48 hours of telephone conversations (100 recordings) between native speakers of Russian. The calls were recorded in 1999 as part of the CALLFRIEND collection. One hundred native Russian speakers living in the continental United States each made a single phone call, lasting up to 30 minutes, to a family member or friend living in the United States. Corresponding transcripts and a lexicon are available in CALLFRIEND Russian Text (LDC2023T09). The CALLFRIEND series is a collection of telephone conversations in several languages conducted by LDC in support of language identification technology development. Languages covered in the collection include American English, Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean, Mandarin Chinese, Russian, Spanish, Tamil and Vietnamese. Data All recordings involved domestic calls routed through the automated telephone collection platform at LDC and were stored as 2-channel (4-wire) 8-KHz mu-law samples taken directly from a public telephone network via a T-1 circuit. Each audio file is a FLAC-compressed MS-WAV (RIFF) format audio file containing 2-channel, 8-KHz, 16-bit PCM sample data. This release includes call metadata, including speaker gender, the number of speakers on each channel and call duration.
E
Phonetically Balanced Words (2)
catalog.elra.info
live.european-language-grid.eu
Updated May 10, 2005
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2005). Phonetically Balanced Words (2) [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0125/
Explore at:
Dataset updated
May 10, 2005
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
Large acoustic corpus of read text in Korean produced by Kaist Korterm. Native Korean speakers (males and females) have uttered 36 geographical proper nouns. Information such as the size and the level of studies of the speakers are provided. The recordings took place in a soundproof room. The data are stored in a 8-bit A-law speech file, with a 16 kHz sampling rate. The standard in use is NIST.
h
KUDGE
huggingface.co
Updated Sep 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HAE-RAE (2024). KUDGE [Dataset]. https://huggingface.co/datasets/HAERAE-HUB/KUDGE
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 17, 2024
Dataset authored and provided by
HAE-RAE
Description
Official data repository for LLM-as-a-Judge & Reward Model: What They Can and Cannot DoTLDR; Automated Evaluators (LLM-as-a-Judge, Reward Models) can be transferred to non-English settings without additional training. (most of the times)

Dataset Description

At the best of our knowledge, KUDGE is the only, non-English, human-annotated meta-evaluation dataset at this point. Consisted of 5,012 human annotation from native Korean speakers, we expect KUDGE to be widely used as a tool… See the full description on the dataset page: https://huggingface.co/datasets/HAERAE-HUB/KUDGE.
Primary Language of Newly Medi-Cal Eligible Individuals
data.chhs.ca.gov
data.ca.gov
+3more
csv, zip
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Health Care Services (2025). Primary Language of Newly Medi-Cal Eligible Individuals [Dataset]. https://data.chhs.ca.gov/dataset/primary-language-of-newly-medi-cal-eligible-individuals
Explore at:
zip, csv(32459)Available download formats
Dataset updated
Mar 19, 2025
Dataset provided by
California Department of Health Care Serviceshttp://www.dhcs.ca.gov/
Authors
Department of Health Care Services
Description
This dataset includes the primary language of newly Medi-Cal eligible individuals who identified their primary language as English, Spanish, Vietnamese, Mandarin, Cantonese, Arabic, Other Non-English, Armenian, Russian, Farsi, Korean, Tagalog, Other Chinese Languages, Hmong, Cambodian, Portuguese, Lao, French, Thai, Japanese, Samoan, Other Sign Language, American Sign Language (ASL), Turkish, Ilacano, Mien, Italian, Hebrew, and Polish, by reporting period. The primary language data is from the Medi-Cal Eligibility Data System (MEDS) and includes eligible individuals without prior Medi-Cal eligibility. This dataset is part of the public reporting requirements set forth in California Welfare and Institutions Code 14102.5.
E
GlobalPhone Spanish (Latin American) Pronunciation Dictionary
catalog.elra.info
live.european-language-grid.eu
Updated Nov 25, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2014). GlobalPhone Spanish (Latin American) Pronunciation Dictionary [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0360/
Explore at:
Dataset updated
Nov 25, 2014
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
Latin America
Description
The GlobalPhone pronunciation dictionaries, created within the framework of the multilingual speech and language corpus GlobalPhone, were developed in collaboration with the Karlsruhe Institute of Technology (KIT). The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The pronunciation dictionaries are currently available in 18 languages: Arabic (29230 entries/27059 words), Bulgarian (20193 entries), Croatian (23497 entries/20628 words), Czech (33049 entries/32942 words), French (36837 entries/20710 words), German (48979 entries/46035 words), Hausa (42662 entries/42079 words), Japanese (18094 entries), Polish (36484 entries), Portuguese (Brazilian) (54146 entries/54130 words), Russian (28818 entries/27667 words), Spanish (Latin American) (43264 entries/33960 words), Swedish (about 25000 entries), Turkish (31330 entries/31087 words), Vietnamese (38504 entries/29974 words), Chinese-Mandarin (73388 pronunciations), Korean (3500 syllables), and Thai (a small set with 12,420 pronunciation entries of 12,420 different words, and does not include pronunciation variants, and a larger set which contains 25,570 pronunciation entries of 22,462 different words units, and includes 3,108 entries of up to four pronunciation variants). 1) Dictionary Encoding: The pronunciation dictionary entries consist of full word forms and are either given in the original script of that language, mostly in UTF-8 encoding (Bulgarian, Croatian, Czech, French, Polish, Russian, Spanish, Thai) corresponding to the trl-files of the GlobalPhone transcriptions or in Romanized script (Arabic, German, Hausa, Japanese, Korean, Mandarin, Portuguese, Swedish, Turkish, Vietnamese) corresponding to the rmn-files of the GlobalPhone transcriptions, respectively. In the latter case the documentation mostly provides a mapping from the Romanized to the original script. 2) Dictionary Phone set: The phone sets for each language were derived individually from the literature following best practices for automatic speech processing. Each phone set is explained and described in the documentation using the international standards of the International Phonetic Alphabet (IPA). For most languages a mapping to the language independent GlobalPhone naming conventions (indicated by “M_”) is provided for the purpose of data sharing across languages to build multilingual acoustic models. 3) Dictionary Generation:Whenever the grapheme-to-phoneme relationship allowed, the dictionaries were created semi-automatically in a rule-based fashion using a set of grapheme-to-phoneme mapping rules. The number of rules highly depends on the language. After the automatic creation process, all dictionaries were manually cross-checked by native speakers, correcting potential errors of the automatic pronunciation generation process. Most of the dictionaries have been applied to large vocabulary speech recognition. In ...
E
Collins Multilingual database (MLD) – WordBank with audio files
catalogue.elra.info
live.european-language-grid.eu
Updated Nov 18, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2016). Collins Multilingual database (MLD) – WordBank with audio files [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0382/
Explore at:
Dataset updated
Nov 18, 2016
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank, see ELRA-T0376) and a multilingual set of sentences in 28 languages (the PhraseBank, see ELRA-T0377).This version includes the corresponding audio files covering 26 languages of the 32 languages available in the Collins MLD Wordbank: Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Finnish, French, German, Greek, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese. The WordBank contains 10,000 words for each language, XML-annotated for part-of-speech, gender, irregular forms and disambiguating information for homographs. An additional dataset of 10,000 headwords is included for 12 languages (Chinese, American and British English, French, German, Italian, Japanese, Korean, Iberian and Brazilian Portuguese, Iberian and Latin American Spanish).The full database contains 10,000 audio files for each language (26 languages), and 10,000 additional audio files corresponding to the 10,000 additional headwords in 12 languages. Audio was recorded by native speakers.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/

Ranking of languages spoken at home in the U.S. 2023

Explore at:

14 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 14, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2023

Area covered

United States

Description

In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

Clear search

Close search

Google apps

Main menu

Ranking of languages spoken at home in the U.S. 2023

ksponspeech

Dolphin_Model_North-Korean-Conversational-Speech-Recognition-Corpus

Table2_L2 Processing of Words Containing English /æ/-/ɛ/ and /l/-/ɹ/...

kor_3i4k

KSS_Dataset

L2Arctic

Willingness to send children to study abroad South Korea 2023

Export value of sauces South Korea 2017-2023

L2ArcticSpontaneousSplit

Global Intelligent Voice Assistant Market Research Report: By Device...

fleurs

CALLFRIEND Russian Speech

Phonetically Balanced Words (2)

KUDGE

Primary Language of Newly Medi-Cal Eligible Individuals

GlobalPhone Spanish (Latin American) Pronunciation Dictionary

Collins Multilingual database (MLD) – WordBank with audio files

Ranking of languages spoken at home in the U.S. 2023