18 datasets found
  1. Ranking of languages spoken at home in the U.S. 2023

    • statista.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
    Explore at:
    Dataset updated
    Apr 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    United States
    Description

    In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

  2. h

    ksponspeech

    • huggingface.co
    Updated Jun 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cheul (2025). ksponspeech [Dataset]. https://huggingface.co/datasets/cheulyop/ksponspeech
    Explore at:
    Dataset updated
    Jun 17, 2025
    Authors
    Cheul
    Description

    KsponSpeech is a large-scale spontaneous speech corpus of Korean conversations. This corpus contains 969 hrs of general open-domain dialog utterances, spoken by about 2,000 native Korean speakers in a clean environment. All data were constructed by recording the dialogue of two people freely conversing on a variety of topics and manually transcribing the utterances. The transcription provides a dual transcription consisting of orthography and pronunciation, and disfluency tags for spontaneity of speech, such as filler words, repeated words, and word fragments. KsponSpeech is publicly available on an open data hub site of the Korea government. (https://aihub.or.kr/aidata/105)

  3. h

    Dolphin_Model_North-Korean-Conversational-Speech-Recognition-Corpus

    • huggingface.co
    Updated Apr 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataocean AI (2025). Dolphin_Model_North-Korean-Conversational-Speech-Recognition-Corpus [Dataset]. https://huggingface.co/datasets/DataoceanAI/Dolphin_Model_North-Korean-Conversational-Speech-Recognition-Corpus
    Explore at:
    Dataset updated
    Apr 26, 2025
    Authors
    Dataocean AI
    Area covered
    North Korea
    Description

    ID

    King-ASR-217

      Duration
    

    511 hours

      Recording Device
    

    Telephone

      Description
    

    This dataset was recorded in a quiet office/home environment, with a total of 636 speakers participating, including 283 males and 353 females. All speakers involved in the recording were professionally selected to ensure standard pronunciation and clear articulation. The recorded text covers health, sports, travel, and other information.

      URL… See the full description on the dataset page: https://huggingface.co/datasets/DataoceanAI/Dolphin_Model_North-Korean-Conversational-Speech-Recognition-Corpus.
    
  4. f

    Table2_L2 Processing of Words Containing English /æ/-/ɛ/ and /l/-/ɹ/...

    • frontiersin.figshare.com
    docx
    Updated Jun 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shannon Barrios; Rachel Hayes-Harb (2023). Table2_L2 Processing of Words Containing English /æ/-/ɛ/ and /l/-/ɹ/ Contrasts, and the Uses and Limits of the Auditory Lexical Decision Task for Understanding the Locus of Difficulty.DOCX [Dataset]. http://doi.org/10.3389/fcomm.2021.689470.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    Frontiers
    Authors
    Shannon Barrios; Rachel Hayes-Harb
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Second language (L2) learners often exhibit difficulty perceiving novel phonological contrasts and/or using them to distinguish similar-sounding words. The auditory lexical decision (LD) task has emerged as a promising method to elicit the asymmetries in lexical processing performance that help to identify the locus of learners’ difficulty. However, LD tasks have been implemented and interpreted variably in the literature, complicating their utility in distinguishing between cases where learners’ difficulty lies at the level of perceptual and/or lexical coding. Building on previous work, we elaborate a set of LD ordinal accuracy predictions associated with various logically possible scenarios concerning the locus of learner difficulty, and provide new LD data involving multiple contrasts and native language (L1) groups. The inclusion of a native speaker control group allows us to isolate which patterns are unique to L2 learners, and the combination of multiple contrasts and L1 groups allows us to elicit evidence of various scenarios. We present findings of an experiment where native English, Korean, and Mandarin speakers completed an LD task that probed the robustness of listeners’ phonological representations of the English /æ/-/ɛ/ and /l/-/ɹ/ contrasts. Words contained the target phonemes, and nonwords were created by replacing the target phoneme with its counterpart (e.g., lecture/*[ɹ]ecture, battle/*b[ɛ]ttle). For the /æ/-/ɛ/ contrast, all three groups exhibited the same pattern of accuracy: near-ceiling acceptance of words and an asymmetric pattern of responses to nonwords, with higher accuracy for nonwords containing [æ] than [ɛ]. For the /l/-/ɹ/ contrast, we found three distinct accuracy patterns: native English speakers’ performance was highly accurate and symmetric for words and nonwords, native Mandarin speakers exhibited asymmetries favoring [l] items for words and nonwords (interpreted as evidence that they experienced difficulty at the perceptual coding level), and native Korean speakers exhibited asymmetries in opposite directions for words (favoring [l]) and nonwords (favoring [ɹ]; evidence of difficulty at the lexical coding level). Our findings suggest that the auditory LD task holds promise for determining the locus of learners’ difficulty with L2 contrasts; however, we raise several issues requiring attention to maximize its utility in investigating L2 phonolexical processing.

  5. h

    kor_3i4k

    • huggingface.co
    • opendatalab.com
    Updated Jan 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Won Ik Cho (2021). kor_3i4k [Dataset]. https://huggingface.co/datasets/wicho/kor_3i4k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 13, 2021
    Authors
    Won Ik Cho
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for 3i4K

      Dataset Summary
    

    The 3i4K dataset is a set of frequently used Korean words (corpus provided by the Seoul National University Speech Language Processing Lab) and manually created questions/commands containing short utterances. The goal is to identify the speaker intention of a spoken utterance based on its transcript, and whether in some cases, requires using auxiliary acoustic features. The classification system decides whether the utterance is a… See the full description on the dataset page: https://huggingface.co/datasets/wicho/kor_3i4k.

  6. h

    KSS_Dataset

    • huggingface.co
    Updated Apr 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dowon Hwang (2018). KSS_Dataset [Dataset]. https://huggingface.co/datasets/Bingsu/KSS_Dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 15, 2018
    Authors
    Dowon Hwang
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Description of the original author

      KSS Dataset: Korean Single speaker Speech Dataset
    

    KSS Dataset is designed for the Korean text-to-speech task. It consists of audio files recorded by a professional female voice actoress and their aligned text extracted from my books. As a copyright holder, by courtesy of the publishers, I release this dataset to the public. To my best knowledge, this is the first publicly available speech dataset for Korean.

      File Format
    

    Each… See the full description on the dataset page: https://huggingface.co/datasets/Bingsu/KSS_Dataset.

  7. h

    L2Arctic

    • huggingface.co
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koel Labs (2025). L2Arctic [Dataset]. https://huggingface.co/datasets/KoelLabs/L2Arctic
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset authored and provided by
    Koel Labs
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    L2-ARCTIC: a non-native English speech corpus

    L2-ARCTIC contains English speech from 24 non-native speakers of Vietnamese, Korean, Mandarin, Spanish, Hindi, and Arabic backgrounds. It contains phonemic annotations using the sounds supported by ARPABet. It was compiled by researchers at Texas A&M University and Iowa State University. Read more on their official website.

      This Processed Version
    

    We have processed the dataset into an easily consumable Hugging Face dataset… See the full description on the dataset page: https://huggingface.co/datasets/KoelLabs/L2Arctic.

  8. Willingness to send children to study abroad South Korea 2023

    • statista.com
    Updated Aug 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Willingness to send children to study abroad South Korea 2023 [Dataset]. https://www.statista.com/statistics/1060268/south-korea-willingness-to-send-their-children-to-study-abroad/
    Explore at:
    Dataset updated
    Aug 15, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jul 31, 2023 - Aug 17, 2023
    Area covered
    South Korea
    Description

    According to a study conducted in South Korea in 2023, over half of the respondents indicated that they would not send their children to study abroad, while about ** percent stated the opposite. Korean students studying abroad Many Korean students choose to study abroad because they can benefit from the advanced educational system and various opportunities available. The United States is the most preferred destination for Korean students due to its status as an English-speaking country. In the 2022/23 academic year, more than ****** South Korean students were enrolled in U.S. universities. Internationalization of Korean universities In response to this trend, the South Korean government is actively working to globalize universities. The aim is to provide diverse opportunities for Korean students without the need to study abroad. The Songdo commercial district has been designated to attract universities from other countries. Furthermore, many other higher education institutions have been signing contracts with foreign partners for credit exchanges, degree programs, research projects, and instructor exchanges. As a result, local undergraduate students can now easily take courses in English and interact with foreign professors and students from diverse cultural backgrounds.

  9. Export value of sauces South Korea 2017-2023

    • statista.com
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Export value of sauces South Korea 2017-2023 [Dataset]. https://www.statista.com/statistics/1226934/south-korea-export-value-sauces/
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    South Korea
    Description

    n 2023, the value of sauces and sauce products exported from South Korea amounted to around ********U.S. dollars, an increase from the previous year. During recent years, Korean food items, which includes sauces, have experienced an increase in popularity. Thanks to that, exports of sauces and pastes like the well-known red pepper paste gochujang have been increasing to keep up with the growing demand. The rise of K-food As access to Korean culture is now easy thanks to the Korean wave, Hallyu, and internet trends, more consumers have discovered Korean cuisine for themselves. While the taste plays a major part in attracting a new food target audience, Korean media as well as celebrities and influencers add further motivation. The power of media for food trends is visible with examples like the Korean hot noodle challenge and dalgona.
    Consumers' access to Korean food Especially in major Asian and English-speaking large cities, consumers enjoyed visiting Korean restaurants. According to a survey, those who ate Korean food, spent an average of around ** U.S. dollar per month on it. The cuisine is known for featuring a variety of vegetables and healthy cooking methods. Nonetheless, Korean-style fried chicken and instant noodles are the most popular dishes of K-food globally, far ahead of traditional dishes.

  10. h

    L2ArcticSpontaneousSplit

    • huggingface.co
    Updated Aug 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koel Labs (2025). L2ArcticSpontaneousSplit [Dataset]. https://huggingface.co/datasets/KoelLabs/L2ArcticSpontaneousSplit
    Explore at:
    Dataset updated
    Aug 17, 2025
    Dataset authored and provided by
    Koel Labs
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    L2-ARCTIC Suitcase: a spontaneous non-native English speech corpus

    L2-ARCTIC Suitcase Corpus contains English speech from 22 non-native speakers of Vietnamese, Korean, Mandarin, Spanish, Hindi, and Arabic backgrounds. It contains phonemic annotations using the sounds supported by ARPABet. It was compiled by researchers at Texas A&M University and Iowa State University. Read more on their official website.

      This Processed Version
    

    We have processed the dataset into an… See the full description on the dataset page: https://huggingface.co/datasets/KoelLabs/L2ArcticSpontaneousSplit.

  11. w

    Global Intelligent Voice Assistant Market Research Report: By Device...

    • wiseguyreports.com
    Updated Jul 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wWiseguy Research Consultants Pvt Ltd (2024). Global Intelligent Voice Assistant Market Research Report: By Device (Smartphones, Smart Speakers, Smart Displays, Wearables), By End-User (Personal, Enterprise, Government), By Application (Control Devices, Information Retrieval, Communication, Entertainment, Shopping, Banking), By Language Support (English, Mandarin, Spanish, Hindi, Arabic, French, Japanese, Korean), By Vertical (Healthcare, Retail, Banking and Finance, Education, Manufacturing, Automotive) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2032. [Dataset]. https://www.wiseguyreports.com/reports/intelligent-voice-assistant-market
    Explore at:
    Dataset updated
    Jul 23, 2024
    Dataset authored and provided by
    wWiseguy Research Consultants Pvt Ltd
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Time period covered
    Jan 7, 2024
    Area covered
    Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2024
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 202315.4(USD Billion)
    MARKET SIZE 202417.97(USD Billion)
    MARKET SIZE 203261.9(USD Billion)
    SEGMENTS COVEREDDevice ,End-User ,Application ,Language Support ,Vertical ,Regional
    COUNTRIES COVEREDNorth America, Europe, APAC, South America, MEA
    KEY MARKET DYNAMICSAI Integration Rising Demand IoT Advancements Privacy Concerns
    MARKET FORECAST UNITSUSD Billion
    KEY COMPANIES PROFILEDNICE Ltd. ,Apple Inc. ,Tencent Holdings Ltd. ,Google LLC ,Genesys Telecommunications Laboratories, Inc. ,IBM Corporation ,Alibaba Group Holding Ltd. ,Microsoft Corporation ,Amazon Web Services ,SoundHound Inc. ,Sensory Inc. ,Verint Systems Inc. ,Nuance Communications Inc. ,Samsung Electronics Co., Ltd. ,Baidu Inc.
    MARKET FORECAST PERIOD2024 - 2032
    KEY MARKET OPPORTUNITIES1 Enhanced user experience 2 Growing demand for smart homes 3 Increased adoption in healthcare 4 Integration with IoT devices 5 Expansion into emerging markets
    COMPOUND ANNUAL GROWTH RATE (CAGR) 16.71% (2024 - 2032)
  12. fleurs

    • huggingface.co
    • opendatalab.com
    Updated Jun 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2022). fleurs [Dataset]. https://huggingface.co/datasets/google/fleurs
    Explore at:
    Dataset updated
    Jun 4, 2022
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    FLEURS

    Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and ”unit error rate” (characters, signs) of all languages is averaged. Languages and results are also grouped into seven… See the full description on the dataset page: https://huggingface.co/datasets/google/fleurs.

  13. A

    CALLFRIEND Russian Speech

    • dvrs-applnxprd2.library.ubc.ca
    iso, txt
    Updated Oct 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2023). CALLFRIEND Russian Speech [Dataset]. https://dvrs-applnxprd2.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/NGRVVO
    Explore at:
    txt(1308), iso(2403022848)Available download formats
    Dataset updated
    Oct 16, 2023
    Dataset provided by
    Abacus Data Network
    Description

    AbstractIntroduction CALLFRIEND Russian Speech (LDC2023S08) was developed by the Linguistic Data Consortium (LDC) and consists of approximately 48 hours of telephone conversations (100 recordings) between native speakers of Russian. The calls were recorded in 1999 as part of the CALLFRIEND collection. One hundred native Russian speakers living in the continental United States each made a single phone call, lasting up to 30 minutes, to a family member or friend living in the United States. Corresponding transcripts and a lexicon are available in CALLFRIEND Russian Text (LDC2023T09). The CALLFRIEND series is a collection of telephone conversations in several languages conducted by LDC in support of language identification technology development. Languages covered in the collection include American English, Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean, Mandarin Chinese, Russian, Spanish, Tamil and Vietnamese. Data All recordings involved domestic calls routed through the automated telephone collection platform at LDC and were stored as 2-channel (4-wire) 8-KHz mu-law samples taken directly from a public telephone network via a T-1 circuit. Each audio file is a FLAC-compressed MS-WAV (RIFF) format audio file containing 2-channel, 8-KHz, 16-bit PCM sample data. This release includes call metadata, including speaker gender, the number of speakers on each channel and call duration.

  14. E

    Phonetically Balanced Words (2)

    • catalog.elra.info
    • live.european-language-grid.eu
    Updated May 10, 2005
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2005). Phonetically Balanced Words (2) [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0125/
    Explore at:
    Dataset updated
    May 10, 2005
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    Large acoustic corpus of read text in Korean produced by Kaist Korterm. Native Korean speakers (males and females) have uttered 36 geographical proper nouns. Information such as the size and the level of studies of the speakers are provided. The recordings took place in a soundproof room. The data are stored in a 8-bit A-law speech file, with a 16 kHz sampling rate. The standard in use is NIST.

  15. h

    KUDGE

    • huggingface.co
    Updated Sep 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HAE-RAE (2024). KUDGE [Dataset]. https://huggingface.co/datasets/HAERAE-HUB/KUDGE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 17, 2024
    Dataset authored and provided by
    HAE-RAE
    Description

    Official data repository for LLM-as-a-Judge & Reward Model: What They Can and Cannot DoTLDR; Automated Evaluators (LLM-as-a-Judge, Reward Models) can be transferred to non-English settings without additional training. (most of the times)

      Dataset Description
    

    At the best of our knowledge, KUDGE is the only, non-English, human-annotated meta-evaluation dataset at this point. Consisted of 5,012 human annotation from native Korean speakers, we expect KUDGE to be widely used as a tool… See the full description on the dataset page: https://huggingface.co/datasets/HAERAE-HUB/KUDGE.

  16. Primary Language of Newly Medi-Cal Eligible Individuals

    • data.chhs.ca.gov
    • data.ca.gov
    • +3more
    csv, zip
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Health Care Services (2025). Primary Language of Newly Medi-Cal Eligible Individuals [Dataset]. https://data.chhs.ca.gov/dataset/primary-language-of-newly-medi-cal-eligible-individuals
    Explore at:
    zip, csv(32459)Available download formats
    Dataset updated
    Mar 19, 2025
    Dataset provided by
    California Department of Health Care Serviceshttp://www.dhcs.ca.gov/
    Authors
    Department of Health Care Services
    Description

    This dataset includes the primary language of newly Medi-Cal eligible individuals who identified their primary language as English, Spanish, Vietnamese, Mandarin, Cantonese, Arabic, Other Non-English, Armenian, Russian, Farsi, Korean, Tagalog, Other Chinese Languages, Hmong, Cambodian, Portuguese, Lao, French, Thai, Japanese, Samoan, Other Sign Language, American Sign Language (ASL), Turkish, Ilacano, Mien, Italian, Hebrew, and Polish, by reporting period. The primary language data is from the Medi-Cal Eligibility Data System (MEDS) and includes eligible individuals without prior Medi-Cal eligibility. This dataset is part of the public reporting requirements set forth in California Welfare and Institutions Code 14102.5.

  17. E

    GlobalPhone Spanish (Latin American) Pronunciation Dictionary

    • catalog.elra.info
    • live.european-language-grid.eu
    Updated Nov 25, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2014). GlobalPhone Spanish (Latin American) Pronunciation Dictionary [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0360/
    Explore at:
    Dataset updated
    Nov 25, 2014
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Area covered
    Latin America
    Description

    The GlobalPhone pronunciation dictionaries, created within the framework of the multilingual speech and language corpus GlobalPhone, were developed in collaboration with the Karlsruhe Institute of Technology (KIT). The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The pronunciation dictionaries are currently available in 18 languages: Arabic (29230 entries/27059 words), Bulgarian (20193 entries), Croatian (23497 entries/20628 words), Czech (33049 entries/32942 words), French (36837 entries/20710 words), German (48979 entries/46035 words), Hausa (42662 entries/42079 words), Japanese (18094 entries), Polish (36484 entries), Portuguese (Brazilian) (54146 entries/54130 words), Russian (28818 entries/27667 words), Spanish (Latin American) (43264 entries/33960 words), Swedish (about 25000 entries), Turkish (31330 entries/31087 words), Vietnamese (38504 entries/29974 words), Chinese-Mandarin (73388 pronunciations), Korean (3500 syllables), and Thai (a small set with 12,420 pronunciation entries of 12,420 different words, and does not include pronunciation variants, and a larger set which contains 25,570 pronunciation entries of 22,462 different words units, and includes 3,108 entries of up to four pronunciation variants). 1) Dictionary Encoding: The pronunciation dictionary entries consist of full word forms and are either given in the original script of that language, mostly in UTF-8 encoding (Bulgarian, Croatian, Czech, French, Polish, Russian, Spanish, Thai) corresponding to the trl-files of the GlobalPhone transcriptions or in Romanized script (Arabic, German, Hausa, Japanese, Korean, Mandarin, Portuguese, Swedish, Turkish, Vietnamese) corresponding to the rmn-files of the GlobalPhone transcriptions, respectively. In the latter case the documentation mostly provides a mapping from the Romanized to the original script. 2) Dictionary Phone set: The phone sets for each language were derived individually from the literature following best practices for automatic speech processing. Each phone set is explained and described in the documentation using the international standards of the International Phonetic Alphabet (IPA). For most languages a mapping to the language independent GlobalPhone naming conventions (indicated by “M_”) is provided for the purpose of data sharing across languages to build multilingual acoustic models. 3) Dictionary Generation:Whenever the grapheme-to-phoneme relationship allowed, the dictionaries were created semi-automatically in a rule-based fashion using a set of grapheme-to-phoneme mapping rules. The number of rules highly depends on the language. After the automatic creation process, all dictionaries were manually cross-checked by native speakers, correcting potential errors of the automatic pronunciation generation process. Most of the dictionaries have been applied to large vocabulary speech recognition. In ...

  18. E

    Collins Multilingual database (MLD) – WordBank with audio files

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Nov 18, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2016). Collins Multilingual database (MLD) – WordBank with audio files [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0382/
    Explore at:
    Dataset updated
    Nov 18, 2016
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank, see ELRA-T0376) and a multilingual set of sentences in 28 languages (the PhraseBank, see ELRA-T0377).This version includes the corresponding audio files covering 26 languages of the 32 languages available in the Collins MLD Wordbank: Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Finnish, French, German, Greek, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese. The WordBank contains 10,000 words for each language, XML-annotated for part-of-speech, gender, irregular forms and disambiguating information for homographs. An additional dataset of 10,000 headwords is included for 12 languages (Chinese, American and British English, French, German, Italian, Japanese, Korean, Iberian and Brazilian Portuguese, Iberian and Latin American Spanish).The full database contains 10,000 audio files for each language (26 languages), and 10,000 additional audio files corresponding to the 10,000 additional headwords in 12 languages. Audio was recorded by native speakers.

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
Organization logo

Ranking of languages spoken at home in the U.S. 2023

Explore at:
14 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
United States
Description

In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

Search
Clear search
Close search
Google apps
Main menu