85 datasets found
  1. The most spoken languages worldwide 2025

    • statista.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset updated
    Apr 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

  2. The Most Spoken Languages Around the World

    • kaggle.com
    Updated Nov 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Narmelan Tharmalingam (2020). The Most Spoken Languages Around the World [Dataset]. https://www.kaggle.com/narmelan/100-most-spoken-languages-around-the-world/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 4, 2020
    Dataset provided by
    Kaggle
    Authors
    Narmelan Tharmalingam
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    World
    Description

    Context

    After going through quite the verbal loop when ordering foreign currency through the bank, which involved a discussion with an assigned financial advisor at the branch the following day to confirm details, I noticed despite our names hinting at the assumed typical background similarities, communication by phone was much more difficult due to the thickness in accents and different speech patterns when voicing from a non-native speaker.

    It hit me then coming from an extremely multicultural and welcoming city, the challenges others from completely different labels given to them in life must go through in their daily affairs when having to face communication barriers that I myself encountered, particularly when interacting with those outside their usual bubble. Now imagine this situation occurring every hour across the world in various sectors of business. How may this impede, help or create frustrations in minor or major ways as a result of increasing workplace diversity quota demands, customer satisfaction needs and process efficiencies?

    The data I was looking for to explore this phenomena existed in the form of native and non-native speakers of the 100 most commonly spoken languages across the globe.

    Content

    The data in this database contains the following attributes:

    • Language - name of the language
    • Total Speakers - this assumes both native and non-native speakers
    • Native Speakers - native speakers of the language
    • Origin - family origin group of said language

    Acknowledgements

    The data was collected with the aid of WordTips visualization of the 22nd edition of Ethnologue - "a research center for language intelligence"

    https://www.ethnologue.com/world https://www.ethnologue.com/guides/ethnologue200 https://word.tips/pictures/b684e98f-f512-4ac0-96a4-0efcf6decbc0_most-spoken-languages-world-5.png?auto=compress,format&rect=0,0,2001,7115&w=800&h=2845

    Inspiration

    As globalization no longer constrains us, what implications will this have in terms of organizational communications conducted moving forward? I believe this is something to be examined in careful context in order to make customer relationship processes meaningful rather than it being confined to a strictly detached transactional basis.

  3. Number of native Spanish speakers worldwide 2024, by country

    • statista.com
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Number of native Spanish speakers worldwide 2024, by country [Dataset]. https://www.statista.com/statistics/991020/number-native-spanish-speakers-country-worldwide/
    Explore at:
    Dataset updated
    Jan 15, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    World
    Description

    Mexico is the country with the largest number of native Spanish speakers in the world. As of 2024, 132.5 million people in Mexico spoke Spanish with a native command of the language. Colombia was the nation with the second-highest number of native Spanish speakers, at around 52.7 million. Spain came in third, with 48 million, and Argentina fourth, with 46 million. Spanish, a world language As of 2023, Spanish ranked as the fourth most spoken language in the world, only behind English, Chinese, and Hindi, with over half a billion speakers. Spanish is the official language of over 20 countries, the majority on the American continent, nonetheless, it's also one of the official languages of Equatorial Guinea in Africa. Other countries have a strong influence, like the United States, Morocco, or Brazil, countries included in the list of non-Hispanic countries with the highest number of Spanish speakers. The second most spoken language in the U.S. In the most recent data, Spanish ranked as the language, other than English, with the highest number of speakers, with 12 times more speakers as the second place. Which comes to no surprise following the long history of migrations from Latin American countries to the Northern country. Moreover, only during the fiscal year 2022. 5 out of the top 10 countries of origin of naturalized people in the U.S. came from Spanish-speaking countries.

  4. MCB_languages_county

    • kaggle.com
    Updated Oct 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marisol Brewster (2019). MCB_languages_county [Dataset]. https://www.kaggle.com/mcbrewster/mcb-languages-county/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 1, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Marisol Brewster
    Description

    Context

    This is a dataset I found online through the Google Dataset Search portal.

    Content

    The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.

    The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.

    The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.

    These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.

    Acknowledgements

    Sources:

    Google Dataset Search: https://toolbox.google.com/datasetsearch

    2009-2013 American Community Survey

    Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html

    Downloaded From: https://data.world/kvaughn/languages-county

    Banner and thumbnail photo by Farzad Mohsenvand on Unsplash

  5. E

    GlobalPhone Hausa

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Hausa [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0347/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.Hausa is a member of the Chadic language family, and belongs together with the Semitic and Cushitic languages to the Afroasiatic language family. With over 25 million speakers, it is widely spoken in West Africa. The collection of the Hausa speech and text corpus followed the GlobalPhone collection standards. First, a large text corpus was built by crawling websites that cover main Hausa newspaper sources. Hausa’s modern official orthography is a Latin-based alphabet called Boko, which was imposed in the 1930s by the British colonial administration. It consists of 22 characters of the English alphabet plus five special characters. The collection is based on five main newspapers written in Boko. After cleaning and normalization, these texts were used to build language models and to select prompts for the speech data recordings. Native speakers of Hausa were asked to read prompted sentences of newspaper articles. The entire collection...

  6. E

    GlobalPhone Portuguese (Brazilian)

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Portuguese (Brazilian) [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0201/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Area covered
    Brazil
    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Portuguese (Brazilian) corpus was produced using the Folha de Sao Paulo newspaper. It contains recordings of 102 speakers (54 males, 48 females) recorded in Porto Velho and Sao Paulo, Brazil. The following age distribution has been obtained: 6 speakers are below 19, 58 speakers are between 20 and 29, 27 speakers are between 30 and 39, 5 speakers are between 40 and 49, and 5 speakers are over 50 (1 speaker age is unknown).

  7. s

    120 Million Word Spanish Corpus

    • marketplace.sshopencloud.eu
    Updated Apr 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). 120 Million Word Spanish Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/XTUFXt
    Explore at:
    Dataset updated
    Apr 24, 2020
    Description

    Spanish is the second most widely-spoken language on Earth; over one in 20 humans alive today is a native speaker of Spanish. This medium-sized corpus contains 120 million words of modern Spanish taken from the Spanish-Language Wikipedia in 2010. This dataset is made up of 57 text files. Each contains multiple Wikipedia articles in an XML format. The text of each article is surrounded by tags. The initial tag also contains metadata about the article, including the article’s id and the title of the article. The text “ENDOFARTICLE.” appears at the end of each article, before the closing tag.

  8. Share of the global language services market by region 2018

    • statista.com
    Updated Jul 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Share of the global language services market by region 2018 [Dataset]. https://www.statista.com/statistics/190486/global-language-services-market-share-by-continent/
    Explore at:
    Dataset updated
    Jul 10, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2018
    Area covered
    World
    Description

    Given its diverse range of languages and high level of economic development, it is perhaps not surprising that Europe is home to the largest language services market in the world, comprising almost half of the global market. Language services globally The language services market covers a broad range of activities, from language instruction to professional translation services to localization and voice-over services for media such as film, television and video games. With the world becoming increasingly interconnected through technology, this market has more than doubled since 2009, with an expected global value of almost ** billion U.S. dollars in 2019. And, there is good reason to expect this market to continue growing – especially given that the market share of the Asia Pacific region is relatively low, yet the region is home to **** of the *** most commonly spoken languages in the world. Machine translation Technology is playing an increasingly important role in the language services industry. Machine translation, which is the process of using software to translate from one language to another, is a fast-growing field that is expected to more than triple in size from 2017 to 2024. Accordingly, the *** largest providers in the global language services market – Transperfect and Lionbridge – are investing heavily in this area, offering software based ‘artificial intelligence’ translation in conjunction with their more traditional translation services.

  9. f

    Non-English language corpus.

    • plos.figshare.com
    xls
    Updated Jun 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). Non-English language corpus. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.

  10. u

    Speech Across Dialects of English: Acoustic Measures from SPADE Project...

    • datacatalogue.ukdataservice.ac.uk
    Updated Feb 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stuart-Smith, J, University of Glasgow; Sonderegger, M, McGill University; Mielke, J, North Carolina State University (2024). Speech Across Dialects of English: Acoustic Measures from SPADE Project Corpora, 1949-2019 [Dataset]. http://doi.org/10.5255/UKDA-SN-854959
    Explore at:
    Dataset updated
    Feb 21, 2024
    Authors
    Stuart-Smith, J, University of Glasgow; Sonderegger, M, McGill University; Mielke, J, North Carolina State University
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Time period covered
    Jan 1, 1949 - Jan 1, 2019
    Area covered
    United Kingdom, Ireland, Canada, United States
    Description

    The SPADE project aims to develop and apply user-friendly software for large-scale speech analysis of existing public and private English speech datasets, in order to understand more about English speech over space and time. To date, we have worked with 42 shared corpora comprising dialects from across the British Isles (England, Wales, Scotland, Ireland) and North America (US, Canada), with an effective time span of over 100 years. We make available here a link to our OSF repository (see below) which has acoustic measures datasets for sibilants and durations and static formants for vowels, for 39 corpora (~2200 hours of speech analysed from ~8600 speakers), with information about dataset generation. In addition, at the OSF site, we provide Praat TextGrids created by SPADE for some corpora. Reading passage text is provided when the measures are based on reading only. Datasets are in their raw form and will require cleaning (e.g. outlier removal) before analysis. In addition, we used whitelisting to anonymise measures datasets generated from non-public, restricted corpora.

    Obtaining a data visualization of a text search within seconds via generic, large-scale search algorithms, such as Google n-gram viewer, is available to anyone. By contrast, speech research is only now entering its own 'big data' revolution. Historically, linguistic research has tended to carry out fine-grained analysis of a few aspects of speech from one or a few languages or dialects. The current scale of speech research studies has shaped our understanding of spoken language and the kinds of questions that we ask. Today, massive digital collections of transcribed speech are available from many different languages, gathered for many different purposes: from oral histories, to large datasets for training speech recognition systems, to legal and political interactions. Sophisticated speech processing tools exist to analyze these data, but require substantial technical skill. Given this confluence of data and tools, linguists have a new opportunity to answer fundamental questions about the nature and development of spoken language.

    Our project seeks to establish the key tools to enable large-scale speech research to become as powerful and pervasive as large-scale text mining. It is based on a partnership of three teams based in Scotland, Canada and the US. Together we exploit methods from computing science and put them to work with tools and methods from speech science, linguistics and digital humanities, to discover how much the sounds of English across the Atlantic vary over space and time.

    We have developed innovative and user-friendly software which exploits the availability of existing speech data and speech processing tools to facilitate large-scale integrated speech corpus analysis across many datasets together. The gains of such an approach are substantial: linguists will be able to scale up answers to existing research questions from one to many varieties of a language, and ask new and different questions about spoken language within and across social, regional, and cultural, contexts. Computational linguistics, speech technology, forensic and clinical linguistics researchers, who engage with variability in spoken language, will also benefit directly from our software. This project also opens up vast potential for those who already use digital scholarship for spoken language collections in the humanities and social sciences more broadly, e.g. literary scholars, sociologists, anthropologists, historians, political scientists. The possibility of ethically non-invasive inspection of speech and texts will allow analysts to uncover far more than is possible through textual analysis alone.

    Our project has developed and applied our new software to a global language, English, using existing public and private spoken datasets of Old World (British Isles) and New World (North American) English, across an effective time span of more than 100 years, spanning the entire 20th century. Much of what we know about spoken English comes from influential studies on a few specific aspects of speech from one or two dialects. This vast literature has established important research questions which has been investigated for the first time on a much larger scale, through standardized data across many different varieties of English.

    Our large-scale study complements current-scale studies, by enabling us to consider stability and change in English across the 20th century on an unparalleled scale. The global nature of English means that our findings will be interesting and relevant to a large international non-academic audience; they have been made accessible through an innovative and dynamic visualization of linguistic variation via an interactive sound mapping website. In addition to new insights into spoken English, this project also lays the crucial groundwork for large-scale speech studies across many datasets from different languages, of different formats and structures.

  11. h

    jampatoisnli

    • huggingface.co
    Updated Jul 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruth-Ann Armstrong (2023). jampatoisnli [Dataset]. https://huggingface.co/datasets/Ruth-Ann/jampatoisnli
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 21, 2023
    Authors
    Ruth-Ann Armstrong
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for [Dataset Name]

      Dataset Summary
    

    JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in exploring the… See the full description on the dataset page: https://huggingface.co/datasets/Ruth-Ann/jampatoisnli.

  12. Ranking of languages spoken at home in the U.S. 2023

    • statista.com
    Updated Oct 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Veera Korhonen (2024). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/topics/3806/hispanics-in-the-united-states/
    Explore at:
    Dataset updated
    Oct 24, 2024
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Veera Korhonen
    Area covered
    United States
    Description

    In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

  13. u

    Evidence of Universal Language Structure From Speakers Whose Language...

    • datacatalogue.ukdataservice.ac.uk
    Updated Oct 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Culbertson, J, University of Edinburgh; Alexander, M, University of Groningen; Patrick, K, University of Edinburgh; Klaus, A, UCL; David, A, QMUL (2023). Evidence of Universal Language Structure From Speakers Whose Language Violates It, 2017-2022 [Dataset]. http://doi.org/10.5255/UKDA-SN-856694
    Explore at:
    Dataset updated
    Oct 11, 2023
    Authors
    Culbertson, J, University of Edinburgh; Alexander, M, University of Groningen; Patrick, K, University of Edinburgh; Klaus, A, UCL; David, A, QMUL
    Area covered
    Kenya
    Description

    There is a longstanding debate in cognitive science surrounding the source of commonalities among languages of the world. Indeed, there are many potential explanations for such commonalities—accidents of history, common processes of language change, memory limitations, constraints on linguistic representations, etc. Recent research has used psycholinguistic experiments to provide empirical evidence linking common linguistic patterns to specific features of human cognition, but these experiments tend to use English speakers, who in many cases have direct experience with precisely the common patterns of interest. Here, we highlight the importance of testing populations whose languages go against cross-linguistic trends. We investigate whether monolingual speakers of Kîîtharaka, which has an unusual way of ordering words, mirror those of English speakers. We find that they do, supporting the hypothesis that universal cognitive representations play a role in shaping word order.

    Languages can be very different from each other. For example, just focussing on the order of words, languages like English put adjectives before nouns ('red house') while languages like Thai put them afterwards ('house red'). Similarly, languages like Vietnamese put Numerals before nouns ('three houses'), while others, like the Kitharaka (spoken in Kenya), put numerals after ('houses three'). If word ordering was simply due to happenstance, we would expect to see all different orders appearing in equal proportion across languages, but we don't find that. In fact, some orders are very common, some are very rare, and some don't seem to appear at all. For example, many languages are ordered like English ('three red houses'), and many are also ordered like Thai, which is exactly the reverse ('houses red three'). But the Kitharaka order ('houses three red') is much rarer, and its mirror image ('red three houses') never seems to occur. Why is this?

    One of the major controversies in the language sciences is whether we need to appeal to the basic set-up of the human mind to explain the ways languages can vary, or whether these properties are instead a result of cultural differences in communication and social interaction. A great deal of recent work coming from the perspective of psychology assumes the latter: that the properties of language can be boiled down to communication, interaction and the vagaries of history, while most work in linguistics assumes the former: there must be biases in the human mind that allow us to learn languages of particular types more easily than others. This project seeks to resolve that issue.

    In order to do this, we test how well people learn languages of various types, to see whether their behaviour follows the general tendencies we see across real languages. Importantly, we use artificially constructed languages, rather than natural languages, in order to make sure that they only differ in the crucial respects. For example, we present English speakers with artificial languages that use word orders from Thai and Kitharaka. If Thai orders are more common across languages than Kitharaka ones because the former are easier to learn, then we should see this reflected in the behaviour of learners in our experiments. We can also see whether such patterns are always harder to learn, or if speaking a language which uses them-like Kitharaka-makes them easier to pick up in a new language. To do this, our experiments compare English, Thai, Vietnamese and Kitharaka speakers. If our learners all show the same kinds of patterns in how they learn our artificial languages that we find across real languages, that will suggest that the way languages vary is not random, nor is it entirely a product of historical facts. Rather it would suggest that there are universal cognitive biases at play.

    We plan to look at not just the basic question of what orders appear, but also two other well-known cases where languages don't seem to vary randomly. The first relates to how words like adjectives and numbers are placed relative to the nouns they modify: most languages place them both before or after (like English and Thai), rather than putting them on opposite sides (e.g., 'two houses red', like Vietnamese). We will test whether this type of pattern is always easier to learn in a new language. Second, we will look at whether people prefer to learn languages with suffixes (e.g., 'cat-s') rather than prefixes (e.g., 'un-happy'). Both types are present in English, but most languages have (more) suffixes. Our project we will shed light on whether there are universal cognitive biases in language learning, if such biases are at play for the particular phenomena we look at, and how people's native languages affect these biases.

  14. Enrollment numbers in language training Spain 2005 to 2023

    • statista.com
    Updated Jan 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Enrollment numbers in language training Spain 2005 to 2023 [Dataset]. https://www.statista.com/statistics/459491/enrollment-numbers-in-language-training-spain/
    Explore at:
    Dataset updated
    Jan 22, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Spain
    Description

    The number of enrollments in language schools in Spain reveals that Spaniards are well aware of the importance of foreign languages in modern times. During the 2022/23 academic year, almost 331,000 people were registered at the Spanish language schools to add a new language to their curricula. In a globalized world, languages are taking a much more important role on the job market. The most studied and spoken languages in the world include English, Mandarin, Hindi or Spanish.

    The importance of language knowledge in the job market Enrollment numbers at language schools come as no surprise considering that foreign languages have become a vital asset for job seekers in the last years. English, par excellence the most used language for international affairs, unsurprisingly ranked first on the list of most valued languages on the Spanish job market, with approximately 65.2 of job openings that require foreign language skills demanding this one. Far from that stood French, with 17.38 percent of the job openings.

    Languages in the Spanish multimedia scene Most of the best selling albums Spain during 2022 were recorded in the country’s main language Spanish, with 38 albums in the top 50. As for videogames, 96 percent of the games produced in the country had English as a language option. Spanish was the second most used language, being present in 91 percent of productions.

  15. Language spoken at Home (Census 2016)

    • digital-earth-pacificcore.hub.arcgis.com
    • cacgeoportal.com
    • +1more
    Updated May 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri Australia (2019). Language spoken at Home (Census 2016) [Dataset]. https://digital-earth-pacificcore.hub.arcgis.com/items/6c0488fd7bcb455fadc66e505cbd21a9
    Explore at:
    Dataset updated
    May 26, 2019
    Dataset provided by
    Esrihttp://esri.com/
    Esri Australia
    Authors
    Esri Australia
    Description

    Does the person speak a language other than English at home? This map takes a look at answers to this question from Census Night.Colour:For each SA1 geography, the colour indicates which language 'wins'.SA1 geographies not coloured are either tied between two languages or not enough data Colour Intensity:The colour intensity compares the values of the winner to all other values and returns its dominance over other languages in the same geographyNotes:Only considers top 6 languages for VICCensus 2016 DataPacksPredominance VisualisationsSource CodeNotice that while one language level appears to dominate certain geographies, it doesn't necessarily mean it represents the majority of the population. In fact, as you explore most areas, you will find the predominant language makes up just a fraction of the population due to the number of languages considered.

  16. h

    BanglaNLP

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Likhon Sheikh (2025). BanglaNLP [Dataset]. https://huggingface.co/datasets/likhonsheikh/BanglaNLP
    Explore at:
    Dataset updated
    Jun 1, 2025
    Authors
    Likhon Sheikh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    BanglaNLP: Bengali-English Parallel Dataset Tools

    BanglaNLP is a comprehensive toolkit for creating high-quality Bengali-English parallel datasets from news sources, designed to improve machine translation and other cross-lingual NLP tasks for the Bengali language. Our work addresses the critical shortage of high-quality parallel data for Bengali, the 7th most spoken language in the world with over 230 million speakers.

      🏆 Impact & Recognition
    

    120K+ Sentence Pairs:… See the full description on the dataset page: https://huggingface.co/datasets/likhonsheikh/BanglaNLP.

  17. D

    Language Learning Apps Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Language Learning Apps Market Research Report 2033 [Dataset]. https://dataintelo.com/report/language-learning-apps-market
    Explore at:
    pptx, pdf, csvAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Language Learning Apps Market Outlook



    According to our latest research, the global Language Learning Apps market size reached USD 7.35 billion in 2024, reflecting a strong demand for digital language education solutions worldwide. The market is projected to grow at a CAGR of 18.2% during the forecast period from 2025 to 2033, reaching an estimated USD 34.85 billion by 2033. This robust growth is primarily driven by the increasing penetration of smartphones, the growing necessity for multilingual communication in an interconnected world, and the widespread adoption of e-learning methodologies across educational institutions and enterprises.




    One of the most significant growth factors propelling the Language Learning Apps market is the rapid advancement of mobile technology and the proliferation of affordable smartphones and high-speed internet connectivity. With mobile devices becoming ubiquitous, users are increasingly seeking flexible, on-the-go educational solutions. Language learning apps are leveraging this trend by offering interactive, adaptive, and personalized learning experiences that cater to diverse learning styles and schedules. Furthermore, the integration of artificial intelligence, speech recognition, and gamification features has made these apps more engaging and effective, encouraging higher user retention rates and expanding the addressable market.




    Another critical driver is the globalization of business and education, which necessitates proficiency in multiple languages. Enterprises are investing in upskilling their workforce to enhance cross-border communication and collaboration, while educational institutions are incorporating digital language learning tools into their curricula. The COVID-19 pandemic further accelerated the shift towards digital learning, as remote and hybrid education models became the norm. Consequently, both individual learners and organizations are increasingly turning to language learning apps for their convenience, scalability, and cost-effectiveness, fueling sustained market growth.




    Additionally, the market is benefiting from the rising demand for English and other widely spoken languages such as Mandarin, Spanish, and French, especially in emerging economies. Governments and educational authorities are actively promoting language education to improve employability and global competitiveness. The increasing availability of regionally tailored content and the localization of apps to support less commonly taught languages are further broadening the user base. Strategic partnerships between app developers, educational institutions, and technology providers are fostering innovation and expanding the reach of language learning solutions to underserved populations.




    From a regional perspective, Asia Pacific is emerging as the fastest-growing market, driven by a large population of young learners, rapid digitalization, and strong government initiatives supporting education technology. North America and Europe continue to dominate in terms of market share, owing to high digital literacy, established educational infrastructure, and a strong presence of leading app developers. Meanwhile, Latin America and the Middle East & Africa are witnessing increasing adoption rates, supported by rising smartphone penetration and a growing emphasis on bilingual education. This regional diversification is expected to further enhance the global growth trajectory of the Language Learning Apps market throughout the forecast period.



    Product Type Analysis



    The Product Type segment of the Language Learning Apps market is primarily categorized into web-based and mobile-based solutions. Mobile-based language learning apps have witnessed a remarkable surge in popularity, owing to the widespread adoption of smartphones and tablets. These apps offer unparalleled convenience, allowing users to practice languages anytime and anywhere, which aligns perfectly with the modern learner’s lifestyle. The integration of push notifications, offline access, and interactive features such as voice recognition and gamification has further enhanced user engagement and learning outcomes. As a result, mobile-based solutions accounted for the largest share of the market in 2024, and this dominance is expected to continue throughout the forecast period.




    Web-based language learning platforms, while slightly lagging behind mobile apps in terms of u

  18. E

    GlobalPhone Japanese

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Japanese [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0199/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Japanese corpus was produced using the Nikkei Shinbun newspaper. It contains recordings of 149 speakers (104 males, 44 females, 1 unspecified) recorded in Tokyo, Japan. The following age distribution has been obtained: 22 speakers are below 19, 90 speakers are between 20 and 29, 5 speakers are between 30 and 39, 2 speakers are between 40 and 49, and 1 speaker is over 50 (28 speakers age is unknown).

  19. S

    Global Mandarin Learning Market Overview and Outlook 2025-2032

    • statsndata.org
    excel, pdf
    Updated Sep 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats N Data (2025). Global Mandarin Learning Market Overview and Outlook 2025-2032 [Dataset]. https://www.statsndata.org/report/mandarin-learning-market-178140
    Explore at:
    pdf, excelAvailable download formats
    Dataset updated
    Sep 2025
    Dataset authored and provided by
    Stats N Data
    License

    https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order

    Area covered
    Global
    Description

    The Mandarin Learning market has seen significant growth over the past decade, reflecting the increasing global interest in China as a major economic powerhouse and cultural influencer. With Mandarin being the most spoken language in the world, the demand for Mandarin language education has spiked, creating a divers

  20. Bangla Wikipedia Articles

    • kaggle.com
    Updated Jul 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafid Abyaad (2019). Bangla Wikipedia Articles [Dataset]. https://www.kaggle.com/abyaadrafid/bnwiki/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2019
    Dataset provided by
    Kaggle
    Authors
    Rafid Abyaad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Despite being the 7th most spoken language in the world, online resources for Bangla is surprisingly scarce. This poses a huge problem for up-and-coming NLP enthusiasts Bangladesh/West Bengal are producing nowadays. There are some datasets in kaggle on Bangla literature, which hugely misrepresent the language structure as people don't talk like "Gitanjali". So I've compiled this dataset from scraped BNWiki articles in hopes of making things easier for newbies.

    Content

    I downloaded bnwiki data dump from official wikipedia dump. Then used wikiextractor for scrape the data into json format. I've included a kernel explaining how to make csv files out of it. The files contain all bnwiki articles (verified or not). So the standard for all articles can not be guaranteed. But hey, we take what we can get at this point.

    Acknowledgements

    I found this project in the wild and followed in their footsteps. Check the repo out, might be useful to you.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Organization logo

The most spoken languages worldwide 2025

Explore at:
451 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description

In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

Search
Clear search
Close search
Google apps
Main menu