85 datasets found

The most spoken languages worldwide 2025
statista.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Explore at:
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
The Most Spoken Languages Around the World
kaggle.com
Updated Nov 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Narmelan Tharmalingam (2020). The Most Spoken Languages Around the World [Dataset]. https://www.kaggle.com/narmelan/100-most-spoken-languages-around-the-world/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 4, 2020
Dataset provided by
Kaggle
Authors
Narmelan Tharmalingam
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Area covered
World
Description
Context

After going through quite the verbal loop when ordering foreign currency through the bank, which involved a discussion with an assigned financial advisor at the branch the following day to confirm details, I noticed despite our names hinting at the assumed typical background similarities, communication by phone was much more difficult due to the thickness in accents and different speech patterns when voicing from a non-native speaker.

It hit me then coming from an extremely multicultural and welcoming city, the challenges others from completely different labels given to them in life must go through in their daily affairs when having to face communication barriers that I myself encountered, particularly when interacting with those outside their usual bubble. Now imagine this situation occurring every hour across the world in various sectors of business. How may this impede, help or create frustrations in minor or major ways as a result of increasing workplace diversity quota demands, customer satisfaction needs and process efficiencies?

The data I was looking for to explore this phenomena existed in the form of native and non-native speakers of the 100 most commonly spoken languages across the globe.

Content

The data in this database contains the following attributes:

Language - name of the language

Total Speakers - this assumes both native and non-native speakers

Native Speakers - native speakers of the language

Origin - family origin group of said language

Acknowledgements

The data was collected with the aid of WordTips visualization of the 22nd edition of Ethnologue - "a research center for language intelligence"

https://www.ethnologue.com/world https://www.ethnologue.com/guides/ethnologue200 https://word.tips/pictures/b684e98f-f512-4ac0-96a4-0efcf6decbc0_most-spoken-languages-world-5.png?auto=compress,format&rect=0,0,2001,7115&w=800&h=2845

Inspiration

As globalization no longer constrains us, what implications will this have in terms of organizational communications conducted moving forward? I believe this is something to be examined in careful context in order to make customer relationship processes meaningful rather than it being confined to a strictly detached transactional basis.
Number of native Spanish speakers worldwide 2024, by country
statista.com
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Number of native Spanish speakers worldwide 2024, by country [Dataset]. https://www.statista.com/statistics/991020/number-native-spanish-speakers-country-worldwide/
Explore at:
Dataset updated
Jan 15, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
World
Description
Mexico is the country with the largest number of native Spanish speakers in the world. As of 2024, 132.5 million people in Mexico spoke Spanish with a native command of the language. Colombia was the nation with the second-highest number of native Spanish speakers, at around 52.7 million. Spain came in third, with 48 million, and Argentina fourth, with 46 million. Spanish, a world language As of 2023, Spanish ranked as the fourth most spoken language in the world, only behind English, Chinese, and Hindi, with over half a billion speakers. Spanish is the official language of over 20 countries, the majority on the American continent, nonetheless, it's also one of the official languages of Equatorial Guinea in Africa. Other countries have a strong influence, like the United States, Morocco, or Brazil, countries included in the list of non-Hispanic countries with the highest number of Spanish speakers. The second most spoken language in the U.S. In the most recent data, Spanish ranked as the language, other than English, with the highest number of speakers, with 12 times more speakers as the second place. Which comes to no surprise following the long history of migrations from Latin American countries to the Northern country. Moreover, only during the fiscal year 2022. 5 out of the top 10 countries of origin of naturalized people in the U.S. came from Spanish-speaking countries.
MCB_languages_county
kaggle.com
Updated Oct 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marisol Brewster (2019). MCB_languages_county [Dataset]. https://www.kaggle.com/mcbrewster/mcb-languages-county/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 1, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Marisol Brewster
Description
Context

This is a dataset I found online through the Google Dataset Search portal.

Content

The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.

The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.

The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.

These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.

Acknowledgements

Sources:

Google Dataset Search: https://toolbox.google.com/datasetsearch

2009-2013 American Community Survey

Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html

Downloaded From: https://data.world/kvaughn/languages-county

Banner and thumbnail photo by Farzad Mohsenvand on Unsplash
E
GlobalPhone Hausa
catalogue.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Hausa [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0347/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.Hausa is a member of the Chadic language family, and belongs together with the Semitic and Cushitic languages to the Afroasiatic language family. With over 25 million speakers, it is widely spoken in West Africa. The collection of the Hausa speech and text corpus followed the GlobalPhone collection standards. First, a large text corpus was built by crawling websites that cover main Hausa newspaper sources. Hausa’s modern official orthography is a Latin-based alphabet called Boko, which was imposed in the 1930s by the British colonial administration. It consists of 22 characters of the English alphabet plus five special characters. The collection is based on five main newspapers written in Boko. After cleaning and normalization, these texts were used to build language models and to select prompts for the speech data recordings. Native speakers of Hausa were asked to read prompted sentences of newspaper articles. The entire collection...
E
GlobalPhone Portuguese (Brazilian)
catalogue.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Portuguese (Brazilian) [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0201/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
Brazil
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Portuguese (Brazilian) corpus was produced using the Folha de Sao Paulo newspaper. It contains recordings of 102 speakers (54 males, 48 females) recorded in Porto Velho and Sao Paulo, Brazil. The following age distribution has been obtained: 6 speakers are below 19, 58 speakers are between 20 and 29, 27 speakers are between 30 and 39, 5 speakers are between 40 and 49, and 5 speakers are over 50 (1 speaker age is unknown).
s
120 Million Word Spanish Corpus
marketplace.sshopencloud.eu
Updated Apr 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). 120 Million Word Spanish Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/XTUFXt
Explore at:
Dataset updated
Apr 24, 2020
Description
Spanish is the second most widely-spoken language on Earth; over one in 20 humans alive today is a native speaker of Spanish. This medium-sized corpus contains 120 million words of modern Spanish taken from the Spanish-Language Wikipedia in 2010. This dataset is made up of 57 text files. Each contains multiple Wikipedia articles in an XML format. The text of each article is surrounded by tags. The initial tag also contains metadata about the article, including the article’s id and the title of the article. The text “ENDOFARTICLE.” appears at the end of each article, before the closing tag.
Share of the global language services market by region 2018
statista.com
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Share of the global language services market by region 2018 [Dataset]. https://www.statista.com/statistics/190486/global-language-services-market-share-by-continent/
Explore at:
Dataset updated
Jul 10, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2018
Area covered
World
Description
Given its diverse range of languages and high level of economic development, it is perhaps not surprising that Europe is home to the largest language services market in the world, comprising almost half of the global market. Language services globally The language services market covers a broad range of activities, from language instruction to professional translation services to localization and voice-over services for media such as film, television and video games. With the world becoming increasingly interconnected through technology, this market has more than doubled since 2009, with an expected global value of almost ** billion U.S. dollars in 2019. And, there is good reason to expect this market to continue growing – especially given that the market share of the Asia Pacific region is relatively low, yet the region is home to **** of the *** most commonly spoken languages in the world. Machine translation Technology is playing an increasingly important role in the language services industry. Machine translation, which is the process of using software to translate from one language to another, is a fast-growing field that is expected to more than triple in size from 2017 to 2024. Accordingly, the *** largest providers in the global language services market – Transperfect and Lionbridge – are investing heavily in this area, offering software based ‘artificial intelligence’ translation in conjunction with their more traditional translation services.
f
Non-English language corpus.
plos.figshare.com
xls
Updated Jun 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). Non-English language corpus. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320701.t003
Dataset updated
Jun 2, 2025
Dataset provided by
PLOS ONE
Authors
Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.
u
Speech Across Dialects of English: Acoustic Measures from SPADE Project...
datacatalogue.ukdataservice.ac.uk
Updated Feb 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stuart-Smith, J, University of Glasgow; Sonderegger, M, McGill University; Mielke, J, North Carolina State University (2024). Speech Across Dialects of English: Acoustic Measures from SPADE Project Corpora, 1949-2019 [Dataset]. http://doi.org/10.5255/UKDA-SN-854959
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-854959
Dataset updated
Feb 21, 2024
Authors
Stuart-Smith, J, University of Glasgow; Sonderegger, M, McGill University; Mielke, J, North Carolina State University
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Time period covered
Jan 1, 1949 - Jan 1, 2019
Area covered
United Kingdom, Ireland, Canada, United States
Description
The SPADE project aims to develop and apply user-friendly software for large-scale speech analysis of existing public and private English speech datasets, in order to understand more about English speech over space and time. To date, we have worked with 42 shared corpora comprising dialects from across the British Isles (England, Wales, Scotland, Ireland) and North America (US, Canada), with an effective time span of over 100 years. We make available here a link to our OSF repository (see below) which has acoustic measures datasets for sibilants and durations and static formants for vowels, for 39 corpora (~2200 hours of speech analysed from ~8600 speakers), with information about dataset generation. In addition, at the OSF site, we provide Praat TextGrids created by SPADE for some corpora. Reading passage text is provided when the measures are based on reading only. Datasets are in their raw form and will require cleaning (e.g. outlier removal) before analysis. In addition, we used whitelisting to anonymise measures datasets generated from non-public, restricted corpora.
Obtaining a data visualization of a text search within seconds via generic, large-scale search algorithms, such as Google n-gram viewer, is available to anyone. By contrast, speech research is only now entering its own 'big data' revolution. Historically, linguistic research has tended to carry out fine-grained analysis of a few aspects of speech from one or a few languages or dialects. The current scale of speech research studies has shaped our understanding of spoken language and the kinds of questions that we ask. Today, massive digital collections of transcribed speech are available from many different languages, gathered for many different purposes: from oral histories, to large datasets for training speech recognition systems, to legal and political interactions. Sophisticated speech processing tools exist to analyze these data, but require substantial technical skill. Given this confluence of data and tools, linguists have a new opportunity to answer fundamental questions about the nature and development of spoken language.

Our project seeks to establish the key tools to enable large-scale speech research to become as powerful and pervasive as large-scale text mining. It is based on a partnership of three teams based in Scotland, Canada and the US. Together we exploit methods from computing science and put them to work with tools and methods from speech science, linguistics and digital humanities, to discover how much the sounds of English across the Atlantic vary over space and time.

We have developed innovative and user-friendly software which exploits the availability of existing speech data and speech processing tools to facilitate large-scale integrated speech corpus analysis across many datasets together. The gains of such an approach are substantial: linguists will be able to scale up answers to existing research questions from one to many varieties of a language, and ask new and different questions about spoken language within and across social, regional, and cultural, contexts. Computational linguistics, speech technology, forensic and clinical linguistics researchers, who engage with variability in spoken language, will also benefit directly from our software. This project also opens up vast potential for those who already use digital scholarship for spoken language collections in the humanities and social sciences more broadly, e.g. literary scholars, sociologists, anthropologists, historians, political scientists. The possibility of ethically non-invasive inspection of speech and texts will allow analysts to uncover far more than is possible through textual analysis alone.

Our project has developed and applied our new software to a global language, English, using existing public and private spoken datasets of Old World (British Isles) and New World (North American) English, across an effective time span of more than 100 years, spanning the entire 20th century. Much of what we know about spoken English comes from influential studies on a few specific aspects of speech from one or two dialects. This vast literature has established important research questions which has been investigated for the first time on a much larger scale, through standardized data across many different varieties of English.

Our large-scale study complements current-scale studies, by enabling us to consider stability and change in English across the 20th century on an unparalleled scale. The global nature of English means that our findings will be interesting and relevant to a large international non-academic audience; they have been made accessible through an innovative and dynamic visualization of linguistic variation via an interactive sound mapping website. In addition to new insights into spoken English, this project also lays the crucial groundwork for large-scale speech studies across many datasets from different languages, of different formats and structures.
h
jampatoisnli
huggingface.co
Updated Jul 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruth-Ann Armstrong (2023). jampatoisnli [Dataset]. https://huggingface.co/datasets/Ruth-Ann/jampatoisnli
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 21, 2023
Authors
Ruth-Ann Armstrong
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for [Dataset Name]

Dataset Summary

JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in exploring the… See the full description on the dataset page: https://huggingface.co/datasets/Ruth-Ann/jampatoisnli.
Ranking of languages spoken at home in the U.S. 2023
statista.com
Updated Oct 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Veera Korhonen (2024). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/topics/3806/hispanics-in-the-united-states/
Explore at:
Dataset updated
Oct 24, 2024
Dataset provided by
Statistahttp://statista.com/
Authors
Veera Korhonen
Area covered
United States
Description
In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.
u
Evidence of Universal Language Structure From Speakers Whose Language...
datacatalogue.ukdataservice.ac.uk
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Culbertson, J, University of Edinburgh; Alexander, M, University of Groningen; Patrick, K, University of Edinburgh; Klaus, A, UCL; David, A, QMUL (2023). Evidence of Universal Language Structure From Speakers Whose Language Violates It, 2017-2022 [Dataset]. http://doi.org/10.5255/UKDA-SN-856694
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-856694
Dataset updated
Oct 11, 2023
Authors
Culbertson, J, University of Edinburgh; Alexander, M, University of Groningen; Patrick, K, University of Edinburgh; Klaus, A, UCL; David, A, QMUL
Area covered
Kenya
Description
There is a longstanding debate in cognitive science surrounding the source of commonalities among languages of the world. Indeed, there are many potential explanations for such commonalities—accidents of history, common processes of language change, memory limitations, constraints on linguistic representations, etc. Recent research has used psycholinguistic experiments to provide empirical evidence linking common linguistic patterns to specific features of human cognition, but these experiments tend to use English speakers, who in many cases have direct experience with precisely the common patterns of interest. Here, we highlight the importance of testing populations whose languages go against cross-linguistic trends. We investigate whether monolingual speakers of Kîîtharaka, which has an unusual way of ordering words, mirror those of English speakers. We find that they do, supporting the hypothesis that universal cognitive representations play a role in shaping word order.
Languages can be very different from each other. For example, just focussing on the order of words, languages like English put adjectives before nouns ('red house') while languages like Thai put them afterwards ('house red'). Similarly, languages like Vietnamese put Numerals before nouns ('three houses'), while others, like the Kitharaka (spoken in Kenya), put numerals after ('houses three'). If word ordering was simply due to happenstance, we would expect to see all different orders appearing in equal proportion across languages, but we don't find that. In fact, some orders are very common, some are very rare, and some don't seem to appear at all. For example, many languages are ordered like English ('three red houses'), and many are also ordered like Thai, which is exactly the reverse ('houses red three'). But the Kitharaka order ('houses three red') is much rarer, and its mirror image ('red three houses') never seems to occur. Why is this?

One of the major controversies in the language sciences is whether we need to appeal to the basic set-up of the human mind to explain the ways languages can vary, or whether these properties are instead a result of cultural differences in communication and social interaction. A great deal of recent work coming from the perspective of psychology assumes the latter: that the properties of language can be boiled down to communication, interaction and the vagaries of history, while most work in linguistics assumes the former: there must be biases in the human mind that allow us to learn languages of particular types more easily than others. This project seeks to resolve that issue.

In order to do this, we test how well people learn languages of various types, to see whether their behaviour follows the general tendencies we see across real languages. Importantly, we use artificially constructed languages, rather than natural languages, in order to make sure that they only differ in the crucial respects. For example, we present English speakers with artificial languages that use word orders from Thai and Kitharaka. If Thai orders are more common across languages than Kitharaka ones because the former are easier to learn, then we should see this reflected in the behaviour of learners in our experiments. We can also see whether such patterns are always harder to learn, or if speaking a language which uses them-like Kitharaka-makes them easier to pick up in a new language. To do this, our experiments compare English, Thai, Vietnamese and Kitharaka speakers. If our learners all show the same kinds of patterns in how they learn our artificial languages that we find across real languages, that will suggest that the way languages vary is not random, nor is it entirely a product of historical facts. Rather it would suggest that there are universal cognitive biases at play.

We plan to look at not just the basic question of what orders appear, but also two other well-known cases where languages don't seem to vary randomly. The first relates to how words like adjectives and numbers are placed relative to the nouns they modify: most languages place them both before or after (like English and Thai), rather than putting them on opposite sides (e.g., 'two houses red', like Vietnamese). We will test whether this type of pattern is always easier to learn in a new language. Second, we will look at whether people prefer to learn languages with suffixes (e.g., 'cat-s') rather than prefixes (e.g., 'un-happy'). Both types are present in English, but most languages have (more) suffixes. Our project we will shed light on whether there are universal cognitive biases in language learning, if such biases are at play for the particular phenomena we look at, and how people's native languages affect these biases.
Enrollment numbers in language training Spain 2005 to 2023
statista.com
Updated Jan 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Enrollment numbers in language training Spain 2005 to 2023 [Dataset]. https://www.statista.com/statistics/459491/enrollment-numbers-in-language-training-spain/
Explore at:
Dataset updated
Jan 22, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Spain
Description
The number of enrollments in language schools in Spain reveals that Spaniards are well aware of the importance of foreign languages in modern times. During the 2022/23 academic year, almost 331,000 people were registered at the Spanish language schools to add a new language to their curricula. In a globalized world, languages are taking a much more important role on the job market. The most studied and spoken languages in the world include English, Mandarin, Hindi or Spanish.

The importance of language knowledge in the job market Enrollment numbers at language schools come as no surprise considering that foreign languages have become a vital asset for job seekers in the last years. English, par excellence the most used language for international affairs, unsurprisingly ranked first on the list of most valued languages on the Spanish job market, with approximately 65.2 of job openings that require foreign language skills demanding this one. Far from that stood French, with 17.38 percent of the job openings.

Languages in the Spanish multimedia scene Most of the best selling albums Spain during 2022 were recorded in the country’s main language Spanish, with 38 albums in the top 50. As for videogames, 96 percent of the games produced in the country had English as a language option. Spanish was the second most used language, being present in 91 percent of productions.
Language spoken at Home (Census 2016)
digital-earth-pacificcore.hub.arcgis.com
cacgeoportal.com
+1more
Updated May 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri Australia (2019). Language spoken at Home (Census 2016) [Dataset]. https://digital-earth-pacificcore.hub.arcgis.com/items/6c0488fd7bcb455fadc66e505cbd21a9
Explore at:
Dataset updated
May 26, 2019
Dataset provided by
Esrihttp://esri.com/
Esri Australia
Authors
Esri Australia
Description
Does the person speak a language other than English at home? This map takes a look at answers to this question from Census Night.Colour:For each SA1 geography, the colour indicates which language 'wins'.SA1 geographies not coloured are either tied between two languages or not enough data Colour Intensity:The colour intensity compares the values of the winner to all other values and returns its dominance over other languages in the same geographyNotes:Only considers top 6 languages for VICCensus 2016 DataPacksPredominance VisualisationsSource CodeNotice that while one language level appears to dominate certain geographies, it doesn't necessarily mean it represents the majority of the population. In fact, as you explore most areas, you will find the predominant language makes up just a fraction of the population due to the number of languages considered.
h
BanglaNLP
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Likhon Sheikh (2025). BanglaNLP [Dataset]. https://huggingface.co/datasets/likhonsheikh/BanglaNLP
Explore at:
Dataset updated
Jun 1, 2025
Authors
Likhon Sheikh
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
BanglaNLP: Bengali-English Parallel Dataset Tools

BanglaNLP is a comprehensive toolkit for creating high-quality Bengali-English parallel datasets from news sources, designed to improve machine translation and other cross-lingual NLP tasks for the Bengali language. Our work addresses the critical shortage of high-quality parallel data for Bengali, the 7th most spoken language in the world with over 230 million speakers.

🏆 Impact & Recognition

120K+ Sentence Pairs:… See the full description on the dataset page: https://huggingface.co/datasets/likhonsheikh/BanglaNLP.
D
Language Learning Apps Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Language Learning Apps Market Research Report 2033 [Dataset]. https://dataintelo.com/report/language-learning-apps-market
Explore at:
pptx, pdf, csvAvailable download formats
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Language Learning Apps Market Outlook

According to our latest research, the global Language Learning Apps market size reached USD 7.35 billion in 2024, reflecting a strong demand for digital language education solutions worldwide. The market is projected to grow at a CAGR of 18.2% during the forecast period from 2025 to 2033, reaching an estimated USD 34.85 billion by 2033. This robust growth is primarily driven by the increasing penetration of smartphones, the growing necessity for multilingual communication in an interconnected world, and the widespread adoption of e-learning methodologies across educational institutions and enterprises.

One of the most significant growth factors propelling the Language Learning Apps market is the rapid advancement of mobile technology and the proliferation of affordable smartphones and high-speed internet connectivity. With mobile devices becoming ubiquitous, users are increasingly seeking flexible, on-the-go educational solutions. Language learning apps are leveraging this trend by offering interactive, adaptive, and personalized learning experiences that cater to diverse learning styles and schedules. Furthermore, the integration of artificial intelligence, speech recognition, and gamification features has made these apps more engaging and effective, encouraging higher user retention rates and expanding the addressable market.

Another critical driver is the globalization of business and education, which necessitates proficiency in multiple languages. Enterprises are investing in upskilling their workforce to enhance cross-border communication and collaboration, while educational institutions are incorporating digital language learning tools into their curricula. The COVID-19 pandemic further accelerated the shift towards digital learning, as remote and hybrid education models became the norm. Consequently, both individual learners and organizations are increasingly turning to language learning apps for their convenience, scalability, and cost-effectiveness, fueling sustained market growth.

Additionally, the market is benefiting from the rising demand for English and other widely spoken languages such as Mandarin, Spanish, and French, especially in emerging economies. Governments and educational authorities are actively promoting language education to improve employability and global competitiveness. The increasing availability of regionally tailored content and the localization of apps to support less commonly taught languages are further broadening the user base. Strategic partnerships between app developers, educational institutions, and technology providers are fostering innovation and expanding the reach of language learning solutions to underserved populations.

From a regional perspective, Asia Pacific is emerging as the fastest-growing market, driven by a large population of young learners, rapid digitalization, and strong government initiatives supporting education technology. North America and Europe continue to dominate in terms of market share, owing to high digital literacy, established educational infrastructure, and a strong presence of leading app developers. Meanwhile, Latin America and the Middle East & Africa are witnessing increasing adoption rates, supported by rising smartphone penetration and a growing emphasis on bilingual education. This regional diversification is expected to further enhance the global growth trajectory of the Language Learning Apps market throughout the forecast period.

Product Type Analysis

The Product Type segment of the Language Learning Apps market is primarily categorized into web-based and mobile-based solutions. Mobile-based language learning apps have witnessed a remarkable surge in popularity, owing to the widespread adoption of smartphones and tablets. These apps offer unparalleled convenience, allowing users to practice languages anytime and anywhere, which aligns perfectly with the modern learner’s lifestyle. The integration of push notifications, offline access, and interactive features such as voice recognition and gamification has further enhanced user engagement and learning outcomes. As a result, mobile-based solutions accounted for the largest share of the market in 2024, and this dominance is expected to continue throughout the forecast period.

Web-based language learning platforms, while slightly lagging behind mobile apps in terms of u
E
GlobalPhone Japanese
catalogue.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Japanese [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0199/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Japanese corpus was produced using the Nikkei Shinbun newspaper. It contains recordings of 149 speakers (104 males, 44 females, 1 unspecified) recorded in Tokyo, Japan. The following age distribution has been obtained: 22 speakers are below 19, 90 speakers are between 20 and 29, 5 speakers are between 30 and 39, 2 speakers are between 40 and 49, and 1 speaker is over 50 (28 speakers age is unknown).
S
Global Mandarin Learning Market Overview and Outlook 2025-2032
statsndata.org
excel, pdf
Updated Sep 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats N Data (2025). Global Mandarin Learning Market Overview and Outlook 2025-2032 [Dataset]. https://www.statsndata.org/report/mandarin-learning-market-178140
Explore at:
pdf, excelAvailable download formats
Dataset updated
Sep 2025
Dataset authored and provided by
Stats N Data
License
https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
Area covered
Global
Description
The Mandarin Learning market has seen significant growth over the past decade, reflecting the increasing global interest in China as a major economic powerhouse and cultural influencer. With Mandarin being the most spoken language in the world, the demand for Mandarin language education has spiked, creating a divers
Bangla Wikipedia Articles
kaggle.com
Updated Jul 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafid Abyaad (2019). Bangla Wikipedia Articles [Dataset]. https://www.kaggle.com/abyaadrafid/bnwiki/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 1, 2019
Dataset provided by
Kaggle
Authors
Rafid Abyaad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Despite being the 7th most spoken language in the world, online resources for Bangla is surprisingly scarce. This poses a huge problem for up-and-coming NLP enthusiasts Bangladesh/West Bengal are producing nowadays. There are some datasets in kaggle on Bangla literature, which hugely misrepresent the language structure as people don't talk like "Gitanjali". So I've compiled this dataset from scraped BNWiki articles in hopes of making things easier for newbies.

Content

I downloaded bnwiki data dump from official wikipedia dump. Then used wikiextractor for scrape the data into json format. I've included a kernel explaining how to make csv files out of it. The files contain all bnwiki articles (verified or not). So the standard for all articles can not be guaranteed. But hey, we take what we can get at this point.

Acknowledgements

I found this project in the wild and followed in their footsteps. Check the repo out, might be useful to you.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/

The most spoken languages worldwide 2025

Explore at:

451 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 14, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2025

Area covered

World

Description

In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

Clear search

Close search

Google apps

Main menu

The most spoken languages worldwide 2025

The Most Spoken Languages Around the World

Context

Content

Acknowledgements

Inspiration

Number of native Spanish speakers worldwide 2024, by country

MCB_languages_county

Context

Content

Acknowledgements

GlobalPhone Hausa

GlobalPhone Portuguese (Brazilian)

120 Million Word Spanish Corpus

Share of the global language services market by region 2018

Non-English language corpus.

Speech Across Dialects of English: Acoustic Measures from SPADE Project...

jampatoisnli

Ranking of languages spoken at home in the U.S. 2023

Evidence of Universal Language Structure From Speakers Whose Language...

Enrollment numbers in language training Spain 2005 to 2023

Language spoken at Home (Census 2016)

BanglaNLP

Language Learning Apps Market Research Report 2033

Language Learning Apps Market Outlook

Product Type Analysis

GlobalPhone Japanese

Global Mandarin Learning Market Overview and Outlook 2025-2032

Bangla Wikipedia Articles

Context

Content

Acknowledgements

The most spoken languages worldwide 2025