53 datasets found

The most spoken languages worldwide 2025
statista.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista, The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Explore at:
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
Ranking of languages spoken at home in the U.S. 2024, by number of speakers
statista.com
Updated Nov 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Ranking of languages spoken at home in the U.S. 2024, by number of speakers [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
Explore at:
Dataset updated
Nov 28, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2024
Area covered
United States
Description
In 2024, some 45 million people in the United States spoke Spanish at home. In comparison, the second most spoken non-English language spoken by households was Chinese, at just 3.7 million speakers.The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.
d
Language - ACS 2019-2023 - Tempe Tracts
catalog.data.gov
performance.tempe.gov
+9more
Updated Aug 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Tempe (2025). Language - ACS 2019-2023 - Tempe Tracts [Dataset]. https://catalog.data.gov/dataset/language-acs-2019-2023-tempe-tracts
Explore at:
Dataset updated
Aug 23, 2025
Dataset provided by
City of Tempe
Area covered
Tempe
Description
This layer shows language group of language spoken at home by age. Data is from US Census American Community Survey (ACS) 5-year estimates.This layer is symbolized to show the percentage of the population age 5+ who speak Spanish at home. To see the full list of attributes available in this service, go to the "Data" tab, and choose "Fields" at the top right. To view only the census tracts that are predominantly in Tempe, add the expression City is Tempe in the map filter settings.A ‘Null’ entry in the estimate indicates that data for this geographic area cannot be displayed because the number of sample cases is too small (per the U.S. Census).Vintage: 2019-2023ACS Table(s): B16007 (Not all lines of these ACS tables are available in this feature layer.)Data downloaded from: Census Bureau's API for American Community Survey Data Preparation: Data curated from Esri Living Atlas clipped to Census Tract boundaries that are within or adjacent to the City of Tempe boundaryDate of Census update: December 12, 2024National Figures: data.census.gov
l
LA County Language Spoken at Home (census tract)
data.lacounty.gov
egis-lacounty.hub.arcgis.com
+2more
Updated Jul 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
County of Los Angeles (2025). LA County Language Spoken at Home (census tract) [Dataset]. https://data.lacounty.gov/datasets/la-county-language-spoken-at-home-census-tract
Explore at:
Dataset updated
Jul 28, 2025
Dataset authored and provided by
County of Los Angeles
Area covered

Description
US Census American Community Survey Custom Tabulation (ST542) by Census Tract. Language spoken at home for population 5 years and over by ability to speak English, summarized by census tract for 114 languages spoken across LA County, 5-year estimates 2019-2023.See also source data tables:Census Tracts: Language Spoken at Home LA County Census TractsLA County: Language Spoken at Home LA County Headings:GEOIDGeography identificationCT20Census tract (2020)NameCensus tract nameCSACountywide Statistical Area (city or community)SPAService Planning AreaSDSupervisorial Districttotal_popPopulation over 5 years old in census tract (universe)total_limited_engPopulation that speaks English less than "very well"total_limited_eng_pctPercent of population that speaks English less than "very well"
d
Language spoken - ACS 2015-2019 - Tempe Tracts
catalog.data.gov
data.tempe.gov
+9more
Updated Sep 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Tempe (2024). Language spoken - ACS 2015-2019 - Tempe Tracts [Dataset]. https://catalog.data.gov/dataset/language-spoken-acs-2015-2019-tempe-tracts-28081
Explore at:
Dataset updated
Sep 20, 2024
Dataset provided by
City of Tempe
Area covered
Tempe
Description
Notice: The U.S. Census Bureau is delaying the release of the 2016-2020 ACS 5-year data until March 2022. For more information, please read the Census Bureau statement regarding this matter. -----------------------------------------This layer shows language group of language spoken at home by age. This layer is Census data from Esri's Living Atlas and is clipped to only show Tempe census tracts. This layer is symbolized to show the percentage of the population age 5+ who speak Spanish at home. To see the full list of attributes available in this service, go to the "Data" tab, and choose "Fields" at the top right. Data is from US Census American Community Survey (ACS) 5-year estimates. Vintage: 2015-2019 ACS Table(s): B16007 (Not all lines of these ACS tables are available in this feature layer.) Data downloaded from: Census Bureau's API for American Community Survey Date of Census update: December 10, 2020 National Figures: data.census.gov Additional Census data notes and data processing notes are available at the Esri Living Atlas Layer: https://tempegov.maps.arcgis.com/home/item.html?id=527ea2b5ba814c8ca1c34a2945e1b751
Main languages spoken at home in Nigeria 2022
statista.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista, Main languages spoken at home in Nigeria 2022 [Dataset]. https://www.statista.com/statistics/1268798/main-languages-spoken-at-home-in-nigeria/
Explore at:
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Mar 5, 2022 - Mar 31, 2022
Area covered
Nigeria
Description
The primary languages spoken at home in Nigeria are Hausa, Yoruba, and English. In a survey conducted in 2022 around 32 percent of respondents declared that they mainly spoke Hausa at home. Some 17 percent, on the other hand, had Yoruba as their main family language. Igbo followed, with 13 percent of the respondents indicating it. Some other languages spoken in Nigerian households are English, Ibibio, Fulani, Tiv, Nupe, Pidgin English, and Ijaw.

One of the most diverse countries

There are over 500 languages in Nigeria. The country has only one official language, English. According to estimates from 2018, Nigeria's major ethnic groups are Hausa, Yoruba, Igbo (Ibo), Fulani, Tiv, Kanuri, and Beriberi. Hausa, the largest population, is an ethnic group of people speaking the Hausa language. The Hausa are mainly present in West Africa, most of them living between Nigeria and Niger.

English is the main language at school

The main language of instruction at school is generally English. However, for the first years of education, an indigenous or local language is also taught. As of 2019, around 72 percent of young women and 78 percent of young men in Nigeria were English language literates. This means they could understand, read, and write a short and simple statement in English, for instance, on their everyday life.
Russian-speaking population share 2019, by geographic area
statista.com
Updated Jul 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2020). Russian-speaking population share 2019, by geographic area [Dataset]. https://www.statista.com/statistics/1139302/russian-speaking-population-by-geographic-area/
Explore at:
Dataset updated
Jul 29, 2020
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2019
Area covered
Russia
Description
Other than in Russia, the Russian language was widely spoken in CIS countries by over 79 million people in 2019. Furthermore, more than 13 million residents of Eastern European and Balkan countries were Russian speakers. Russian was the eighth most widely spoken language worldwide as of 2019.
ACS Language Spoken at Home Variables - Boundaries
hub.arcgis.com
atlas-connecteddmv.hub.arcgis.com
+3more
Updated Oct 20, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri (2018). ACS Language Spoken at Home Variables - Boundaries [Dataset]. https://hub.arcgis.com/maps/527ea2b5ba814c8ca1c34a2945e1b751
Explore at:
Dataset updated
Oct 20, 2018
Dataset authored and provided by
Esrihttp://esri.com/
Area covered

Description
This layer shows language group of language spoken at home by age. This is shown by tract, county, and state boundaries. This service is updated annually to contain the most currently released American Community Survey (ACS) 5-year data, and contains estimates and margins of error. There are also additional calculated attributes related to this topic, which can be mapped or used within analysis. This layer is symbolized to show the percentage of the population age 5+ who speak Spanish at home. To see the full list of attributes available in this service, go to the "Data" tab, and choose "Fields" at the top right. Current Vintage: 2019-2023ACS Table(s): B16007Data downloaded from: Census Bureau's API for American Community Survey Date of API call: December 12, 2024National Figures: data.census.govThe United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. For more information about ACS layers, visit the FAQ. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:This layer is updated automatically when the most current vintage of ACS data is released each year, usually in December. The layer always contains the latest available ACS 5-year estimates. It is updated annually within days of the Census Bureau's release schedule. Click here to learn more about ACS data releases.Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2023 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters).The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.
c
Language Spoken at Home - Counties 2015-2019
covid19.census.gov
Updated Mar 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
US Census Bureau (2021). Language Spoken at Home - Counties 2015-2019 [Dataset]. https://covid19.census.gov/datasets/language-spoken-at-home-counties-2015-2019/api
Explore at:
Dataset updated
Mar 19, 2021
Dataset authored and provided by
US Census Bureau
Area covered
Description
This layer shows Language Spoken at Home. This is shown by county boundaries. This service is updated annually to contain the most currently released American Community Survey (ACS) 5-year data, and contains estimates and margins of error. There are also additional calculated attributes related to this topic, which can be mapped or used within analysis.
This layer is symbolized to show the percentage of households with Limited English Speaking Status. To see the full list of attributes available in this service, go to the "Data" tab, and choose "Fields" at the top right. Current Vintage: 2015-2019ACS Table(s): B16004, DP02, S1601, S1602Data downloaded from: Census Bureau's API for American Community Survey Date of API call: February 10, 2021National Figures: data.census.gov The United States Census Bureau's American Community Survey (ACS): About the SurveyGeography & ACSTechnical Documentation News & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables. Data Processing Notes: Boundaries come from the US Census TIGER geodatabases. Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines clipped for cartographic purposes. For census tracts, the water cutouts are derived from a subset of the 2010 AWATER (Area Water) boundaries offered by TIGER. For state and county boundaries, the water and coastlines are derived from the coastlines of the 500k TIGER Cartographic Boundary Shapefiles. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters). The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Margin of error (MOE) values of -555555555 in the API (or "*****" (five asterisks) on data.census.gov) are displayed as 0 in this dataset. The estimates associated with these MOEs have been controlled to independent counts in the ACS weighting and have zero sampling error. So, the MOEs are effectively zeroes, and are treated as zeroes in MOE calculations. Other negative values on the API, such as -222222222, -666666666, -888888888, and -999999999, all represent estimates or MOEs that can't be calculated or can't be published, usually due to small sample sizes.
All of these are rendered in this dataset as null (blank) values.
a
ACS Population Characteristics: Spoken Languages
gis.data.alaska.gov
alaska-economic-data-dcced.hub.arcgis.com
+5more
Updated Sep 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dept. of Commerce, Community, & Economic Development (2019). ACS Population Characteristics: Spoken Languages [Dataset]. https://gis.data.alaska.gov/datasets/DCCED::acs-population-characteristics-spoken-languages
Explore at:
Dataset updated
Sep 4, 2019
Dataset authored and provided by
Dept. of Commerce, Community, & Economic Development
Area covered

Description
Counts and breakdown of languages used data with margins of error for Alaskan Communities/Places and aggregation at Borough/CDA and State level for recent 5-year American Community Survey (ACS) intervals. The 5-year interval data sets are published approximately 1/2 a period later than the End Year listed - for instance the interval ending in 2019 is published in mid-2021.Source: US Census Bureau, American Community SurveyThis data has been visualized in a Geographic Information Systems (GIS) format and is provided as a service in the DCRA Information Portal by the Alaska Department of Commerce, Community, and Economic Development Division of Community and Regional Affairs (SOA DCCED DCRA), Research and Analysis section. SOA DCCED DCRA Research and Analysis is not the authoritative source for this data. For more information and for questions about this data, see: US Census - Language UseUSE CONSTRAINTS: The Alaska Department of Commerce, Community, and Economic Development (DCCED) provides the data in this application as a service to the public. DCCED makes no warranty, representation, or guarantee as to the content, accuracy, timeliness, or completeness of any of the data provided on this site. DCCED shall not be liable to the user for damages of any kind arising out of the use of data or information provided. DCCED is not the authoritative source for American Community Survey data, and any data or information provided by DCCED is provided "as is". Data or information provided by DCCED shall be used and relied upon only at the user's sole risk. For information about the American Community Survey, click here.
Italian Negation Constructions - Tweets
kaggle.com
zip
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Italian Negation Constructions - Tweets [Dataset]. https://www.kaggle.com/datasets/thedevastator/italian-negation-constructions-tweets
Explore at:
zip(303402 bytes)Available download formats
Dataset updated
Feb 11, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Italian Negation Constructions - Tweets

Exploring Language Variation Across 10 Cities

By [source]

About this dataset

This dataset, the Twitter Italian Negation (TIN) Corpus, provides an interesting glimpse into language change in Romance languages with the emergence of non-standard uses of negations. This collection contains 10,000 tweets from ten different cities -Milan, Rome, Naples, Palermo, Bologna, Turin, Florence Cagliari Genoa and New York City -each collected in August 2019. The data includes tokenized text and frequency measures for each tweet as well as a city column so users can explore regional differences. With this resource users can uncover how the language of these cities is changing over time or even how language usage between neighboring countries or states may differ. Get ready to dive deep into the fascinating shifts that occur between spoken and written languages!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains 10,000 tweets in Italian gathered from ten different cities between August and December 2019. This collection of tweets provides an interesting insight into the language change phenomena in Romance languages, specifically with regard to non-standard uses of negations.

The dataset is composed of nine columns: token, absolute frequency, relative frequency, variation, and city from which the tweet originated. Each row represents a single token in a particular tweet: each tweet can contain more than one token.

By using this dataset you can analyze and compare patterns of usage across different cities or even within a specific city. You can also compare variations within tokens between different cities to understand how certain constructions are used differently across regions or dialects. Additionally you could use this data to examine trends in literary works such as poetry by looking at the most commonly used words and phrases over time.

To use the data effectively, it is important first to understand what each column represents:

Tok (Tokenized text): This is text that has been broken down into individual words or tokens representing all of the words found in a particular tweet including punctuation marks like commas or exclamation points;

Abs (Absolute Frequency): This is the total number of times that a particular token appears within all tweets;

Rel (Relative Frequency): This is calculated by calculating how many times a particular token appears compared to other tokens;

Var (Variation): This indicates whether there have been any alterations made compared to standard usage such as “has” being replaced with “haz”;

City: The originator's city corresponds with each tweet guiding analysis on usage differences among locales for example “Milan” or “Genua” but also generalized larger geographic areas such as “Italy” versus other countries like “United States.

Using these numeric values alongside thematic exploration allows for understanding not only usages but trends across different geographic populations relative representations both locally and globally provided by Twitter users regarding issues related language use especially non-standard dialectical contructs throughout Italy

Research Ideas

Studying the regional variation of Italian negation constructions by comparing the frequency and variation between cities.

Investigating language change over time by tracking changes in relative and absolute frequencies of negation constructions across tweets.

Exploring how different socio-economic contexts or trends such as news, fashion, sports impacted the evolution of language use in tweets in each city

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: interessa+word1.csv | Column name | Description | |:--------------|:------------------------------------------------------| | tok | Tokenized text of the tweet. (String) | | abs | Absolute frequency of a token in the...
Wales: A level European language candidates in 2019
statista.com
Updated Nov 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Wales: A level European language candidates in 2019 [Dataset]. https://www.statista.com/statistics/346488/wales-european-language-candidates-a-level/
Explore at:
Dataset updated
Nov 28, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2019
Area covered
Wales, United Kingdom
Description
This statistic shows the number of A level candidates taking a European language subject in Wales in the year 2019, and highlights how few students there are. Welsh was the most popular European language. French was over * times more popular than German, and over * times more popular than Spanish.
Share of U.S. population speaking a language besides English at home 2023,...
statista.com
Updated Nov 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Share of U.S. population speaking a language besides English at home 2023, by state [Dataset]. https://www.statista.com/statistics/312940/share-of-us-population-speaking-a-language-other-than-english-at-home-by-state/
Explore at:
Dataset updated
Nov 28, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
United States
Description
As of 2023, more than ** percent of people in the United States spoke a language other than English at home. California had the highest share among all U.S. states, with ** percent of its population speaking a language other than English at home.
u
Speech Across Dialects of English: Acoustic Measures from SPADE Project...
datacatalogue.ukdataservice.ac.uk
Updated Feb 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stuart-Smith, J, University of Glasgow; Sonderegger, M, McGill University; Mielke, J, North Carolina State University (2024). Speech Across Dialects of English: Acoustic Measures from SPADE Project Corpora, 1949-2019 [Dataset]. http://doi.org/10.5255/UKDA-SN-854959
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-854959
Dataset updated
Feb 21, 2024
Authors
Stuart-Smith, J, University of Glasgow; Sonderegger, M, McGill University; Mielke, J, North Carolina State University
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Time period covered
Jan 1, 1949 - Jan 1, 2019
Area covered
United Kingdom, Ireland, Canada, United States
Description
The SPADE project aims to develop and apply user-friendly software for large-scale speech analysis of existing public and private English speech datasets, in order to understand more about English speech over space and time. To date, we have worked with 42 shared corpora comprising dialects from across the British Isles (England, Wales, Scotland, Ireland) and North America (US, Canada), with an effective time span of over 100 years. We make available here a link to our OSF repository (see below) which has acoustic measures datasets for sibilants and durations and static formants for vowels, for 39 corpora (~2200 hours of speech analysed from ~8600 speakers), with information about dataset generation. In addition, at the OSF site, we provide Praat TextGrids created by SPADE for some corpora. Reading passage text is provided when the measures are based on reading only. Datasets are in their raw form and will require cleaning (e.g. outlier removal) before analysis. In addition, we used whitelisting to anonymise measures datasets generated from non-public, restricted corpora.
Obtaining a data visualization of a text search within seconds via generic, large-scale search algorithms, such as Google n-gram viewer, is available to anyone. By contrast, speech research is only now entering its own 'big data' revolution. Historically, linguistic research has tended to carry out fine-grained analysis of a few aspects of speech from one or a few languages or dialects. The current scale of speech research studies has shaped our understanding of spoken language and the kinds of questions that we ask. Today, massive digital collections of transcribed speech are available from many different languages, gathered for many different purposes: from oral histories, to large datasets for training speech recognition systems, to legal and political interactions. Sophisticated speech processing tools exist to analyze these data, but require substantial technical skill. Given this confluence of data and tools, linguists have a new opportunity to answer fundamental questions about the nature and development of spoken language.

Our project seeks to establish the key tools to enable large-scale speech research to become as powerful and pervasive as large-scale text mining. It is based on a partnership of three teams based in Scotland, Canada and the US. Together we exploit methods from computing science and put them to work with tools and methods from speech science, linguistics and digital humanities, to discover how much the sounds of English across the Atlantic vary over space and time.

We have developed innovative and user-friendly software which exploits the availability of existing speech data and speech processing tools to facilitate large-scale integrated speech corpus analysis across many datasets together. The gains of such an approach are substantial: linguists will be able to scale up answers to existing research questions from one to many varieties of a language, and ask new and different questions about spoken language within and across social, regional, and cultural, contexts. Computational linguistics, speech technology, forensic and clinical linguistics researchers, who engage with variability in spoken language, will also benefit directly from our software. This project also opens up vast potential for those who already use digital scholarship for spoken language collections in the humanities and social sciences more broadly, e.g. literary scholars, sociologists, anthropologists, historians, political scientists. The possibility of ethically non-invasive inspection of speech and texts will allow analysts to uncover far more than is possible through textual analysis alone.

Our project has developed and applied our new software to a global language, English, using existing public and private spoken datasets of Old World (British Isles) and New World (North American) English, across an effective time span of more than 100 years, spanning the entire 20th century. Much of what we know about spoken English comes from influential studies on a few specific aspects of speech from one or two dialects. This vast literature has established important research questions which has been investigated for the first time on a much larger scale, through standardized data across many different varieties of English.

Our large-scale study complements current-scale studies, by enabling us to consider stability and change in English across the 20th century on an unparalleled scale. The global nature of English means that our findings will be interesting and relevant to a large international non-academic audience; they have been made accessible through an innovative and dynamic visualization of linguistic variation via an interactive sound mapping website. In addition to new insights into spoken English, this project also lays the crucial groundwork for large-scale speech studies across many datasets from different languages, of different formats and structures.
f
Table_1_Multi-Talker Speech Promotes Greater Knowledge-Based Spoken Mandarin...
frontiersin.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seth Wiener; Chao-Yang Lee (2023). Table_1_Multi-Talker Speech Promotes Greater Knowledge-Based Spoken Mandarin Word Recognition in First and Second Language Listeners.DOCX [Dataset]. http://doi.org/10.3389/fpsyg.2020.00214.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2020.00214.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Seth Wiener; Chao-Yang Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Spoken word recognition involves a perceptual tradeoff between the reliance on the incoming acoustic signal and knowledge about likely sound categories and their co-occurrences as words. This study examined how adult second language (L2) learners navigate between acoustic-based and knowledge-based spoken word recognition when listening to highly variable, multi-talker truncated speech, and whether this perceptual tradeoff changes as L2 listeners gradually become more proficient in their L2 after multiple months of structured classroom learning. First language (L1) Mandarin Chinese listeners and L1 English-L2 Mandarin adult listeners took part in a gating experiment. The L2 listeners were tested twice – once at the start of their intermediate/advanced L2 language class and again 2 months later. L1 listeners were only tested once. Participants were asked to identify syllable-tone words that varied in syllable token frequency (high/low according to a spoken word corpus) and syllable-conditioned tonal probability (most probable/least probable in speech given the syllable). The stimuli were recorded by 16 different talkers and presented at eight gates ranging from onset-only (gate 1) through onset +40 ms increments (gates 2 through 7) to the full word (gate 8). Mixed-effects regression modeling was used to compare performance to our previous study which used single-talker stimuli (Wiener et al., 2019). The results indicated that multi-talker speech caused both L1 and L2 listeners to rely greater on knowledge-based processing of tone. L1 listeners were able to draw on distributional knowledge of syllable-tone probabilities in early gates and switch to predominantly acoustic-based processing when more of the signal was available. In contrast, L2 listeners, with their limited experience with talker range normalization, were less able to effectively transition from probability-based to acoustic-based processing. Moreover, for the L2 listeners, the reliance on such distributional information for spoken word recognition appeared to be conditioned by the nature of the acoustic signal. Single-talker speech did not result in the same pattern of probability-based tone processing, suggesting that knowledge-based processing of L2 speech may only occur under certain acoustic conditions, such as multi-talker speech.
r
Enrolments of LBOTE government school students by largest language groups...
researchdata.edu.au
Updated Jun 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NSW Department of Education (2024). Enrolments of LBOTE government school students by largest language groups (2017-2024) [Dataset]. https://researchdata.edu.au/enrolments-lbote-government-2017-2024/2968159
Explore at:
Dataset updated
Jun 10, 2024
Dataset provided by
data.nsw.gov.au
Authors
NSW Department of Education
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset captures the diversity of students with a language background other than English (LBOTE) who are enrolled in NSW government schools. \r \r Data Notes:\r \r * LBOTE students are those in whose home a language other than English is spoken by the student, parents, or other primary caregivers.\r \r * LBOTE and total (headcount) enrolment figures are collected in March of each year. Most other collections use enrolment data that are collected as part of the Mid Year Census in August. \r \r * The table is ordered by the largest language groups for language groups with 1000 or more students in the most recent year presented. Language groups with fewer than 1000 students are included in 'other language groups'.\r \r * Indian and Chinese Languages are included as a combined total, and also as separate distinct languages. Therefore Indian and Chinese data appears twice in the table. \r \r * Due to rounding issues, the total percentage for Indian and Chinese Language groups may be slightly different to the sum of the distinct languages.\r \r * There can be minor changes in the categorization of less common languages and dialects over time. For example, these definitional variations account for the difference in the 2018 total as reported in the ‘Enrolments of LBOTE government school students by largest language groups’ table in the 2018 and 2019 LBOTE bulletins. \r \r \r Data Source:\r \r * Centre for Education Statistics and Evaluation, NSW Department of Education.
u
Moving from China to York: How do Changes in Language Environment Modulate...
datacatalogue.ukdataservice.ac.uk
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
de Bruin, A, University of York (2024). Moving from China to York: How do Changes in Language Environment Modulate Bilingual Language Control, 2023 [Dataset]. http://doi.org/10.5255/UKDA-SN-857311
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-857311
Dataset updated
Jul 17, 2024
Authors
de Bruin, A, University of York
Area covered
China, United Kingdom
Description
Within this project, Mandarin-English bilingual participants completed a series of language production and comprehension tasks. The data provided correspond to three manuscripts, with the abstract per manuscript provided below.

1.Coumel, M., Liu, C., Trenkic, D., & de Bruin, A. (in press). Do accent and input modality modulate processing of language switches in bilingual language comprehension?. Journal of Experimental Psychology: Human Perception and Performance. We examined how bilinguals process language switches between their first (L1) and second language (L2). Language switching costs (slower responses to language switch than non-switch trials) appear to arise more systematically in production than in comprehension, possibly because the latter context might sometimes elicit less language co-activation (Declerck et al., 2019). This might reduce language competition and in turn the need for bilinguals to apply language control when processing language switches. Yet even in comprehension, language co-activation may vary depending on variables such as the accent of the speaker (for example, whether the L2 words are pronounced with an L1 or L2 accent) and input modality (spoken or written). In three experiments conducted in 2021-2022, we tested how unbalanced Mandarin-English bilinguals processed language switches during comprehension and the potential influence of a speaker’s accent and input modality. Overall, across settings, participants experienced significant language switching costs. In some conditions, switching costs were larger to L1-Mandarin than to L2-English, an asymmetry consistent with the participants’ dominance in L1-Mandarin and the application of language control. However, manipulating accent and input modality did not influence language switches, suggesting they did not impact language co-activation sufficiently to modulate language control.

Bilingual language control during single-language production: Does relocation to a new linguistic environment change it? (submitted) A bilingual’s two languages are simultaneously active and competing for selection, even when only one language is used. To manage this competition, bilinguals apply language control. We firstly examined how bilinguals apply control in two single-language tasks differing in their demands on lexical selection. Second, we examined how this language control might adapt to the language environment bilinguals live in. We conducted a longitudinal study with Mandarin-English bilinguals who moved from China to the UK and a control group staying in China. Participants completed a picture-naming task in which they had to retrieve one word in response to a picture and a verbal-fluency task in which they generated words belonging to a semantic category. Both tasks were completed twice, approximately seven months apart. In both tasks, bilinguals proactively applied language control over the language they were not currently using to manage the anticipated language competition. However, this language control did not change after relocation to the UK, nor did it differ between the groups. This suggests that while language control is a core part of language production, the language environment a bilingual lives in might not have a defining impact on the exact way this language control is applied.

How do changes in language environment modulate bilingual language switching in production and comprehension? (submitted) In dual-language contexts, bilinguals often switch between their languages. How they do this, and how they control their languages during switching, can depend on the nature of the interactional context and the modality (comprehension or production). Here, we examined the influence of the interactional context on language control in two ways. First, we examined how language control differs between producing language switches in response to cues, producing switches voluntarily, and comprehending switches. Second, we examined whether language control changes when the general language environment a bilingual lives in changes. To do this, we conducted a longitudinal study with Mandarin-English bilinguals who moved from China (L1-dominant environment) to the UK (bilingual/L2-dominant environment) and with a control group staying in China. Participants completed three tasks twice (seven months apart): cued picture naming (cues indicating language choice), voluntary picture naming (free language choice), and comprehension of spoken words. Language control differed between the three tasks. Participants showed greater language-switching costs in cued production than during voluntary production and comprehension. Furthermore, only cued production showed that using two languages was more costly than using one (mixing costs). However, we found no evidence that a change in language environment resulted in changes in language control. This suggests a bilingual’s language control mechanisms adapt to the immediate context they are communicating in, but are perhaps not shaped as strongly by the overall language environment they live in.
Enrolments of LBOTE government school students by largest language groups...
data.nsw.gov.au
csv
Updated Nov 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NSW Department of Education (2024). Enrolments of LBOTE government school students by largest language groups (2017-2024) [Dataset]. https://data.nsw.gov.au/data/dataset/nsw-education-enrolments-of-lbote-government-school-students-by-largest-language-groups
Explore at:
csv(1854), csv(1824), csv(7577), csv(1888), csv(2094), csv(2307), csv(2222)Available download formats
Dataset updated
Nov 6, 2024
Dataset authored and provided by
NSW Department of Educationhttps://education.nsw.gov.au/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset captures the diversity of students with a language background other than English (LBOTE) who are enrolled in NSW government schools.

Data Notes:

LBOTE students are those in whose home a language other than English is spoken by the student, parents, or other primary caregivers.

LBOTE and total (headcount) enrolment figures are collected in March of each year. Most other collections use enrolment data that are collected as part of the Mid Year Census in August.

The table is ordered by the largest language groups for language groups with 1000 or more students in the most recent year presented. Language groups with fewer than 1000 students are included in 'other language groups'.

Indian and Chinese Languages are included as a combined total, and also as separate distinct languages. Therefore Indian and Chinese data appears twice in the table.

Due to rounding issues, the total percentage for Indian and Chinese Language groups may be slightly different to the sum of the distinct languages.

There can be minor changes in the categorization of less common languages and dialects over time. For example, these definitional variations account for the difference in the 2018 total as reported in the ‘Enrolments of LBOTE government school students by largest language groups’ table in the 2018 and 2019 LBOTE bulletins.

Data Source:

Centre for Education Statistics and Evaluation, NSW Department of Education.
Arabic and Mandarin interventions 2019.
plos.figshare.com
xls
Updated Nov 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anita Lal; Mohammadreza Mohebi; Kerryann Wyatt; Ayesha Ghosh; Kate Broun; Lan Gao; Nikki McCaffrey (2024). Arabic and Mandarin interventions 2019. [Dataset]. http://doi.org/10.1371/journal.pone.0313058.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0313058.t001
Dataset updated
Nov 14, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Anita Lal; Mohammadreza Mohebi; Kerryann Wyatt; Ayesha Ghosh; Kate Broun; Lan Gao; Nikki McCaffrey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundEffective bowel cancer screening is freely available in Australia, however, there are inequities in utilisation amongst non-English speakers at home. This study estimates the health impacts and cost-effectiveness of recruitment interventions targeted at Arabic and Mandarin speaking populations in Victoria, Australia to increase bowel cancer screening participation.MethodsA Markov microsimulation model simulated the development of bowel cancer, considering National Bowel Cancer Screening Program participation rates. Culturally specific recruitment interventions e.g., community education and tailored paid media for 50–74-year-olds were compared to usual practice. A cost-utility analysis was conducted over a 50-year time horizon from a healthcare perspective, to estimate the cost per quality-adjusted life year (QALY) based on plausible effectiveness levels. Costs are in 2019 Australian dollars.ResultsIntervention costs were $6.90 per person for the Arabic speaking group and $3.10 for Mandarin speakers. The estimated cost/QALY was $2,781 (95% uncertainty interval [UI]: $2,144─$3,277) when screening increased by 0.2% in the Arabic group, and an estimated 5–6 additional adenoma and cancer cases were detected. In the Mandarin group, the estimated cost/QALY was $1,024/QALY (95%UI: $749─$1,272) when screening increased by 1.1%, and an estimated 18–23 additional adenoma and cancer cases were detected.ConclusionsCulturally specific recruitment interventions to increase bowel cancer screening are inexpensive and likely to be cost-effective. Improvements in capturing language spoken at home by the National program would facilitate more precise estimates of the effectiveness and cost-effectiveness of these interventions.
Identities, Values and Attitudes among Russian-, Estonian-, Somali-, and...
services.fsd.tuni.fi
zip
Updated Jan 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Finnish Social Science Data Archive (2025). Identities, Values and Attitudes among Russian-, Estonian-, Somali-, and Arabic-speaking People in the Helsinki Capital Region 2018-2019 [Dataset]. http://doi.org/10.60686/t-fsd3448
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.60686/t-fsd3448
Dataset updated
Jan 9, 2025
Dataset provided by
Finnish Social Science Data Archive
Area covered
Helsinki
Description
The survey studied the values and identities of the five main linguistic minorities, Russian-, Estonian-, Somali-, English- and Arabic-speaking people, in the Helsinki capital region. The main theme of the survey was the self-image of the respondents as an individual and as part of different groups. The study was funded by E2 Research, Finnish Cultural Foundation, Ministry of Justice, and the cities of Helsinki, Espoo and Vantaa. First, the respondents were asked how many years they had lived in Finland and where they or their family were originally from. National identity was surveyed with questions on how the respondents viewed their nationality and whether they felt they were a part of Finnish society. Questions also focused on the significance of various regional and social factors, such as the respondent's current area of residence, being a European citizen, level of education, and cultural traditions, as contributors to the respondent's identity. The respondents were also asked to describe their family's social class during their childhood (e.g. whether their family was a working or middle class family). Next, the respondents' views on several statements were surveyed. The statements included, for example, whether the respondent thought that being Finnish was connected to one's ethnic background, that the media portrayed representatives of their minority group too negatively, and that Finland needs strong leadership so that social problems can be fixed without compromises. The importance of several things for the respondents, such as power, wealth, equality, and forgiveness, was charted. The respondents were also asked to consider what was extremely important for them in their life (e.g. health, love, children, traditions, safety and security). Finally, the respondents' trust in various authorities and organisations in Finland, such as the President, the justice system, banks, and large corporations, was examined. The respondents' trust in individuals was also charted. Background variables included, among others, the respondent's age group, gender, economic activity and occupational status, number of completed school years, country where R completed their highest level of education, household composition, mother tongue, which language R spoke at home, and the main reason why R moved to Finland.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista, The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/

The most spoken languages worldwide 2025

Explore at:

464 scholarly articles cite this dataset (View in Google Scholar)

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2025

Area covered

World

Description

In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

Clear search

Close search

Google apps

Main menu

The most spoken languages worldwide 2025

Ranking of languages spoken at home in the U.S. 2024, by number of speakers

Language - ACS 2019-2023 - Tempe Tracts

LA County Language Spoken at Home (census tract)

Language spoken - ACS 2015-2019 - Tempe Tracts

Main languages spoken at home in Nigeria 2022

Russian-speaking population share 2019, by geographic area

ACS Language Spoken at Home Variables - Boundaries

Language Spoken at Home - Counties 2015-2019

ACS Population Characteristics: Spoken Languages

Italian Negation Constructions - Tweets

Italian Negation Constructions - Tweets

Exploring Language Variation Across 10 Cities

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Wales: A level European language candidates in 2019

Share of U.S. population speaking a language besides English at home 2023,...

Speech Across Dialects of English: Acoustic Measures from SPADE Project...

Table_1_Multi-Talker Speech Promotes Greater Knowledge-Based Spoken Mandarin...

Enrolments of LBOTE government school students by largest language groups...

Moving from China to York: How do Changes in Language Environment Modulate...

Enrolments of LBOTE government school students by largest language groups...

Arabic and Mandarin interventions 2019.

Identities, Values and Attitudes among Russian-, Estonian-, Somali-, and...

The most spoken languages worldwide 2025