100+ datasets found

The most spoken languages worldwide 2025
statista.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Explore at:
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
Top Languages Spoken in the United States
kaggle.com
Updated Oct 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Top Languages Spoken in the United States [Dataset]. https://www.kaggle.com/datasets/thedevastator/top-languages-spoken-in-the-united-states/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 22, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
Area covered
United States
Description
Top Languages Spoken in the United States

The Impact of linguistics on Community and Business in America

About this dataset

Languages are an important part of daily life in the USA. Here is a table that shows the most common languages spoken in the USA, as well as a big spreadsheet which shows each CBSA (Core-Based Statistical Area, or urban area).

Language usage varies widely throughout the United States. According to the latest census data, over 350 different languages are represented in homes across the country. The following table and spreadsheet provide more detailed information on language usage throughout the various states and cities in the US:

Columns: - index: Index column for dataframe - Table with column headers in row 5 and row headers in column A: Contains language data for each CBSA (Core Based Statistical Area) - Unnamed: 1: Rank of CBSA by total number of speakers of all languages - Unnamed: 2: Name of CBSA - Unnamed: 3: Population of CBSA - Unnamed: 4: Percent of population that speaks English very well - Unnamed: 5 through Unnamed: 58 : Languages spoken by at least 0.1% of the population, with corresponding percentages

How to use the dataset

This dataset can be used to understand the linguistic diversity of the United States, and to compare languages spoken across different states and cities.

This data can also be used to explore trends in language usage over time.

businesses can use this dataset to identify which languages are most commonly spoken in the areas in which they operate and tailor their marketing or customer service accordingly.

Schools could use this dataset to plan language-learning programs based on the needs of their community.

Policymakers could use this data to better understand linguistic diversity in the United States and design programs to support bilingualism or multilingualism

Research Ideas

Businesses can use this dataset to identify which languages are most commonly spoken in the areas in which they operate and cater their marketing or customer service accordingly.

Schools could use this data to plan language-learning programs based on the needs of their community.

Policymakers could use this dataset to better understand linguistic diversity in the United States and design programs to support bilingualism or multilingualism

Acknowledgements

This dataset was created by Gary Hoover. The data was sourced from https://www.kaggle.com/garyhoov/us-languages

License

Unknown License - Please check the dataset description for more information.

Columns

File: Languages Spoken at Home by Urban Area = CBSA.csv

File: US Languages Spoken at Home 2014.csv | Column name | Description | |:-------------------------------------------------------------------|:--------------| | Table with column headers in row 5 and row headers in column A | |
d
Population and Languages of the Limited English Proficient (LEP) Speakers by...
catalog.data.gov
data.cityofnewyork.us
Updated Jan 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.cityofnewyork.us (2024). Population and Languages of the Limited English Proficient (LEP) Speakers by Community District [Dataset]. https://catalog.data.gov/dataset/population-and-languages-of-the-limited-english-proficient-lep-speakers-by-community-distr
Explore at:
Dataset updated
Jan 19, 2024
Dataset provided by
data.cityofnewyork.us
Description
Many residents of New York City speak more than one language; a number of them speak and understand non-English languages more fluently than English. This dataset, derived from the Census Bureau's American Community Survey (ACS), includes information on over 1.7 million limited English proficient (LEP) residents and a subset of that population called limited English proficient citizens of voting age (CVALEP) at the Community District level. There are 59 community districts throughout NYC, with each district being represented by a Community Board.
t
Main language spoken - Dataset - Data Place Plymouth
plymouth.thedata.place
Updated Oct 6, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). Main language spoken - Dataset - Data Place Plymouth [Dataset]. https://plymouth.thedata.place/dataset/main-language-spoken-detailed-plymouth
Explore at:
Dataset updated
Oct 6, 2016
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Area covered
Plymouth
Description
Data showing the main languages spoken in Plymouth by population numbers.
W
Caribbean Netherlands; Spoken languages and main language, characteristics
cloud.csiss.gmu.edu
ckan.mobidatalab.eu
+3more
Updated Jul 10, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Netherlands (2019). Caribbean Netherlands; Spoken languages and main language, characteristics [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/60050-caribbean-netherlands-spoken-languages-and-main-language-characteristics
Explore at:
http://publications.europa.eu/resource/authority/file-type/atom, http://publications.europa.eu/resource/authority/file-type/jsonAvailable download formats
Dataset updated
Jul 10, 2019
Dataset provided by
Netherlands
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Caribbean Netherlands
Description
This table provides data on languages spoken by the population of the Caribbean Netherlands aged 15 years and older in private households. Breakdowns by sex, age and level of education are presented. These aspects are shown for the Caribbean Netherlands and also for the islands Bonaire, St Eustatius and Saba separately. The research is a sample survey. This means that the figures shown are estimates for which reliability margins apply. These margins are also included in the table. The Omnibus survey was carried out for the first time on Bonaire, Saba and St. Eustatius in 2013 during the month of June and the first week of July. For the second time the Omnibus survey was carried out on Bonaire during the months of October and November 2017, and on Saba and St. Eustatius in the period January to March 2018.

Data available from: 2013

Status of the figures: The figures in this table are final.

Changes as of 4 April 2019 None, this is a new table.

When will new figures be published? New data will be published every four years.
n
Data from: Language Spoken at Home
linc.osbm.nc.gov
ncosbm.opendatasoft.com
csv, excel, geojson +1
Updated Oct 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Language Spoken at Home [Dataset]. https://linc.osbm.nc.gov/explore/dataset/language-spoken-at-home/
Explore at:
geojson, csv, json, excelAvailable download formats
Dataset updated
Oct 3, 2024
Description
Language spoken at home and the ability to speak English for the population age 5 and over as reported by the US Census Bureau's, American Community Survey (ACS) 5-year estimates table C16001.
LANGUAGE SPOKEN AT HOME FOR THE POPULATION 5 YEARS AND OVER (C16001)
catalog.data.gov
Updated Jan 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Seattle ArcGIS Online (2025). LANGUAGE SPOKEN AT HOME FOR THE POPULATION 5 YEARS AND OVER (C16001) [Dataset]. https://catalog.data.gov/dataset/language-spoken-at-home-for-the-population-5-years-and-over-c16001
Explore at:
Dataset updated
Jan 31, 2025
Dataset provided by
https://arcgis.com/
Description
Table from the American Community Survey (ACS) C16001 of language spoken at home for the population 5 years and over. These are multiple, nonoverlapping vintages of the 5-year ACS estimates of population and housing attributes starting in 2010 shown by the corresponding census tract vintage. Also includes the most recent release annually.King County, Washington census tracts with nonoverlapping vintages of the 5-year American Community Survey (ACS) estimates starting in 2010. Vintage identified in the "ACS Vintage" field.The census tract boundaries match the vintage of the ACS data (currently 2010 and 2020) so please note the geographic changes between the decades. Tracts have been coded as being within the City of Seattle as well as assigned to neighborhood groups called "Community Reporting Areas". These areas were created after the 2000 census to provide geographically consistent neighborhoods through time for reporting U.S. Census Bureau data. This is not an attempt to identify neighborhood boundaries as defined by neighborhoods themselves.Vintages: 2010, 2015, 2020, 2021, 2022, 2023ACS Table(s): C16001Data downloaded from: <a href='https://data.census.gov/' style='color:rgb(0, 97, 155); text
S
UGSpeechData: A Multilingual Speech Dataset of Ghanaian Languages
scidb.cn
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wiafe, Isaac; Abdulai, Jamal-Deen; Ekpezu, Akon Obu; Helegah, Raynard Dodzi; Atsakpo, Elikem Doe; Nutrokpor, Charles; Winful, Fiifi Baffoe Payin; Solaga, Kafui Kwashia (2025). UGSpeechData: A Multilingual Speech Dataset of Ghanaian Languages [Dataset]. http://doi.org/10.57760/sciencedb.22298
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.22298
Dataset updated
Mar 26, 2025
Dataset provided by
Science Data Bank
Authors
Wiafe, Isaac; Abdulai, Jamal-Deen; Ekpezu, Akon Obu; Helegah, Raynard Dodzi; Atsakpo, Elikem Doe; Nutrokpor, Charles; Winful, Fiifi Baffoe Payin; Solaga, Kafui Kwashia
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Area covered
Ghana
Description
The UGSpeechData is a collection of audio speech data of Akan, Ewe, Dagaare, Dagbani, and Ikposo. These languages are among the most spoken languages in Ghana. The uploaded dataset contains a total of 970148 audio files (5384.28 hours) and 93262 transcribed audio files (518 hours). The audio files are descriptions of 1000 culturally relevant images collected from indigenous speakers of each of the languages. Each audio is between 15 to 30 seconds long. More specifically, the dataset contains five subfolders for each of the five respective languages. Each language has at least 1000 hours of speech data and 100 hours of transcribed speech data. Fig. 1 provides details of the transcribed audio corpus, including gender and recording environments for each language.Fig. 1. Details of transcribed audio files
785 Million Language Translation Database for AI
kaggle.com
Updated Aug 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramakrishnan Lakshmanan (2023). 785 Million Language Translation Database for AI [Dataset]. https://www.kaggle.com/datasets/ramakrishnan1984/785-million-language-translation-database-ai-ml
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ramakrishnan Lakshmanan
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
Our groundbreaking translation dataset represents a monumental advancement in the field of natural language processing and machine translation. Comprising a staggering 785 million records, this corpus bridges language barriers by offering translations from English to an astonishing 548 languages. The dataset promises to be a cornerstone resource for researchers, engineers, and developers seeking to enhance their machine translation models, cross-lingual analysis, and linguistic investigations.

Size of the dataset – 41GB(Uncompressed) and Compressed – 20GB

Key Features:

Scope and Scale: With a comprehensive collection of 785 million records, this dataset provides an unparalleled wealth of translated text. Each record consists of an English sentence paired with its translation in one of the 548 target languages, enabling multi-directional translation applications.

Language Diversity: Encompassing translations into 548 languages, this dataset represents a diverse array of linguistic families, dialects, and scripts. From widely spoken languages to those with limited digital representation, the dataset bridges communication gaps on a global scale.

Quality and Authenticity: The translations have been meticulously curated, verified, and cross-referenced to ensure high quality and authenticity. This attention to detail guarantees that the dataset is not only extensive but also reliable, serving as a solid foundation for machine learning applications. Data is collected from various open datasets for my personal ML projects and looking to share it to team.

Use Case Versatility: Researchers and practitioners across a spectrum of domains can harness this dataset for a myriad of applications. It facilitates the training and evaluation of machine translation models, empowers cross-lingual sentiment analysis, aids in linguistic typology studies, and supports cultural and sociolinguistic investigations.

Machine Learning Advancement: Machine translation models, especially neural machine translation (NMT) systems, can leverage this dataset to enhance their training. The large-scale nature of the dataset allows for more robust and contextually accurate translation outputs.

Fine-tuning and Customization: Developers can fine-tune translation models using specific language pairs, offering a powerful tool for specialized translation tasks. This customization capability ensures that the dataset is adaptable to various industries and use cases.

Data Format: The dataset is provided in a structured json format, facilitating easy integration into existing machine learning pipelines. This structured approach expedites research and experimentation. Json format contains the English word and equivalent word as single record. Data was exported from MongoDB database to ensure the uniqueness of the record. Each of the record is unique and sorted.

Access: The dataset is available for academic and research purposes, enabling the global AI community to contribute to and benefit from its usage. A well-documented API and sample code are provided to expedite exploration and integration.

The English-to-548-languages translation dataset represents an incredible leap forward in advancing multilingual communication, breaking down barriers to understanding, and fostering collaboration on a global scale. It holds the potential to reshape how we approach cross-lingual communication, linguistic studies, and the development of cutting-edge translation technologies.

Dataset Composition: The dataset is a culmination of translations from English, a widely spoken and understood language, into 548 distinct languages. Each language represents a unique linguistic and cultural background, providing a rich array of translation contexts. This diverse range of languages spans across various language families, regions, and linguistic complexities, making the dataset a comprehensive repository for linguistic research.

Data Volume and Scale: With a staggering 785 million records, the dataset boasts an immense scale that captures a vast array of translations and linguistic nuances. Each translation entry consists of an English source text paired with its corresponding translation in one of the 548 target languages. This vast corpus allows researchers and practitioners to explore patterns, trends, and variations across languages, enabling the development of robust and adaptable translation models.

Linguistic Coverage: The dataset covers an extensive set of languages, including but not limited to Indo-European, Afroasiatic, Sino-Tibetan, Austronesian, Niger-Congo, and many more. This broad linguistic coverage ensures that languages with varying levels of grammatical complexity, vocabulary richness, and syntactic structures are included, enhancing the applicability of translation models across diverse linguistic landscapes.

Dataset Preparation: The translation ...

Data from: LaFresCat: a Catalan multi-accent speech dataset for...

zenodo.org

application/gzip, txt

Updated Feb 18, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Zenodo (2025). LaFresCat: a Catalan multi-accent speech dataset for text-to-speech [Dataset]. http://doi.org/10.21437/iberspeech.2024-42

Explore at:

txt, application/gzipAvailable download formats

Unique identifier

https://doi.org/10.21437/iberspeech.2024-42

Dataset updated

Feb 18, 2025

Dataset provided by

Zenodo

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

LaFresCat Multiaccent

We present LaFresCat, the first Catalan multiaccented and multispeaker dataset.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Commercial use is only possible through licensing by the voice artists. For further information, contact langtech@bsc.es and lafrescaproduccions@gmail.com.

Dataset Details

Dataset Description

The audios from this dataset have been created with professional studio recordings by professional voice actors in Lafresca Creative Studio. This is the raw version of the dataset, no resampling or trimming has been applied to the audios. Audios are stored in wav format at 48khz sampling rate

In total, there are 4 different accents, with 2 speakers per accent (female and male). After trimming, accumulates a total of 3,75h (divided by speaker IDs) as follows:

Balear
- olga -> 23.5 min
- quim -> 30.93 min
Central
- elia -> 33.14 min
- grau -> 37,86 min
Occidental (North-Western)
- emma -> 28,67 min
- pere -> 25,12 min
Valencia
- gina -> 22,25 min
- lluc -> 23,58 min

Uses

The purpose of this dataset is mainly for training text-to-speech and automatic speech recognition models in Catalan accents.

Languages

The dataset is in Catalan (ca-ES).

Dataset Structure

The dataset consists of 2858 audios and transcriptions in the following structure:lafresca_multiaccent_raw ├── balear │ ├── olga │ ├── olga.txt │ ├── quim │ └── quim.txt ├── central │ ├── elia │ ├── elia.txt │ ├── grau │ └── grau.txt ├── full_filelist.txt ├── occidental │ ├── emma │ ├── emma.txt │ ├── pere │ └── pere.txt └── valencia ├── gina ├── gina.txt ├── lluc └── lluc.txt

Metadata of the dataset can be found in the file `full_filelist.txt` , each line represents an audio and follows the format:

audio_path | speaker_id | transcription

The speaker ids have the following mapping:

"quim": 0,
"olga": 1,
"grau": 2,
"elia": 3,
"pere": 4,
"emma": 5,
"lluc": 6,
"gina": 7

Dataset Creation

This dataset has been created by members of the Language Technologies unit from the Life Sciences department of the Barcelona Supercomputing Center, except the valencian sentences which were created with the support of Cenid, the Digital Intelligence Center of the University of Alicante. The voices belong to professional voice actors and they've been recorded in Lafresca Creative Studio.

Source Data

The data presented in this dataset is the source data.

Data Collection and Processing

These are the technical details of the data collection and processing:

Microphone: Austrian Audio oc818
Preamp: Focusrite ISA Two
Audio Interface: Antelope Orion 32+
DAW: ProTools 2023.6.0

Processing:

Noise Gate: C1 Gate
Compression BF-76
De-Esser Renaissance
EQ Maag EQ2
EQ FabFilter Pro-Q3
Limiter: L1 Ultramaximizer

Here's the information about the speakers:

Dialect	Gender	County
Central	male	Barcelonès
Central	female	Barcelonès
Balear	female	Pla de Mallorca
Balear	male	Llevant
Occidental	male	Baix Ebre
Occidental	female	Baix Ebre
Valencian	female	Ribera Alta
Valencian	male	La Plana Baixa

Who are the source data producers?

The Language Technologies team from the Life Sciences department at the Barcelona Supercomputing Center developed this dataset. It features recordings by professional voice actors made at Lafresca Creative Studio.

Annotations

In order to check whether or not there were any errors in the transcriptions of the audios, we created a Label Studio space. In that space, we manually listened to subset of the dataset, and compared what we heard with the transcription. If the transcription was mistaken, we corrected it.

Personal and Sensitive Information

The dataset consists of professional voice actors who have recorded their voice. You agree to not attempt to determine the identity of speakers in this dataset.

Bias, Risks, and Limitations

Training a Text-to-Speech (TTS) model by fine-tuning with a Catalan speaker who speaks a particular dialect presents significant limitations. Mostly, the challenge is in capturing the full range of variability inherent in that accent. Each dialect has its own unique phonetic, intonational, and prosodic characteristics that can vary greatly even within a single linguistic region. Consequently, a TTS model trained on a narrow dialect sample will struggle to generalize across different accents and sub-dialects, leading to reduced accuracy and naturalness. Additionally, achieving a standard representation is exceedingly difficult because linguistic features can differ markedly not only between dialects but also among individual speakers within the same dialect group. These variations encompass subtle nuances in pronunciation, rhythm, and speech patterns that are challenging to standardize in a model trained on a limited dataset.

Funding

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project, in addition the Valencian sentences have been created within the framework of the NEL-VIVES project 2022/TL22/00215334.

Dataset Card Contact

langtech@bsc.es

a
ACS Population Characteristics: Spoken Languages
dcra-cdo-dcced.opendata.arcgis.com
gis.data.alaska.gov
+2more
Updated Sep 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dept. of Commerce, Community, & Economic Development (2019). ACS Population Characteristics: Spoken Languages [Dataset]. https://dcra-cdo-dcced.opendata.arcgis.com/datasets/acs-population-characteristics-spoken-languages
Explore at:
Dataset updated
Sep 4, 2019
Dataset authored and provided by
Dept. of Commerce, Community, & Economic Development
Area covered

Description
Counts and breakdown of languages used data with margins of error for Alaskan Communities/Places and aggregation at Borough/CDA and State level for recent 5-year American Community Survey (ACS) intervals. The 5-year interval data sets are published approximately 1/2 a period later than the End Year listed - for instance the interval ending in 2019 is published in mid-2021.Source: US Census Bureau, American Community SurveyThis data has been visualized in a Geographic Information Systems (GIS) format and is provided as a service in the DCRA Information Portal by the Alaska Department of Commerce, Community, and Economic Development Division of Community and Regional Affairs (SOA DCCED DCRA), Research and Analysis section. SOA DCCED DCRA Research and Analysis is not the authoritative source for this data. For more information and for questions about this data, see: US Census - Language UseUSE CONSTRAINTS: The Alaska Department of Commerce, Community, and Economic Development (DCCED) provides the data in this application as a service to the public. DCCED makes no warranty, representation, or guarantee as to the content, accuracy, timeliness, or completeness of any of the data provided on this site. DCCED shall not be liable to the user for damages of any kind arising out of the use of data or information provided. DCCED is not the authoritative source for American Community Survey data, and any data or information provided by DCCED is provided "as is". Data or information provided by DCCED shall be used and relied upon only at the user's sole risk. For information about the American Community Survey, click here.
P
SLURP Dataset
paperswithcode.com
Updated Apr 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emanuele Bastianelli; Andrea Vanzo; Pawel Swietojanski; Verena Rieser (2023). SLURP Dataset [Dataset]. https://paperswithcode.com/dataset/slurp
Explore at:
Dataset updated
Apr 12, 2023
Authors
Emanuele Bastianelli; Andrea Vanzo; Pawel Swietojanski; Verena Rieser
Description
A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets.
e
Top Languages Spoken in London Boroughs and MSOAs
data.europa.eu
unknown
Updated Jul 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
census2011@london.gov.uk (2021). Top Languages Spoken in London Boroughs and MSOAs [Dataset]. https://data.europa.eu/data/datasets/top-languages-spoken-in-london-boroughs-and-msoas?locale=ga
Explore at:
unknownAvailable download formats
Dataset updated
Jul 19, 2021
Dataset authored and provided by
census2011@london.gov.uk
Area covered
London
Description
This dataset shows the most spoken languages by borough and MSOAs in London. It provides numbers of the population aged 3+ who speak specified languages as their main language.

Main language is from 2011 Census (detailed) - Census table QS204EW.

This data is presented alongside Annual Population Survey (APS) data showing the top nationalities of residents in January - December 2019 by borough. The top 3 non-British nationalities are at the far right of the table. This is to highlight areas which may now have other common non-British languages spoken compared to 2011 (the year in which the Census information was gathered). The top non-British nationalities in 2019, which did not feature in 2011 as one of the most spoken non-British languages, are highlighted in column AD.

The APS has a sample of around 320,000 people in the UK (around 28,000 in London). As such all figures must be treated with some caution. Estimates for non-British nationalities at borough level that are below 10,000 are considered too small to be reliable and should be treated with additional caution.

MSOA codes have now been linked to House of Commons MSOA names
d
Data from: Language structure is influenced by the number of speakers but...
datadryad.org
data.niaid.nih.gov
zip
Updated Jan 31, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Koplenig (2019). Language structure is influenced by the number of speakers but seemingly not by the proportion of non-native speakers [Dataset]. http://doi.org/10.5061/dryad.g0m3b82
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.g0m3b82
Dataset updated
Jan 31, 2019
Dataset provided by
Dryad
Authors
Alexander Koplenig
Time period covered
2019
Description
Data table to reproduce all resultsdata_table.xlsxStata_filesStata 14 code to reproduce all results
a
Languages spoken by tract, ACS
hub.arcgis.com
massachsuetts-environmental-justice-datasets-mass-eoeea.hub.arcgis.com
Updated May 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MA Executive Office of Energy and Environmental Affairs (2021). Languages spoken by tract, ACS [Dataset]. https://hub.arcgis.com/datasets/Mass-EOEEA::languages-spoken-by-tract-acs-/about
Explore at:
Dataset updated
May 19, 2021
Dataset provided by
Massachusetts Executive Office of Energy and Environmental Affairs
Authors
MA Executive Office of Energy and Environmental Affairs
Area covered

Description
The American Community Survey, Table B16001 provided detailed individual-level language estimates at the tract level of 42 non-English language categories, tabulated by the English-speaking ability. Two sets of languages data are included here, with population counts and percentages for both:the tract population speaking languages other than English, regardless of English=speaking ability, identified by the language name, and the languages spoken other than English by the tract population who does not speak English 'very well', identified by the language name followed by "_Enw".The default pop-up for this service presents the second of these data: languages spoken other than English by the tract population who does not speak English 'very well'.In part because of privacy concerns with the very small counts in some categories in Table B16001, the Census changed the American Community Survey estimates of the languages spoken by individuals. In 2016, the number of categories previously presented in Table B16001 was reduced to reflect the most commonly spoken languages, and several languages spoken in Massachusetts were grouped into generalized (i.e., "Other...") categories.Table B16001 has been renamed Table C16001 with these generalized categories. Therefore, although the information presented in this datalayer is not current, and these data cannot be updated.
G
Proportion of Population by Language Spoken Most Often at Home, Alberta...
ouvert.canada.ca
data.urbandatacentre.ca
+3more
csv, html, pdf
Updated Jul 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Government of Alberta (2024). Proportion of Population by Language Spoken Most Often at Home, Alberta Economic Regions [Dataset]. https://ouvert.canada.ca/data/dataset/8d334793-ff24-42bc-8692-0bb86b5211d2
Explore at:
csv, html, pdfAvailable download formats
Dataset updated
Jul 24, 2024
Dataset provided by
Government of Alberta
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Time period covered
Jun 10, 2006 - Jun 10, 2011
Area covered
Alberta
Description
This Alberta Official Statistic describes the proportion of population based on language spoken most often at home in each economic region as reported in the 2011 population census. Alberta is divided into eight economic regions as follows: Lethbridge – Medicine -Hat; Camrose-Drumheller; Calgary; Banff – Jasper – Rocky Mountain House; Red Deer; Edmonton; Athabasca – Grande Prairie – Peace River; and Wood Buffalo – Cold Lake.
Primary language spoken by the Medicaid and CHIP population
catalog.data.gov
data.virginia.gov
+2more
Updated Feb 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centers for Medicare & Medicaid Services (2025). Primary language spoken by the Medicaid and CHIP population [Dataset]. https://catalog.data.gov/dataset/primary-language-spoken-by-the-medicaid-and-chip-population
Explore at:
Dataset updated
Feb 3, 2025
Dataset provided by
Centers for Medicare & Medicaid Services
Description
This data set includes annual counts and percentages of Medicaid and Children’s Health Insurance Program (CHIP) enrollees by primary language spoken (English, Spanish, and all other languages). Results are shown overall; by state; and by five subpopulation topics: race and ethnicity, age group, scope of Medicaid and CHIP benefits, urban or rural residence, and eligibility category. These results were generated using Transformed Medicaid Statistical Information System (T-MSIS) Analytic Files (TAF) Release 1 data and the Race/Ethnicity Imputation Companion File. This data set includes Medicaid and CHIP enrollees in all 50 states, the District of Columbia, Puerto Rico, and the U.S. Virgin Islands who were enrolled for at least one day in the calendar year, except where otherwise noted. Enrollees in Guam, American Samoa, the Northern Mariana Islands, and select states with data quality issues with the primary language variable in TAF are not included. Results shown for the race and ethnicity subpopulation topic exclude enrollees in the U.S. Virgin Islands. Results shown overall (where subpopulation topic is "Total enrollees") exclude enrollees younger than age 5 and enrollees in the U.S. Virgin Islands. Results for states with TAF data quality issues in the year have a value of "Unusable data." Some rows in the data set have a value of "DS," which indicates that data were suppressed according to the Centers for Medicare & Medicaid Services’ Cell Suppression Policy for values between 1 and 10. This data set is based on the brief: "Primary language spoken by the Medicaid and CHIP population in 2020." Enrollees are assigned to a primary language category based on their reported ISO language code in TAF (English/missing, Spanish, and all other language codes) (Primary Language). Enrollees are assigned to a race and ethnicity subpopulation using the state-reported race and ethnicity information in TAF when it is available and of good quality; if it is missing or unreliable, race and ethnicity is indirectly estimated using an enhanced version of Bayesian Improved Surname Geocoding (BISG) (Race and ethnicity of the national Medicaid and CHIP population in 2020). Enrollees are assigned to an age group subpopulation using age as of December 31st of the calendar year. Enrollees are assigned to the comprehensive benefits or limited benefits subpopulation according to the criteria in the "Identifying Beneficiaries with Full-Scope, Comprehensive, and Limited Benefits in the TAF" DQ Atlas brief. Enrollees are assigned to an urban or rural subpopulation based on the 2010 Rural-Urban Commuting Area (RUCA) code associated with their home or mailing address ZIP code in TAF (Rural Medicaid and CHIP enrollees in 2020). Enrollees are assigned to an eligibility category subpopulation using their latest reported eligibility group code, CHIP code, and age in the calendar year. Please refer to the full brief for additional context about the methodology and detailed findings. Future updates to this data set will include more recent data years as the TAF data become available.
Tamil (Tamizh) Wikipedia Text Dataset for NLP
kaggle.com
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Younus_Mohamed (2024). Tamil (Tamizh) Wikipedia Text Dataset for NLP [Dataset]. http://doi.org/10.34740/kaggle/dsv/9884525
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/9884525
Dataset updated
Nov 12, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Younus_Mohamed
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.

What’s Included

- Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.

- Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.

Why This Dataset?

Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.

** How You Can Use This Dataset**

- Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.

Let’s Collaborate!

I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.

License

This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.
s
Population according to language, age and sex 1990-2017 - Datasets - This...
store.smartdatahub.io
Updated Feb 12, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Population according to language, age and sex 1990-2017 - Datasets - This service has been deprecated - please visit https://www.smartdatahub.io/ to access data. See the About page for details. // [Dataset]. https://store.smartdatahub.io/dataset/fi_statistics_finland_population_according_to_language_age_and_sex_1990_2017
Explore at:
Dataset updated
Feb 12, 2019
Description
Population according to language, age and sex 1990-2017
l
Census 21 - Main Language MSOA
data.leicester.gov.uk
csv, excel, geojson +1
Updated Aug 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Census 21 - Main Language MSOA [Dataset]. https://data.leicester.gov.uk/explore/dataset/census-21-main-language-msoa/
Explore at:
json, geojson, excel, csvAvailable download formats
Dataset updated
Aug 22, 2023
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
The census is undertaken by the Office for National Statistics every 10 years and gives us a picture of all the people and households in England and Wales. The most recent census took place in March of 2021.The census asks every household questions about the people who live there and the type of home they live in. In doing so, it helps to build a detailed snapshot of society. Information from the census helps the government and local authorities to plan and fund local services, such as education, doctors' surgeries and roads.Key census statistics for Leicester are published on the open data platform to make information accessible to local services, voluntary and community groups, and residents. There is also a dashboard published showcasing various datasets from the census allowing users to view data for the MSOAs of Leicester and compare this with Leicester overall statistics.Further information about the census and full datasets can be found on the ONS website - https://www.ons.gov.uk/census/aboutcensus/censusproductsMain languageThis dataset provides Census 2021 estimates that classify usual residents in England and Wales by their main language. The estimates are as at Census Day, 21 March 2021.Main language is a person's first or preferred language. They may speak other languages as well. A main language is provided only for residents age 3 and above. Residents age below 3 years will appear as ‘Does not apply’. Please note that some organisations exclude those below 3 years when calculating percentages for this variable.This dataset contains information for the MSOAs of Leicester City.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/

The most spoken languages worldwide 2025

Explore at:

435 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 14, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2025

Area covered

World

Description

In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

Clear search

Close search

Google apps

Main menu

The most spoken languages worldwide 2025

Top Languages Spoken in the United States

Top Languages Spoken in the United States

The Impact of linguistics on Community and Business in America

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Population and Languages of the Limited English Proficient (LEP) Speakers by...

Main language spoken - Dataset - Data Place Plymouth

Caribbean Netherlands; Spoken languages and main language, characteristics

Data from: Language Spoken at Home

LANGUAGE SPOKEN AT HOME FOR THE POPULATION 5 YEARS AND OVER (C16001)

UGSpeechData: A Multilingual Speech Dataset of Ghanaian Languages

785 Million Language Translation Database for AI

Data from: LaFresCat: a Catalan multi-accent speech dataset for...

LaFresCat Multiaccent

Dataset Details

Dataset Description

Uses

Languages

Dataset Structure

Dataset Creation

Source Data

Data Collection and Processing

Who are the source data producers?

Annotations

Personal and Sensitive Information

Bias, Risks, and Limitations

Funding

Dataset Card Contact

ACS Population Characteristics: Spoken Languages

SLURP Dataset

Top Languages Spoken in London Boroughs and MSOAs

Data from: Language structure is influenced by the number of speakers but...

Languages spoken by tract, ACS

Proportion of Population by Language Spoken Most Often at Home, Alberta...

Primary language spoken by the Medicaid and CHIP population

Tamil (Tamizh) Wikipedia Text Dataset for NLP

What’s Included

Why This Dataset?

** How You Can Use This Dataset**

Let’s Collaborate!

License

Population according to language, age and sex 1990-2017 - Datasets - This...

Census 21 - Main Language MSOA

The most spoken languages worldwide 2025

How You Can Use This Dataset