In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
Languages are an important part of daily life in the USA. Here is a table that shows the most common languages spoken in the USA, as well as a big spreadsheet which shows each CBSA (Core-Based Statistical Area, or urban area).
Language usage varies widely throughout the United States. According to the latest census data, over 350 different languages are represented in homes across the country. The following table and spreadsheet provide more detailed information on language usage throughout the various states and cities in the US:
Columns: - index: Index column for dataframe - Table with column headers in row 5 and row headers in column A: Contains language data for each CBSA (Core Based Statistical Area) - Unnamed: 1: Rank of CBSA by total number of speakers of all languages - Unnamed: 2: Name of CBSA - Unnamed: 3: Population of CBSA - Unnamed: 4: Percent of population that speaks English very well - Unnamed: 5 through Unnamed: 58 : Languages spoken by at least 0.1% of the population, with corresponding percentages
This dataset was created by Gary Hoover. The data was sourced from https://www.kaggle.com/garyhoov/us-languages
Unknown License - Please check the dataset description for more information.
File: Languages Spoken at Home by Urban Area = CBSA.csv
File: US Languages Spoken at Home 2014.csv | Column name | Description | |:-------------------------------------------------------------------|:--------------| | Table with column headers in row 5 and row headers in column A | |
Many residents of New York City speak more than one language; a number of them speak and understand non-English languages more fluently than English. This dataset, derived from the Census Bureau's American Community Survey (ACS), includes information on over 1.7 million limited English proficient (LEP) residents and a subset of that population called limited English proficient citizens of voting age (CVALEP) at the Community District level. There are 59 community districts throughout NYC, with each district being represented by a Community Board.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Data showing the main languages spoken in Plymouth by population numbers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This table provides data on languages spoken by the population of the Caribbean Netherlands aged 15 years and older in private households. Breakdowns by sex, age and level of education are presented. These aspects are shown for the Caribbean Netherlands and also for the islands Bonaire, St Eustatius and Saba separately. The research is a sample survey. This means that the figures shown are estimates for which reliability margins apply. These margins are also included in the table. The Omnibus survey was carried out for the first time on Bonaire, Saba and St. Eustatius in 2013 during the month of June and the first week of July. For the second time the Omnibus survey was carried out on Bonaire during the months of October and November 2017, and on Saba and St. Eustatius in the period January to March 2018.
Data available from: 2013
Status of the figures: The figures in this table are final.
Changes as of 4 April 2019 None, this is a new table.
When will new figures be published? New data will be published every four years.
Language spoken at home and the ability to speak English for the population age 5 and over as reported by the US Census Bureau's, American Community Survey (ACS) 5-year estimates table C16001.
Table from the American Community Survey (ACS) C16001 of language spoken at home for the population 5 years and over. These are multiple, nonoverlapping vintages of the 5-year ACS estimates of population and housing attributes starting in 2010 shown by the corresponding census tract vintage. Also includes the most recent release annually.King County, Washington census tracts with nonoverlapping vintages of the 5-year American Community Survey (ACS) estimates starting in 2010. Vintage identified in the "ACS Vintage" field.The census tract boundaries match the vintage of the ACS data (currently 2010 and 2020) so please note the geographic changes between the decades. Tracts have been coded as being within the City of Seattle as well as assigned to neighborhood groups called "Community Reporting Areas". These areas were created after the 2000 census to provide geographically consistent neighborhoods through time for reporting U.S. Census Bureau data. This is not an attempt to identify neighborhood boundaries as defined by neighborhoods themselves.Vintages: 2010, 2015, 2020, 2021, 2022, 2023ACS Table(s): C16001Data downloaded from: <a href='https://data.census.gov/' style='color:rgb(0, 97, 155); text
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The UGSpeechData is a collection of audio speech data of Akan, Ewe, Dagaare, Dagbani, and Ikposo. These languages are among the most spoken languages in Ghana. The uploaded dataset contains a total of 970148 audio files (5384.28 hours) and 93262 transcribed audio files (518 hours). The audio files are descriptions of 1000 culturally relevant images collected from indigenous speakers of each of the languages. Each audio is between 15 to 30 seconds long. More specifically, the dataset contains five subfolders for each of the five respective languages. Each language has at least 1000 hours of speech data and 100 hours of transcribed speech data. Fig. 1 provides details of the transcribed audio corpus, including gender and recording environments for each language.Fig. 1. Details of transcribed audio files
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Our groundbreaking translation dataset represents a monumental advancement in the field of natural language processing and machine translation. Comprising a staggering 785 million records, this corpus bridges language barriers by offering translations from English to an astonishing 548 languages. The dataset promises to be a cornerstone resource for researchers, engineers, and developers seeking to enhance their machine translation models, cross-lingual analysis, and linguistic investigations.
Size of the dataset – 41GB(Uncompressed) and Compressed – 20GB
Key Features:
Scope and Scale: With a comprehensive collection of 785 million records, this dataset provides an unparalleled wealth of translated text. Each record consists of an English sentence paired with its translation in one of the 548 target languages, enabling multi-directional translation applications.
Language Diversity: Encompassing translations into 548 languages, this dataset represents a diverse array of linguistic families, dialects, and scripts. From widely spoken languages to those with limited digital representation, the dataset bridges communication gaps on a global scale.
Quality and Authenticity: The translations have been meticulously curated, verified, and cross-referenced to ensure high quality and authenticity. This attention to detail guarantees that the dataset is not only extensive but also reliable, serving as a solid foundation for machine learning applications. Data is collected from various open datasets for my personal ML projects and looking to share it to team.
Use Case Versatility: Researchers and practitioners across a spectrum of domains can harness this dataset for a myriad of applications. It facilitates the training and evaluation of machine translation models, empowers cross-lingual sentiment analysis, aids in linguistic typology studies, and supports cultural and sociolinguistic investigations.
Machine Learning Advancement: Machine translation models, especially neural machine translation (NMT) systems, can leverage this dataset to enhance their training. The large-scale nature of the dataset allows for more robust and contextually accurate translation outputs.
Fine-tuning and Customization: Developers can fine-tune translation models using specific language pairs, offering a powerful tool for specialized translation tasks. This customization capability ensures that the dataset is adaptable to various industries and use cases.
Data Format: The dataset is provided in a structured json format, facilitating easy integration into existing machine learning pipelines. This structured approach expedites research and experimentation. Json format contains the English word and equivalent word as single record. Data was exported from MongoDB database to ensure the uniqueness of the record. Each of the record is unique and sorted.
Access: The dataset is available for academic and research purposes, enabling the global AI community to contribute to and benefit from its usage. A well-documented API and sample code are provided to expedite exploration and integration.
The English-to-548-languages translation dataset represents an incredible leap forward in advancing multilingual communication, breaking down barriers to understanding, and fostering collaboration on a global scale. It holds the potential to reshape how we approach cross-lingual communication, linguistic studies, and the development of cutting-edge translation technologies.
Dataset Composition: The dataset is a culmination of translations from English, a widely spoken and understood language, into 548 distinct languages. Each language represents a unique linguistic and cultural background, providing a rich array of translation contexts. This diverse range of languages spans across various language families, regions, and linguistic complexities, making the dataset a comprehensive repository for linguistic research.
Data Volume and Scale: With a staggering 785 million records, the dataset boasts an immense scale that captures a vast array of translations and linguistic nuances. Each translation entry consists of an English source text paired with its corresponding translation in one of the 548 target languages. This vast corpus allows researchers and practitioners to explore patterns, trends, and variations across languages, enabling the development of robust and adaptable translation models.
Linguistic Coverage: The dataset covers an extensive set of languages, including but not limited to Indo-European, Afroasiatic, Sino-Tibetan, Austronesian, Niger-Congo, and many more. This broad linguistic coverage ensures that languages with varying levels of grammatical complexity, vocabulary richness, and syntactic structures are included, enhancing the applicability of translation models across diverse linguistic landscapes.
Dataset Preparation: The translation ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present LaFresCat, the first Catalan multiaccented and multispeaker dataset.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Commercial use is only possible through licensing by the voice artists. For further information, contact langtech@bsc.es and lafrescaproduccions@gmail.com.
The audios from this dataset have been created with professional studio recordings by professional voice actors in Lafresca Creative Studio. This is the raw version of the dataset, no resampling or trimming has been applied to the audios. Audios are stored in wav format at 48khz sampling rate
In total, there are 4 different accents, with 2 speakers per accent (female and male). After trimming, accumulates a total of 3,75h (divided by speaker IDs) as follows:
Balear
Central
Occidental (North-Western)
Valencia
The purpose of this dataset is mainly for training text-to-speech and automatic speech recognition models in Catalan accents.
The dataset is in Catalan (ca-ES
).
The dataset consists of 2858 audios and transcriptions in the following structure:
lafresca_multiaccent_raw
├── balear
│ ├── olga
│ ├── olga.txt
│ ├── quim
│ └── quim.txt
├── central
│ ├── elia
│ ├── elia.txt
│ ├── grau
│ └── grau.txt
├── full_filelist.txt
├── occidental
│ ├── emma
│ ├── emma.txt
│ ├── pere
│ └── pere.txt
└── valencia
├── gina
├── gina.txt
├── lluc
└── lluc.txt
Metadata of the dataset can be found in the file `full_filelist.txt` , each line represents an audio and follows the format:
audio_path | speaker_id | transcription
The speaker ids have the following mapping:
"quim": 0,
"olga": 1,
"grau": 2,
"elia": 3,
"pere": 4,
"emma": 5,
"lluc": 6,
"gina": 7
This dataset has been created by members of the Language Technologies unit from the Life Sciences department of the Barcelona Supercomputing Center, except the valencian sentences which were created with the support of Cenid, the Digital Intelligence Center of the University of Alicante. The voices belong to professional voice actors and they've been recorded in Lafresca Creative Studio.
The data presented in this dataset is the source data.
These are the technical details of the data collection and processing:
Microphone: Austrian Audio oc818
Preamp: Focusrite ISA Two
Audio Interface: Antelope Orion 32+
DAW: ProTools 2023.6.0
Processing:
Noise Gate: C1 Gate
Compression BF-76
De-Esser Renaissance
EQ Maag EQ2
EQ FabFilter Pro-Q3
Limiter: L1 Ultramaximizer
Here's the information about the speakers:
Dialect | Gender | County |
---|---|---|
Central | male | Barcelonès |
Central | female | Barcelonès |
Balear | female | Pla de Mallorca |
Balear | male | Llevant |
Occidental | male | Baix Ebre |
Occidental | female | Baix Ebre |
Valencian | female | Ribera Alta |
Valencian | male | La Plana Baixa |
The Language Technologies team from the Life Sciences department at the Barcelona Supercomputing Center developed this dataset. It features recordings by professional voice actors made at Lafresca Creative Studio.
In order to check whether or not there were any errors in the transcriptions of the audios, we created a Label Studio space. In that space, we manually listened to subset of the dataset, and compared what we heard with the transcription. If the transcription was mistaken, we corrected it.
The dataset consists of professional voice actors who have recorded their voice. You agree to not attempt to determine the identity of speakers in this dataset.
Training a Text-to-Speech (TTS) model by fine-tuning with a Catalan speaker who speaks a particular dialect presents significant limitations. Mostly, the challenge is in capturing the full range of variability inherent in that accent. Each dialect has its own unique phonetic, intonational, and prosodic characteristics that can vary greatly even within a single linguistic region. Consequently, a TTS model trained on a narrow dialect sample will struggle to generalize across different accents and sub-dialects, leading to reduced accuracy and naturalness. Additionally, achieving a standard representation is exceedingly difficult because linguistic features can differ markedly not only between dialects but also among individual speakers within the same dialect group. These variations encompass subtle nuances in pronunciation, rhythm, and speech patterns that are challenging to standardize in a model trained on a limited dataset.
This work has been promoted and financed by the Generalitat de Catalunya through the Aina project, in addition the Valencian sentences have been created within the framework of the NEL-VIVES project 2022/TL22/00215334.
Counts and breakdown of languages used data with margins of error for Alaskan Communities/Places and aggregation at Borough/CDA and State level for recent 5-year American Community Survey (ACS) intervals. The 5-year interval data sets are published approximately 1/2 a period later than the End Year listed - for instance the interval ending in 2019 is published in mid-2021.Source: US Census Bureau, American Community SurveyThis data has been visualized in a Geographic Information Systems (GIS) format and is provided as a service in the DCRA Information Portal by the Alaska Department of Commerce, Community, and Economic Development Division of Community and Regional Affairs (SOA DCCED DCRA), Research and Analysis section. SOA DCCED DCRA Research and Analysis is not the authoritative source for this data. For more information and for questions about this data, see: US Census - Language UseUSE CONSTRAINTS: The Alaska Department of Commerce, Community, and Economic Development (DCCED) provides the data in this application as a service to the public. DCCED makes no warranty, representation, or guarantee as to the content, accuracy, timeliness, or completeness of any of the data provided on this site. DCCED shall not be liable to the user for damages of any kind arising out of the use of data or information provided. DCCED is not the authoritative source for American Community Survey data, and any data or information provided by DCCED is provided "as is". Data or information provided by DCCED shall be used and relied upon only at the user's sole risk. For information about the American Community Survey, click here.
A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets.
This dataset shows the most spoken languages by borough and MSOAs in London. It provides numbers of the population aged 3+ who speak specified languages as their main language.
Main language is from 2011 Census (detailed) - Census table QS204EW.
This data is presented alongside Annual Population Survey (APS) data showing the top nationalities of residents in January - December 2019 by borough. The top 3 non-British nationalities are at the far right of the table. This is to highlight areas which may now have other common non-British languages spoken compared to 2011 (the year in which the Census information was gathered). The top non-British nationalities in 2019, which did not feature in 2011 as one of the most spoken non-British languages, are highlighted in column AD.
The APS has a sample of around 320,000 people in the UK (around 28,000 in London). As such all figures must be treated with some caution. Estimates for non-British nationalities at borough level that are below 10,000 are considered too small to be reliable and should be treated with additional caution.
MSOA codes have now been linked to House of Commons MSOA names
Data table to reproduce all resultsdata_table.xlsxStata_filesStata 14 code to reproduce all results
The American Community Survey, Table B16001 provided detailed individual-level language estimates at the tract level of 42 non-English language categories, tabulated by the English-speaking ability. Two sets of languages data are included here, with population counts and percentages for both:the tract population speaking languages other than English, regardless of English=speaking ability, identified by the language name, and the languages spoken other than English by the tract population who does not speak English 'very well', identified by the language name followed by "_Enw".The default pop-up for this service presents the second of these data: languages spoken other than English by the tract population who does not speak English 'very well'.In part because of privacy concerns with the very small counts in some categories in Table B16001, the Census changed the American Community Survey estimates of the languages spoken by individuals. In 2016, the number of categories previously presented in Table B16001 was reduced to reflect the most commonly spoken languages, and several languages spoken in Massachusetts were grouped into generalized (i.e., "Other...") categories.Table B16001 has been renamed Table C16001 with these generalized categories. Therefore, although the information presented in this datalayer is not current, and these data cannot be updated.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
This Alberta Official Statistic describes the proportion of population based on language spoken most often at home in each economic region as reported in the 2011 population census. Alberta is divided into eight economic regions as follows: Lethbridge – Medicine -Hat; Camrose-Drumheller; Calgary; Banff – Jasper – Rocky Mountain House; Red Deer; Edmonton; Athabasca – Grande Prairie – Peace River; and Wood Buffalo – Cold Lake.
This data set includes annual counts and percentages of Medicaid and Children’s Health Insurance Program (CHIP) enrollees by primary language spoken (English, Spanish, and all other languages). Results are shown overall; by state; and by five subpopulation topics: race and ethnicity, age group, scope of Medicaid and CHIP benefits, urban or rural residence, and eligibility category. These results were generated using Transformed Medicaid Statistical Information System (T-MSIS) Analytic Files (TAF) Release 1 data and the Race/Ethnicity Imputation Companion File. This data set includes Medicaid and CHIP enrollees in all 50 states, the District of Columbia, Puerto Rico, and the U.S. Virgin Islands who were enrolled for at least one day in the calendar year, except where otherwise noted. Enrollees in Guam, American Samoa, the Northern Mariana Islands, and select states with data quality issues with the primary language variable in TAF are not included. Results shown for the race and ethnicity subpopulation topic exclude enrollees in the U.S. Virgin Islands. Results shown overall (where subpopulation topic is "Total enrollees") exclude enrollees younger than age 5 and enrollees in the U.S. Virgin Islands. Results for states with TAF data quality issues in the year have a value of "Unusable data." Some rows in the data set have a value of "DS," which indicates that data were suppressed according to the Centers for Medicare & Medicaid Services’ Cell Suppression Policy for values between 1 and 10. This data set is based on the brief: "Primary language spoken by the Medicaid and CHIP population in 2020." Enrollees are assigned to a primary language category based on their reported ISO language code in TAF (English/missing, Spanish, and all other language codes) (Primary Language). Enrollees are assigned to a race and ethnicity subpopulation using the state-reported race and ethnicity information in TAF when it is available and of good quality; if it is missing or unreliable, race and ethnicity is indirectly estimated using an enhanced version of Bayesian Improved Surname Geocoding (BISG) (Race and ethnicity of the national Medicaid and CHIP population in 2020). Enrollees are assigned to an age group subpopulation using age as of December 31st of the calendar year. Enrollees are assigned to the comprehensive benefits or limited benefits subpopulation according to the criteria in the "Identifying Beneficiaries with Full-Scope, Comprehensive, and Limited Benefits in the TAF" DQ Atlas brief. Enrollees are assigned to an urban or rural subpopulation based on the 2010 Rural-Urban Commuting Area (RUCA) code associated with their home or mailing address ZIP code in TAF (Rural Medicaid and CHIP enrollees in 2020). Enrollees are assigned to an eligibility category subpopulation using their latest reported eligibility group code, CHIP code, and age in the calendar year. Please refer to the full brief for additional context about the methodology and detailed findings. Future updates to this data set will include more recent data years as the TAF data become available.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.
- Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.
- Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.
Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.
- Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.
I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.
This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.
Population according to language, age and sex 1990-2017
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
The census is undertaken by the Office for National Statistics every 10 years and gives us a picture of all the people and households in England and Wales. The most recent census took place in March of 2021.The census asks every household questions about the people who live there and the type of home they live in. In doing so, it helps to build a detailed snapshot of society. Information from the census helps the government and local authorities to plan and fund local services, such as education, doctors' surgeries and roads.Key census statistics for Leicester are published on the open data platform to make information accessible to local services, voluntary and community groups, and residents. There is also a dashboard published showcasing various datasets from the census allowing users to view data for the MSOAs of Leicester and compare this with Leicester overall statistics.Further information about the census and full datasets can be found on the ONS website - https://www.ons.gov.uk/census/aboutcensus/censusproductsMain languageThis dataset provides Census 2021 estimates that classify usual residents in England and Wales by their main language. The estimates are as at Census Day, 21 March 2021.Main language is a person's first or preferred language. They may speak other languages as well. A main language is provided only for residents age 3 and above. Residents age below 3 years will appear as ‘Does not apply’. Please note that some organisations exclude those below 3 years when calculating percentages for this variable.This dataset contains information for the MSOAs of Leicester City.
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.