In 2021, there were 611,845 people who spoke Polish as a main language in England and Wales, the most common non-English language among the population. This was followed by Romanian, and Panjabi, which had 471,945 speakers and 290,745 speakers respectively.
This dataset shows the most spoken languages by borough and MSOAs in London. It provides numbers of the population aged 3+ who speak specified languages as their main language.
Main language is from 2011 Census (detailed) - Census table QS204EW.
This data is presented alongside Annual Population Survey (APS) data showing the top nationalities of residents in January - December 2019 by borough. The top 3 non-British nationalities are at the far right of the table. This is to highlight areas which may now have other common non-British languages spoken compared to 2011 (the year in which the Census information was gathered). The top non-British nationalities in 2019, which did not feature in 2011 as one of the most spoken non-British languages, are highlighted in column AD.
The APS has a sample of around 320,000 people in the UK (around 28,000 in London). As such all figures must be treated with some caution. Estimates for non-British nationalities at borough level that are below 10,000 are considered too small to be reliable and should be treated with additional caution.
MSOA codes have now been linked to House of Commons MSOA names
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
The census is undertaken by the Office for National Statistics every 10 years and gives us a picture of all the people and households in England and Wales. The most recent census took place in March of 2021.The census asks every household questions about the people who live there and the type of home they live in. In doing so, it helps to build a detailed snapshot of society. Information from the census helps the government and local authorities to plan and fund local services, such as education, doctors' surgeries and roads.Key census statistics for Leicester are published on the open data platform to make information accessible to local services, voluntary and community groups, and residents. There is also a dashboard published showcasing various datasets from the census allowing users to view data for Leicester and compare this with national statistics.Further information about the census and full datasets can be found on the ONS website - https://www.ons.gov.uk/census/aboutcensus/censusproductsMain languageThis dataset provides Census 2021 estimates that classify usual residents in England and Wales by their main language. The estimates are as at Census Day, 21 March 2021.Main language is a person's first or preferred language. They may speak other languages as well. A main language is provided only for residents age 3 and above. Residents age below 3 years will appear as ‘Does not apply’. Please note that some organisations exclude those below 3 years when calculating percentages for this variable.This dataset contains information for Leicester City and England overall.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
This dataset provides Census 2021 estimates that classify Household Reference Persons in England and Wales by whether one or multiple languages are spoken, and by ethnic group. The estimates are as at Census Day, 21 March 2021.
Area type
Census 2021 statistics are published for a number of different geographies. These can be large, for example the whole of England, or small, for example an output area (OA), the lowest level of geography for which statistics are produced.
For higher levels of geography, more detailed statistics can be produced. When a lower level of geography is used, such as output areas (which have a minimum of 100 persons), the statistics produced have less detail. This is to protect the confidentiality of people and ensure that individuals or their characteristics cannot be identified.
Lower tier local authorities
Lower tier local authorities provide a range of local services. There are 309 lower tier local authorities in England made up of 181 non-metropolitan districts, 59 unitary authorities, 36 metropolitan districts and 33 London boroughs (including City of London). In Wales there are 22 local authorities made up of 22 unitary authorities.
Coverage
Census 2021 statistics are published for the whole of England and Wales. However, you can choose to filter areas by:
Multiple main languages in household
Classifies households by whether members speak the same or different main language. If multiple main languages are spoken, this identifies whether they differ between generations or partnerships within the household.
Ethnic group
The ethnic group that the person completing the census feels they belong to. This could be based on their culture, family background, identity or physical appearance.
Respondents could choose one out of 19 tick-box response categories, including write-in response options.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
This provides estimates of the percentage of usual residents aged 3 and over in England and Wales by their proficiency in English. The proficiency in English classification corresponds to the tick box response options on the census questionnaire. Estimates are used to help central government, local authorities and the NHS allocate resources and provide services for non-English speakers. It also helps public service providers effectively target the delivery of their services. For example, translation and interpretation services and material in alternative languages. Statistical Disclosure Control - In order to protect against disclosure of personal information from the Census, there has been swapping of records in the Census database between different geographic areas, and so some counts will be affected. In the main, the greatest effects will be at the lowest geographies, since the record swapping is targeted towards those households with unusual characteristics in small areas. Data is Powered by LG Inform Plus and automatically checked for new data on the 3rd of each month.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the UK English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world UK English communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic British accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of UK English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple English speech and language AI applications:
In 2023, Duolingo led the language learning app market in the United Kingdom, achieving an aided brand awareness of 38.23 percent among consumers. Rosetta Stone followed with 29.84 percent, and Babbel reported 28.99 percent.
Hindi, with over *** million native speakers was the most spoken language across Indian homes, followed by Bengali with ** million speakers, as of 2011 census data. English native speakers accounted for about *** thousand during the measured time period. The colonial rule in India One of the most remarkable and widespread legacies that the British colonial rule left behind was the English language. Before independence, the English language was the solely used for higher education and in government and administrative processes. Post-independence, however, and till today, Hindi was claimed as the language with official government patronage. This lead to resistance from the southern states of India, where Hindi did not have prominence. Consequently, the Official Languages Act of 1963, was enacted by the parliament, which ensured the continued use of English for official purposes in conjunction with Hindi. Multi-linguistic cultures India has approximately ** major languages that are written in about ** different scripts. While the country’s official languages are both, English and Hindi, Hindi remains the most preferred language used online especially in the northern rural areas. The use of English is becoming increasingly popular in the urban areas. In addition, almost every state in India has its own official language that is studied in primary and secondary school as an obligatory second language. Among the most prominent are Bengali, Marathi, and Telugu.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
This provides estimates of the percentage of usual residents aged 3 and over in England and Wales by their proficiency in English. The proficiency in English classification corresponds to the tick box response options on the census questionnaire. Estimates are used to help central government, local authorities and the NHS allocate resources and provide services for non-English speakers. It also helps public service providers effectively target the delivery of their services. For example, translation and interpretation services and material in alternative languages. Statistical Disclosure Control - In order to protect against disclosure of personal information from the Census, there has been swapping of records in the Census database between different geographic areas, and so some counts will be affected. In the main, the greatest effects will be at the lowest geographies, since the record swapping is targeted towards those households with unusual characteristics in small areas. Data is Powered by LG Inform Plus and automatically checked for new data on the 3rd of each month.
Šis duomenų rinkinys rodo labiausiai kalbas pagal miesto ir MSOA Londone. Jame pateikiami skaičiai 3+ amžiaus gyventojų, kurie kalba konkrečiai_nbsp;languages kaip jų pagrindinę kalbą. Pagrindinė kalba yra nuo 2011 m. Surašymo (išsamiai) – Surašymo lentelė QS204EW.
Šie duomenys pateikiami kartu su metiniu gyventojų tyrimu (APS) duomenimis, iš kurių matyti 2019 m. sausio-gruodžio mėn. gyventojų didžiausias tautybių skaičius pagal rajonus. Trys geriausi ne Didžiosios Britanijos piliečiai yra tolimiausioje dešinėje stalo pusėje. Taip siekiama atkreipti dėmesį į sritis, kuriose dabar, palyginti su 2011 m. (metais, kuriais buvo surinkta surašymo informacija), gali būti vartojamos ir kitos bendros ne britų kalbos. AD skiltyje paryškintos geriausios 2019 m. ne Didžiosios Britanijos pilietybės, kurios 2011 m. nebuvo viena iš dažniausiai vartojamų ne britų kalbų.
APS turi aμnbsp;sample of around 320 000 žmonių Jungtinėje Karalystėje (apie 28,000 Londone). Todėl visi skaičiai turi būti vertinami atsargiai. Apylinkių, kurios nesiekia 10 000, įverčiai laikomi per mažais, kad būtų patikimi, ir turėtų būti vertinami papildomai atsargiai.
MSOA kodai dabar susieti su Bendruomenių Rūmų MSOA pavadinimais
Our British English language datasets are meticulously curated and annotated by experienced linguistics and language experts, ensuring exceptional accuracy, consistency, and linguistic depth. The below datasets in British English are available for license:
Key Features (approximate numbers):
Our British English monolingual dataset delivers clear, reliable definitions and authentic usage examples, featuring a high volume of headwords and in-depth coverage of the British English variant of English. As one of the world’s most authoritative lexical resources, it’s trusted by leading academic, AI, and language technology organizations.
This British English language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for NLP tasks such as semantic search, word sense disambiguation, and language generation.
This dataset provides IPA transcriptions and mapped audio files for words in contemporary British English, with a focus on UK speaker usage. It includes syllabified transcriptions, variant spellings, part-of-speech tags, and pronunciation group identifiers. Audio files are supplied separately and linked where available – ideal for TTS, ASR, and pronunciation modeling.
Use Cases:
We consistently work with our clients on new use cases as language technology continues to evolve. These include Natural Language Processing (NLP) applications, TTS, dictionary display tools, games, translations, word embedding, and word sense disambiguation (WSD).
If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.
Pricing:
Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.
Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
This provides estimates of the percentage of usual residents aged 3 and over in England and Wales by their proficiency in English. The proficiency in English classification corresponds to the tick box response options on the census questionnaire. Estimates are used to help central government, local authorities and the NHS allocate resources and provide services for non-English speakers. It also helps public service providers effectively target the delivery of their services. For example, translation and interpretation services and material in alternative languages. Statistical Disclosure Control - In order to protect against disclosure of personal information from the Census, there has been swapping of records in the Census database between different geographic areas, and so some counts will be affected. In the main, the greatest effects will be at the lowest geographies, since the record swapping is targeted towards those households with unusual characteristics in small areas.
Data is Powered by LG Inform Plus and automatically checked for new data on the 4th of each month.
This statistic displays the languages of films released in the United Kingdom and Republic of Ireland in 2019. Following English language movies and movies that featured English alongside extensive use of another language, Hindi language movies were second most common, followed by Spanish and Polish. In 2019, ** Hindi movies and ** Spanish movies were released.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
This provides estimates of the percentage of usual residents aged 3 and over in England and Wales by their proficiency in English. The proficiency in English classification corresponds to the tick box response options on the census questionnaire. Estimates are used to help central government, local authorities and the NHS allocate resources and provide services for non-English speakers. It also helps public service providers effectively target the delivery of their services. For example, translation and interpretation services and material in alternative languages. Statistical Disclosure Control - In order to protect against disclosure of personal information from the Census, there has been swapping of records in the Census database between different geographic areas, and so some counts will be affected. In the main, the greatest effects will be at the lowest geographies, since the record swapping is targeted towards those households with unusual characteristics in small areas.
Data is Powered by LG Inform Plus and automatically checked for new data on the 4th of each month.
Tento datový soubor ukazuje nejrozšířenější jazyky čtvrti a MSOA v Londýně. To poskytuje počet obyvatel ve věku 3+, kteří mluví specifikované, nbsp;jazyky anbsp;jejich hlavním jazykem. Hlavní jazyk je z 2011 sčítání lidu (podrobné) – sčítání lidu tabulka QS204EW. Tyto údaje jsou prezentovány spolu s daty Annual Population Survey (APS) ukazujícími nejvyšší národnost obyvatel v období od ledna do prosince 2019 podle čtvrtí. Top 3 non-britské národnosti jsou na pravé straně stolu. Cílem je upozornit na oblasti, v nichž se ve srovnání s rokem 2011 (rok, v němž byly shromážděny informace o sčítání lidu) hovoří i jiné běžné nebritské jazyky. Nejvyšší nebritské národnosti v roce 2019, které se v roce 2011 nevyskytovaly jako jeden z nejvíce mluvených nebritských jazyků, jsou zvýrazněny ve sloupci AD. APS má ve Velké Británii přibližně 320.000 lidí (přibližně 28 000 v Londýně). Jako takové je třeba se všemi údaji zacházet s určitou opatrností. Odhady nebritských národností na úrovni obvodu, které jsou nižší než 10 000, jsou považovány za příliš malé na to, aby byly spolehlivé, a měly by být ošetřeny s větší opatrností. Kódy MSOA jsou nyní propojeny s názvy MSOA House of Commons
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Census 2021 data on international student population of England and Wales by country of birth, passport held, age, sex and other characteristics.
These datasets are part of the release: The changing picture of long-term international migration, England and Wales: Census 2021. Figures may differ slightly in future releases because of the impact of removing rounding and applying further statistical processes.
Figures are based on geography boundaries as of 1 April 2022.
This release includes comparisons to the folllowing 2011 Census data:
Quality notes can be found here
Quality information about demography and migration can be found here
Quality information about labour market can be found here
Usual resident
A usual resident is anyone who on Census Day, 21 March 2021 was in the UK and had stayed or intended to stay in the UK for a period of 12 months or more, or had a permanent UK address and was outside the UK and intended to be outside the UK for less than 12 months.
International student
An international student is defined as someone who was a usual resident in England and Wales and meets all the following criteria:
Country of birth
The country in which a person was born. The following country of birth classifications are used in this dataset:
More information about country of birth classifications can be found here.
Passports held
The country or countries that a person holds, or is entitled to hold, a passport for. Where a person recorded having more than one passport, they were counted only once, categorised in the following priority order: 1. UK passport, 2. Irish passport, 3. Other passport. The following classifications were created for this dataset for comparability with other international migration releases:
More information can be found here
Economic activity status
The economic activity status of a person on Census Day, 21 March 2021. The following classification is used in this dataset:
Industry
The industry worked in for those in current employment. The following classification was used for this dataset:
Student accommodation
Student accommodation breaks down household type by typical households used by students. This includes communal establishments, all student households, households containing a single family, households containing multiple families, living with parents and living alone.
More information can be found here
Second address indicator
The second address indicator is used to define an address (in or out of the UK) a person stays at for more than 30 days per year that is not their place of usual residence. Second addresses typically include: armed forces bases, addresses used by people working away from home, a student’s home address, the address of another parent or guardian, a partner’s address, a holiday home. There are 3 categories in this classification.
Detailed description can be found here
Main language (detailed)
This is used to define a person's first or preferred language. This breaks down the responses given in the write-in option "Other, write in (including British Sign Language)". There are 95 categories in the primary classification.
More details can be found here
Proficiency in English language
Proficiency in English language is used to determine how well a person whose main language is not English (English or Welsh in Wales) feels they can speak English. There are a total number of 6 categories in this classification.
More details can be found here
In 2021, the London borough of Newham had the highest share of residents that spoke a language other than English as their main language. Brent had the second-highest share of residents that had a different main language, followed by Ealing and Harrow, all also London boroughs. Outside of London, Leicester had the highest share of people who reported a language other than English as their main one, at 30 percent.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Ukrainian web corpus MaCoCu-uk 1.0 was built by crawling the ".ua" and ".укр" internet top-level domains in 2022, extending the crawl dynamically to other domains as well. The crawler is available at https://github.com/macocu/MaCoCu-crawler. Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies. In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext) and fluency (score between 0 and 1, assigned with the Monocleaner tool - https://github.com/bitextor/monocleaner), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the Biroamer tool - https://github.com/bitextor/biroamer). The corpus can be easily read with the prevert parser (https://pypi.org/project/prevert/). Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains. A newer version of the corpus is available as part of the MaCoCu-Genre corpora collection at http://hdl.handle.net/11356/1969. The main novelty of the MaCoCu-Genre version is that the texts have been automatically annotated with genre categories. Additionally, the corpus underwent additional post-processing and has been transformed to the JSONL format.
In 2021, there were 611,845 people who spoke Polish as a main language in England and Wales, the most common non-English language among the population. This was followed by Romanian, and Panjabi, which had 471,945 speakers and 290,745 speakers respectively.