100+ datasets found

The most spoken languages worldwide 2025
statista.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Explore at:
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
Number of native Spanish speakers worldwide 2024, by country
statista.com
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Number of native Spanish speakers worldwide 2024, by country [Dataset]. https://www.statista.com/statistics/991020/number-native-spanish-speakers-country-worldwide/
Explore at:
Dataset updated
Jan 15, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
World
Description
Mexico is the country with the largest number of native Spanish speakers in the world. As of 2024, 132.5 million people in Mexico spoke Spanish with a native command of the language. Colombia was the nation with the second-highest number of native Spanish speakers, at around 52.7 million. Spain came in third, with 48 million, and Argentina fourth, with 46 million. Spanish, a world language As of 2023, Spanish ranked as the fourth most spoken language in the world, only behind English, Chinese, and Hindi, with over half a billion speakers. Spanish is the official language of over 20 countries, the majority on the American continent, nonetheless, it's also one of the official languages of Equatorial Guinea in Africa. Other countries have a strong influence, like the United States, Morocco, or Brazil, countries included in the list of non-Hispanic countries with the highest number of Spanish speakers. The second most spoken language in the U.S. In the most recent data, Spanish ranked as the language, other than English, with the highest number of speakers, with 12 times more speakers as the second place. Which comes to no surprise following the long history of migrations from Latin American countries to the Northern country. Moreover, only during the fiscal year 2022. 5 out of the top 10 countries of origin of naturalized people in the U.S. came from Spanish-speaking countries.
Ranking of languages spoken at home in the U.S. 2023
statista.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
Explore at:
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
United States
Description
In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.
O
2017 San Diego County Demographics - Language Spoken at Home for the...
data.sandiegocounty.gov
application/rdfxml +5
Updated Feb 22, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
County of San Diego (2020). 2017 San Diego County Demographics - Language Spoken at Home for the Population 5 Years and Ability to Speak English (Detailed) [Dataset]. https://data.sandiegocounty.gov/Demographics/2017-San-Diego-County-Demographics-Language-Spoken/b7iq-x9dz
Explore at:
csv, xml, application/rdfxml, application/rssxml, tsv, jsonAvailable download formats
Dataset updated
Feb 22, 2020
Dataset authored and provided by
County of San Diego
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Area covered
San Diego County
Description
Language questions were only asked of persons 5 years and older. The language question is about current use of a non-English language at home, not about ability to speak another language or the use of such a language in the past or elsewhere. People who speak a language other than English outside of the home are not reported as speaking a language other than English. Respondents that spoke a language other than English at home, where also asked whether they could speak English "very well" or less than "very well. See how the Census Bureau measures Language Use for more information at https://www.census.gov/topics/population/language-use/about.html.

Source: U.S. Census Bureau; 2013-2017 American Community Survey 5-Year Estimates, Table C16001.
2013 American Community Survey - Table Packages: Detailed Language Spoken in...
catalog.data.gov
Updated Jul 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Census Bureau (2023). 2013 American Community Survey - Table Packages: Detailed Language Spoken in the U.S. [Dataset]. https://catalog.data.gov/dataset/2013-american-community-survey-table-packages-detailed-language-spoken-in-the-u-s
Explore at:
Dataset updated
Jul 19, 2023
Dataset provided by
United States Census Bureauhttp://census.gov/
Area covered
United States
Description
This data set uses the 2009-2013 American Community Survey to tabulate the number of speakers of languages spoken at home and the number of speakers of each language who speak English less than very well. These tabulations are available for the following geographies: nation; each of the 50 states, plus Washington, D.C. and Puerto Rico; counties with 100,000 or more total population and 25,000 or more speakers of languages other than English and Spanish; core-based statistical areas (metropolitan statistical areas and micropolitan statistical areas) with 100,000 or more total population and 25,000 or more speakers of languages other than English and Spanish.
Common languages used for web content 2025, by share of websites
ai-chatbox.pro
statista.com
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Common languages used for web content 2025, by share of websites [Dataset]. https://www.ai-chatbox.pro/?_=%2Fstatistics%2F262946%2Fshare-of-the-most-common-languages-on-the-internet%2F%23XgboD02vawLKoDs%2BT%2BQLIV8B6B4Q9itA
Explore at:
Dataset updated
Feb 11, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Feb 2025
Area covered
World
Description
As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.
a
Languages and English Ability - Seattle Neighborhoods
data-seattlecitygis.opendata.arcgis.com
data.seattle.gov
+4more
Updated Feb 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Seattle ArcGIS Online (2024). Languages and English Ability - Seattle Neighborhoods [Dataset]. https://data-seattlecitygis.opendata.arcgis.com/datasets/SeattleCityGIS::languages-and-english-ability-seattle-neighborhoods
Explore at:
Dataset updated
Feb 22, 2024
Dataset authored and provided by
City of Seattle ArcGIS Online
Area covered
Seattle
Description
Table from the American Community Survey (ACS) 5-year series on languages spoken and English ability related topics for City of Seattle Council Districts, Comprehensive Plan Growth Areas and Community Reporting Areas. Table includes B16004 Age by Language Spoken at Home by Ability to Speak English, C16002 Household Language by Household Limited English-Speaking Status. Data is pulled from block group tables for the most recent ACS vintage and summarized to the neighborhoods based on block group assignment.Table created for and used in the Neighborhood Profiles application.Vintages: 2023ACS Table(s): B16004, C16002Data downloaded from: Census Bureau's Explore Census Data The United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2020 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters). The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.
a
Percent Spanish Speakers
king-snocoplanning.opendata.arcgis.com
hub.arcgis.com
+1more
Updated Aug 10, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
King County (2016). Percent Spanish Speakers [Dataset]. https://king-snocoplanning.opendata.arcgis.com/datasets/kingcounty::percent-spanish-speakers
Explore at:
Dataset updated
Aug 10, 2016
Dataset authored and provided by
King County
Area covered

Description
Languages:Percent Spanish Speakers: Basic demographics by census tracts in King County based on current American Community Survey 5 Year Average (ACS). Included demographics are: total population; foreign born; median household income; English language proficiency; languages spoken; race and ethnicity; sex; and age. Numbers and derived percentages are estimates based on the current year's ACS. GEO_ID_TRT is the key field and may be used to join to other demographic Census data tables.
n
Data from: Language Spoken at Home
linc.osbm.nc.gov
ncosbm.opendatasoft.com
csv, excel, geojson +1
Updated Oct 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Language Spoken at Home [Dataset]. https://linc.osbm.nc.gov/explore/dataset/language-spoken-at-home/
Explore at:
geojson, csv, json, excelAvailable download formats
Dataset updated
Oct 3, 2024
Description
Language spoken at home and the ability to speak English for the population age 5 and over as reported by the US Census Bureau's, American Community Survey (ACS) 5-year estimates table C16001.
e
Percent of Population with Limited Ability to Speak English
coronavirus-resources.esri.com
data.amerigeoss.org
+1more
Updated Jul 3, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Urban Observatory by Esri (2019). Percent of Population with Limited Ability to Speak English [Dataset]. https://coronavirus-resources.esri.com/maps/78a668915cbc4bf983330608f3d687aa
Explore at:
Dataset updated
Jul 3, 2019
Dataset authored and provided by
Urban Observatory by Esri
Area covered

Description
This map shows the percent of population with a limited ability to speak English by census tract. Search to your community and investigate the top language needs in nearby census tracts.*DATA AS OF 2011-2015*Data Source: U.S. Census Bureau's American Community Survey 5-year estimates, 2011-2015, Table B16001.Complete list of all languages available in this data set (29):Spanish or Spanish Creole; French (including Patois, Cajun); French Creole; Italian; Portuguese; German; Yiddish; Greek; Russian; Polish; Serbo-Croatian; Armenian; Persian; Gujarati; Hindi; Urdu; Chinese; Japanese; Korean; Mon-Khmer, Cambodian; Hmong; Thai; Laotian; Vietnamese; Tagalog; Navajo; Hungarian; Arabic; Hebrew. Those who have limited English ability and speak other languages are included in the percentage depicted in the map, but other languages will not appear in the ranked list or in the table.Accompanying feature layer and viewing app are also available.
f
Data_Sheet_1_Pilot study of a Spanish language measure of financial toxicity...
frontiersin.figshare.com
docx
Updated Jul 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julia J. Shi; Gwendolyn J. McGinnis; Susan K. Peterson; Nicolette Taku; Ying-Shiuan Chen; Robert K. Yu; Chi-Fang Wu; Tito R. Mendoza; Sanjay S. Shete; Hilary Ma; Robert J. Volk; Sharon H. Giordano; Ya-Chen T. Shih; Diem-Khanh Nguyen; Kelsey W. Kaiser; Grace L. Smith (2023). Data_Sheet_1_Pilot study of a Spanish language measure of financial toxicity in underserved Hispanic cancer patients with low English proficiency.docx [Dataset]. http://doi.org/10.3389/fpsyg.2023.1188783.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2023.1188783.s001
Dataset updated
Jul 10, 2023
Dataset provided by
Frontiers
Authors
Julia J. Shi; Gwendolyn J. McGinnis; Susan K. Peterson; Nicolette Taku; Ying-Shiuan Chen; Robert K. Yu; Chi-Fang Wu; Tito R. Mendoza; Sanjay S. Shete; Hilary Ma; Robert J. Volk; Sharon H. Giordano; Ya-Chen T. Shih; Diem-Khanh Nguyen; Kelsey W. Kaiser; Grace L. Smith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundFinancial toxicity (FT) reflects multi-dimensional personal economic hardships borne by cancer patients. It is unknown whether measures of FT—to date derived largely from English-speakers—adequately capture economic experiences and financial hardships of medically underserved low English proficiency US Hispanic cancer patients. We piloted a Spanish language FT instrument in this population.MethodsWe piloted a Spanish version of the Economic Strain and Resilience in Cancer (ENRICh) FT measure using qualitative cognitive interviews and surveys in un-/under-insured or medically underserved, low English proficiency, Spanish-speaking Hispanics (UN-Spanish, n = 23) receiving ambulatory oncology care at a public healthcare safety net hospital in the Houston metropolitan area. Exploratory analyses compared ENRICh FT scores amongst the UN-Spanish group to: (1) un-/under-insured English-speaking Hispanics (UN-English, n = 23) from the same public facility and (2) insured English-speaking Hispanics (INS-English, n = 31) from an academic comprehensive cancer center. Multivariable logistic models compared the outcome of severe FT (score > 6).ResultsUN-Spanish Hispanic participants reported high acceptability of the instrument (only 0% responded that the instrument was “very difficult to answer” and 4% that it was “very difficult to understand the questions”; 8% responded that it was “very difficult to remember resources used” and 8% that it was “very difficult to remember the burdens experienced”; and 4% responded that it was “very uncomfortable to respond”). Internal consistency of the FT measure was high (Cronbach’s α = 0.906). In qualitative responses, UN-Spanish Hispanics frequently identified a total lack of credit, savings, or income and food insecurity as aspects contributing to FT. UN-Spanish and UN-English Hispanic patients were younger, had lower education and income, resided in socioeconomically deprived neighborhoods and had more advanced cancer vs. INS-English Hispanics. There was a higher likelihood of severe FT in UN-Spanish (OR = 2.73, 95% CI 0.77–9.70; p = 0.12) and UN-English (OR = 4.13, 95% CI 1.13–15.12; p = 0.03) vs. INS-English Hispanics. A higher likelihood of severely depleted FT coping resources occurred in UN-Spanish (OR = 4.00, 95% CI 1.07–14.92; p = 0.04) and UN-English (OR = 5.73, 95% CI 1.49–22.1; p = 0.01) vs. INS-English. The likelihood of FT did not differ between UN-Spanish and UN-English in both models (p = 0.59 and p = 0.62 respectively).ConclusionIn medically underserved, uninsured Hispanic patients with cancer, comprehensive Spanish-language FT assessment in low English proficiency participants was feasible, acceptable, and internally consistent. Future studies employing tailored FT assessment and intervention should encompass the key privations and hardships in this population.
a
Linguistic Isolation (by Georgia House) 2017
opendata.atlantaregional.com
Updated Jun 26, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georgia Association of Regional Commissions (2019). Linguistic Isolation (by Georgia House) 2017 [Dataset]. https://opendata.atlantaregional.com/datasets/linguistic-isolation-by-georgia-house-2017
Explore at:
Dataset updated
Jun 26, 2019
Dataset provided by
The Georgia Association of Regional Commissions
Authors
Georgia Association of Regional Commissions
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This layer was developed by the Research & Analytics Group of the Atlanta Regional Commission, using data from the U.S. Census Bureau’s American Community Survey 5-year estimates for 2013-2017, to show number and percentage of U.S. population 5 years and older that speaks English less than "very well" and don’t speak English at home by Georgia House in the Atlanta region. The user should note that American Community Survey data represent estimates derived from a surveyed sample of the population, which creates some level of uncertainty, as opposed to an exact measure of the entire population (the full census count is only conducted once every 10 years and does not cover as many detailed characteristics of the population). Therefore, any measure reported by ACS should not be taken as an exact number – this is why a corresponding margin of error (MOE) is also given for ACS measures. The size of the MOE relative to its corresponding estimate value provides an indication of confidence in the accuracy of each estimate. Each MOE is expressed in the same units as its corresponding measure; for example, if the estimate value is expressed as a number, then its MOE will also be a number; if the estimate value is expressed as a percent, then its MOE will also be a percent. The user should also note that for relatively small geographic areas, such as census tracts shown here, ACS only releases combined 5-year estimates, meaning these estimates represent rolling averages of survey results that were collected over a 5-year span (in this case 2013-2017). Therefore, these data do not represent any one specific point in time or even one specific year. For geographic areas with larger populations, 3-year and 1-year estimates are also available. For further explanation of ACS estimates and margin of error, visit Census ACS website. Naming conventions: Prefixes:NoneCountpPercentrRatemMedianaMean (average)tAggregate (total)chChange in absolute terms (value in t2 - value in t1)pchPercent change ((value in t2 - value in t1) / value in t1)chpChange in percent (percent in t2 - percent in t1)Suffixes:NoneChange over two periods_eEstimate from most recent ACS_mMargin of Error from most recent ACS_00Decennial 2000 Attributes:SumLevelSummary level of geographic unit (e.g., County, Tract, NSA, NPU, DSNI, SuperDistrict, etc)GEOIDCensus tract Federal Information Processing Series (FIPS) code NAMEName of geographic unitPlanning_RegionPlanning region designation for ARC purposesAcresTotal area within the tract (in acres)SqMiTotal area within the tract (in square miles)CountyCounty identifier (combination of Federal Information Processing Series (FIPS) codes for state and county)CountyNameCounty NamePop5P_e# Population 5 years and over, 2017Pop5P_m# Population 5 years and over, 2017 (MOE)EnglishOnly_e# Speaks English only, 2017EnglishOnly_m# Speaks English only, 2017 (MOE)pEnglishOnly_e% Speaks English only, 2017pEnglishOnly_m% Speaks English only, 2017 (MOE)NotEnglish_e# Speaks language other than English at home, 2017NotEnglish_m# Speaks language other than English at home, 2017 (MOE)pNotEnglish_e% Speaks language other than English at home, 2017pNotEnglish_m% Speaks language other than English at home, 2017 (MOE)EngLtVeryWell_e# English not spoken at home, speaks English less than 'very well', 2017EngLtVeryWell_m# English not spoken at home, speaks English less than 'very well', 2017 (MOE)pEngLtVeryWell_e% English not spoken at home, speaks English less than 'very well', 2017pEngLtVeryWell_m% English not spoken at home, speaks English less than 'very well', 2017 (MOE)Spanish_e# Speaks Spanish at home, 2017Spanish_m# Speaks Spanish at home, 2017 (MOE)pSpanish_e% Speaks Spanish at home, 2017pSpanish_m% Speaks Spanish at home, 2017 (MOE)SpanishEngLtVeryWell_e# Speaks Spanish at home, speaks English less than 'very well', 2017SpanishEngLtVeryWell_m# Speaks Spanish at home, speaks English less than 'very well', 2017 (MOE)pSpanishEngLtVeryWell_e% Speaks Spanish at home, speaks English less than 'very well', 2017pSpanishEngLtVeryWell_m% Speaks Spanish at home, speaks English less than 'very well', 2017 (MOE)IndoEurNotEnglish_e# Speaks other Indo-European language at home, 2017IndoEurNotEnglish_m# Speaks other Indo-European language at home, 2017 (MOE)pIndoEurNotEnglish_e% Speaks other Indo-European language at home, 2017pIndoEurNotEnglish_m% Speaks other Indo-European language at home, 2017 (MOE)IndoEurEngLtVeryWell_e# Speaks other Indo-European language at home, speaks English less than 'very well', 2017IndoEurEngLtVeryWell_m# Speaks other Indo-European language at home, speaks English less than 'very well', 2017 (MOE)pIndoEurEngLtVeryWell_e% Speaks other Indo-European language at home, speaks English less than 'very well', 2017pIndoEurEngLtVeryWell_m% Speaks other Indo-European language at home, speaks English less than 'very well', 2017 (MOE)AsianNotEnglish_e# Speaks Asian language at home, 2017AsianNotEnglish_m# Speaks Asian language at home, 2017 (MOE)pAsianNotEnglish_e% Speaks Asian language at home, 2017pAsianNotEnglish_m% Speaks Asian language at home, 2017 (MOE)AsianEngLtVeryWell_e# Speaks Asian language at home, speaks English less than 'very well', 2017AsianEngLtVeryWell_m# Speaks Asian language at home, speaks English less than 'very well', 2017 (MOE)pAsianEngLtVeryWell_e% Speaks Asian language at home, speaks English less than 'very well', 2017pAsianEngLtVeryWell_m% Speaks Asian language at home, speaks English less than 'very well', 2017 (MOE)OthLangNotEnglish_e# Speaks other language at home, 2017OthLangNotEnglish_m# Speaks other language at home, 2017 (MOE)pOthLangNotEnglish_e% Speaks other language at home, 2017pOthLangNotEnglish_m% Speaks other language at home, 2017 (MOE)OthLangEngLtVeryWell_e# Speaks other language at home, speaks English less than 'very well', 2017OthLangEngLtVeryWell_m# Speaks other language at home, speaks English less than 'very well', 2017 (MOE)pOthLangEngLtVeryWell_e% Speaks other language at home, speaks English less than 'very well', 2017pOthLangEngLtVeryWell_m% Speaks other language at home, speaks English less than 'very well', 2017 (MOE)last_edited_dateLast date the feature was edited by ARC Source: U.S. Census Bureau, Atlanta Regional CommissionDate: 2013-2017 For additional information, please visit the Census ACS website.
A
Data from: Hispanic-English Database
abacus.library.ubc.ca
iso, txt
Updated Nov 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abacus Data Network (2022). Hispanic-English Database [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml;jsessionid=719087385d798ea8ac94b0e3997f?persistentId=hdl%3A11272.1%2FAB2%2FIIJZCH&version=&q=&fileTypeGroupFacet=%22Text%22&fileAccess=
Explore at:
txt(1308), iso(3087785984)Available download formats
Dataset updated
Nov 30, 2022
Dataset provided by
Abacus Data Network
Description
AbstractIntroduction Hispanic-English Database contains approximately 30 hours of English and Spanish conversational and read speech with transcripts (24 hours) and metadata collected from 22 non-native English speakers between 1996 and 1998. The corpus was developed by Entropic Research Laboratory, Inc., a developer of speech recognition and speech synthesis software toolkits that was acquired by Microsoft in 1999. Participants were adult native speakers of Spanish as spoken in Central America and South America who resided in the Palo Alto, California area, had lived in the United States for at least one year and demonstrated a basic ability to understand, read and speak English. They read a total of 2200 sentences, 50 each in Spanish and English per speaker. The Spanish sentence prompts were a subset of the materials in LATINO-40 Spanish Read News, and the English sentence prompts were taken from the TIMIT database. Conversations were task-oriented, drawing on exercises similar to those used in English second language instruction and designed to engage the speakers in collaborative, problem-solving activities. Data Read speech was recorded on two wideband channels with a Shure SM10A head-mounted microphone in a quiet laboratory environment. The conversational speech was simultaneously recorded on four channels, two of which were used to place phone calls to each subject in two separate offices and to record the incoming speech of the two channels into separate files. The audio was originally saved under the Entropic Audio (ESPS) format using a 16kHz sampling rate and 16 bit samples. Audio files were converted to flac compressed .wav files from the ESPS format. ESPS headers were removed and are presented in this release as *.hdr files that include demographic and technical data. Transcripts were developed with the Entropic Annotator tool and are time-aligned with speaker turns. The transcription conventions were based on those used in the LDC Switchboard and CALLHOME collections. Transcript files are denoted with a .lab extension. Data files and their corresponding label files are stored in subdirectories named using a speaker-pair id and session number. The first three letters identify the speaker on channel A. The last three letters identify the speaker on channel B. Wideband audio files contain *.wb.flac in their file name, and narrow band audio files are denoted with a *.nb.flac in the file name.
Spanish Language Datasets | 1.8M+ Sentences | Translation Data | TTS |...
datarade.ai
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). Spanish Language Datasets | 1.8M+ Sentences | Translation Data | TTS | Dictionary Display | Translations | EU & LATAM Coverage [Dataset]. https://datarade.ai/data-products/spanish-language-datasets-1-8m-sentences-nlp-tts-dic-oxford-languages
Explore at:
.csv, .json, .mp3, .txt, .wav, .xls, .xmlAvailable download formats
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Oxford Languageshttps://www.lexico.com/
Area covered
Panama, Ecuador, Costa Rica, Honduras, Colombia, Chile, Bolivia (Plurinational State of), Nicaragua, Paraguay, Cuba
Description
Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

Spanish Monolingual Dictionary Data

Spanish Bilingual Dictionary Data

Spanish Sentences Data

Synonyms and Antonyms Data

Audio Data

Word list Data

Key Features (approximate numbers):

Spanish Monolingual Dictionary Data

Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

Headwords: 73,000

Senses: 123,000

Sentence examples: 104,000

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

Translations: 221,300

Senses: 103,500

Example sentences: 74,500

Example translations: 83,800

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Sentences Data

Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

Sentences volume: 1,840,000

Format: XML and JSON format

Delivery: Email (link-based file sharing) and REST API

Spanish Synonyms and Antonyms Data

This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

Synonyms: 127,700

Antonyms: 9,500

Format: XML format

Delivery: Email (link-based file sharing)

Updated frequency: annually

Spanish Audio Data (word-level)

Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

Audio files: 20,900

Format: XLSX (for index), MP3 and WAV (audio files)

Spanish Word List Data

This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

Wordforms: 450,000

Format: CSV and TXT formats

Delivery: Email (link-based file sharing)

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.
e
Spanish-English website parallel corpus
data.europa.eu
live.european-language-grid.eu
zip
Updated Dec 14, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Directorate-General for Communications Networks, Content and Technology (2017). Spanish-English website parallel corpus [Dataset]. https://data.europa.eu/data/datasets/elrc_339?locale=en
Explore at:
zipAvailable download formats
Dataset updated
Dec 14, 2017
Dataset authored and provided by
Directorate-General for Communications Networks, Content and Technology
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a parallel corpus of bilingual texts crawled from multilingual websites, which contains 21,007 TUs. Period of crawling : 15/11/2016 - 23/01/2017 A strict validation process has been followed, which resulted in discarding: - TUs from crawled websites that do not comply to the PSI directive, - TUs with more than 99% of mispelled tokens, - TUs identified during the manual validation process and all the TUs from websites which error rate in the sample extracted for manual validation is strictly above the following thresholds: 50% of TUs with language identification errors, 50% of TUs with alignment errors, 50% of TUs with tokenization errors, 20% of TUs identified as machine translated content, 50% of TUs with translation errors.

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) actions SMART 2014/1074 and SMART 2015/1091. For further information on the project: http://lr-coordination.eu.
Code-switching in bilingual children with DLD (Gross & Castilla-Earls, 2023)...
asha.figshare.com
pdf
Updated Oct 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Megan C. Gross; Anny Castilla-Earls (2023). Code-switching in bilingual children with DLD (Gross & Castilla-Earls, 2023) [Dataset]. http://doi.org/10.23641/asha.23479574.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.23641/asha.23479574.v1
Dataset updated
Oct 4, 2023
Dataset provided by
American Speech–Language–Hearing Association
Authors
Megan C. Gross; Anny Castilla-Earls
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Purpose: This study examined the frequency, direction, and structural characteristics of code-switching (CS) during narratives by Spanish–English bilingual children with and without developmental language disorder (DLD) to determine whether children with DLD exhibit unique features in their CS that may inform clinical decision-making.Method: Spanish–English bilingual children, aged 4;0–6;11 (years;months), with DLD (n = 33) and with typical language development (TLD; n = 33) participated in narrative retell and story generation tasks in Spanish and English. Instances of CS were classified as between utterance or within utterance; within-utterance CS was coded for type of grammatical structure. Children completed the morphosyntax subtests of the Bilingual English-Spanish Assessment to assist in identifying DLD and to index Spanish and English morphosyntactic proficiency.Results: In analyses examining the contributions of both DLD status and Spanish and English proficiency, the only significant effect of DLD was on the tendency to engage in between-utterance CS; children with DLD were more likely than TLD peers to produce whole utterances in English during the Spanish narrative task. Within-utterance CS was related to lower morphosyntax scores in the target language, but there was no effect of DLD. Both groups exhibited noun insertions as the most frequent type of within-utterance CS. However, children with DLD tended to exhibit more determiner and verb insertions than TLD peers and increased use of “congruent lexicalization,” that is, CS utterances that integrate content and function words from both languages.Conclusions: These findings reinforce that use of CS, particularly within-utterance CS, is a typical bilingual behavior even during narrative samples collected in a single-language context. However, language difficulties associated with DLD may emerge in how children code-switch, including use of between-utterance CS and unique patterns during within-utterance CS. Therefore, analyzing CS patterns may contribute to a more complete profile of children’s dual-language skills during assessment.Supplemental Material S1. Participant characteristics by site.Supplemental Material S2. Spearman-rho correlation tables.Supplemental Material S3. Examples of insertion types.Gross, M. C., & Castilla-Earls, A. (2023). Code-switching during narratives by bilingual children with and without developmental language disorder. Language, Speech, and Hearing Services in Schools, 54(3), 996–1019. https://doi.org/10.1044/2023_LSHSS-22-00149
n
388 Hours - Spanish Speaking English Speech Data by Mobile Phone
nexdata.ai
Updated Oct 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 388 Hours - Spanish Speaking English Speech Data by Mobile Phone [Dataset]. https://www.nexdata.ai/datasets/speechrecog/990
Explore at:
Dataset updated
Oct 31, 2023
Dataset provided by
Nexdata
nexdata technology inc
Authors
Nexdata
Variables measured
Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Features of annotation
Description
English(Spain) Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and in-car command, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(891 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
F
English-Spanish Parallel Corpus for the Medical Domain
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English-Spanish Parallel Corpus for the Medical Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/spanish-english-translated-parallel-corpus-for-medical-domain
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The English-Spanish Medical Parallel Corpus is a professionally curated bilingual dataset designed to support the development of language models, translation systems, and NLP applications in the healthcare and medical sectors. With over 50,000 sentence pairs covering a wide range of medical topics, this dataset serves as a powerful resource for improving multilingual AI systems in one of the most critical domains like healthcare.
Dataset Content
•Volume and Translator Diversity
•Sentence Count: 50,000+ parallel sentences
•Translator Base: Contributions from over 200 native Spanish translators with subject matter familiarity
•Data Origin: All content is purpose-built and translation-ready, developed specifically for machine learning applications
•Sentence Diversity
•Length Range: Sentences range from 7 to 25 words
•Structural Variety: Includes simple, compound, and complex sentence structures
•Form Types: Covers questions, commands, affirmations, and negations
•Voice: Balanced inclusion of both active and passive constructions
•Bi-directional Translation: Includes both English-to-Spanish and Spanish-to-English sentence sets to enhance model performance in both directions
•
Domain-relevant metaphors, idioms, and phrases

•Logical flow supported by a rich use of discourse markers and connectors
Medical Domain Specifics
•Terminology Coverage
The dataset reflects real-world terminology from across the medical field, including:
•Anatomy and physiology
•Diseases and symptoms
•Diagnosis and treatment protocols
•Pharmaceutical and drug-related terminology
•Medical devices, procedures, and administrative documentation
•Real-World Contexts
This corpus features data drawn from various healthcare settings and content types such as:
•Patient-doctor dialogues and telehealth interactions
•Diagnosis summaries and treatment plans
•Clinical notes and discharge instructions
•Medical research abstracts and journal-style excerpts
•Drug descriptions, usage guidelines, and safety instructions
•Hospital policy and consent-related materials
•Informational content around wellness, supplements, and preventive care
•Cross-Domain Elements
In addition to core medical language, the dataset also includes related content from:
•Healthtech and medical devices
•Wellness and self-care
•Nutrition and lifestyle medicine
Format and Structure
•
Available Formats: Delivered in Excel, with optional conversions to JSON, TMX, XML, XLIFF, or other localization-ready formats

•Fields Included:
•Serial Number
•Unique ID
•Source Sentence
•Source Word Count
•Target Sentence
•Target Word Count
Applications and Use Cases
<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;
Preferred Language Spoken in California Facilities
healthdata.gov
data.chhs.ca.gov
+1more
application/rdfxml +5
Updated Apr 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
chhs.data.ca.gov (2025). Preferred Language Spoken in California Facilities [Dataset]. https://healthdata.gov/State/Preferred-Language-Spoken-in-California-Facilities/e5vw-v44b
Explore at:
csv, application/rssxml, json, application/rdfxml, xml, tsvAvailable download formats
Dataset updated
Apr 8, 2025
Dataset provided by
chhs.data.ca.gov
Area covered
California
Description
The dataset contains combined counts for hospital discharges, emergency room encounters, and ambulatory surgeries by preferred language spoken at each facility. The nearly 100 languages collected in the patient-level data were combined into eight geographical or cultural groups: English Language, Spanish Language, Asian/Pacific Islander Languages, Middle Eastern Languages, European Languages, African Languages, Latin American Languages, Native American Languages, and Sign Language. See the Preferred Language Spoken Language List below to see the exact separation of languages.
Z
Data from: English and Spanish Vowel Formants from DIAPIX-FL
data.niaid.nih.gov
research.science.eus
+2more
Updated Dec 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Garcia Lecumberri, Maria Luisa (2024). English and Spanish Vowel Formants from DIAPIX-FL [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14411924
Explore at:
Dataset updated
Dec 12, 2024
Dataset provided by
Garcia Lecumberri, Maria Luisa
Cooke, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data table contains acoustic parameters for each vowel (of length at least 30ms) from the DIAPIX-FL corpus of first and second language task-oriented speech spoken by Spanish and English talkers, available at https://datashare.ed.ac.uk/handle/10283/346

Each row corresponds to a single vowel and contains the following columns:

l1: L1 of the speaker (either En or Sp)

speaking: language that is being spoken (either En or Sp)

speaker: anonymised speaker identifier

frag: which DIAPIX-FL TCU this corresponds to

nwords: number of words in TCU

vowel: MRPA symbol corresponding to the vowel

word: word from which the vowel was extracted

length: duration of vowel in units of 10ms frames

en: Praat estimate of median energy

f0: estimated median fundamental frequency

f1: estimated median first formant frequency (f1)

f2: estimated median second formant frequency (f2)

f3: estimated median third formant frequency (f3)

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/

The most spoken languages worldwide 2025

Explore at:

419 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 14, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2025

Area covered

World

Description

In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

Clear search

Close search

Google apps

Main menu

The most spoken languages worldwide 2025

Number of native Spanish speakers worldwide 2024, by country

Ranking of languages spoken at home in the U.S. 2023

2017 San Diego County Demographics - Language Spoken at Home for the...

2013 American Community Survey - Table Packages: Detailed Language Spoken in...

Common languages used for web content 2025, by share of websites

Languages and English Ability - Seattle Neighborhoods

Percent Spanish Speakers

Data from: Language Spoken at Home

Percent of Population with Limited Ability to Speak English

Data_Sheet_1_Pilot study of a Spanish language measure of financial toxicity...

Linguistic Isolation (by Georgia House) 2017

Data from: Hispanic-English Database

Spanish Language Datasets | 1.8M+ Sentences | Translation Data | TTS |...

Spanish-English website parallel corpus

Code-switching in bilingual children with DLD (Gross & Castilla-Earls, 2023)...

388 Hours - Spanish Speaking English Speech Data by Mobile Phone

English-Spanish Parallel Corpus for the Medical Domain

Introduction

Dataset Content

Medical Domain Specifics

Format and Structure

Applications and Use Cases

Preferred Language Spoken in California Facilities

Data from: English and Spanish Vowel Formants from DIAPIX-FL

The most spoken languages worldwide 2025