In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
Mexico is the country with the largest number of native Spanish speakers in the world. As of 2024, 132.5 million people in Mexico spoke Spanish with a native command of the language. Colombia was the nation with the second-highest number of native Spanish speakers, at around 52.7 million. Spain came in third, with 48 million, and Argentina fourth, with 46 million. Spanish, a world language As of 2023, Spanish ranked as the fourth most spoken language in the world, only behind English, Chinese, and Hindi, with over half a billion speakers. Spanish is the official language of over 20 countries, the majority on the American continent, nonetheless, it's also one of the official languages of Equatorial Guinea in Africa. Other countries have a strong influence, like the United States, Morocco, or Brazil, countries included in the list of non-Hispanic countries with the highest number of Spanish speakers. The second most spoken language in the U.S. In the most recent data, Spanish ranked as the language, other than English, with the highest number of speakers, with 12 times more speakers as the second place. Which comes to no surprise following the long history of migrations from Latin American countries to the Northern country. Moreover, only during the fiscal year 2022. 5 out of the top 10 countries of origin of naturalized people in the U.S. came from Spanish-speaking countries.
In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Language questions were only asked of persons 5 years and older. The language question is about current use of a non-English language at home, not about ability to speak another language or the use of such a language in the past or elsewhere. People who speak a language other than English outside of the home are not reported as speaking a language other than English. Respondents that spoke a language other than English at home, where also asked whether they could speak English "very well" or less than "very well. See how the Census Bureau measures Language Use for more information at https://www.census.gov/topics/population/language-use/about.html.
Source: U.S. Census Bureau; 2013-2017 American Community Survey 5-Year Estimates, Table C16001.
This data set uses the 2009-2013 American Community Survey to tabulate the number of speakers of languages spoken at home and the number of speakers of each language who speak English less than very well. These tabulations are available for the following geographies: nation; each of the 50 states, plus Washington, D.C. and Puerto Rico; counties with 100,000 or more total population and 25,000 or more speakers of languages other than English and Spanish; core-based statistical areas (metropolitan statistical areas and micropolitan statistical areas) with 100,000 or more total population and 25,000 or more speakers of languages other than English and Spanish.
As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.
Table from the American Community Survey (ACS) 5-year series on languages spoken and English ability related topics for City of Seattle Council Districts, Comprehensive Plan Growth Areas and Community Reporting Areas. Table includes B16004 Age by Language Spoken at Home by Ability to Speak English, C16002 Household Language by Household Limited English-Speaking Status. Data is pulled from block group tables for the most recent ACS vintage and summarized to the neighborhoods based on block group assignment.Table created for and used in the Neighborhood Profiles application.Vintages: 2023ACS Table(s): B16004, C16002Data downloaded from: Census Bureau's Explore Census Data The United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2020 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters). The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.
Languages:Percent Spanish Speakers: Basic demographics by census tracts in King County based on current American Community Survey 5 Year Average (ACS). Included demographics are: total population; foreign born; median household income; English language proficiency; languages spoken; race and ethnicity; sex; and age. Numbers and derived percentages are estimates based on the current year's ACS. GEO_ID_TRT is the key field and may be used to join to other demographic Census data tables.
Language spoken at home and the ability to speak English for the population age 5 and over as reported by the US Census Bureau's, American Community Survey (ACS) 5-year estimates table C16001.
This map shows the percent of population with a limited ability to speak English by census tract. Search to your community and investigate the top language needs in nearby census tracts.*DATA AS OF 2011-2015*Data Source: U.S. Census Bureau's American Community Survey 5-year estimates, 2011-2015, Table B16001.Complete list of all languages available in this data set (29):Spanish or Spanish Creole; French (including Patois, Cajun); French Creole; Italian; Portuguese; German; Yiddish; Greek; Russian; Polish; Serbo-Croatian; Armenian; Persian; Gujarati; Hindi; Urdu; Chinese; Japanese; Korean; Mon-Khmer, Cambodian; Hmong; Thai; Laotian; Vietnamese; Tagalog; Navajo; Hungarian; Arabic; Hebrew. Those who have limited English ability and speak other languages are included in the percentage depicted in the map, but other languages will not appear in the ranked list or in the table.Accompanying feature layer and viewing app are also available.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundFinancial toxicity (FT) reflects multi-dimensional personal economic hardships borne by cancer patients. It is unknown whether measures of FT—to date derived largely from English-speakers—adequately capture economic experiences and financial hardships of medically underserved low English proficiency US Hispanic cancer patients. We piloted a Spanish language FT instrument in this population.MethodsWe piloted a Spanish version of the Economic Strain and Resilience in Cancer (ENRICh) FT measure using qualitative cognitive interviews and surveys in un-/under-insured or medically underserved, low English proficiency, Spanish-speaking Hispanics (UN-Spanish, n = 23) receiving ambulatory oncology care at a public healthcare safety net hospital in the Houston metropolitan area. Exploratory analyses compared ENRICh FT scores amongst the UN-Spanish group to: (1) un-/under-insured English-speaking Hispanics (UN-English, n = 23) from the same public facility and (2) insured English-speaking Hispanics (INS-English, n = 31) from an academic comprehensive cancer center. Multivariable logistic models compared the outcome of severe FT (score > 6).ResultsUN-Spanish Hispanic participants reported high acceptability of the instrument (only 0% responded that the instrument was “very difficult to answer” and 4% that it was “very difficult to understand the questions”; 8% responded that it was “very difficult to remember resources used” and 8% that it was “very difficult to remember the burdens experienced”; and 4% responded that it was “very uncomfortable to respond”). Internal consistency of the FT measure was high (Cronbach’s α = 0.906). In qualitative responses, UN-Spanish Hispanics frequently identified a total lack of credit, savings, or income and food insecurity as aspects contributing to FT. UN-Spanish and UN-English Hispanic patients were younger, had lower education and income, resided in socioeconomically deprived neighborhoods and had more advanced cancer vs. INS-English Hispanics. There was a higher likelihood of severe FT in UN-Spanish (OR = 2.73, 95% CI 0.77–9.70; p = 0.12) and UN-English (OR = 4.13, 95% CI 1.13–15.12; p = 0.03) vs. INS-English Hispanics. A higher likelihood of severely depleted FT coping resources occurred in UN-Spanish (OR = 4.00, 95% CI 1.07–14.92; p = 0.04) and UN-English (OR = 5.73, 95% CI 1.49–22.1; p = 0.01) vs. INS-English. The likelihood of FT did not differ between UN-Spanish and UN-English in both models (p = 0.59 and p = 0.62 respectively).ConclusionIn medically underserved, uninsured Hispanic patients with cancer, comprehensive Spanish-language FT assessment in low English proficiency participants was feasible, acceptable, and internally consistent. Future studies employing tailored FT assessment and intervention should encompass the key privations and hardships in this population.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This layer was developed by the Research & Analytics Group of the Atlanta Regional Commission, using data from the U.S. Census Bureau’s American Community Survey 5-year estimates for 2013-2017, to show number and percentage of U.S. population 5 years and older that speaks English less than "very well" and don’t speak English at home by Georgia House in the Atlanta region. The user should note that American Community Survey data represent estimates derived from a surveyed sample of the population, which creates some level of uncertainty, as opposed to an exact measure of the entire population (the full census count is only conducted once every 10 years and does not cover as many detailed characteristics of the population). Therefore, any measure reported by ACS should not be taken as an exact number – this is why a corresponding margin of error (MOE) is also given for ACS measures. The size of the MOE relative to its corresponding estimate value provides an indication of confidence in the accuracy of each estimate. Each MOE is expressed in the same units as its corresponding measure; for example, if the estimate value is expressed as a number, then its MOE will also be a number; if the estimate value is expressed as a percent, then its MOE will also be a percent. The user should also note that for relatively small geographic areas, such as census tracts shown here, ACS only releases combined 5-year estimates, meaning these estimates represent rolling averages of survey results that were collected over a 5-year span (in this case 2013-2017). Therefore, these data do not represent any one specific point in time or even one specific year. For geographic areas with larger populations, 3-year and 1-year estimates are also available. For further explanation of ACS estimates and margin of error, visit Census ACS website. Naming conventions: Prefixes:NoneCountpPercentrRatemMedianaMean (average)tAggregate (total)chChange in absolute terms (value in t2 - value in t1)pchPercent change ((value in t2 - value in t1) / value in t1)chpChange in percent (percent in t2 - percent in t1)Suffixes:NoneChange over two periods_eEstimate from most recent ACS_mMargin of Error from most recent ACS_00Decennial 2000 Attributes:SumLevelSummary level of geographic unit (e.g., County, Tract, NSA, NPU, DSNI, SuperDistrict, etc)GEOIDCensus tract Federal Information Processing Series (FIPS) code NAMEName of geographic unitPlanning_RegionPlanning region designation for ARC purposesAcresTotal area within the tract (in acres)SqMiTotal area within the tract (in square miles)CountyCounty identifier (combination of Federal Information Processing Series (FIPS) codes for state and county)CountyNameCounty NamePop5P_e# Population 5 years and over, 2017Pop5P_m# Population 5 years and over, 2017 (MOE)EnglishOnly_e# Speaks English only, 2017EnglishOnly_m# Speaks English only, 2017 (MOE)pEnglishOnly_e% Speaks English only, 2017pEnglishOnly_m% Speaks English only, 2017 (MOE)NotEnglish_e# Speaks language other than English at home, 2017NotEnglish_m# Speaks language other than English at home, 2017 (MOE)pNotEnglish_e% Speaks language other than English at home, 2017pNotEnglish_m% Speaks language other than English at home, 2017 (MOE)EngLtVeryWell_e# English not spoken at home, speaks English less than 'very well', 2017EngLtVeryWell_m# English not spoken at home, speaks English less than 'very well', 2017 (MOE)pEngLtVeryWell_e% English not spoken at home, speaks English less than 'very well', 2017pEngLtVeryWell_m% English not spoken at home, speaks English less than 'very well', 2017 (MOE)Spanish_e# Speaks Spanish at home, 2017Spanish_m# Speaks Spanish at home, 2017 (MOE)pSpanish_e% Speaks Spanish at home, 2017pSpanish_m% Speaks Spanish at home, 2017 (MOE)SpanishEngLtVeryWell_e# Speaks Spanish at home, speaks English less than 'very well', 2017SpanishEngLtVeryWell_m# Speaks Spanish at home, speaks English less than 'very well', 2017 (MOE)pSpanishEngLtVeryWell_e% Speaks Spanish at home, speaks English less than 'very well', 2017pSpanishEngLtVeryWell_m% Speaks Spanish at home, speaks English less than 'very well', 2017 (MOE)IndoEurNotEnglish_e# Speaks other Indo-European language at home, 2017IndoEurNotEnglish_m# Speaks other Indo-European language at home, 2017 (MOE)pIndoEurNotEnglish_e% Speaks other Indo-European language at home, 2017pIndoEurNotEnglish_m% Speaks other Indo-European language at home, 2017 (MOE)IndoEurEngLtVeryWell_e# Speaks other Indo-European language at home, speaks English less than 'very well', 2017IndoEurEngLtVeryWell_m# Speaks other Indo-European language at home, speaks English less than 'very well', 2017 (MOE)pIndoEurEngLtVeryWell_e% Speaks other Indo-European language at home, speaks English less than 'very well', 2017pIndoEurEngLtVeryWell_m% Speaks other Indo-European language at home, speaks English less than 'very well', 2017 (MOE)AsianNotEnglish_e# Speaks Asian language at home, 2017AsianNotEnglish_m# Speaks Asian language at home, 2017 (MOE)pAsianNotEnglish_e% Speaks Asian language at home, 2017pAsianNotEnglish_m% Speaks Asian language at home, 2017 (MOE)AsianEngLtVeryWell_e# Speaks Asian language at home, speaks English less than 'very well', 2017AsianEngLtVeryWell_m# Speaks Asian language at home, speaks English less than 'very well', 2017 (MOE)pAsianEngLtVeryWell_e% Speaks Asian language at home, speaks English less than 'very well', 2017pAsianEngLtVeryWell_m% Speaks Asian language at home, speaks English less than 'very well', 2017 (MOE)OthLangNotEnglish_e# Speaks other language at home, 2017OthLangNotEnglish_m# Speaks other language at home, 2017 (MOE)pOthLangNotEnglish_e% Speaks other language at home, 2017pOthLangNotEnglish_m% Speaks other language at home, 2017 (MOE)OthLangEngLtVeryWell_e# Speaks other language at home, speaks English less than 'very well', 2017OthLangEngLtVeryWell_m# Speaks other language at home, speaks English less than 'very well', 2017 (MOE)pOthLangEngLtVeryWell_e% Speaks other language at home, speaks English less than 'very well', 2017pOthLangEngLtVeryWell_m% Speaks other language at home, speaks English less than 'very well', 2017 (MOE)last_edited_dateLast date the feature was edited by ARC Source: U.S. Census Bureau, Atlanta Regional CommissionDate: 2013-2017 For additional information, please visit the Census ACS website.
AbstractIntroduction Hispanic-English Database contains approximately 30 hours of English and Spanish conversational and read speech with transcripts (24 hours) and metadata collected from 22 non-native English speakers between 1996 and 1998. The corpus was developed by Entropic Research Laboratory, Inc., a developer of speech recognition and speech synthesis software toolkits that was acquired by Microsoft in 1999. Participants were adult native speakers of Spanish as spoken in Central America and South America who resided in the Palo Alto, California area, had lived in the United States for at least one year and demonstrated a basic ability to understand, read and speak English. They read a total of 2200 sentences, 50 each in Spanish and English per speaker. The Spanish sentence prompts were a subset of the materials in LATINO-40 Spanish Read News, and the English sentence prompts were taken from the TIMIT database. Conversations were task-oriented, drawing on exercises similar to those used in English second language instruction and designed to engage the speakers in collaborative, problem-solving activities. Data Read speech was recorded on two wideband channels with a Shure SM10A head-mounted microphone in a quiet laboratory environment. The conversational speech was simultaneously recorded on four channels, two of which were used to place phone calls to each subject in two separate offices and to record the incoming speech of the two channels into separate files. The audio was originally saved under the Entropic Audio (ESPS) format using a 16kHz sampling rate and 16 bit samples. Audio files were converted to flac compressed .wav files from the ESPS format. ESPS headers were removed and are presented in this release as *.hdr files that include demographic and technical data. Transcripts were developed with the Entropic Annotator tool and are time-aligned with speaker turns. The transcription conventions were based on those used in the LDC Switchboard and CALLHOME collections. Transcript files are denoted with a .lab extension. Data files and their corresponding label files are stored in subdirectories named using a speaker-pair id and session number. The first three letters identify the speaker on channel A. The last three letters identify the speaker on channel B. Wideband audio files contain *.wb.flac in their file name, and narrow band audio files are denoted with a *.nb.flac in the file name.
Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:
Key Features (approximate numbers):
Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.
The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.
Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.
This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.
Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.
This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.
Use Cases:
We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).
If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.
Pricing:
Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.
Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a parallel corpus of bilingual texts crawled from multilingual websites, which contains 21,007 TUs. Period of crawling : 15/11/2016 - 23/01/2017 A strict validation process has been followed, which resulted in discarding: - TUs from crawled websites that do not comply to the PSI directive, - TUs with more than 99% of mispelled tokens, - TUs identified during the manual validation process and all the TUs from websites which error rate in the sample extracted for manual validation is strictly above the following thresholds: 50% of TUs with language identification errors, 50% of TUs with alignment errors, 50% of TUs with tokenization errors, 20% of TUs identified as machine translated content, 50% of TUs with translation errors.
This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) actions SMART 2014/1074 and SMART 2015/1091. For further information on the project: http://lr-coordination.eu.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Purpose: This study examined the frequency, direction, and structural characteristics of code-switching (CS) during narratives by Spanish–English bilingual children with and without developmental language disorder (DLD) to determine whether children with DLD exhibit unique features in their CS that may inform clinical decision-making.Method: Spanish–English bilingual children, aged 4;0–6;11 (years;months), with DLD (n = 33) and with typical language development (TLD; n = 33) participated in narrative retell and story generation tasks in Spanish and English. Instances of CS were classified as between utterance or within utterance; within-utterance CS was coded for type of grammatical structure. Children completed the morphosyntax subtests of the Bilingual English-Spanish Assessment to assist in identifying DLD and to index Spanish and English morphosyntactic proficiency.Results: In analyses examining the contributions of both DLD status and Spanish and English proficiency, the only significant effect of DLD was on the tendency to engage in between-utterance CS; children with DLD were more likely than TLD peers to produce whole utterances in English during the Spanish narrative task. Within-utterance CS was related to lower morphosyntax scores in the target language, but there was no effect of DLD. Both groups exhibited noun insertions as the most frequent type of within-utterance CS. However, children with DLD tended to exhibit more determiner and verb insertions than TLD peers and increased use of “congruent lexicalization,” that is, CS utterances that integrate content and function words from both languages.Conclusions: These findings reinforce that use of CS, particularly within-utterance CS, is a typical bilingual behavior even during narrative samples collected in a single-language context. However, language difficulties associated with DLD may emerge in how children code-switch, including use of between-utterance CS and unique patterns during within-utterance CS. Therefore, analyzing CS patterns may contribute to a more complete profile of children’s dual-language skills during assessment.Supplemental Material S1. Participant characteristics by site.Supplemental Material S2. Spearman-rho correlation tables.Supplemental Material S3. Examples of insertion types.Gross, M. C., & Castilla-Earls, A. (2023). Code-switching during narratives by bilingual children with and without developmental language disorder. Language, Speech, and Hearing Services in Schools, 54(3), 996–1019. https://doi.org/10.1044/2023_LSHSS-22-00149
English(Spain) Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and in-car command, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(891 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The English-Spanish Medical Parallel Corpus is a professionally curated bilingual dataset designed to support the development of language models, translation systems, and NLP applications in the healthcare and medical sectors. With over 50,000 sentence pairs covering a wide range of medical topics, this dataset serves as a powerful resource for improving multilingual AI systems in one of the most critical domains like healthcare.
The dataset reflects real-world terminology from across the medical field, including:
This corpus features data drawn from various healthcare settings and content types such as:
In addition to core medical language, the dataset also includes related content from:
The dataset contains combined counts for hospital discharges, emergency room encounters, and ambulatory surgeries by preferred language spoken at each facility. The nearly 100 languages collected in the patient-level data were combined into eight geographical or cultural groups: English Language, Spanish Language, Asian/Pacific Islander Languages, Middle Eastern Languages, European Languages, African Languages, Latin American Languages, Native American Languages, and Sign Language. See the Preferred Language Spoken Language List below to see the exact separation of languages.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data table contains acoustic parameters for each vowel (of length at least 30ms) from the DIAPIX-FL corpus of first and second language task-oriented speech spoken by Spanish and English talkers, available at https://datashare.ed.ac.uk/handle/10283/346
Each row corresponds to a single vowel and contains the following columns:
l1: L1 of the speaker (either En or Sp)
speaking: language that is being spoken (either En or Sp)
speaker: anonymised speaker identifier
frag: which DIAPIX-FL TCU this corresponds to
nwords: number of words in TCU
vowel: MRPA symbol corresponding to the vowel
word: word from which the vowel was extracted
length: duration of vowel in units of 10ms frames
en: Praat estimate of median energy
f0: estimated median fundamental frequency
f1: estimated median first formant frequency (f1)
f2: estimated median second formant frequency (f2)
f3: estimated median third formant frequency (f3)
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.