In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
Project Summary The data for this study came from a more extensive mixed-methods study that focused on understanding educators’ perspectives on translanguaging and its use in bilingual classrooms across the United States. The overall purpose of this study was to determine how bilingual educators make sense of translanguaging, and how, if at all, they use it in their teaching practice. This study helped enhance understanding about how languages are used in bilingual programs and for what purposes. The survey helped to collect information from larger samples of educators across the country to elicit information about attitudes that may otherwise have been difficult to measure using observation. Furthermore, it also allowed educators to share their understanding and attitudes towards a relatively debated language teaching technique anonymously. Data Description and Collection Overview The study used an online survey to inquire about educators’ self-reported understanding and practices about translanguaging. (The authors intentionally did not define the critical term, since this was one of the questions and a focus of the survey itself.) The study targeted educators of emergent bilingual services (EBS) across the United States and was administered in 2019. Any educator working in a K-12 language program met the survey inclusion criteria. This included bilingual programs, English as a second language, or another language program in a K-12 school. This meant that participants could be classroom teachers, language or other specialists, school/district administrators, educational consultants, staff from institutes of higher learning, and other potential stakeholders involved or interested in bilingual education in K-12 schools. The recruitment of survey participants was conducted through a sample of convenience given its cost-effectiveness and easy access to educators through the networks of the two organizations that sponsored the project: WIDA at University of Wisconsin–Madison and the Center for Applied Linguistics (CAL). The survey was shared through the websites of the two organizations over a period of eight months between March 25 and November 24, 2019. Survey participants were recruited through flyers at conferences, notices posted on websites of organizations engaged in research related to bilingual education and advertised via mailings to members of organizations engaged in research related to bilingual education. The availability of the survey was further disseminated through a group of organizations dedicated to dual language and bilingual education in the U.S., called the National Dual Language Forum. This method was used to reach as many participants as possible and did not include purposeful sampling. Summary of participant demographics can be found in the Visualizations file included in the deposit. While most of the respondents taught in the United States, 36 respondents reported working in international schools outside of the United States. A total of 972 responses were collected, of which 447 were complete responses. The survey included a total of 30 Likert scale items with a five-point scale, organized into sections related to the reasons why EBs switch languages (4 items), educators’ beliefs about the appropriateness of translanguaging (3 items), educators’ perceptions of the benefits and limitations of translanguaging (8 items), and educators’ reported classroom practices (15 items). The survey also included two open-ended questions: one at the beginning of the survey asking the participants to define translanguaging in their own words and one at the end of the survey asking them to share one or two activities they do with their students in which they use translanguaging. At the end of the survey, there were demographic questions about educators’ roles in their schools, the types of programs in which they worked, the grade levels they served, their years of experience in education, and their language abilities. The survey also asked for information about where the educators worked and how they heard about the survey. The Likert scale survey items were based on the literature on practices and understanding of translanguaging. The survey took 20-25 minutes to complete, and participants could take the survey anywhere and at their own convenience. No personally identifiable information was collected. The responses from the surveys were compiled and analyzed to identify frequency of responses and themes. Selection and Organization of Shared Data The informed consent, full survey questionnaire, and this data narrative serve as documentation files for this project. Additionally, all survey responses and visualizations from them are included as data files. The csv file containing the survey responses depicts the raw data. It contains both complete and incomplete responses and would best be analyzed side by side with the questionnaire. For example, column “BG” (Q19), corresponds to the question...
In 2019, about 12.08 million children were speaking another language other than English at home in the United States. This number is fairly consistent with the previous year, where 12.13 million children spoke another language at home.
This dataset is a list of Department of Rehabilitation (DOR) offices and includes contact information, addresses, and languages spoken in each office. Note: In addition to the languages listed, the DOR has various Bilingual language resources available in each office that allow us to serve members of the public who may speak a language other than English.
Dataset Title: A Gold Standard Corpus for Activity Information (GoSCAI)
Dataset Curators: The Epidemiology & Biostatistics Section of the NIH Clinical Center Rehabilitation Medicine Department
Dataset Version: 1.0 (May 16, 2025)
Dataset Citation and DOI: NIH CC RMD Epidemiology & Biostatistics Section. (2025). A Gold Standard Corpus for Activity Information (GoSCAI) [Data set]. Zenodo. doi: 10.5281/zenodo.15528545
This data statement is for a gold standard corpus of de-identified clinical notes that have been annotated for human functioning information based on the framework of the WHO's International Classification of Functioning, Disability and Health (ICF). The corpus includes 484 notes from a single institution within the United States written in English in a clinical setting. This dataset was curated for the purpose of training natural language processing models to automatically identify, extract, and classify information on human functioning at the whole-person, or activity, level.
This dataset is curated to be a publicly available resource for the development and evaluation of methods for the automatic extraction and classification of activity-level functioning information as defined in the ICF. The goals of data curation are to 1) create a corpus of a size that can be manually deidentified and annotated, 2) maximize the density and diversity of functioning information of interest, and 3) allow public dissemination of the data.
Language Region: en-US
Prose Description: English as written by native and bilingual English speakers in a clinical setting
The language users represented in this dataset are medical and clinical professionals who work in a research hospital setting. These individuals hold professional degrees corresponding to their respective specialties. Specific demographic characteristics of the language users such as age, gender, or race/ethnicity were not collected.
The annotator group consisted of five people, 33 to 76 years old, including four females and one male. Socioeconomically, they came from the middle and upper-middle income classes. Regarding first language, three annotators had English as their first language, one had Chinese, and one had Spanish. Proficiency in English, the language of the data being annotated, was native for three of the annotators and bilingual for the other two. The annotation team included clinical rehabilitation domain experts with backgrounds in occupational therapy, physical therapy, and individuals with public health and data science expertise. Prior to annotation, all annotators were trained on the specific annotation process using established guidelines for the given domain, and annotators were required to achieve a specified proficiency level prior to annotating notes in this corpus.
The notes in the dataset were written as part of clinical care within a U.S. research hospital between May 2008 and November 2019. These notes were written by health professionals asynchronously following the patient encounter to document the interaction and support continuity of care. The intended audience of these notes were clinicians involved in the patients' care. The included notes come from nine disciplines - neuropsychology, occupational therapy, physical medicine (physiatry), physical therapy, psychiatry, recreational therapy, social work, speech language pathology, and vocational rehabilitation. The notes were curated to support research on natural language processing for functioning information between 2018 and 2024.
The final corpus was derived from a set of clinical notes extracted from the hospital electronic medical record (EMR) for the purpose of clinical research. The original data include character-based digital content originally. We work in ASCII 8 or UNICODE encoding, and therefore part of our pre-processing includes running encoding detection and transformation from encodings such as Windows-1252 or ISO-8859 format to our preferred format.
On the larger corpus, we applied sampling to match our curation rationale. Given the resource constraints of manual annotation, we set out to create a dataset of 500 clinical notes, which would exclude notes over 10,000 characters in length.
To promote density and diversity, we used five note characteristics as sampling criteria. We used the text length as expressed in number of characters. Next, we considered the discipline group as derived from note type metadata and describes which discipline a note originated from: occupational and vocational therapy (OT/VOC), physical therapy (PT), recreation therapy (RT), speech and language pathology (SLP), social work (SW), or miscellaneous (MISC, including psychiatry, neurology and physiatry). These disciplines were selected for collecting the larger corpus because their notes are likely to include functioning information. Existing information extraction tools were used to obtain annotation counts in four areas of functioning and provided a note’s annotation count, annotation density (annotation count divided by text length), and domain count (number of domains with at least 1 annotation).
We used stratified sampling across the 6 discipline groups to ensure discipline diversity in the corpus. Because of low availability, 50 notes were sampled from SLP with relaxed criteria, and 90 notes each from the 5 other discipline groups with stricter criteria. Sampled SLP notes were those with the highest annotation density that had an annotation count of at least 5 and a domain count of at least 2. Other notes were sampled by highest annotation count and lowest text length, with a minimum annotation count of 15 and minimum domain count of 3.
The notes in the resulting sample included certain types of PHI and PII. To prepare for public dissemination, all sensitive or potentially identifying information was manually annotated in the notes and replaced with substituted content to ensure readability and enough context needed for machine learning without exposing any sensitive information. This de-identification effort was manually reviewed to ensure no PII or PHI exposure and correct any resulting readability issues. Notes about pediatric patients were excluded. No intent was made to sample multiple notes from the same patient. No metadata is provided to group notes other than by note type, discipline, or discipline group. The dataset is not organized beyond the provided metadata, but publications about models trained on this dataset should include information on the train/test splits used.
All notes were sentence-segmented and tokenized using the spaCy en_core_web_lg model with additional rules for sentence segmentation customized to the dataset. Notes are stored in an XML format readable by the GATE annotation software (https://gate.ac.uk/family/developer.html), which stores annotations separately in annotation sets.
As the clinical notes were extracted directly from the EMR in text format, the capture quality was determined to be high. The clinical notes did not have to be converted from other data formats, which means this dataset is free from noise introduced by conversion processes such as optical character recognition.
Because of the effort required to manually deidentify and annotate notes, this corpus is limited in terms of size and representation. The curation decisions skewed note selection towards specific disciplines and note types to increase the likelihood of encountering information on functioning. Some subtypes of functioning occur infrequently in the data, or not at all. The deidentification of notes was done in a manner to preserve natural language as it would occur in the notes, but some information is lost, e.g. on rare diseases.
Information on the manual annotation process is provided in the annotation guidelines for each of the four domains:
- Communication & Cognition (https://zenodo.org/records/13910167)
- Mobility (https://zenodo.org/records/11074838)
- Self-Care & Domestic Life (SCDL) (https://zenodo.org/records/11210183)
- Interpersonal Interactions & Relationships (IPIR) (https://zenodo.org/records/13774684)
Inter-annotator agreement was established on development datasets described in the annotation guidelines prior to the annotation of this gold standard corpus.
The gold standard corpus consists of 484 documents, which include 35,147 sentences in total. The distribution of annotated information is provided in the table below.
Domain |
Number of Annotated Sentences |
% of All Sentences |
Mean Number of Annotated Sentences per Document |
Communication & Cognition |
6033 |
17.2% |
This layer shows language or language groups spoken at home by English ability. This is shown by tract, county, and state centroids. This service is updated annually to contain the most currently released American Community Survey (ACS) 5-year data, and contains estimates and margins of error. There are also additional calculated attributes related to this topic, which can be mapped or used within analysis. This layer is symbolized to show the count and percent of individuals age 5+ who are bilingual in English and another language (speak English very well and speak another language at home). To see the full list of attributes available in this service, go to the "Data" tab, and choose "Fields" at the top right. Current Vintage: 2019-2023ACS Table(s): C16001 Data downloaded from: Census Bureau's API for American Community Survey Date of API call: December 12, 2024National Figures: data.census.govThe United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. For more information about ACS layers, visit the FAQ. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:This layer is updated automatically when the most current vintage of ACS data is released each year, usually in December. The layer always contains the latest available ACS 5-year estimates. It is updated annually within days of the Census Bureau's release schedule. Click here to learn more about ACS data releases.Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2023 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters). The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper describes the simultaneous co-development of Oral Health Behavior Social Support (OHBSS) scales in English and Spanish. OHBSS scales assess social support for toothbrushing, flossing, and dental care utilization, which are targets for interpersonal-level interventions to promote oral health among Hispanic/Latino adults. The focus was on Mexican-origin adults, who comprise the largest United States Hispanic/Latino subgroup and experience a high oral disease burden. All participants self-identified as Mexican-origin adults (ages 21–40 years old), living along the California-Arizona-Mexico border. Independent samples were recruited for each study partnering with Federally Qualified Health Centers. First, we conducted semi-structured interviews about social support for oral health behaviors in August to November 2018 (Study 1, N = 72). Interviews were audio recorded, transcribed (in original language, Spanish or English), and qualitative data were coded and analyzed in Dedoose following three topical codebooks; excerpts were used to co-create the large bilingual item data bank (OHBSSv1). The item bank was pre-tested via 39 cognitive interviews between December 2019 to March 2020, reviewed by an expert panel with several bilingual members, reduced to 107 Spanish/109 English items (OHBSSv2), then pilot tested in January to December 2021 (Study 2, N = 309). Pilot survey data were analyzed through Exploratory Factor Analysis and Horn’s parallel analysis, overall and by language, to examine response patterns and inform item selection (OHBSSv3). The scales queried social support for toothbrushing, flossing, and dental care utilization across 39 items from three sources (family, health providers, others/friends), plus up to nine optional dental care-related items (Study 3, conducted April 2022 to February 2023, N = 502). Confirmatory Factor Analysis (CFA) assessed model fit, overall and by language (multiple group CFA). Final OHBSS scales include 37 items, plus seven optional items. Acceptable model fit for three-factor structures for each oral health behavior was found, providing evidence of the scales’ construct validity. Cronbach’s alphas and McDonald’s omegas were tabulated; all were above 0.95, overall and by language, supporting scales’ internal consistency.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.