Facebook
TwitterLanguages are an important part of daily life in the USA. Here is a table that shows the most common languages spoken in the USA, as well as a big spreadsheet which shows each CBSA (Core-Based Statistical Area, or urban area).
Language usage varies widely throughout the United States. According to the latest census data, over 350 different languages are represented in homes across the country. The following table and spreadsheet provide more detailed information on language usage throughout the various states and cities in the US:
Columns: - index: Index column for dataframe - Table with column headers in row 5 and row headers in column A: Contains language data for each CBSA (Core Based Statistical Area) - Unnamed: 1: Rank of CBSA by total number of speakers of all languages - Unnamed: 2: Name of CBSA - Unnamed: 3: Population of CBSA - Unnamed: 4: Percent of population that speaks English very well - Unnamed: 5 through Unnamed: 58 : Languages spoken by at least 0.1% of the population, with corresponding percentages
This dataset was created by Gary Hoover. The data was sourced from https://www.kaggle.com/garyhoov/us-languages
Unknown License - Please check the dataset description for more information.
File: Languages Spoken at Home by Urban Area = CBSA.csv
File: US Languages Spoken at Home 2014.csv | Column name | Description | |:-------------------------------------------------------------------|:--------------| | Table with column headers in row 5 and row headers in column A | |
Facebook
TwitterBackgroundIn the US, people who don’t speak English well often have a lower quality of life than those who do [1]. They may also have limited access to health care, including mental health services, and may not be able to take part in key national health surveys like the Behavioral Risk Factor Surveillance System (BRFSS). Communities where many people have limited English skills tend to live closer to toxic chemicals. Limited English skills can also make it harder for community members to get involved in local decision-making, which can affect environmental policies and lead to health inequalities. Data SourceWashington Office of the Superintendent of Public Instruction (OSPI) | Public Records CenterMethodologyThe data was collected through a public records request from the OSPI data portal. It shows what languages students speak at home, organized by school district. OSPI collects and reports data by academic year. For example, the 2023 data comes from the 2022-2023 school year (August 1, 2022 to May 31, 2023). OSPI updates this information regularly.CaveatsThese figures only include households with children enrolled in public schools from pre-K through 12th grade. The data may change over time as new information becomes available. Source1. Shariff-Marco, S., Gee, G. C., Breen, N., Willis, G., Reeve, B. B., Grant, D., Ponce, N. A., Krieger, N., Landrine, H., Williams, D. R., Alegria, M., Mays, V. M., Johnson, T. P., & Brown, E. R. (2009). A mixed-methods approach to developing a self-reported racial/ethnic discrimination measure for use in multiethnic health surveys. Ethnicity & disease, 19(4), 447–453.CitationWashington Tracking Network, Washington State Department of Health. Languages Spoken at Home. Data from the Washington Office of Superintendent of Public Instruction (OSPI). Published January 2026. Web.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the US English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world US English communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic American accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of US English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple English speech and language AI applications:
Facebook
TwitterIn 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
Facebook
TwitterUS Census American Community Survey Custom Tabulation (ST542) by Census Tract. Language spoken at home for population 5 years and over by ability to speak English, summarized by census tract for 114 languages spoken across LA County, 5-year estimates 2019-2023.See also source data tables:Census Tracts: Language Spoken at Home LA County Census TractsLA County: Language Spoken at Home LA County Headings:GEOIDGeography identificationCT20Census tract (2020)NameCensus tract nameCSACountywide Statistical Area (city or community)SPAService Planning AreaSDSupervisorial Districttotal_popPopulation over 5 years old in census tract (universe)total_limited_engPopulation that speaks English less than "very well"total_limited_eng_pctPercent of population that speaks English less than "very well"
Facebook
TwitterThe Delaware Valley Regional Planning Commission (DVRPC) is committed to upholding the principles and intentions of the 1964 Civil Rights Act and related nondiscrimination statutes in all of the Commission’s work, including publications, products, communications, public input, and decision-making processes. Language barriers may prohibit people who are Limited in English Proficiency (also known as LEP persons) from obtaining services, information, or participating in public planning processes. To better identify LEP populations and thoroughly evaluate the Commission’s efforts to provide meaningful access, DVRPC has produced this Limited-English Proficiency Plan. This is the data that was used to make the maps for the upcoming plan. Public Use Microdata Area (PUMA), are geographies of at least 100,000 people that are nested within states or equivalent entities. States are able to delineate PUMAs within their borders, or use PUMA Criteria provided by the Census Bureau. Census tables used to gather data from the 2019- 2023 American Community Survey 5-Year Estimates ACS 2019-2023, Table B16001: Language Spoken at Home by Ability to Speak English for the Population 5 Years and Over. ACS data are derived from a survey and are subject to sampling variablity.
Vietnamese Source of PUMA boundaries: US Census Bureau. The TIGER/Line Files Please refer to U:_OngoingProjects\LEP\ACS_5YR_B16001_PUMAs_metadata.xlsx for full attribute loop up and fields used in making the DVRPC LEP Map Series. Please contact Chris Pollard (cpollard@dvrpc.org) should you have any questions about this dataset.
Facebook
TwitterIn 2022, about 21.4 percent of schoolchildren spoke another language than English at home in the United States. This is a slight increase from 2021, when 21.3 percent of U.S. schoolchildren did not speak English at home.
Facebook
TwitterThis dataset contains estimates of the number of residents aged 5 years or older in Chicago who “speak English less than very well,” by the non-English language spoken at home and community area of residence, for the years 2008 – 2012. See the full dataset description for more information at: https://data.cityofchicago.org/api/views/fpup-mc9v/files/dK6ZKRQZJ7XEugvUavf5MNrGNW11AjdWw0vkpj9EGjg?download=true&filename=P:\EPI\OEPHI\MATERIALS\REFERENCES\ECONOMIC_INDICATORS\Dataset_Description_Languages_2012_FOR_PORTAL_ONLY.pdf
Facebook
TwitterThis map shows the predominant language(s) spoken by people who have limited English speaking ability. This is shown using American Community Survey data from the US Census Bureau by state, county, and tract.There are 12 different language/language groupings: SpanishFrench, Haitian, or CajunKoreanChinese (including Mandarin and Cantonese)VietnameseTagalog (including Filipino)ArabicGerman or other West GermanicRussian, Polish, or other SlavicOther Indo-European (such as Italian or Portuguese)Other Asian and Pacific Island (such as Japanese or Hmong)Other and unspecified (such as Navajo or Hebrew).This map also uses a feature effect to identify the counties with either 10,000 or 5% of the population having limited English ability. According to the Voting Rights Act, "localities where there are more than 10,000 or over 5 percent of the total voting age citizens in a single political subdivision (usually a county, but a township or municipality in some states) who are members of a single language minority group, have depressed literacy rates, and do not speak English very well" are required to "provide [voting materials] in the language of the applicable minority group as well as in the English language".This map uses these hosted feature layers containing the most recent American Community Survey data. These layers are part of ArcGIS Living Atlas, and are updated every year when the American Community Survey releases new estimates, so values in the map always reflect the newest data available.
Facebook
TwitterThis map shows the predominant language spoken at home by the US population aged 5+. This is shown by Census Tract and County centroids. The data values are from the 2012-2016 American Community Survey 5-year estimates in the S1601 Table for Language Spoken at Home. The popup in the map provides a breakdown of the population age 5+ by the language spoken at home. Data values for other age groups are also available within the data's table. The color of the symbols represent the most common language spoken at home. This predominance map style compares the count of people age 5+ based on what language is spoken at home, and returns the value with the highest count. The census breaks down the population 5+ by the following language options:English OnlyNon-English - SpanishNon-English - Asian and Pacific Islander LanguagesNon-English - Indo European LanguagesNon-English - OtherThe size of the symbols represents how many people are 5 years or older, which helps highlight the quantity of people that live within an area that were sampled for this language categorization. The strength of the color represents HOW predominant an language is within an area. If the symbol is a strong color, it makes up a larger portion of the population. This map is designed for a dark basemap such as the Human Geography Basemap or the Dark Gray Canvas Basemap. See the web map to see the pattern at both the county and tract level. This map helps to show the most common language spoken at home at both a regional and local level. The tract pattern shows how distinct neighborhoods are clustered by which language they speak. The county pattern shows how language is used throughout the country. This pattern is shown by census tracts at large scales, and counties at smaller scales.This data was downloaded from the United States Census Bureau American Fact Finder on January 16, 2018. It was then joined with 2016 vintage centroid points and hosted to ArcGIS Online and into the Living Atlas.Nationally, the breakdown of education for the population 5+ is as follows:Total EstimateMargin of ErrorPercent EstimateMargin of ErrorPopulation 5 years and over298,691,202+/-3,594(X)(X)Speak only English235,519,143+/-154,40978.90%+/-0.1Spanish39,145,066+/-94,57113.10%+/-0.1Asian and Pacific Island languages10,172,370+/-22,5613.40%+/-0.1Other Indo-European languages10,827,536+/-46,3353.60%+/-0.1Other languages3,027,087+/-23,3021.00%+/-0.1
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This US English Call Center Speech Dataset for the Telecom industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English-speaking telecom customers. Featuring over 30 hours of real-world, unscripted audio, it delivers authentic customer-agent interactions across key telecom support scenarios to help train robust ASR models.
Curated by FutureBeeAI, this dataset empowers voice AI engineers, telecom automation teams, and NLP researchers to build high-accuracy, production-ready models for telecom-specific use cases.
The dataset contains 30 hours of dual-channel call center recordings between native US English speakers. Captured in realistic customer support settings, these conversations span a wide range of telecom topics from network complaints to billing issues, offering a strong foundation for training and evaluating telecom voice AI solutions.
This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral ensuring broad scenario coverage for telecom AI development.
This variety helps train telecom-specific models to manage real-world customer interactions and understand context-specific voice patterns.
All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.
These transcriptions are production-ready, allowing for faster development of ASR and conversational AI systems in the Telecom domain.
Rich metadata is available for each participant and conversation:
Facebook
Twitter2016-2020 ACS 5-Year estimates of demographic variables (see below) compiled at the State level. These variables include Sex By Age, Hispanic Or Latino Origin By Race, Household Type (Including Living Alone), Households By Presence Of People Under 18 Years By Household Type, Households By Presence Of People 60 Years And Over By Household Type, Nativity By Language Spoken At Home By Ability To Speak English For The Population 5 Years And Over, Average Household Size Of Occupied Housing Units By Tenure, and Sex by Educational Attainment for the Population 18 Years and Over.
Facebook
TwitterWe present Don’t Patronize Me!, an annotated dataset with Patronizing and Condescending Language (PCL)towards vulnerable communities. This annotated data is especially aimed at the NLP community in order to help improve the modelling and detection of PCL when referring to vulnerable communities, with the ultimate goal of producing and consuming a more responsible and inclusive communication. The Don’t Patronize Me! dataset (v.1.0) consists of 7,738 paragraphs about vulnerable communities extracted from news stories from the News on Web (NoW) corpus (https://www.english-corpora.org/now/, used with permission). This original corpus contains more than 18 million articles crawled from online media in 20 English-speaking countries from 2010 until 2018. In order to create our own dataset, we automatically selected from the NoW corpus just those articles where at least one word from a list of selected keywords was present. The articles were then divided per country and keyword and split into paragraphs. With the objective of assuring a balanced representation of countries and keywords, we randomly selected 75 paragraphs per keyword and country using theSciKitLearn library [11]. The final dataset will be a collection of 15,000 paragraphs with PCL annotations referring to vulnerable communities (150 per keyword; 750 per country).
Countries represented in the Don’t Patronize Me! dataset: Australia, Hong Kong, Sri Lanka, Pakistan, Bangladesh, Ireland, Malaysia, Singapore, Canada, India, Nigeria, Tanzania, United Kingdom, Jamaica, New Zealand, United States, Ghana, Kenia, Philipines, South Africa.
The keywords include seven potentially vulnerable groups which are widely referred to in general media and are potential recipients of condescending treatment. The remaining three keywords are concepts usually used to describe the former communities or the situations they live.
Keywords: Disable, Homeless, Immigrant, Migrant, Poor families, Women, Hopeless, Vulnerable, In need, Refugee.
The paragraphs included in the Don’t Patronize Me! dataset are written in English. Twenty English speaking countries are represented in the dataset, thus all their varieties of English are expected to be present in the corpus. See below for the codes of the English varieties as recommended in BCP-47. It is not possible for us to know either the regional varieties of English in each country if any, or if English is the speaker’s first language.
Language varieties in the dataset: en-AU, en-HK, en-LK, en-PK, en-BD, en-IE, en-MY, en-SG, en-CA, en-IN, en-NG, en-TZ, en-GB, en- JM, en-NZ, en-US, en-GH, en-KE, en-PH, en-ZA
As the paragraphs of our dataset are extracted from another corpus, we do not have the possibility to trace socio-demographic data of the speakers. Nevertheless, we can assume a) they are journalist, as they work in the media, so they are educated professionals; b) they speak English, although we do not know if this is their first language, and c) there is a wide representation of different races and ethnic origins, as we collect texts from 20 countries. In our dataset, each country contributes with 750 articles, so this is the maximum number of different authors we could have per country. We have not observed any disorder of speech, as the texts are written, probably edited and reviewed before their publication.
The annotators who collaborated in this dataset are three white females, with ages between 25 and 35 years old. Their first language is Spanish, but they are proficient in English. They all have graduate and postgraduate studies in communication, computer science and data science.
The news stories from where the paragraphs of our dataset are extracted were published between 2010 and 2018 in 20 countries (see section A). The stories are asynchronous communication, written, edited and probably reviewed before publishing. The texts are published news articles; thus they are likely intended to reach a general audience, although the characteristics of the audience might vary depending on the country of publication.
The texts of the dataset belong to the journalism genre and the topics have been previously selected to cover the treatment of the media towards potentially vulnerable groups, as explained in section A.
Facebook
Twitterhttps://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de457443https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de457443
Abstract (en): Summary File 4 (SF 4) from the United States 2000 Census contains the sample data, which is the information compiled from the questions asked of a sample of all people and housing units. Population items include basic population totals: urban and rural, households and families, marital status, grandparents as caregivers, language and ability to speak English, ancestry, place of birth, citizenship status, year of entry, migration, place of work, journey to work (commuting), school enrollment and educational attainment, veteran status, disability, employment status, industry, occupation, class of worker, income, and poverty status. Housing items include basic housing totals: urban and rural, number of rooms, number of bedrooms, year moved into unit, household size and occupants per room, units in structure, year structure built, heating fuel, telephone service, plumbing and kitchen facilities, vehicles available, value of home, monthly rent, and shelter costs. In Summary File 4, the sample data are presented in 213 population tables (matrices) and 110 housing tables, identified with "PCT" and "HCT" respectively. Each table is iterated for 336 population groups: the total population, 132 race groups, 78 American Indian and Alaska Native tribe categories (reflecting 39 individual tribes), 39 Hispanic or Latino groups, and 86 ancestry groups. The presentation of SF4 tables for any of the 336 population groups is subject to a population threshold. That is, if there are fewer than 100 people (100-percent count) in a specific population group in a specific geographic area, and there are fewer than 50 unweighted cases, their population and housing characteristics data are not available for that geographic area in SF4. For the ancestry iterations, only the 50 unweighted cases test can be performed. See Appendix H: Characteristic Iterations, for a complete list of characteristic iterations. ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection: Created variable labels and/or value labels.. All persons in housing units in Iowa in 2000. 2013-05-25 Multiple Census data file segments were repackaged for distribution into a single zip archive per dataset. No changes were made to the data or documentation.2006-01-12 All files were removed from dataset 342 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 341 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 340 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 339 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 338 and flagged as study-level files, so that they will accompany all downloads. Because of the number of files per state in Summary File 4, ICPSR has given each state its own ICPSR study number in the range ICPSR 13512-13563. The study number for the national file is 13570. Data for each state are being released as they become available.The data are provided in 38 segments (files) per iteration. These segments are PCT1-PCT4, PCT5-PCT16, PCT17-PCT34, PCT35-PCT37, PCT38-PCT45, PCT46-PCT49, PCT50-PCT61, PCT62-PCT67, PCT68-PCT71, PCT72-PCT76, PCT77-PCT78, PCT79-PCT81, PCT82-PCT84, PCT85-PCT86 (partial), PCT86 (partial), PCT87-PCT103, PCT104-PCT120, PCT121-PCT131, PCT132-PCT137, PCT138-PCT143, PCT144, PCT145-PCT150, PCT151-PCT156, PCT157-PCT162, PCT163-PCT208, PCT209-PCT213, HCT1-HCT9, HCT10-HCT18, HCT19-HCT22, HCT23-HCT25, HCT26-HCT29, HCT30-HCT39, HCT40-HCT55, HCT56-HCT61, HCT62-HCT70, HCT71-HCT81, HCT82-HCT86, and HCT87-HCT110. The iterations are Parts 1-336, the Geographic Header File is Part 337. The Geographic Header File is in fixed-format ASCII and the table files are in comma-delimited ASCII format. A merged iteration will have 7,963 variables.For Parts 251-336, the part names contain numbers within parentheses that refer to the Ancestry Code List (page G1 of the codebook).
Facebook
TwitterThe Programme for International Student Assessment (PISA) is a test given every three years to 15-year-old students from around the world to evaluate their performance in mathematics, reading, and science. This test provides a quantitative way to compare the performance of students from different parts of the world. In this homework assignment, we will predict the reading scores of students from the United States of America on the 2009 PISA exam.
The datasets pisa2009train.csv and pisa2009test.csv contain information about the demographics and schools for American students taking the exam, derived from 2009 PISA Public-Use Data Files distributed by the United States National Center for Education Statistics (NCES). While the datasets are not supposed to contain identifying information about students taking the test, by using the data you are bound by the NCES data use agreement, which prohibits any attempt to determine the identity of any student in the datasets.
Each row in the datasets pisa2009train.csv and pisa2009test.csv represents one student taking the exam. The datasets have the following variables:
grade: The grade in school of the student (most 15-year-olds in America are in 10th grade)
male: Whether the student is male (1/0)
raceeth: The race/ethnicity composite of the student
preschool: Whether the student attended preschool (1/0)
expectBachelors: Whether the student expects to obtain a bachelor's degree (1/0)
motherHS: Whether the student's mother completed high school (1/0)
motherBachelors: Whether the student's mother obtained a bachelor's degree (1/0)
motherWork: Whether the student's mother has part-time or full-time work (1/0)
fatherHS: Whether the student's father completed high school (1/0)
fatherBachelors: Whether the student's father obtained a bachelor's degree (1/0)
fatherWork: Whether the student's father has part-time or full-time work (1/0)
selfBornUS: Whether the student was born in the United States of America (1/0)
motherBornUS: Whether the student's mother was born in the United States of America (1/0)
fatherBornUS: Whether the student's father was born in the United States of America (1/0)
englishAtHome: Whether the student speaks English at home (1/0)
computerForSchoolwork: Whether the student has access to a computer for schoolwork (1/0)
read30MinsADay: Whether the student reads for pleasure for 30 minutes/day (1/0)
minutesPerWeekEnglish: The number of minutes per week the student spend in English class
studentsInEnglish: The number of students in this student's English class at school
schoolHasLibrary: Whether this student's school has a library (1/0)
publicSchool: Whether this student attends a public school (1/0)
urban: Whether this student's school is in an urban area (1/0)
schoolSize: The number of students in this student's school
readingScore: The student's reading score, on a 1000-point scale
MITx ANALYTIX
Facebook
Twitter2016-2020 ACS 5-Year estimates of demographic variables (see below) compiled at the tract level.The American Community Survey (ACS) 5 Year 2016-2020 demographic information is a subset of information available for download from the U.S. Census. Tables used in the development of this dataset include: B01001 - Sex By Age; B03002 - Hispanic Or Latino Origin By Race; B11001 - Household Type (Including Living Alone); B11005 - Households By Presence Of People Under 18 Years By Household Type; B11006 - Households By Presence Of People 60 Years And Over By Household Type; B16005 - Nativity By Language Spoken At Home By Ability To Speak English For The Population 5 Years And Over; B25010 - Average Household Size Of Occupied Housing Units By Tenure, and; B15001 - Sex by Educational Attainment for the Population 18 Years and Over; To learn more about the American Community Survey (ACS), and associated datasets visit: https://www.census.gov/programs-surveys/acs, for questions about the spatial attribution of this dataset, please reach out to us at GISHelpdesk@hud.gov. Data Dictionary: DD_ACS 5-Year Demographic Estimate Data by TractDate of Coverage: 2016-2020
Facebook
TwitterThe State Legislative District Summary File (Sample) (SLDSAMPLE) contains the sample data, which is the information compiled from the questions asked of a sample of all people and housing units. Population items include basic population totals; urban and rural; households and families; marital status; grandparents as caregivers; language and ability to speak English; ancestry; place of birth, citizenship status, and year of entry; migration; place of work; journey to work (commuting); school enrollment and educational attainment; veteran status; disability; employment status; industry, occupation, and class of worker; income; and poverty status. Housing items include basic housing totals; urban and rural; number of rooms; number of bedrooms; year moved into unit; household size and occupants per room; units in structure; year structure built; heating fuel; telephone service; plumbing and kitchen facilities; vehicles available; value of home; monthly rent; and shelter costs. The file contains subject content identical to that shown in Summary File 3 (SF 3).
Facebook
Twitterhttps://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de457436https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de457436
Abstract (en): Summary File 4 (SF 4) from the United States 2000 Census contains the sample data, which is the information compiled from the questions asked of a sample of all people and housing units. Population items include basic population totals: urban and rural, households and families, marital status, grandparents as caregivers, language and ability to speak English, ancestry, place of birth, citizenship status, year of entry, migration, place of work, journey to work (commuting), school enrollment and educational attainment, veteran status, disability, employment status, industry, occupation, class of worker, income, and poverty status. Housing items include basic housing totals: urban and rural, number of rooms, number of bedrooms, year moved into unit, household size and occupants per room, units in structure, year structure built, heating fuel, telephone service, plumbing and kitchen facilities, vehicles available, value of home, monthly rent, and shelter costs. In Summary File 4, the sample data are presented in 213 population tables (matrices) and 110 housing tables, identified with "PCT" and "HCT" respectively. Each table is iterated for 336 population groups: the total population, 132 race groups, 78 American Indian and Alaska Native tribe categories (reflecting 39 individual tribes), 39 Hispanic or Latino groups, and 86 ancestry groups. The presentation of SF4 tables for any of the 336 population groups is subject to a population threshold. That is, if there are fewer than 100 people (100-percent count) in a specific population group in a specific geographic area, and there are fewer than 50 unweighted cases, their population and housing characteristics data are not available for that geographic area in SF4. For the ancestry iterations, only the 50 unweighted cases test can be performed. See Appendix H: Characteristic Iterations, for a complete list of characteristic iterations. ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection: Created variable labels and/or value labels.. All persons in housing units in the District of Columbia in 2000. 2013-05-25 Multiple Census data file segments were repackaged for distribution into a single zip archive per dataset. No changes were made to the data or documentation.2006-01-12 All files were removed from dataset 342 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 341 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 340 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 339 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 338 and flagged as study-level files, so that they will accompany all downloads. Because of the number of files per state in Summary File 4, ICPSR has given each state its own ICPSR study number in the range ICPSR 13512-13563. The study number for the national file is 13570. Data for each state are being released as they become available.The data are provided in 38 segments (files) per iteration. These segments are PCT1-PCT4, PCT5-PCT16, PCT17-PCT34, PCT35-PCT37, PCT38-PCT45, PCT46-PCT49, PCT50-PCT61, PCT62-PCT67, PCT68-PCT71, PCT72-PCT76, PCT77-PCT78, PCT79-PCT81, PCT82-PCT84, PCT85-PCT86 (partial), PCT86 (partial), PCT87-PCT103, PCT104-PCT120, PCT121-PCT131, PCT132-PCT137, PCT138-PCT143, PCT144, PCT145-PCT150, PCT151-PCT156, PCT157-PCT162, PCT163-PCT208, PCT209-PCT213, HCT1-HCT9, HCT10-HCT18, HCT19-HCT22, HCT23-HCT25, HCT26-HCT29, HCT30-HCT39, HCT40-HCT55, HCT56-HCT61, HCT62-HCT70, HCT71-HCT81, HCT82-HCT86, and HCT87-HCT110. The iterations are Parts 1-336, the Geographic Header File is Part 337. The Geographic Header File is in fixed-format ASCII and the table files are in comma-delimited ASCII format. A merged iteration will have 7,963 variables.For Parts 251-336, the part names contain numbers within parentheses that refer to the Ancestry Code List (page G1 of the codebook).
Facebook
Twitterhttps://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de446517https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de446517
Abstract (en): Summary File 4 (SF 4) from the United States 2000 Census contains the sample data, which is the information compiled from the questions asked of a sample of all people and housing units. Population items include basic population totals: urban and rural, households and families, marital status, grandparents as caregivers, language and ability to speak English, ancestry, place of birth, citizenship status, year of entry, migration, place of work, journey to work (commuting), school enrollment and educational attainment, veteran status, disability, employment status, industry, occupation, class of worker, income, and poverty status. Housing items include basic housing totals: urban and rural, number of rooms, number of bedrooms, year moved into unit, household size and occupants per room, units in structure, year structure built, heating fuel, telephone service, plumbing and kitchen facilities, vehicles available, value of home, monthly rent, and shelter costs. In Summary File 4, the sample data are presented in 213 population tables (matrices) and 110 housing tables, identified with "PCT" and "HCT" respectively. Each table is iterated for 336 population groups: the total population, 132 race groups, 78 American Indian and Alaska Native tribe categories (reflecting 39 individual tribes), 39 Hispanic or Latino groups, and 86 ancestry groups. The presentation of SF4 tables for any of the 336 population groups is subject to a population threshold. That is, if there are fewer than 100 people (100-percent count) in a specific population group in a specific geographic area, and there are fewer than 50 unweighted cases, their population and housing characteristics data are not available for that geographic area in SF4. For the ancestry iterations, only the 50 unweighted cases test can be performed. See Appendix H: Characteristic Iterations, for a complete list of characteristic iterations. ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection: Created variable labels and/or value labels.. All persons in housing units in West Virginia in 2000. 2013-05-25 Multiple Census data file segments were repackaged for distribution into a single zip archive per dataset. No changes were made to the data or documentation.2006-01-12 All files were removed from dataset 342 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 341 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 340 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 339 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 338 and flagged as study-level files, so that they will accompany all downloads. Because of the number of files per state in Summary File 4, ICPSR has given each state its own ICPSR study number in the range ICPSR 13512-13563. The study number for the national file is 13570. Data for each state are being released as they become available.The data are provided in 38 segments (files) per iteration. These segments are PCT1-PCT4, PCT5-PCT16, PCT17-PCT34, PCT35-PCT37, PCT38-PCT45, PCT46-PCT49, PCT50-PCT61, PCT62-PCT67, PCT68-PCT71, PCT72-PCT76, PCT77-PCT78, PCT79-PCT81, PCT82-PCT84, PCT85-PCT86 (partial), PCT86 (partial), PCT87-PCT103, PCT104-PCT120, PCT121-PCT131, PCT132-PCT137, PCT138-PCT143, PCT144, PCT145-PCT150, PCT151-PCT156, PCT157-PCT162, PCT163-PCT208, PCT209-PCT213, HCT1-HCT9, HCT10-HCT18, HCT19-HCT22, HCT23-HCT25, HCT26-HCT29, HCT30-HCT39, HCT40-HCT55, HCT56-HCT61, HCT62-HCT70, HCT71-HCT81, HCT82-HCT86, and HCT87-HCT110. The iterations are Parts 1-336, the Geographic Header File is Part 337. The Geographic Header File is in fixed-format ASCII and the table files are in comma-delimited ASCII format. A merged iteration will have 7,963 variables.For Parts 251-336, the part names contain numbers within parentheses that refer to the Ancestry Code List (page G1 of the codebook).
Facebook
TwitterSummary File 3 contains sample data, which is the information compiled from the questions asked of a sample of all people and housing units in the United States. Population items include basic population totals as well as counts for the following characteristics: urban and rural, households and families, marital status, grandparents as caregivers, language and ability to speak English, ancestry, place of birth, citizenship status, year of entry, migration, place of work, journey to work (commuting), school enrollment and educational attainment, veteran status, disability, employment status, industry, occupation, class of worker, income, and poverty status. Housing items include basic housing totals and counts for urban and rural, number of rooms, number of bedrooms, year moved into unit, household size and occupants per room, units in structure, year structure built, heating fuel, telephone service, plumbing and kitchen facilities, vehicles available, value of home, and monthly rent and shelter costs. The Summary File 3 population tables are identified with a "P" prefix and the housing tables are identified with an "H," followed by a sequential number. The "P" and "H" tables are shown for the block group and higher level geography, while the "PCT" and "HCT" tables are shown for the census tract and higher level geography. There are 16 "P" tables, 15 "PCT" tables, and 20 "HCT" tables that bear an alphabetic suffix on the table number, indicating that they are repeated for nine major race and Hispanic or Latino groups. There are 484 population tables and 329 housing tables for a total of 813 unique tables. (Source: downloaded from ICPSR 7/13/10)
Please Note: This dataset is part of the historical CISER Data Archive Collection and is also available at ICPSR at https://doi.org/10.3886/ICPSR13343.v1. We highly recommend using the ICPSR version as they may make this dataset available in multiple data formats in the future.
Facebook
TwitterLanguages are an important part of daily life in the USA. Here is a table that shows the most common languages spoken in the USA, as well as a big spreadsheet which shows each CBSA (Core-Based Statistical Area, or urban area).
Language usage varies widely throughout the United States. According to the latest census data, over 350 different languages are represented in homes across the country. The following table and spreadsheet provide more detailed information on language usage throughout the various states and cities in the US:
Columns: - index: Index column for dataframe - Table with column headers in row 5 and row headers in column A: Contains language data for each CBSA (Core Based Statistical Area) - Unnamed: 1: Rank of CBSA by total number of speakers of all languages - Unnamed: 2: Name of CBSA - Unnamed: 3: Population of CBSA - Unnamed: 4: Percent of population that speaks English very well - Unnamed: 5 through Unnamed: 58 : Languages spoken by at least 0.1% of the population, with corresponding percentages
This dataset was created by Gary Hoover. The data was sourced from https://www.kaggle.com/garyhoov/us-languages
Unknown License - Please check the dataset description for more information.
File: Languages Spoken at Home by Urban Area = CBSA.csv
File: US Languages Spoken at Home 2014.csv | Column name | Description | |:-------------------------------------------------------------------|:--------------| | Table with column headers in row 5 and row headers in column A | |