Median house prices for California districts derived from the 1990 census.
About Dataset
Context This is the dataset used in the second chapter of Aurélien Géron's recent book 'Hands-On Machine learning with Scikit-Learn and TensorFlow'. It serves as an excellent introduction to implementing machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and sits at an optimal size between being to toyish and too cumbersome.
The data contains information from the 1990 California census. So although it may not help you with predicting current housing prices like the Zillow Zestimate dataset, it does provide an accessible introductory dataset for teaching people about the basics of machine learning.
Content The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. Be warned the data aren't cleaned so there are some preprocessing steps required! The columns are as follows, their names are pretty self-explanatory: - longitude - latitude - housing_median_age - total_rooms - total_bedrooms - population - households - median_income - median_house_value - ocean_proximity
Acknowledgements This data was initially featured in the following paper: Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.
and I encountered it in 'Hands-On Machine learning with Scikit-Learn and TensorFlow' by Aurélien Géron. Aurélien Géron wrote: This dataset is a modified version of the California Housing dataset available from: Luís Torgo's page (University of Porto)
Inspiration See my kernel on machine learning basics in R using this dataset, or venture over to the following link for a python based introductory tutorial: https://github.com/ageron/handson-ml/tree/master/datasets/housing
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
U.S. Census Bureau QuickFacts statistics for California City city, California. QuickFacts data are derived from: Population Estimates, American Community Survey, Census of Population and Housing, Current Population Survey, Small Area Health Insurance Estimates, Small Area Income and Poverty Estimates, State and County Housing Unit Estimates, County Business Patterns, Nonemployer Statistics, Economic Census, Survey of Business Owners, Building Permits.
https://www.icpsr.umich.edu/web/ICPSR/studies/6712/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/6712/terms
This collection comprises census tract-level data for California from the 1970 Census. The data contain 20-, 15-, and 5-percent sample population and housing characteristics including education, occupation, income, citizenship, vocational training, and household equipment and facilities.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data collection includes housing data by census tract for San Diego County in 1970. Sample counts (5%, 15% and 20%) are derived from the Census of Population and Housing, 1970: Summary Tape File 4A: Housing. Housing characteristics include housing value, number of housing units in structure, number of rooms in housing unit, year structure was built, occupancy/vacancy status, tenure, rent, type of heating fuel, source of water, and presence of an air conditioner and other home appliances. Counts are available for the total, Negro, and Spanish American populations. Negros are defined as a racial category by the Census Bureau. In California, Spanish Americans include "Persons of Spanish language or Spanish surname". The California state 4A data file was processed with PERL and SPSS by the Social Science Data Collection (SSDC) staff of the University of California, San Diego Library from Census Bureau data processed by DUALabs, Inc. and archived at the Odum Institute. PERL script concatenated record type codes and output six data files that match the record types described in the codebook: tables 001-040, tables 041-107, tables 108-119, tables 120-130, tables 131-152 and Spanish American tables 153-200. These six files were subsequently processed with SPSS to extract tracts for San Diego County and recompute some aggregate housing value data content. Users may browse a list of data variables (tables and cells) included in these six data files. The Census Bureau produced printed reports for 1970 Summary Tape File 4A. The UCSD Geisel Library maintains printed reports and census tract maps in 1970 Census of Population and Housing Census Tracts, San Diego, Calif. (SSH Docs US Stacks C 3.223/11:970/188).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘California Housing Data (1990)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/harrywang/housing on 12 November 2021.
--- Dataset description provided by original source is as follows ---
This is the dataset used in this book: https://github.com/ageron/handson-ml/tree/master/datasets/housing to illustrate a sample end-to-end ML project workflow (pipeline). This is a great book - I highly recommend!
The data is based on California Census in 1990.
"This dataset is a modified version of the California Housing dataset available from Luís Torgo's page (University of Porto). Luís Torgo obtained it from the StatLib repository (which is closed now). The dataset may also be downloaded from StatLib mirrors.
The following is the description from the book author:
This dataset appeared in a 1997 paper titled Sparse Spatial Autoregressions by Pace, R. Kelley and Ronald Barry, published in the Statistics and Probability Letters journal. They built it using the 1990 California census data. It contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).
The dataset in this directory is almost identical to the original, with two differences: 207 values were randomly removed from the total_bedrooms column, so we can discuss what to do with missing data. An additional categorical attribute called ocean_proximity was added, indicating (very roughly) whether each block group is near the ocean, near the Bay area, inland or on an island. This allows discussing what to do with categorical data. Note that the block groups are called "districts" in the Jupyter notebooks, simply because in some contexts the name "block group" was confusing."
http://www.dcc.fc.up.pt/%7Eltorgo/Regression/cal_housing.html
This is a dataset obtained from the StatLib repository. Here is the included description:
"We collected information on the variables using all the block groups in California from the 1990 Cens us. In this sample a block group on average includes 1425.5 individuals living in a geographically co mpact area. Naturally, the geographical area included varies inversely with the population density. W e computed distances among the centroids of each block group as measured in latitude and longitude. W e excluded all the block groups reporting zero entries for the independent and dependent variables. T he final data contained 20,640 observations on 9 variables. The dependent variable is ln(median house value)."
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This table contains data on the percent of households paying more than 30% (or 50%) of monthly household income towards housing costs for California, its regions, counties, cities/towns, and census tracts. Data is from the U.S. Department of Housing and Urban Development (HUD), Consolidated Planning Comprehensive Housing Affordability Strategy (CHAS) and the U.S. Census Bureau, American Community Survey (ACS). The table is part of a series of indicators in the [Healthy Communities Data and Indicators Project of the Office of Health Equity] Affordable, quality housing is central to health, conferring protection from the environment and supporting family life. Housing costs—typically the largest, single expense in a family's budget—also impact decisions that affect health. As housing consumes larger proportions of household income, families have less income for nutrition, health care, transportation, education, etc. Severe cost burdens may induce poverty—which is associated with developmental and behavioral problems in children and accelerated cognitive and physical decline in adults. Low-income families and minority communities are disproportionately affected by the lack of affordable, quality housing. More information about the data table and a data dictionary can be found in the Attachments.
VITAL SIGNS INDICATOR Housing Production (LU4)
FULL MEASURE NAME Produced housing units by unit type
LAST UPDATED October 2019
DESCRIPTION Housing production is measured in terms of the number of units that local jurisdictions produces throughout a given year. The annual production count captures housing units added by new construction and annexations, subtracts demolitions and destruction from natural disasters, and adjusts for units lost or gained by conversions.
DATA SOURCE California Department of Finance Form E-8 1990-2010 http://www.dof.ca.gov/Forecasting/Demographics/Estimates/E-8/
California Department of Finance Form E-5 2011-2018 http://www.dof.ca.gov/Forecasting/Demographics/Estimates/E-5/
U.S. Census Bureau Population Estimates 2000-2018 https://www.census.gov/programs-surveys/popest.html
CONTACT INFORMATION vitalsigns.info@bayareametro.gov
METHODOLOGY NOTES (across all datasets for this indicator) Single-family housing units include single detached units and single attached units. Multi-family housing includes two to four units and five plus or apartment units.
Housing production data for metropolitan areas for each year is the difference of annual housing unit estimates from the Census Bureau’s Population Estimates Program. Housing production data for the region, counties, and cities for each year is the difference of annual housing unit estimates from the California Department of Finance. Department of Finance data uses an annual cycle between January 1 and December 31, whereas U.S. Census Bureau data uses an annual cycle from April 1 to March 31 of the following year.
Housing production data shows how many housing units have been produced over time. Like housing permit statistics, housing production numbers are an indicator of where the region is growing. However, since permitted units are sometimes not constructed or there can be a long lag time between permit approval and the start of construction, production data also reflects the effects of barriers to housing production. These range from a lack of builder confidence to high construction costs and limited financing. Data also differentiates the trends in multi-family, single-family and mobile home production.
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Graph and download economic data for New Private Housing Units Authorized by Building Permits: 1-Unit Structures for California (CABP1FH) from Jan 1988 to May 2025 about privately owned, 1-unit structures, permits, family, buildings, CA, housing, and USA.
The California Housing dataset is based on 1990 US census and is widely used for machine learning and statistics. It was published in 1990 by Pace, R. Kelley and Ronald Barry, and can be found in the UCI Machine Learning Repository. The California Data set gives the information about Economic and Geographic values of the Houses,and also the economic status of the people present in the California.
The TIGER/Line Files are shapefiles and related database files (.dbf) that are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The purpose of this file is to provide the geography for the 2010 Census Blocks along with their 2010 housing unit count and population. Census Blocks are statistical areas bounded on all sides by visible features, such as streets, roads, streams, and railroad tracks, and/or by nonvisible boundaries such as city, town, township, and county limits, and short line-of-sight extensions of streets and roads. Blocks are the smallest geographic areas for which the Census Bureau publishes data from the decennial census. A block may consist of one or more faces.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
U.S. Census Bureau QuickFacts statistics for Bakersfield city, California. QuickFacts data are derived from: Population Estimates, American Community Survey, Census of Population and Housing, Current Population Survey, Small Area Health Insurance Estimates, Small Area Income and Poverty Estimates, State and County Housing Unit Estimates, County Business Patterns, Nonemployer Statistics, Economic Census, Survey of Business Owners, Building Permits.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents the mean household income for each of the five quintiles in Orange County, CA, as reported by the U.S. Census Bureau. The dataset highlights the variation in mean household income across quintiles, offering valuable insights into income distribution and inequality.
Key observations
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Income Levels:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Orange County median household income. You can refer the same here
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
U.S. Census Bureau QuickFacts statistics for Lafayette city, California. QuickFacts data are derived from: Population Estimates, American Community Survey, Census of Population and Housing, Current Population Survey, Small Area Health Insurance Estimates, Small Area Income and Poverty Estimates, State and County Housing Unit Estimates, County Business Patterns, Nonemployer Statistics, Economic Census, Survey of Business Owners, Building Permits.
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Graph and download economic data for New Private Housing Structures Authorized by Building Permits for Humboldt County, CA (BPPRIV006023) from 1990 to 2024 about Humboldt County, CA; permits; buildings; CA; private; housing; and USA.
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Graph and download economic data for Home Vacancy Rate for California (CAHVAC) from 1986 to 2024 about vacancy, CA, housing, rate, and USA.
VITAL SIGNS INDICATOR
Housing Permits (LU3)
FULL MEASURE NAME
Permitted housing units
LAST UPDATED
February 2023
DESCRIPTION
Housing growth is measured in terms of the number of units that local jurisdictions permit throughout a given year. A permitted unit is a unit that a city or county has authorized for construction.
DATA SOURCE
California Housing Foundation/Construction Industry Research Board (CIRB) - https://www.cirbreport.org/
Construction Review report (1967-2022)
Association of Bay Area Governments (ABAG) – Metropolitan Transportation Commission (MTC) - https://data.bayareametro.gov/Development/HCD-Annual-Progress-Report-Jurisdiction-Summary/nxbj-gfv7
Housing Permits Database (2014-2021)
Census Bureau Building Permit Survey - https://www2.census.gov/econ/bps/County/
Building permits by county (annual, monthly)
CONTACT INFORMATION
vitalsigns.info@bayareametro.gov
METHODOLOGY NOTES (across all datasets for this indicator)
Bay Area housing permits data by single/multi family come from the California Housing Foundation/Construction Industry Research Board (CIRB). Affordability breakdowns from 2014 to 2021 come from the Association of Bay Area Governments (ABAG) – Metropolitan Transportation Commission (MTC) Housing Permits Database.
Single-family housing units include detached, semi-detached, row house and town house units. Row houses and town houses are included as single-family units when each unit is separated from the adjacent unit by an unbroken ground-to-roof party or fire wall. Condominiums are included as single-family units when they are of zero-lot-line or zero-property-line construction; when units are separated by an air space; or, when units are separated by an unbroken ground-to-roof party or fire wall. Multi-family housing includes duplexes, three-to-four-unit structures and apartment-type structures with five units or more. Multi-family also includes condominium units in structures of more than one living unit that do not meet the single-family housing definition.
Each multi-family unit is counted separately even though they may be in the same building. Total units is the sum of single-family and multi-family units. County data is available from 1967 whereas city data is available from 1990. City data is only available for incorporated cities and towns. All permits in unincorporated cities and towns are included under their respective county’s unincorporated total. Permit data is not available for years when the city or town was not incorporated.
Affordable housing is the total number of permitted units affordable to low and very low income households. Housing affordable to very low income households are households making below 50% of the area median income. Housing affordable to low income households are households making between 50% and 80% of the area median income. Housing affordable to moderate income households are households making below 80% and 120% of the area median income. Housing affordable to above moderate income households are households making above 120% of the area median income.
Permit data is missing for the following cities and years:
Clayton, 1990-2007
Lafayette, 1990-2007
Moraga, 1990-2007
Orinda, 1990-2007
San Ramon, 1990
Building permit data for metropolitan areas for each year is the sum of non-seasonally adjusted monthly estimates from the Census Building Permit Survey. The Bay Area values are the sum of the San Francisco-Oakland-Hayward MSA and the San Jose-Sunnyvale-Santa Clara MSA. The counties included in these areas are: San Francisco, Marin, Contra Costa, Alameda, San Mateo, Santa Clara, and San Benito.
Permit values reflect the number of units permitted in each respective year. Note that the data columns come from difference sources. The columns (SFunits, MFunits, TOTALunits, SF_Share and MF_Share) are sourced from CIRB. The columns (VeryLowunits, Lowunits, Moderateunits, AboveModerateunits, VeryLow_Share, Low_Share, Moderate_Share, AboveModerate_Share, Affordableunits and Affordableunits_Share) are sourced from the ABAG Housing Permits Database. Due to the slightly different methodologies that exist within each of those datasets, the total units from each of the two sources might not be consistent with each other.
As shown, three different data sources are used for this analysis of housing permits issued in the Bay Area. Data from the Construction Industry Research Board (CIRB) represents the best available data source for examining housing permits issued over time in cities and counties across the Bay Area, dating back to 1967. In recent years, Annual Progress Report (APR) data collected by the California Department of Housing and Community Development has been available for analyzing housing permits issued by affordability levels. Since CIRB data is only available for California jurisdictions, the U.S. Census Bureau provides the best data source for comparing housing permits issued across different metropolitan areas. Notably, annual permit totals for the Bay Area differ across these three data sources, reflecting the limitations of needing to use different data sources for different purposes.
This dataset is a modified version of the California Housing dataset available from Luís Torgo's page (University of Porto). Luís Torgo obtained it from the StatLib repository (which is closed now). The dataset may also be downloaded from StatLib mirrors.
This dataset appeared in a 1997 paper titled Sparse Spatial Autoregressions by Pace, R. Kelley and Ronald Barry, published in the Statistics and Probability Letters journal. They built it using the 1990 California census data. It contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people)
The dataset in this directory is almost identical to the original, with two differences:
207 values were randomly removed from the total_bedrooms column, so we can discuss what to do with missing data. An additional categorical attribute called ocean_proximity was added, indicating (very roughly) whether each block group is near the ocean, near the Bay area, inland or on an island. This allows discussing what to do with categorical data. Note that the block groups are called "districts" in the Jupyter notebooks, simply because in some contexts the name "block group" was confusing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This layer shows census tracts that meet the following definitions: Census tracts with median household incomes at or below 80 percent of the statewide median income or with median household incomes at or below the threshold designated as low income by the Department of Housing and Community Development’s list of state income limits adopted under Healthy and Safety Code section 50093 and/or Census tracts receiving the highest 25 percent of overall scores in CalEnviroScreen 4.0 or Census tracts lacking overall scores in CalEnviroScreen 4.0 due to data gaps, but receiving the highest 5 percent of CalEnviroScreen 4.0 cumulative population burden scores or Census tracts identified in the 2017 DAC designation as disadvantaged, regardless of their scores in CalEnviroScreen 4.0 or Lands under the control of federally recognized Tribes.
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Graph and download economic data for New Private Housing Structures Authorized by Building Permits for Sierra County, CA (BPPRIV006091) from 1990 to 2024 about Sierra County, CA; permits; buildings; CA; private; housing; and USA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents the the household distribution across 16 income brackets among four distinct age groups in California: Under 25 years, 25-44 years, 45-64 years, and over 65 years. The dataset highlights the variation in household income, offering valuable insights into economic trends and disparities within different age categories, aiding in data analysis and decision-making..
Key observations
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
Income brackets:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for California median household income by age. You can refer the same here
Median house prices for California districts derived from the 1990 census.
About Dataset
Context This is the dataset used in the second chapter of Aurélien Géron's recent book 'Hands-On Machine learning with Scikit-Learn and TensorFlow'. It serves as an excellent introduction to implementing machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and sits at an optimal size between being to toyish and too cumbersome.
The data contains information from the 1990 California census. So although it may not help you with predicting current housing prices like the Zillow Zestimate dataset, it does provide an accessible introductory dataset for teaching people about the basics of machine learning.
Content The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. Be warned the data aren't cleaned so there are some preprocessing steps required! The columns are as follows, their names are pretty self-explanatory: - longitude - latitude - housing_median_age - total_rooms - total_bedrooms - population - households - median_income - median_house_value - ocean_proximity
Acknowledgements This data was initially featured in the following paper: Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.
and I encountered it in 'Hands-On Machine learning with Scikit-Learn and TensorFlow' by Aurélien Géron. Aurélien Géron wrote: This dataset is a modified version of the California Housing dataset available from: Luís Torgo's page (University of Porto)
Inspiration See my kernel on machine learning basics in R using this dataset, or venture over to the following link for a python based introductory tutorial: https://github.com/ageron/handson-ml/tree/master/datasets/housing