Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘California Housing Prices Data (5 new features!)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/fedesoriano/california-housing-prices-data-extra-features on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Boston House Prices: LINK
This is the dataset is a modified version of the California Housing Data used in the paper Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.
. It serves as an excellent introduction to implementing machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and sits at an optimal size between being too toyish and too cumbersome.
The data contains information from the 1990 California census. So although it may not help you with predicting current housing prices like the Zillow Zestimate dataset, it does provide an accessible introductory dataset for teaching people about the basics of machine learning.
This dataset includes 5 extra features defined by me: "Distance to coast", "Distance to Los Angeles", "Distance to San Diego", "Distance to San Jose", and "Distance to San Francisco". These extra features try to account for the distance to the nearest coast and the distance to the centre of the largest cities in California.
The distances were calculated using the Haversine formula with the Longitude and Latitude:
https://wikimedia.org/api/rest_v1/media/math/render/svg/a65dbbde43ff45bacd2505fcf32b44fc7dcd8cc0" alt="">
where:
phi_1
and phi_2
are the Latitudes of point 1 and point 2, respectivelylambda_1
and lambda_2
are the Longitudes of point 1 and point 2, respectivelyr
is the radius of the Earth (6371km)The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. The columns are as follows, their names are pretty self-explanatory:
1) Median House Value: Median house value for households within a block (measured in US Dollars) [$] 2) Median Income: Median income for households within a block of houses (measured in tens of thousands of US Dollars) [10k$] 3) Median Age: Median age of a house within a block; a lower number is a newer building [years] 4) Total Rooms: Total number of rooms within a block 5) Total Bedrooms: Total number of bedrooms within a block 6) Population: Total number of people residing within a block 7) Households: Total number of households, a group of people residing within a home unit, for a block 8) Latitude: A measure of how far north a house is; a higher value is farther north [°] 9) Longitude: A measure of how far west a house is; a higher value is farther west [°] 10) Distance to coast: Distance to the nearest coast point [m] 11) Distance to Los Angeles: Distance to the centre of Los Angeles [m] 12) Distance to San Diego: Distance to the centre of San Diego [m] 13) Distance to San Jose: Distance to the centre of San Jose [m] 14) Distance to San Francisco: Distance to the centre of San Francisco [m]
This data was entirely modified and cleaned by me. The original data (without the distance features) was initially featured in the following paper: Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.
The original dataset can be found under the following link: https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: In these datasets, a person is defined as up to date if they have received at least one dose of an updated COVID-19 vaccine. The Centers for Disease Control and Prevention (CDC) recommends that certain groups, including adults ages 65 years and older, receive additional doses.
Starting on July 13, 2022, the denominator for calculating vaccine coverage has been changed from age 5+ to all ages to reflect new vaccine eligibility criteria. Previously the denominator was changed from age 16+ to age 12+ on May 18, 2021, then changed from age 12+ to age 5+ on November 10, 2021, to reflect previous changes in vaccine eligibility criteria. The previous datasets based on age 12+ and age 5+ denominators have been uploaded as archived tables.
Starting June 30, 2021, the dataset has been reconfigured so that all updates are appended to one dataset to make it easier for API and other interfaces. In addition, historical data has been extended back to January 5, 2021.
This dataset shows full, partial, and at least 1 dose coverage rates by zip code tabulation area (ZCTA) for the state of California. Data sources include the California Immunization Registry and the American Community Survey’s 2015-2019 5-Year data.
This is the data table for the LHJ Vaccine Equity Performance dashboard. However, this data table also includes ZTCAs that do not have a VEM score.
This dataset also includes Vaccine Equity Metric score quartiles (when applicable), which combine the Public Health Alliance of Southern California’s Healthy Places Index (HPI) measure with CDPH-derived scores to estimate factors that impact health, like income, education, and access to health care. ZTCAs range from less healthy community conditions in Quartile 1 to more healthy community conditions in Quartile 4.
The Vaccine Equity Metric is for weekly vaccination allocation and reporting purposes only. CDPH-derived quartiles should not be considered as indicative of the HPI score for these zip codes. CDPH-derived quartiles were assigned to zip codes excluded from the HPI score produced by the Public Health Alliance of Southern California due to concerns with statistical reliability and validity in populations smaller than 1,500 or where more than 50% of the population resides in a group setting.
These data do not include doses administered by the following federal agencies who received vaccine allocated directly from CDC: Indian Health Service, Veterans Health Administration, Department of Defense, and the Federal Bureau of Prisons.
For some ZTCAs, vaccination coverage may exceed 100%. This may be a result of many people from outside the county coming to that ZTCA to get their vaccine and providers reporting the county of administration as the county of residence, and/or the DOF estimates of the population in that ZTCA are too low. Please note that population numbers provided by DOF are projections and so may not be accurate, especially given unprecedented shifts in population as a result of the pandemic.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Relative concentration of the estimated number of people in the Southern California region that live in a household defined as "low income." There are multiple ways to define low income. These data apply the most common standard: low income population consists of all members of households that collectively have income less than twice the federal poverty threshold that applies to their household type. Household type refers to the household's resident composition: the number of independent adults plus dependents that can be of any age, from children to elderly. For example, a household with four people ' one working adult parent and three dependent children ' has a different poverty threshold than a household comprised of four unrelated independent adults.
Due to high estimate uncertainty for many block group estimates of the number of people living in low income households, some records cannot be reliably assigned a class and class code comparable to those assigned to race/ethnicity data from the decennial Census.
"Relative concentration" is a measure that compares the proportion of population within each Census block group data unit to the proportion of all people that live within the 13,312 block groups in the Southern California RRK region. See the "Data Units" description below for how these relative concentrations are broken into categories in this "low income" metric.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Relative concentration of the Southern California region's Asian American population. The variable ASIANALN records all individuals who select Asian as their SOLE racial identity in response to the Census questionnaire, regardless of their response to the Hispanic ethnicity question. Both Hispanic and non-Hispanic in the Census questionnaire are potentially associated with the Asian race alone.
"Relative concentration" is a measure that compares the proportion of population within each Census block group data unit that identify as ASIANALN alone to the proportion of all people that live within the 13,312 block groups in the Southern California RRK region that identify as ASIANALN alone. Example: if 5.2% of people in a block group identify as HSPBIPOC, the block group has twice the proportion of ASIANALN individuals compared to the Southern California RRK region (2.6%), and more than three times the proportion compared to the entire state of California (1.6%). If the local proportion is twice the regional proportion, then ASIANALN individuals are highly concentrated locally.
This dataset contains counts of deaths for California counties based on information entered on death certificates. Final counts are derived from static data and include out-of-state deaths to California residents, whereas provisional counts are derived from incomplete and dynamic data. Provisional counts are based on the records available when the data was retrieved and may not represent all deaths that occurred during the time period. Deaths involving injuries from external or environmental forces, such as accidents, homicide and suicide, often require additional investigation that tends to delay certification of the cause and manner of death. This can result in significant under-reporting of these deaths in provisional data.
The final data tables include both deaths that occurred in each California county regardless of the place of residence (by occurrence) and deaths to residents of each California county (by residence), whereas the provisional data table only includes deaths that occurred in each county regardless of the place of residence (by occurrence). The data are reported as totals, as well as stratified by age, gender, race-ethnicity, and death place type. Deaths due to all causes (ALL) and selected underlying cause of death categories are provided. See temporal coverage for more information on which combinations are available for which years.
The cause of death categories are based solely on the underlying cause of death as coded by the International Classification of Diseases. The underlying cause of death is defined by the World Health Organization (WHO) as "the disease or injury which initiated the train of events leading directly to death, or the circumstances of the accident or violence which produced the fatal injury." It is a single value assigned to each death based on the details as entered on the death certificate. When more than one cause is listed, the order in which they are listed can affect which cause is coded as the underlying cause. This means that similar events could be coded with different underlying causes of death depending on variations in how they were entered. Consequently, while underlying cause of death provides a convenient comparison between cause of death categories, it may not capture the full impact of each cause of death as it does not always take into account all conditions contributing to the death.
"The Social Security Administration (SSA) suggested to USC to survey members of the public around these topics: What do people know about Social Security? How do people learn about Social Security and how do they want to learn about Social Security? How do adults use financial products as they age? How do adults make their financial decisions and where do they turn for advice? What are adults' main sources of financial stress? The results of the survey are available at the USC website below after logging in and being granted access by USC."
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Relative concentration of the Southern California region's Black/African American population. The variable BLACKALN records all individuals who select black or African American as their SOLE racial identity in response to the Census questionnaire, regardless of their response to the Hispanic ethnicity question. Both Hispanic and non-Hispanic in the Census questionnaire are potentially associated with black race alone.
"Relative concentration" is a measure that compares the proportion of population within each Census block group data unit that identify as Black/African American alone to the proportion of all people that live within the 13,312 block groups in the Southern California RRK region that identify as Black/African American alone. Example: if 5.2% of people in a block group identify as BLACKALN, the block group has twice the proportion of BLACKALN individuals compared to the Southern California RRK region (2.6%), and more than three times the proportion compared to the entire state of California (1.6%). If the local proportion is twice the regional proportion, then BLACKALN individuals are highly concentrated locally.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the population of South Gate by race. It includes the population of South Gate across racial categories (excluding ethnicity) as identified by the Census Bureau. The dataset can be utilized to understand the population distribution of South Gate across relevant racial categories.
Key observations
The percent distribution of South Gate population by race (across all racial categories recognized by the U.S. Census Bureau): 23.87% are white, 1.02% are Black or African American, 1.75% are American Indian and Alaska Native, 0.79% are Asian, 0.08% are Native Hawaiian and other Pacific Islander, 44.97% are some other race and 27.51% are multiracial.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Racial categories include:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for South Gate Population by Race & Ethnicity. You can refer the same here
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Relative concentration of the Southern California region's Hispanic/Latino population. The variable HISPANIC records all individuals who select Hispanic or Latino in response to the Census questionnaire, regardless of their response to the racial identity question.
"Relative concentration" is a measure that compares the proportion of population within each Census block group data unit that identify as American Indian / Alaska Native alone to the proportion of all people that live within the 13,312 block groups in the Southern California RRK region that identify as American Indian / Alaska native alone. Example: if 5.2% of people in a block group identify as HISPANIC, the block group has twice the proportion of HISPANIC individuals compared to the Southern California RRK region (2.6%), and more than three times the proportion compared to the entire state of California (1.6%). If the local proportion is twice the regional proportion, then HISPANIC individuals are highly concentrated locally.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the South Pasadena household income by gender. The dataset can be utilized to understand the gender-based income distribution of South Pasadena income.
The dataset will have the following datasets when applicable
Please note: The 2020 1-Year ACS estimates data was not reported by the Census Bureau due to the impact on survey collection and analysis caused by COVID-19. Consequently, median household income data for 2020 is unavailable for large cities (population 65,000 and above).
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
Explore our comprehensive data analysis and visual representations for a deeper understanding of South Pasadena income distribution by gender. You can refer the same here
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Relative concentration of the Southern California region's Black/African American population. The variable HSPBIPOC is equivalent to all individuals who select a combination of racial and ethnic identity in response to the Census questionnaire EXCEPT those who select "not Hispanic" for the ethnic identity question, and "white race alone" for the racial identity question. This is the most encompassing possible definition of racial and ethnic identities that may be associated with historic underservice by agencies, or be more likely to express environmental justice concerns (as compared to predominantly non-Hispanic white communities). Until 2021, federal agency guidance for considering environmental justice impacts of proposed actions focused on how the actions affected "racial or ethnic minorities." "Racial minority" is an increasingly meaningless concept in the USA, and particularly so in California, where only about 3/8 of the state's population identifies as non-Hispanic and white race alone - a clear majority of Californians identify as Hispanic and/or not white. Because many federal and state map screening tools continue to rely on "minority population" as an indicator for flagging potentially vulnerable / disadvantaged/ underserved populations, our analysis includes the variable HSPBIPOC which is effectively "all minority" population according to the now outdated federal environmental justice direction. A more meaningful analysis for the potential impact of forest management actions on specific populations considers racial or ethnic populations individually: e.g., all people identifying as Hispanic regardless of race; all people identifying as American Indian, regardless of Hispanic ethnicity; etc.
"Relative concentration" is a measure that compares the proportion of population within each Census block group data unit that identify as HSPBIPOC alone to the proportion of all people that live within the 13,312 block groups in the Southern California RRK region that identify as HSPBIPOC alone. Example: if 5.2% of people in a block group identify as HSPBIPOC, the block group has twice the proportion of HSPBIPOC individuals compared to the Southern California RRK region (2.6%), and more than three times the proportion compared to the entire state of California (1.6%). If the local proportion is twice the regional proportion, then HSPBIPOC individuals are highly concentrated locally.
Note: In these datasets, a person is defined as up to date if they have received at least one dose of an updated COVID-19 vaccine. The Centers for Disease Control and Prevention (CDC) recommends that certain groups, including adults ages 65 years and older, receive additional doses. Starting on July 13, 2022, the denominator for calculating vaccine coverage has been changed from age 5+ to all ages to reflect new vaccine eligibility criteria. Previously the denominator was changed from age 16+ to age 12+ on May 18, 2021, then changed from age 12+ to age 5+ on November 10, 2021, to reflect previous changes in vaccine eligibility criteria. The previous datasets based on age 12+ and age 5+ denominators have been uploaded as archived tables. Starting June 30, 2021, the dataset has been reconfigured so that all updates are appended to one dataset to make it easier for API and other interfaces. In addition, historical data has been extended back to January 5, 2021. This dataset shows full, partial, and at least 1 dose coverage rates by zip code tabulation area (ZCTA) for the state of California. Data sources include the California Immunization Registry and the American Community Survey’s 2015-2019 5-Year data. This is the data table for the LHJ Vaccine Equity Performance dashboard. However, this data table also includes ZTCAs that do not have a VEM score. This dataset also includes Vaccine Equity Metric score quartiles (when applicable), which combine the Public Health Alliance of Southern California’s Healthy Places Index (HPI) measure with CDPH-derived scores to estimate factors that impact health, like income, education, and access to health care. ZTCAs range from less healthy community conditions in Quartile 1 to more healthy community conditions in Quartile 4. The Vaccine Equity Metric is for weekly vaccination allocation and reporting purposes only. CDPH-derived quartiles should not be considered as indicative of the HPI score for these zip codes. CDPH-derived quartiles were assigned to zip codes excluded from the HPI score produced by the Public Health Alliance of Southern California due to concerns with statistical reliability and validity in populations smaller than 1,500 or where more than 50% of the population resides in a group setting. These data do not include doses administered by the following federal agencies who received vaccine allocated directly from CDC: Indian Health Service, Veterans Health Administration, Department of Defense, and the Federal Bureau of Prisons. For some ZTCAs, vaccination coverage may exceed 100%. This may be a result of many people from outside the county coming to that ZTCA to get their vaccine and providers reporting the county of administration as the county of residence, and/or the DOF estimates of the population in that ZTCA are too low. Please note that population numbers provided by DOF are projections and so may not be accurate, especially given unprecedented shifts in population as a result of the pandemic.
This dataset contains counts of live births for California counties based on information entered on birth certificates. Final counts are derived from static data and include out of state births to California residents, whereas provisional counts are derived from incomplete and dynamic data. Provisional counts are based on the records available when the data was retrieved and may not represent all births that occurred during the time period.
The final data tables include both births that occurred in California regardless of the place of residence (by occurrence) and births to California residents (by residence), whereas the provisional data table only includes births that occurred in California regardless of the place of residence (by occurrence). The data are reported as totals, as well as stratified by parent giving birth's age, parent giving birth's race-ethnicity, and birth place type. See temporal coverage for more information on which strata are available for which years.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents the detailed breakdown of the count of individuals within distinct income brackets, categorizing them by gender (men and women) and employment type - full-time (FT) and part-time (PT), offering valuable insights into the diverse income landscapes within South El Monte. The dataset can be utilized to gain insights into gender-based income distribution within the South El Monte population, aiding in data analysis and decision-making..
Key observations
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Income brackets:
Variables / Data Columns
Employment type classifications include:
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for South El Monte median household income by race. You can refer the same here
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents the detailed breakdown of the count of individuals within distinct income brackets, categorizing them by gender (men and women) and employment type - full-time (FT) and part-time (PT), offering valuable insights into the diverse income landscapes within South Pasadena. The dataset can be utilized to gain insights into gender-based income distribution within the South Pasadena population, aiding in data analysis and decision-making..
Key observations
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Income brackets:
Variables / Data Columns
Employment type classifications include:
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for South Pasadena median household income by race. You can refer the same here
https://www.california-demographics.com/terms_and_conditionshttps://www.california-demographics.com/terms_and_conditions
A dataset listing California counties by population for 2024.
This study sought evidence that a subset of people with dementia (PwD) have reliable memory for emotional events in their own lives, and that they differ from PwD whose memory for emotional life events is less reliable or unreliable in respect to their own disease stage, confabulation and neuropsychiatric behaviors, and awareness of their cognitive impairment. A cross-sectional study of 93 people with mild or moderate dementia (aged 55 and older) and a comparison group of 50 older adults was conducted. Memories of recent autobiographical events that had both positive and negative emotional content were elicited during a structured interview, designed for consistency with accepted forensic interviewing techniques. Accurate recollection of these events was independently verified by a non-demented informant, usually a family member. In addition, both members of the dyad were interviewed independently to assess other characteristics of people with dementia (PwD): demographics, depressive symptoms, functional and cognitive abilities, medications, health conditions, behaviors and characteristics of the dyadic relationship. Researchers also assessed PwD for disease stage, awareness of cognitive impairment, and episodic memory. A validated test of emotionally-influenced memory was administered to qualified participants to verify the novel structured interviewing assessment developed for this study. Two researchers conducted the study assessments during home visits. The data file contains 945 cases and 732 variables.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Relative concentration of the Southern California region's American Indian population. The variable AIAN_ALN_AND_MULTIRACEAIANALN includes BOTH individuals who select American Indian or Alaska Native as their sole racial identity (they only identify as American Indian), AND individuals who select American Indian / Alaska Native as one of two or more racial identities (they partly identify as American Indian) in response to the Census questionnaire. IMPORTANT: this self reported ancestry and Tribal membership are distinct identities and one does not automatically imply the other. These data should not be interpreted as a distribution of "Tribal people." Numerous Rancherias in the Southern California region account for the wide distribution of very to extremely high concentrations of American Indians.
"Relative concentration" is a measure that compares the proportion of population within each Census block group data unit that identify as American Indian / Alaska Native alone to the proportion of all people that live within the 13,312 block groups in the Southern California RRK region that identify as American Indian / Alaska native alone. Example: if 5.2% of people in a block group identify as AIANALN, the block group has twice the proportion of AIANALN individuals compared to the Southern California RRK region (2.6%), and more than three times the proportion compared to the entire state of California (1.6%). If the local proportion is twice the regional proportion, then AIANALN individuals are highly concentrated locally.
https://doi.org/10.5061/dryad.qz612jmjt
Data description:
Annual spatial estimates of above ground live, standing dead, litter, and below ground biomass (g/m2) for 2001-2023 for southern California.
These raster layers were created by modeling field plot biomass to covariates, including precipitation, remotely sensed NDVI, and geophysical (slope, aspect, elevation) data.
For a more complete description, visit https://doi.org/10.5061/dryad.qz612jmjt
The biomass raster layers are packaged in zip files for each year using the following naming structure:
WWETAC_UCD_SoCal_Biomass_XXXX.zip
Where XXXX is the year of the biomass estimates. Within each zip file are the following files:
WWETAC_UCD_
Where
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Relative concentration of the Southern California region's population that identifies as "Multiracial", EXCEPT those with part-American Indian identity, in response to the Census questionnaire. "Relative concentration" is a measure that compares the proportion of population within each Census block group data unit that identifies as Multiiracial to the proportion of all people that live within the 13,312 census block groups in the Southern California RRK region. People with part-American Indian identity are not included here but are included in the American Indian or Alaska Native Race Alone and Multirace Population, described above.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘California Housing Prices Data (5 new features!)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/fedesoriano/california-housing-prices-data-extra-features on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Boston House Prices: LINK
This is the dataset is a modified version of the California Housing Data used in the paper Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.
. It serves as an excellent introduction to implementing machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and sits at an optimal size between being too toyish and too cumbersome.
The data contains information from the 1990 California census. So although it may not help you with predicting current housing prices like the Zillow Zestimate dataset, it does provide an accessible introductory dataset for teaching people about the basics of machine learning.
This dataset includes 5 extra features defined by me: "Distance to coast", "Distance to Los Angeles", "Distance to San Diego", "Distance to San Jose", and "Distance to San Francisco". These extra features try to account for the distance to the nearest coast and the distance to the centre of the largest cities in California.
The distances were calculated using the Haversine formula with the Longitude and Latitude:
https://wikimedia.org/api/rest_v1/media/math/render/svg/a65dbbde43ff45bacd2505fcf32b44fc7dcd8cc0" alt="">
where:
phi_1
and phi_2
are the Latitudes of point 1 and point 2, respectivelylambda_1
and lambda_2
are the Longitudes of point 1 and point 2, respectivelyr
is the radius of the Earth (6371km)The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. The columns are as follows, their names are pretty self-explanatory:
1) Median House Value: Median house value for households within a block (measured in US Dollars) [$] 2) Median Income: Median income for households within a block of houses (measured in tens of thousands of US Dollars) [10k$] 3) Median Age: Median age of a house within a block; a lower number is a newer building [years] 4) Total Rooms: Total number of rooms within a block 5) Total Bedrooms: Total number of bedrooms within a block 6) Population: Total number of people residing within a block 7) Households: Total number of households, a group of people residing within a home unit, for a block 8) Latitude: A measure of how far north a house is; a higher value is farther north [°] 9) Longitude: A measure of how far west a house is; a higher value is farther west [°] 10) Distance to coast: Distance to the nearest coast point [m] 11) Distance to Los Angeles: Distance to the centre of Los Angeles [m] 12) Distance to San Diego: Distance to the centre of San Diego [m] 13) Distance to San Jose: Distance to the centre of San Jose [m] 14) Distance to San Francisco: Distance to the centre of San Francisco [m]
This data was entirely modified and cleaned by me. The original data (without the distance features) was initially featured in the following paper: Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.
The original dataset can be found under the following link: https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
--- Original source retains full ownership of the source dataset ---