62 datasets found

House Price Prediction Dataset
kaggle.com
zip
Updated Sep 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zafar (2024). House Price Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/zafarali27/house-price-prediction-dataset
Explore at:
zip(29372 bytes)Available download formats
Dataset updated
Sep 21, 2024
Authors
Zafar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
House Price Prediction Dataset.

The dataset contains 2000 rows of house-related data, representing various features that could influence house prices. Below, we discuss key aspects of the dataset, which include its structure, the choice of features, and potential use cases for analysis.

1. Dataset Features

The dataset is designed to capture essential attributes for predicting house prices, including:

Area: Square footage of the house, which is generally one of the most important predictors of price. Bedrooms & Bathrooms: The number of rooms in a house significantly affects its value. Homes with more rooms tend to be priced higher. Floors: The number of floors in a house could indicate a larger, more luxurious home, potentially raising its price. Year Built: The age of the house can affect its condition and value. Newly built houses are generally more expensive than older ones. Location: Houses in desirable locations such as downtown or urban areas tend to be priced higher than those in suburban or rural areas. Condition: The current condition of the house is critical, as well-maintained houses (in 'Excellent' or 'Good' condition) will attract higher prices compared to houses in 'Fair' or 'Poor' condition. Garage: Availability of a garage can increase the price due to added convenience and space. Price: The target variable, representing the sale price of the house, used to train machine learning models to predict house prices based on the other features.

2. Feature Distributions

Area Distribution: The area of the houses in the dataset ranges from 500 to 5000 square feet, which allows analysis across different types of homes, from smaller apartments to larger luxury houses. Bedrooms and Bathrooms: The number of bedrooms varies from 1 to 5, and bathrooms from 1 to 4. This variance enables analysis of homes with different sizes and layouts. Floors: Houses in the dataset have between 1 and 3 floors. This feature could be useful for identifying the influence of multi-level homes on house prices. Year Built: The dataset contains houses built from 1900 to 2023, giving a wide range of house ages to analyze the effects of new vs. older construction. Location: There is a mix of urban, suburban, downtown, and rural locations. Urban and downtown homes may command higher prices due to proximity to amenities. Condition: Houses are labeled as 'Excellent', 'Good', 'Fair', or 'Poor'. This feature helps model the price differences based on the current state of the house. Price Distribution: Prices range between $50,000 and $1,000,000, offering a broad spectrum of property values. This range makes the dataset appropriate for predicting a wide variety of housing prices, from affordable homes to luxury properties.

3. Correlation Between Features

A key area of interest is the relationship between various features and house price: Area and Price: Typically, a strong positive correlation is expected between the size of the house (Area) and its price. Larger homes are likely to be more expensive. Location and Price: Location is another major factor. Houses in urban or downtown areas may show a higher price on average compared to suburban and rural locations. Condition and Price: The condition of the house should show a positive correlation with price. Houses in better condition should be priced higher, as they require less maintenance and repair. Year Built and Price: Newer houses might command a higher price due to better construction standards, modern amenities, and less wear-and-tear, but some older homes in good condition may retain historical value. Garage and Price: A house with a garage may be more expensive than one without, as it provides extra storage or parking space.

4. Potential Use Cases

The dataset is well-suited for various machine learning and data analysis applications, including:

House Price Prediction: Using regression techniques, this dataset can be used to build a model to predict house prices based on the available features. Feature Importance Analysis: By using techniques such as feature importance ranking, data scientists can determine which features (e.g., location, area, or condition) have the greatest impact on house prices. Clustering: Clustering techniques like k-means could help identify patterns in the data, such as grouping houses into segments based on their characteristics (e.g., luxury homes, affordable homes). Market Segmentation: The dataset can be used to perform segmentation by location, price range, or house type to analyze trends in specific sub-markets, like luxury vs. affordable housing. Time-Based Analysis: By studying how house prices vary with the year built or the age of the house, analysts can derive insights into the trends of older vs. newer homes.

5. Limitations and ...
housing
kaggle.com
zip
Updated Sep 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HappyRautela (2023). housing [Dataset]. https://www.kaggle.com/datasets/happyrautela/housing
Explore at:
zip(809785 bytes)Available download formats
Dataset updated
Sep 22, 2023
Authors
HappyRautela
Description
The exercise after this contains questions that are based on the housing dataset.

How many houses have a waterfront? a. 21000 b. 21450 c. 163 d. 173

How many houses have 2 floors? a. 2692 b. 8241 c. 10680 d. 161

How many houses built before 1960 have a waterfront? a. 80 b. 7309 c. 90 d. 92

What is the price of the most expensive house having more than 4 bathrooms? a. 7700000 b. 187000 c. 290000 d. 399000

For instance, if the ‘price’ column consists of outliers, how can you make the data clean and remove the redundancies? a. Calculate the IQR range and drop the values outside the range. b. Calculate the p-value and remove the values less than 0.05. c. Calculate the correlation coefficient of the price column and remove the values less than the correlation coefficient. d. Calculate the Z-score of the price column and remove the values less than the z-score.

What are the various parameters that can be used to determine the dependent variables in the housing data to determine the price of the house? a. Correlation coefficients b. Z-score c. IQR Range d. Range of the Features

If we get the r2 score as 0.38, what inferences can we make about the model and its efficiency? a. The model is 38% accurate, and shows poor efficiency. b. The model is showing 0.38% discrepancies in the outcomes. c. Low difference between observed and fitted values. d. High difference between observed and fitted values.

If the metrics show that the p-value for the grade column is 0.092, what all inferences can we make about the grade column? a. Significant in presence of other variables. b. Highly significant in presence of other variables c. insignificance in presence of other variables d. None of the above

If the Variance Inflation Factor value for a feature is considerably higher than the other features, what can we say about that column/feature? a. High multicollinearity b. Low multicollinearity c. Both A and B d. None of the above
T
United States Housing Starts
tradingeconomics.com
zh.tradingeconomics.com
+13more
csv, excel, json, xml
Updated Sep 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2025). United States Housing Starts [Dataset]. https://tradingeconomics.com/united-states/housing-starts
Explore at:
json, excel, csv, xmlAvailable download formats
Dataset updated
Sep 17, 2025
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 31, 1959 - Aug 31, 2025
Area covered
United States
Description
Housing Starts in the United States decreased to 1307 Thousand units in August from 1429 Thousand units in July of 2025. This dataset provides the latest reported value for - United States Housing Starts - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
Housing Prices Dataset
kaggle.com
zip
Updated Jan 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2022). Housing Prices Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
Explore at:
zip(4740 bytes)Available download formats
Dataset updated
Jan 12, 2022
Authors
M Yasser H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://raw.githubusercontent.com/Masterx-AI/Project_Housing_Price_Prediction_/main/hs.jpg" alt="">

Description:

A simple yet challenging project, to predict the housing price based on certain factors like house area, bedrooms, furnished, nearness to mainroad, etc. The dataset is small yet, it's complexity arises due to the fact that it has strong multicollinearity. Can you overcome these obstacles & build a decent predictive model?

Acknowledgement:

Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Objective:

Understand the Dataset & cleanup (if required).

Build Regression models to predict the sales w.r.t a single & multiple feature.

Also evaluate the models & compare thier respective scores like R2, RMSE, etc.
T
United States New Home Sales
tradingeconomics.com
it.tradingeconomics.com
+13more
csv, excel, json, xml
Updated Sep 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2025). United States New Home Sales [Dataset]. https://tradingeconomics.com/united-states/new-home-sales
Explore at:
csv, json, excel, xmlAvailable download formats
Dataset updated
Sep 24, 2025
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 31, 1963 - Aug 31, 2025
Area covered
United States
Description
New Home Sales in the United States increased to 800 Thousand units in August from 664 Thousand units in July of 2025. This dataset provides the latest reported value for - United States New Home Sales - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
Housing Price Prediction using DT and RF in R
kaggle.com
zip
Updated Aug 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Housing Price Prediction using DT and RF in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/housing-price-prediction-using-dt-and-rf-in-r
Explore at:
zip(629100 bytes)Available download formats
Dataset updated
Aug 31, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Objective: To predict the prices of houses in the City of Melbourne

Approach: Using Decision Tree and Random Forest https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ffc6fb7d0bd8e854daf7a6f033937a397%2FPicture1.png?generation=1693489996707941&alt=media" alt="">

Data Cleaning:

Date column is shown as a character vector which is converted into a date vector using the library ‘lubridate’

We create a new column called age to understand the age of the house as it can be a factor in the pricing of the house. We extract the year from column ‘Date’ and subtract it from the column ‘Year Built’

We remove 11566 records which have missing values

We drop columns which are not significant such as ‘X’, ‘suburb’, ‘address’, (we have kept zipcode as it serves the purpose in place of suburb and address), ‘type’, ‘method’, ‘SellerG’, ‘date’, ‘Car’, ‘year built’, ‘Council Area’, ‘Region Name’

We split the data into ‘train’ and ‘test’ in 80/20 ratio using the sample function

Run libraries ‘rpart’, ‘rpart.plot’, ‘rattle’, ‘RcolorBrewer’

Run decision tree using the rpart function. ‘Price’ is the dependent variable https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6065322d19b1376c4a341a4f22933a51%2FPicture2.png?generation=1693490067579017&alt=media" alt="">

Average price for 5464 houses is $1084349

Where building area is less than 200.5, the average price for 4582 houses is $931445. Where building area is less than 200.5 & age of the building is less than 67.5 years, the avg price for 3385 houses is $799299.6.

$4801538 is the Highest average prices of 13 houses where distance is lower than 5.35 & building are is >280.5
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F136542b7afb6f03c1890bae9b07dc464%2FDecision%20Tree%20Plot.jpeg?generation=1693490124083168&alt=media" alt="">

We use the caret package for tuning the parameter and the optimal complexity parameter found is 0.01 with RMSE 445197.9 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feb1633df9dd61ba3a51574873b055fd0%2FPicture3.png?generation=1693490163033658&alt=media" alt="">

We use library (Metrics) to find out the RMSE ($392107), MAPE (0.297) which means an accuracy of 99.70% and MAE ($272015.4)

Variables ‘postcode’, longitude and building are the most important variables

Test$Price indicates the actual price and test$predicted indicates the predicted price for particular 6 houses. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F620b1aad968c9aee169d0e7371bf3818%2FPicture4.png?generation=1693490211728176&alt=media" alt="">

We use the default parameters of random forest on the train data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe9a3c3f8776ee055e4a1bb92d782e19c%2FPicture5.png?generation=1693490244695668&alt=media" alt="">

The below image indicates that ‘Building Area’, ‘Age of the house’ and ‘Distance’ are the most important variables that affect the price of the house. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc14d6266184db8f30290c528d72b9f6b%2FRandom%20Forest%20Variables%20Importance.jpeg?generation=1693490284920037&alt=media" alt="">

Based on the default parameters, RMSE is $250426.2, MAPE is 0.147 (accuracy is 99.853%) and MAE is $151657.7

Error starts to remain constant between 100 to 200 trees and thereafter there is almost minimal reduction. We can choose N tree=200. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F365f9e8587d3a65805330889d22f9e60%2FNtree%20Plot.jpeg?generation=1693490308734539&alt=media" alt="">

We tune the model and find mtry = 3 has the lowest out of bag error

We use the caret package and use 5 fold cross validation technique

RMSE is $252216.10 , MAPE is 0.146 (accuracy is 99.854%) , MAE is $151669.4

We can conclude that Random Forest give us more accurate results as compared to Decision Tree

In Random Forest , the default parameters (N tree = 500) give us lower RMSE and MAPE as compared to N tree = 200. So we can proceed with those parameters.
b
Median house price (affordability ratios) - WMCA
cityobservatory.birmingham.gov.uk
csv, excel, geojson +1
Updated Dec 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Median house price (affordability ratios) - WMCA [Dataset]. https://cityobservatory.birmingham.gov.uk/explore/dataset/median-house-price-affordability-ratios-wmca/
Explore at:
excel, geojson, json, csvAvailable download formats
Dataset updated
Dec 3, 2025
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
This is the unadjusted median house priced for residential property sales (transactions) in the area for a 12 month period with April in the middle (year-ending September). These figures have been produced by the ONS (Office for National Statistics) using the Land Registry (LR) Price Paid data on residential dwelling transactions.

The LR Price Paid data are comprehensive in that they capture changes of ownership for individual residential properties which have sold for full market value and covers both cash sales and those involving a mortgage.

The median is the value determined by putting all the house sales for a given year, area and type in order of price and then selecting the price of the house sale which falls in the middle. The median is less susceptible to distortion by the presence of extreme values than is the mean. It is the most appropriate average to use because it best takes account of the skewed distribution of house prices.

Note that a transaction occurs when a change of freeholder or leaseholder takes place regardless of the amount of money involved and a property can transact more than once in the time period.

The LR records the actual price for which the property changed hands. This will usually be an accurate reflection of the market value for the individual property, but it is not always the case. In order to generate statistics that more accurately reflect market values, the LR has excluded records of houses that were not sold at market value from the dataset. The remaining data are considered a good reflection of market values at the time of the transaction. For full details of exclusions and more information on the methodology used to produce these statistics please see http://www.ons.gov.uk/peoplepopulationandcommunity/housing/qmis/housepricestatisticsforsmallareasqmi

The LR Price Paid data are not adjusted to reflect the mix of houses in a given area. Fluctuations in the types of house that are sold in that area can cause differences between the median transactional value of houses and the overall market value of houses. Therefore these statistics differ to the new UK House Price Index (HPI) which reports mix-adjusted average house prices and house price indices.

If, for a given year, for house type and area there were fewer than 5 sales records in the LR Price Paid data, the house price statistics are not reported. Data is Powered by LG Inform Plus and automatically checked for new data on the 3rd of each month.
d
Integrated Building Health Management
catalog.data.gov
s.cnmilf.com
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Integrated Building Health Management [Dataset]. https://catalog.data.gov/dataset/integrated-building-health-management
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
Abstract: Building health management is an important part in running an efficient and cost-effective building. Many problems in a building’s system can go undetected for long periods of time, leading to expensive repairs or wasted resources. This project aims to help detect and diagnose the building‘s health with data driven methods throughout the day. Orca and IMS are two state of the art algorithms that observe an array of building health sensors and provide feedback on the overall system’s health as well as localize the problem to one, or possibly two, components. With this level of feedback the hope is to quickly identify problems and provide appropriate maintenance while reducing the number of complaints and service calls. Introduction: To prepare these technologies for the new installation, the proposed methods are being tested on a current system that behaves similarly to the future green building. Building 241 was determined to best resemble the proposed building 232 and therefore was chosen for this study. Building 241 is currently outfitted with 34 sensors that monitor the heating & cooling temperatures for the air and water systems as well as other various subsystem states. The daily sensor recordings were logged and sent to the IDU group for analysis. The period of analysis was focused from July 1st through August 10th 2009. Methodology: The two algorithms used for analysis were Orca and IMS. Both methods look for anomalies using a distanced based scoring approach. Orca has the ability to use a single data set and find outliers within that data set. This tactic was applied to each day. After scoring each time sample throughout a given day the Orca score profiles were compared by computing the correlation against all other days. Days with high overall correlations were considered normal however days with lower overall correlations were more anomalous. IMS, on the other hand, needs a normal set of data to build a model, which can be applied to a set of test data to asses how anomaly the particular data set is. The typical days identified by Orca were used as the reference/training set for IMS, while all the other days were passed through IMS resulting in an anomaly score profile for each day. The mean of the IMS score profile was then calculated for each day to produce a summary IMS score. These summary scores were ranked and the top outliers were identified (see Figure 1). Once the anomalies were identified the contributing parameters were then ranked by the algorithm. Analysis: The contributing parameters identified by IMS were localized to the return air temperature duct system. -7/03/09 (Figure 2 & 3) AHU-1 Return Air Temperature (RAT) Calculated Average Return Air Temperature -7/19/09 (Figure 3 & 4) AHU-2 Return Air Temperature (RAT) Calculated Average Return Air Temperature IMS identified significantly higher temperatures compared to other days during the month of July and August. Conclusion: The proposed algorithms Orca and IMS have shown that they were able to pick up significant anomalies in the building system as well as diagnose the anomaly by identifying the sensor values that were anomalous. In the future these methods can be used on live streaming data and produce a real time anomaly score to help building maintenance with detection and diagnosis of problems.
D
Housing Affordability
catalog.dvrpc.org
csv
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DVRPC (2025). Housing Affordability [Dataset]. https://catalog.dvrpc.org/dataset/housing-affordability
Explore at:
csv(17918), csv(11692), csv(22352), csv(8938), csv(6237), csv(4449), csv(2636), csv(4792), csv(1396), csv(1368), csv(2548)Available download formats
Dataset updated
Mar 17, 2025
Dataset authored and provided by
DVRPC
License
https://catalog.dvrpc.org/dvrpc_data_license.htmlhttps://catalog.dvrpc.org/dvrpc_data_license.html
Description
A commonly accepted threshold for affordable housing costs at the household level is 30% of a household's income. Accordingly, a household is considered cost burdened if it pays more than 30% of its income on housing. Households paying more than 50% are considered severely cost burdened. These thresholds apply to both homeowners and renters.

The Housing Affordability indicator only measures cost burden among the region's households, and not the supply of affordable housing. The directionality of cost burden trends can be impacted by changes in both income and housing supply. If lower income households are priced out of a county or the region, it would create a downward trend in cost burden, but would not reflect a positive trend for an inclusive housing market.
Integrated Building Health Management - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Integrated Building Health Management - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/integrated-building-health-management
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Abstract: Building health management is an important part in running an efficient and cost-effective building. Many problems in a building’s system can go undetected for long periods of time, leading to expensive repairs or wasted resources. This project aims to help detect and diagnose the building‘s health with data driven methods throughout the day. Orca and IMS are two state of the art algorithms that observe an array of building health sensors and provide feedback on the overall system’s health as well as localize the problem to one, or possibly two, components. With this level of feedback the hope is to quickly identify problems and provide appropriate maintenance while reducing the number of complaints and service calls. Introduction: To prepare these technologies for the new installation, the proposed methods are being tested on a current system that behaves similarly to the future green building. Building 241 was determined to best resemble the proposed building 232 and therefore was chosen for this study. Building 241 is currently outfitted with 34 sensors that monitor the heating & cooling temperatures for the air and water systems as well as other various subsystem states. The daily sensor recordings were logged and sent to the IDU group for analysis. The period of analysis was focused from July 1st through August 10th 2009. Methodology: The two algorithms used for analysis were Orca and IMS. Both methods look for anomalies using a distanced based scoring approach. Orca has the ability to use a single data set and find outliers within that data set. This tactic was applied to each day. After scoring each time sample throughout a given day the Orca score profiles were compared by computing the correlation against all other days. Days with high overall correlations were considered normal however days with lower overall correlations were more anomalous. IMS, on the other hand, needs a normal set of data to build a model, which can be applied to a set of test data to asses how anomaly the particular data set is. The typical days identified by Orca were used as the reference/training set for IMS, while all the other days were passed through IMS resulting in an anomaly score profile for each day. The mean of the IMS score profile was then calculated for each day to produce a summary IMS score. These summary scores were ranked and the top outliers were identified (see Figure 1). Once the anomalies were identified the contributing parameters were then ranked by the algorithm. Analysis: The contributing parameters identified by IMS were localized to the return air temperature duct system. -7/03/09 (Figure 2 & 3) AHU-1 Return Air Temperature (RAT) Calculated Average Return Air Temperature -7/19/09 (Figure 3 & 4) AHU-2 Return Air Temperature (RAT) Calculated Average Return Air Temperature IMS identified significantly higher temperatures compared to other days during the month of July and August. Conclusion: The proposed algorithms Orca and IMS have shown that they were able to pick up significant anomalies in the building system as well as diagnose the anomaly by identifying the sensor values that were anomalous. In the future these methods can be used on live streaming data and produce a real time anomaly score to help building maintenance with detection and diagnosis of problems.
c
Home For Everyone Tracker Open Data
opendata.cityofboise.org
housing-data-portal-boise.hub.arcgis.com
+1more
Updated Jul 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Boise, Idaho (2023). Home For Everyone Tracker Open Data [Dataset]. https://opendata.cityofboise.org/documents/ffead1f0bfc947dfad961a0fdedfab6a
Explore at:
Dataset updated
Jul 5, 2023
Dataset authored and provided by
City of Boise, Idaho
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A Home for Everyone is the City of Boise’s (city) initiative to address needs in the community by supporting the development and preservation of housing affordable to residents on Boise budgets. A Home for Everyone has three core goals: produce new homes affordable at 60% of area median income, create permanent supportive housing for households experiencing homelessness, and preserve home affordable at 80% of area median income. This dataset includes information about all homes that count toward the city’s Home for Everyone goals.

While the “produce affordable housing” and “create permanent supportive housing” goals are focused on supporting the development of new housing, the preservation goal is focused on maintaining existing housing affordable. As a result, many of the data fields related to new development are not relevant to preservation projects. For example, zoning incentives are only applicable to new construction projects.

Data may be unavailable for some projects and details are subject to change until construction is complete. Addresses are excluded for projects with fewer than five homes for privacy reasons.

The dataset includes details on the number of “homes”. We use the word "home" to refer to any single unit of housing regardless of size, type, or whether it is rented or owned. For example, a building with 40 apartments counts as 40 homes, and a single detached house counts as one home.

The dataset includes details about the phase of each project when a project involves constructing new housing. The process for building a new development is as follows: First, one must receive approval from the city’s Planning Division, which is also known as being “entitled.” Next, one must apply for and receive a permit from the city’s Building Division before beginning construction. Finally, once construction is complete and all city inspections have been passed, the building can be occupied.

To contribute to a city goal, homes must meet affordability requirements based on a standard called area median income. The city considers housing affordable if is targeted to households earning at or below 80% of the area median income. For a three-person household in Boise, that equates to an annual income of $60,650 and monthly housing cost of $1,516. Deeply affordable housing sets the income limit at 60% of area median income, or even 30% of area median income. See Boise Income Guidelines for more details.Project Name – The name of each project. If a row is related to the Home Improvement Loan program, that row aggregates data for all homes that received a loan in that quarter or year. Primary Address – The primary address for the development. Some developments encompass multiple addresses.Project Address(es) – Includes all addresses that are included as part of the development project.Parcel Number(s) – The identification code for all parcels of land included in the development.Acreage – The number of acres for the parcel(s) included in the project.Planning Permit Number – The identification code for all permits the development has received from the Planning Division for the City of Boise. The number and types of permits required vary based on the location and type of development.Date Entitled – The date a development was approved by the City’s Planning Division.Building Permit Number – The identification code for all permits the development has received from the city’s Building Division.Date Building Permit Issued – Building permits are required to begin construction on a development.Date Final Certificate of Occupancy Issued – A certificate of occupancy is the final approval by the city for a development, once construction is complete. Not all developments require a certificate of occupancy.Studio – The number of homes in the development that are classified as a studio. A studio is typically defined as a home in which there is no separate bedroom. A single room serves as both a bedroom and a living room.1-Bedroom – The number of homes in a development that have exactly one bedroom.2-Bedroom – The number of homes in a development that have exactly two bedrooms.3-Bedroom – The number of homes in a development that have exactly three bedrooms.4+ Bedroom – The number of homes in a development that have four or more bedrooms.# of Total Project Units – The total number of homes in the development.# of units toward goals – The number of homes in a development that contribute to either the city’s goal to produce housing affordable at or under 60% of area median income, or the city’s goal to create permanent supportive housing for households experiencing homelessness. Rent at or under 60% AMI - The number of homes in a development that are required to be rented at or below 60% of area median income. See the description of the dataset above for an explanation of area median income or see Boise Income Guidelines for more details. Boise defines a home as “affordable” if it is rented or sold at or below 80% of area median income.Rent 61-80% AMI – The number of homes in a development that are required to be rented at between 61% and 80% of area median income. See the description of the dataset above for an explanation of area median income or see Boise Income Guidelines for more details. Boise defines a home as “affordable” if it is rented or sold at or below 80% of area median income.Rent 81-120% AMI - The number of homes in a development that are required to be rented at between 81% and 120% of area median income. See the description of the dataset above for an explanation of area median income or see Boise Income Guidelines for more details.Own at or under 60% AMI - The number of homes in a development that are required to be sold at or below 60% of area median income. See the description of the dataset above for an explanation of area median income or see Boise Income Guidelines for more details. Boise defines a home as “affordable” if it is rented or sold at or below 80% of area median income.
d
An Early Warning System to Predict Speculative House Price Bubbles [Dataset]...
datamed.org
dataverse.harvard.edu
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
An Early Warning System to Predict Speculative House Price Bubbles [Dataset] [Dataset]. https://datamed.org/display-item.php?repository=0012&idName=ID&id=56d4b855e4b0e644d3132896
Explore at:
Description
In this paper, the authors construct country-specific chronologies of the house price bubbles for 12 OECD countries over the period 1969:Q1–2009:Q4. These chronologies are obtained using a combination of a fundamental approach and a filter approach. The resulting speculative bubble chronology is the one which provides the highest concordance between these two techniques. In addition, the authors suggest an early warning system based on three alternative approaches: A signalling approach, a logit model, and a probit model. It is shown that the latter two models allow much more accurate predictions of house price bubbles than the signalling approach. Furthermore, the predictive accuracy of the logit and probit models is high enough to make them useful in forecasting future speculative bubbles in housing market. Thus, this method can be used by the policymakers in their attempts to quickly detect house price bubbles and attenuate their devastating effects on the domestic and world economy.
Public Housing Buildings
data.lojic.org
impactmap-smudallas.hub.arcgis.com
+2more
Updated Feb 24, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Housing and Urban Development (2016). Public Housing Buildings [Dataset]. https://data.lojic.org/maps/HUD::public-housing-buildings-2
Explore at:
Dataset updated
Feb 24, 2016
Dataset provided by
United States Department of Housing and Urban Developmenthttp://www.hud.gov/
Authors
Department of Housing and Urban Development
Area covered

Description
HUD administers Federal aid to local Housing Agencies (HAs) that manage housing for low-income residents at rents they can afford. Likewise, HUD furnishes technical and professional assistance in planning, developing, and managing the buildings that comprise low-income housing developments. This dataset provides the location and resident characteristics of public housing development buildings. Location data for HUD-related properties and facilities are derived from HUD's enterprise geocoding service. While not all addresses are able to be geocoded and mapped to 100% accuracy, we are continuously working to improve address data quality and enhance coverage. Please consider this issue when using any datasets provided by HUD. When using this data, take note of the field titled “LVL2KX” which indicates the overall accuracy of the geocoded address using the following return codes: ‘R’ - Interpolated rooftop (high degree of accuracy, symbolized as green) ‘4’ - ZIP+4 centroid (high degree of accuracy, symbolized as green) ‘B’ - Block group centroid (medium degree of accuracy, symbolized as yellow) ‘T’ - Census tract centroid (low degree of accuracy, symbolized as red) ‘2’ - ZIP+2 centroid (low degree of accuracy, symbolized as red) ‘Z’ - ZIP5 centroid (low degree of accuracy, symbolized as red) ‘5’ - ZIP5 centroid (same as above, low degree of accuracy, symbolized as red) Null - Could not be geocoded (does not appear on the map) For the purposes of displaying the location of an address on a map only use addresses and their associated lat/long coordinates where the LVL2KX field is coded ‘R’ or ‘4’. These codes ensure that the address is displayed on the correct street segment and in the correct census block. The remaining LVL2KX codes provide a cascading indication of the most granular level geography for which an address can be confirmed. For example, if an address cannot be accurately interpolated to a rooftop (‘R’), or ZIP+4 centroid (‘4’), then the address will be mapped to the centroid of the next nearest confirmed geography: block group, tract, and so on. When performing any point-in polygon analysis it is important to note that points mapped to the centroids of larger geographies will be less likely to map accurately to the smaller geographies of the same area. For instance, a point coded as ‘5’ in the correct ZIP Code will be less likely to map to the correct block group or census tract for that address. In an effort to protect Personally Identifiable Information (PII), the characteristics for each building are suppressed with a -4 value when the “Number_Reported” is equal to, or less than 10. To learn more about Public Housing visit: https://www.hud.gov/program_offices/public_indian_housing/programs/ph/ Development FAQs - IMS/PIC | HUD.gov / U.S. Department of Housing and Urban Development (HUD), for questions about the spatial attribution of this dataset, please reach out to us at GISHelpdesk@hud.gov. Data Dictionary: DD_Public Housing Buildings Date Updated: Q2 2025
o
Data from: A three-year building operational performance dataset for...
openenergyhub.ornl.gov
data.niaid.nih.gov
+2more
Updated Jul 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). A three-year building operational performance dataset for informing energy efficiency [Dataset]. https://openenergyhub.ornl.gov/explore/dataset/a-three-year-building-operational-performance-dataset-for-informing-energy-effic/
Explore at:
Dataset updated
Jul 30, 2024
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset was curated from an office building constructed in 2015 in Berkeley, California, which includes whole-building and end-use energy consumption, HVAC system operating conditions, indoor and outdoor environmental parameters, and occupant counts. The data was collected in three years from more than 300 sensors and meters for two office floors (each 2,325 m2) of the building. A three-step data curation strategy is applied to transform the raw data into the research-grade data: (1) cleaning the raw data to detect and adjust the outlier values and fill the data gaps; (2) creating the metadata model of the building systems and data points using the Brick schema; (3) describing the metadata of the dataset using a semantic JSON schema. This dataset can be used for various types of applications, including building energy benchmarking, load shape analysis, energy prediction, occupancy prediction and analytics, and HVAC controls to improve understanding and efficiency of building operations for reducing energy use, energy costs, and carbon emissions.
c
Housing Receiving Incentives Open Data
opendata.cityofboise.org
city-of-boise.opendata.arcgis.com
Updated Jul 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Boise, Idaho (2023). Housing Receiving Incentives Open Data [Dataset]. https://opendata.cityofboise.org/documents/1423afcc749646649c82d7cdc718e4f5
Explore at:
Dataset updated
Jul 5, 2023
Dataset authored and provided by
City of Boise, Idaho
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Thumbnail image by Tony Moody.This dataset includes all housing developments approved by the City of Boise’s (“city”) Planning Division since 2020 that are known by the city to have received or are expected to receive support or incentives from a government entity. Each row represents one development. Data may be unavailable for some projects and details are subject to change until construction is complete. Addresses are excluded for projects with fewer than five homes for privacy reasons.

The dataset includes details on the number of “homes” in a development. We use the word "home" to refer to any single unit of housing regardless of size, type, or whether it is rented or owned. For example, a building with 40 apartments counts as 40 homes, and a single detached house counts as one home.

The dataset includes details about the phase of each project. The process for build a new development is as follows: First, one must receive approval from the city’s Planning Division, which is also known as being “entitled.” Next, one must apply for and receive a permit from the city’s Building Division before beginning construction. Finally, once construction is complete and all city inspections have been passed, the building can be occupied.

The dataset also includes data on the affordability level of each development. To receive a government incentive, a developer is typically required to rent or sell a specified number of homes to households that have an income below limits set by the government and their housing cost must not exceed 30% of their income. The federal government determines income limits based on a standard called “area median income.” The city considers housing affordable if is targeted to households earning at or below 80% of the area median income. For a three-person household in Boise, that equates to an annual income of $60,650 and monthly rent or mortgage of $1,516. See Boise Income Guidelines for more details.Project Address(es) – Includes all addresses that are included as part of the development project.Address – The primary address for the development.Parcel Number(s) – The identification code for all parcels of land included in the development.Acreage – The number of acres for the parcel(s) included in the project.Planning Permit Number – The identification code for all permits the development has received from the Planning Division for the City of Boise. The number and types of permits required vary based on the location and type of development.Date Entitled – The date a development was approved by the City’s Planning Division.Building Permit Number – The identification code for all permits the development has received from the city’s Building Division.Date Building Permit Issued – Building permits are required to begin construction on a development.Date Final Certificate of Occupancy Issued – A certificate of occupancy is the final approval by the city for a development, once construction is complete. Not all developments require a certificate of occupancy.Studio – The number of homes in the development that are classified as a studio. A studio is typically defined as a home in which there is no separate bedroom. A single room serves as both a bedroom and a living room.1-Bedroom – The number of homes in a development that have exactly one bedroom.2-Bedroom – The number of homes in a development that have exactly two bedrooms.3-Bedroom – The number of homes in a development that have exactly three bedrooms.4+ Bedroom – The number of homes in a development that have four or more bedrooms.# of Total Project Units – The total number of homes in the development.# of units toward goals – The number of homes in a development that contribute to either the city’s goal to produce housing affordable at or under 60% of area median income, or the city’s goal to create permanent supportive housing for households experiencing homelessness.Rent at or under 60% AMI - The number of homes in a development that are required to be rented at or below 60% of area median income. See the description of the dataset above for an explanation of area median income or see Boise Income Guidelines for more details. Boise defines a home as “affordable” if it is rented or sold at or below 80% of area median income.Rent 61-80% AMI – The number of homes in a development that are required to be rented at between 61% and 80% of area median income. See the description of the dataset above for an explanation of area median income or see Boise Income Guidelines for more details. Boise defines a home as “affordable” if it is rented or sold at or below 80% of area median income.Rent 81-120% AMI - The number of homes in a development that are required to be rented at between 81% and 120% of area median income. See the description of the dataset above for an explanation of area median income or see Boise Income Guidelines for more details.Own at or under 60% AMI - The number of homes in a development that are required to be sold at or below 60% of area median income. See the description of the dataset above for an explanation of area median income or see Boise Income Guidelines for more details. Boise defines a home as “affordable” if it is rented or sold at or below 80% of area median income.Own 61-80% AMI – The number of homes in a development that are required to be sold at between 61% and 80% of area median income. See the description of the dataset above for an explanation of area median income or see Boise Income Guidelines for more details. Boise defines a home as “affordable” if it is rented or sold at or below 80% of area median income.Own 81-120% AMI - The number of homes in a development that are required to be sold at between 81% and 120% of area median income. See the description of the dataset above for an explanation of area median income or see Boise Income Guidelines for more details.Housing Land Trust – “Yes” if a development receives or is expected to receive this incentive. The Housing Land Trust is a model in which the city owns land that it leases to a developer to build affordable housing.City Investment – “Yes” if the city invests funding or contributes land to an affordable development.Zoning Incentive - The city's zoning code provides incentives for developers to create affordable housing. Incentives may include the ability to build an extra floor or be subject to reduced parking requirements. “Yes” if a development receives or is expected to receive one of these incentives.Project Management - The city provides a developer and their design team a single point of contact who works across city departments to simplify the permitting process, and assists the applicants in understanding the city’s requirements to avoid possible delays. “Yes” if a development receives or is expected to receive this incentive.Low-Income Housing Tax Credit (LIHTC) - A federal tax credit available to some new affordable housing developments. The Idaho Housing and Finance Association is a quasi-governmental agency that administers these federal tax credits. “Yes” if a development receives or is expected to receive this incentive.CCDC Investment - The Capital City Development Corp (CCDC) is a public agency that financially supports some affordable housing development in Urban Renewal Districts. “Yes” if a development receives or is expected to receive this incentive. If “Yes” the field identifies the Urban Renewal District associated with the development.City Goal – The city has set goals to produce housing affordable to households at or below 60% of area median income, and to create permanent supportive housing for households experiencing homelessness. This field identifies whether a development contributes to one of those goals.Project Phase - The process for build a new development is as follows: First, one must receive approval from the city’s Planning Division, which is also known as being “entitled.” Next, one must apply for and receive a permit from the city’s Building Division before beginning construction. Finally, once construction is complete and all city inspections have been passed, the building can be occupied.
HUD Insured Multifamily Properties
data.lojic.org
anrgeodata.vermont.gov
+3more
Updated Jul 1, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Housing and Urban Development (2015). HUD Insured Multifamily Properties [Dataset]. https://data.lojic.org/maps/HUD::hud-insured-multifamily-properties-1
Explore at:
Dataset updated
Jul 1, 2015
Dataset provided by
United States Department of Housing and Urban Developmenthttp://www.hud.gov/
Authors
Department of Housing and Urban Development
Area covered

Description
The FHA insured Multifamily Housing portfolio consists primarily of rental housing properties with five or more dwelling units such as apartments or town houses, but can also be nursing homes, hospitals, elderly housing, mobile home parks, retirement service centers, and occasionally vacant land. Please note that this dataset overlaps the Multifamily Properties Assisted layer. The Multifamily property locations represent the approximate location of the property. Location data for HUD-related properties and facilities are derived from HUD's enterprise geocoding service. While not all addresses are able to be geocoded and mapped to 100% accuracy, we are continuously working to improve address data quality and enhance coverage. Please consider this issue when using any datasets provided by HUD. When using this data, take note of the field titled “LVL2KX” which indicates the overall accuracy of the geocoded address using the following return codes: ‘R’ - Interpolated rooftop (high degree of accuracy, symbolized as green) ‘4’ - ZIP+4 centroid (high degree of accuracy, symbolized as green) ‘B’ - Block group centroid (medium degree of accuracy, symbolized as yellow) ‘T’ - Census tract centroid (low degree of accuracy, symbolized as red) ‘2’ - ZIP+2 centroid (low degree of accuracy, symbolized as red) ‘Z’ - ZIP5 centroid (low degree of accuracy, symbolized as red) ‘5’ - ZIP5 centroid (same as above, low degree of accuracy, symbolized as red) Null - Could not be geocoded (does not appear on the map) For the purposes of displaying the location of an address on a map only use addresses and their associated lat/long coordinates where the LVL2KX field is coded ‘R’ or ‘4’. These codes ensure that the address is displayed on the correct street segment and in the correct census block. The remaining LVL2KX codes provide a cascading indication of the most granular level geography for which an address can be confirmed. For example, if an address cannot be accurately interpolated to a rooftop (‘R’), or ZIP+4 centroid (‘4’), then the address will be mapped to the centroid of the next nearest confirmed geography: block group, tract, and so on. When performing any point-in polygon analysis it is important to note that points mapped to the centroids of larger geographies will be less likely to map accurately to the smaller geographies of the same area. For instance, a point coded as ‘5’ in the correct ZIP Code will be less likely to map to the correct block group or census tract for that address. In an effort to protect Personally Identifiable Information (PII), the characteristics for each building are suppressed with a -4 value when the “Number_Reported” is equal to, or less than 10. To learn more about HUD Insured Multifamily Properties visit: https://www.hud.gov/program_offices/housing/mfh Data Dictionary: DD_HUD Insured Multifamilly Properties Date of Coverage: 02/2025
Data from: Public Housing Developments
data.lojic.org
opendata.atlantaregional.com
+1more
Updated Mar 2, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Housing and Urban Development (2016). Public Housing Developments [Dataset]. https://data.lojic.org/datasets/HUD::public-housing-developments-1
Explore at:
Dataset updated
Mar 2, 2016
Dataset provided by
United States Department of Housing and Urban Developmenthttp://www.hud.gov/
Authors
Department of Housing and Urban Development
Area covered
Description
HUD furnishes technical and professional assistance in planning, developing and managing these developments. Public Housing Developments are depicted as a distinct address chosen to represent the general location of an entire Public Housing Development, which may be comprised of several buildings scattered across a community. The building with the largest number of units is selected to represent the location of the development. Location data for HUD-related properties and facilities are derived from HUD's enterprise geocoding service. While not all addresses are able to be geocoded and mapped to 100% accuracy, we are continuously working to improve address data quality and enhance coverage. Please consider this issue when using any datasets provided by HUD. When using this data, take note of the field titled “LVL2KX” which indicates the overall accuracy of the geocoded address using the following return codes: ‘R’ - Interpolated rooftop (high degree of accuracy, symbolized as green) ‘4’ - ZIP+4 centroid (high degree of accuracy, symbolized as green) ‘B’ - Block group centroid (medium degree of accuracy, symbolized as yellow) ‘T’ - Census tract centroid (low degree of accuracy, symbolized as red) ‘2’ - ZIP+2 centroid (low degree of accuracy, symbolized as red) ‘Z’ - ZIP5 centroid (low degree of accuracy, symbolized as red) ‘5’ - ZIP5 centroid (same as above, low degree of accuracy, symbolized as red) Null - Could not be geocoded (does not appear on the map) For the purposes of displaying the location of an address on a map only use addresses and their associated lat/long coordinates where the LVL2KX field is coded ‘R’ or ‘4’. These codes ensure that the address is displayed on the correct street segment and in the correct census block. The remaining LVL2KX codes provide a cascading indication of the most granular level geography for which an address can be confirmed. For example, if an address cannot be accurately interpolated to a rooftop (‘R’), or ZIP+4 centroid (‘4’), then the address will be mapped to the centroid of the next nearest confirmed geography: block group, tract, and so on. When performing any point-in polygon analysis it is important to note that points mapped to the centroids of larger geographies will be less likely to map accurately to the smaller geographies of the same area. For instance, a point coded as ‘5’ in the correct ZIP Code will be less likely to map to the correct block group or census tract for that address. In an effort to protect Personally Identifiable Information (PII), the characteristics for each building are suppressed with a -4 value when the “Number_Reported” is equal to, or less than 10. To learn more about Public Housing visit: https://www.hud.gov/program_offices/public_indian_housing/programs/ph/, for questions about the spatial attribution of this dataset, please reach out to us at GISHelpdesk@hud.gov. Data Dictionary: DD_Public Housing Developments Date Updated: Q2 2025
Ames Housing Engineered Dataset
kaggle.com
Updated Sep 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atefeh Amjadian (2025). Ames Housing Engineered Dataset [Dataset]. https://www.kaggle.com/datasets/atefehamjadian/ameshousing-engineered
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Atefeh Amjadian
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
Ames
Description
This dataset is an engineered version of the original Ames Housing dataset from the "House Prices: Advanced Regression Techniques" Kaggle competition. The goal of this engineering was to clean the data, handle missing values, encode categorical features, scale numeric features, manage outliers, reduce skewness, select useful features, and create new features to improve model performance for house price prediction.

The original dataset contains information on 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, with the target variable being SalePrice. This engineered version has undergone several preprocessing steps to make it ready for machine learning models.

Preprocessing Steps Applied

Missing Value Handling: Missing values in categorical columns with meaningful absence (e.g., no pool for PoolQC) were filled with "None". Numeric columns were filled with median, and other categorical columns with mode.

Correlation-based Feature Selection: Numeric features with absolute correlation < 0.1 with SalePrice were removed.

Encoding Categorical Variables: Ordinal features (e.g., quality ratings) were encoded using OrdinalEncoder, and nominal features (e.g., neighborhoods) using OneHotEncoder.

Outlier Handling: Outliers in numeric features were detected using IQR and capped (Winsorized) to IQR bounds to preserve data while reducing extreme values.

Skewness Handling: Highly skewed numeric features (|skew| > 1) were transformed using Yeo-Johnson to make distributions more normal-like.

Additional Feature Selection: Low-variance one-hot features (variance < 0.01) and highly collinear features (|corr| > 0.8) were removed.

Feature Scaling: Numeric features were scaled using RobustScaler to handle outliers.

Duplicate Removal: Duplicate rows were checked and removed if found (none in this dataset).

The final dataset has fewer columns than the original (reduced from 81 to approximately 250 after one-hot encoding, then further reduced by feature selection), with improved quality for modeling.

New Features Created

To add more predictive power, the following new features were created based on domain knowledge: 1. HouseAge: Age of the house at the time of sale. Calculated as YrSold - YearBuilt. This captures how old the house is, which can negatively affect price due to depreciation. - Example: A house built in 2000 and sold in 2008 has HouseAge = 8. 2. Quality_x_Size: Interaction term between overall quality and living area. Calculated as OverallQual * GrLivArea. This combines quality and size to capture the value of high-quality large homes. - Example: A house with OverallQual = 7 and GrLivArea = 1500 has Quality_x_Size = 10500. 3. TotalSF: Total square footage of the house. Calculated as GrLivArea + TotalBsmtSF + 1stFlrSF + 2ndFlrSF (if available). This aggregates area features into a single metric for better price prediction. - Example: If GrLivArea = 1500 and TotalBsmtSF = 1000, TotalSF = 2500. 4. Log_LotArea: Log-transformed lot area to reduce skewness. Calculated as np.log1p(LotArea). This makes the distribution of lot sizes more normal, helping models handle extreme values. - Example: A lot area of 10000 becomes Log_LotArea ≈ 9.21.

These new features were created using the original (unscaled) values to maintain interpretability, then scaled with RobustScaler to match the rest of the dataset.

Data Dictionary

Original Numeric Features: Kept features with |corr| > 0.1 with SalePrice, such as:

OverallQual: Material and finish quality (scaled, 1-10).

GrLivArea: Above grade (ground) living area square feet (scaled).

GarageCars: Size of garage in car capacity (scaled).

TotalBsmtSF: Total square feet of basement area (scaled).

And others like FullBath, YearBuilt, etc. (see the code for the full list).

Ordinal Encoded Features: Quality and condition ratings, e.g.:

ExterQual: Exterior material quality (encoded as 0=Po to 4=Ex).

BsmtQual: Basement quality (encoded as 0=None to 5=Ex).

One-Hot Encoded Features: Nominal categorical features, e.g.:

MSZoning_RL: 1 if residential low density, 0 otherwise.

Neighborhood_NAmes: 1 if in NAmes neighborhood, 0 otherwise.

New Engineered Features (as described above):

HouseAge: Age of the house (scaled).

Quality_x_Size: Overall quality times living area (scaled).

TotalSF: Total square footage (scaled).

Log_LotArea: Log-transformed lot area (scaled).

Target: SalePrice - The property's sale price in dollars (not scaled, as it's the target).

Total columns: Approximately 200-250 (after one-hot encoding and feature selection).

License

This dataset is derived from the Ames Housing...
Rasterized building footprints for the USA
kaggle.com
zip
Updated Dec 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clayton Miller (2021). Rasterized building footprints for the USA [Dataset]. https://www.kaggle.com/datasets/claytonmiller/rasterized-building-footprints-for-usa
Explore at:
zip(8005609403 bytes)Available download formats
Dataset updated
Dec 13, 2021
Authors
Clayton Miller
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
Data and description below is from the Scientific Data paper: A rasterized building footprint dataset for the United States authored by Mehdi P. Heris, Nathan Leon Foks, Kenneth J. Bagstad, Austin Troy & Zachary H. Ancona

Abstract

Microsoft released a U.S.-wide vector building dataset in 2018. Although the vector building layers provide relatively accurate geometries, their use in large-extent geospatial analysis comes at a high computational cost. We used High-Performance Computing (HPC) to develop an algorithm that calculates six summary values for each cell in a raster representation of each U.S. state, excluding Alaska and Hawaii: (1) total footprint coverage, (2) number of unique buildings intersecting each cell, (3) number of building centroids falling inside each cell, and area of the (4) average, (5) smallest, and (6) largest area of buildings that intersect each cell. These values are represented as raster layers with 30 m cell size covering the 48 conterminous states. We also identify errors in the original building dataset. We evaluate precision and recall in the data for three large U.S. urban areas. Precision is high and comparable to results reported by Microsoft while recall is high for buildings with footprints larger than 200 m2 but lower for progressively smaller buildings.

Background and Summary

Building footprints are a critical environmental descriptor. Microsoft produced a U.S.-wide vector building dataset in 20181 that was generated from aerial images available to Bing Maps using deep learning methods for object classification2. The main goal of this product has been to increase the coverage of building footprints available for OpenStreetMap. Microsoft identified building footprints in two phases; first, using semantic segmentation to identify building pixels from aerial imagery using Deep Neural Networks and second, converting building pixel blobs into polygons. The final dataset includes 125,192,184 building footprint polygon geometries in GeoJSON vector format, covering all 50 U.S. States, with data for each state distributed separately. These data have 99.3% precision and 93.5% pixel recall accuracy2. Temporal resolution of the data (i.e., years of the aerial imagery used to derive the data) are not provided by Microsoft in the metadata.

Using vector layers for large-extent (i.e., national or state-level) spatial analysis and modelling (e.g., mapping the Wildland-Urban Interface, flood and coastal hazards, or large-extent urban typology modelling) is challenging in practice. Although vector data provide accurate geometries, incorporating them in large-extent spatial analysis comes at a high computational cost. We used High Performance Computing (HPC) to develop an algorithm that calculates six summary statistics (described below) for buildings at 30-m cell size in the 48 conterminous U.S. states, to better support national-scale and multi-state modelling that requires building footprint data. To develop these six derived products from the Microsoft buildings dataset, we created an algorithm that took every single building and built a small meshgrid (a 2D array) for the bounding box of the building and calculated unique values for each cell of the meshgrid. This grid structure is aligned with National Land Cover Database (NLCD) products (projected using Albers Equal Area Conic system), enabling researchers to combine or compare our products with standard national-scale datasets such as land cover, tree canopy cover, and urban imperviousness3.

Locations, shapes, and distribution patterns of structures in urban and rural areas are the subject of many studies. Buildings represent the density of built up areas as an indicator of urban morphology or spatial structures of cities and metropolitan areas4,5. In local studies, the use of vector data types is easier6,7. However, in regional and national studies a raster dataset would be more preferable. For example in measuring the spatial structure of metropolitan areas a rasterized building layer would be more useful than the original vector datasets8.

Our output raster products are: (1) total building footprint coverage per cell (m2 of building footprint per 900 m2 cell); (2) number of buildings that intersect each cell; (3) number of building centroids falling within each cell; (4) area of the largest building intersecting each cell (m2); (5) area of the smallest building intersecting each cell (m2); and (6) average area of all buildings intersecting each cell (m2). The last three area metrics include building area that falls outside the cell but where part of the building intersects the cell (Fig. 1). These values can be used to describe the intensity and typology of the built environment.

Code

Our software is available through U.S. Geological Survey code r...
d
Replication Data for: Revisiting 'The Rise and Decline' in a Population of...
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill (2023). Replication Data for: Revisiting 'The Rise and Decline' in a Population of Peer Production Projects [Dataset]. http://doi.org/10.7910/DVN/SG3LP1
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/SG3LP1
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill
Description
This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.

Facebook

Twitter

Click to copy link

Link copied

Cite

Zafar (2024). House Price Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/zafarali27/house-price-prediction-dataset

House Price Prediction Dataset

Explore at:

zip(29372 bytes)Available download formats

Dataset updated

Sep 21, 2024

Authors

Zafar

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

House Price Prediction Dataset.

The dataset contains 2000 rows of house-related data, representing various features that could influence house prices. Below, we discuss key aspects of the dataset, which include its structure, the choice of features, and potential use cases for analysis.

1. Dataset Features

The dataset is designed to capture essential attributes for predicting house prices, including:

Area: Square footage of the house, which is generally one of the most important predictors of price. Bedrooms & Bathrooms: The number of rooms in a house significantly affects its value. Homes with more rooms tend to be priced higher. Floors: The number of floors in a house could indicate a larger, more luxurious home, potentially raising its price. Year Built: The age of the house can affect its condition and value. Newly built houses are generally more expensive than older ones. Location: Houses in desirable locations such as downtown or urban areas tend to be priced higher than those in suburban or rural areas. Condition: The current condition of the house is critical, as well-maintained houses (in 'Excellent' or 'Good' condition) will attract higher prices compared to houses in 'Fair' or 'Poor' condition. Garage: Availability of a garage can increase the price due to added convenience and space. Price: The target variable, representing the sale price of the house, used to train machine learning models to predict house prices based on the other features.

2. Feature Distributions

Area Distribution: The area of the houses in the dataset ranges from 500 to 5000 square feet, which allows analysis across different types of homes, from smaller apartments to larger luxury houses. Bedrooms and Bathrooms: The number of bedrooms varies from 1 to 5, and bathrooms from 1 to 4. This variance enables analysis of homes with different sizes and layouts. Floors: Houses in the dataset have between 1 and 3 floors. This feature could be useful for identifying the influence of multi-level homes on house prices. Year Built: The dataset contains houses built from 1900 to 2023, giving a wide range of house ages to analyze the effects of new vs. older construction. Location: There is a mix of urban, suburban, downtown, and rural locations. Urban and downtown homes may command higher prices due to proximity to amenities. Condition: Houses are labeled as 'Excellent', 'Good', 'Fair', or 'Poor'. This feature helps model the price differences based on the current state of the house. Price Distribution: Prices range between $50,000 and $1,000,000, offering a broad spectrum of property values. This range makes the dataset appropriate for predicting a wide variety of housing prices, from affordable homes to luxury properties.

3. Correlation Between Features

A key area of interest is the relationship between various features and house price: Area and Price: Typically, a strong positive correlation is expected between the size of the house (Area) and its price. Larger homes are likely to be more expensive. Location and Price: Location is another major factor. Houses in urban or downtown areas may show a higher price on average compared to suburban and rural locations. Condition and Price: The condition of the house should show a positive correlation with price. Houses in better condition should be priced higher, as they require less maintenance and repair. Year Built and Price: Newer houses might command a higher price due to better construction standards, modern amenities, and less wear-and-tear, but some older homes in good condition may retain historical value. Garage and Price: A house with a garage may be more expensive than one without, as it provides extra storage or parking space.

4. Potential Use Cases

The dataset is well-suited for various machine learning and data analysis applications, including:

House Price Prediction: Using regression techniques, this dataset can be used to build a model to predict house prices based on the available features. Feature Importance Analysis: By using techniques such as feature importance ranking, data scientists can determine which features (e.g., location, area, or condition) have the greatest impact on house prices. Clustering: Clustering techniques like k-means could help identify patterns in the data, such as grouping houses into segments based on their characteristics (e.g., luxury homes, affordable homes). Market Segmentation: The dataset can be used to perform segmentation by location, price range, or house type to analyze trends in specific sub-markets, like luxury vs. affordable housing. Time-Based Analysis: By studying how house prices vary with the year built or the age of the house, analysts can derive insights into the trends of older vs. newer homes.

5. Limitations and ...

Clear search

Close search

Google apps

Main menu

House Price Prediction Dataset

House Price Prediction Dataset.

1. Dataset Features

2. Feature Distributions

3. Correlation Between Features

4. Potential Use Cases

5. Limitations and ...

housing

United States Housing Starts

Housing Prices Dataset

Description:

Acknowledgement:

Objective:

United States New Home Sales

Housing Price Prediction using DT and RF in R

Median house price (affordability ratios) - WMCA

Integrated Building Health Management

Housing Affordability

Integrated Building Health Management - Dataset - NASA Open Data Portal

Home For Everyone Tracker Open Data

An Early Warning System to Predict Speculative House Price Bubbles [Dataset]...

Public Housing Buildings

Data from: A three-year building operational performance dataset for...

Housing Receiving Incentives Open Data

HUD Insured Multifamily Properties

Data from: Public Housing Developments

Ames Housing Engineered Dataset

Preprocessing Steps Applied

New Features Created

Data Dictionary

License

Rasterized building footprints for the USA

Data and description below is from the Scientific Data paper: A rasterized building footprint dataset for the United States authored by Mehdi P. Heris, Nathan Leon Foks, Kenneth J. Bagstad, Austin Troy & Zachary H. Ancona

Abstract

Background and Summary

Code

Replication Data for: Revisiting 'The Rise and Decline' in a Population of...

House Price Prediction Dataset

House Price Prediction Dataset

House Price Prediction Dataset.

1. Dataset Features

2. Feature Distributions

3. Correlation Between Features

4. Potential Use Cases

5. Limitations and ...