Facebook
TwitterThis dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.
Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.
Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.
Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.
Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.
The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.
It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.
This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Data Science
Released under CC0: Public Domain
Facebook
TwitterThis data release contains input data and programs (scripts) used to estimate monthly water demand for retail customers of Providence Water, located in Providence, Rhode Island. Explanatory data and model outputs are from July 2014 through June 2021. Models of per capita (for single-family residential customers) or per connection (for multi-family residential, commercial, and industrial customers) water use were developed using multiple linear regression. The dependent variables, provided by Providence Water, are the monthly number of connections and gallons of water delivered to single- and multi-family residential, commercial, and industrial connections. Potential independent variables (from online sources) are climate variables (temperature and precipitation), economic statistics, and a drought statistic. Not all independent variables were used in all of the models. The data are provided in data tables and model files. The data table RIWaterUseVariableExplanation.csv describes the explanatory variables and their data sources. The data table ProvModelInputData.csv provides the monthly water-use data that are the independent variables and the monthly climatic and economic data that are the dependent variables. The data table DroughtInputData.csv provides the weekly U.S. drought monitor index values that were processed to formulate a potential independent variable. The R script model_water_use.R runs the models that predict water use. The other two R scripts (load_preprocess_input_data.R and model_water_use_functions.R) are not run explicitly but are called from the primary script model_water_use.R. Regression equations produced by the models can be used to predict water demand throughout Rhode Island.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Age: Age of the primary household member (18 to 70 years).
Education Level: Highest education level attained (High School, Bachelor's, Master's, Doctorate).
Occupation: Type of occupation (Healthcare, Education, Technology, Finance, Others).
Number of Dependents: Number of dependents in the household (0 to 5).
Location: Residential location (Urban, Suburban, Rural).
Work Experience: Years of work experience (0 to 50 years).
Marital Status: Marital status of the primary household member (Single, Married, Divorced).
Employment Status: Employment status of the primary household member (Full-time, Part-time, Self-employed).
Household Size: Total number of individuals living in the household (1 to 7).
Homeownership Status: Homeownership status (Own, Rent).
Type of Housing: Type of housing (Apartment, Single-family home, Townhouse).
Gender: Gender of the primary household member (Male, Female).
Primary Mode of Transportation: Primary mode of transportation used by the household member (Car, Public transit, Biking, Walking).
Annual Household Income: Actual annual household income, derived from a combination of features with added noise. Unit USD
This dataset can be used by researchers, analysts, and data scientists to explore the impact of various demographic and socioeconomic factors on household income and to develop predictive models for income estimation.
Facebook
TwitterStudents typically find linear regression analysis of data sets in a biology classroom challenging. These activities could be used in a Biology, Chemistry, Mathematics, or Statistics course. The collection provides student activity files with Excel instructions and Instructor Activity files with Excel instructions and solutions to problems.
Students will be able to perform linear regression analysis, find correlation coefficient, create a scatter plot and find the r-square using MS Excel 365. Students will be able to interpret data sets, describe the relationship between biological variables, and predict the value of an output variable based on the input of an predictor variable.
Facebook
TwitterThis dataset was created by Suraj Baraik
Facebook
TwitterSTAD-R is a set of R programs that performs descriptive statistics, in order to make boxplots and histograms. STAD-R was designed because is necessary before than the thing, check if the dataset have the same number of repetitions, blocks, genotypes, environments, if we have missing values, where and how many, review the distributions and outliers, because is important to be sure that the dataset is complete and have the correct structure for do and other kind of analysis.
Facebook
TwitterSandy ocean beaches are a popular recreational destination, often surrounded by communities containing valuable real estate. Development is on the rise despite the fact that coastal infrastructure is subjected to flooding and erosion. As a result, there is an increased demand for accurate information regarding past and present shoreline changes. To meet these national needs, the Coastal and Marine Geology Program of the U.S. Geological Survey (USGS) is compiling existing reliable historical shoreline data along open-ocean sandy shores of the conterminous United States and parts of Alaska and Hawaii under the National Assessment of Shoreline Change project. There is no widely accepted standard for analyzing shoreline change. Existing shoreline data measurements and rate calculation methods vary from study to study and prevent combining results into state-wide or regional assessments. The impetus behind the National Assessment project was to develop a standardized method of measuring changes in shoreline position that is consistent from coast to coast. The goal was to facilitate the process of periodically and systematically updating the results in an internally consistent manner.
Facebook
TwitterThis data set includes input data for the development of regression models to predict chloride from specific conductance (SC) data at 56 U. S. Geological Survey water quality monitoring stations in the eastern United States. Each site has 20 or more simultaneous observations of SC and chloride. Data were downloaded from the National Water Information System (NWIS) using the R package dataRetrieval. Datasets for each site were evaluated and outliers were removed prior to the development of the regression model. This file contains only the final input dataset for the regression models. Please refer to Moore and others (in review) for more details. Moore, J., R. Fanelli, and A. Sekellick. In review. High-frequency data reveal deicing salts drive elevated conductivity and chloride along with pervasive and frequent exceedances of the EPA aquatic life criteria for chloride in urban streams. Submitted to Environmental Science and Technology.
Facebook
TwitterThis data release contains extended estimates of daily groundwater levels and monthly percentiles at 27 short-term monitoring wells in Massachusetts. The Maintenance of Variance Extension Type 1 (MOVE.1) regression method was used to extend short-term groundwater levels at wells with less than 10 years of continuous data. This method uses groundwater level data from a correlated long-term monitoring well (index well) to estimate the groundwater level record for the short-term monitoring well. MOVE.1 regressions are used widely throughout the hydrologic community to extend flow records from streamgaging stations but are less commonly used to extend groundwater records at wells. The data in this data release document the results of the MOVE.1 regressions to estimate groundwater levels and compute updated monthly percentiles for select wells used in the groundwater index in the Massachusetts Drought Management Plan (2019). The U.S. Geological Survey (USGS) groundwater identification site numbers and groundwater level data are available via the USGS National Water Information System (NWIS) database (available at https://waterdata.usgs.gov/nwis). Groundwater levels provided are in depth to water level, in feet below land surface datum. This data release accompanies a USGS scientific investigations report that describes the methods and results in detail (Ahearn and Crozier, 2024). Reference: Massachusetts Executive Office of Energy and Environmental Affairs and Massachusetts Emergency Management Agency, 2019, Massachusetts drought management plan: Executive Office of Energy and Environmental Affairs, 115 p., accessed September 2022, at https://www.mass.gov/doc/massachusetts-drought-management-plan The following are included in the data release: (1) R input file that lists the final site pairings (R_Input_MOVE1_Site_List.csv) (2) R script that performs the MOVE.1 and produces outputs for evaluation purposes (MOVE1_R_code.R) (3) MOVE.1 model outputs (MOVE1_Models.zip) (4) Estimates of daily groundwater levels using the MOVE.1 regression technique (MOVE1_Estimated_Record_Tables.zip) (5) Plots showing time series of estimated daily groundwater levels from the MOVE.1 technique (MOVE1_Estimated_Record_Plots.zip) (6) Plots showing time series of estimated daily groundwater levels from the MOVE.1 technique zoomed into the period of observed daily groundwater levels for the short-term site (Zoomed_MOVE1_Estimated_Record_Plots.zip) (7) Plots showing residuals (Residuals_WL_Plots.zip) (8) Monthly percentile table for 27 study wells (GWL_Percentiles_All_Study_Wells.csv)
Facebook
TwitterThe R markdown file "BayesianScript.pdf" contains the code required to run the Bayesian regression models found in the paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains supplementary materials related to the study "𝐒𝐮𝐫𝐯𝐞𝐲 𝐨𝐧 𝐜𝐫𝐢𝐭𝐢𝐜𝐚𝐥 𝐫𝐞𝐬𝐮𝐥𝐭𝐬 𝐦𝐚𝐧𝐚𝐠𝐞𝐦𝐞𝐧𝐭 𝐢𝐧 𝐁𝐫𝐚𝐳𝐢𝐥𝐢𝐚𝐧 𝐜𝐥𝐢𝐧𝐢𝐜𝐚𝐥 𝐥𝐚𝐛𝐨𝐫𝐚𝐭𝐨𝐫𝐢𝐞𝐬: 𝐏𝐫𝐨𝐟𝐢𝐥𝐢𝐧𝐠 𝐩𝐫𝐚𝐜𝐭𝐢𝐜𝐞𝐬 𝐭𝐡𝐫𝐨𝐮𝐠𝐡 𝐦𝐮𝐥𝐭𝐢𝐯𝐚𝐫𝐢𝐚𝐭𝐞 𝐚𝐧𝐚𝐥𝐲𝐬𝐢𝐬 𝐚𝐧𝐝 𝐚 '𝐍𝐞𝐰 𝐒𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜𝐬' 𝐚𝐩𝐩𝐫𝐨𝐚𝐜𝐡". The dataset, figures, exported results, and analysis scripts are included to ensure full transparency and reproducibility of the research findings.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The functional linear regression model with points of impact (PoI) is a recent augmentation of the classical functional linear model with many practically important applications. In this article, however, we demonstrate that the existing data-driven procedure for estimating the parameters of this regression model can be very instable and inaccurate. The tendency to omit relevant PoI is a particularly problematic aspect resulting in omitted-variable biases. We explain the theoretical reason for this problem and propose a new sequential estimation algorithm that leads to significantly improved estimation results. Our estimation algorithm is compared with the existing estimation procedure using an in-depth simulation study. The applicability is demonstrated using data from Google AdWords, today’s most important platform for online advertisements. The R-package FunRegPoI and additional R-codes are provided in the online supplementary materials.
Facebook
TwitterSandy ocean beaches are a popular recreational destination, often surrounded by communities containing valuable real estate. Development is on the rise despite the fact that coastal infrastructure is subjected to flooding and erosion. As a result, there is an increased demand for accurate information regarding past and present shoreline changes. To meet these national needs, the Coastal and Marine Geology Program of the U.S. Geological Survey (USGS) is compiling existing reliable historical shoreline data along open-ocean sandy shores of the conterminous United States and parts of Alaska and Hawaii under the National Assessment of Shoreline Change project. There is no widely accepted standard for analyzing shoreline change. Existing shoreline data measurements and rate calculation methods vary from study to study and prevent combining results into state-wide or regional assessments. The impetus behind the National Assessment project was to develop a standardized method of measuring changes in shoreline position that is consistent from coast to coast. The goal was to facilitate the process of periodically and systematically updating the results in an internally consistent manner.
Facebook
TwitterThis dataset was created by Gaurav B R
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 10.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetic datasets were generated as benchmarks capturing the intrinsic characteristics of original data to investigate the performance of additive feature attribution methods for regression tasks. The synthetic datasets were generated based on 2, 6 and 8 clusters formed with the original data. The 6-cluster dataset was used for primary analysis and the other two were used for sensitivity analysis. The synthetic dataset was generated from the original data acquired from Aviation Data for Research Repository, which was collected and processed by EUROCONTROL from the Enhanced Tactical Flow Management System (ETFMS) flight data messages containing all flights in Europe throughout the year 2019, from May to October. The original dataset consisted of fundamental details of the flights, flight status, preceding flight legs, ATFM regulations, weather conditions, calendar information, etc. A brief description of the columns in the synthetic data files is presented in the file 'data_description.pdf' and a more detailed discussion on features can be found in the works of Koolen and Coliban [1] and Dalmau et al. [2].
References[1] H. Koolen and I. Coliban, Flight Progress Messages Document, EUROCONTROL, Brussels, Belgium, Tech. Rep., 2020.[2] R. Dalmau, F. Ballerini, H. Naessens, S. Belkoura, and S. Wangnick, An Explainable Machine Learning Approach to Improve Take-off Time Predictions, Journal of Air Transport Management, vol. 95, p. 102 090, Aug. 2021. doi: 10.1016/j.jairtraman.2021.102090.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Data from an optical turbidity sensor deployed at the stream station were recorded at 15-minute intervals by a data logger and uploaded every hour to the USGS database (Anderson, 2005; Wagner, 2006). Suspended-sediment samples were collected using equal width increments or grab sampling techniques (Edwards, 1999). The use of an optical sensor to continuously monitor turbidity provided an accurate estimate of sediment fluctuations without the collection and analysis costs associated with intensive sampling (OSW policy 2016.07; Rasmussen et al., 2009). Turbidity was used as a surrogate for suspended-sediment concentration (SSC), which is a measure of sedimentation and siltation. Regression models were developed between SSC and turbidity using turbidity data from the optical sensor and the SSC data collected from the suspended-sediment samples. For the West Fork of White River East of Fayetteville instantaneous turbidity measurements began on October 11, 2014 and ranged from 0.3 to ...
Facebook
TwitterThis dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.
Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.
Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.
Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.
Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.
The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.
It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.
This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.