This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.
This is a dataset is a Multiple Regression Project from an Applied Math Science Graduate Level Course at Stony Brook (AMS578 Spring 2020).
The class blackboard has a pdf file of a paper by Caspi et al. that reports a finding of a gene-environment interaction. This paper used multiple regression techniques as the methodology for its findings. You should read it for background, as it is the genesis of the models that you will be given. The data that you are analyzing is synthetic. That is, the TA used a model to generate the data. Your task is to find the model that the TA used for your data. For example, one possible model is
The class blackboard also contains a paper by Risch et al. that uses a larger collection of data to assess the findings in Caspi et al. These researchers confirmed that Caspi et al. calculated their results correctly but that no other dataset had the relation reported in Caspi et al. That is, Caspi et al. seem to have reported a false positive (Type I error). The class blackboard contains a recent paper about the genetics of mental illness and a technical appendix giving the specifics. Together these papers are an example of the response of the research community to studying the genetics of mental illness, which is a notoriously difficult research area.
One file contains the patient identifier and the dependent variable value. The second file contains the patient identifier and values of six environment variables called E1 to E6. The third file contains the patient identifier and the twenty independent indicator variables called G1 to G20. The records may not be in correct order in each file, and cases may be missing in one or more of the files. You can process the data with VMLOOKUP or other data merging software.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract The valuation of real estate, which assists in the definition of market value, is an important science with a wide field of action, which includes the collection of taxes, commercial transactions, insurance and judicial expertise. This study presents the construction of a linear regression model to determine the market value (dependent variable) of residential apartments in the city of Fortaleza-CE. The studied database presents 17,493 apartments, divided into 227 plan types in a total of 154 projects launched between the years of 2011 and 2014. The model developed was obtained using Multiple Linear Regression associated with the Ridge Regression technique to solve the existing multicollinearity problem. In the analysis of 30 variables (12 quantitative and 18 dummy type qualitative variables), an equation with 6 variables was reached, which meets the theoretical assumptions for its existence.
This dataset is having data of customers who buys clothes online. The store offers in-store style and clothing advice sessions. Customers come in to the store, have sessions/meetings with a personal stylist, then they can go home and order either on a mobile app or website for the clothes they want.
The company is trying to decide whether to focus their efforts on their mobile app experience or their website.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
The primary objective from this project was to acquire historical shoreline information for all of the Northern Ireland coastline. Having this detailed understanding of the coast’s shoreline position and geometry over annual to decadal time periods is essential in any management of the coast.The historical shoreline analysis was based on all available Ordnance Survey maps and aerial imagery information. Analysis looked at position and geometry over annual to decadal time periods, providing a dynamic picture of how the coastline has changed since the start of the early 1800s.Once all datasets were collated, data was interrogated using the ArcGIS package – Digital Shoreline Analysis System (DSAS). DSAS is a software package which enables a user to calculate rate-of-change statistics from multiple historical shoreline positions. Rate-of-change was collected at 25m intervals and displayed both statistically and spatially allowing for areas of retreat/accretion to be identified at any given stretch of coastline.The DSAS software will produce the following rate-of-change statistics:Net Shoreline Movement (NSM) – the distance between the oldest and the youngest shorelines.Shoreline Change Envelope (SCE) – a measure of the total change in shoreline movement considering all available shoreline positions and reporting their distances, without reference to their specific dates.End Point Rate (EPR) – derived by dividing the distance of shoreline movement by the time elapsed between the oldest and the youngest shoreline positions.Linear Regression Rate (LRR) – determines a rate of change statistic by fitting a least square regression to all shorelines at specific transects.Weighted Linear Regression Rate (WLR) - calculates a weighted linear regression of shoreline change on each transect. It considers the shoreline uncertainty giving more emphasis on shorelines with a smaller error.The end product provided by Ulster University is an invaluable tool and digital asset that has helped to visualise shoreline change and assess approximate rates of historical change at any given coastal stretch on the Northern Ireland coast.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Sachin Gupta
Released under CC0: Public Domain
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains all the data used for the article "Estimating changes in air pollutant levels due to COVID-19 lockdown measures based on a business-as-usual prediction scenario using data mining models: A case-study for urban traffic sites in Spain", submitted to Environmental Software & Modelling by J. González-Pardo et al. (2022) published in Science of the Total Environment (STOTEN). For the sake of reproducibility, it includes Jupyter notebooks with worked examples which allow to reproduce the results shown in that paper.
Contact: jaime.diez.gp@gmail.com
During the course of this research the pyaemet python library has been developed in order to download daily meteorological observations from the Spanish Met Service (AEMET) via its OpenData API REST and it is needed to perform the data curation process.
This research was developed in the framework of the project “Contaminación atmosférica y COVID-19: ¿Qué podemos aprender de esta pandemia?”, selected in the Extraordinary BBVA Foundation grant call for SARS-CoV-2 and COVID-19 research proposals, within the area of ecology and veterinary science.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
It includes five different datasets. The first four datasets contain student projects collected from different offerings of two undergraduate-level courses – Object-Oriented Analysis and Design (OOAD) and Software Engineering (SE) – taught in a renowned private university in Lahore over a period of six years. The fifth dataset contains real-life industry projects collected from a renowned software house (i.e. member of Pakistan Software Houses Association for IT and ITeS (P@SHA)) in Lahore.
Dataset #1 consists of 31 C++ GUI-based desktop applications. Dataset #2 consists of 19 Java GUI-based desktop applications. Dataset #3 consists of 12 Java web applications. Dataset #4 consists of 31 Java all two categories. Dataset #5 consists of 11 VB.NET GUI-based desktop applications.
Attributes are used as follows: Project Code – Project ID for identification purposes NOC – The total number of classes in a class diagram NOA – The total number of attributes in a class diagram NOM – The total number of methods/operations in a class diagram NODep – The total number of dependency relationships in a class diagram NOAss – The total number of association relationships in a class diagram NOComp – The total number of composition relationships in a class diagram NOAgg – The total number of aggregation relationships in a class diagram NOGen – The total number of generalization relationships in a class diagram NORR – The total number of realization relationships in a class diagram NOOM – The total number of one-to-one multiplicity relationships in a class diagram NOMM – The total number of one-to-many multiplicity relationships in a class diagram NMMM – The total number of many-to-many multiplicity relationships in a class diagram OCP – objective class points EOCP – enhanced objective class points WEOCP – weighted enhanced objective class points SLOC – software size measured in source lines of code
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Effective communication skills, both written and oral, are considered core skills for statisticians. This article presents five small-scale writing projects that were developed for an applied regression course, including the specific writing skills emphasized in each project and what each project entails. We also present and discuss results from surveys on changes in writing attitudes throughout the course and student feedback on the projects. The results indicate improved attitudes toward writing and a positive experience for students. Recommendations for incorporating the writing projects based on our observations of implementing them and potential changes are also provided. Materials for all projects are available in the online supplemental materials.
A multivariate regression model was developed to predict zero-order oxygen reduction rates (mg/L/yr) in aquifers across the State of Wisconsin. The model used a combination of dissolved oxygen concentrations and mean groundwater ages estimated with sampled age tracers from wells in the U.S. Geological Survey National Water Information System and previously published project reports from state agencies and universities. The multivariate regression model was solved using the Microsoft Excel solver, with 461 wells used for training and 46 wells held-out for validation. A total of 31 predictor variables were used for model development (56 were tested), including basic well characteristics, soil properties, aquifer properties, hydrologic position on the landscape, recharge and evapotranspiration rates, and land use characteristics. Model results indicate that the mean oxygen reduction rate for the training wells is 0.15 mg/L/yr (ranges from 0.07 to 0.59 mg/L/yr), with a root mean weighted square error of 3.13 mg/L/yr and Coefficient of Correlation (r^2) of 0.49 for the holdout validation data. This data release includes the Microsoft Excel file that represents the final solved regression model, as well as an Excel file that describes all of the predictor variables that were tested with the model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this project we have 3 datasets. Training Set and Test set consists of the input data from Swedish Motor Insurance dataset which is dividen in ratio 80%-20%. Third dataset consists of our predictions for Sum of payments using linear regression.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository houses the code and data for simulations that apply multiple regression analysis models to biased occurrence data to detect thermophilization.
This R code simulates the application of a multiple regression analysis model to biased occurrence data to detect thermophilization.
Note: To save the running time, we used a parallel computation approach (run time of approximately 30 minutes). Since seven CPUs were used, an equal or greater number of CPUs would be required to reproduce the same results.
Simulation-generated distribution data of fictitious biota species. The column names are explained below.
Column Names | Explanation |
IndID | Unique individual identification number |
SpeciesID | Unique identification number for the species to witch the individual belongs. |
Step | Steps in which the individual exists. |
LTI | Local Temperature Index (LTI) of the location where the individual occurred. |
SpeciesLTICenter | Central value of the species-specific LTI at the time of its Step |
Prob.BiasToWarm | Value of weighting sampled when Bias to Warm is present. |
Prob.BiasToCold | Value of weighting sampled when Bias to Cold is present. |
The result of extracting 2,000 biased occurrences data ofrom the Distribution data.
Column Names | Explanation |
IndID | Unique identification number of the extracted individual. |
SpeciesID | Unique identification number for the species to witch the individual belongs. |
Step | Steps in which the individual is extracted |
LTI | Local Temperature Index (LTI) of the location where the individual occurred. |
EstSTI | Species Temperature Index (STI) of the record species calculated on the basis of the occurrence data. |
BiasType | The type of bias |
iter | The number of iteration |
This simulation code uses the following packages.
{tidyverse} package,
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” _Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
{broom} package,
Robinson D, Hayes A, Couch S (2024). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.7, https://github.com/tidymodels/broom,
{rlist} package.
Ren K (2021). _rlist: A Toolbox for Non-Tabular Data Manipulation_. R package version 0.4.6.2, <https://CRAN.R-project.org/package=rlist>.
{data.table} package
Barrett T, Dowle M, Srinivasan A, Gorecki J, Chirico M, Hocking T (2024). _data.table: Extension of `data.frame`_. R package version 1.15.4, <https://CRAN.R-project.org/package=data.table>.
{snowfall} package
Knaus J (2023). _snowfall: Easier Cluster Computing (Based on 'snow')_. R package version 1.84-6.3, <https://CRAN.R-project.org/package=snowfall>.
{magrittr} package
Bache S, Wickham H (2022). _magrittr: A Forward-Pipe Operator for R_. R package version 2.0.3, <https://CRAN.R-project.org/package=magrittr>.
{ggpmisc} package
Aphalo P (2024). _ggpmisc: Miscellaneous Extensions to 'ggplot2'_. R package version 0.5.6, <https://CRAN.R-project.org/package=ggpmisc>.
{effsize} package
Torchiano M (2020). _effsize: Efficient Effect Size Computation_. doi:10.5281/zenodo.1480624 <https://doi.org/10.5281/zenodo.1480624>, R package version 0.8.1, <https://CRAN.R-project.org/package=effsize>.
{conflicted] package
Wickham H (2023). _conflicted: An Alternative Conflict Resolution Strategy_. R package version 1.2.0, <https://CRAN.R-project.org/package=conflicted>.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multiple regression results for health outcomes—VADER.
Sandy ocean beaches are a popular recreational destination, often surrounded by communities containing valuable real estate. Development is on the rise despite the fact that coastal infrastructure is subjected to flooding and erosion. As a result, there is an increased demand for accurate information regarding past and present shoreline changes. To meet these national needs, the Coastal and Marine Geology Program of the U.S. Geological Survey (USGS) is compiling existing reliable historical shoreline data along open-ocean sandy shores of the conterminous United States and parts of Alaska and Hawaii under the National Assessment of Shoreline Change project. There is no widely accepted standard for analyzing shoreline change. Existing shoreline data measurements and rate calculation methods vary from study to study and prevent combining results into state-wide or regional assessments. The impetus behind the National Assessment project was to develop a standardized method of measuring changes in shoreline position that is consistent from coast to coast. The goal was to facilitate the process of periodically and systematically updating the results in an internally consistent manner.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this article, we explore the use of two published datasets for teaching a wide range of students about regression models, with a particular focus on interaction terms. The two datasets come from recent psychology studies on beliefs about poverty and welfare, and about the dynamics of groups projects. Both datasets (and their original research papers) are accessible to students, and because of their context, students can learn about data collection, measurement, and the use of statistics when studying complex social topics, while using the data to learn about regression analysis. We have used these data for a range of in-class activities, journal paper discussions, exams, and extended projects, at the undergraduate, master’s, and doctoral levels. Supplementary materials for this article are available online.
description: Sandy ocean beaches are a popular recreational destination, often surrounded by communities containing valuable real estate. Development is on the rise despite the fact that coastal infrastructure is subjected to flooding and erosion. As a result, there is an increased demand for accurate information regarding past and present shoreline changes. To meet these national needs, the Coastal and Marine Geology Program of the U.S. Geological Survey (USGS) is compiling existing reliable historical shoreline data along open-ocean sandy shores of the conterminous United States and parts of Alaska and Hawaii under the National Assessment of Shoreline Change project. There is no widely accepted standard for analyzing shoreline change. Existing shoreline data measurements and rate calculation methods vary from study to study and prevent combining results into state-wide or regional assessments. The impetus behind the National Assessment project was to develop a standardized method of measuring changes in shoreline position that is consistent from coast to coast. The goal was to facilitate the process of periodically and systematically updating the results in an internally consistent manner.; abstract: Sandy ocean beaches are a popular recreational destination, often surrounded by communities containing valuable real estate. Development is on the rise despite the fact that coastal infrastructure is subjected to flooding and erosion. As a result, there is an increased demand for accurate information regarding past and present shoreline changes. To meet these national needs, the Coastal and Marine Geology Program of the U.S. Geological Survey (USGS) is compiling existing reliable historical shoreline data along open-ocean sandy shores of the conterminous United States and parts of Alaska and Hawaii under the National Assessment of Shoreline Change project. There is no widely accepted standard for analyzing shoreline change. Existing shoreline data measurements and rate calculation methods vary from study to study and prevent combining results into state-wide or regional assessments. The impetus behind the National Assessment project was to develop a standardized method of measuring changes in shoreline position that is consistent from coast to coast. The goal was to facilitate the process of periodically and systematically updating the results in an internally consistent manner.
Sandy ocean beaches are a popular recreational destination, often surrounded by communities containing valuable real estate. Development is on the rise despite the fact that coastal infrastructure is subjected to flooding and erosion. As a result, there is an increased demand for accurate information regarding past and present shoreline changes. To meet these national needs, the Coastal and Marine Geology Program of the U.S. Geological Survey (USGS) is compiling existing reliable historical shoreline data along open-ocean sandy shores of the conterminous United States and parts of Alaska and Hawaii under the National Assessment of Shoreline Change project. There is no widely accepted standard for analyzing shoreline change. Existing shoreline data measurements and rate calculation methods vary from study to study and prevent combining results into state-wide or regional assessments. The impetus behind the National Assessment project was to develop a standardized method of measuring changes in shoreline position that is consistent from coast to coast. The goal was to facilitate the process of periodically and systematically updating the results in an internally consistent manner.
This dataset consists of short-term (1970-2009) linear regression shoreline change rates for the Boston region of Massachusetts. Rates of short-term shoreline change were computed within a GIS using the Digital Shoreline Analysis System (DSAS) version 4.3, an ArcGIS extension developed by the U.S. Geological Survey. The baseline is used as a reference line for the transects cast by the DSAS software. The transects intersect each shoreline at the measurement points, which are then used to calculate the short-term rates. Due to continued coastal population growth and increased threats of erosion, current data on trends and rates of shoreline movement are required to inform shoreline and floodplain management. The Massachusetts Office of Coastal Zone Management launched the Shoreline Change Project in 1989 to identify erosion-prone areas of the coast. In 2001, a 1994 shoreline was added to calculate both long- and short-term shoreline change rates at 40-meter intervals along ocean-facing sections of the Massachusetts coast. The Coastal and Marine Geology Program of the U.S. Geological Survey (USGS) in cooperation with the Massachusetts Office of Coastal Zone Management, has compiled reliable historical shoreline data along open-facing sections of the Massachusetts coast under the Massachusetts Shoreline Change Mapping and Analysis Project 2013 Update. Two oceanfront shorelines for Massachusetts (approximately 1,800 km) were (1) delineated using 2008/09 color aerial orthoimagery, and (2) extracted from topographic LIDAR datasets (2007) obtained from NOAA's Ocean Service, Coastal Services Center. The new shorelines were integrated with existing Massachusetts Office of Coastal Zone Management and USGS historical shoreline data in order to compute long- and short-term rates using the latest version of the Digital Shoreline Analysis System (DSAS).
Sandy ocean beaches are a popular recreational destination, often surrounded by communities containing valuable real estate. Development is on the rise despite the fact that coastal infrastructure is subjected to flooding and erosion. As a result, there is an increased demand for accurate information regarding past and present shoreline changes. To meet these national needs, the Coastal and Marine Geology Program of the U.S. Geological Survey (USGS) is compiling existing reliable historical shoreline data along open-ocean sandy shores of the conterminous United States and parts of Alaska and Hawaii under the National Assessment of Shoreline Change project. There is no widely accepted standard for analyzing shoreline change. Existing shoreline data measurements and rate calculation methods vary from study to study and prevent combining results into state-wide or regional assessments. The impetus behind the National Assessment project was to develop a standardized method of measuring changes in shoreline position that is consistent from coast to coast. The goal was to facilitate the process of periodically and systematically updating the results in an internally consistent manner.
This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.