Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Ice Cream Selling dataset is a simple and well-suited dataset for beginners in machine learning who are looking to practice polynomial regression. It consists of two columns: temperature and the corresponding number of units of ice cream sold.
The dataset captures the relationship between temperature and ice cream sales. It serves as a practical example for understanding and implementing polynomial regression, a powerful technique for modeling nonlinear relationships in data.
The dataset is designed to be straightforward and easy to work with, making it ideal for beginners. The simplicity of the data allows beginners to focus on the fundamental concepts and steps involved in polynomial regression without overwhelming complexity.
By using this dataset, beginners can gain hands-on experience in preprocessing the data, splitting it into training and testing sets, selecting an appropriate degree for the polynomial regression model, training the model, and evaluating its performance. They can also explore techniques to address potential challenges such as overfitting.
With this dataset, beginners can practice making predictions of ice cream sales based on temperature inputs and visualize the polynomial regression curve that represents the relationship between temperature and ice cream sales.
Overall, the Ice Cream Selling dataset provides an accessible and practical learning resource for beginners to grasp the concepts and techniques of polynomial regression in the context of analyzing ice cream sales data.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
The primary objective from this project was to acquire historical shoreline information for all of the Northern Ireland coastline. Having this detailed understanding of the coastâs shoreline position and geometry over annual to decadal time periods is essential in any management of the coast.
The
historical shoreline analysis was based on all available Ordnance Survey maps
and aerial imagery information. Analysis looked at position and geometry over
annual to decadal time periods, providing a dynamic picture of how the
coastline has changed since the start of the early 1800s.
Once
all datasets were collated, data was interrogated using the ArcGIS package â
Digital Shoreline Analysis System (DSAS). DSAS is a software package which
enables a user to calculate rate-of-change statistics from multiple historical
shoreline positions. Rate-of-change was collected at 25m intervals and
displayed both statistically and spatially allowing for areas of
retreat/accretion to be identified at any given stretch of coastline.
The DSAS software will produce the following rate-of-change statistics:
The end product provided by Ulster University is an invaluable tool and digital asset that has helped to visualise shoreline change and assess approximate rates of historical change at any given coastal stretch on the Northern Ireland coast.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains a collection of 100 randomly generated data points representing the relationship between the number of hours a student spends studying and their corresponding performance, measured as a score. The data has been generated to simulate a real-world scenario where study hours are assumed to influence academic outcomes, making it an excellent resource for linear regression analysis and other machine learning tasks.
Each row in the dataset consists of:
Hours: The number of hours a student dedicates to studying, ranging between 0 and 10 hours. Scores: The student's performance score, represented as a percentage, ranging from 0 to 100. Use Cases: This dataset is particularly useful for:
Linear Regression: Exploring how study hours influence student performance, fitting a regression line to predict scores based on study time. Data Science & Machine Learning: Practicing regression analysis, training models, and applying other predictive algorithms. Educational Research: Simulating data-driven insights into student behavior and performance metrics. Features: 100 rows of data. Continuous numerical variables suitable for regression tasks. Generated for educational purposes, making it ideal for students, teachers, and beginners in machine learning and data science. Potential Applications: Build a linear regression model to predict student scores. Investigate the correlation between study time and performance. Apply data visualization techniques to better understand the data. Use the dataset to experiment with model evaluation metrics like Mean Squared Error (MSE) and R-squared.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Goodness-of-fit measure for multiple linear regression model.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
This dataset contains simulated datasets, empirical data, and R scripts described in the paper: âLi, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)â.
A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by Ă* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (Ă*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of speciesâ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the Ă* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.
Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.âs approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where âr = 0.1, 0.2, 0.3, respectively. These three levels of âr resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (Ă*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.
Facebook
TwitterThe dataset contains model coefficients and statistics for the 488 regression models used to estimate streamwater constituent loads for 13 watersheds in Gwinnett County, Georgia for two calibration periods, water years 2003-2010 and 2010-2020. Model terms were selected from an 11-parameter equation, which was a function of discharge, base flow, season, turbidity, and time (trend), using a forward stepwise ordinary least squares regression approach. Model coefficients were fit using U.S. Geological Survey (USGS) LOADEST load estimation software. Models were fit both with and without turbidity explanatory variables for 12 water-quality constituents: total suspended solids, suspended sediment concentration, total nitrogen, total nitrate plus nitrite, total phosphorus, dissolved phosphorus, total organic carbon, total calcium, total magnesium, total lead, total zinc, and total dissolved solids. The dataset includes a summary of sample concentrations used to calibration the models (period of samples collected, number of concentrations, number of censored concentrations, and number of outliers removed), model coefficients, and selected model statistics (concentration and load model R-squares, estimated residual variance, serial correlation in the model residuals, and Turnbull-Weiss normality test statistic of residuals). Portable document format files of LOADEST output are provided for each model in a âzipâ file that contain model diagnostic statistics and plots for evaluating model fits.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Project Description:
In this project, I developed a linear regression model to predict car prices based on key features such as fuel tank capacity, width, length, and year of manufacture. The goal was to understand how these factors influence car prices and to assess the effectiveness of the model in making accurate predictions.
Key Features:
Fuel Tank Capacity: The capacity of the carâs fuel tank. Width: The width of the car. Length: The length of the car. Year: The year of manufacture of the car.
Target Variable:
Price: The price of the car, which is the primary variable being predicted.
Methodology:
Data Preparation:
Model Training:
Feature Scaling:
Evaluation:
Visualization:
Results:
Technologies Used:
Facebook
TwitterđŹđ§ ěęľ English The primary objective from this project was to acquire historical shoreline information for all of the Northern Ireland coastline. Having this detailed understanding of the coastâs shoreline position and geometry over annual to decadal time periods is essential in any management of the coast.The historical shoreline analysis was based on all available Ordnance Survey maps and aerial imagery information. Analysis looked at position and geometry over annual to decadal time periods, providing a dynamic picture of how the coastline has changed since the start of the early 1800s.Once all datasets were collated, data was interrogated using the ArcGIS package â Digital Shoreline Analysis System (DSAS). DSAS is a software package which enables a user to calculate rate-of-change statistics from multiple historical shoreline positions. Rate-of-change was collected at 25m intervals and displayed both statistically and spatially allowing for areas of retreat/accretion to be identified at any given stretch of coastline.The DSAS software will produce the following rate-of-change statistics:Net Shoreline Movement (NSM) â the distance between the oldest and the youngest shorelines.Shoreline Change Envelope (SCE) â a measure of the total change in shoreline movement considering all available shoreline positions and reporting their distances, without reference to their specific dates.End Point Rate (EPR) â derived by dividing the distance of shoreline movement by the time elapsed between the oldest and the youngest shoreline positions.Linear Regression Rate (LRR) â determines a rate of change statistic by fitting a least square regression to all shorelines at specific transects.Weighted Linear Regression Rate (WLR) - calculates a weighted linear regression of shoreline change on each transect. It considers the shoreline uncertainty giving more emphasis on shorelines with a smaller error.The end product provided by Ulster University is an invaluable tool and digital asset that has helped to visualise shoreline change and assess approximate rates of historical change at any given coastal stretch on the Northern Ireland coast.
Facebook
TwitterThis dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.
Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.
Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.
Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.
Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.
The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.
It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.
This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
The primary objective from this project was to acquire historical shoreline information for all of the Northern Ireland coastline. Having this detailed understanding of the coastâs shoreline position and geometry over annual to decadal time periods is essential in any management of the coast.The historical shoreline analysis was based on all available Ordnance Survey maps and aerial imagery information. Analysis looked at position and geometry over annual to decadal time periods, providing a dynamic picture of how the coastline has changed since the start of the early 1800s.Once all datasets were collated, data was interrogated using the ArcGIS package â Digital Shoreline Analysis System (DSAS). DSAS is a software package which enables a user to calculate rate-of-change statistics from multiple historical shoreline positions. Rate-of-change was collected at 25m intervals and displayed both statistically and spatially allowing for areas of retreat/accretion to be identified at any given stretch of coastline.The DSAS software will produce the following rate-of-change statistics:Net Shoreline Movement (NSM) â the distance between the oldest and the youngest shorelines.Shoreline Change Envelope (SCE) â a measure of the total change in shoreline movement considering all available shoreline positions and reporting their distances, without reference to their specific dates.End Point Rate (EPR) â derived by dividing the distance of shoreline movement by the time elapsed between the oldest and the youngest shoreline positions.Linear Regression Rate (LRR) â determines a rate of change statistic by fitting a least square regression to all shorelines at specific transects.Weighted Linear Regression Rate (WLR) - calculates a weighted linear regression of shoreline change on each transect. It considers the shoreline uncertainty giving more emphasis on shorelines with a smaller error.The end product provided by Ulster University is an invaluable tool and digital asset that has helped to visualise shoreline change and assess approximate rates of historical change at any given coastal stretch on the Northern Ireland coast.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
The primary objective from this project was to acquire historical shoreline information for all of the Northern Ireland coastline. Having this detailed understanding of the coastâs shoreline position and geometry over annual to decadal time periods is essential in any management of the coast.The historical shoreline analysis was based on all available Ordnance Survey maps and aerial imagery information. Analysis looked at position and geometry over annual to decadal time periods, providing a dynamic picture of how the coastline has changed since the start of the early 1800s.Once all datasets were collated, data was interrogated using the ArcGIS package â Digital Shoreline Analysis System (DSAS). DSAS is a software package which enables a user to calculate rate-of-change statistics from multiple historical shoreline positions. Rate-of-change was collected at 25m intervals and displayed both statistically and spatially allowing for areas of retreat/accretion to be identified at any given stretch of coastline.The DSAS software will produce the following rate-of-change statistics:Net Shoreline Movement (NSM) â the distance between the oldest and the youngest shorelines.Shoreline Change Envelope (SCE) â a measure of the total change in shoreline movement considering all available shoreline positions and reporting their distances, without reference to their specific dates.End Point Rate (EPR) â derived by dividing the distance of shoreline movement by the time elapsed between the oldest and the youngest shoreline positions.Linear Regression Rate (LRR) â determines a rate of change statistic by fitting a least square regression to all shorelines at specific transects.Weighted Linear Regression Rate (WLR) - calculates a weighted linear regression of shoreline change on each transect. It considers the shoreline uncertainty giving more emphasis on shorelines with a smaller error.The end product provided by Ulster University is an invaluable tool and digital asset that has helped to visualise shoreline change and assess approximate rates of historical change at any given coastal stretch on the Northern Ireland coast.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
The primary objective from this project was to acquire historical shoreline information for all of the Northern Ireland coastline. Having this detailed understanding of the coastâs shoreline position and geometry over annual to decadal time periods is essential in any management of the coast.The historical shoreline analysis was based on all available Ordnance Survey maps and aerial imagery information. Analysis looked at position and geometry over annual to decadal time periods, providing a dynamic picture of how the coastline has changed since the start of the early 1800s.Once all datasets were collated, data was interrogated using the ArcGIS package â Digital Shoreline Analysis System (DSAS). DSAS is a software package which enables a user to calculate rate-of-change statistics from multiple historical shoreline positions. Rate-of-change was collected at 25m intervals and displayed both statistically and spatially allowing for areas of retreat/accretion to be identified at any given stretch of coastline.The DSAS software will produce the following rate-of-change statistics:Net Shoreline Movement (NSM) â the distance between the oldest and the youngest shorelines.Shoreline Change Envelope (SCE) â a measure of the total change in shoreline movement considering all available shoreline positions and reporting their distances, without reference to their specific dates.End Point Rate (EPR) â derived by dividing the distance of shoreline movement by the time elapsed between the oldest and the youngest shoreline positions.Linear Regression Rate (LRR) â determines a rate of change statistic by fitting a least square regression to all shorelines at specific transects.Weighted Linear Regression Rate (WLR) - calculates a weighted linear regression of shoreline change on each transect. It considers the shoreline uncertainty giving more emphasis on shorelines with a smaller error.The end product provided by Ulster University is an invaluable tool and digital asset that has helped to visualise shoreline change and assess approximate rates of historical change at any given coastal stretch on the Northern Ireland coast.
Facebook
TwitterIn 1991, the U.S. Geological Survey (USGS) began a study of more than 50 major river basins across the Nation as part of the National Water-Quality Assessment (NAWQA) project. One of the major goals of the NAWQA project was to determine how river water quality has changed over time. To support that goal, long-term consistent and comparable monitoring has been conducted by the USGS on streams and rivers throughout the Nation. Outside of the NAWQA project, the USGS and other Federal, State, and local agencies also have collected long-term water-quality data to support their own assessments of changing water quality. In 2017, data from these multiple sources were combined to support one of the most comprehensive assessments to date of water-quality trends in the United States (Oelsner and others, 2017; De Cicco and others, 2017). This data release updates these water quality trends, which ended in 2012, with 5 more years of data and now end in 2017. This USGS data release contains all the input and output files necessary to reproduce the results from the Weighted Regressions on Time, Discharge, and Season (WRTDS) models, using data preparation methods described in Oelsner and others, 2017. Models were calibrated for each combination of site and parameter using the screened input data. Models were run on Yeti, the USGS supercomputer, in 3 separate runs, using the scripts in the "Script.zip" folder. See readMe.txt for details on how the files in this data release are related and the modeling process. "SiteTable.csv" gives information on sites used in this analysis. Once calibrated, the WRTDS models were initially evaluated using a logistic regression equation that estimated a probability of acceptance for each model (e.g., "a good fit") based on a set of diagnostic metrics derived from the observed, estimated, and residual values from each model and data set. Each WRTDS model was assigned to one of three categories: âauto-accept,â âauto-reject,â or âmanual evaluation". Models assigned to the latter category were visually evaluated for appropriate model fit using residual and diagnostic plots. Models assigned to the first two categories were automatically included or rejected from the final results, respectively. Twenty-two water-quality parameters were assessed, including nutrients (ammonia, nitrate, filtered orthophosphate, total nitrogen, total phosphorus, and unfiltered orthophosphate), major ions (calcium, bromide, fluoride, chloride, magnesium, potassium, sodium, and sulfate), salinity indicators (total dissolved solids and specific conductance), sediment (total suspended solids and suspended sediment concentration), carbon (dissolved organic carbon, total organic carbon, and particulate organic carbon), and alkalinity. Trends are reported for six periods: 1972-2017, 1982-2017, 1987-2017, 1992-2017, 2002-2017, and 2007-2017.
Facebook
TwitterGreatest support for best fit models is indicated by lowest ÎAIC (Akaikeâs information criterion) values; all model with ÎAIC < 2 indicated in bold. Coefficients from the most supported model are provided, and terms where support for a significant relationship (positive or negative) with whale presence (p-value <0.05) are noted in bold. Df = degrees of freedom. For Trawler Activity state, Towing was the reference level and for Fishing Area, Flemish cap was the reference level.
Facebook
TwitterThis study was initiated to provide baseline data and to determine the utility of stable isotope analysis to evaluate the foraging strategies of an opportunistic reptile predator. Stable isotope ratios of carbon and nitrogen were evaluated from multiple tissues from terrapin populations to determine spatial or temporal variations in resource use within mangrove habitats in Southern Florida. We sampled Diamondback terrapin (Malaclemys terrapin) and potential resources within mainland and island habitats, and evaluated their δ13C and δ15N values. We fit linear regression models to determine the best predictors of isotopic values for both terrapins and their prey, and SIBER analysis to examine terrapin isotopic niche space and overlap between groups. We identified differences in terrapin isotopic δ13C and δ15N values among all sites. Blood and scute tissues revealed different isotopic compositions and niche overlap between sites, suggesting diets or foraging locations may change over time, and amount of variation is site specific. Niche overlap between size classes was larger for blood (short-term) versus scute (long-term), suggesting greater variability in food resource use and/or isotopic signal of those food resources over short and long timescales.
Facebook
TwitterUltra-high performance liquid chromatography coupled to ion mobility separation and high-resolution mass spectrometry instruments have proven very valuable for screening of emerging contaminants in the aquatic environment. However, when applying suspect or nontarget approaches (i.e., when no reference standards are available), there is no information on retention time (RT) and collision cross-section (CCS) values to facilitate identification. In silico prediction tools of RT and CCS can therefore be of great utility to decrease the number of candidates to investigate. In this work, Multiple Adaptive Regression Splines (MARS) were evaluated for the prediction of both RT and CCS. MARS prediction models were developed and validated using a database of 477 protonated molecules, 169 deprotonated molecules, and 249 sodium adducts. Multivariate and univariate models were evaluated showing a better fit for univariate models to the experimental data. The RT model (R2 = 0.855) showed a deviation between predicted and experimental data of Âą2.32 min (95% confidence intervals). The deviation observed for CCS data of protonated molecules using the CCSH model (R2 = 0.966) was Âą4.05% with 95% confidence intervals. The CCSH model was also tested for the prediction of deprotonated molecules, resulting in deviations below Âą5.86% for the 95% of the cases. Finally, a third model was developed for sodium adducts (CCSNa, R2 = 0.954) with deviation below Âą5.25% for 95% of the cases. The developed models have been incorporated in an open-access and user-friendly online platform which represents a great advantage for third-party research laboratories for predicting both RT and CCS data.
Facebook
TwitterThese are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; âSimulated_Dataset.RDataâ. Metadata (including data dictionary) ⢠y: Vector of binary responses (1: adverse outcome, 0: control) ⢠x: Matrix of covariates; one row for each simulated individual ⢠z: Matrix of standardized pollution exposures ⢠n: Number of simulated individuals ⢠m: Number of exposure time periods (e.g., weeks of pregnancy) ⢠p: Number of columns in the covariate design matrix ⢠alpha_true: Vector of âtrueâ critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (âCWVS_LMC.txtâ) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (âResults_Summary.txtâ) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description âCWVS_LMC.txtâ: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the âSimulated_Dataset.RDataâ workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. âResults_Summary.txtâ: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the âCWVS_LMC.txtâ code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: ⢠For running âCWVS_LMC.txtâ: ⢠msm: Sampling from the truncated normal distribution ⢠mnormt: Sampling from the multivariate normal distribution ⢠BayesLogit: Sampling from the Polya-Gamma distribution ⢠For running âResults_Summary.txtâ: ⢠plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: ⢠Load the âSimulated_Dataset.RDataâ workspace ⢠Run the code contained in âCWVS_LMC.txtâ ⢠Once the âCWVS_LMC.txtâ code is complete, run âResults_Summary.txtâ. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This repository contains various resources related to the study on post-stroke recovery in a mouse model, focusing on the application of the Proportional Recovery Rule (PRR).
code/: Contains all the code used for the analysis in this study. Detailed information is available in the README within the code folder.input/: This folder contains all datasets used in the publication.output/: This directory includes the final results generated for each dataset. Detailed information for each dataset's output can be found in their respective subfolders.docs/: Additional documentation related to this project, including extra resources in the form of a README file within this folder.The Fugl-Meyer upper extremity score is a widely used assessment tool in clinical settings to evaluate motor function in stroke patients. With a maximum score of 66, higher values indicate better motor performance, while lower values signify greater deficits.
The Proportional Recovery Rule (PRR) suggests that the magnitude of recovery from nonsevere upper limb motor impairment after stroke is approximately 0.7 times the initial impairment. This rule, proposed in 2008, has been applied to various motor and nonmotor impairments, leading to inconsistencies in its formulation and application across studies.
In this study, we translated the Fugl-Meyer upper extremity score into a deficit score suitable for use in a mouse model. The PRR posits that the change in impairment can be predicted as 0.7 times the initial impairment, plus an error term. We adapted this rule by fitting a linear regression model without an intercept to relate the initial impairment to the change in impairment.
Initial Impairment Calculation:
Change Observed and Predicted:
Cluster Analysis:
Outlier Removal:
Cluster Characteristics:
Statistical Analysis:
This structured dataset was created with reference to the following publication:
DOI:10.1038/s41597-023-02242-8
If you have any questions or require further assistance, please do not hesitate to reach out to us. Contact us via email at markus.aswendtATuk-koeln.de or aref.kalantari-sarcheshmehATuk-koeln.de.
Facebook
TwitterThe ascii grids represent regional probabilities that groundwater in a particular location will have dissolved oxygen (DO) concentrations less than selected threshold values representing anoxic groundwater conditions or will have dissolved manganese (Mn) concentrations greater than selected threshold values representing secondary drinking water-quality contaminant levels (SMCL) and health-based screening levels (HBSL) for water quality. The probability models were constrained by the alluvial boundary of the Central Valley to a depth of approximately 300 meters (m). We utilized prediction modeling methods, specifically boosted regression trees (BRT) with a Bernoulli error distribution within a statistical learning framework within R's computing framework (http://www.r-project.org/) to produce two-dimensional probability grids at selected depths throughout the modeling domain. The statistical learning framework seeks to maximize the predictive performance of machine learning methods through model tuning by cross validation. Models were constructed using measured dissolved oxygen and manganese concentrations sampled from 2,767 wells within the alluvial boundary of the Central Valley and over 60 predictor variables from 7 sources (see metadata) and were assembled to develop a model that incorporates regional-scale soil properties, soil chemistry, land use, aquifer textures, and aquifer hydrology. Previously developed Central Valley model outputs of textures (Central Valley Textural Model, CVTM; Faunt and others, 2010) and MODFLOW-simulated vertical water fluxes and predicted depth to water table (Central Valley Hydrologic Model, CVHM; Faunt, 2009) were used to represent aquifer textures and groundwater hydraulics, respectively. The wells used in the BRT models described above were attributed to predictor variable values in ArcGIS using a 500-m buffer. The response variable data consisted of measured DO and Mn concentrations from 2,767 wells within the alluvial boundary of the Central Valley. The data were compiled from two sources: U.S. Geological Survey (USGS) National Water Information System (NWIS) database (all data are publicly available from the USGS at http://waterdata.usgs.gov/ca/nwis/nwis) and the California State Water Resources Control Board Division of Drinking Water (SWRCB-DDW) database (water-quality data are publicly available from the SWRCB at http://geotracker.waterboards.ca.gov/gama/). Only wells with well depth data were selected, and for wells with multiple records, only the most recent sample in the period 1993â2014 that had the required water-quality data was used. Data were available for 932 wells for the NWIS dataset and 1,835 wells for the SWRCB-DDW dataset. Models were trained on a USGS NWIS dataset of 932 wells and evaluated on an independent hold-out dataset of 1,835 wells from the SWRCB-DDW. We used cross-validation to assess the predictive performance of models of varying complexity as a basis for selecting the final models used to create the prediction grids. Trained models were applied to cross-validation testing data and a separate hold-out dataset to evaluate model predictive performance by emphasizing three model metrics of fit: Kappa, accuracy, and the area under the receiver operator characteristic (ROC) curve. The final trained models were used for mapping predictions at discrete depths to a depth of approximately 300 m. Trained DO and Mn models had accuracies of 86â100 percent, Kappa values of 0.69â0.99, and ROC values of 0.92â1.0. Model accuracies for cross-validation testing datasets were 82â95 percent, and ROC values were 0.87â0.91, indicating good predictive performance. Kappa values for the cross-validation testing dataset were 0.30â0.69, indicating fair to substantial agreement between testing observations and model predictions. Hold-out data were available for the manganese model only and indicated accuracies of 89â97 percent, ROC values of 0.73â0.75, and Kappa values of 0.06â0.30. The predictive performance of both the DO and Mn models was reasonable, considering all three of these fit metrics and the low percentages of low-DO and high-Mn events in the data. See associated journal article (Rosecrans and others, 2017) for complete summary of BRT modeling methods, model fit metrics, and relative influence of predictor variables for a given DO or Mn BRT model. The modeled response variables for the DO BRT models were based on measured DO values from wells at the following thresholds: <0.5 milligrams per liter (mg/L), <1.0 mg/L, and <2.0 mg/L, and these thresholds values were considered anoxic based on literature reviews. The modeled response variables for the Mn BRT models were based on measured Mn values from wells at the following exceedance thresholds: >50 micrograms per liter (Âľg/L), >150 Âľg/L, and >300 Âľg/L. (The 150 Âľg/L manganese threshold repres... Visit https://dataone.org/datasets/a905afa4-cdf2-4f19-ac0d-42423de2d684 for complete metadata about this dataset.
Facebook
TwitterA geodatabase was developed to compile Curve Fit (Version 10.1; De Jager and Fox, 2013) regression tool adjusted R-squared outputs for wild celery (Vallisneria americana), wild rice (Zizania aquatica) and arrowhead (one raster for the sum of Sagittaria rigida and Sagittaria latifolia) for pools 4, 8, and 13 on the Upper Mississippi River system from 1998-2019 using mapped abundance raster datasets. Relative abundance, for submersed species and filamentous algae, represents the sum of rake scores across the six subsites divided by the maximum possible rake score (30) at each site, multiplied by 100 (0-100%). Percent cover, for emersed, rooted floating-leaved and free-floating lifeforms, represents the maximum % cover for each category (0, 20, 40, 60, 80, 100%). Each explanatory variable (year) was paired with the corresponding raster by pool. Curve Fit was used to estimate the linear relationship between year and pixel value (one relative abundance/percent cover value per year) and create an output raster containing parameter estimates, model error, and r2. The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. Outputs were developed at two temporal scales: 1998-2019 and 2010-2019.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Ice Cream Selling dataset is a simple and well-suited dataset for beginners in machine learning who are looking to practice polynomial regression. It consists of two columns: temperature and the corresponding number of units of ice cream sold.
The dataset captures the relationship between temperature and ice cream sales. It serves as a practical example for understanding and implementing polynomial regression, a powerful technique for modeling nonlinear relationships in data.
The dataset is designed to be straightforward and easy to work with, making it ideal for beginners. The simplicity of the data allows beginners to focus on the fundamental concepts and steps involved in polynomial regression without overwhelming complexity.
By using this dataset, beginners can gain hands-on experience in preprocessing the data, splitting it into training and testing sets, selecting an appropriate degree for the polynomial regression model, training the model, and evaluating its performance. They can also explore techniques to address potential challenges such as overfitting.
With this dataset, beginners can practice making predictions of ice cream sales based on temperature inputs and visualize the polynomial regression curve that represents the relationship between temperature and ice cream sales.
Overall, the Ice Cream Selling dataset provides an accessible and practical learning resource for beginners to grasp the concepts and techniques of polynomial regression in the context of analyzing ice cream sales data.