Facebook
TwitterThis data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.
Facebook
TwitterSite-specific multiple linear regression models were developed for eight sites in Ohio—six in the Western Lake Erie Basin and two in northeast Ohio on inland reservoirs--to quickly predict action-level exceedances for a cyanotoxin, microcystin, in recreational and drinking waters used by the public. Real-time models include easily- or continuously-measured factors that do not require that a sample be collected. Real-time models are presented in two categories: (1) six models with continuous monitor data, and (2) three models with on-site measurements. Real-time models commonly included variables such as phycocyanin, pH, specific conductance, and streamflow or gage height. Many of the real-time factors were averages over time periods antecedent to the time the microcystin sample was collected, including water-quality data compiled from continuous monitors. Comprehensive models use a combination of discrete sample-based measurements and real-time factors. Comprehensive models were useful at some sites with lagged variables (< 2 weeks) for cyanobacterial toxin genes, dissolved nutrients, and (or) N to P ratios. Comprehensive models are presented in three categories: (1) three models with continuous monitor data and lagged comprehensive variables, (2) five models with no continuous monitor data and lagged comprehensive variables, and (3) one model with continuous monitor data and same-day comprehensive variables. Funding for this work was provided by the Ohio Water Development Authority and the U.S. Geological Survey Cooperative Water Program.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Ice Cream Selling dataset is a simple and well-suited dataset for beginners in machine learning who are looking to practice polynomial regression. It consists of two columns: temperature and the corresponding number of units of ice cream sold.
The dataset captures the relationship between temperature and ice cream sales. It serves as a practical example for understanding and implementing polynomial regression, a powerful technique for modeling nonlinear relationships in data.
The dataset is designed to be straightforward and easy to work with, making it ideal for beginners. The simplicity of the data allows beginners to focus on the fundamental concepts and steps involved in polynomial regression without overwhelming complexity.
By using this dataset, beginners can gain hands-on experience in preprocessing the data, splitting it into training and testing sets, selecting an appropriate degree for the polynomial regression model, training the model, and evaluating its performance. They can also explore techniques to address potential challenges such as overfitting.
With this dataset, beginners can practice making predictions of ice cream sales based on temperature inputs and visualize the polynomial regression curve that represents the relationship between temperature and ice cream sales.
Overall, the Ice Cream Selling dataset provides an accessible and practical learning resource for beginners to grasp the concepts and techniques of polynomial regression in the context of analyzing ice cream sales data.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was created by Ananya Nayan
Released under Database: Open Database, Contents: © Original Authors
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary : Fuel demand is shown to be influenced by fuel prices, people's income and motorization rates. We explore the effects of electric vehicle's rates in gasoline demand using this panel dataset.
Files : dataset.csv - Panel dimensions are the Brazilian state ( i ) and year ( t ). The other columns are: gasoline sales per capita (ln_Sg_pc), prices of gasoline (ln_Pg) and ethanol (ln_Pe) and their lags, motorization rates of combustion vehicles (ln_Mi_c) and electric vehicles (ln_Mi_e) and GDP per capita (ln_gdp_pc). All variables are all under the natural log function, since we use this to calculate demand elasticities in a regression model.
adjacency.csv - The adjacency matrix used in interaction with electric vehicles' motorization rates to calculate spatial effects. At first, it follows a binary adjacency formula: for each pair of states i and j, the cell (i, j) is 0 if the states are not adjacent and 1 if they are. Then, each row is normalized to have sum equal to one.
regression.do - Series of Stata commands used to estimate the regression models of our study. dataset.csv must be imported to work, see comment section.
dataset_predictions.xlsx - Based on the estimations from Stata, we use this excel file to make average predictions by year and by state. Also, by including years beyond the last panel sample, we also forecast the model into the future and evaluate the effects of different policies that influence gasoline prices (taxation) and EV motorization rates (electrification). This file is primarily used to create images, but can be used to further understand how the forecasting scenarios are set up.
Sources: Fuel prices and sales: ANP (https://www.gov.br/anp/en/access-information/what-is-anp/what-is-anp) State population, GDP and vehicle fleet: IBGE (https://www.ibge.gov.br/en/home-eng.html?lang=en-GB) State EV fleet: Anfavea (https://anfavea.com.br/en/site/anuarios/)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
New algorithms are continuously proposed in computational biology. Performance evaluation of novel methods is important in practice. Nonetheless, the field experiences a lack of rigorous methodology aimed to systematically and objectively evaluate competing approaches. Simulation studies are frequently used to show that a particular method outperforms another. Often times, however, simulation studies are not well designed, and it is hard to characterize the particular conditions under which different methods perform better. In this paper we propose the adoption of well established techniques in the design of computer and physical experiments for developing effective simulation studies. By following best practices in planning of experiments we are better able to understand the strengths and weaknesses of competing algorithms leading to more informed decisions about which method to use for a particular task. We illustrate the application of our proposed simulation framework with a detailed comparison of the ridge-regression, lasso and elastic-net algorithms in a large scale study investigating the effects on predictive performance of sample size, number of features, true model sparsity, signal-to-noise ratio, and feature correlation, in situations where the number of covariates is usually much larger than sample size. Analysis of data sets containing tens of thousands of features but only a few hundred samples is nowadays routine in computational biology, where “omics” features such as gene expression, copy number variation and sequence data are frequently used in the predictive modeling of complex phenotypes such as anticancer drug response. The penalized regression approaches investigated in this study are popular choices in this setting and our simulations corroborate well established results concerning the conditions under which each one of these methods is expected to perform best while providing several novel insights.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The meta-learning method proposed in this paper addresses the issue of small-sample regression in the application of engineering data analysis, which is a highly promising direction for research. By integrating traditional regression models with optimization-based data augmentation from meta-learning, the proposed deep neural network demonstrates excellent performance in optimizing glass fiber reinforced plastic (GFRP) for wrapping concrete short columns. When compared with traditional regression models, such as Support Vector Regression (SVR), Gaussian Process Regression (GPR), and Radial Basis Function Neural Networks (RBFNN), the meta-learning method proposed here performs better in modeling small data samples. The success of this approach illustrates the potential of deep learning in dealing with limited amounts of data, offering new opportunities in the field of material data analysis.
Facebook
TwitterStudents typically find linear regression analysis of data sets in a biology classroom challenging. These activities could be used in a Biology, Chemistry, Mathematics, or Statistics course. The collection provides student activity files with Excel instructions and Instructor Activity files with Excel instructions and solutions to problems.
Students will be able to perform linear regression analysis, find correlation coefficient, create a scatter plot and find the r-square using MS Excel 365. Students will be able to interpret data sets, describe the relationship between biological variables, and predict the value of an output variable based on the input of an predictor variable.
Facebook
TwitterIntroductionThe capacity of higher education students to comprehend and act on health information is a pivotal factor in attaining favourable health outcomes and well-being. Assessing the health literacy of these students is essential in order to develop targeted interventions and provide informed health support. The aim of this study was to identify the level of health literacy and to analyse its relationship with determinants such as socio-demographic variables, chronic disease, perceived health status, and perceived availability of money for expenses among higher education students in the Alentejo region of southern Portugal.MethodologyAn observational, descriptive and cross-sectional study was conducted between 22 June and 12 September 2023. An online structured questionnaire consisting of the Portuguese version of the European Health Literacy Survey Questionnaire—16 items (HLS-EU-PT-Q16), including socio-demographic data, presence of chronic diseases, perceived health status, and availability of money for expenses. Data were analysed using independent samples t-test, one-way ANOVA, post-hoc Gabriel’s test, and multivariate logistic regression analyses at a significance level of 0.05. Regression models were used to investigate the relationship between health literacy and various determinants. The study protocol was approved by the Ethics Committee of the University of Évora, and all participants gave written informed consent.ResultsAnalysis of the HLS-EU-PT-Q16 showed that 82.3% of the 1228 students sampled had limited health literacy. The mean health literacy score was 19.3 ± 12.8 on a scale of 0 to 50, with subscores of 19.4 ± 13.9 for health care, 19.1 ± 13.1 for disease prevention, and 19.0 ± 13.7 for health promotion. Significant associations were found between health literacy and several determinants. Higher health literacy was associated with the absence of chronic diseases. Regression analysis showed that lower health literacy was associated with not attending health-related courses, not living with a health professional, perceiving limited availability of money for expenses, and having an unsatisfactory health status.ConclusionThis study improves the understanding of health literacy levels among higher education students in Alentejo, Portugal, and identifies key determinants. Higher education students in this region had relatively low levels of health literacy, which may have a negative impact on their health outcomes. These findings highlight the need for interventions to improve health literacy among higher education students and to address the specific needs of high-risk subgroups in the Alentejo.
Facebook
TwitterThis dataset was created by FayeJavad
Facebook
TwitterThe purpose of this report is to compare alternative methods for producing measures of SEs for regression models for the MHSS clinical sample with the goal of producing more accurate and potentially smaller SEs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Source/Credit: Michael Grogan https://github.com/MGCodesandStats https://github.com/MGCodesandStats/datasets/blob/master/cars.csv
Sample dataset for regression analysis. Given 5 attributes (age, gender, miles driven per day, debt, and income) predict how much someone will spend on purchasing a car. All 5 of the input attributes have been scaled to be in 0 to 1 range. Training set has 723 training examples. Test set has 242 test examples.
This dataset will be used in an upcoming Galaxy Training Network tutorial (https://training.galaxyproject.org/training-material/topics/statistics/) on use of feedforward neural networks for regression analysis.
Facebook
TwitterSediment accumulation and transport negatively affect flood control, water supply, aquatic life, reclamation, and recreation (Angino and O’Brien, 1968) and are concerns of resource managers in the Kankakee River Basin of northern Indiana and throughout many regions of the United States. By relating continuously monitored water-quality data to discrete data collected from April 2016 through July 2020, linear regression was used to develop models for estimating concentrations of suspended sediment. Developed regression models indicated a strong correlation between continuous turbidity and suspended-sediment concentration (adjusted coefficient of determination equals 0.765, predicted residual error sum of squares equals 0.122). Daily loads of suspended sediment were computed from regression model concentrations and instantaneous streamflow. Monthly loads were then calculated to provide a clearer representation of seasonality. The estimated mean monthly suspended sediment load (April 2016 through July 2020) was 4726.5 tons per month; the estimated median monthly suspended sediment load was 4447.2 tons per month with a range in monthly loads from 741.2 to 9992.8 tons per month. The development of regression models for suspended sediment, total nitrogen, and total phosphorus relied on the collection of representative discrete water-quality samples and the operation of continuously deployed monitors throughout the range of hydrologic and seasonal conditions at the site. Regression models were developed following USGS protocols and methods (Helsel and others, 2020; Rasmussen and others, 2009). Each regression model relates laboratory-analyzed discrete water-quality sample data with continuously deployed water-quality monitor measurements. Ordinary least squares regression analysis was done using the R statistical software programming language (R Core Team, 2021) to evaluate the relationship between the discrete concentrations of suspended sediment and continuously measured parameters as well as seasonality and time over the study period (explanatory variables) (water temperature, specific conductance, pH, dissolved oxygen, turbidity, and streamflow). To improve potential models, explanatory and response variables were evaluated for transformations (log, square root, or square) that linearize the relation or change the distributional characteristics of data resulting in model residuals that are more symmetric, linear, and homoscedastic. Statistical models for all possible combinations of explanatory and response variables were evaluated using stepwise regression. To further evaluate potential models, diagnostic plots were created to assess how each model’s residuals varied as a function of (1) predicted values, (2) normal quantiles, (3) date, and (4) streamflow. Additional plots highlighted differences among predicted and observed values, residuals by season, and residuals by year. A variety of model statistics and diagnostics were used to determine the best predictors of each modeled constituent including tests of significance, standard error, adjusted coefficient of determination (R2), and the predicted residual error sum of squares (PRESS) statistic. The PRESS statistic is a leave-one-out form of cross-validation that provides a measure of model fit for sample observations not used to develop the regression model. In general, the smaller the PRESS statistic, the better the model’s predictive ability (Helsel and Hirsch, 2002). The optimal models commonly used a mathematically transformed response variable. In those instances, a bias correcting factor (BCF) was used to correct for bias that occurs when back-transforming model results back into base-10 units (Helsel and Hirsch, 2002). Prediction intervals were computed for each model following methods from Helsel and Hirsch (2002), to define the range of values within which there is 90-percent certainty that the true value occurs.
Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Student Performance (Multiple Linear Regression) Dataset is designed to analyze the relationship between students’ learning habits and academic performance. Each sample includes key indicators related to learning, such as study hours, sleep duration, previous test scores, and the number of practice exams completed.
2) Data Utilization (1) Characteristics of the Student Performance (Multiple Linear Regression) Dataset: • The target variable, Hours Studied, quantitatively represents the amount of time a student has invested in studying. The dataset is structured to allow modeling and inference of learning behaviors based on correlations with other variables.
(2) Applications of the Student Performance (Multiple Linear Regression) Dataset: • AI-Based Study Time Prediction Models: The dataset can be used to develop regression models that estimate a student’s expected study time based on inputs like academic performance, sleep habits, and engagement patterns. • Behavioral Analysis and Personalized Learning Strategies: It can be applied to identify students with insufficient study time and design personalized study interventions based on academic and lifestyle patterns.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Used Code to Genrate data:
import numpy as np import pandas as pd
np.random.seed(42) n_samples = 100
X1 = np.random.rand(n_samples, 1) * 10 X2 = X1 + np.random.randn(n_samples, 1) * 0.1 # almost same as X1 -> high correlation X3 = np.random.rand(n_samples, 1) * 10
X = np.hstack([X1, X2, X3]) y = 3*X1 + 2*X2 + 1.5*X3 + np.random.randn(n_samples, 1) * 2 # target with noise
df_X = pd.DataFrame(X, columns=['X1', 'X2', 'X3'])
df_y = pd.DataFrame(y, columns=['y'])
df = pd.concat([df_X, df_y], axis=1)
print(df.head())
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The rootogram is a graphical tool associated with the work of J. W. Tukey that was originally used for assessing goodness of fit of univariate distributions. Here, we extend the rootogram to regression models and show that this is particularly useful for diagnosing and treating issues such as overdispersion and/or excess zeros in count data models. We also introduce a weighted version of the rootogram that can be applied out of sample or to (weighted) subsets of the data, for example, in finite mixture models. An empirical illustration revisiting a well-known dataset from ethology is included, for which a negative binomial hurdle model is employed. Supplementary materials providing two further illustrations are available online: the first, using data from public health, employs a two-component finite mixture of negative binomial models; the second, using data from finance, involves underdispersion. An R implementation of our tools is available in the R package countreg. It also contains the data and replication code.
Facebook
TwitterHeterotrophic biofilm growth is typical in streams receiving airport deicer runoff; however detailed studies of biofilms in this setting are rare. Sample collection for this study was done during and surrounding two deicer seasons (i.e., 2009-2010 and 2010-2011), with additional sample collection occurring in 2014. Field surveys were used to document biofilm prevalence and characteristics, as well as stream characteristics. Collected biofilm samples were analyzed via microscopy, quantitative real-time polymerase chain reaction (qPCR), microarray, gas chromatography (of a cultured isolate), as well as via sequencing (Sanger and massively parallel). Sequence data are provided elsewhere (as described in the larger citation). Water-quality and quantity data were also collected in an attempt to assess relevant environmental conditions. Water quality data included grab samples collected at the time of biofilm field surveys as well as flow-weighted composite samples that were collected throughout the study period at nearby stream gages. Continuous streamflow and temperature were also collected at these gaged sites. Additional sensors were deployed at non-gaged sites to measure water temperature at these sites. Continuous temperature data were used to calculate antecedent characteristics for various time windows. Dye tracer results allowed for the determination of flow-based times of travel between sites. Taken together with flow composite sample data (at the downstream gage closest to the airport), this allowed for the calculation of estimated water quality characteristics at downstream sites as well as the subsequent calculation of antecedent water quality characteristics at all downstream sites. Regression models were run to investigate the influence of environmental factors on biofilm volume and dissolved oxygen concentration. Models were developed by ordinary least-squares regression using the R project for statistical computing with core functionality. Predictor variables for these models are included in the data file and input files provided. These include biofilm volumes, dissolved oxygen, COD concentration, water temperature, and monitoring site designation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many exciting results have been obtained on model selection for high-dimensional data in both efficient algorithms and theoretical developments. The powerful penalized regression methods can give sparse representations of the data even when the number of predictors is much larger than the sample size. One important question then is: How do we know when a sparse pattern identified by such a method is reliable? In this work, besides investigating instability of model selection methods in terms of variable selection, we propose variable selection deviation measures that give one a proper sense on how many predictors in the selected set are likely trustworthy in certain aspects. Simulation and a real data example demonstrate the utility of these measures for application.
Facebook
TwitterThis data release supports the following publication: Mast, M. A., 2018, Estimating metal concentrations with regression analysis and water-quality surrogates at nine sites on the Animas and San Juan Rivers, Colorado, New Mexico, and Utah: U.S. Geological Survey Scientific Investigations Report 2018-5116. The U.S. Geological Survey (USGS), in cooperation with the U. S. Environmental Protection Agency (EPA), developed site-specific regression models to estimate concentrations of selected metals at nine USGS streamflow-gaging stations along the Animas and San Juan Rivers. Multiple linear-regression models were developed by relating metal concentrations in discrete water-quality samples to continuously monitored streamflow and surrogate parameters including specific conductance, pH, turbidity, and water temperature. Models were developed for dissolved and total concentrations of aluminum, arsenic, cadmium, iron, lead, manganese, and zinc using water-quality samples collected during 2005–17 by several agencies, using different collection methods and analytical laboratories. Calibration datasets in comma-separated format (CSV) include the variables of sampling date and time, metal concentrations (in micrograms per liter), stream discharge (in cubic feet per second), specific conductance (in microsiemens per centimeter at 25 degrees Celsius), pH, water temperature (in degrees Celsius), turbidity (in nephelometric turbidity units), and calculated seasonal terms based on Julian day. Surrogate parameters and discrete water-quality samples were used from nine sites including Cement Creek at Silverton, Colo. (USGS station 09358550); Animas River below Silverton, Colo. (USGS station 09359020); Animas River at Durango, Colo. (USGS station 09361500); Animas River Near Cedar Hill, N. Mex. (USGS station 09363500); Animas River below Aztec, N. Mex. (USGS station 09364010); San Juan River at Farmington, N. Mex. (USGS station 09365000); San Juan River at Shiprock, N. Mex (USGS Station 09368000); San Juan River at Four Corners, Colo. (USGS station 09371010); and San Juan River near Bluff, Utah (USGS station 09379500). Model archive summaries in pdf format include model statistics, data, and plots and were generated using a R script developed by USGS Kansas Water Science Center available at https://patrickeslick.github.io/ModelArchiveSummary/. A description of each USGS streamflow gaging station along with information about the calibration datasets also are provided.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The datasets used in this research work refer to the aims of Sustainable Development Goal 7. These datasets were used to train and test machine learning model based on artificial neural network and other machine learning regression models for solving the problem of prediction scores in terms of SDG 7 aims realization. Train dataset was created based on data from 2013 to 2021 and includes 261 samples. Test dataset includes 29 samples. Sources data from 2013 to 2022 are available in 10 XLSX and CSV files. Train and test datasets are available in XLSX and CSV files. Detailed description of data is available in PDF file.
Facebook
TwitterThis data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.