Facebook
TwitterThis dataset has been created to demonstrate the use of a simple linear regression model. It includes two variables: an independent variable and a dependent variable. The data can be used for training, testing, and validating a simple linear regression model, making it ideal for educational purposes, tutorials, and basic predictive analysis projects. The dataset consists of 100 observations with no missing values, and it follows a linear relationship
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Here in This Dataset we have only 2 columns the first one is Age and the second one is Premium You can use this dataset in machine learning for Simple linear Regression and for Prediction Practices.
Facebook
TwitterSite-specific multiple linear regression models were developed for eight sites in Ohio—six in the Western Lake Erie Basin and two in northeast Ohio on inland reservoirs--to quickly predict action-level exceedances for a cyanotoxin, microcystin, in recreational and drinking waters used by the public. Real-time models include easily- or continuously-measured factors that do not require that a sample be collected. Real-time models are presented in two categories: (1) six models with continuous monitor data, and (2) three models with on-site measurements. Real-time models commonly included variables such as phycocyanin, pH, specific conductance, and streamflow or gage height. Many of the real-time factors were averages over time periods antecedent to the time the microcystin sample was collected, including water-quality data compiled from continuous monitors. Comprehensive models use a combination of discrete sample-based measurements and real-time factors. Comprehensive models were useful at some sites with lagged variables (< 2 weeks) for cyanobacterial toxin genes, dissolved nutrients, and (or) N to P ratios. Comprehensive models are presented in three categories: (1) three models with continuous monitor data and lagged comprehensive variables, (2) five models with no continuous monitor data and lagged comprehensive variables, and (3) one model with continuous monitor data and same-day comprehensive variables. Funding for this work was provided by the Ohio Water Development Authority and the U.S. Geological Survey Cooperative Water Program.
Facebook
TwitterThis data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.
Facebook
TwitterSite-specific multiple linear regression models were developed for one beach in Ohio (three discrete sampling sites) and one beach in Pennsylvania to estimate concentrations of Escherichia coli (E. coli) or the probability of exceeding the bathing-water standard for E. coli in recreational waters used by the public. Traditional culture-based methods are commonly used to estimate concentrations of fecal indicator bacteria, such as E. coli; however, results are obtained 18 to 24 hours post sampling and do not accurately reflect current water-quality conditions. Beach-specific mathematical models use environmental and water-quality variables that are easily and quickly measured as surrogates to estimate concentrations of fecal-indicator bacteria or to provide the probability that a State recreational water-quality standard will be exceeded. When predictive models are used for beach closure or advisory decisions, they are referred to as “nowcasts”. Software designed for model development by the U.S. Environmental Protection Agency (Virtual Beach) was used. The selected model for each beach was based on a combination of explanatory variables including, most commonly, turbidity, water temperature, change in lake level over 24 hours, and antecedent rainfall. Model results are used by managers to report water-quality conditions to the public through the Great Lakes NowCast in 2019 (https://pa.water.usgs.gov/apps/nowcast/). Model performance in 2019 (sensitivity, specificity, and accuracy) was compared to using the previous day's E. coli concentration (persistence method).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Salary Dataset in CSV for Simple linear regression. It has also been used in Machine Learning A to Z course of my series.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Linear regression is one of the most used statistical techniques in neuroscience, including the study of the neuropathology of Alzheimer’s disease (AD) dementia. However, the practical utility of this approach is often limited because dependent variables are often highly skewed and fail to meet the assumption of normality. Applying linear regression analyses to highly skewed datasets can generate imprecise results, which lead to erroneous estimates derived from statistical models. Furthermore, the presence of outliers can introduce unwanted bias, which affect estimates derived from linear regression models. Although a variety of data transformations can be utilized to mitigate these problems, these approaches are also associated with various caveats. By contrast, a robust regression approach does not impose distributional assumptions on data allowing for results to be interpreted in a similar manner to that derived using a linear regression analysis. Here, we demonstrate the utility of applying robust regression to the analysis of data derived from studies of human brain neurodegeneration where the error distribution of a dependent variable does not meet the assumption of normality. We show that the application of a robust regression approach to two independent published human clinical neuropathologic data sets provides reliable estimates of associations. We also demonstrate that results from a linear regression analysis can be biased if the dependent variable is significantly skewed, further indicating robust regression as a suitable alternate approach.
Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Student Performance (Multiple Linear Regression) Dataset is designed to analyze the relationship between students’ learning habits and academic performance. Each sample includes key indicators related to learning, such as study hours, sleep duration, previous test scores, and the number of practice exams completed.
2) Data Utilization (1) Characteristics of the Student Performance (Multiple Linear Regression) Dataset: • The target variable, Hours Studied, quantitatively represents the amount of time a student has invested in studying. The dataset is structured to allow modeling and inference of learning behaviors based on correlations with other variables.
(2) Applications of the Student Performance (Multiple Linear Regression) Dataset: • AI-Based Study Time Prediction Models: The dataset can be used to develop regression models that estimate a student’s expected study time based on inputs like academic performance, sleep habits, and engagement patterns. • Behavioral Analysis and Personalized Learning Strategies: It can be applied to identify students with insufficient study time and design personalized study interventions based on academic and lifestyle patterns.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The file contain dataset with two variables (x & y). The dataset is for Linear regression ML Models. The dataset can be used for Testing purpose. The x variable is the independent variable, and y is the dependent variable. The dataset has a correlation of 0.9981 showing the dataset is best suited for linear models and can be used for the testing purpose.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary : Fuel demand is shown to be influenced by fuel prices, people's income and motorization rates. We explore the effects of electric vehicle's rates in gasoline demand using this panel dataset.
Files : dataset.csv - Panel dimensions are the Brazilian state ( i ) and year ( t ). The other columns are: gasoline sales per capita (ln_Sg_pc), prices of gasoline (ln_Pg) and ethanol (ln_Pe) and their lags, motorization rates of combustion vehicles (ln_Mi_c) and electric vehicles (ln_Mi_e) and GDP per capita (ln_gdp_pc). All variables are all under the natural log function, since we use this to calculate demand elasticities in a regression model.
adjacency.csv - The adjacency matrix used in interaction with electric vehicles' motorization rates to calculate spatial effects. At first, it follows a binary adjacency formula: for each pair of states i and j, the cell (i, j) is 0 if the states are not adjacent and 1 if they are. Then, each row is normalized to have sum equal to one.
regression.do - Series of Stata commands used to estimate the regression models of our study. dataset.csv must be imported to work, see comment section.
dataset_predictions.xlsx - Based on the estimations from Stata, we use this excel file to make average predictions by year and by state. Also, by including years beyond the last panel sample, we also forecast the model into the future and evaluate the effects of different policies that influence gasoline prices (taxation) and EV motorization rates (electrification). This file is primarily used to create images, but can be used to further understand how the forecasting scenarios are set up.
Sources: Fuel prices and sales: ANP (https://www.gov.br/anp/en/access-information/what-is-anp/what-is-anp) State population, GDP and vehicle fleet: IBGE (https://www.ibge.gov.br/en/home-eng.html?lang=en-GB) State EV fleet: Anfavea (https://anfavea.com.br/en/site/anuarios/)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this article, we explore the use of two published datasets for teaching a wide range of students about regression models, with a particular focus on interaction terms. The two datasets come from recent psychology studies on beliefs about poverty and welfare, and about the dynamics of groups projects. Both datasets (and their original research papers) are accessible to students, and because of their context, students can learn about data collection, measurement, and the use of statistics when studying complex social topics, while using the data to learn about regression analysis. We have used these data for a range of in-class activities, journal paper discussions, exams, and extended projects, at the undergraduate, master’s, and doctoral levels. Supplementary materials for this article are available online.
Facebook
TwitterAn increased blood pressure is a known comorbidity of both type 1 diabetes (T1DM) and obesity in children. Increasing evidence suggests a subtle interplay between epidermal growth factor (EGF) and renin along the juxtaglomerular system, regulating the impact of blood pressure on kidney health and the cardiovascular system. In this study, we investigated the relation between urinary EGF, serum renin and blood pressure in children with obesity or T1DM. 147 non-obese children with T1DM and 126 children with obesity, were included. Blood pressure was measured and mean arterial pressure (MAP) and the pulse pressure (PP) were calculated. Serum renin and urinary EGF levels were determined with a commercial ELISA kit. Partial Spearman rank correlation coefficients and multiple linear regression models were used to study the association between renin, the urinary EGF/urinary creatinine ratio and blood pressure parameters. The urinary EGF/urinary creatinine ratio is correlated with the SBP and the MAP in boys with obesity as well as in boys with T1DM. Multiple regression analysis showed that sex and pulse pressure in male subjects were found to be independently associated with renin. Sex, the presence of diabetes, age, the glomerular filtration rate and both pulse pressure and mean arterial pressure in male subjects were independently associated with urinary EGF/urinary creatinine. In conclusion, in boys with either obesity or diabetes, pulse pressure and mean arterial pressure are negatively associated with the functional integrity of the nephron, which is reflected by a decreased expression of urinary EGF.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The present work demonstrates how neural networks are used to do non-linear regressions. The technique is presented in a simple and didactic manner and applied to fit potential energy surfaces for the FeC molecule and for the reaction H + H2. It shows how to do the fitting for single- and multi-variable system providing examples and code that can be easily extended to many problems in chemistry. All the code used to perform the fitting and generate the results is available as a Jupyter Notebook, which can be used without neither installation nor configuration
Facebook
TwitterStudents typically find linear regression analysis of data sets in a biology classroom challenging. These activities could be used in a Biology, Chemistry, Mathematics, or Statistics course. The collection provides student activity files with Excel instructions and Instructor Activity files with Excel instructions and solutions to problems.
Students will be able to perform linear regression analysis, find correlation coefficient, create a scatter plot and find the r-square using MS Excel 365. Students will be able to interpret data sets, describe the relationship between biological variables, and predict the value of an output variable based on the input of an predictor variable.
Facebook
TwitterLinear regression analysis for commonly used function domains.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Generated datasets of multiple ideals distributions used in the research in linear regressions and machine learning algorithm for the thesis s 'Predicting the performance of Buchberger‘s algorithm' . Concatenated and concatenated_stats are the datasets with the ideals exponents and correspondent polynomial additions, these datasets were created specifically for RNN, features_dataset contains statistics regarding the ideals and polynomial_additions_dataset contains info regarding their polynomial additions created for multiple linear regression models and simple neural networks.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains a collection of 100 randomly generated data points representing the relationship between the number of hours a student spends studying and their corresponding performance, measured as a score. The data has been generated to simulate a real-world scenario where study hours are assumed to influence academic outcomes, making it an excellent resource for linear regression analysis and other machine learning tasks.
Each row in the dataset consists of:
Hours: The number of hours a student dedicates to studying, ranging between 0 and 10 hours. Scores: The student's performance score, represented as a percentage, ranging from 0 to 100. Use Cases: This dataset is particularly useful for:
Linear Regression: Exploring how study hours influence student performance, fitting a regression line to predict scores based on study time. Data Science & Machine Learning: Practicing regression analysis, training models, and applying other predictive algorithms. Educational Research: Simulating data-driven insights into student behavior and performance metrics. Features: 100 rows of data. Continuous numerical variables suitable for regression tasks. Generated for educational purposes, making it ideal for students, teachers, and beginners in machine learning and data science. Potential Applications: Build a linear regression model to predict student scores. Investigate the correlation between study time and performance. Apply data visualization techniques to better understand the data. Use the dataset to experiment with model evaluation metrics like Mean Squared Error (MSE) and R-squared.
Facebook
TwitterThe dataset used in the paper is a large linear system of equations with a sparse solution. The authors used this dataset to test their Momentumized Iterative Shrinkage Thresholding (MIST) algorithm.
Facebook
TwitterThis data release contains input data and programs (scripts) used to estimate monthly water demand for retail customers of Providence Water, located in Providence, Rhode Island. Explanatory data and model outputs are from July 2014 through June 2021. Models of per capita (for single-family residential customers) or per connection (for multi-family residential, commercial, and industrial customers) water use were developed using multiple linear regression. The dependent variables, provided by Providence Water, are the monthly number of connections and gallons of water delivered to single- and multi-family residential, commercial, and industrial connections. Potential independent variables (from online sources) are climate variables (temperature and precipitation), economic statistics, and a drought statistic. Not all independent variables were used in all of the models. The data are provided in data tables and model files. The data table RIWaterUseVariableExplanation.csv describes the explanatory variables and their data sources. The data table ProvModelInputData.csv provides the monthly water-use data that are the independent variables and the monthly climatic and economic data that are the dependent variables. The data table DroughtInputData.csv provides the weekly U.S. drought monitor index values that were processed to formulate a potential independent variable. The R script model_water_use.R runs the models that predict water use. The other two R scripts (load_preprocess_input_data.R and model_water_use_functions.R) are not run explicitly but are called from the primary script model_water_use.R. Regression equations produced by the models can be used to predict water demand throughout Rhode Island.
Facebook
TwitterThis dataset has been created to demonstrate the use of a simple linear regression model. It includes two variables: an independent variable and a dependent variable. The data can be used for training, testing, and validating a simple linear regression model, making it ideal for educational purposes, tutorials, and basic predictive analysis projects. The dataset consists of 100 observations with no missing values, and it follows a linear relationship