100+ datasets found

d
Data from: An example data set for exploration of Multiple Linear Regression...
catalog.data.gov
data.usgs.gov
Updated Nov 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). An example data set for exploration of Multiple Linear Regression [Dataset]. https://catalog.data.gov/dataset/an-example-data-set-for-exploration-of-multiple-linear-regression
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.
d
Data from: Data for multiple linear regression models for predicting...
catalog.data.gov
data.usgs.gov
+2more
Updated Nov 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Data for multiple linear regression models for predicting microcystin concentration action-level exceedances in selected lakes in Ohio [Dataset]. https://catalog.data.gov/dataset/data-for-multiple-linear-regression-models-for-predicting-microcystin-concentration-action
Explore at:
Dataset updated
Nov 19, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Ohio
Description
Site-specific multiple linear regression models were developed for eight sites in Ohio—six in the Western Lake Erie Basin and two in northeast Ohio on inland reservoirs--to quickly predict action-level exceedances for a cyanotoxin, microcystin, in recreational and drinking waters used by the public. Real-time models include easily- or continuously-measured factors that do not require that a sample be collected. Real-time models are presented in two categories: (1) six models with continuous monitor data, and (2) three models with on-site measurements. Real-time models commonly included variables such as phycocyanin, pH, specific conductance, and streamflow or gage height. Many of the real-time factors were averages over time periods antecedent to the time the microcystin sample was collected, including water-quality data compiled from continuous monitors. Comprehensive models use a combination of discrete sample-based measurements and real-time factors. Comprehensive models were useful at some sites with lagged variables (< 2 weeks) for cyanobacterial toxin genes, dissolved nutrients, and (or) N to P ratios. Comprehensive models are presented in three categories: (1) three models with continuous monitor data and lagged comprehensive variables, (2) five models with no continuous monitor data and lagged comprehensive variables, and (3) one model with continuous monitor data and same-day comprehensive variables. Funding for this work was provided by the Ohio Water Development Authority and the U.S. Geological Survey Cooperative Water Program.
polynomial regression
kaggle.com
Updated Jul 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miraj Deep Bhandari (2023). polynomial regression [Dataset]. http://doi.org/10.34740/kaggle/ds/3482232
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/3482232
Dataset updated
Jul 5, 2023
Dataset provided by
Kaggle
Authors
Miraj Deep Bhandari
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Ice Cream Selling dataset is a simple and well-suited dataset for beginners in machine learning who are looking to practice polynomial regression. It consists of two columns: temperature and the corresponding number of units of ice cream sold.

The dataset captures the relationship between temperature and ice cream sales. It serves as a practical example for understanding and implementing polynomial regression, a powerful technique for modeling nonlinear relationships in data.

The dataset is designed to be straightforward and easy to work with, making it ideal for beginners. The simplicity of the data allows beginners to focus on the fundamental concepts and steps involved in polynomial regression without overwhelming complexity.

By using this dataset, beginners can gain hands-on experience in preprocessing the data, splitting it into training and testing sets, selecting an appropriate degree for the polynomial regression model, training the model, and evaluating its performance. They can also explore techniques to address potential challenges such as overfitting.

With this dataset, beginners can practice making predictions of ice cream sales based on temperature inputs and visualize the polynomial regression curve that represents the relationship between temperature and ice cream sales.

Overall, the Ice Cream Selling dataset provides an accessible and practical learning resource for beginners to grasp the concepts and techniques of polynomial regression in the context of analyzing ice cream sales data.
Logistic Regression
kaggle.com
zip
Updated Dec 24, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ananya Nayan (2017). Logistic Regression [Dataset]. https://www.kaggle.com/datasets/dragonheir/logistic-regression
Explore at:
zip(3349 bytes)Available download formats
Dataset updated
Dec 24, 2017
Authors
Ananya Nayan
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Dataset

This dataset was created by Ananya Nayan

Released under Database: Open Database, Contents: © Original Authors

Contents
m
Panel dataset on Brazilian fuel demand
data.mendeley.com
Updated Oct 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Prolo (2024). Panel dataset on Brazilian fuel demand [Dataset]. http://doi.org/10.17632/hzpwbp7j22.1
Explore at:
Unique identifier
https://doi.org/10.17632/hzpwbp7j22.1
Dataset updated
Oct 7, 2024
Authors
Sergio Prolo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Brazil
Description
Summary : Fuel demand is shown to be influenced by fuel prices, people's income and motorization rates. We explore the effects of electric vehicle's rates in gasoline demand using this panel dataset.

Files : dataset.csv - Panel dimensions are the Brazilian state ( i ) and year ( t ). The other columns are: gasoline sales per capita (ln_Sg_pc), prices of gasoline (ln_Pg) and ethanol (ln_Pe) and their lags, motorization rates of combustion vehicles (ln_Mi_c) and electric vehicles (ln_Mi_e) and GDP per capita (ln_gdp_pc). All variables are all under the natural log function, since we use this to calculate demand elasticities in a regression model.

adjacency.csv - The adjacency matrix used in interaction with electric vehicles' motorization rates to calculate spatial effects. At first, it follows a binary adjacency formula: for each pair of states i and j, the cell (i, j) is 0 if the states are not adjacent and 1 if they are. Then, each row is normalized to have sum equal to one.

regression.do - Series of Stata commands used to estimate the regression models of our study. dataset.csv must be imported to work, see comment section.

dataset_predictions.xlsx - Based on the estimations from Stata, we use this excel file to make average predictions by year and by state. Also, by including years beyond the last panel sample, we also forecast the model into the future and evaluate the effects of different policies that influence gasoline prices (taxation) and EV motorization rates (electrification). This file is primarily used to create images, but can be used to further understand how the forecasting scenarios are set up.

Sources: Fuel prices and sales: ANP (https://www.gov.br/anp/en/access-information/what-is-anp/what-is-anp) State population, GDP and vehicle fleet: IBGE (https://www.ibge.gov.br/en/home-eng.html?lang=en-GB) State EV fleet: Anfavea (https://anfavea.com.br/en/site/anuarios/)
Simulation Studies as Designed Experiments: The Comparison of Penalized...
figshare.com
ai
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elias Chaibub Neto; J. Christopher Bare; Adam A. Margolin (2023). Simulation Studies as Designed Experiments: The Comparison of Penalized Regression Models in the “Large p, Small n” Setting [Dataset]. http://doi.org/10.1371/journal.pone.0107957
Explore at:
aiAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0107957
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Elias Chaibub Neto; J. Christopher Bare; Adam A. Margolin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
New algorithms are continuously proposed in computational biology. Performance evaluation of novel methods is important in practice. Nonetheless, the field experiences a lack of rigorous methodology aimed to systematically and objectively evaluate competing approaches. Simulation studies are frequently used to show that a particular method outperforms another. Often times, however, simulation studies are not well designed, and it is hard to characterize the particular conditions under which different methods perform better. In this paper we propose the adoption of well established techniques in the design of computer and physical experiments for developing effective simulation studies. By following best practices in planning of experiments we are better able to understand the strengths and weaknesses of competing algorithms leading to more informed decisions about which method to use for a particular task. We illustrate the application of our proposed simulation framework with a detailed comparison of the ridge-regression, lasso and elastic-net algorithms in a large scale study investigating the effects on predictive performance of sample size, number of features, true model sparsity, signal-to-noise ratio, and feature correlation, in situations where the number of covariates is usually much larger than sample size. Analysis of data sets containing tens of thousands of features but only a few hundred samples is nowadays routine in computational biology, where “omics” features such as gene expression, copy number variation and sequence data are frequently used in the predictive modeling of complex phenotypes such as anticancer drug response. The penalized regression approaches investigated in this study are popular choices in this setting and our simulations corroborate well established results concerning the conditions under which each one of these methods is expected to perform best while providing several novel insights.
Data from: S1 Dataset -
plos.figshare.com
xlsx
Updated Jul 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianyi Deng; Chengqi Xue; Gengpei Zhang (2024). S1 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0305038.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305038.s001
Dataset updated
Jul 10, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Tianyi Deng; Chengqi Xue; Gengpei Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The meta-learning method proposed in this paper addresses the issue of small-sample regression in the application of engineering data analysis, which is a highly promising direction for research. By integrating traditional regression models with optimization-based data augmentation from meta-learning, the proposed deep neural network demonstrates excellent performance in optimizing glass fiber reinforced plastic (GFRP) for wrapping concrete short columns. When compared with traditional regression models, such as Support Vector Regression (SVR), Gaussian Process Regression (GPR), and Radial Basis Function Neural Networks (RBFNN), the meta-learning method proposed here performs better in modeling small data samples. The success of this approach illustrates the potential of deep learning in dealing with limited amounts of data, offering new opportunities in the field of material data analysis.
q
Linear Regression (Excel) and Cellular Respiration for Biology, Chemistry...
qubeshub.org
Updated Jan 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irene Corriette; Beatriz Gonzalez; Daniela Kitanska; Henriette Mozsolits; Sheela Vemu (2022). Linear Regression (Excel) and Cellular Respiration for Biology, Chemistry and Mathematics [Dataset]. http://doi.org/10.25334/5PX5-H796
Explore at:
Unique identifier
https://doi.org/10.25334/5PX5-H796
Dataset updated
Jan 11, 2022
Dataset provided by
QUBES
Authors
Irene Corriette; Beatriz Gonzalez; Daniela Kitanska; Henriette Mozsolits; Sheela Vemu
Description
Students typically find linear regression analysis of data sets in a biology classroom challenging. These activities could be used in a Biology, Chemistry, Mathematics, or Statistics course. The collection provides student activity files with Excel instructions and Instructor Activity files with Excel instructions and solutions to problems.

Students will be able to perform linear regression analysis, find correlation coefficient, create a scatter plot and find the r-square using MS Excel 365. Students will be able to interpret data sets, describe the relationship between biological variables, and predict the value of an output variable based on the input of an predictor variable.
f
Multiple linear regression model.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dias, Sara Simões; Pedro, Ana Rita; Rosário, Jorge; Dias, Sónia (2024). Multiple linear regression model. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001409589
Explore at:
Dataset updated
Sep 24, 2024
Authors
Dias, Sara Simões; Pedro, Ana Rita; Rosário, Jorge; Dias, Sónia
Description
IntroductionThe capacity of higher education students to comprehend and act on health information is a pivotal factor in attaining favourable health outcomes and well-being. Assessing the health literacy of these students is essential in order to develop targeted interventions and provide informed health support. The aim of this study was to identify the level of health literacy and to analyse its relationship with determinants such as socio-demographic variables, chronic disease, perceived health status, and perceived availability of money for expenses among higher education students in the Alentejo region of southern Portugal.MethodologyAn observational, descriptive and cross-sectional study was conducted between 22 June and 12 September 2023. An online structured questionnaire consisting of the Portuguese version of the European Health Literacy Survey Questionnaire—16 items (HLS-EU-PT-Q16), including socio-demographic data, presence of chronic diseases, perceived health status, and availability of money for expenses. Data were analysed using independent samples t-test, one-way ANOVA, post-hoc Gabriel’s test, and multivariate logistic regression analyses at a significance level of 0.05. Regression models were used to investigate the relationship between health literacy and various determinants. The study protocol was approved by the Ethics Committee of the University of Évora, and all participants gave written informed consent.ResultsAnalysis of the HLS-EU-PT-Q16 showed that 82.3% of the 1228 students sampled had limited health literacy. The mean health literacy score was 19.3 ± 12.8 on a scale of 0 to 50, with subscores of 19.4 ± 13.9 for health care, 19.1 ± 13.1 for disease prevention, and 19.0 ± 13.7 for health promotion. Significant associations were found between health literacy and several determinants. Higher health literacy was associated with the absence of chronic diseases. Regression analysis showed that lower health literacy was associated with not attending health-related courses, not living with a health professional, perceiving limited availability of money for expenses, and having an unsatisfactory health status.ConclusionThis study improves the understanding of health literacy levels among higher education students in Alentejo, Portugal, and identifies key determinants. Higher education students in this region had relatively low levels of health literacy, which may have a negative impact on their health outcomes. These findings highlight the need for interventions to improve health literacy among higher education students and to address the specific needs of high-risk subgroups in the Alentejo.
Marketing Linear Multiple Regression
kaggle.com
zip
Updated Apr 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FayeJavad (2020). Marketing Linear Multiple Regression [Dataset]. https://www.kaggle.com/datasets/fayejavad/marketing-linear-multiple-regression
Explore at:
zip(1907 bytes)Available download formats
Dataset updated
Apr 24, 2020
Authors
FayeJavad
Description
Dataset

This dataset was created by FayeJavad

Contents
A Comparison of Variance Estimation Methods for Regression Analyses with the...
catalog.data.gov
data.virginia.gov
+1more
Updated Sep 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Substance Abuse and Mental Health Services Administration (2025). A Comparison of Variance Estimation Methods for Regression Analyses with the Mental Health Surveillance Study Clinical Sample [Dataset]. https://catalog.data.gov/dataset/a-comparison-of-variance-estimation-methods-for-regression-analyses-with-the-mental-health
Explore at:
Dataset updated
Sep 7, 2025
Dataset provided by
Substance Abuse and Mental Health Services Administrationhttps://www.samhsa.gov/
Description
The purpose of this report is to compare alternative methods for producing measures of SEs for regression models for the MHSS clinical sample with the goal of producing more accurate and potentially smaller SEs.
Z
Regression analysis in Galaxy with car purchase price prediction dataset
data.niaid.nih.gov
Updated Aug 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaivan Kamali (2022). Regression analysis in Galaxy with car purchase price prediction dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4660496
Explore at:
Dataset updated
Aug 4, 2022
Dataset provided by
Penn State University
Authors
Kaivan Kamali
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Source/Credit: Michael Grogan https://github.com/MGCodesandStats https://github.com/MGCodesandStats/datasets/blob/master/cars.csv

Sample dataset for regression analysis. Given 5 attributes (age, gender, miles driven per day, debt, and income) predict how much someone will spend on purchasing a car. All 5 of the input attributes have been scaled to be in 0 to 1 range. Training set has 723 training examples. Test set has 242 test examples.

This dataset will be used in an upcoming Galaxy Training Network tutorial (https://training.galaxyproject.org/training-material/topics/statistics/) on use of feedforward neural networks for regression analysis.
d
Data from: Model archive summary and suspended-sediment concentrations from...
catalog.data.gov
data.usgs.gov
Updated Sep 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Model archive summary and suspended-sediment concentrations from a surrogate ordinary least square regression analysis for station 05517500, Kankakee River at Dunns Bridge, Indiana, April 2016 through July 2020 [Dataset]. https://catalog.data.gov/dataset/model-archive-summary-and-suspended-sediment-concentrations-from-a-surrogate-ordinary-leas
Explore at:
Dataset updated
Sep 14, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Indiana, Dunns Bridge, Kankakee River
Description
Sediment accumulation and transport negatively affect flood control, water supply, aquatic life, reclamation, and recreation (Angino and O’Brien, 1968) and are concerns of resource managers in the Kankakee River Basin of northern Indiana and throughout many regions of the United States. By relating continuously monitored water-quality data to discrete data collected from April 2016 through July 2020, linear regression was used to develop models for estimating concentrations of suspended sediment. Developed regression models indicated a strong correlation between continuous turbidity and suspended-sediment concentration (adjusted coefficient of determination equals 0.765, predicted residual error sum of squares equals 0.122). Daily loads of suspended sediment were computed from regression model concentrations and instantaneous streamflow. Monthly loads were then calculated to provide a clearer representation of seasonality. The estimated mean monthly suspended sediment load (April 2016 through July 2020) was 4726.5 tons per month; the estimated median monthly suspended sediment load was 4447.2 tons per month with a range in monthly loads from 741.2 to 9992.8 tons per month. The development of regression models for suspended sediment, total nitrogen, and total phosphorus relied on the collection of representative discrete water-quality samples and the operation of continuously deployed monitors throughout the range of hydrologic and seasonal conditions at the site. Regression models were developed following USGS protocols and methods (Helsel and others, 2020; Rasmussen and others, 2009). Each regression model relates laboratory-analyzed discrete water-quality sample data with continuously deployed water-quality monitor measurements. Ordinary least squares regression analysis was done using the R statistical software programming language (R Core Team, 2021) to evaluate the relationship between the discrete concentrations of suspended sediment and continuously measured parameters as well as seasonality and time over the study period (explanatory variables) (water temperature, specific conductance, pH, dissolved oxygen, turbidity, and streamflow). To improve potential models, explanatory and response variables were evaluated for transformations (log, square root, or square) that linearize the relation or change the distributional characteristics of data resulting in model residuals that are more symmetric, linear, and homoscedastic. Statistical models for all possible combinations of explanatory and response variables were evaluated using stepwise regression. To further evaluate potential models, diagnostic plots were created to assess how each model’s residuals varied as a function of (1) predicted values, (2) normal quantiles, (3) date, and (4) streamflow. Additional plots highlighted differences among predicted and observed values, residuals by season, and residuals by year. A variety of model statistics and diagnostics were used to determine the best predictors of each modeled constituent including tests of significance, standard error, adjusted coefficient of determination (R2), and the predicted residual error sum of squares (PRESS) statistic. The PRESS statistic is a leave-one-out form of cross-validation that provides a measure of model fit for sample observations not used to develop the regression model. In general, the smaller the PRESS statistic, the better the model’s predictive ability (Helsel and Hirsch, 2002). The optimal models commonly used a mathematically transformed response variable. In those instances, a bias correcting factor (BCF) was used to correct for bias that occurs when back-transforming model results back into base-10 units (Helsel and Hirsch, 2002). Prediction intervals were computed for each model following methods from Helsel and Hirsch (2002), to define the range of values within which there is 90-percent certainty that the true value occurs.
c
Student Performance (Multiple Linear Regression) Dataset
cubig.ai
zip
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Student Performance (Multiple Linear Regression) Dataset [Dataset]. https://cubig.ai/store/products/392/student-performance-multiple-linear-regression-dataset
Explore at:
zipAvailable download formats
Dataset updated
May 29, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
Description
1) Data Introduction • The Student Performance (Multiple Linear Regression) Dataset is designed to analyze the relationship between students’ learning habits and academic performance. Each sample includes key indicators related to learning, such as study hours, sleep duration, previous test scores, and the number of practice exams completed.

2) Data Utilization (1) Characteristics of the Student Performance (Multiple Linear Regression) Dataset: • The target variable, Hours Studied, quantitatively represents the amount of time a student has invested in studying. The dataset is structured to allow modeling and inference of learning behaviors based on correlations with other variables.

(2) Applications of the Student Performance (Multiple Linear Regression) Dataset: • AI-Based Study Time Prediction Models: The dataset can be used to develop regression models that estimate a student’s expected study time based on inputs like academic performance, sleep habits, and engagement patterns. • Behavioral Analysis and Personalized Learning Strategies: It can be applied to identify students with insufficient study time and design personalized study interventions based on academic and lifestyle patterns.
Ridge-regression-sample-dataset
kaggle.com
zip
Updated Oct 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
naresh1502 (2025). Ridge-regression-sample-dataset [Dataset]. https://www.kaggle.com/datasets/naresh1502/ridge-regression-sample-dataset
Explore at:
zip(4117 bytes)Available download formats
Dataset updated
Oct 21, 2025
Authors
naresh1502
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Used Code to Genrate data:

import numpy as np import pandas as pd

🎲 Create synthetic data with correlated features

np.random.seed(42) n_samples = 100

X1 = np.random.rand(n_samples, 1) * 10 X2 = X1 + np.random.randn(n_samples, 1) * 0.1 # almost same as X1 -> high correlation X3 = np.random.rand(n_samples, 1) * 10

X = np.hstack([X1, X2, X3]) y = 3*X1 + 2*X2 + 1.5*X3 + np.random.randn(n_samples, 1) * 2 # target with noise

Convert X to DataFrame with column names

df_X = pd.DataFrame(X, columns=['X1', 'X2', 'X3'])

Convert y to DataFrame

df_y = pd.DataFrame(y, columns=['y'])

Combine X and y into a single DataFrame

df = pd.concat([df_X, df_y], axis=1)

Display first 5 rows

print(df.head())
Visualizing Count Data Regressions Using Rootograms
tandf.figshare.com
tar
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Kleiber; Achim Zeileis (2023). Visualizing Count Data Regressions Using Rootograms [Dataset]. http://doi.org/10.6084/m9.figshare.3204181.v2
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3204181.v2
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Christian Kleiber; Achim Zeileis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The rootogram is a graphical tool associated with the work of J. W. Tukey that was originally used for assessing goodness of fit of univariate distributions. Here, we extend the rootogram to regression models and show that this is particularly useful for diagnosing and treating issues such as overdispersion and/or excess zeros in count data models. We also introduce a weighted version of the rootogram that can be applied out of sample or to (weighted) subsets of the data, for example, in finite mixture models. An empirical illustration revisiting a well-known dataset from ethology is included, for which a negative binomial hurdle model is employed. Supplementary materials providing two further illustrations are available online: the first, using data from public health, employs a two-component finite mixture of negative binomial models; the second, using data from finance, involves underdispersion. An R implementation of our tools is available in the R package countreg. It also contains the data and replication code.
d
Data from: Data and regression models describing biofilms and water quality...
catalog.data.gov
data.usgs.gov
Updated Nov 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Data and regression models describing biofilms and water quality in streams surrounding Milwaukee Mitchell International Airport, Milwaukee, Wisconsin (2009-2014) [Dataset]. https://catalog.data.gov/dataset/data-and-regression-models-describing-biofilms-and-water-quality-in-streams-surroundi-2009
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
U.S. Geological Survey
Area covered
Milwaukee, Wisconsin
Description
Heterotrophic biofilm growth is typical in streams receiving airport deicer runoff; however detailed studies of biofilms in this setting are rare. Sample collection for this study was done during and surrounding two deicer seasons (i.e., 2009-2010 and 2010-2011), with additional sample collection occurring in 2014. Field surveys were used to document biofilm prevalence and characteristics, as well as stream characteristics. Collected biofilm samples were analyzed via microscopy, quantitative real-time polymerase chain reaction (qPCR), microarray, gas chromatography (of a cultured isolate), as well as via sequencing (Sanger and massively parallel). Sequence data are provided elsewhere (as described in the larger citation). Water-quality and quantity data were also collected in an attempt to assess relevant environmental conditions. Water quality data included grab samples collected at the time of biofilm field surveys as well as flow-weighted composite samples that were collected throughout the study period at nearby stream gages. Continuous streamflow and temperature were also collected at these gaged sites. Additional sensors were deployed at non-gaged sites to measure water temperature at these sites. Continuous temperature data were used to calculate antecedent characteristics for various time windows. Dye tracer results allowed for the determination of flow-based times of travel between sites. Taken together with flow composite sample data (at the downstream gage closest to the airport), this allowed for the calculation of estimated water quality characteristics at downstream sites as well as the subsequent calculation of antecedent water quality characteristics at all downstream sites. Regression models were run to investigate the influence of environmental factors on biofilm volume and dissolved oxygen concentration. Models were developed by ordinary least-squares regression using the R project for statistical computing with core functionality. Predictor variables for these models are included in the data file and input files provided. These include biofilm volumes, dissolved oxygen, COD concentration, water temperature, and monitoring site designation.
f
Data from: Variable Selection Diagnostics Measures for High-Dimensional...
figshare.com
bin
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ying Nan; Yuhong Yang (2023). Variable Selection Diagnostics Measures for High-Dimensional Regression [Dataset]. http://doi.org/10.6084/m9.figshare.1067053.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1067053.v2
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Ying Nan; Yuhong Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many exciting results have been obtained on model selection for high-dimensional data in both efficient algorithms and theoretical developments. The powerful penalized regression methods can give sparse representations of the data even when the number of predictors is much larger than the sample size. One important question then is: How do we know when a sparse pattern identified by such a method is reliable? In this work, besides investigating instability of model selection methods in terms of variable selection, we propose variable selection deviation measures that give one a proper sense on how many predictors in the selected set are likely trustworthy in certain aspects. Simulation and a real data example demonstrate the utility of these measures for application.
d
Calibration datasets and model archive summaries for regression models...
catalog.data.gov
data.usgs.gov
+1more
Updated Oct 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Calibration datasets and model archive summaries for regression models developed to estimate metal concentrations at nine sites on the Animas and San Juan Rivers, Colorado, New Mexico, and Utah: U.S. Geological Survey data release, https://doi.org/10.5066/P9THSFE0 [Dataset]. https://catalog.data.gov/dataset/calibration-datasets-and-model-archive-summaries-for-regression-models-developed-to-estima
Explore at:
Dataset updated
Oct 22, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
San Juan River, Utah, New Mexico, Colorado
Description
This data release supports the following publication: Mast, M. A., 2018, Estimating metal concentrations with regression analysis and water-quality surrogates at nine sites on the Animas and San Juan Rivers, Colorado, New Mexico, and Utah: U.S. Geological Survey Scientific Investigations Report 2018-5116. The U.S. Geological Survey (USGS), in cooperation with the U. S. Environmental Protection Agency (EPA), developed site-specific regression models to estimate concentrations of selected metals at nine USGS streamflow-gaging stations along the Animas and San Juan Rivers. Multiple linear-regression models were developed by relating metal concentrations in discrete water-quality samples to continuously monitored streamflow and surrogate parameters including specific conductance, pH, turbidity, and water temperature. Models were developed for dissolved and total concentrations of aluminum, arsenic, cadmium, iron, lead, manganese, and zinc using water-quality samples collected during 2005–17 by several agencies, using different collection methods and analytical laboratories. Calibration datasets in comma-separated format (CSV) include the variables of sampling date and time, metal concentrations (in micrograms per liter), stream discharge (in cubic feet per second), specific conductance (in microsiemens per centimeter at 25 degrees Celsius), pH, water temperature (in degrees Celsius), turbidity (in nephelometric turbidity units), and calculated seasonal terms based on Julian day. Surrogate parameters and discrete water-quality samples were used from nine sites including Cement Creek at Silverton, Colo. (USGS station 09358550); Animas River below Silverton, Colo. (USGS station 09359020); Animas River at Durango, Colo. (USGS station 09361500); Animas River Near Cedar Hill, N. Mex. (USGS station 09363500); Animas River below Aztec, N. Mex. (USGS station 09364010); San Juan River at Farmington, N. Mex. (USGS station 09365000); San Juan River at Shiprock, N. Mex (USGS Station 09368000); San Juan River at Four Corners, Colo. (USGS station 09371010); and San Juan River near Bluff, Utah (USGS station 09379500). Model archive summaries in pdf format include model statistics, data, and plots and were generated using a R script developed by USGS Kansas Water Science Center available at https://patrickeslick.github.io/ModelArchiveSummary/. A description of each USGS streamflow gaging station along with information about the calibration datasets also are provided.
m
Datasets used to train and test prediction model to predict scores in terms...
data.mendeley.com
Updated Mar 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jarosław Wątróbski (2025). Datasets used to train and test prediction model to predict scores in terms of SDG 7 realization [Dataset]. http://doi.org/10.17632/6c8fm7s4y2.1
Explore at:
Unique identifier
https://doi.org/10.17632/6c8fm7s4y2.1
Dataset updated
Mar 5, 2025
Authors
Jarosław Wątróbski
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The datasets used in this research work refer to the aims of Sustainable Development Goal 7. These datasets were used to train and test machine learning model based on artificial neural network and other machine learning regression models for solving the problem of prediction scores in terms of SDG 7 aims realization. Train dataset was created based on data from 2013 to 2021 and includes 261 samples. Test dataset includes 29 samples. Sources data from 2013 to 2022 are available in 10 XLSX and CSV files. Train and test datasets are available in XLSX and CSV files. Detailed description of data is available in PDF file.

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. Geological Survey (2025). An example data set for exploration of Multiple Linear Regression [Dataset]. https://catalog.data.gov/dataset/an-example-data-set-for-exploration-of-multiple-linear-regression

Data from: An example data set for exploration of Multiple Linear Regression

Explore at:

Dataset updated

Nov 20, 2025

Dataset provided by

United States Geological Surveyhttp://www.usgs.gov/

Description

This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.

Clear search

Close search

Google apps

Main menu

Data from: An example data set for exploration of Multiple Linear Regression...

Data from: Data for multiple linear regression models for predicting...

polynomial regression

Logistic Regression

Dataset

Contents

Panel dataset on Brazilian fuel demand

Simulation Studies as Designed Experiments: The Comparison of Penalized...

Data from: S1 Dataset -

Linear Regression (Excel) and Cellular Respiration for Biology, Chemistry...

Multiple linear regression model.

Marketing Linear Multiple Regression

Dataset

Contents

A Comparison of Variance Estimation Methods for Regression Analyses with the...

Regression analysis in Galaxy with car purchase price prediction dataset

Data from: Model archive summary and suspended-sediment concentrations from...

Student Performance (Multiple Linear Regression) Dataset

Ridge-regression-sample-dataset

🎲 Create synthetic data with correlated features

Convert X to DataFrame with column names

Convert y to DataFrame

Combine X and y into a single DataFrame

Display first 5 rows

Visualizing Count Data Regressions Using Rootograms

Data from: Data and regression models describing biofilms and water quality...

Data from: Variable Selection Diagnostics Measures for High-Dimensional...

Calibration datasets and model archive summaries for regression models...

Datasets used to train and test prediction model to predict scores in terms...

Data from: An example data set for exploration of Multiple Linear Regression