85 datasets found

d
Data for multiple linear regression models for predicting microcystin...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data for multiple linear regression models for predicting microcystin concentration action-level exceedances in selected lakes in Ohio [Dataset]. https://catalog.data.gov/dataset/data-for-multiple-linear-regression-models-for-predicting-microcystin-concentration-action
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Ohio
Description
Site-specific multiple linear regression models were developed for eight sites in Ohio—six in the Western Lake Erie Basin and two in northeast Ohio on inland reservoirs--to quickly predict action-level exceedances for a cyanotoxin, microcystin, in recreational and drinking waters used by the public. Real-time models include easily- or continuously-measured factors that do not require that a sample be collected. Real-time models are presented in two categories: (1) six models with continuous monitor data, and (2) three models with on-site measurements. Real-time models commonly included variables such as phycocyanin, pH, specific conductance, and streamflow or gage height. Many of the real-time factors were averages over time periods antecedent to the time the microcystin sample was collected, including water-quality data compiled from continuous monitors. Comprehensive models use a combination of discrete sample-based measurements and real-time factors. Comprehensive models were useful at some sites with lagged variables (< 2 weeks) for cyanobacterial toxin genes, dissolved nutrients, and (or) N to P ratios. Comprehensive models are presented in three categories: (1) three models with continuous monitor data and lagged comprehensive variables, (2) five models with no continuous monitor data and lagged comprehensive variables, and (3) one model with continuous monitor data and same-day comprehensive variables. Funding for this work was provided by the Ohio Water Development Authority and the U.S. Geological Survey Cooperative Water Program.
d
An example data set for exploration of Multiple Linear Regression
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). An example data set for exploration of Multiple Linear Regression [Dataset]. https://catalog.data.gov/dataset/an-example-data-set-for-exploration-of-multiple-linear-regression
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.
c
Student Performance (Multiple Linear Regression) Dataset
cubig.ai
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Student Performance (Multiple Linear Regression) Dataset [Dataset]. https://cubig.ai/store/products/392/student-performance-multiple-linear-regression-dataset
Explore at:
Dataset updated
May 29, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
Description
1) Data Introduction • The Student Performance (Multiple Linear Regression) Dataset is designed to analyze the relationship between students’ learning habits and academic performance. Each sample includes key indicators related to learning, such as study hours, sleep duration, previous test scores, and the number of practice exams completed.

2) Data Utilization (1) Characteristics of the Student Performance (Multiple Linear Regression) Dataset: • The target variable, Hours Studied, quantitatively represents the amount of time a student has invested in studying. The dataset is structured to allow modeling and inference of learning behaviors based on correlations with other variables.

(2) Applications of the Student Performance (Multiple Linear Regression) Dataset: • AI-Based Study Time Prediction Models: The dataset can be used to develop regression models that estimate a student’s expected study time based on inputs like academic performance, sleep habits, and engagement patterns. • Behavioral Analysis and Personalized Learning Strategies: It can be applied to identify students with insufficient study time and design personalized study interventions based on academic and lifestyle patterns.
Data from: Bike Sharing Dataset
kaggle.com
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ram Vishnu R (2024). Bike Sharing Dataset [Dataset]. https://www.kaggle.com/datasets/ramvishnur/bike-sharing-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 10, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ram Vishnu R
Description
Problem Statement:

A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.

A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.

In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.

They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

Which variables are significant in predicting the demand for shared bikes.

How well those variables describe the bike demands

Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.

Business Goal:

You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

Data Preparation:

You can observe in the dataset that some of the variables like 'weathersit' and 'season' have values as 1, 2, 3, 4 which have specific labels associated with them (as can be seen in the data dictionary). These numeric values associated with the labels may indicate that there is some order to them - which is actually not the case (Check the data dictionary and think why). So, it is advisable to convert such feature values into categorical string values before proceeding with model building. Please refer the data dictionary to get a better understanding of all the independent variables.

You might notice the column 'yr' with two values 0 and 1 indicating the years 2018 and 2019 respectively. At the first instinct, you might think it is a good idea to drop this column as it only has two values so it might not be a value-add to the model. But in reality, since these bike-sharing systems are slowly gaining popularity, the demand for these bikes is increasing every year proving that the column 'yr' might be a good variable for prediction. So think twice before dropping it.

Model Building:

In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.

Model Evaluation:

When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set. python from sklearn.metrics import r2_score r2_score(y_test, y_pred) - where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set. - Please perform this step as the R-squared score on the test set holds as a benchmark for your model.
A
‘homeprices-multiple-variables’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘homeprices-multiple-variables’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-homeprices-multiple-variables-e06e/acea5a36/?iid=001-683&v=presentation
Explore at:
Dataset updated
Feb 14, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘homeprices-multiple-variables’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/pankeshpatel/homepricesmultiplevariables on 14 February 2022.

--- Dataset description provided by original source is as follows ---

Context

Sample data of housing price. We have used this small data set to create a tutorial -- Machine learning for absolute beginners. The topic is Multivariate Regression.

Content

It has the following four attributes, describing a house - **area **: area of a house in square feet - bedrooms: number of bedrooms in a house - **age **: age of house - price: price of a house.

Area, bedrooms, and age are feature attributes and price is target attributes/variable.

Acknowledgements

Source: codebasics : https://twitter.com/codebasicshub

--- Original source retains full ownership of the source dataset ---
f
Multiple linear regression analysis of socioeconomic factors.
plos.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bronislava Bashinskaya; Brian V. Nahed; Brian P. Walcott; Jean-Valery C. E. Coumans; Oyere K. Onuma (2023). Multiple linear regression analysis of socioeconomic factors. [Dataset]. http://doi.org/10.1371/journal.pone.0046314.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0046314.t004
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Bronislava Bashinskaya; Brian V. Nahed; Brian P. Walcott; Jean-Valery C. E. Coumans; Oyere K. Onuma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
**Significant.A multiple linear regression analysis was performed for each procedure (column 1) and three socioeconomic factors (column 2). Individual regression coefficients are identified (column 3), along with their respective 95% confidence intervals (column 4). The goodness of model fit (column 5) is the percent of the variation explained by the model. The P value (column 6) represents the significance of each regression model as a whole, incorporating education, income, and employment as variables. This model was significant in describing the relationship of the three socioeconomic variables and the prevalence of CABG and PTCA. No causal mechanism can be identified with any regression analysis technique.
Canada Per Capita Income Single variable data set
kaggle.com
zip
Updated Sep 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gurdit Singh (2019). Canada Per Capita Income Single variable data set [Dataset]. https://www.kaggle.com/datasets/gurdit559/canada-per-capita-income-single-variable-data-set
Explore at:
zip(637 bytes)Available download formats
Dataset updated
Sep 9, 2019
Authors
Gurdit Singh
Area covered
Canada
Description
The dataset for predicting income per capita for Canada is taken from the website: data.worldbank.org

I am just using the data to practice Linear regression for single variable as a beginner.

There's a story behind every dataset and here's your opportunity to share yours.

Content

The data set contains 2 columns namely year and per capita income

Acknowledgements

The dataset for predicting income per capita for Canada is taken from the website: data.worldbank.org

Objective

Predict Canadas per capita income for the year 2020 using linear regression (beginner level)(just for practice)
u
Data from: Data for Multiple Linear Regresion social media on Cost of...
portaldelainvestigacion.uma.es
Updated 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Castillo Esparcia, Antonio; Almansa Martinez, Ana; Gorostiza Cervino, Aritz; Castillo Esparcia, Antonio; Almansa Martinez, Ana; Gorostiza Cervino, Aritz (2023). Data for Multiple Linear Regresion social media on Cost of Representation of the European Groups of Interest [Dataset]. https://portaldelainvestigacion.uma.es/documentos/67a9c7ce19544708f8c731fe
Explore at:
Dataset updated
2023
Authors
Castillo Esparcia, Antonio; Almansa Martinez, Ana; Gorostiza Cervino, Aritz; Castillo Esparcia, Antonio; Almansa Martinez, Ana; Gorostiza Cervino, Aritz
Description
This dataset provides a critical nexus between social media engagement and the financial dimension of European interest groups' representation costs. Designed to facilitate multiple linear regression analysis, this dataset is a valuable tool for researchers, statisticians, and analysts seeking to unravel the intricate relationships between digital engagement and the financial commitments of these interest groups.

The dataset offers a robust collection of data points that enables the exploration of potential correlations, dependencies, and predictive insights. By delving into the varying levels of social media presence across different platforms and their potential influence on the cost of representation, researchers can gain a deeper understanding of the interplay between virtual engagement and real-world financial investment.

Accessible to the academic and research community, this dataset holds the promise of shedding light on the dynamic and evolving landscape of interest groups' communication strategies and their financial implications. With the potential to inform policy decisions and strategic planning, this dataset represents a stepping stone toward a more comprehensive understanding of the intricate web of relationships that shape the world of European interest groups. The variables included:

1. Twitter_link (Dummy):

1.1. Twitter_YES

1.2. Twitter_NO

2. Facebook_link (Dummy):

2.1. Facebook_YES

2.2. Facebook_NO

3. Instagram_link (Dummy):

3.1. Instagram_YES

3.2. Instagram_NO

4. Linkedin_link (Dummy):

4.1. Linkedin_YES

4.2. Linkedin_NO

5. Youtube_link (Dummy):

5.1. Youtube_YES

5.2. Youtube_NO

6. mean_cost (continus): The mean of the range of estimated cost of representation.
J
Estimating the LQAC model with I(2) variables (replication data)
journaldata.zbw.eu
.data, txt
Updated Dec 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom Engsted; Niels Haldrup; Tom Engsted; Niels Haldrup (2022). Estimating the LQAC model with I(2) variables (replication data) [Dataset]. http://doi.org/10.15456/jae.2022314.0706003384
Explore at:
txt(922), .data(8085)Available download formats
Unique identifier
https://doi.org/10.15456/jae.2022314.0706003384
Dataset updated
Dec 8, 2022
Dataset provided by
ZBW - Leibniz Informationszentrum Wirtschaft
Authors
Tom Engsted; Niels Haldrup; Tom Engsted; Niels Haldrup
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper derives a method for estimating and testing the Linear Quadratic Adjustment Cost (LQAC) model when the target variable and some of the forcing variables follow I(2) processes. Based on a forward-looking error-correction formulation of the model it is shown how to obtain strongly consistent estimates of the structural parameters from both a linear and a non-linear cointegrating regression where first-differences of the I(2) variables are included as regressors (multicointegration). Further, based on the estimated parameter values, it is shown how to test and evaluate the LQAC model using a VAR approach. A simple easy interpretable metric for measuring the model fit is suggested. In an empirical application using UK money demand data, the non-linear multicointegrating regression delivers an economically plausible estimate of the adjustment cost parameter. However, the restrictions implied by the exact LQAC model under rational expectations are strongly rejected and the metric for model fit indicates a substantial noise component in the model.
Data from: Red wine DataSet
kaggle.com
Updated Aug 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suraj_kumar_Gupta (2023). Red wine DataSet [Dataset]. https://www.kaggle.com/datasets/soorajgupta7/red-wine-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Suraj_kumar_Gupta
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Datasets Description:

The datasets under discussion pertain to the red and white variants of Portuguese "Vinho Verde" wine. Detailed information is available in the reference by Cortez et al. (2009). These datasets encompass physicochemical variables as inputs and sensory variables as outputs. Notably, specifics regarding grape types, wine brand, and selling prices are absent due to privacy and logistical concerns.

Classification and Regression Tasks: One can interpret these datasets as being suitable for both classification and regression analyses. The classes are ordered, albeit imbalanced. For instance, the dataset contains a more significant number of normal wines compared to excellent or poor ones.

Dataset Contents: For a comprehensive understanding, readers are encouraged to review the work by Cortez et al. (2009). The input variables, derived from physicochemical tests, include: 1. Fixed acidity 2. Volatile acidity 3. Citric acid 4. Residual sugar 5. Chlorides 6. Free sulfur dioxide 7. Total sulfur dioxide 8. Density 9. pH 10. Sulphates 11. Alcohol

The output variable, based on sensory data, is denoted by: 12. Quality (score ranging from 0 to 10)

Usage Tips: A practical suggestion involves setting a threshold for the dependent variable, defining wines with a quality score of 7 or higher as 'good/1' and the rest as 'not good/0.' This facilitates meaningful experimentation with hyperparameter tuning using decision tree algorithms and analyzing ROC curves and AUC values.

Operational Workflow: To efficiently utilize the dataset, the following steps are recommended: 1. Utilize a File Reader (for csv) to a linear correlation node and an interactive histogram for basic Exploratory Data Analysis (EDA). 2. Employ a File Reader to a Rule Engine Node for transforming the 10-point scale to a dichotomous variable indicating 'good wine' and 'rest.' 3. Implement a Rule Engine Node output to an input of Column Filter node to filter out the original 10-point feature, thus preventing data leakage. 4. Apply a Column Filter Node output to the input of Partitioning Node to execute a standard train/test split (e.g., 75%/25%, choosing 'random' or 'stratified'). 5. Feed the Partitioning Node train data split output into the input of Decision Tree Learner node. 6. Connect the Partitioning Node test data split output to the input of Decision Tree predictor Node. 7. Link the Decision Tree Learner Node output to the input of Decision Tree Node. 8. Finally, connect the Decision Tree output to the input of ROC Node for model evaluation based on the AUC value.

Tools and Acknowledgments: For an efficient analysis, consider using KNIME, a valuable graphical user interface (GUI) tool. Additionally, the dataset is available on the UCI machine learning repository, and proper acknowledgment and citation of the dataset source by Cortez et al. (2009) are essential for use.
f
Additional file 2: of Standardizing effect size from linear regression...
springernature.figshare.com
xlsx
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miguel Rodríguez-Barranco; Aurelio Tobías; Daniel Redondo; Elena Molina-Portillo; María Sánchez (2023). Additional file 2: of Standardizing effect size from linear regression models with log-transformed variables for meta-analysis [Dataset]. http://doi.org/10.6084/m9.figshare.c.3719716_D2.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3719716_D2.v1
Dataset updated
Jun 4, 2023
Dataset provided by
figshare
Authors
Miguel Rodríguez-Barranco; Aurelio Tobías; Daniel Redondo; Elena Molina-Portillo; María Sánchez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Excel template to transform original effect size using the proposed formulae. (XLSX 19 kb)
A
‘Walmart Dataset (Retail)’ analyzed by Analyst-2
analyst-2.ai
Updated Apr 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Walmart Dataset (Retail)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-walmart-dataset-retail-0283/e07567d8/?iid=003-947&v=presentation
Explore at:
Dataset updated
Apr 18, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Walmart Dataset (Retail)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/rutuspatel/walmart-dataset-retail on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Dataset Description :

This is the historical data that covers sales from 2010-02-05 to 2012-11-01, in the file Walmart_Store_sales. Within this file you will find the following fields:

Store - the store number

Date - the week of sales

Weekly_Sales - sales for the given store

Holiday_Flag - whether the week is a special holiday week 1 – Holiday week 0 – Non-holiday week

Temperature - Temperature on the day of sale

Fuel_Price - Cost of fuel in the region

CPI – Prevailing consumer price index

Unemployment - Prevailing unemployment rate

Holiday Events Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13 Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13 Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13 Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

Analysis Tasks

Basic Statistics tasks

1) Which store has maximum sales

2) Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of mean to standard deviation

3) Which store/s has good quarterly growth rate in Q3’2012

4) Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together

5) Provide a monthly and semester view of sales in units and give insights

Statistical Model

For Store 1 – Build prediction models to forecast demand

Linear Regression – Utilize variables like date and restructure dates as 1 for 5 Feb 2010 (starting from the earliest date in order). Hypothesize if CPI, unemployment, and fuel price have any impact on sales.

Change dates into days by creating new variable.

Select the model which gives best accuracy.

--- Original source retains full ownership of the source dataset ---
f
Determinants of diabetes knowledge (raw scores) in multivariable linear...
figshare.com
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eva K. Fenwick; Jing Xie; Gwyn Rees; Robert P. Finger; Ecosse L. Lamoureux (2023). Determinants of diabetes knowledge (raw scores) in multivariable linear regression models. [Dataset]. http://doi.org/10.1371/journal.pone.0080593.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0080593.t004
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Eva K. Fenwick; Jing Xie; Gwyn Rees; Robert P. Finger; Ecosse L. Lamoureux
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CI = Confidence interval; Bolded values indicate significant results NDSS = National Diabetes Service Scheme; SD = Standard Deviation.*Represents variables substantially different from the analyses using raw scores (Table 4).Model 1 had the smallest Bayesian information criterion (BIC).Model 2 had the smallest a bias-corrected version of AIC.Model 3 had the largest adjusted proportion of variation “explained” by the regression model.
A
‘Red and White Wine Quality Analysis’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Red and White Wine Quality Analysis’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-red-and-white-wine-quality-analysis-0938/d129fe93/?iid=005-791&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Red and White Wine Quality Analysis’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/saigeethac/red-and-white-wine-quality-datasets on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Wine Quality Data Set

This data set is available in UCI at https://archive.ics.uci.edu/ml/datasets/Wine+Quality.

Abstract: Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests.

Data Set Information:

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

Attribute Information:

Input variables (based on physicochemical tests):

fixed acidity

volatile acidity

citric acid

residual sugar

chlorides

free sulfur dioxide

total sulfur dioxide

density

pH

sulphates

alcohol

Output variable (based on sensory data):

quality (score between 0 and 10)

These columns have been described in the Kaggle Data Explorer.

Context

The authors state "we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods." We have briefly explored this aspect and see that Red wine quality prediction on the test and training datasets is almost the same (~88%) with just three features. Likewise White wine quality prediction appears to depend on just one feature. This may be due to the privacy and logistics issues mentioned by the dataset authors.

Content

Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. Both these datasets are analyzed and linear regression models are developed in Python 3. The github link provided for the source code also includes a Flask web application for deployment on the local machine or on Heroku.

Acknowledgements

Datasets: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Banner Image: Photo by Roberta Sorge on Unsplash

Github Link

Complete code has been uploaded onto github at https://github.com/saigeethachandrashekar/wine_quality.

Please clone the repo - this contains both the datasets, the code required for building and saving the model on to your local system. Code for a Flask app is provided for deploying the models on your local machine. The app can also be deployed on Heroku - the requirements.txt and Procfile are also provided for this.

Next Steps

White wine quality prediction appears to depend on just one feature. This may be due to the privacy and logistics issues mentioned by the dataset authors (e.g. there is no data about grape types, wine brand, wine selling price, etc.) or it may be due to other factors that are not clear. This is an area that might be worth exploring further.

Other ML techniques may be applied to improve the accuracy.

--- Original source retains full ownership of the source dataset ---
g
Replication data for: Linear Models with Outliers: Choosing between...
datasearch.gesis.org
dataverse.harvard.edu
+1more
Updated Jan 22, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harden, Jeffrey; Desmarais, Bruce (2020). Replication data for: Linear Models with Outliers: Choosing between Conditional-Mean and Conditional-Median Methods [Dataset]. https://datasearch.gesis.org/dataset/httpsdataverse.unc.eduoai--hdl1902.2911608
Explore at:
Dataset updated
Jan 22, 2020
Dataset provided by
Odum Institute Dataverse Network
Authors
Harden, Jeffrey; Desmarais, Bruce
Description
State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.
A
‘California Housing Data (1990)’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘California Housing Data (1990)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-california-housing-data-1990-a0c5/b7389540/?iid=007-628&v=presentation
Explore at:
Dataset updated
Nov 12, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
California
Description
Analysis of ‘California Housing Data (1990)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/harrywang/housing on 12 November 2021.

--- Dataset description provided by original source is as follows ---

Source

This is the dataset used in this book: https://github.com/ageron/handson-ml/tree/master/datasets/housing to illustrate a sample end-to-end ML project workflow (pipeline). This is a great book - I highly recommend!

The data is based on California Census in 1990.

About the Data (from the book):

"This dataset is a modified version of the California Housing dataset available from Luís Torgo's page (University of Porto). Luís Torgo obtained it from the StatLib repository (which is closed now). The dataset may also be downloaded from StatLib mirrors.

The following is the description from the book author:

This dataset appeared in a 1997 paper titled Sparse Spatial Autoregressions by Pace, R. Kelley and Ronald Barry, published in the Statistics and Probability Letters journal. They built it using the 1990 California census data. It contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

The dataset in this directory is almost identical to the original, with two differences: 207 values were randomly removed from the total_bedrooms column, so we can discuss what to do with missing data. An additional categorical attribute called ocean_proximity was added, indicating (very roughly) whether each block group is near the ocean, near the Bay area, inland or on an island. This allows discussing what to do with categorical data. Note that the block groups are called "districts" in the Jupyter notebooks, simply because in some contexts the name "block group" was confusing."

About the Data (From Luís Torgo page):

http://www.dcc.fc.up.pt/%7Eltorgo/Regression/cal_housing.html

This is a dataset obtained from the StatLib repository. Here is the included description:

"We collected information on the variables using all the block groups in California from the 1990 Cens us. In this sample a block group on average includes 1425.5 individuals living in a geographically co mpact area. Naturally, the geographical area included varies inversely with the population density. W e computed distances among the centroids of each block group as measured in latitude and longitude. W e excluded all the block groups reporting zero entries for the independent and dependent variables. T he final data contained 20,640 observations on 9 variables. The dependent variable is ln(median house value)."

End-to-End ML Project Steps (Chapter 2 of the book)

Look at the big picture

Get the data

Discover and visualize the data to gain insights

Prepare the data for Machine Learning algorithms

Select a model and train it

Fine-tune your model

Present your solution

Launch, monitor, and maintain your system

The 10-Step Machine Learning Project Workflow (My Version)

Define business object

Make sense of the data from a high level

data types (number, text, object, etc.)

continuous/discrete

basic stats (min, max, std, median, etc.) using boxplot

frequency via histogram

scales and distributions of different features

Create the traning and test sets using proper sampling methods, e.g., random vs. stratified

Correlation analysis (pair-wise and attribute combinations)

Data cleaning (missing data, outliers, data errors)

Data transformation via pipelines (categorical text to number using one hot encoding, feature scaling via normalization/standardization, feature combinations)

Train and cross validate different models and select the most promising one (Linear Regression, Decision Tree, and Random Forest were tried in this tutorial)

Fine tune the model using trying different combinations of hyperparameters

Evaluate the model with best estimators in the test set

Launch, monitor, and refresh the model and system

--- Original source retains full ownership of the source dataset ---
m
Questionnaire data on land use change of Industrial Heritage: Insights from...
data.mendeley.com
Updated Jul 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arsalan Karimi (2023). Questionnaire data on land use change of Industrial Heritage: Insights from Decision-Makers in Shiraz, Iran [Dataset]. http://doi.org/10.17632/gk3z8gp7cp.2
Explore at:
Unique identifier
https://doi.org/10.17632/gk3z8gp7cp.2
Dataset updated
Jul 20, 2023
Authors
Arsalan Karimi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Iran, Shiraz
Description
The survey dataset for identifying Shiraz old silo’s new use which includes four components: 1. The survey instrument used to collect the data “SurveyInstrument_table.pdf”. The survey instrument contains 18 main closed-ended questions in a table format. Two of these, concern information on Silo’s decision-makers and proposed new use followed up after a short introduction of the questionnaire, and others 16 (each can identify 3 variables) are related to the level of appropriate opinions for ideal intervention in Façade, Openings, Materials and Floor heights of the building in four values: Feasibility, Reversibility, Compatibility and Social Benefits. 2. The raw survey data “SurveyData.rar”. This file contains an Excel.xlsx and a SPSS.sav file. The survey data file contains 50 variables (12 for each of the four values separated by colour) and data from each of the 632 respondents. Answering each question in the survey was mandatory, therefor there are no blanks or non-responses in the dataset. In the .sav file, all variables were assigned with numeric type and nominal measurement level. More details about each variable can be found in the Variable View tab of this file. Additional variables were created by grouping or consolidating categories within each survey question for simpler analysis. These variables are listed in the last columns of the .xlsx file. 3. The analysed survey data “AnalysedData.rar”. This file contains 6 “SPSS Statistics Output Documents” which demonstrate statistical tests and analysis such as mean, correlation, automatic linear regression, reliability, frequencies, and descriptives. 4. The codebook “Codebook.rar”. The detailed SPSS “Codebook.pdf” alongside the simplified codebook as “VariableInformation_table.pdf” provides a comprehensive guide to all 50 variables in the survey data, including numerical codes for survey questions and response options. They serve as valuable resources for understanding the dataset, presenting dictionary information, and providing descriptive statistics, such as counts and percentages for categorical variables.
u
Data from: Predicting spatial-temporal patterns of diet quality and large...
agdatacommons.nal.usda.gov
docx
Updated Feb 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sean Kearney; Lauren M. Porensky; David J. Augustine; Justin D. Derner; Feng Gao (2024). Data from: Predicting spatial-temporal patterns of diet quality and large herbivore performance using satellite time series [Dataset]. http://doi.org/10.15482/USDA.ADC/1522609
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1522609
Dataset updated
Feb 16, 2024
Dataset provided by
Ag Data Commons
Authors
Sean Kearney; Lauren M. Porensky; David J. Augustine; Justin D. Derner; Feng Gao
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Analysis-ready tabular data from "Predicting spatial-temporal patterns of diet quality and large herbivore performance using satellite time series" in Ecological Applications, Kearney et al., 2021. Data is tabular data only, summarized to the pasture scale. Weight gain data for individual cattle and the STARFM-derived Landsat-MODIS fusion imagery can be made available upon request. Resources in this dataset:Resource Title: Metadata - CSV column names, units and descriptions. File Name: Kearney_et_al_ECOLAPPL_Patterns of herbivore - metada.docxResource Description: Column names, units and descriptions for all CSV files in this datasetResource Title: Fecal quality data. File Name: Kearney_etal2021_Patterns_of_herbivore_Data_FQ_cln.csvResource Description: Field-sampled fecal quality (CP = crude protein; DOM = digestible organic matter) data and phenology-related APAR metrics derived from 30 m daily Landsat-MODIS fusion satellite imagery. All data are paddock-scale averages and the paddock is the spatial scale of replication and week is the temporal scale of replication. Fecal samples were collected by USDA-ARS staff from 3-5 animals per paddock (10% - 25% of animals in each herd) weekly during each grazing season from 2014 to 2019 across 10 different paddocks at the Central Plains Experimental Range (CPER) near Nunn, CO. Samples were analyzed at the Grazingland Animal Nutrition Lab (GANlab, https://cnrit.tamu.edu/index.php/ganlab/) using near infrared spectroscopy (see Lyons & Stuth, 1992; Lyons, Stuth, & Angerer, 1995). Not every herd was sampled every week or every year, resulting in a total of 199 samples. Samples represent all available data at the CPER during the study period and were collected for different research and adaptive management objectives, but following the basic protocol described above. APAR metrics were derived from the paddock-scale APAR daily time series (all paddock pixels averaged daily to create a single paddock-scale time series). All APAR metrics are calculated for the week that corresponds to the week that fecal quality samples were collected in the field. See Section 2.2.4 of the corresponding manuscript for a complete description of the APAR metrics. Resource Title: Monthly ADG. File Name: Kearney_etal2021_Patterns_of_herbivore_Data_ADG_monthly_cln.csvResource Description: Monthly average daily gain (ADG) of cattle weights at the paddock scale and the three satellite-derived metrics used to build regression model to predict AD: crude protein (CP), digestible organic matter (DOM) and aboveground net herbaceous production (ANHP). Data table also includes stocking rate (animal units per hectare) used as an interaction term in the ADG regression model and all associated data to derive each of these variables (e.g., sampling start and end dates, 30 m daily Landsat-MODIS fusion satellite imagery-derived APAR metrics, cattle weights, etc.). We calculated paddock-scale average daily gain (ADG, kg hd-1 day-1) from 2000-2019 for yearlings weighed approximately every 28-days during the grazing season across 6 different paddocks with stocking densities of 0.08 – 0.27 animal units (AU) ha-1, where one AU is equivalent to a 454 kg animal. It is worth noting that AU’s change as a function of both the number of cattle within a paddock and the size of individual animals, the latter of which changes within a single grazing season. This becomes important to consider when using sub-seasonal weight data for fast-growing yearlings. For paddock-scale ADG, we first calculated ADG for each individual yearling as the difference between the weights obtained at the end and beginning of each period, divided by the number of days in each period, and then averaged for all individuals in the paddock. We excluded data from 2013 due to data collection inconsistencies. We note that most of the monthly weight data (97%) is from 3 paddocks where cattle were weighed every year, whereas in the other 3 paddocks, monthly weights were only measured during 2017-2019. Apart from the 2013 data, which were not comparable to data from other years, the data represents all available weight gain data for CPER to maximize spatial-temporal coverage and avoid potential bias from subjective decisions to subset the data. Data may have been collected for different projects at different times, but was collected in a consistent way. This resulted in 269 paddock-scale estimates of monthly ADG, with robust temporal, but limited spatial, coverage. CP and DOM were estimated from a random forest model trained from the five APAR metrics: rAPAR, dAPAR, tPeak, iAPAR and iAPAR-dry (see manuscript Section 2.3 for description). APAR metrics were derived from the paddock-scale APAR daily time series (all paddock pixels averaged daily to create a single paddock-scale time series). All APAR metrics are calculated as the average of the approximately 28-day period that corresponds to the ADG calculation. See Section 2.2.4 of the manuscript for a complete description of the APAR metrics. ANHP was estimated from a linear regression model developed by Gaffney et al. (2018) to calculate net aboveground herbaceous productivity (ANHP; kg ha-1) from iAPAR. We averaged the coefficients of 4 spatial models (2013-2016) developed by Gaffney et al. (2018), resulting in the following equation: ANHP = -26.47 + 2.07(iAPAR) We first calculated ANHP for each day of the grazing season at the paddock scale, and then took the average ANHP for the 28-day period. REFERENCES: Gaffney, R., Porensky, L. M., Gao, F., Irisarri, J. G., Durante, M., Derner, J. D., & Augustine, D. J. (2018). Using APAR to predict aboveground plant productivity in semi-aid rangelands: Spatial and temporal relationships differ. Remote Sensing, 10(9). doi: 10.3390/rs10091474 Resource Title: Season-long ADG. File Name: Kearney_etal2021_Patterns_of_herbivore_Data_ADG_seasonal_cln.csvResource Description: Season-long observed and model-predicted average daily gain (ADG) of cattle weights at the paddock scale. Also includes two variables used to analyze patterns in model residuals: percent sand content and season-long aboveground net herbaceous production (ANHP). We calculated observed paddock-scale ADG for the entire grazing season from 2010-2019 (excluding 2013 due to data collection inconsistencies) by averaging seasonal ADG of each yearling, determined as the difference between the end and starting weights divided by the number of days in the grazing season. This dataset was available for 40 paddocks spanning a range of soil types, plant communities, and topographic positions. Data may have been collected for different projects at different times, but was collected in a consistent way. We note that there was spatial overlap among a small number paddock boundaries across different years since some fence lines were moved in 2012 and 2014. Model-predicted paddock-scale ADG was derived using the monthly ADG regression model described in Sections 2.3.3 and 2.3.4. of the associated manuscript. In short, we predicted season-long cattle weight gains by first predicting daily weight gain for each day of the grazing season from the monthly regression model using a 28-day moving average of model inputs (CP, DOM and ANHP ). We calculated the final ADG for the entire grazing season as the average predicted ADG, starting 28-days into the growing season. Percent sand content was obtained as the paddock-scale average of POLARIS sand content in the upper 0-30 cm. ANHP was calculated on the last day of the grazing season fusing a linear regression model developed by Gaffney et al. (2018) to calculate net aboveground herbaceous productivity (ANHP; kg ha-1) from satellite-derived integrated absorbed photosynthetically active radiation (iAPAR) (see Section 3.1.2 of the associated manuscript). We averaged the coefficients of 4 spatial models (2013-2016) developed by Gaffney et al. (2018), resulting in the following equation: ANHP = -26.47 + 2.07(iAPAR) REFERENCES: Gaffney, R., Porensky, L. M., Gao, F., Irisarri, J. G., Durante, M., Derner, J. D., & Augustine, D. J. (2018). Using APAR to predict aboveground plant productivity in semi-aid rangelands: Spatial and temporal relationships differ. Remote Sensing, 10(9). doi: 10.3390/rs10091474
R
Data from: The relationship between learning orientation, firm performance...
repod.icm.edu.pl
ods, odt
Updated Feb 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karpacz, Jarosław; Wójcik-Karpacz, Anna (2023). The relationship between learning orientation, firm performance and market dynamism in MSMEs operating in technology parks in Poland: an empirical analysis [Dataset]. http://doi.org/10.18150/IOUHRH
Explore at:
ods(8051), odt(7204), ods(7973)Available download formats
Unique identifier
https://doi.org/10.18150/IOUHRH
Dataset updated
Feb 3, 2023
Dataset provided by
RepOD
Authors
Karpacz, Jarosław; Wójcik-Karpacz, Anna
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
In this study, we investigate the (in)direct relationship between learning orientation and firm performance. The study is guided by the DCs framework. We collected data from 182 MSMEs operating in TPs in Poland. We used two methods (PAPI, CAWI) in our quantitative empirical research.For the analysis of empirical data, we used the methods of description and statistical inference. The values obtained by means of Cronbach’s alpha values showed very good reliability of questionnaire. We have assumed that the coefficient deciding whether a tool is reliable should be at least 0.70. The results of the Kolmogorov-Smirnov tests indicate grounds for assuming that the variables are not normally distributed. We present the results of the Kolmogorov–Smirnov tests and Cronbach’s alpha coefficients in Table 1.In the next step, we applied the correlation analysis between the variables by using the rho-Spearman coefficient. We present the results of correlations among the analysed variables in Table 2. The analysis of data included in Table 2 indicated weak or very weak correlations among the variables in individual configurations. LO positively correlates with FP (rs = 0.197; p < 0.01). This means that the increase in LO is accompanied, on average, by a small increase in FP.There is also a positive, although very weak (rs = 0.151) correlation between MD and LO, which was statistically significant (p < 0.05). This means that the increase in MD is accompanied by, on average, a slight increase in LO.At the same time, the results of the correlation analysis indicated a weak but positive correlation between one of the dimensions of MD, i.e. speed of change in technology and competition and LO (rs = 0.0.236; p < 0.01). Relationships between the two remaining dimensions of MD were not statistically significant.In addition, the aforementioned MD dimension also positively correlates with FP. The correlation between the MD dimension called speed of change in technology and competition and FP is positive, weak and statistically significant (rs = 0.181; p < 0.05). This means that the increase in the speed of change in technology and competition is accompanied by, on average, a slight increase in FP. Relationships between the two remaining dimensions of MD and FP were not statistically significant.Correlation analysis encourages deeper recognition and understanding of LO-FP relationship in the context of MD. We used linear regression models in order to verify the hypotheses, which allowed for a global assessment of relationships among all analysed variables.The values of coefficients obtained for permanent effects in this model inform about how much the expected value of explanatory variable changes along with the unitary growth of a given predictor. The explanatory variable (predictor) is a variable in a statistical model (as well as in an econometric model) on the basis of which the response variable is calculated. In Model 1 there is one explanatory variable (LO); while in Model 2 there are two explanatory variables (LO, MD). The response variable is FP. The statistical significance of these coefficients was verified by a test based on the t statistics. For all the mentioned tests, p<0.05 indicated the statistical significance of the analysed relationships.The assessment of the impact of LO on FP is dictated by the H.1 hypothesis verification.While the assessment of the impact of dynamism of the market in which enterprises operate in explaining the impact of LO on FP is dictated by the H.2 hypothesis verification.H.1: Learning orientation is positively related to firm performance.H.2: Market dynamism moderates the learning orientation-firm performance relationship; the positive effect of learning orientation on firm performance is likely to be stronger under high market dynamism than under low market dynamism.The results of testing the H1 and H2 hypotheses are presented in Table 3.We estimated Models 1 and 2 in Table 3 by using the Akaike Information Criteria (AIC). The AIC for both models was similar, i.e. 568.28 for the first model and 571.12 for the second one. AIC levels for both models indicated acceptable matching levels. The lower the AIC value, the better the predictive values of the model. The model coefficient is a parameter determined by its most likely value. The confidence interval of the model coefficient indicates in which range its less probable but possible values may be. It also has a diagnostic value. If the value of the regression coefficient contains “0”, the coefficient has no substantive value for the model. Model 1 explained 13.5% of the data variation (R2 = 0.135), while Model 2 explained 14.0% of the data variation (R2 = 0.140), which is slightly more than Model 1. The analysis of the models presented in Table 3 leads to several findings. In the first model, only LO was positively related to FP and only slightly explained the variability of the dependent variable. It has a small but statistically significant impact on FP (coefficient: 0.38; p=0.00). The linear regression model (Model 1) confirms the thesis about the positive impact of LO on FP. It may be assumed that an increase in the assessment of LO by one point, with no change in the other parameters of the model, would result in an increase in average FP by 0.38. This model explains 13.5% of the data variability (R2 = 0.135). Secondly, the linear regression model (Model 2) did not confirm the thesis about the moderating role of MD on the LO-FP relationship. None of the predictors showed statistical significance (p<0.05) in Model 2. What is more, taking the MD variable into account affects the quality of the model, and MD itself adopts negative prediction indicators, which means that better FP in responding to changes in the level of MD deteriorates the overall FP. However, the research has not confirmed whether MD - a higher-order construct built of three first-order constructs, i.e. the speed of changes in technology and competition, unpredictability of changes in technology and competition, uncertainty of customer behaviour - increases the importance of LO for increasing FP, and thus achieving a competitive advantage. Thirdly, the control variables were insignificant in both models. This means that the control variables in the form of enterprise size do not have a statistically significant effect on the dependent variable. Therefore, the introduction of two control variables and a moderating variable reduced the impact of LO on FP to a statistically insignificant level.The results of the study show that firm performance benefits from LO-related behaviours. Learning orientation is an important stimulant of firm performance, while market dynamism has not been classified as a moderator of the learning orientation-firm performance relationship.
Z
Uncertainty-aware Machine Learning Bias Correction and Filtering for OCO-2 |...
data.niaid.nih.gov
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KEELY, WILLIAM (2025). Uncertainty-aware Machine Learning Bias Correction and Filtering for OCO-2 | 2014-2024 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_15085178
Explore at:
Dataset updated
Mar 26, 2025
Dataset provided by
KEELY, WILLIAM
Mauceri, Steffen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

This is a dataset to explore the effect of applying a new bias correction and quality filtering approach to increase the accuracy of atmospheric CO2 measurements derived from the Orbiting Carbon Observatory-2 (OCO-2) satellite. This is not an official OCO-2 data product.

Data Description

The dataset contains OCO-2 retrieved XCO2 B11.2 that have been corrected and filtered with a new machine learning approach from 2014 to the end of 2024. There is one file for each year. The following variables are contained in the files:

sounding_id, xco2_ML*, xco2, xco2_x2019, xco2_quality_flag_ML*, xco2_quality_flag, bias_correction_uncert_ML*, xco2_uncertainty, latitude, longitude, time, land_water_indicator, operation_mode

new variables that contain the new bias coorected xco2, quality flag, and uncertainty and are not contained in the official OCO-2 Lite Files.

xco2_ML: XCO2 Machine Learning corrected XCO2 on x2019 scale

xco2_quality_flag_ML: XCO2 ternary quality flag: 0 = best quality data, 1 = good quality data for increasing sounding throughput if needed, 2 = poor quality data (do not use)

bias_correction_uncert_ML: XCO2 bias correction uncertainty

For the full set of variables contained in the LiteFiles and description of each variable please refere to the data user guide of the official OCO-2 lite files: https://disc.gsfc.nasa.gov/datasets/OCO2_L2_Lite_FP_11.2r/summary?keywords=oco2

Current Bias Correction Approach

The operational bias correction method uses a multiple linear regression-like approach to adjust errors in XCO2 relative to elements of the state vector derived from the ACOS retrieval. Adjustments are currently made manually, guided by various success metrics like TCCON observation agreement, retrieval variability reduction, and flux model coherence.

New Bias Correction Approach

Computational Optimization: Replaces manual tuning with computational methods, enhancing transparency, traceability, and reproducibility.2. Non-linear Error Modeling: Allows more flexibility in modeling retrieval errors, reducing biases, particularly in previously unusable data.3. Independent Flux Inversion Models: Excludes flux inversion models from bias correction development, maintaining OCO-2 measurement independence.4. Quantified Correction Uncertainties: Provides uncertainty quantification for each bias correction at the per sounding level.

Data Usage

If you find anything unexpected in the data please report your findings to Steffen.mauceri@jpl.nasa.gov and william.r.keely@jpl.nasa.gov so we can resolve any issues.

Additional Resources and Citation

Two preprints are currently available that describe the approach in detail and should be cited if the data is used. We will update the citations as soon as the papers are published:

https://doi.org/10.22541/essoar.174164198.80749970/v1https://doi.org/10.22541/essoar.174164203.37422284/v1

Copyright statement: © 2023 California Institute of Technology. Government sponsorship acknowledged.

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. Geological Survey (2024). Data for multiple linear regression models for predicting microcystin concentration action-level exceedances in selected lakes in Ohio [Dataset]. https://catalog.data.gov/dataset/data-for-multiple-linear-regression-models-for-predicting-microcystin-concentration-action

Data for multiple linear regression models for predicting microcystin concentration action-level exceedances in selected lakes in Ohio

Explore at:

Dataset updated

Jul 6, 2024

Dataset provided by

United States Geological Surveyhttp://www.usgs.gov/

Area covered

Ohio

Description

Site-specific multiple linear regression models were developed for eight sites in Ohio—six in the Western Lake Erie Basin and two in northeast Ohio on inland reservoirs--to quickly predict action-level exceedances for a cyanotoxin, microcystin, in recreational and drinking waters used by the public. Real-time models include easily- or continuously-measured factors that do not require that a sample be collected. Real-time models are presented in two categories: (1) six models with continuous monitor data, and (2) three models with on-site measurements. Real-time models commonly included variables such as phycocyanin, pH, specific conductance, and streamflow or gage height. Many of the real-time factors were averages over time periods antecedent to the time the microcystin sample was collected, including water-quality data compiled from continuous monitors. Comprehensive models use a combination of discrete sample-based measurements and real-time factors. Comprehensive models were useful at some sites with lagged variables (< 2 weeks) for cyanobacterial toxin genes, dissolved nutrients, and (or) N to P ratios. Comprehensive models are presented in three categories: (1) three models with continuous monitor data and lagged comprehensive variables, (2) five models with no continuous monitor data and lagged comprehensive variables, and (3) one model with continuous monitor data and same-day comprehensive variables. Funding for this work was provided by the Ohio Water Development Authority and the U.S. Geological Survey Cooperative Water Program.

Clear search

Close search

Google apps

Main menu

Data for multiple linear regression models for predicting microcystin...

An example data set for exploration of Multiple Linear Regression

Student Performance (Multiple Linear Regression) Dataset

Data from: Bike Sharing Dataset

Problem Statement:

Business Goal:

Data Preparation:

Model Building:

Model Evaluation:

‘homeprices-multiple-variables’ analyzed by Analyst-2

Context

Content

Acknowledgements

Multiple linear regression analysis of socioeconomic factors.

Canada Per Capita Income Single variable data set

The dataset for predicting income per capita for Canada is taken from the website: data.worldbank.org

Content

Acknowledgements

Objective

Data from: Data for Multiple Linear Regresion social media on Cost of...

Estimating the LQAC model with I(2) variables (replication data)

Data from: Red wine DataSet

Additional file 2: of Standardizing effect size from linear regression...

‘Walmart Dataset (Retail)’ analyzed by Analyst-2

Determinants of diabetes knowledge (raw scores) in multivariable linear...

‘Red and White Wine Quality Analysis’ analyzed by Analyst-2

Wine Quality Data Set

Data Set Information:

Attribute Information:

Context

Content

Acknowledgements

Github Link

Next Steps

Replication data for: Linear Models with Outliers: Choosing between...

‘California Housing Data (1990)’ analyzed by Analyst-2

Source

About the Data (from the book):

About the Data (From Luís Torgo page):

End-to-End ML Project Steps (Chapter 2 of the book)

The 10-Step Machine Learning Project Workflow (My Version)

Questionnaire data on land use change of Industrial Heritage: Insights from...

Data from: Predicting spatial-temporal patterns of diet quality and large...

Data from: The relationship between learning orientation, firm performance...

Uncertainty-aware Machine Learning Bias Correction and Filtering for OCO-2 |...

Data for multiple linear regression models for predicting microcystin concentration action-level exceedances in selected lakes in OhioSee More Versions

Data for multiple linear regression models for predicting microcystin concentration action-level exceedances in selected lakes in Ohio