Site-specific multiple linear regression models were developed for eight sites in Ohio—six in the Western Lake Erie Basin and two in northeast Ohio on inland reservoirs--to quickly predict action-level exceedances for a cyanotoxin, microcystin, in recreational and drinking waters used by the public. Real-time models include easily- or continuously-measured factors that do not require that a sample be collected. Real-time models are presented in two categories: (1) six models with continuous monitor data, and (2) three models with on-site measurements. Real-time models commonly included variables such as phycocyanin, pH, specific conductance, and streamflow or gage height. Many of the real-time factors were averages over time periods antecedent to the time the microcystin sample was collected, including water-quality data compiled from continuous monitors. Comprehensive models use a combination of discrete sample-based measurements and real-time factors. Comprehensive models were useful at some sites with lagged variables (< 2 weeks) for cyanobacterial toxin genes, dissolved nutrients, and (or) N to P ratios. Comprehensive models are presented in three categories: (1) three models with continuous monitor data and lagged comprehensive variables, (2) five models with no continuous monitor data and lagged comprehensive variables, and (3) one model with continuous monitor data and same-day comprehensive variables. Funding for this work was provided by the Ohio Water Development Authority and the U.S. Geological Survey Cooperative Water Program.
This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Student Performance (Multiple Linear Regression) Dataset is designed to analyze the relationship between students’ learning habits and academic performance. Each sample includes key indicators related to learning, such as study hours, sleep duration, previous test scores, and the number of practice exams completed.
2) Data Utilization (1) Characteristics of the Student Performance (Multiple Linear Regression) Dataset: • The target variable, Hours Studied, quantitatively represents the amount of time a student has invested in studying. The dataset is structured to allow modeling and inference of learning behaviors based on correlations with other variables.
(2) Applications of the Student Performance (Multiple Linear Regression) Dataset: • AI-Based Study Time Prediction Models: The dataset can be used to develop regression models that estimate a student’s expected study time based on inputs like academic performance, sleep habits, and engagement patterns. • Behavioral Analysis and Personalized Learning Strategies: It can be applied to identify students with insufficient study time and design personalized study interventions based on academic and lifestyle patterns.
A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.
A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.
In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.
They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:
Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.
You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.
In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.
When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set.
python
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
- where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set.
- Please perform this step as the R-squared score on the test set holds as a benchmark for your model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘homeprices-multiple-variables’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/pankeshpatel/homepricesmultiplevariables on 14 February 2022.
--- Dataset description provided by original source is as follows ---
Sample data of housing price. We have used this small data set to create a tutorial -- Machine learning for absolute beginners. The topic is Multivariate Regression.
It has the following four attributes, describing a house - **area **: area of a house in square feet - bedrooms: number of bedrooms in a house - **age **: age of house - price: price of a house.
Area, bedrooms, and age are feature attributes and price is target attributes/variable.
Source: codebasics : https://twitter.com/codebasicshub
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
**Significant.A multiple linear regression analysis was performed for each procedure (column 1) and three socioeconomic factors (column 2). Individual regression coefficients are identified (column 3), along with their respective 95% confidence intervals (column 4). The goodness of model fit (column 5) is the percent of the variation explained by the model. The P value (column 6) represents the significance of each regression model as a whole, incorporating education, income, and employment as variables. This model was significant in describing the relationship of the three socioeconomic variables and the prevalence of CABG and PTCA. No causal mechanism can be identified with any regression analysis technique.
I am just using the data to practice Linear regression for single variable as a beginner.
There's a story behind every dataset and here's your opportunity to share yours.
The data set contains 2 columns namely year and per capita income
The dataset for predicting income per capita for Canada is taken from the website: data.worldbank.org
Predict Canadas per capita income for the year 2020 using linear regression (beginner level)(just for practice)
This dataset provides a critical nexus between social media engagement and the financial dimension of European interest groups' representation costs. Designed to facilitate multiple linear regression analysis, this dataset is a valuable tool for researchers, statisticians, and analysts seeking to unravel the intricate relationships between digital engagement and the financial commitments of these interest groups.
The dataset offers a robust collection of data points that enables the exploration of potential correlations, dependencies, and predictive insights. By delving into the varying levels of social media presence across different platforms and their potential influence on the cost of representation, researchers can gain a deeper understanding of the interplay between virtual engagement and real-world financial investment.
Accessible to the academic and research community, this dataset holds the promise of shedding light on the dynamic and evolving landscape of interest groups' communication strategies and their financial implications. With the potential to inform policy decisions and strategic planning, this dataset represents a stepping stone toward a more comprehensive understanding of the intricate web of relationships that shape the world of European interest groups. The variables included:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper derives a method for estimating and testing the Linear Quadratic Adjustment Cost (LQAC) model when the target variable and some of the forcing variables follow I(2) processes. Based on a forward-looking error-correction formulation of the model it is shown how to obtain strongly consistent estimates of the structural parameters from both a linear and a non-linear cointegrating regression where first-differences of the I(2) variables are included as regressors (multicointegration). Further, based on the estimated parameter values, it is shown how to test and evaluate the LQAC model using a VAR approach. A simple easy interpretable metric for measuring the model fit is suggested. In an empirical application using UK money demand data, the non-linear multicointegrating regression delivers an economically plausible estimate of the adjustment cost parameter. However, the restrictions implied by the exact LQAC model under rational expectations are strongly rejected and the metric for model fit indicates a substantial noise component in the model.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Datasets Description:
The datasets under discussion pertain to the red and white variants of Portuguese "Vinho Verde" wine. Detailed information is available in the reference by Cortez et al. (2009). These datasets encompass physicochemical variables as inputs and sensory variables as outputs. Notably, specifics regarding grape types, wine brand, and selling prices are absent due to privacy and logistical concerns.
Classification and Regression Tasks: One can interpret these datasets as being suitable for both classification and regression analyses. The classes are ordered, albeit imbalanced. For instance, the dataset contains a more significant number of normal wines compared to excellent or poor ones.
Dataset Contents: For a comprehensive understanding, readers are encouraged to review the work by Cortez et al. (2009). The input variables, derived from physicochemical tests, include: 1. Fixed acidity 2. Volatile acidity 3. Citric acid 4. Residual sugar 5. Chlorides 6. Free sulfur dioxide 7. Total sulfur dioxide 8. Density 9. pH 10. Sulphates 11. Alcohol
The output variable, based on sensory data, is denoted by: 12. Quality (score ranging from 0 to 10)
Usage Tips: A practical suggestion involves setting a threshold for the dependent variable, defining wines with a quality score of 7 or higher as 'good/1' and the rest as 'not good/0.' This facilitates meaningful experimentation with hyperparameter tuning using decision tree algorithms and analyzing ROC curves and AUC values.
Operational Workflow: To efficiently utilize the dataset, the following steps are recommended: 1. Utilize a File Reader (for csv) to a linear correlation node and an interactive histogram for basic Exploratory Data Analysis (EDA). 2. Employ a File Reader to a Rule Engine Node for transforming the 10-point scale to a dichotomous variable indicating 'good wine' and 'rest.' 3. Implement a Rule Engine Node output to an input of Column Filter node to filter out the original 10-point feature, thus preventing data leakage. 4. Apply a Column Filter Node output to the input of Partitioning Node to execute a standard train/test split (e.g., 75%/25%, choosing 'random' or 'stratified'). 5. Feed the Partitioning Node train data split output into the input of Decision Tree Learner node. 6. Connect the Partitioning Node test data split output to the input of Decision Tree predictor Node. 7. Link the Decision Tree Learner Node output to the input of Decision Tree Node. 8. Finally, connect the Decision Tree output to the input of ROC Node for model evaluation based on the AUC value.
Tools and Acknowledgments: For an efficient analysis, consider using KNIME, a valuable graphical user interface (GUI) tool. Additionally, the dataset is available on the UCI machine learning repository, and proper acknowledgment and citation of the dataset source by Cortez et al. (2009) are essential for use.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Excel template to transform original effect size using the proposed formulae. (XLSX 19 kb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Walmart Dataset (Retail)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/rutuspatel/walmart-dataset-retail on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Dataset Description :
This is the historical data that covers sales from 2010-02-05 to 2012-11-01, in the file Walmart_Store_sales. Within this file you will find the following fields:
Store - the store number
Date - the week of sales
Weekly_Sales - sales for the given store
Holiday_Flag - whether the week is a special holiday week 1 – Holiday week 0 – Non-holiday week
Temperature - Temperature on the day of sale
Fuel_Price - Cost of fuel in the region
CPI – Prevailing consumer price index
Unemployment - Prevailing unemployment rate
Holiday Events Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13 Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13 Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13 Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13
Analysis Tasks
Basic Statistics tasks
1) Which store has maximum sales
2) Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of mean to standard deviation
3) Which store/s has good quarterly growth rate in Q3’2012
4) Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together
5) Provide a monthly and semester view of sales in units and give insights
Statistical Model
For Store 1 – Build prediction models to forecast demand
Linear Regression – Utilize variables like date and restructure dates as 1 for 5 Feb 2010 (starting from the earliest date in order). Hypothesize if CPI, unemployment, and fuel price have any impact on sales.
Change dates into days by creating new variable.
Select the model which gives best accuracy.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CI = Confidence interval; Bolded values indicate significant results NDSS = National Diabetes Service Scheme; SD = Standard Deviation.*Represents variables substantially different from the analyses using raw scores (Table 4).Model 1 had the smallest Bayesian information criterion (BIC).Model 2 had the smallest a bias-corrected version of AIC.Model 3 had the largest adjusted proportion of variation “explained” by the regression model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Red and White Wine Quality Analysis’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/saigeethac/red-and-white-wine-quality-datasets on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This data set is available in UCI at https://archive.ics.uci.edu/ml/datasets/Wine+Quality.
Abstract: Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests.
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
Input variables (based on physicochemical tests):
Output variable (based on sensory data):
These columns have been described in the Kaggle Data Explorer.
The authors state "we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods." We have briefly explored this aspect and see that Red wine quality prediction on the test and training datasets is almost the same (~88%) with just three features. Likewise White wine quality prediction appears to depend on just one feature. This may be due to the privacy and logistics issues mentioned by the dataset authors.
Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. Both these datasets are analyzed and linear regression models are developed in Python 3. The github link provided for the source code also includes a Flask web application for deployment on the local machine or on Heroku.
Datasets: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Banner Image: Photo by Roberta Sorge on Unsplash
Complete code has been uploaded onto github at https://github.com/saigeethachandrashekar/wine_quality.
Please clone the repo - this contains both the datasets, the code required for building and saving the model on to your local system. Code for a Flask app is provided for deploying the models on your local machine. The app can also be deployed on Heroku - the requirements.txt and Procfile are also provided for this.
White wine quality prediction appears to depend on just one feature. This may be due to the privacy and logistics issues mentioned by the dataset authors (e.g. there is no data about grape types, wine brand, wine selling price, etc.) or it may be due to other factors that are not clear. This is an area that might be worth exploring further.
Other ML techniques may be applied to improve the accuracy.
--- Original source retains full ownership of the source dataset ---
State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘California Housing Data (1990)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/harrywang/housing on 12 November 2021.
--- Dataset description provided by original source is as follows ---
This is the dataset used in this book: https://github.com/ageron/handson-ml/tree/master/datasets/housing to illustrate a sample end-to-end ML project workflow (pipeline). This is a great book - I highly recommend!
The data is based on California Census in 1990.
"This dataset is a modified version of the California Housing dataset available from Luís Torgo's page (University of Porto). Luís Torgo obtained it from the StatLib repository (which is closed now). The dataset may also be downloaded from StatLib mirrors.
The following is the description from the book author:
This dataset appeared in a 1997 paper titled Sparse Spatial Autoregressions by Pace, R. Kelley and Ronald Barry, published in the Statistics and Probability Letters journal. They built it using the 1990 California census data. It contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).
The dataset in this directory is almost identical to the original, with two differences: 207 values were randomly removed from the total_bedrooms column, so we can discuss what to do with missing data. An additional categorical attribute called ocean_proximity was added, indicating (very roughly) whether each block group is near the ocean, near the Bay area, inland or on an island. This allows discussing what to do with categorical data. Note that the block groups are called "districts" in the Jupyter notebooks, simply because in some contexts the name "block group" was confusing."
http://www.dcc.fc.up.pt/%7Eltorgo/Regression/cal_housing.html
This is a dataset obtained from the StatLib repository. Here is the included description:
"We collected information on the variables using all the block groups in California from the 1990 Cens us. In this sample a block group on average includes 1425.5 individuals living in a geographically co mpact area. Naturally, the geographical area included varies inversely with the population density. W e computed distances among the centroids of each block group as measured in latitude and longitude. W e excluded all the block groups reporting zero entries for the independent and dependent variables. T he final data contained 20,640 observations on 9 variables. The dependent variable is ln(median house value)."
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The survey dataset for identifying Shiraz old silo’s new use which includes four components: 1. The survey instrument used to collect the data “SurveyInstrument_table.pdf”. The survey instrument contains 18 main closed-ended questions in a table format. Two of these, concern information on Silo’s decision-makers and proposed new use followed up after a short introduction of the questionnaire, and others 16 (each can identify 3 variables) are related to the level of appropriate opinions for ideal intervention in Façade, Openings, Materials and Floor heights of the building in four values: Feasibility, Reversibility, Compatibility and Social Benefits. 2. The raw survey data “SurveyData.rar”. This file contains an Excel.xlsx and a SPSS.sav file. The survey data file contains 50 variables (12 for each of the four values separated by colour) and data from each of the 632 respondents. Answering each question in the survey was mandatory, therefor there are no blanks or non-responses in the dataset. In the .sav file, all variables were assigned with numeric type and nominal measurement level. More details about each variable can be found in the Variable View tab of this file. Additional variables were created by grouping or consolidating categories within each survey question for simpler analysis. These variables are listed in the last columns of the .xlsx file. 3. The analysed survey data “AnalysedData.rar”. This file contains 6 “SPSS Statistics Output Documents” which demonstrate statistical tests and analysis such as mean, correlation, automatic linear regression, reliability, frequencies, and descriptives. 4. The codebook “Codebook.rar”. The detailed SPSS “Codebook.pdf” alongside the simplified codebook as “VariableInformation_table.pdf” provides a comprehensive guide to all 50 variables in the survey data, including numerical codes for survey questions and response options. They serve as valuable resources for understanding the dataset, presenting dictionary information, and providing descriptive statistics, such as counts and percentages for categorical variables.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Analysis-ready tabular data from "Predicting spatial-temporal patterns of diet quality and large herbivore performance using satellite time series" in Ecological Applications, Kearney et al., 2021. Data is tabular data only, summarized to the pasture scale. Weight gain data for individual cattle and the STARFM-derived Landsat-MODIS fusion imagery can be made available upon request. Resources in this dataset:Resource Title: Metadata - CSV column names, units and descriptions. File Name: Kearney_et_al_ECOLAPPL_Patterns of herbivore - metada.docxResource Description: Column names, units and descriptions for all CSV files in this datasetResource Title: Fecal quality data. File Name: Kearney_etal2021_Patterns_of_herbivore_Data_FQ_cln.csvResource Description: Field-sampled fecal quality (CP = crude protein; DOM = digestible organic matter) data and phenology-related APAR metrics derived from 30 m daily Landsat-MODIS fusion satellite imagery. All data are paddock-scale averages and the paddock is the spatial scale of replication and week is the temporal scale of replication. Fecal samples were collected by USDA-ARS staff from 3-5 animals per paddock (10% - 25% of animals in each herd) weekly during each grazing season from 2014 to 2019 across 10 different paddocks at the Central Plains Experimental Range (CPER) near Nunn, CO. Samples were analyzed at the Grazingland Animal Nutrition Lab (GANlab, https://cnrit.tamu.edu/index.php/ganlab/) using near infrared spectroscopy (see Lyons & Stuth, 1992; Lyons, Stuth, & Angerer, 1995). Not every herd was sampled every week or every year, resulting in a total of 199 samples. Samples represent all available data at the CPER during the study period and were collected for different research and adaptive management objectives, but following the basic protocol described above. APAR metrics were derived from the paddock-scale APAR daily time series (all paddock pixels averaged daily to create a single paddock-scale time series). All APAR metrics are calculated for the week that corresponds to the week that fecal quality samples were collected in the field. See Section 2.2.4 of the corresponding manuscript for a complete description of the APAR metrics. Resource Title: Monthly ADG. File Name: Kearney_etal2021_Patterns_of_herbivore_Data_ADG_monthly_cln.csvResource Description: Monthly average daily gain (ADG) of cattle weights at the paddock scale and the three satellite-derived metrics used to build regression model to predict AD: crude protein (CP), digestible organic matter (DOM) and aboveground net herbaceous production (ANHP). Data table also includes stocking rate (animal units per hectare) used as an interaction term in the ADG regression model and all associated data to derive each of these variables (e.g., sampling start and end dates, 30 m daily Landsat-MODIS fusion satellite imagery-derived APAR metrics, cattle weights, etc.). We calculated paddock-scale average daily gain (ADG, kg hd-1 day-1) from 2000-2019 for yearlings weighed approximately every 28-days during the grazing season across 6 different paddocks with stocking densities of 0.08 – 0.27 animal units (AU) ha-1, where one AU is equivalent to a 454 kg animal. It is worth noting that AU’s change as a function of both the number of cattle within a paddock and the size of individual animals, the latter of which changes within a single grazing season. This becomes important to consider when using sub-seasonal weight data for fast-growing yearlings. For paddock-scale ADG, we first calculated ADG for each individual yearling as the difference between the weights obtained at the end and beginning of each period, divided by the number of days in each period, and then averaged for all individuals in the paddock. We excluded data from 2013 due to data collection inconsistencies. We note that most of the monthly weight data (97%) is from 3 paddocks where cattle were weighed every year, whereas in the other 3 paddocks, monthly weights were only measured during 2017-2019. Apart from the 2013 data, which were not comparable to data from other years, the data represents all available weight gain data for CPER to maximize spatial-temporal coverage and avoid potential bias from subjective decisions to subset the data. Data may have been collected for different projects at different times, but was collected in a consistent way. This resulted in 269 paddock-scale estimates of monthly ADG, with robust temporal, but limited spatial, coverage. CP and DOM were estimated from a random forest model trained from the five APAR metrics: rAPAR, dAPAR, tPeak, iAPAR and iAPAR-dry (see manuscript Section 2.3 for description). APAR metrics were derived from the paddock-scale APAR daily time series (all paddock pixels averaged daily to create a single paddock-scale time series). All APAR metrics are calculated as the average of the approximately 28-day period that corresponds to the ADG calculation. See Section 2.2.4 of the manuscript for a complete description of the APAR metrics. ANHP was estimated from a linear regression model developed by Gaffney et al. (2018) to calculate net aboveground herbaceous productivity (ANHP; kg ha-1) from iAPAR. We averaged the coefficients of 4 spatial models (2013-2016) developed by Gaffney et al. (2018), resulting in the following equation: ANHP = -26.47 + 2.07(iAPAR) We first calculated ANHP for each day of the grazing season at the paddock scale, and then took the average ANHP for the 28-day period. REFERENCES: Gaffney, R., Porensky, L. M., Gao, F., Irisarri, J. G., Durante, M., Derner, J. D., & Augustine, D. J. (2018). Using APAR to predict aboveground plant productivity in semi-aid rangelands: Spatial and temporal relationships differ. Remote Sensing, 10(9). doi: 10.3390/rs10091474 Resource Title: Season-long ADG. File Name: Kearney_etal2021_Patterns_of_herbivore_Data_ADG_seasonal_cln.csvResource Description: Season-long observed and model-predicted average daily gain (ADG) of cattle weights at the paddock scale. Also includes two variables used to analyze patterns in model residuals: percent sand content and season-long aboveground net herbaceous production (ANHP). We calculated observed paddock-scale ADG for the entire grazing season from 2010-2019 (excluding 2013 due to data collection inconsistencies) by averaging seasonal ADG of each yearling, determined as the difference between the end and starting weights divided by the number of days in the grazing season. This dataset was available for 40 paddocks spanning a range of soil types, plant communities, and topographic positions. Data may have been collected for different projects at different times, but was collected in a consistent way. We note that there was spatial overlap among a small number paddock boundaries across different years since some fence lines were moved in 2012 and 2014. Model-predicted paddock-scale ADG was derived using the monthly ADG regression model described in Sections 2.3.3 and 2.3.4. of the associated manuscript. In short, we predicted season-long cattle weight gains by first predicting daily weight gain for each day of the grazing season from the monthly regression model using a 28-day moving average of model inputs (CP, DOM and ANHP ). We calculated the final ADG for the entire grazing season as the average predicted ADG, starting 28-days into the growing season. Percent sand content was obtained as the paddock-scale average of POLARIS sand content in the upper 0-30 cm. ANHP was calculated on the last day of the grazing season fusing a linear regression model developed by Gaffney et al. (2018) to calculate net aboveground herbaceous productivity (ANHP; kg ha-1) from satellite-derived integrated absorbed photosynthetically active radiation (iAPAR) (see Section 3.1.2 of the associated manuscript). We averaged the coefficients of 4 spatial models (2013-2016) developed by Gaffney et al. (2018), resulting in the following equation: ANHP = -26.47 + 2.07(iAPAR) REFERENCES: Gaffney, R., Porensky, L. M., Gao, F., Irisarri, J. G., Durante, M., Derner, J. D., & Augustine, D. J. (2018). Using APAR to predict aboveground plant productivity in semi-aid rangelands: Spatial and temporal relationships differ. Remote Sensing, 10(9). doi: 10.3390/rs10091474
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In this study, we investigate the (in)direct relationship between learning orientation and firm performance. The study is guided by the DCs framework. We collected data from 182 MSMEs operating in TPs in Poland. We used two methods (PAPI, CAWI) in our quantitative empirical research.For the analysis of empirical data, we used the methods of description and statistical inference. The values obtained by means of Cronbach’s alpha values showed very good reliability of questionnaire. We have assumed that the coefficient deciding whether a tool is reliable should be at least 0.70. The results of the Kolmogorov-Smirnov tests indicate grounds for assuming that the variables are not normally distributed. We present the results of the Kolmogorov–Smirnov tests and Cronbach’s alpha coefficients in Table 1.In the next step, we applied the correlation analysis between the variables by using the rho-Spearman coefficient. We present the results of correlations among the analysed variables in Table 2. The analysis of data included in Table 2 indicated weak or very weak correlations among the variables in individual configurations. LO positively correlates with FP (rs = 0.197; p < 0.01). This means that the increase in LO is accompanied, on average, by a small increase in FP.There is also a positive, although very weak (rs = 0.151) correlation between MD and LO, which was statistically significant (p < 0.05). This means that the increase in MD is accompanied by, on average, a slight increase in LO.At the same time, the results of the correlation analysis indicated a weak but positive correlation between one of the dimensions of MD, i.e. speed of change in technology and competition and LO (rs = 0.0.236; p < 0.01). Relationships between the two remaining dimensions of MD were not statistically significant.In addition, the aforementioned MD dimension also positively correlates with FP. The correlation between the MD dimension called speed of change in technology and competition and FP is positive, weak and statistically significant (rs = 0.181; p < 0.05). This means that the increase in the speed of change in technology and competition is accompanied by, on average, a slight increase in FP. Relationships between the two remaining dimensions of MD and FP were not statistically significant.Correlation analysis encourages deeper recognition and understanding of LO-FP relationship in the context of MD. We used linear regression models in order to verify the hypotheses, which allowed for a global assessment of relationships among all analysed variables.The values of coefficients obtained for permanent effects in this model inform about how much the expected value of explanatory variable changes along with the unitary growth of a given predictor. The explanatory variable (predictor) is a variable in a statistical model (as well as in an econometric model) on the basis of which the response variable is calculated. In Model 1 there is one explanatory variable (LO); while in Model 2 there are two explanatory variables (LO, MD). The response variable is FP. The statistical significance of these coefficients was verified by a test based on the t statistics. For all the mentioned tests, p<0.05 indicated the statistical significance of the analysed relationships.The assessment of the impact of LO on FP is dictated by the H.1 hypothesis verification.While the assessment of the impact of dynamism of the market in which enterprises operate in explaining the impact of LO on FP is dictated by the H.2 hypothesis verification.H.1: Learning orientation is positively related to firm performance.H.2: Market dynamism moderates the learning orientation-firm performance relationship; the positive effect of learning orientation on firm performance is likely to be stronger under high market dynamism than under low market dynamism.The results of testing the H1 and H2 hypotheses are presented in Table 3.We estimated Models 1 and 2 in Table 3 by using the Akaike Information Criteria (AIC). The AIC for both models was similar, i.e. 568.28 for the first model and 571.12 for the second one. AIC levels for both models indicated acceptable matching levels. The lower the AIC value, the better the predictive values of the model. The model coefficient is a parameter determined by its most likely value. The confidence interval of the model coefficient indicates in which range its less probable but possible values may be. It also has a diagnostic value. If the value of the regression coefficient contains “0”, the coefficient has no substantive value for the model. Model 1 explained 13.5% of the data variation (R2 = 0.135), while Model 2 explained 14.0% of the data variation (R2 = 0.140), which is slightly more than Model 1. The analysis of the models presented in Table 3 leads to several findings. In the first model, only LO was positively related to FP and only slightly explained the variability of the dependent variable. It has a small but statistically significant impact on FP (coefficient: 0.38; p=0.00). The linear regression model (Model 1) confirms the thesis about the positive impact of LO on FP. It may be assumed that an increase in the assessment of LO by one point, with no change in the other parameters of the model, would result in an increase in average FP by 0.38. This model explains 13.5% of the data variability (R2 = 0.135). Secondly, the linear regression model (Model 2) did not confirm the thesis about the moderating role of MD on the LO-FP relationship. None of the predictors showed statistical significance (p<0.05) in Model 2. What is more, taking the MD variable into account affects the quality of the model, and MD itself adopts negative prediction indicators, which means that better FP in responding to changes in the level of MD deteriorates the overall FP. However, the research has not confirmed whether MD - a higher-order construct built of three first-order constructs, i.e. the speed of changes in technology and competition, unpredictability of changes in technology and competition, uncertainty of customer behaviour - increases the importance of LO for increasing FP, and thus achieving a competitive advantage. Thirdly, the control variables were insignificant in both models. This means that the control variables in the form of enterprise size do not have a statistically significant effect on the dependent variable. Therefore, the introduction of two control variables and a moderating variable reduced the impact of LO on FP to a statistically insignificant level.The results of the study show that firm performance benefits from LO-related behaviours. Learning orientation is an important stimulant of firm performance, while market dynamism has not been classified as a moderator of the learning orientation-firm performance relationship.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This is a dataset to explore the effect of applying a new bias correction and quality filtering approach to increase the accuracy of atmospheric CO2 measurements derived from the Orbiting Carbon Observatory-2 (OCO-2) satellite. This is not an official OCO-2 data product.
Data Description
The dataset contains OCO-2 retrieved XCO2 B11.2 that have been corrected and filtered with a new machine learning approach from 2014 to the end of 2024. There is one file for each year. The following variables are contained in the files:
sounding_id, xco2_ML*, xco2, xco2_x2019, xco2_quality_flag_ML*, xco2_quality_flag, bias_correction_uncert_ML*, xco2_uncertainty, latitude, longitude, time, land_water_indicator, operation_mode
xco2_ML: XCO2 Machine Learning corrected XCO2 on x2019 scale
xco2_quality_flag_ML: XCO2 ternary quality flag: 0 = best quality data, 1 = good quality data for increasing sounding throughput if needed, 2 = poor quality data (do not use)
bias_correction_uncert_ML: XCO2 bias correction uncertainty
For the full set of variables contained in the LiteFiles and description of each variable please refere to the data user guide of the official OCO-2 lite files: https://disc.gsfc.nasa.gov/datasets/OCO2_L2_Lite_FP_11.2r/summary?keywords=oco2
Current Bias Correction Approach
The operational bias correction method uses a multiple linear regression-like approach to adjust errors in XCO2 relative to elements of the state vector derived from the ACOS retrieval. Adjustments are currently made manually, guided by various success metrics like TCCON observation agreement, retrieval variability reduction, and flux model coherence.
New Bias Correction Approach
Data Usage
If you find anything unexpected in the data please report your findings to Steffen.mauceri@jpl.nasa.gov and william.r.keely@jpl.nasa.gov so we can resolve any issues.
Additional Resources and Citation
Two preprints are currently available that describe the approach in detail and should be cited if the data is used. We will update the citations as soon as the papers are published:
Copyright statement: © 2023 California Institute of Technology. Government sponsorship acknowledged.
Site-specific multiple linear regression models were developed for eight sites in Ohio—six in the Western Lake Erie Basin and two in northeast Ohio on inland reservoirs--to quickly predict action-level exceedances for a cyanotoxin, microcystin, in recreational and drinking waters used by the public. Real-time models include easily- or continuously-measured factors that do not require that a sample be collected. Real-time models are presented in two categories: (1) six models with continuous monitor data, and (2) three models with on-site measurements. Real-time models commonly included variables such as phycocyanin, pH, specific conductance, and streamflow or gage height. Many of the real-time factors were averages over time periods antecedent to the time the microcystin sample was collected, including water-quality data compiled from continuous monitors. Comprehensive models use a combination of discrete sample-based measurements and real-time factors. Comprehensive models were useful at some sites with lagged variables (< 2 weeks) for cyanobacterial toxin genes, dissolved nutrients, and (or) N to P ratios. Comprehensive models are presented in three categories: (1) three models with continuous monitor data and lagged comprehensive variables, (2) five models with no continuous monitor data and lagged comprehensive variables, and (3) one model with continuous monitor data and same-day comprehensive variables. Funding for this work was provided by the Ohio Water Development Authority and the U.S. Geological Survey Cooperative Water Program.