Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook
TwitterPreliminary investigation (a) Carry out a shortened initial investigation (steps 1, 2 and 3) based on the matrix scatter plot and box plot. Do not remove outliers or transform the data. Indicate if you had to process the data file in anyway. Explain any conclusions drawn from the evidence and backup your conclusions. (b) Explain why using the correlation matrix for the factor analysis is indicated. (c) Display the sample correlation matrix R. Does the matrix R suggest the number of factors to use? (d) Perform a preliminary simplified principal component analysis using R. i. List the eigenvalues and describe the percent contributions to the variance. ii. Determine the number of principal components to retain and justify your an- swer by considering at least three methods. Note and comment if there is any disagreement between the methods. (e) Include your code
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Measurement Configuration Dataset
This is the anonymous reviewing version; the source code repository will be added after the review.
This dataset provides reproduction data for performance measurement configuration at source code level in Java. The measurement data can be obtained using the precision-experiments repository https://anonymous.4open.science/r/precision-experiments-C613/ (Examining Different Repetition Counts) yourself. These data conatained here are the data we obtained from execution on i7-4770 CPU @ 3.40GHz.
The analysis was tested on Ubuntu 20.04 and gnuplot 5.2.8. It will not work with older gnuplot versions.
To execute the analysis, extract the data by
tar -xvf basic-parameter-comparison.tar tar -xvf parallel-sequential-comparison.tar
and afterwards build the precision-experiments repo and execute the analysis by
cd precision-experiments/precision-analysis/ ../gradlew fatJar cd scripts/configuration-analysis/ ./executeCompleteAnalysis.sh ../../../../basic-parameter-comparison ../../../../parallel-sequential-comparison
Afterwards, the following files will be present:
precision-experiments/precision-analysis/scripts/configuration-analysis/repetitionHeatmaps/heatmap_all_en.pdf (Heatmaps for different repetition counts)
precision-experiments/precision-analysis/scripts/configuration-analysis/repetitionHeatmaps/heatmap_outlierRemoval_en.pdf (Heatmap with and without outlier removal for 1000 repetitions)
precision-experiments/precision-analysis/scripts/configuration-analysis/histogram_outliers_en.pdf (Histogram of the outliers)
precision-experiments/precision-analysis/scripts/configuration-analysis/heatmap_parallel_en.pdf (Heatmap with sequential and parallel execution)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.
Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.
Facebook
TwitterThe project is to conduct a principal components analysis of the paper mill response data (paper_mill_data_response.txt, Aldrin, M., "Moderate projection pursuit regression for multivariate response data", Computational Statistics and Data Analysis, 21 (1996), p. 501-531). (a) Label the variables r1,···, r13. Carry out an initial investigation. Do not remove outliers or transform the data. Indicate if you had to process the data file in anyway. Explain any conclusions drawn from the evidence and backup your conclusions. (b) Display the sample correlation matrix R. (c) Perform a principal component analysis using R. i. List the eigenvalues and describe the percent contributions to the variance. ii. Determine the number of principal components to retain and justify your an- swer by considering at least three methods. Note and comment if there is any disagreement between the methods. iii. Give the eigenvectors for the first two principal components and write out the principal components. iv. Considering the coefficients of the principal components, describe dependencies of the principal components on the variables. v. Display a scatter plot of the first two principal components. Make observations about the plots. (d) Include your code.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.