10 datasets found

Data from: Valid Inference Corrected for Outlier Removal
tandf.figshare.com
pdf
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v4
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9762731.v4
Dataset updated
Jun 4, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Shuxiao Chen; Jacob Bien
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.
KC_House Dataset -Linear Regression of Home Prices
kaggle.com
zip
Updated May 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). KC_House Dataset -Linear Regression of Home Prices [Dataset]. https://www.kaggle.com/datasets/vikramamin/kc-house-dataset-home-prices
Explore at:
zip(776807 bytes)Available download formats
Dataset updated
May 15, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset: House pricing dataset containing 21 columns and 21613 rows.

Programming Language : R

Objective : To predict house prices by creating a model

Steps : A) Import the dataset B) Install and run libraries C) Data Cleaning - Remove Null Values , Change Data Types , Dropping of Columns which are not important D) Data Analysis - (i)Linear Regression Model was used to establish the relationship between the dependent variable (price) and other independent variable (ii) Outliers were identified and removed (iii) Regression model was run once again after removing the outliers (iv) Multiple R- squared was calculated which indicated the independent variables can explain 73% change/ variation in the dependent variable (v) P value was less than that of alpha 0.05 which shows it is statistically significant. (vi) Interpreting the meaning of the results of the coefficients (vii) Checked the assumption of multicollinearity (viii) VIF(Variance inflation factor) was calculated for all the independent variables and their absolute value was found to be less than 5. Hence, there is not threat of multicollinearity and that we can proceed with the independent variables specified.
Life Expectancy WHO
kaggle.com
zip
Updated Jun 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Life Expectancy WHO [Dataset]. https://www.kaggle.com/datasets/vikramamin/life-expectancy-who
Explore at:
zip(121472 bytes)Available download formats
Dataset updated
Jun 19, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.

We use DECISION TREE MODEL for the analysis.

Run the required libraries (rpart, rpart.plot, RColorBrewer, rattle).

We run the decision tree analysis using rpart and plot the tree. We use fancyRpartPlot.

We use 5 fold cross validation method with CP (complexity parameter) being 0.01.

In Decision Tree , RMSE (Root Mean Squared Error) is 3.06. This indicates that on an average, the predicted values have an error of 3.06 years as compared to the actual life expectancy values.

MAPE (Mean Absolute Percentage Error) is 0.035. This indicates an accuracy prediction of 96.45% (1-0.035).

MAE (Mean Absolute Error) is 2.35. This indicates that on an average, the predicted values deviate by approximately 2.35 years from the actual values.

We use RANDOM FOREST for the analysis.

Run library(randomForest)

We use varImpPlot to find out which variables are most significant and least significant. Income composition is the most important followed by adult mortality and the least relevant independent variable is Population.

Predict Life expectancy through random forest model.

In Random Forest , RMSE (Root Mean Squared Error) is 1.73. This indicates that on an average, the predicted values have an error of 1.73 years as compared to the actual life expectancy values.

MAPE (Mean Absolute Percentage Error) is 0.01. This indicates an accuracy prediction of 98.27% (1-0.01).

MAE (Mean Absolute Error) is 1.14. This indicates that on an average, the predicted values deviate by approximately 1.14 years from the actual values.

Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.
Z
Occurrence Record Dataset from "Depth Matters for Marine Biodiversity"
datasetcatalog.nlm.nih.gov
Updated Aug 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahbek, Carsten; Owens, Hannah (2024). Occurrence Record Dataset from "Depth Matters for Marine Biodiversity" [Dataset]. http://doi.org/10.5281/zenodo.13318673
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13318673
Dataset updated
Aug 14, 2024
Authors
Rahbek, Carsten; Owens, Hannah
Description
This is the final occurrence record dataset produced for the manuscript "Depth Matters for Marine Biodiversity". Detailed methods for the creation of the dataset, below, have been excerpted from Appendix I: Extended Methods. Detailed citations for the occurrence datasets from which these data were derived can also be foud in Appedix I of the manuscript. We assembled a list of all recognized species of fishes from the orders Scombiformes (sensu Betancur-R et al., 2017), Gadiformes, and Beloniformes by accessing FishBase (Boettiger et al., 2012; Froese & Pauly, 2017) and the Ocean Biodiversity Information System (OBIS; OBIS, 2022; Provoost & Bosch, 2019) through queries in R (R Core Team, 2021). Species were considered Atlantic if their FishBase distribution or occurrence records on OBIS included any area within the Atlantic or Mediterranean major fishing regions as defined by the Food and Agriculture Organization of the United Nations (FAO Regions 21, 27, 31, 34, 37, 41, 47, and 48; FAO, 2020) The database query script can be found on the project code repository (https://github.com/hannahlowens/3DFishRichness/blob/main/1_OccurrenceSearch.R). We then curated the list of names to resolve discrepancies in taxonomy and known distributions through comparison with the Eschmeyer Catalog of Fishes (Eschmeyer & Fricke, 2015), accessed in September of 2020, as our ultimate taxonomic authority. The resulting list of species was then mapped onto the Global Biodiversity Information Facility’s backbone taxonomy (Chamberlain et al., 2021; GBIF, 2020a) to ensure taxonomic concurrence across databases (Appendix I Table 1). The final taxonomic list was used to download occurrence records from OBIS (OBIS, 2022) and GBIF (GBIF, 2020b) in R through robis and occCite (Chamberlain et al., 2020; Provoost & Bosch, 2019; Owens et al., 2021). Once the resulting data were mapped and curated to remove records with putatively spurious coordinates, under-sampled regions and species were augmented with data from publicly available digital museum collection databases not served through OBIS or GBIF, as well as a literature search. For each species, duplicate points were removed from two- and three-dimensional species occurrence datasets separately, and inaccurate depth records were removed from 3D datasets. Inaccuracy was determined based on extreme statistical outliers (values greater than 2 or less than -2 when occurrence depths were centered and scaled), depth ranges that exceeded bathymetry at occurrence coordinates, and occurrence far outside known depth ranges compared to information from FishBase, Eschmeyer’s Catalog of Fishes, and congeneric depth ranges in the dataset. Finally, for datasets with more than 20 points remaining after cleaning, occurrence data were downsampled to the resolution of the environmental data; that is, to 1 point per 1 degree grid cell in the 2D dataset, and to one point per depth slice per 1 degree grid cell in the 3D dataset. Counts of raw and cleaned records for each species can be found in Appendix 1 Table 1. References: Betancur-R, R., Wiley, E. O., Arratia, G., Acero, A., Bailly, N., Miya, M., Lecointre, G., & Ortí, G. (2017). Phylogenetic classification of bony fishes. BMC Evolutionary Biology, 17(1), 162. https://doi.org/10.1186/s12862-017-0958-3 Boettiger, C., Lang, D. T., & Wainwright, P. C. (2012). rfishbase: exploring, manipulating and visualizing FishBase data from R. Journal of Fish Biology, 81(6), 2030–2039. https://doi.org/10.1111/j.1095-8649.2012.03464.x Chamberlain, S., Barve, V., McGlinn, D., Oldoni, D., Desmet, P., Geffert, L., & Ram, K. (2021). rgbif: Interface to the Global Biodiversity Information Facility API. https://CRAN.R-project.org/package=rgbif Eschmeyer, & Fricke, W. N. &. (2015). Taxonomic checklist of fish species listed in the CITES Appendices and EC Regulation 338/97 (Elasmobranchii, Actinopteri, Coelacanthi, and Dipneusti, except the genus Hippocampus). Catalog of Fishes, Electronic Version. Accessed September, 2020. https://www.calacademy.org/scientists/projects/eschmeyers-catalog-of-fishes FAO. (2020). FAO Major Fishing Areas. United Nations Fisheries and Aquaculture Division. https://www.fao.org/fishery/en/collection/area Froese, R., & Pauly, D. (2017). FishBase. Accessed September, 2022. www.fishbase.org GBIF.org. (2020a). GBIF Backbone Taxonomy. Accessed September, 2020. GBIF.org GBIF.org. (2020b). GBIF Occurrence Download. Accessed November, 2020. https://doi.org/10.15468 OBIS. (2020). Ocean Biodiversity Information System. Intergovernmental Oceanographic Commission of UNESCO. Accessed November, 2020. www.obis.org Owens, H. L., Merow, C., Maitner, B. S., Kass, J. M., Barve, V., & Guralnick, R. P. (2021). occCite: Tools for querying and managing large biodiversity occurrence datasets. Ecography, 44(8), 1228–1235. https://doi.org/10.1111/ecog.05618 Provoost, P., & Bosch, S. (2019). robis: R Client to access data from the OBIS API. https://cran.r-project.org/package=robis R Core Team. (2021). R: A Language and Environment for Statistical Computing. https://www.R-project.org/
Z
Identification of Performance Changes at Code Level (Measurement...
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Aug 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous for Reviewing (2022). Identification of Performance Changes at Code Level (Measurement Configuration Dataset) [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_6300863
Explore at:
Dataset updated
Aug 8, 2022
Authors
Anonymous for Reviewing
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Measurement Configuration Dataset

This is the anonymous reviewing version; the source code repository will be added after the review.

This dataset provides reproduction data for performance measurement configuration at source code level in Java. The measurement data can be obtained using the precision-experiments repository https://anonymous.4open.science/r/precision-experiments-C613/ (Examining Different Repetition Counts) yourself. These data conatained here are the data we obtained from execution on i7-4770 CPU @ 3.40GHz.

The analysis was tested on Ubuntu 20.04 and gnuplot 5.2.8. It will not work with older gnuplot versions.

To execute the analysis, extract the data by

tar -xvf basic-parameter-comparison.tar tar -xvf parallel-sequential-comparison.tar

and afterwards build the precision-experiments repo and execute the analysis by

cd precision-experiments/precision-analysis/ ../gradlew fatJar cd scripts/configuration-analysis/ ./executeCompleteAnalysis.sh ../../../../basic-parameter-comparison ../../../../parallel-sequential-comparison

Afterwards, the following files will be present:

precision-experiments/precision-analysis/scripts/configuration-analysis/repetitionHeatmaps/heatmap_all_en.pdf (Heatmaps for different repetition counts)

precision-experiments/precision-analysis/scripts/configuration-analysis/repetitionHeatmaps/heatmap_outlierRemoval_en.pdf (Heatmap with and without outlier removal for 1000 repetitions)

precision-experiments/precision-analysis/scripts/configuration-analysis/histogram_outliers_en.pdf (Histogram of the outliers)

precision-experiments/precision-analysis/scripts/configuration-analysis/heatmap_parallel_en.pdf (Heatmap with sequential and parallel execution)
t
Wenau, Stefan, Spieß, Volkhard, Zabel, Matthias (2021). Dataset: Multibeam...
service.tib.eu
Updated Nov 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Wenau, Stefan, Spieß, Volkhard, Zabel, Matthias (2021). Dataset: Multibeam bathymetry processed data (EM 120 echosounder dataset compilation) of RV METEOR & RV MARIA S. MERIAN during cruise M76/1 & MSM19/1c, Namibian continental slope. https://doi.org/10.1594/PANGAEA.932434 [Dataset]. https://service.tib.eu/ldmservice/dataset/png-doi-10-1594-pangaea-932434
Explore at:
Dataset updated
Nov 29, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Namibia
Description
The data contain bathymetric data from the Namibia continental slope. The data were acquired on R/V Meteor research expeditions M76/1 in 2008, and R/V Maria S. Merian expedition MSM19/1c in 2011. The purpose of the data was the exploration of the Namibian continental slope and espressially the investigation of large seafloor depressions. The bathymetric data were acquired with the 191-beam 12 kHz Kongsberg EM120 system. The data were processed using the public software package MBSystems. The loaded data were cleaned semi-automatically and manually, removing outliers and other erroneous data. Initial velocity fields were adjusted to remove artifacts from the data. Gridding was done in 10x10 m grid cells for the MSM19-1c dataset and 50x50 m for the M76 dataset using the Gaussian Weighted Mean algorithm.
Sales data for bulls
kaggle.com
zip
Updated Apr 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yerkin Mudebayev (2023). Sales data for bulls [Dataset]. https://www.kaggle.com/datasets/yerkinmudebayev/sales-data-for-bulls
Explore at:
zip(413774 bytes)Available download formats
Dataset updated
Apr 6, 2023
Authors
Yerkin Mudebayev
Description
Preliminary investigation (a) Carry out a shortened initial investigation (steps 1, 2 and 3) based on the matrix scatter plot and box plot. Do not remove outliers or transform the data. Indicate if you had to process the data file in anyway. Explain any conclusions drawn from the evidence and backup your conclusions. (b) Explain why using the correlation matrix for the factor analysis is indicated. (c) Display the sample correlation matrix R. Does the matrix R suggest the number of factors to use? (d) Perform a preliminary simplified principal component analysis using R. i. List the eigenvalues and describe the percent contributions to the variance. ii. Determine the number of principal components to retain and justify your an- swer by considering at least three methods. Note and comment if there is any disagreement between the methods. (e) Include your code
g
DC Estimated Tree Root Zone (Structural) | gimi9.com
gimi9.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DC Estimated Tree Root Zone (Structural) | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_dc-estimated-tree-root-zone-structural/
Explore at:
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Washington
Description
DC 2022 LiDAR was used and processed using the “Extract Trees using Cluster Analysis” script which is included as part of Esri’s 3D Basemap solution. The extracted tree data set was merged with the UFA tree inventory data, with preference given to the UFA tree inventory data. All LiDAR-derived trees within 2 meters of a UFA tree were removed as being duplicates.
g
DC Tree Structure and Benefits | gimi9.com
gimi9.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DC Tree Structure and Benefits | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_dc-tree-structure-and-benefits/
Explore at:
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Washington
Description
🇺🇸 미국 English DC 2022 LiDAR was used and processed using the “Extract Trees using Cluster Analysis” script which is included as part of Esri’s 3D Basemap solution. All LiDAR-derived trees within 2 meters of a Urban Forestry Division tree were removed as being duplicates.Tree diameter (DBH, in inches) was estimated for the LiDAR-derived trees from calculated tree height (in feet) based on the equation: DBH = 0.4003*height - 1.9557. This equation was derived from a statistical analysis of a detailed park inventory tree data set and has an R^2 = 0.7418.Extreme outliers were also modified, with any DBH larger than 80 inches being converted to a DBH of 80 inches.The combined data set was processed using the USDA Forest Service i-Tree eco software, where structure and environmental benefits were estimated.
paper mill response data
kaggle.com
zip
Updated Apr 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yerkin Mudebayev (2023). paper mill response data [Dataset]. https://www.kaggle.com/datasets/yerkinmudebayev/paper-mill-response-data/discussion
Explore at:
zip(689197 bytes)Available download formats
Dataset updated
Apr 6, 2023
Authors
Yerkin Mudebayev
Description
The project is to conduct a principal components analysis of the paper mill response data (paper_mill_data_response.txt, Aldrin, M., "Moderate projection pursuit regression for multivariate response data", Computational Statistics and Data Analysis, 21 (1996), p. 501-531). (a) Label the variables r1,···, r13. Carry out an initial investigation. Do not remove outliers or transform the data. Indicate if you had to process the data file in anyway. Explain any conclusions drawn from the evidence and backup your conclusions. (b) Display the sample correlation matrix R. (c) Perform a principal component analysis using R. i. List the eigenvalues and describe the percent contributions to the variance. ii. Determine the number of principal components to retain and justify your an- swer by considering at least three methods. Note and comment if there is any disagreement between the methods. iii. Give the eigenvectors for the first two principal components and write out the principal components. iv. Considering the coefficients of the principal components, describe dependencies of the principal components on the variables. v. Display a scatter plot of the first two principal components. Make observations about the plots. (d) Include your code.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v4

Data from: Valid Inference Corrected for Outlier Removal

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.9762731.v4

Dataset updated

Jun 4, 2023

Dataset provided by

Taylor & Francishttps://taylorandfrancis.com/

Authors

Shuxiao Chen; Jacob Bien

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.

Clear search

Close search

Google apps

Main menu

Data from: Valid Inference Corrected for Outlier Removal

KC_House Dataset -Linear Regression of Home Prices

Life Expectancy WHO

We use DECISION TREE MODEL for the analysis.

We use RANDOM FOREST for the analysis.

Occurrence Record Dataset from "Depth Matters for Marine Biodiversity"

Identification of Performance Changes at Code Level (Measurement...

Wenau, Stefan, Spieß, Volkhard, Zabel, Matthias (2021). Dataset: Multibeam...

Sales data for bulls

DC Estimated Tree Root Zone (Structural) | gimi9.com

DC Tree Structure and Benefits | gimi9.com

paper mill response data

Data from: Valid Inference Corrected for Outlier Removal