14 datasets found

Data from: Water-quality data imputation with a high percentage of missing...
zenodo.org
data.niaid.nih.gov
csv
Updated Jun 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4731169
Dataset updated
Jun 8, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)

Dissolved oxygen (DO)

Electrical conductivity (EC)

pH

Turbidity (Turb)

Nitrite (NO2-)

Nitrate (NO3-)

Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
f
Data from: Pay Attention to the Ignorable Missing Data Mechanisms! An...
tandf.figshare.com
pdf
Updated Dec 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lihan Chen; Victoria Savalei; Mijke Rhemtulla (2023). Pay Attention to the Ignorable Missing Data Mechanisms! An Exploration of Their Impact on the Efficiency of Regression Coefficients [Dataset]. http://doi.org/10.6084/m9.figshare.22587018.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22587018.v1
Dataset updated
Dec 23, 2023
Dataset provided by
Taylor & Francis
Authors
Lihan Chen; Victoria Savalei; Mijke Rhemtulla
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The use of modern missing data techniques has become more prevalent with their increasing accessibility in statistical software. These techniques focus on handling data that are missing at random (MAR). Although all MAR mechanisms are routinely treated as the same, they are not equal. The impact of missing data on the efficiency of parameter estimates can differ for different MAR variations, even when the amount of missing data is held constant; yet, in current practice, only the rate of missing data is reported. The impact of MAR on the loss of efficiency can instead be more directly measured by the fraction of missing information (FMI). In this article, we explore this impact using FMIs in regression models with one and two predictors. With the help of a Shiny application, we demonstrate that efficiency loss due to missing data can be highly complex and is not always intuitive. We recommend substantive researchers who work with missing data report estimates of FMIs in addition to the rate of missingness. We also encourage methodologists to examine FMIs when designing simulation studies with missing data, and to explore the behavior of efficiency loss under MAR using FMIs in more complex models.
f
Data_Sheet_1_Cognitive Diagnosis Modeling Incorporating Item-Level Missing...
frontiersin.figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Na Shan; Xiaofei Wang (2023). Data_Sheet_1_Cognitive Diagnosis Modeling Incorporating Item-Level Missing Data Mechanism.PDF [Dataset]. http://doi.org/10.3389/fpsyg.2020.564707.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2020.564707.s001
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Na Shan; Xiaofei Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The aim of cognitive diagnosis is to classify respondents' mastery status of latent attributes from their responses on multiple items. Since respondents may answer some but not all items, item-level missing data often occur. Even if the primary interest is to provide diagnostic classification of respondents, misspecification of missing data mechanism may lead to biased conclusions. This paper proposes a joint cognitive diagnosis modeling of item responses and item-level missing data mechanism. A Bayesian Markov chain Monte Carlo (MCMC) method is developed for model parameter estimation. Our simulation studies examine the parameter recovery under different missing data mechanisms. The parameters could be recovered well with correct use of missing data mechanism for model fit, and missing that is not at random is less sensitive to incorrect use. The Program for International Student Assessment (PISA) 2015 computer-based mathematics data are applied to demonstrate the practical value of the proposed method.
Data from: Testing of the effect of missing data estimation and distribution...
zenodo.org
data.niaid.nih.gov
+1more
Updated May 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caleb Marshall Brown; Jessica H. Arbour; Donald A. Jackson; Caleb Marshall Brown; Jessica H. Arbour; Donald A. Jackson (2022). Data from: Testing of the effect of missing data estimation and distribution in morphometric multivariate data analyses [Dataset]. http://doi.org/10.5061/dryad.m01st7p0
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.m01st7p0
Dataset updated
May 29, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Caleb Marshall Brown; Jessica H. Arbour; Donald A. Jackson; Caleb Marshall Brown; Jessica H. Arbour; Donald A. Jackson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Missing data are an unavoidable problem in biological datasets and the performance of missing data deletion and estimation techniques in morphometric datasets are poorly understood. Here a novel method is used to measure the introduced error of multiple techniques on a representative sample. A large sample of extant crocodilian skulls was measured and analyzed with principal components analysis (PCA). Twenty-three different proportions of missing data were introduced into the dataset, estimated, analyzed, and compared to the original result using Procrustes superimposition. Previous work investigating the effects of missing data input missing values randomly, a non-biological phenomenon. Here, missing data were introduced into the dataset using three methodologies: purely at random, as a function of the Euclidean distance between respective measurements (simulating anatomical regions), and as a function of the portion of the sample occupied by each taxon (simulating unequal missing data in rare taxa). Gower's distance was found to be the best performing non-estimation method, and Bayesian PCA the best performing estimation method. Specimens of the taxa with small sample sizes and those most morphologically disparate had the highest estimation error. Distribution of missing data had a significant effect on the estimation error for almost all methods and proportions. Taxonomically biased missing data tended to show similar trends to random, but with higher error rates. Anatomically biased missing data showed a much greater deviation from random than the taxonomic bias, and with magnitudes dependent on the estimation method.
f
Data from: Integrating Multisource Block-Wise Missing Data in Model...
tandf.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fei Xue; Annie Qu (2023). Integrating Multisource Block-Wise Missing Data in Model Selection [Dataset]. http://doi.org/10.6084/m9.figshare.12100701.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12100701.v2
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Fei Xue; Annie Qu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
For multisource data, blocks of variable information from certain sources are likely missing. Existing methods for handling missing data do not take structures of block-wise missing data into consideration. In this article, we propose a multiple block-wise imputation (MBI) approach, which incorporates imputations based on both complete and incomplete observations. Specifically, for a given missing pattern group, the imputations in MBI incorporate more samples from groups with fewer observed variables in addition to the group with complete observations. We propose to construct estimating equations based on all available information, and integrate informative estimating functions to achieve efficient estimators. We show that the proposed method has estimation and model selection consistency under both fixed-dimensional and high-dimensional settings. Moreover, the proposed estimator is asymptotically more efficient than the estimator based on a single imputation from complete observations only. In addition, the proposed method is not restricted to missing completely at random. Numerical studies and ADNI data application confirm that the proposed method outperforms existing variable selection methods under various missing mechanisms. Supplementary materials for this article are available online.
Z
Data from: A real data-driven simulation strategy to select an imputation...
data.niaid.nih.gov
Updated Feb 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adamowicz, Sarah J. (2023). Data from: A real data-driven simulation strategy to select an imputation method for mixed-type trait data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7644193
Explore at:
Dataset updated
Feb 16, 2023
Dataset provided by
Feng, Zeny
Adamowicz, Sarah J.
May, Jacqueline A.
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Considering the mixed results of imputation, the wide variety of available methods, and the varied structure of real trait datasets, a framework for selecting a suitable imputation method is advantageous. We invoked a real data-driven simulation strategy to select an imputation method for a given mixed-type (categorical, count, continuous) target dataset. Candidate methods included mean/mode imputation, k-nearest neighbour, random forests, and multivariate imputation by chained equations (MICE). Using a trait dataset of squamates (lizards and amphisbaenians; order: Squamata) as a target dataset, a complete-case dataset consisting of species with nearly completed information was formed for the imputation method selection. Missing data were induced by removing values from this dataset under different missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). For each method, combinations with and without phylogenetic information from single gene (nuclear and mitochondrial) or multigene trees were used to impute the missing values for five numerical and two categorical traits. The performances of the methods were evaluated under each missing mechanism by determining the mean squared error and proportion falsely classified rates for numerical and categorical traits, respectively. A random forest method supplemented with a nuclear-derived phylogeny resulted in the lowest error rates for the majority of traits, and this method was used to impute missing values in the original dataset. Data with imputed values better reflected the characteristics and distributions of the original data compared to complete-case data. However, caution should be taken when imputing trait data as phylogeny did not always improve performance for every trait and in every scenario. Ultimately, these results support the use of a real data-driven simulation strategy for selecting a suitable imputation method for a given mixed-type trait dataset.
f
Table_1_Comparison of machine learning and logistic regression as predictive...
frontiersin.figshare.com
xlsx
Updated Jun 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongying Zheng; Xinyu Hao; Muhanmmad Khan; Lixia Wang; Fan Li; Ning Xiang; Fuli Kang; Timo Hamalainen; Fengyu Cong; Kedong Song; Chong Qiao (2023). Table_1_Comparison of machine learning and logistic regression as predictive models for adverse maternal and neonatal outcomes of preeclampsia: A retrospective study.XLSX [Dataset]. http://doi.org/10.3389/fcvm.2022.959649.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fcvm.2022.959649.s003
Dataset updated
Jun 13, 2023
Dataset provided by
Frontiers
Authors
Dongying Zheng; Xinyu Hao; Muhanmmad Khan; Lixia Wang; Fan Li; Ning Xiang; Fuli Kang; Timo Hamalainen; Fengyu Cong; Kedong Song; Chong Qiao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionPreeclampsia, one of the leading causes of maternal and fetal morbidity and mortality, demands accurate predictive models for the lack of effective treatment. Predictive models based on machine learning algorithms demonstrate promising potential, while there is a controversial discussion about whether machine learning methods should be recommended preferably, compared to traditional statistical models.MethodsWe employed both logistic regression and six machine learning methods as binary predictive models for a dataset containing 733 women diagnosed with preeclampsia. Participants were grouped by four different pregnancy outcomes. After the imputation of missing values, statistical description and comparison were conducted preliminarily to explore the characteristics of documented 73 variables. Sequentially, correlation analysis and feature selection were performed as preprocessing steps to filter contributing variables for developing models. The models were evaluated by multiple criteria.ResultsWe first figured out that the influential variables screened by preprocessing steps did not overlap with those determined by statistical differences. Secondly, the most accurate imputation method is K-Nearest Neighbor, and the imputation process did not affect the performance of the developed models much. Finally, the performance of models was investigated. The random forest classifier, multi-layer perceptron, and support vector machine demonstrated better discriminative power for prediction evaluated by the area under the receiver operating characteristic curve, while the decision tree classifier, random forest, and logistic regression yielded better calibration ability verified, as by the calibration curve.ConclusionMachine learning algorithms can accomplish prediction modeling and demonstrate superior discrimination, while Logistic Regression can be calibrated well. Statistical analysis and machine learning are two scientific domains sharing similar themes. The predictive abilities of such developed models vary according to the characteristics of datasets, which still need larger sample sizes and more influential predictors to accumulate evidence.
Data from: A new method for handling missing species in diversification...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated May 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natalie Cusimano; Tanja Stadler; Susanne S. Renner; Natalie Cusimano; Tanja Stadler; Susanne S. Renner (2022). Data from: A new method for handling missing species in diversification analysis applicable to randomly or non-randomly sampled phylogenies [Dataset]. http://doi.org/10.5061/dryad.r8f04fk2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.r8f04fk2
Dataset updated
May 30, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Natalie Cusimano; Tanja Stadler; Susanne S. Renner; Natalie Cusimano; Tanja Stadler; Susanne S. Renner
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Chronograms from molecular dating are increasingly being used to infer rates of diversification and their change over time. A major limitation in such analyses is incomplete species sampling that moreover is usually non-random. While the widely used γ statistic with the MCCR test or the birth-death likelihood analysis with the ∆AICrc test statistic are appropriate for comparing the fit of different diversification models in phylogenies with random species sampling, no objective, automated method has been developed for fitting diversification models to non-randomly sampled phylogenies. Here we introduce a novel approach, CorSiM, which involves simulating missing splits under a constant-rate birth-death model and allows the user to specify whether species sampling in the phylogeny being analyzed is random or non-random. The completed trees can be used in subsequent model-fitting analyses. This is fundamentally different from previous diversification rate estimation methods, which were based on null distributions derived from the incomplete trees. CorSiM is automated in an R package and can easily be applied to large data sets. We illustrate the approach in two Araceae clades, one with a random species sampling of 52% and one with a non-random sampling of 55%. In the latter clade, the CorSiM approach detects and quantifies an increase in diversification rate while classic approaches prefer a constant rate model, whereas in the former clade, results do not differ among methods (as indeed expected since the classic approaches are valid only for randomly sampled phylogenies). The CorSiM method greatly reduces the type I error in diversification analysis, but type II error remains a methodological problem.
Data from: A hierarchical Bayesian approach for handling missing...
zenodo.org
data.niaid.nih.gov
+1more
bin, zip
Updated May 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alison C. Ketz; Therese L. Johnson; Mevin B. Hooten; M. Thompson Hobbs; Alison C. Ketz; Therese L. Johnson; Mevin B. Hooten; M. Thompson Hobbs (2022). Data from: A hierarchical Bayesian approach for handling missing classification data [Dataset]. http://doi.org/10.5061/dryad.8h36t01
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.8h36t01
Dataset updated
May 31, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alison C. Ketz; Therese L. Johnson; Mevin B. Hooten; M. Thompson Hobbs; Alison C. Ketz; Therese L. Johnson; Mevin B. Hooten; M. Thompson Hobbs
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Ecologists use classifications of individuals in categories to understand composition of populations and communities. These categories might be defined by demographics, functional traits, or species. Assignment of categories is often imperfect, but frequently treated as observations without error. When individuals are observed but not classified, these "partial" observations must be modified to include the missing data mechanism to avoid spurious inference.

We developed two hierarchical Bayesian models to overcome the assumption of perfect assignment to mutually exclusive categories in the multinomial distribution of categorical counts, when classifications are missing. These models incorporate auxiliary information to adjust the posterior distributions of the proportions of membership in categories. In one model, we use an empirical Bayes approach, where a subset of data from one year serves as a prior for the missing data the next. In the other approach, we use a small random sample of data within a year to inform the distribution of the missing data.

We performed a simulation to show the bias that occurs when partial observations were ignored and demonstrated the altered inference for the estimation of demographic ratios. We applied our models to demographic classifications of elk (Cervus elaphus nelsoni) to demonstrate improved inference for the proportions of sex and stage classes.

We developed multiple modeling approaches using a generalizable nested multinomial structure to account for partially observed data that were missing not at random for classification counts. Accounting for classification uncertainty is important to accurately understand the composition of populations and communities in ecological studies.
Data from: Investigating the contributors to hit-and-run crashes using...
figshare.com
xlsx
Updated Oct 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gen Li (2024). Investigating the contributors to hit-and-run crashes using gradient boosting decision trees [Dataset]. http://doi.org/10.6084/m9.figshare.27178305.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27178305.v1
Dataset updated
Oct 7, 2024
Dataset provided by
figshare
Authors
Gen Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper uses the 2021 traffic crash data from the NHTSA CRSS as a sample for model training and validation. The CRSS data collects crash report data provided by police departments from all 50 states in the United States. It details various factors of each traffic crash, including crash information, driver information, vehicle information, road information, and environmental information.The crash accident data provided by CRSS include crash-related details such as the location, time, cause, type of crash, driver’s age, gender, attention level, injury status, risky driving behavior, vehicle type, usage, damage, and hit-and-run situations. However, due to the separate recording of the dataset and the presence of systematic errors and redundant information, the CRSS 2021 data undergo the following merging and filtering processes:1) Match and merge separately recorded data based on the unique case number "CASENUM" in the dataset.2) Records with missing values in critical variables (e.g., whether the crash involved a hit-and-run) were removed to avoid bias in the analysis. For non-critical variables, missing values were imputed using the mean or mode depending on the variable type. For continuous variables, such as speed limits, we used mean imputation. For categorical variables (e.g., weather, road surface conditions), mode imputation was applied.3) Noise in the dataset arises from both human error in crash reporting and random fluctuations in recorded variables. We used z-scores to detect and remove extreme outliers in numerical variables (e.g., speed limits, crash angle). Data points with a z-score beyond ±3 standard deviations were considered outliers and were excluded from the analysis. To handle noisy fluctuations in continuous variables (e.g., speed limits), we applied a symmetrical exponential moving average (EMA) filter.After processing, the CRSS 2021 data include a total of 54,187 crash accidents, among which there are 5,944 hit-and-run accidents, accounting for 10.97% of crash accidents. The hit-and-run and non-hit-and-run categories face a serious class imbalance issue, and data balancing processing is applied to the target variable during parameter calibration. Hit-and-run crashes constitute a relatively small proportion of total crashes in the dataset, leading to class imbalance in the binary classification target. To address this issue, we utilized the resampling techniques available in the data mining software. Specifically, random undersampling was applied to the majority class (non-hit-and-run crashes), while Synthetic Minority Over-sampling Technique (SMOTE) was used for the minority class. This ensured balanced class distribution in the training set, improving model performance and preventing the classifier from being biased toward the majority class.
d
Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data
datasets.ai
data.usgs.gov
+1more
55
Updated Sep 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of the Interior (2024). Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data [Dataset]. https://datasets.ai/datasets/variable-terrestrial-gps-telemetry-detection-rates-parts-1-7data
Explore at:
55Available download formats
Dataset updated
Sep 11, 2024
Dataset authored and provided by
Department of the Interior
Description
Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. The model training data are provided here for fix attempts by hour. This table can be linked with the site location shapefile using the site field. Part 3, Probability Raster (raster dataset): Bias correction in GPS telemetry datasets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix aquistion. We found terrain exposure and tall overstory vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The models predictive ability was evaluated using two independent datasets from stationary test collars of different make/model, fix interval programing, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. We evaluated GPS telemetry datasets by comparing the mean probability of a successful GPS fix across study animals home-ranges, to the actual observed FSR of GPS downloaded deployed collars on cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Comparing the mean probability of acquisition within study animals home-ranges and observed FSRs of GPS downloaded collars resulted in a approximatly 1:1 linear relationship with an r-sq= 0.68. Part 4, GPS Test Collar Sites (shapefile): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. Part 5, Cougar Home Ranges (shapefile): Cougar home-ranges were calculated to compare the mean probability of a GPS fix acquisition across the home-range to the actual fix success rate (FSR) of the collar as a means for evaluating if characteristics of an animal’s home-range have an effect on observed FSR. We estimated home-ranges using the Local Convex Hull (LoCoH) method using the 90th isopleth. Data obtained from GPS download of retrieved units were only used. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose as additional 10% of data. Comparisons with home-range mean probability of fix were also used as a reference for assessing if the frequency animals use areas of low GPS acquisition rates may play a role in observed FSRs. Part 6, Cougar Fix Success Rate by Hour (csv): Cougar GPS collar fix success varied by hour-of-day suggesting circadian rhythms with bouts of rest during daylight hours may change the orientation of the GPS receiver affecting the ability to acquire fixes. Raw data of overall fix success rates (FSR) and FSR by hour were used to predict relative reductions in FSR. Data only includes direct GPS download datasets. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose approximately an additional 10% of data. Part 7, Openness Python Script version 2.0: This python script was used to calculate positive openness using a 30 meter digital elevation model for a large geographic area in Arizona, California, Nevada and Utah. A scientific research project used the script to explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use.
New Insights into Handling Missing Values in Environmental Epidemiological...
plos.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Célina Roda; Ioannis Nicolis; Isabelle Momas; Chantal Guihenneuc (2023). New Insights into Handling Missing Values in Environmental Epidemiological Studies [Dataset]. http://doi.org/10.1371/journal.pone.0104254
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0104254
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Célina Roda; Ioannis Nicolis; Isabelle Momas; Chantal Guihenneuc
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Missing data are unavoidable in environmental epidemiologic surveys. The aim of this study was to compare methods for handling large amounts of missing values: omission of missing values, single and multiple imputations (through linear regression or partial least squares regression), and a fully Bayesian approach. These methods were applied to the PARIS birth cohort, where indoor domestic pollutant measurements were performed in a random sample of babies' dwellings. A simulation study was conducted to assess performances of different approaches with a high proportion of missing values (from 50% to 95%). Different simulation scenarios were carried out, controlling the true value of the association (odds ratio of 1.0, 1.2, and 1.4), and varying the health outcome prevalence. When a large amount of data is missing, omitting these missing data reduced statistical power and inflated standard errors, which affected the significance of the association. Single imputation underestimated the variability, and considerably increased risk of type I error. All approaches were conservative, except the Bayesian joint model. In the case of a common health outcome, the fully Bayesian approach is the most efficient approach (low root mean square error, reasonable type I error, and high statistical power). Nevertheless for a less prevalent event, the type I error is increased and the statistical power is reduced. The estimated posterior distribution of the OR is useful to refine the conclusion. Among the methods handling missing values, no approach is absolutely the best but when usual approaches (e.g. single imputation) are not sufficient, joint modelling approach of missing process and health association is more efficient when large amounts of data are missing.
f
A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...
acs.figshare.com
figshare.com
xlsx
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker (2023). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. http://doi.org/10.1021/acs.jproteome.1c00070.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.1c00070.s003
Dataset updated
Jun 11, 2023
Dataset provided by
ACS Publications
Authors
Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.
f
Validity of using multiple imputation for "unknown" stage at diagnosis in...
plos.figshare.com
tiff
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qingwei Luo; Sam Egger; Xue Qin Yu; David P. Smith; Dianne L. O’Connell (2023). Validity of using multiple imputation for "unknown" stage at diagnosis in population-based cancer registry data [Dataset]. http://doi.org/10.1371/journal.pone.0180033
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0180033
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Qingwei Luo; Sam Egger; Xue Qin Yu; David P. Smith; Dianne L. O’Connell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe multiple imputation approach to missing data has been validated by a number of simulation studies by artificially inducing missingness on fully observed stage data under a pre-specified missing data mechanism. However, the validity of multiple imputation has not yet been assessed using real data. The objective of this study was to assess the validity of using multiple imputation for “unknown” prostate cancer stage recorded in the New South Wales Cancer Registry (NSWCR) in real-world conditions.MethodsData from the population-based cohort study NSW Prostate Cancer Care and Outcomes Study (PCOS) were linked to 2000–2002 NSWCR data. For cases with “unknown” NSWCR stage, PCOS-stage was extracted from clinical notes. Logistic regression was used to evaluate the missing at random assumption adjusted for variables from two imputation models: a basic model including NSWCR variables only and an enhanced model including the same NSWCR variables together with PCOS primary treatment. Cox regression was used to evaluate the performance of MI.ResultsOf the 1864 prostate cancer cases 32.7% were recorded as having “unknown” NSWCR stage. The missing at random assumption was satisfied when the logistic regression included the variables included in the enhanced model, but not those in the basic model only. The Cox models using data with imputed stage from either imputation model provided generally similar estimated hazard ratios but with wider confidence intervals compared with those derived from analysis of the data with PCOS-stage. However, the complete-case analysis of the data provided a considerably higher estimated hazard ratio for the low socio-economic status group and rural areas in comparison with those obtained from all other datasets.ConclusionsUsing MI to deal with “unknown” stage data recorded in a population-based cancer registry appears to provide valid estimates. We would recommend a cautious approach to the use of this method elsewhere.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169

Data from: Water-quality data imputation with a high percentage of missing values: a machine learning approach

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.4731169

Dataset updated

Jun 8, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)
Dissolved oxygen (DO)
Electrical conductivity (EC)
pH
Turbidity (Turb)
Nitrite (NO2-)
Nitrate (NO3-)
Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

Clear search

Close search

Google apps

Main menu

Data from: Water-quality data imputation with a high percentage of missing...

Data from: Pay Attention to the Ignorable Missing Data Mechanisms! An...

Data_Sheet_1_Cognitive Diagnosis Modeling Incorporating Item-Level Missing...

Data from: Testing of the effect of missing data estimation and distribution...

Data from: Integrating Multisource Block-Wise Missing Data in Model...

Data from: A real data-driven simulation strategy to select an imputation...

Table_1_Comparison of machine learning and logistic regression as predictive...

Data from: A new method for handling missing species in diversification...

Data from: A hierarchical Bayesian approach for handling missing...

Data from: Investigating the contributors to hit-and-run crashes using...

Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data

New Insights into Handling Missing Values in Environmental Epidemiological...

A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

Validity of using multiple imputation for "unknown" stage at diagnosis in...

Data from: Water-quality data imputation with a high percentage of missing values: a machine learning approachSee More Versions

Data from: Water-quality data imputation with a high percentage of missing values: a machine learning approach