79 datasets found

Understanding and Managing Missing Data.pdf
figshare.com
pdf
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29265155.v1
Dataset updated
Jun 9, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ibrahim Denis Fofanah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
Water-quality data imputation with a high percentage of missing values: a...
zenodo.org
explore.openaire.eu
+1more
csv
Updated Jun 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4731169
Dataset updated
Jun 8, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)

Dissolved oxygen (DO)

Electrical conductivity (EC)

pH

Turbidity (Turb)

Nitrite (NO2-)

Nitrate (NO3-)

Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
h
Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...
datahub.hku.hk
Updated Aug 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wen Ma (2020). Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets [Dataset]. http://doi.org/10.25442/hku.12752549.v1
Explore at:
Unique identifier
https://doi.org/10.25442/hku.12752549.v1
Dataset updated
Aug 13, 2020
Dataset provided by
HKU Data Repository
Authors
Wen Ma
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
NCCTG Lung cancer datasetSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.2.CNV measurements of CNV of GBM This dataset records the information about copy number variation of Glioblastoma (GBM).Abstract:In biology and medicine, conservative patient and data collection malpractice can lead to missing or incorrect values in patient registries, which can affect both diagnosis and prognosis. Insufficient or biased patient information significantly impedes the sensitivity and accuracy of predicting cancer survival. In bioinformatics, making a best guess of the missing values and identifying the incorrect values are collectively called “imputation”. Existing imputation methods work by establishing a model based on the data mechanism of the missing values. Existing imputation methods work well under two assumptions: 1) the data is missing completely at random, and 2) the percentage of missing values is not high. These are not cases found in biomedical datasets, such as the Cancer Genome Atlas Glioblastoma Copy-Number Variant dataset (TCGA: 108 columns), or the North Central Cancer Treatment Group Lung Cancer (NCCTG) dataset (NCCTG: 9 columns). We tested six existing imputation methods, but only two of them worked with these datasets: The Last Observation Carried Forward (LOCF) and K-nearest Algorithm (KNN). Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) worked only with the NCCTG lung cancer dataset with fewer columns, except when the dataset contains 45% missing data. The quality of the imputed values using existing methods is bad because they do not meet the two assumptions.In our study, we propose a Restricted Boltzmann Machine (RBM)-based imputation method to cope with low randomness and the high percentage of the missing values. RBM is an undirected, probabilistic and parameterized two-layer neural network model, which is often used for extracting abstract information from data, especially for high-dimensional data with unknown or non-standard distributions. In our benchmarks, we applied our method to two cancer datasets: 1) NCCTG, and 2) TCGA. The running time, root mean squared error (RMSE) of the different methods were gauged. The benchmarks for the NCCTG dataset show that our method performs better than other methods when there is 5% missing data in the dataset, with 4.64 RMSE lower than the best KNN. For the TCGA dataset, our method achieved 0.78 RMSE lower than the best KNN.In addition to imputation, RBM can achieve simultaneous predictions. We compared the RBM model with four traditional prediction methods. The running time and area under the curve (AUC) were measured to evaluate the performance. Our RBM-based approach outperformed traditional methods. Specifically, the AUC was up to 19.8% higher than the multivariate logistic regression model in the NCCTG lung cancer dataset, and the AUC was higher than the Cox proportional hazard regression model, with 28.1% in the TCGA dataset.Apart from imputation and prediction, RBM models can detect outliers in one pass by allowing the reconstruction of all the inputs in the visible layer with in a single backward pass. Our results show that RBM models have achieved higher precision and recall on detecting outliers than other methods.
Additional file 4 of Heckman imputation models for binary or continuous MNAR...
springernature.figshare.com
txt
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon (2023). Additional file 4 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors [Dataset]. http://doi.org/10.6084/m9.figshare.7038104.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7038104.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R code to impute binary outcome. (R 1 kb)
f
Data_Sheet_1_The Optimal Machine Learning-Based Missing Data Imputation for...
frontiersin.figshare.com
docx
Updated Jun 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chao-Yu Guo; Ying-Chen Yang; Yi-Hau Chen (2023). Data_Sheet_1_The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.docx [Dataset]. http://doi.org/10.3389/fpubh.2021.680054.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2021.680054.s001
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Chao-Yu Guo; Ying-Chen Yang; Yi-Hau Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanisms is unknown. In this research, we propose supervised and unsupervised imputations and examine four machine learning-based imputation strategies. We conducted a simulation study under various scenarios with several parameters, such as sample size, missing rate, and different missing mechanisms. The results revealed the type-I errors according to different imputation techniques in the survival data. The simulation results show that the non-parametric “missForest” based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms. In contrast, other methods are not valid to test when the missing pattern is informative. Statistical analysis, which is improperly conducted, with missing data may lead to erroneous conclusions. This research provides a clear guideline for a valid survival analysis using the Cox proportional hazard model with machine learning-based imputations.
Sensitivity analysis for missing data in cost-effectiveness analysis: Stata...
figshare.com
bin
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Baptiste Leurent; Manuel Gomes; Rita Faria; Stephen Morris; Richard Grieve; James R Carpenter (2023). Sensitivity analysis for missing data in cost-effectiveness analysis: Stata code [Dataset]. http://doi.org/10.6084/m9.figshare.6714206.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6714206.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Baptiste Leurent; Manuel Gomes; Rita Faria; Stephen Morris; Richard Grieve; James R Carpenter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Stata do-files and data to support tutorial "Sensitivity Analysis for Not-at-Random Missing Data in Trial-Based Cost-Effectiveness Analysis" (Leurent, B. et al. PharmacoEconomics (2018) 36: 889).Do-files should be similar to the code provided in the article's supplementary material.Dataset based on 10 Top Tips trial, but modified to preserve confidentiality. Results will differ from those published.
f
Data_Sheet_2_A Random Shuffle Method to Expand a Narrow Dataset and Overcome...
frontiersin.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorenzo Fassina; Alessandro Faragli; Francesco Paolo Lo Muzio; Sebastian Kelle; Carlo Campana; Burkert Pieske; Frank Edelmann; Alessio Alogna (2023). Data_Sheet_2_A Random Shuffle Method to Expand a Narrow Dataset and Overcome the Associated Challenges in a Clinical Study: A Heart Failure Cohort Example.PDF [Dataset]. http://doi.org/10.3389/fcvm.2020.599923.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fcvm.2020.599923.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Lorenzo Fassina; Alessandro Faragli; Francesco Paolo Lo Muzio; Sebastian Kelle; Carlo Campana; Burkert Pieske; Frank Edelmann; Alessio Alogna
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Heart failure (HF) affects at least 26 million people worldwide, so predicting adverse events in HF patients represents a major target of clinical data science. However, achieving large sample sizes sometimes represents a challenge due to difficulties in patient recruiting and long follow-up times, increasing the problem of missing data. To overcome the issue of a narrow dataset cardinality (in a clinical dataset, the cardinality is the number of patients in that dataset), population-enhancing algorithms are therefore crucial. The aim of this study was to design a random shuffle method to enhance the cardinality of an HF dataset while it is statistically legitimate, without the need of specific hypotheses and regression models. The cardinality enhancement was validated against an established random repeated-measures method with regard to the correctness in predicting clinical conditions and endpoints. In particular, machine learning and regression models were employed to highlight the benefits of the enhanced datasets. The proposed random shuffle method was able to enhance the HF dataset cardinality (711 patients before dataset preprocessing) circa 10 times and circa 21 times when followed by a random repeated-measures approach. We believe that the random shuffle method could be used in the cardiovascular field and in other data science problems when missing data and the narrow dataset cardinality represent an issue.
f
Data from: Fast tipping point sensitivity analyses in clinical trials with...
tandf.figshare.com
application/gzip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anders Gorst-Rasmussen; Mads Jeppe Tarp-Johansen (2023). Fast tipping point sensitivity analyses in clinical trials with missing continuous outcomes under multiple imputation [Dataset]. http://doi.org/10.6084/m9.figshare.19967496.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19967496.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
Anders Gorst-Rasmussen; Mads Jeppe Tarp-Johansen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
When dealing with missing data in clinical trials, it is often convenient to work under simplifying assumptions, such as missing at random (MAR), and follow up with sensitivity analyses to address unverifiable missing data assumptions. One such sensitivity analysis, routinely requested by regulatory agencies, is the so-called tipping point analysis, in which the treatment effect is re-evaluated after adding a successively more extreme shift parameter to the predicted values among subjects with missing data. If the shift parameter needed to overturn the conclusion is so extreme that it is considered clinically implausible, then this indicates robustness to missing data assumptions. Tipping point analyses are frequently used in the context of continuous outcome data under multiple imputation. While simple to implement, computation can be cumbersome in the two-way setting where both comparator and active arms are shifted, essentially requiring the evaluation of a two-dimensional grid of models. We describe a computationally efficient approach to performing two-way tipping point analysis in the setting of continuous outcome data with multiple imputation. We show how geometric properties can lead to further simplification when exploring the impact of missing data. Lastly, we propose a novel extension to a multi-way setting which yields simple and general sufficient conditions for robustness to missing data assumptions.
H
Replication Data for: Comparative investigation of time series missing data...
dataverse.harvard.edu
Updated Jul 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LEIZHEN ZANG; Feng XIONG (2020). Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results [Dataset]. http://doi.org/10.7910/DVN/GQHURF
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/GQHURF
Dataset updated
Jul 24, 2020
Dataset provided by
Harvard Dataverse
Authors
LEIZHEN ZANG; Feng XIONG
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Missing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.
Z
Multi-Label Datasets with Missing Values
data.niaid.nih.gov
Updated Mar 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabrício A. do Carmo (2023). Multi-Label Datasets with Missing Values [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7748932
Explore at:
Dataset updated
Mar 19, 2023
Dataset provided by
Ádamo L. de Santana
Fabrício A. do Carmo
Antonio F. L. Jacob Jr.
Ewaldo Santana
Fábio M. F. Lobato
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Consisting of six multi-label datasets from the UCI Machine Learning repository.

Each dataset contains missing values which have been artificially added at the following rates: 5, 10, 15, 20, 25, and 30%. The “amputation” was performed using the “Missing Completely at Random” mechanism.

File names are represented as follows:

amp_DB_MR.arff

where:

DB = original dataset; MR = missing rate.

For more details, please read:

IEEE Access article (in review process)
H
Replication Data for: Choosing Imputation Models
dataverse.harvard.edu
application/gzip, txt
Updated Sep 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2021). Replication Data for: Choosing Imputation Models [Dataset]. http://doi.org/10.7910/DVN/IIXGBM
Explore at:
application/gzip(8929325), txt(4447)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/IIXGBM
Dataset updated
Sep 21, 2021
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Imputing missing values is an important preprocessing step in data analysis, but the literature offers little guidance on how to choose between imputation models. This letter suggests adopting the imputation model that generates a density of imputed values most similar to those of the observed values for an incomplete variable after balancing all other covariates. We recommend stable balancing weights as a practical approach to balance covariates whose distribution is expected to differ if the values are not missing completely at random. After balancing, discrepancy statistics can be used to compare the density of imputed and observed values. We illustrate the application of the suggested approach using simulated and real-world survey data from the American National Election Study, comparing popular imputation approaches including random forests, hot-deck, predictive mean matching, and multivariate normal imputation. An R package implementing the suggested approach accompanies this letter.
f
Data from: A multiple imputation method using population information
tandf.figshare.com
pdf
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tadayoshi Fushiki (2025). A multiple imputation method using population information [Dataset]. http://doi.org/10.6084/m9.figshare.28900017.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28900017.v1
Dataset updated
Apr 30, 2025
Dataset provided by
Taylor & Francis
Authors
Tadayoshi Fushiki
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multiple imputation (MI) is effectively used to deal with missing data when the missing mechanism is missing at random. However, MI may not be effective when the missing mechanism is not missing at random (NMAR). In such cases, additional information is required to obtain an appropriate imputation. Pham et al. (2019) proposed the calibrated-δ adjustment method, which is a multiple imputation method using population information. It provides appropriate imputation in two NMAR settings. However, the calibrated-δ adjustment method has two problems. First, it can be used only when one variable has missing values. Second, the theoretical properties of the variance estimator have not been provided. This article proposes a multiple imputation method using population information that can be applied when several variables have missing values. The proposed method is proven to include the calibrated-δ adjustment method. It is shown that the proposed method provides a consistent estimator for the parameter of the imputation model in an NMAR situation. The asymptotic variance of the estimator obtained by the proposed method and its estimator are also given.
o
ComBat HarmonizR enables the integrated analysis of independently generated...
omicsdi.org
ebi.ac.uk
xml
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hannah Voß, ComBat HarmonizR enables the integrated analysis of independently generated proteomic datasets through data harmonization with appropriate handling of missing values [Dataset]. https://www.omicsdi.org/dataset/pride/PXD027467
Explore at:
xmlAvailable download formats
Authors
Hannah Voß
Variables measured
Proteomics
Description
The integration of proteomic datasets, generated by non-cooperating laboratories using different LC-MS/MS setups can overcome limitations in statistically underpowered sample cohorts but has not been demonstrated to this day. In proteomics, differences in sample preservation and preparation strategies, chromatography and mass spectrometry approaches and the used quantification strategy distort protein abundance distributions in integrated datasets. The Removal of these technical batch effects requires setup-specific normalization and strategies that can deal with missing at random (MAR) and missing not at random (MNAR) type values at a time. Algorithms for batch effect removal, such as the ComBat-algorithm, commonly used for other omics types, disregard proteins with MNAR missing values and reduce the informational yield and the effect size for combined datasets significantly. Here, we present a strategy for data harmonization across different tissue preservation techniques, LC-MS/MS instrumentation setups and quantification approaches. To enable batch effect removal without the need for data reduction or error-prone imputation we developed an extension to the ComBat algorithm, ´ComBat HarmonizR, that performs data harmonization with appropriate handling of MAR and MNAR missing values by matrix dissection The ComBat HarmonizR based strategy enables the combined analysis of independently generated proteomic datasets for the first time. Furthermore, we found ComBat HarmonizR to be superior for removing batch effects between different Tandem Mass Tag (TMT)-plexes, compared to commonly used internal reference scaling (iRS). Due to the matrix dissection approach without the need of data imputation, the HarmonizR algorithm can be applied to any type of -omics data while assuring minimal data loss
S
Penalized Empirical Likelihood of High-Dimensional Semiparametric Varying...
scidb.cn
Updated Feb 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wang long (2025). Penalized Empirical Likelihood of High-Dimensional Semiparametric Varying Coefficient Errors-in-Variables Model Under Missing Data - Appendix [Dataset]. http://doi.org/10.57760/sciencedb.j00206.00050
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00206.00050
Dataset updated
Feb 17, 2025
Dataset provided by
Science Data Bank
Authors
wang long
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
The auxiliary random vector of the parameter part is mainly constructed through inverse probability weighting and local correction methods, and its asymptotic normality is proved for the mixed sequence by combining the random error term Based on the constructed parameter part auxiliary random vector, the empirical logarithmic likelihood ratio function of the parameter part is obtained. At the same time, it is recommended to use penalty empirical likelihood (PEL) for variable selection. Under appropriate conditions, it is proved that the proposed penalty empirical estimation has Oracle characteristics and follows an asymptotic standard chi square distribution
f
Data from: Integrating Multisource Block-Wise Missing Data in Model...
tandf.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fei Xue; Annie Qu (2023). Integrating Multisource Block-Wise Missing Data in Model Selection [Dataset]. http://doi.org/10.6084/m9.figshare.12100701.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12100701.v2
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Fei Xue; Annie Qu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
For multisource data, blocks of variable information from certain sources are likely missing. Existing methods for handling missing data do not take structures of block-wise missing data into consideration. In this article, we propose a multiple block-wise imputation (MBI) approach, which incorporates imputations based on both complete and incomplete observations. Specifically, for a given missing pattern group, the imputations in MBI incorporate more samples from groups with fewer observed variables in addition to the group with complete observations. We propose to construct estimating equations based on all available information, and integrate informative estimating functions to achieve efficient estimators. We show that the proposed method has estimation and model selection consistency under both fixed-dimensional and high-dimensional settings. Moreover, the proposed estimator is asymptotically more efficient than the estimator based on a single imputation from complete observations only. In addition, the proposed method is not restricted to missing completely at random. Numerical studies and ADNI data application confirm that the proposed method outperforms existing variable selection methods under various missing mechanisms. Supplementary materials for this article are available online.
Data from: A hierarchical Bayesian approach for handling missing...
zenodo.org
data.niaid.nih.gov
+1more
bin, zip
Updated May 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alison C. Ketz; Therese L. Johnson; Mevin B. Hooten; M. Thompson Hobbs; Alison C. Ketz; Therese L. Johnson; Mevin B. Hooten; M. Thompson Hobbs (2022). Data from: A hierarchical Bayesian approach for handling missing classification data [Dataset]. http://doi.org/10.5061/dryad.8h36t01
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.8h36t01
Dataset updated
May 31, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alison C. Ketz; Therese L. Johnson; Mevin B. Hooten; M. Thompson Hobbs; Alison C. Ketz; Therese L. Johnson; Mevin B. Hooten; M. Thompson Hobbs
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Ecologists use classifications of individuals in categories to understand composition of populations and communities. These categories might be defined by demographics, functional traits, or species. Assignment of categories is often imperfect, but frequently treated as observations without error. When individuals are observed but not classified, these "partial" observations must be modified to include the missing data mechanism to avoid spurious inference.

We developed two hierarchical Bayesian models to overcome the assumption of perfect assignment to mutually exclusive categories in the multinomial distribution of categorical counts, when classifications are missing. These models incorporate auxiliary information to adjust the posterior distributions of the proportions of membership in categories. In one model, we use an empirical Bayes approach, where a subset of data from one year serves as a prior for the missing data the next. In the other approach, we use a small random sample of data within a year to inform the distribution of the missing data.

We performed a simulation to show the bias that occurs when partial observations were ignored and demonstrated the altered inference for the estimation of demographic ratios. We applied our models to demographic classifications of elk (Cervus elaphus nelsoni) to demonstrate improved inference for the proportions of sex and stage classes.

We developed multiple modeling approaches using a generalizable nested multinomial structure to account for partially observed data that were missing not at random for classification counts. Accounting for classification uncertainty is important to accurately understand the composition of populations and communities in ecological studies.
d
Data from: A real data-driven simulation strategy to select an imputation...
datadryad.org
zenodo.org
zip
Updated Feb 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacqueline A. May; Zeny Feng; Sarah J. Adamowicz (2023). A real data-driven simulation strategy to select an imputation method for mixed-type trait data [Dataset]. http://doi.org/10.5061/dryad.crjdfn37m
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.crjdfn37m
Dataset updated
Feb 15, 2023
Dataset provided by
Dryad
Authors
Jacqueline A. May; Zeny Feng; Sarah J. Adamowicz
Time period covered
Feb 10, 2023
Description
Alignment and phylogenetic trees may be opened and visualized by software capable of handling Newick and FASTA file formats.
Data from: Fossilization processes have little impact on tip-calibrated...
zenodo.org
datadryad.org
txt, zip
Updated Jun 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph O'Reilly; Joseph O'Reilly (2022). Fossilization processes have little impact on tip-calibrated divergence time analyses [Dataset]. http://doi.org/10.5061/dryad.b2rbnzsd4
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.b2rbnzsd4
Dataset updated
Jun 4, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joseph O'Reilly; Joseph O'Reilly
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The importance of palaeontological data in divergence time estimation has increased with the introduction of Bayesian Total-Evidence Dating methods which utilise fossil taxa directly for calibration, facilitated by the joint analysis of morphological and molecular data. Fossil taxa are invariably incompletely known as a consequence of taphonomic processes, resulting in the decidedly non-random distribution of missing data. The impact of non-random missing data on the accuracy and precision of clade age estimation is unknown. In an attempt to constrain the impact of taphonomy on tip-calibrated dating analyses, we compared clade ages estimated from a very complete morphological matrix to ages estimated from the same matrix permuted to simulate the progressive loss of anatomical information resulting from taphonomic processes. We demonstrate that systematically distributed missing data negatively influence clade age estimates, but that successive stages within the taphonomic process introduce greater differences in age estimates, when compared to estimates obtained from untreated data. Despite these effects, the general influence of missing data is weak, presumably due to the compensatory effect of extensive morphological data from extant taxa. We suggest that, in the absence of models that can explicitly account for taphonomic processes, morphological datasets should be constructed to minimise the impact of taphonomy on divergence time estimation.
Replication Data for: Democratization and Gini index: Panel data analysis...
search.datacite.org
Updated 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LEIZHEN ZANG; Xiong Feng (2019). Replication Data for: Democratization and Gini index: Panel data analysis based on random forest method [Dataset]. http://doi.org/10.7910/dvn/w2cxvu
Explore at:
Unique identifier
https://doi.org/10.7910/dvn/w2cxvu
Dataset updated
2019
Dataset provided by
DataCitehttps://www.datacite.org/
Harvard Dataverse
Authors
LEIZHEN ZANG; Xiong Feng
Description
The mechanism for the association between democratic development and the wealth gap has always been the focus of political and economic research, yet with no consistent conclusion. The reasons for that often are, 1) challenges to generalize the results obtained from analyzing a single country’s time series studies or multinational cross-section data analysis, and 2) deviations in research results caused by missing values or variable selection in panel data analysis. When it comes to the latter one, there are two factors contribute to it. One is that the accuracy of estimation is interfered with the presence of missing values in variables, another is that subjective discretion that must be exercised to select suitable proxies amongst many candidates, which are likely to cause variable selection bias. In order to solve these problems, this study is the pioneeringly research to utilize the machine learning method to interpolate missing values efficiently through the random forest model in this topic, and effectively analyzed cross-country data from 151 countries covering the period 1993–2017. Since this paper measures the importance of different variables to the dependent variable, more appropriate and important variables could be selected to construct a complete regression model. Results from different models come to a consensus that the promotion of democracy can significantly narrow the gap between the rich and the poor, with marginally decreasing effect with respect to wealth. In addition, the study finds out that this mechanism exists only in non-colonial nations or presidential states. Finally, this paper discusses the potential theoretical and policy implications of results.
m
Data from: Optimization of Imputation Strategies for High-Resolution Gas...
metabolomicsworkbench.org
zip
Updated Apr 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Ampong (2022). Optimization of Imputation Strategies for High-Resolution Gas Chromatography-Mass Spectrometry (HR GC-MS) Metabolomics Data [Dataset]. https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&StudyID=ST002132&DataMode=AllData&ResultType=1
Explore at:
zipAvailable download formats
Dataset updated
Apr 27, 2022
Dataset provided by
Wake Forest School of Medicine
Authors
Isaac Ampong
Description
Gas chromatography-coupled mass spectrometry (GC-MS) has been used in biomedical research to analyze volatile, non-polar, and polar metabolites in a wide array of sample types. Despite advances in technology, missing values are still common in metabolomics datasets and must be properly handled. We evaluated the performance of ten commonly used missing value imputa-tion methods with metabolites analyzed on an HR GC-MS instrument. By introducing missing values into the complete (i.e., data without any missing values) NIST plasma dataset we demon-strate that Random Forest (RF), Glmnet Ridge Regression (GRR), and Bayesian Principal Com-ponent Analysis (BPCA) shared the lowest Root Mean Squared Error (RMSE) in technical repli-cate data. Further examination of these three methods in data from baboon plasma and liver samples demonstrated they all maintained high accuracy. Overall, our analysis suggests that any of the three imputation methods can be applied effectively to untargeted metabolomics datasets with high accuracy. However, it is important to note that imputation will alter the correlation structure of the dataset, and bias downstream regression coefficients and p-values.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1

Understanding and Managing Missing Data.pdf

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.29265155.v1

Dataset updated

Jun 9, 2025

Dataset provided by

Figsharehttp://figshare.com/

Authors

Ibrahim Denis Fofanah

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

Clear search

Close search

Google apps

Main menu

Understanding and Managing Missing Data.pdf

Water-quality data imputation with a high percentage of missing values: a...

Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...

Additional file 4 of Heckman imputation models for binary or continuous MNAR...

Data_Sheet_1_The Optimal Machine Learning-Based Missing Data Imputation for...

Sensitivity analysis for missing data in cost-effectiveness analysis: Stata...

Data_Sheet_2_A Random Shuffle Method to Expand a Narrow Dataset and Overcome...

Data from: Fast tipping point sensitivity analyses in clinical trials with...

Replication Data for: Comparative investigation of time series missing data...

Multi-Label Datasets with Missing Values

Replication Data for: Choosing Imputation Models

Data from: A multiple imputation method using population information

ComBat HarmonizR enables the integrated analysis of independently generated...

Penalized Empirical Likelihood of High-Dimensional Semiparametric Varying...

Data from: Integrating Multisource Block-Wise Missing Data in Model...

Data from: A hierarchical Bayesian approach for handling missing...

Data from: A real data-driven simulation strategy to select an imputation...

Data from: Fossilization processes have little impact on tip-calibrated...

Replication Data for: Democratization and Gini index: Panel data analysis...

Data from: Optimization of Imputation Strategies for High-Resolution Gas...

Understanding and Managing Missing Data.pdf