17 datasets found

f
Understanding and Managing Missing Data.pdf
figshare.com
pdf
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29265155.v1
Dataset updated
Jun 9, 2025
Dataset provided by
figshare
Authors
Ibrahim Denis Fofanah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
h
Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...
datahub.hku.hk
Updated Aug 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wen Ma (2020). Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets [Dataset]. http://doi.org/10.25442/hku.12752549.v1
Explore at:
Unique identifier
https://doi.org/10.25442/hku.12752549.v1
Dataset updated
Aug 13, 2020
Dataset provided by
HKU Data Repository
Authors
Wen Ma
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
NCCTG Lung cancer datasetSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.2.CNV measurements of CNV of GBM This dataset records the information about copy number variation of Glioblastoma (GBM).Abstract:In biology and medicine, conservative patient and data collection malpractice can lead to missing or incorrect values in patient registries, which can affect both diagnosis and prognosis. Insufficient or biased patient information significantly impedes the sensitivity and accuracy of predicting cancer survival. In bioinformatics, making a best guess of the missing values and identifying the incorrect values are collectively called “imputation”. Existing imputation methods work by establishing a model based on the data mechanism of the missing values. Existing imputation methods work well under two assumptions: 1) the data is missing completely at random, and 2) the percentage of missing values is not high. These are not cases found in biomedical datasets, such as the Cancer Genome Atlas Glioblastoma Copy-Number Variant dataset (TCGA: 108 columns), or the North Central Cancer Treatment Group Lung Cancer (NCCTG) dataset (NCCTG: 9 columns). We tested six existing imputation methods, but only two of them worked with these datasets: The Last Observation Carried Forward (LOCF) and K-nearest Algorithm (KNN). Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) worked only with the NCCTG lung cancer dataset with fewer columns, except when the dataset contains 45% missing data. The quality of the imputed values using existing methods is bad because they do not meet the two assumptions.In our study, we propose a Restricted Boltzmann Machine (RBM)-based imputation method to cope with low randomness and the high percentage of the missing values. RBM is an undirected, probabilistic and parameterized two-layer neural network model, which is often used for extracting abstract information from data, especially for high-dimensional data with unknown or non-standard distributions. In our benchmarks, we applied our method to two cancer datasets: 1) NCCTG, and 2) TCGA. The running time, root mean squared error (RMSE) of the different methods were gauged. The benchmarks for the NCCTG dataset show that our method performs better than other methods when there is 5% missing data in the dataset, with 4.64 RMSE lower than the best KNN. For the TCGA dataset, our method achieved 0.78 RMSE lower than the best KNN.In addition to imputation, RBM can achieve simultaneous predictions. We compared the RBM model with four traditional prediction methods. The running time and area under the curve (AUC) were measured to evaluate the performance. Our RBM-based approach outperformed traditional methods. Specifically, the AUC was up to 19.8% higher than the multivariate logistic regression model in the NCCTG lung cancer dataset, and the AUC was higher than the Cox proportional hazard regression model, with 28.1% in the TCGA dataset.Apart from imputation and prediction, RBM models can detect outliers in one pass by allowing the reconstruction of all the inputs in the visible layer with in a single backward pass. Our results show that RBM models have achieved higher precision and recall on detecting outliers than other methods.
f
Missing completely at random test.
plos.figshare.com
xls
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayman Omar Baniamer (2025). Missing completely at random test. [Dataset]. http://doi.org/10.1371/journal.pone.0321344.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0321344.t001
Dataset updated
Apr 30, 2025
Dataset provided by
PLOS ONE
Authors
Ayman Omar Baniamer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Statistical models are essential tools in data analysis. However, missing data plays a pivotal role in impacting the assumptions and effectiveness of statistical models, especially when there is a significant amount of missing data. This study addresses one of the core assumptions supporting many statistical models, the assumption of unidimensionality. It examines the impact of missing data rates and imputation methods on fulfilling this assumption. The study employs three imputation methods: Corrected Item Mean, multiple imputation, and expectation maximization, assessing their performance across nineteen levels of missing data rates, and examining their impact on the assumption of unidimensionality using several indicators (Cronbach’s alpha, corrected correlation coefficients, factor analysis (Eigenvalues (, , and cumulative variance, and communalities). The study concluded that all imputation methods used effectively provided data that maintained the unidimensionality assumption, regardless of missing data rates. Additionally, it was found that most of the unidimensionality indicators increased in value as missing data rates rose.
Water-quality data imputation with a high percentage of missing values: a...
zenodo.org
explore.openaire.eu
+1more
csv
Updated Jun 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4731169
Dataset updated
Jun 8, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)

Dissolved oxygen (DO)

Electrical conductivity (EC)

pH

Turbidity (Turb)

Nitrite (NO2-)

Nitrate (NO3-)

Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
f
Data from: Integrating Multisource Block-Wise Missing Data in Model...
tandf.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fei Xue; Annie Qu (2023). Integrating Multisource Block-Wise Missing Data in Model Selection [Dataset]. http://doi.org/10.6084/m9.figshare.12100701.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12100701.v2
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Fei Xue; Annie Qu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
For multisource data, blocks of variable information from certain sources are likely missing. Existing methods for handling missing data do not take structures of block-wise missing data into consideration. In this article, we propose a multiple block-wise imputation (MBI) approach, which incorporates imputations based on both complete and incomplete observations. Specifically, for a given missing pattern group, the imputations in MBI incorporate more samples from groups with fewer observed variables in addition to the group with complete observations. We propose to construct estimating equations based on all available information, and integrate informative estimating functions to achieve efficient estimators. We show that the proposed method has estimation and model selection consistency under both fixed-dimensional and high-dimensional settings. Moreover, the proposed estimator is asymptotically more efficient than the estimator based on a single imputation from complete observations only. In addition, the proposed method is not restricted to missing completely at random. Numerical studies and ADNI data application confirm that the proposed method outperforms existing variable selection methods under various missing mechanisms. Supplementary materials for this article are available online.
Z
Multi-Label Datasets with Missing Values
data.niaid.nih.gov
Updated Mar 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabrício A. do Carmo (2023). Multi-Label Datasets with Missing Values [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7748932
Explore at:
Dataset updated
Mar 19, 2023
Dataset provided by
Ewaldo Santana
Antonio F. L. Jacob Jr.
Fabrício A. do Carmo
Ádamo L. de Santana
Fábio M. F. Lobato
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Consisting of six multi-label datasets from the UCI Machine Learning repository.

Each dataset contains missing values which have been artificially added at the following rates: 5, 10, 15, 20, 25, and 30%. The “amputation” was performed using the “Missing Completely at Random” mechanism.

File names are represented as follows:

amp_DB_MR.arff

where:

DB = original dataset; MR = missing rate.

For more details, please read:

IEEE Access article (in review process)
f
Data from: Power for balanced linear mixed models with complex missing data...
tandf.figshare.com
pdf
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin P. Josey; Brandy M. Ringham; Anna E. Barón; Margaret Schenkman; Katherine A. Sauder; Keith E. Muller; Dana Dabelea; Deborah H. Glueck (2023). Power for balanced linear mixed models with complex missing data processes [Dataset]. http://doi.org/10.6084/m9.figshare.14374261.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14374261.v1
Dataset updated
Jun 6, 2023
Dataset provided by
Taylor & Francis
Authors
Kevin P. Josey; Brandy M. Ringham; Anna E. Barón; Margaret Schenkman; Katherine A. Sauder; Keith E. Muller; Dana Dabelea; Deborah H. Glueck
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
When designing repeated measures studies, both the amount and the pattern of missing outcome data can affect power. The chance that an observation is missing may vary across measurements, and missingness may be correlated across measurements. For example, in a physiotherapy study of patients with Parkinson’s disease, increasing intermittent dropout over time yielded missing measurements of physical function. In this example, we assume data are missing completely at random, since the chance that a data point was missing appears to be unrelated to either outcomes or covariates. For data missing completely at random, we propose noncentral F power approximations for the Wald test for balanced linear mixed models with Gaussian responses. The power approximations are based on moments of missing data summary statistics. The moments were derived assuming a conditional linear missingness process. The approach provides approximate power for both complete-case analyses, which include independent sampling units where all measurements are present, and observed-case analyses, which include all independent sampling units with at least one measurement. Monte Carlo simulations demonstrate the accuracy of the method in small samples. We illustrate the utility of the method by computing power for proposed replications of the Parkinson’s study.
d
Data from: A real data-driven simulation strategy to select an imputation...
datadryad.org
zip
Updated Feb 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacqueline A. May; Zeny Feng; Sarah J. Adamowicz (2023). A real data-driven simulation strategy to select an imputation method for mixed-type trait data [Dataset]. http://doi.org/10.5061/dryad.crjdfn37m
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.crjdfn37m
Dataset updated
Feb 15, 2023
Dataset provided by
Dryad
Authors
Jacqueline A. May; Zeny Feng; Sarah J. Adamowicz
Time period covered
Feb 10, 2023
Description
Alignment and phylogenetic trees may be opened and visualized by software capable of handling Newick and FASTA file formats.
f
Missing data pattern and mechanism by age and wear time criteria.
plos.figshare.com
figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric E. Wickel (2023). Missing data pattern and mechanism by age and wear time criteria. [Dataset]. http://doi.org/10.1371/journal.pone.0114402.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0114402.t001
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Eric E. Wickel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MAR, missing at random; MCAR, missing completely at random; NMAR, not missing at random; np, number of participants.aLittle's test, p
Data imputation: an application on wind speed data for Entebbe International...
figshare.com
xlsx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ronald Wesonga (2023). Data imputation: an application on wind speed data for Entebbe International Airport [Dataset]. http://doi.org/10.6084/m9.figshare.3804357.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3804357.v2
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ronald Wesonga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Entebbe
Description
The windspeed data are daily aggregated values for wind speed data for Entebbe International Airport, located at Entebbe, Uganda. This data has undergone multivariate imputations given that the initial dataset for the period 1995 through 2008 had over 10% missing completely at random (MCAR).
f
Parameters of age mixing in sexual partnerships of infected individuals at...
plos.figshare.com
xls
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Niyukuri; Peter Nyasulu; Wim Delva (2023). Parameters of age mixing in sexual partnerships of infected individuals at different sampling coverage (%) when missing individuals were missing completely at random (MCAR), missing at random (MAR) with at most 30%, 50%, and 70% women in the sample. [Dataset]. http://doi.org/10.1371/journal.pone.0249013.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0249013.t002
Dataset updated
Jun 11, 2023
Dataset provided by
PLOS ONE
Authors
David Niyukuri; Peter Nyasulu; Wim Delva
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Parameters of age mixing in sexual partnerships of infected individuals at different sampling coverage (%) when missing individuals were missing completely at random (MCAR), missing at random (MAR) with at most 30%, 50%, and 70% women in the sample.
Jane Street is_missing feature labels
kaggle.com
Updated Dec 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tpmeli (2020). Jane Street is_missing feature labels [Dataset]. https://www.kaggle.com/tpmeli/jane-street-is-missing-feature-labels/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 3, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
tpmeli
Description
The purpose of this dataset is to prevent consistent procesing times. One can add it to train.csv and experiment with a very simple df.join(this.csv). Since the missing rows seem to have some values that are completely missing not at random, they should have some meaning attached to them.
f
Supplementary Material for: Phenotypically Enriched Genotypic Imputation in...
karger.figshare.com
pdf
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhuang W.V.; Murabito J.M. (2023). Supplementary Material for: Phenotypically Enriched Genotypic Imputation in Genetic Association Tests [Dataset]. http://doi.org/10.6084/m9.figshare.3803124.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3803124.v1
Dataset updated
Jun 3, 2023
Dataset provided by
Karger Publishers
Authors
Zhuang W.V.; Murabito J.M.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: In longitudinal epidemiological studies there may be individuals with rich phenotype data who die or are lost to follow-up before providing DNA for genetic studies. Often, the genotypic and phenotypic data of the relatives are available. Two strategies for analyzing the incomplete data are to exclude ungenotyped subjects from analysis (the complete-case method, CC) and to include phenotyped but ungenotyped individuals in analysis by using relatives' genotypes for genotype imputation (GI). In both strategies, the information in the phenotypic data was not used to handle the missing-genotype problem. Methods: We propose a phenotypically enriched genotypic imputation (PEGI) method that uses the EM (expectation-maximization)-based maximum likelihood method to incorporate observed phenotypes into genotype imputation. Results: Our simulations with genotypes missing completely at random show that, for a single-nucleotide polymorphism (SNP) with moderate to strong effect on a phenotype, PEGI improves power more than GI without excess type I errors. Using the Framingham Heart Study data set, we compare the ability of the PEGI, GI, and CC to detect the associations between 5 SNPs and age at natural menopause. Conclusion: The PEGI method may improve power to detect an association over both CC and GI under many circumstances.
f
Empirical Bayesian Random Censoring Threshold Model Improves Detection of...
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank Koopmans; L. Niels Cornelisse; Tom Heskes; Tjeerd M. H. Dijkstra (2023). Empirical Bayesian Random Censoring Threshold Model Improves Detection of Differentially Abundant Proteins [Dataset]. http://doi.org/10.1021/pr500171u.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/pr500171u.s002
Dataset updated
May 31, 2023
Dataset provided by
ACS Publications
Authors
Frank Koopmans; L. Niels Cornelisse; Tom Heskes; Tjeerd M. H. Dijkstra
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
A challenge in proteomics is that many observations are missing with the probability of missingness increasing as abundance decreases. Adjusting for this informative missingness is required to assess accurately which proteins are differentially abundant. We propose an empirical Bayesian random censoring threshold (EBRCT) model that takes the pattern of missingness in account in the identification of differential abundance. We compare our model with four alternatives, one that considers the missing values as missing completely at random (MCAR model), one with a fixed censoring threshold for each protein species (fixed censoring model) and two imputation models, k-nearest neighbors (IKNN) and singular value thresholding (SVTI). We demonstrate that the EBRCT model bests all alternative models when applied to the CPTAC study 6 benchmark data set. The model is applicable to any label-free peptide or protein quantification pipeline and is provided as an R script.
f
Data from: Small sample adjustment for inference without assuming...
tandf.figshare.com
zip
Updated Oct 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kazushi Maruo; Ryota Ishii; Yusuke Yamaguchi; Tomohiro Ohigashi; Masahiko Gosho (2024). Small sample adjustment for inference without assuming orthogonality in a mixed model for repeated measures analysis [Dataset]. http://doi.org/10.6084/m9.figshare.27330045.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27330045.v1
Dataset updated
Oct 30, 2024
Dataset provided by
Taylor & Francis
Authors
Kazushi Maruo; Ryota Ishii; Yusuke Yamaguchi; Tomohiro Ohigashi; Masahiko Gosho
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The mixed model for repeated measures (MMRM) analysis is sometimes used as a primary statistical analysis for a longitudinal randomized clinical trial. When the MMRM analysis is implemented in ordinary statistical software, the standard error of the treatment effect is estimated by assuming orthogonality between the fixed effects and covariance parameters, based on the characteristics of the normal distribution. However, orthogonality does not hold unless the normality assumption of the error distribution holds, and/or the missing data are derived from the missing completely at random structure. Therefore, assuming orthogonality in the MMRM analysis is not preferable. However, without the assumption of orthogonality, the small-sample bias in the standard error of the treatment effect is significant. Nonetheless, there is no method to improve small-sample performance. Furthermore, there is no software that can easily implement inferences on treatment effects without assuming orthogonality. Hence, we propose two small-sample adjustment methods inflating standard errors that are reasonable in ideal situations and achieve empirical conservatism even in general situations. We also provide an R package to implement these inference processes. The simulation results show that one of the proposed small-sample adjustment methods performs particularly well in terms of underestimation bias of standard errors; consequently, the proposed method is recommended. When using the MMRM analysis, our proposed method is recommended if the sample size is not large and between-group heteroscedasticity is expected.
Additional file 2 of Analyzing pre-service biology teachers’ intention to...
springernature.figshare.com
jar
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Helena Aptyka; Jörg Großschedl (2023). Additional file 2 of Analyzing pre-service biology teachers’ intention to teach evolution using the theory of planned behavior [Dataset]. http://doi.org/10.6084/m9.figshare.21602040.v1
Explore at:
jarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21602040.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Helena Aptyka; Jörg Großschedl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 2. This file comprises the output of the Little’s test of missing completely at random (MCAR) and the multiple imputing by using the expectation-maximum (EM) algorithm in SPSS.
f
Data from: Estimation of Conditional Prevalence From Group Testing Data With...
tandf.figshare.com
zip
Updated Feb 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aurore Delaigle; Wei Huang; Shaoke Lei (2024). Estimation of Conditional Prevalence From Group Testing Data With Missing Covariates [Dataset]. http://doi.org/10.6084/m9.figshare.7639889.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7639889.v3
Dataset updated
Feb 16, 2024
Dataset provided by
Taylor & Francis
Authors
Aurore Delaigle; Wei Huang; Shaoke Lei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We consider estimating the conditional prevalence of a disease from data pooled according to the group testing mechanism. Consistent estimators have been proposed in the literature, but they rely on the data being available for all individuals. In infectious disease studies where group testing is frequently applied, the covariate is often missing for some individuals. There, unless the missing mechanism occurs completely at random, applying the existing techniques to the complete cases without adjusting for missingness does not generally provide consistent estimators, and finding appropriate modifications is challenging. We develop a consistent spline estimator, derive its theoretical properties, and show how to adapt local polynomial and likelihood estimators to the missing data problem. We illustrate the numerical performance of our methods on simulated and real examples. Supplementary materials for this article are available online.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1

Understanding and Managing Missing Data.pdf

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.29265155.v1

Dataset updated

Jun 9, 2025

Dataset provided by

figshare

Authors

Ibrahim Denis Fofanah

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

Clear search

Close search

Google apps

Main menu

Understanding and Managing Missing Data.pdf

Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...

Missing completely at random test.

Water-quality data imputation with a high percentage of missing values: a...

Data from: Integrating Multisource Block-Wise Missing Data in Model...

Multi-Label Datasets with Missing Values

Data from: Power for balanced linear mixed models with complex missing data...

Data from: A real data-driven simulation strategy to select an imputation...

Missing data pattern and mechanism by age and wear time criteria.

Data imputation: an application on wind speed data for Entebbe International...

Parameters of age mixing in sexual partnerships of infected individuals at...

Jane Street is_missing feature labels

Supplementary Material for: Phenotypically Enriched Genotypic Imputation in...

Empirical Bayesian Random Censoring Threshold Model Improves Detection of...

Data from: Small sample adjustment for inference without assuming...

Additional file 2 of Analyzing pre-service biology teachers’ intention to...

Data from: Estimation of Conditional Prevalence From Group Testing Data With...

Understanding and Managing Missing Data.pdf