17 datasets found
  1. f

    Understanding and Managing Missing Data.pdf

    • figshare.com
    pdf
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    figshare
    Authors
    Ibrahim Denis Fofanah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

  2. h

    Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...

    • datahub.hku.hk
    Updated Aug 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wen Ma (2020). Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets [Dataset]. http://doi.org/10.25442/hku.12752549.v1
    Explore at:
    Dataset updated
    Aug 13, 2020
    Dataset provided by
    HKU Data Repository
    Authors
    Wen Ma
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description
    1. NCCTG Lung cancer datasetSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.2.CNV measurements of CNV of GBM This dataset records the information about copy number variation of Glioblastoma (GBM).Abstract:In biology and medicine, conservative patient and data collection malpractice can lead to missing or incorrect values in patient registries, which can affect both diagnosis and prognosis. Insufficient or biased patient information significantly impedes the sensitivity and accuracy of predicting cancer survival. In bioinformatics, making a best guess of the missing values and identifying the incorrect values are collectively called “imputation”. Existing imputation methods work by establishing a model based on the data mechanism of the missing values. Existing imputation methods work well under two assumptions: 1) the data is missing completely at random, and 2) the percentage of missing values is not high. These are not cases found in biomedical datasets, such as the Cancer Genome Atlas Glioblastoma Copy-Number Variant dataset (TCGA: 108 columns), or the North Central Cancer Treatment Group Lung Cancer (NCCTG) dataset (NCCTG: 9 columns). We tested six existing imputation methods, but only two of them worked with these datasets: The Last Observation Carried Forward (LOCF) and K-nearest Algorithm (KNN). Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) worked only with the NCCTG lung cancer dataset with fewer columns, except when the dataset contains 45% missing data. The quality of the imputed values using existing methods is bad because they do not meet the two assumptions.In our study, we propose a Restricted Boltzmann Machine (RBM)-based imputation method to cope with low randomness and the high percentage of the missing values. RBM is an undirected, probabilistic and parameterized two-layer neural network model, which is often used for extracting abstract information from data, especially for high-dimensional data with unknown or non-standard distributions. In our benchmarks, we applied our method to two cancer datasets: 1) NCCTG, and 2) TCGA. The running time, root mean squared error (RMSE) of the different methods were gauged. The benchmarks for the NCCTG dataset show that our method performs better than other methods when there is 5% missing data in the dataset, with 4.64 RMSE lower than the best KNN. For the TCGA dataset, our method achieved 0.78 RMSE lower than the best KNN.In addition to imputation, RBM can achieve simultaneous predictions. We compared the RBM model with four traditional prediction methods. The running time and area under the curve (AUC) were measured to evaluate the performance. Our RBM-based approach outperformed traditional methods. Specifically, the AUC was up to 19.8% higher than the multivariate logistic regression model in the NCCTG lung cancer dataset, and the AUC was higher than the Cox proportional hazard regression model, with 28.1% in the TCGA dataset.Apart from imputation and prediction, RBM models can detect outliers in one pass by allowing the reconstruction of all the inputs in the visible layer with in a single backward pass. Our results show that RBM models have achieved higher precision and recall on detecting outliers than other methods.
  3. f

    Missing completely at random test.

    • plos.figshare.com
    xls
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayman Omar Baniamer (2025). Missing completely at random test. [Dataset]. http://doi.org/10.1371/journal.pone.0321344.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Ayman Omar Baniamer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistical models are essential tools in data analysis. However, missing data plays a pivotal role in impacting the assumptions and effectiveness of statistical models, especially when there is a significant amount of missing data. This study addresses one of the core assumptions supporting many statistical models, the assumption of unidimensionality. It examines the impact of missing data rates and imputation methods on fulfilling this assumption. The study employs three imputation methods: Corrected Item Mean, multiple imputation, and expectation maximization, assessing their performance across nineteen levels of missing data rates, and examining their impact on the assumption of unidimensionality using several indicators (Cronbach’s alpha, corrected correlation coefficients, factor analysis (Eigenvalues (, , and cumulative variance, and communalities). The study concluded that all imputation methods used effectively provided data that maintained the unidimensionality assumption, regardless of missing data rates. Additionally, it was found that most of the unidimensionality indicators increased in value as missing data rates rose.

  4. Water-quality data imputation with a high percentage of missing values: a...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    csv
    Updated Jun 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 8, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

    This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

    To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

    IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

    In this dataset, we include the original and imputed values for the following variables:

    • Water temperature (Tw)

    • Dissolved oxygen (DO)

    • Electrical conductivity (EC)

    • pH

    • Turbidity (Turb)

    • Nitrite (NO2-)

    • Nitrate (NO3-)

    • Total Nitrogen (TN)

    Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

    More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

    If you use this dataset in your work, please cite our paper:
    Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

  5. f

    Data from: Integrating Multisource Block-Wise Missing Data in Model...

    • tandf.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fei Xue; Annie Qu (2023). Integrating Multisource Block-Wise Missing Data in Model Selection [Dataset]. http://doi.org/10.6084/m9.figshare.12100701.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Fei Xue; Annie Qu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    For multisource data, blocks of variable information from certain sources are likely missing. Existing methods for handling missing data do not take structures of block-wise missing data into consideration. In this article, we propose a multiple block-wise imputation (MBI) approach, which incorporates imputations based on both complete and incomplete observations. Specifically, for a given missing pattern group, the imputations in MBI incorporate more samples from groups with fewer observed variables in addition to the group with complete observations. We propose to construct estimating equations based on all available information, and integrate informative estimating functions to achieve efficient estimators. We show that the proposed method has estimation and model selection consistency under both fixed-dimensional and high-dimensional settings. Moreover, the proposed estimator is asymptotically more efficient than the estimator based on a single imputation from complete observations only. In addition, the proposed method is not restricted to missing completely at random. Numerical studies and ADNI data application confirm that the proposed method outperforms existing variable selection methods under various missing mechanisms. Supplementary materials for this article are available online.

  6. Z

    Multi-Label Datasets with Missing Values

    • data.niaid.nih.gov
    Updated Mar 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabrício A. do Carmo (2023). Multi-Label Datasets with Missing Values [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7748932
    Explore at:
    Dataset updated
    Mar 19, 2023
    Dataset provided by
    Ewaldo Santana
    Antonio F. L. Jacob Jr.
    Fabrício A. do Carmo
    Ádamo L. de Santana
    Fábio M. F. Lobato
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Consisting of six multi-label datasets from the UCI Machine Learning repository.

    Each dataset contains missing values which have been artificially added at the following rates: 5, 10, 15, 20, 25, and 30%. The “amputation” was performed using the “Missing Completely at Random” mechanism.

    File names are represented as follows:

       amp_DB_MR.arff
    

    where:

       DB = original dataset;
    
    
       MR = missing rate.
    

    For more details, please read:

    IEEE Access article (in review process)

  7. f

    Data from: Power for balanced linear mixed models with complex missing data...

    • tandf.figshare.com
    pdf
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin P. Josey; Brandy M. Ringham; Anna E. Barón; Margaret Schenkman; Katherine A. Sauder; Keith E. Muller; Dana Dabelea; Deborah H. Glueck (2023). Power for balanced linear mixed models with complex missing data processes [Dataset]. http://doi.org/10.6084/m9.figshare.14374261.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Kevin P. Josey; Brandy M. Ringham; Anna E. Barón; Margaret Schenkman; Katherine A. Sauder; Keith E. Muller; Dana Dabelea; Deborah H. Glueck
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    When designing repeated measures studies, both the amount and the pattern of missing outcome data can affect power. The chance that an observation is missing may vary across measurements, and missingness may be correlated across measurements. For example, in a physiotherapy study of patients with Parkinson’s disease, increasing intermittent dropout over time yielded missing measurements of physical function. In this example, we assume data are missing completely at random, since the chance that a data point was missing appears to be unrelated to either outcomes or covariates. For data missing completely at random, we propose noncentral F power approximations for the Wald test for balanced linear mixed models with Gaussian responses. The power approximations are based on moments of missing data summary statistics. The moments were derived assuming a conditional linear missingness process. The approach provides approximate power for both complete-case analyses, which include independent sampling units where all measurements are present, and observed-case analyses, which include all independent sampling units with at least one measurement. Monte Carlo simulations demonstrate the accuracy of the method in small samples. We illustrate the utility of the method by computing power for proposed replications of the Parkinson’s study.

  8. d

    Data from: A real data-driven simulation strategy to select an imputation...

    • datadryad.org
    zip
    Updated Feb 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacqueline A. May; Zeny Feng; Sarah J. Adamowicz (2023). A real data-driven simulation strategy to select an imputation method for mixed-type trait data [Dataset]. http://doi.org/10.5061/dryad.crjdfn37m
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 15, 2023
    Dataset provided by
    Dryad
    Authors
    Jacqueline A. May; Zeny Feng; Sarah J. Adamowicz
    Time period covered
    Feb 10, 2023
    Description

    Alignment and phylogenetic trees may be opened and visualized by software capable of handling Newick and FASTA file formats.

  9. f

    Missing data pattern and mechanism by age and wear time criteria.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric E. Wickel (2023). Missing data pattern and mechanism by age and wear time criteria. [Dataset]. http://doi.org/10.1371/journal.pone.0114402.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Eric E. Wickel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MAR, missing at random; MCAR, missing completely at random; NMAR, not missing at random; np, number of participants.aLittle's test, p

  10. Data imputation: an application on wind speed data for Entebbe International...

    • figshare.com
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ronald Wesonga (2023). Data imputation: an application on wind speed data for Entebbe International Airport [Dataset]. http://doi.org/10.6084/m9.figshare.3804357.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ronald Wesonga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Entebbe
    Description

    The windspeed data are daily aggregated values for wind speed data for Entebbe International Airport, located at Entebbe, Uganda. This data has undergone multivariate imputations given that the initial dataset for the period 1995 through 2008 had over 10% missing completely at random (MCAR).

  11. f

    Parameters of age mixing in sexual partnerships of infected individuals at...

    • plos.figshare.com
    xls
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Niyukuri; Peter Nyasulu; Wim Delva (2023). Parameters of age mixing in sexual partnerships of infected individuals at different sampling coverage (%) when missing individuals were missing completely at random (MCAR), missing at random (MAR) with at most 30%, 50%, and 70% women in the sample. [Dataset]. http://doi.org/10.1371/journal.pone.0249013.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    PLOS ONE
    Authors
    David Niyukuri; Peter Nyasulu; Wim Delva
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Parameters of age mixing in sexual partnerships of infected individuals at different sampling coverage (%) when missing individuals were missing completely at random (MCAR), missing at random (MAR) with at most 30%, 50%, and 70% women in the sample.

  12. Jane Street is_missing feature labels

    • kaggle.com
    Updated Dec 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tpmeli (2020). Jane Street is_missing feature labels [Dataset]. https://www.kaggle.com/tpmeli/jane-street-is-missing-feature-labels/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 3, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    tpmeli
    Description

    The purpose of this dataset is to prevent consistent procesing times. One can add it to train.csv and experiment with a very simple df.join(this.csv). Since the missing rows seem to have some values that are completely missing not at random, they should have some meaning attached to them.

  13. f

    Supplementary Material for: Phenotypically Enriched Genotypic Imputation in...

    • karger.figshare.com
    pdf
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhuang W.V.; Murabito J.M. (2023). Supplementary Material for: Phenotypically Enriched Genotypic Imputation in Genetic Association Tests [Dataset]. http://doi.org/10.6084/m9.figshare.3803124.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Karger Publishers
    Authors
    Zhuang W.V.; Murabito J.M.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: In longitudinal epidemiological studies there may be individuals with rich phenotype data who die or are lost to follow-up before providing DNA for genetic studies. Often, the genotypic and phenotypic data of the relatives are available. Two strategies for analyzing the incomplete data are to exclude ungenotyped subjects from analysis (the complete-case method, CC) and to include phenotyped but ungenotyped individuals in analysis by using relatives' genotypes for genotype imputation (GI). In both strategies, the information in the phenotypic data was not used to handle the missing-genotype problem. Methods: We propose a phenotypically enriched genotypic imputation (PEGI) method that uses the EM (expectation-maximization)-based maximum likelihood method to incorporate observed phenotypes into genotype imputation. Results: Our simulations with genotypes missing completely at random show that, for a single-nucleotide polymorphism (SNP) with moderate to strong effect on a phenotype, PEGI improves power more than GI without excess type I errors. Using the Framingham Heart Study data set, we compare the ability of the PEGI, GI, and CC to detect the associations between 5 SNPs and age at natural menopause. Conclusion: The PEGI method may improve power to detect an association over both CC and GI under many circumstances.

  14. f

    Empirical Bayesian Random Censoring Threshold Model Improves Detection of...

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank Koopmans; L. Niels Cornelisse; Tom Heskes; Tjeerd M. H. Dijkstra (2023). Empirical Bayesian Random Censoring Threshold Model Improves Detection of Differentially Abundant Proteins [Dataset]. http://doi.org/10.1021/pr500171u.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ACS Publications
    Authors
    Frank Koopmans; L. Niels Cornelisse; Tom Heskes; Tjeerd M. H. Dijkstra
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    A challenge in proteomics is that many observations are missing with the probability of missingness increasing as abundance decreases. Adjusting for this informative missingness is required to assess accurately which proteins are differentially abundant. We propose an empirical Bayesian random censoring threshold (EBRCT) model that takes the pattern of missingness in account in the identification of differential abundance. We compare our model with four alternatives, one that considers the missing values as missing completely at random (MCAR model), one with a fixed censoring threshold for each protein species (fixed censoring model) and two imputation models, k-nearest neighbors (IKNN) and singular value thresholding (SVTI). We demonstrate that the EBRCT model bests all alternative models when applied to the CPTAC study 6 benchmark data set. The model is applicable to any label-free peptide or protein quantification pipeline and is provided as an R script.

  15. f

    Data from: Small sample adjustment for inference without assuming...

    • tandf.figshare.com
    zip
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kazushi Maruo; Ryota Ishii; Yusuke Yamaguchi; Tomohiro Ohigashi; Masahiko Gosho (2024). Small sample adjustment for inference without assuming orthogonality in a mixed model for repeated measures analysis [Dataset]. http://doi.org/10.6084/m9.figshare.27330045.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 30, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Kazushi Maruo; Ryota Ishii; Yusuke Yamaguchi; Tomohiro Ohigashi; Masahiko Gosho
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The mixed model for repeated measures (MMRM) analysis is sometimes used as a primary statistical analysis for a longitudinal randomized clinical trial. When the MMRM analysis is implemented in ordinary statistical software, the standard error of the treatment effect is estimated by assuming orthogonality between the fixed effects and covariance parameters, based on the characteristics of the normal distribution. However, orthogonality does not hold unless the normality assumption of the error distribution holds, and/or the missing data are derived from the missing completely at random structure. Therefore, assuming orthogonality in the MMRM analysis is not preferable. However, without the assumption of orthogonality, the small-sample bias in the standard error of the treatment effect is significant. Nonetheless, there is no method to improve small-sample performance. Furthermore, there is no software that can easily implement inferences on treatment effects without assuming orthogonality. Hence, we propose two small-sample adjustment methods inflating standard errors that are reasonable in ideal situations and achieve empirical conservatism even in general situations. We also provide an R package to implement these inference processes. The simulation results show that one of the proposed small-sample adjustment methods performs particularly well in terms of underestimation bias of standard errors; consequently, the proposed method is recommended. When using the MMRM analysis, our proposed method is recommended if the sample size is not large and between-group heteroscedasticity is expected.

  16. Additional file 2 of Analyzing pre-service biology teachers’ intention to...

    • springernature.figshare.com
    jar
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Helena Aptyka; Jörg Großschedl (2023). Additional file 2 of Analyzing pre-service biology teachers’ intention to teach evolution using the theory of planned behavior [Dataset]. http://doi.org/10.6084/m9.figshare.21602040.v1
    Explore at:
    jarAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Helena Aptyka; Jörg Großschedl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 2. This file comprises the output of the Little’s test of missing completely at random (MCAR) and the multiple imputing by using the expectation-maximum (EM) algorithm in SPSS.

  17. f

    Data from: Estimation of Conditional Prevalence From Group Testing Data With...

    • tandf.figshare.com
    zip
    Updated Feb 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aurore Delaigle; Wei Huang; Shaoke Lei (2024). Estimation of Conditional Prevalence From Group Testing Data With Missing Covariates [Dataset]. http://doi.org/10.6084/m9.figshare.7639889.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 16, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Aurore Delaigle; Wei Huang; Shaoke Lei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We consider estimating the conditional prevalence of a disease from data pooled according to the group testing mechanism. Consistent estimators have been proposed in the literature, but they rely on the data being available for all individuals. In infectious disease studies where group testing is frequently applied, the covariate is often missing for some individuals. There, unless the missing mechanism occurs completely at random, applying the existing techniques to the complete cases without adjusting for missingness does not generally provide consistent estimators, and finding appropriate modifications is challenging. We develop a consistent spline estimator, derive its theoretical properties, and show how to adapt local polynomial and likelihood estimators to the missing data problem. We illustrate the numerical performance of our methods on simulated and real examples. Supplementary materials for this article are available online.

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1

Understanding and Managing Missing Data.pdf

Explore at:
pdfAvailable download formats
Dataset updated
Jun 9, 2025
Dataset provided by
figshare
Authors
Ibrahim Denis Fofanah
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

Search
Clear search
Close search
Google apps
Main menu