100+ datasets found
  1. Understanding and Managing Missing Data.pdf

    • figshare.com
    pdf
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Ibrahim Denis Fofanah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

  2. Z

    Water-quality data imputation with a high percentage of missing values: a...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4731168
    Explore at:
    Dataset updated
    Jun 8, 2021
    Dataset provided by
    Department of Fluid Mechanics and Environmental Engineering (IMFIA), School of Engineering, Universidad de la República, Uruguay
    Department of Computer Science (InCo), School of Engineering, Universidad de la República, Uruguay
    Authors
    Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

    This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

    To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

    IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

    In this dataset, we include the original and imputed values for the following variables:

    Water temperature (Tw)

    Dissolved oxygen (DO)

    Electrical conductivity (EC)

    pH

    Turbidity (Turb)

    Nitrite (NO2-)

    Nitrate (NO3-)

    Total Nitrogen (TN)

    Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

    More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

    If you use this dataset in your work, please cite our paper: Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

  3. Data from: A multiple imputation method using population information

    • tandf.figshare.com
    pdf
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tadayoshi Fushiki (2025). A multiple imputation method using population information [Dataset]. http://doi.org/10.6084/m9.figshare.28900017.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Tadayoshi Fushiki
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multiple imputation (MI) is effectively used to deal with missing data when the missing mechanism is missing at random. However, MI may not be effective when the missing mechanism is not missing at random (NMAR). In such cases, additional information is required to obtain an appropriate imputation. Pham et al. (2019) proposed the calibrated-δ adjustment method, which is a multiple imputation method using population information. It provides appropriate imputation in two NMAR settings. However, the calibrated-δ adjustment method has two problems. First, it can be used only when one variable has missing values. Second, the theoretical properties of the variance estimator have not been provided. This article proposes a multiple imputation method using population information that can be applied when several variables have missing values. The proposed method is proven to include the calibrated-δ adjustment method. It is shown that the proposed method provides a consistent estimator for the parameter of the imputation model in an NMAR situation. The asymptotic variance of the estimator obtained by the proposed method and its estimator are also given.

  4. Data from: Imputation of Missing Covariates in Randomized Controlled Trials...

    • tandf.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mutamba T. Kayembe; Shahab Jolani; Frans E.S. Tan; Gerard J.P. van Breukelen (2023). Imputation of Missing Covariates in Randomized Controlled Trials with Continuous Outcomes: Simple, Unbiased and Efficient Methods [Dataset]. http://doi.org/10.6084/m9.figshare.18637732.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Mutamba T. Kayembe; Shahab Jolani; Frans E.S. Tan; Gerard J.P. van Breukelen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The literature on dealing with missing covariates in nonrandomized studies advocates the use of sophisticated methods like multiple imputation (MI) and maximum likelihood (ML)-based approaches over simple methods. However, these methods are not necessarily optimal in terms of bias and efficiency of treatment effect estimation in randomized studies, where the covariate of interest (treatment group) is independent of all baseline (pre-randomization) covariates due to randomization. This has been shown in the literature, but only for missingness on a single baseline covariate. Here, we extend the situation to multiple baseline covariates with missingness and evaluate the performance of MI and ML compared with simple alternative methods under various missingness scenarios in RCTs with a quantitative outcome. We first derive asymptotic relative efficiencies of the simple methods under the missing completely at random (MCAR) scenario and then perform a simulation study for non-MCAR scenarios. Finally, a trial on chronic low back pain is used to illustrate the implementation of the methods. The results show that all simple methods give unbiased treatment effect estimation but with increased mean squared residual. It also turns out that mean imputation and the missing-indicator method are most efficient under all covariate missingness scenarios and perform at least as well as MI and LM in each scenario.

  5. Handling Missing Data Example Dataset

    • kaggle.com
    zip
    Updated Aug 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PRINCE1204 (2025). Handling Missing Data Example Dataset [Dataset]. https://www.kaggle.com/prince1204/handling-missing-data-example-dataset
    Explore at:
    zip(10211 bytes)Available download formats
    Dataset updated
    Aug 21, 2025
    Authors
    PRINCE1204
    Description

    📊 Dataset Description – Handling Missing Data

    This dataset contains 1,000 employee records across different departments and cities, designed for practicing data cleaning, preprocessing, and handling missing values in real-world scenarios.

    🔹 Features (Columns)

    • ID (Integer): Unique identifier for each employee.
    • Age (Float): Age of the employee (some values are missing).
    • Salary (Float): Annual salary of the employee in USD (some values are missing).
    • Experience (Float): Total years of professional experience (some values are missing).
    • Department (Categorical): Department of the employee (e.g., IT, Sales, Finance, Admin) – contains missing values.
    • City (Categorical): Work location of the employee (e.g., London, Berlin, New York) – contains missing values.

    🔹 Missing Data Information

    • Columns Age, Salary, Experience, Department, and City contain around 100 missing values each.
    • The dataset is ideal for testing different missing data handling techniques, such as:
      • Mean / Median / Mode imputation
      • Random sampling imputation
      • Forward / Backward filling
      • Predictive modeling approaches

    🔹 Use Cases

    • 🧹 Practice data cleaning & preprocessing for ML projects.
    • 🔧 Explore imputation techniques for both numerical and categorical data.
    • 🤖 Build predictive models while handling incomplete datasets.
    • 🎓 Great for educational purposes, tutorials, and workshops on missing data handling.
  6. Additional file 4 of Heckman imputation models for binary or continuous MNAR...

    • springernature.figshare.com
    txt
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon (2023). Additional file 4 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors [Dataset]. http://doi.org/10.6084/m9.figshare.7038104.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R code to impute binary outcome. (R 1 kb)

  7. Random Imputer for Missing Data

    • kaggle.com
    zip
    Updated Jun 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SakshiRahangdale (2024). Random Imputer for Missing Data [Dataset]. https://www.kaggle.com/datasets/sakshirahangdale/random-imputer-for-missing-data
    Explore at:
    zip(231998 bytes)Available download formats
    Dataset updated
    Jun 17, 2024
    Authors
    SakshiRahangdale
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by SakshiRahangdale

    Released under Apache 2.0

    Contents

  8. f

    Assessment of the missing at random assumption–the associations between...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jun 27, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Egger, Sam; Luo, Qingwei; O’Connell, Dianne L.; Smith, David P.; Yu, Xue Qin (2017). Assessment of the missing at random assumption–the associations between “unknown” stage prostate cancer recorded in the NSWCR and PCOS-stage, after adjusting for variables included in the imputation models (n = 1864). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001805427
    Explore at:
    Dataset updated
    Jun 27, 2017
    Authors
    Egger, Sam; Luo, Qingwei; O’Connell, Dianne L.; Smith, David P.; Yu, Xue Qin
    Description

    Assessment of the missing at random assumption–the associations between “unknown” stage prostate cancer recorded in the NSWCR and PCOS-stage, after adjusting for variables included in the imputation models (n = 1864).

  9. Data from: Fast tipping point sensitivity analyses in clinical trials with...

    • tandf.figshare.com
    application/gzip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anders Gorst-Rasmussen; Mads Jeppe Tarp-Johansen (2023). Fast tipping point sensitivity analyses in clinical trials with missing continuous outcomes under multiple imputation [Dataset]. http://doi.org/10.6084/m9.figshare.19967496.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Anders Gorst-Rasmussen; Mads Jeppe Tarp-Johansen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    When dealing with missing data in clinical trials, it is often convenient to work under simplifying assumptions, such as missing at random (MAR), and follow up with sensitivity analyses to address unverifiable missing data assumptions. One such sensitivity analysis, routinely requested by regulatory agencies, is the so-called tipping point analysis, in which the treatment effect is re-evaluated after adding a successively more extreme shift parameter to the predicted values among subjects with missing data. If the shift parameter needed to overturn the conclusion is so extreme that it is considered clinically implausible, then this indicates robustness to missing data assumptions. Tipping point analyses are frequently used in the context of continuous outcome data under multiple imputation. While simple to implement, computation can be cumbersome in the two-way setting where both comparator and active arms are shifted, essentially requiring the evaluation of a two-dimensional grid of models. We describe a computationally efficient approach to performing two-way tipping point analysis in the setting of continuous outcome data with multiple imputation. We show how geometric properties can lead to further simplification when exploring the impact of missing data. Lastly, we propose a novel extension to a multi-way setting which yields simple and general sufficient conditions for robustness to missing data assumptions.

  10. Z

    Multi-Label Datasets with Missing Values

    • data.niaid.nih.gov
    Updated Mar 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio F. L. Jacob Jr.; Fabrício A. do Carmo; Ádamo L. de Santana; Ewaldo Santana; Fábio M. F. Lobato (2023). Multi-Label Datasets with Missing Values [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7748932
    Explore at:
    Dataset updated
    Mar 19, 2023
    Dataset provided by
    UEMA
    UFOPA
    Fuji Electric Co. Ltd.
    Authors
    Antonio F. L. Jacob Jr.; Fabrício A. do Carmo; Ádamo L. de Santana; Ewaldo Santana; Fábio M. F. Lobato
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Consisting of six multi-label datasets from the UCI Machine Learning repository.

    Each dataset contains missing values which have been artificially added at the following rates: 5, 10, 15, 20, 25, and 30%. The “amputation” was performed using the “Missing Completely at Random” mechanism.

    File names are represented as follows:

       amp_DB_MR.arff
    

    where:

       DB = original dataset;
    
    
       MR = missing rate.
    

    For more details, please read:

    IEEE Access article (in review process)

  11. f

    Data from: Validity of using multiple imputation for "unknown" stage at...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jun 27, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luo, Qingwei; Egger, Sam; Yu, Xue Qin; Smith, David P.; O’Connell, Dianne L. (2017). Validity of using multiple imputation for "unknown" stage at diagnosis in population-based cancer registry data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001781541
    Explore at:
    Dataset updated
    Jun 27, 2017
    Authors
    Luo, Qingwei; Egger, Sam; Yu, Xue Qin; Smith, David P.; O’Connell, Dianne L.
    Description

    BackgroundThe multiple imputation approach to missing data has been validated by a number of simulation studies by artificially inducing missingness on fully observed stage data under a pre-specified missing data mechanism. However, the validity of multiple imputation has not yet been assessed using real data. The objective of this study was to assess the validity of using multiple imputation for “unknown” prostate cancer stage recorded in the New South Wales Cancer Registry (NSWCR) in real-world conditions.MethodsData from the population-based cohort study NSW Prostate Cancer Care and Outcomes Study (PCOS) were linked to 2000–2002 NSWCR data. For cases with “unknown” NSWCR stage, PCOS-stage was extracted from clinical notes. Logistic regression was used to evaluate the missing at random assumption adjusted for variables from two imputation models: a basic model including NSWCR variables only and an enhanced model including the same NSWCR variables together with PCOS primary treatment. Cox regression was used to evaluate the performance of MI.ResultsOf the 1864 prostate cancer cases 32.7% were recorded as having “unknown” NSWCR stage. The missing at random assumption was satisfied when the logistic regression included the variables included in the enhanced model, but not those in the basic model only. The Cox models using data with imputed stage from either imputation model provided generally similar estimated hazard ratios but with wider confidence intervals compared with those derived from analysis of the data with PCOS-stage. However, the complete-case analysis of the data provided a considerably higher estimated hazard ratio for the low socio-economic status group and rural areas in comparison with those obtained from all other datasets.ConclusionsUsing MI to deal with “unknown” stage data recorded in a population-based cancer registry appears to provide valid estimates. We would recommend a cautious approach to the use of this method elsewhere.

  12. h

    ssa-breast-missing-data-patterns

    • huggingface.co
    Updated Nov 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Electric Sheep (2025). ssa-breast-missing-data-patterns [Dataset]. https://huggingface.co/datasets/electricsheepafrica/ssa-breast-missing-data-patterns
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset authored and provided by
    Electric Sheep
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    SSA Breast Missing Data Patterns (Synthetic)

      Dataset summary
    

    This module provides a synthetic missing-data sandbox for oncology care in African healthcare contexts, focusing on:

    Realistic loss-to-follow-up (LTFU) and retention patterns over 0–24 months. Incomplete diagnostic and laboratory test results (ordered vs completed vs available in records). Non-random missingness driven by facility type, distance, socioeconomic status (SES), and insurance.

    The dataset is… See the full description on the dataset page: https://huggingface.co/datasets/electricsheepafrica/ssa-breast-missing-data-patterns.

  13. f

    Pre and post imputation descriptives of all study variables.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    • +1more
    Updated Nov 20, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huber, Stella Maria; Heumann, Christian; Schomaker, Michael; Jenni, Oskar G.; Caflisch, Jon; Radon, Katja; Muñoz, Daniel Moraga; von Ehrenstein, Ondine S.; Michalke, Bernhard; Schierl, Rudolf; Ohlander, Johan (2013). Pre and post imputation descriptives of all study variables. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001624778
    Explore at:
    Dataset updated
    Nov 20, 2013
    Authors
    Huber, Stella Maria; Heumann, Christian; Schomaker, Michael; Jenni, Oskar G.; Caflisch, Jon; Radon, Katja; Muñoz, Daniel Moraga; von Ehrenstein, Ondine S.; Michalke, Bernhard; Schierl, Rudolf; Ohlander, Johan
    Description

    1Descriptives for variables post imputation were calculated using Rubin’s rules.2NA = missing value. Column displays percentage of missing values in variable.3Variable additionally included in imputation model to improve missing at random assumption.

  14. S

    Penalized Empirical Likelihood of High-Dimensional Semiparametric Varying...

    • scidb.cn
    Updated Feb 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wang long (2025). Penalized Empirical Likelihood of High-Dimensional Semiparametric Varying Coefficient Errors-in-Variables Model Under Missing Data - Appendix [Dataset]. http://doi.org/10.57760/sciencedb.j00206.00050
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 17, 2025
    Dataset provided by
    Science Data Bank
    Authors
    wang long
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    The auxiliary random vector of the parameter part is mainly constructed through inverse probability weighting and local correction methods, and its asymptotic normality is proved for the mixed sequence by combining the random error term Based on the constructed parameter part auxiliary random vector, the empirical logarithmic likelihood ratio function of the parameter part is obtained. At the same time, it is recommended to use penalty empirical likelihood (PEL) for variable selection. Under appropriate conditions, it is proved that the proposed penalty empirical estimation has Oracle characteristics and follows an asymptotic standard chi square distribution

  15. H

    Replication Data for: Comparative investigation of time series missing data...

    • dataverse.harvard.edu
    • dataone.org
    Updated Jul 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LEIZHEN ZANG; Feng XIONG (2020). Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results [Dataset]. http://doi.org/10.7910/DVN/GQHURF
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 24, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    LEIZHEN ZANG; Feng XIONG
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Missing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.

  16. d

    Data from: Bias and sensitivity in the placement of fossil taxa resulting...

    • datadryad.org
    • datasetcatalog.nlm.nih.gov
    • +2more
    zip
    Updated Nov 21, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert S. Sansom (2014). Bias and sensitivity in the placement of fossil taxa resulting from interpretations of missing data [Dataset]. http://doi.org/10.5061/dryad.7tq20
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 21, 2014
    Dataset provided by
    Dryad
    Authors
    Robert S. Sansom
    Time period covered
    Aug 21, 2014
    Description

    supplmentaryscriptTNT script for introduction of random absences and assessment of effect on taxon placement. Also includes TNT script used to generate simulated datasets.

  17. Table 1_A random forest dynamic threshold imputation method for handling...

    • frontiersin.figshare.com
    pdf
    Updated Aug 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaofeng You; Jianqin Yang; Xinai Xu (2025). Table 1_A random forest dynamic threshold imputation method for handling missing data in cognitive diagnosis assessments.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2025.1487111.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 5, 2025
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Xiaofeng You; Jianqin Yang; Xinai Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The handling of missing data in cognitive diagnostic assessment is an important issue. The Random Forest Threshold Imputation (RFTI) method proposed by You et al. in 2023 is specifically designed for cognitive diagnostic models (CDMs) and built on the random forest imputation. However, in RFTI, the threshold for determining imputed values to be 0 is fixed at 0.5, which may result in uncertainty in this imputation. To address this issue, we proposed an improved method, Random Forest Dynamic Threshold Imputation (RFDTI), which possess two dynamic thresholds for dichotomous imputed values. A simulation study showed that the classification of attribute profiles when using RFDTI to impute missing data was always better than the four commonly used traditional methods (i.e., person mean imputation, two-way imputation, expectation–maximization algorithm, and multiple imputation). Compared with RFTI, RFDTI was slightly better for MAR or MCAR data, but slightly worse for MNAR or MIXED data, especially with a larger missingness proportion. An empirical example with MNAR data demonstrates the applicability of RFDTI, which performed similarly as RFTI and much better than the other four traditional methods. An R package is provided to facilitate the application of the proposed method.

  18. f

    A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    • +1more
    Updated May 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dabke, Kruttika; Jones, Michelle R.; Kreimer, Simion; Parker, Sarah J. (2021). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000907442
    Explore at:
    Dataset updated
    May 3, 2021
    Authors
    Dabke, Kruttika; Jones, Michelle R.; Kreimer, Simion; Parker, Sarah J.
    Description

    Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.

  19. Data from: Matrix Completion When Missing Is Not at Random and Its...

    • tandf.figshare.com
    zip
    Updated Sep 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jungjun Choi; Ming Yuan (2024). Matrix Completion When Missing Is Not at Random and Its Applications in Causal Panel Data Models [Dataset]. http://doi.org/10.6084/m9.figshare.26319010.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 20, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Jungjun Choi; Ming Yuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This article develops an inferential framework for matrix completion when missing is not at random and without the requirement of strong signals. Our development is based on the observation that if the number of missing entries is small enough compared to the panel size, then they can be estimated well even when missing is not at random. Taking advantage of this fact, we divide the missing entries into smaller groups and estimate each group via nuclear norm regularization. In addition, we show that with appropriate debiasing, our proposed estimate is asymptotically normal even for fairly weak signals. Our work is motivated by recent research on the Tick Size Pilot Program, an experiment conducted by the Security and Exchange Commission (SEC) to evaluate the impact of widening the tick size on the market quality of stocks from 2016 to 2018. While previous studies were based on traditional regression or difference-in-difference methods by assuming that the treatment effect is invariant with respect to time and unit, our analyses suggest significant heterogeneity across units and intriguing dynamics over time during the pilot program. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

  20. f

    Dataset for: Latent trait shared parameter mixed-models for missing...

    • datasetcatalog.nlm.nih.gov
    • wiley.figshare.com
    Updated Oct 31, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cursio, John; Mermelstein, Robin J.; Hedeker, Donald (2018). Dataset for: Latent trait shared parameter mixed-models for missing ecological momentary assessment data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000688132
    Explore at:
    Dataset updated
    Oct 31, 2018
    Authors
    Cursio, John; Mermelstein, Robin J.; Hedeker, Donald
    Description

    Latent trait shared-parameter mixed-models (LTSPMM) for ecological momentary assessment (EMA) data containing missing values are developed in which data are collected in an intermittent manner. In such studies, data are often missing due to unanswered prompts. Using item response theory (IRT) models, a latent trait is used to represent the missing prompts and modeled jointly with a mixed-model for bivariate longitudinal outcomes. Both one- and two-parameter LTSPMMs are presented. These new models offer a unique way to analyze missing EMA data with many response patterns. Here, the proposed models represent missingness via a latent trait that corresponds to the students' "ability" to respond to the prompting device. Data containing more than 10,300 observations from an EMA study involving high-school students' positive and negative affect are presented. The latent trait representing missingness was a significant predictor of both positive affect and negative affect outcomes. The models are compared to a missing at random (MAR) mixed-model. A simulation study indicates that the proposed models can provide lower bias and increased efficiency compared to the standard MAR approach commonly used with intermittently missing longitudinal data.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
Organization logoOrganization logo

Understanding and Managing Missing Data.pdf

Explore at:
pdfAvailable download formats
Dataset updated
Jun 9, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Ibrahim Denis Fofanah
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

Search
Clear search
Close search
Google apps
Main menu