100+ datasets found

Understanding and Managing Missing Data.pdf
figshare.com
pdf
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29265155.v1
Dataset updated
Jun 9, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Ibrahim Denis Fofanah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
Z
Water-quality data imputation with a high percentage of missing values: a...
data.niaid.nih.gov
zenodo.org
Updated Jun 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4731168
Explore at:
Dataset updated
Jun 8, 2021
Dataset provided by
Department of Fluid Mechanics and Environmental Engineering (IMFIA), School of Engineering, Universidad de la República, Uruguay
Department of Computer Science (InCo), School of Engineering, Universidad de la República, Uruguay
Authors
Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)

Dissolved oxygen (DO)

Electrical conductivity (EC)

pH

Turbidity (Turb)

Nitrite (NO2-)

Nitrate (NO3-)

Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper: Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
Data from: A multiple imputation method using population information
tandf.figshare.com
pdf
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tadayoshi Fushiki (2025). A multiple imputation method using population information [Dataset]. http://doi.org/10.6084/m9.figshare.28900017.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28900017.v1
Dataset updated
Apr 30, 2025
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Tadayoshi Fushiki
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multiple imputation (MI) is effectively used to deal with missing data when the missing mechanism is missing at random. However, MI may not be effective when the missing mechanism is not missing at random (NMAR). In such cases, additional information is required to obtain an appropriate imputation. Pham et al. (2019) proposed the calibrated-δ adjustment method, which is a multiple imputation method using population information. It provides appropriate imputation in two NMAR settings. However, the calibrated-δ adjustment method has two problems. First, it can be used only when one variable has missing values. Second, the theoretical properties of the variance estimator have not been provided. This article proposes a multiple imputation method using population information that can be applied when several variables have missing values. The proposed method is proven to include the calibrated-δ adjustment method. It is shown that the proposed method provides a consistent estimator for the parameter of the imputation model in an NMAR situation. The asymptotic variance of the estimator obtained by the proposed method and its estimator are also given.
Data from: Imputation of Missing Covariates in Randomized Controlled Trials...
tandf.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mutamba T. Kayembe; Shahab Jolani; Frans E.S. Tan; Gerard J.P. van Breukelen (2023). Imputation of Missing Covariates in Randomized Controlled Trials with Continuous Outcomes: Simple, Unbiased and Efficient Methods [Dataset]. http://doi.org/10.6084/m9.figshare.18637732.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.18637732.v1
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Mutamba T. Kayembe; Shahab Jolani; Frans E.S. Tan; Gerard J.P. van Breukelen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The literature on dealing with missing covariates in nonrandomized studies advocates the use of sophisticated methods like multiple imputation (MI) and maximum likelihood (ML)-based approaches over simple methods. However, these methods are not necessarily optimal in terms of bias and efficiency of treatment effect estimation in randomized studies, where the covariate of interest (treatment group) is independent of all baseline (pre-randomization) covariates due to randomization. This has been shown in the literature, but only for missingness on a single baseline covariate. Here, we extend the situation to multiple baseline covariates with missingness and evaluate the performance of MI and ML compared with simple alternative methods under various missingness scenarios in RCTs with a quantitative outcome. We first derive asymptotic relative efficiencies of the simple methods under the missing completely at random (MCAR) scenario and then perform a simulation study for non-MCAR scenarios. Finally, a trial on chronic low back pain is used to illustrate the implementation of the methods. The results show that all simple methods give unbiased treatment effect estimation but with increased mean squared residual. It also turns out that mean imputation and the missing-indicator method are most efficient under all covariate missingness scenarios and perform at least as well as MI and LM in each scenario.
Handling Missing Data Example Dataset
kaggle.com
zip
Updated Aug 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PRINCE1204 (2025). Handling Missing Data Example Dataset [Dataset]. https://www.kaggle.com/prince1204/handling-missing-data-example-dataset
Explore at:
zip(10211 bytes)Available download formats
Dataset updated
Aug 21, 2025
Authors
PRINCE1204
Description
📊 Dataset Description – Handling Missing Data

This dataset contains 1,000 employee records across different departments and cities, designed for practicing data cleaning, preprocessing, and handling missing values in real-world scenarios.

🔹 Features (Columns)

ID (Integer): Unique identifier for each employee.

Age (Float): Age of the employee (some values are missing).

Salary (Float): Annual salary of the employee in USD (some values are missing).

Experience (Float): Total years of professional experience (some values are missing).

Department (Categorical): Department of the employee (e.g., IT, Sales, Finance, Admin) – contains missing values.

City (Categorical): Work location of the employee (e.g., London, Berlin, New York) – contains missing values.

🔹 Missing Data Information

Columns Age, Salary, Experience, Department, and City contain around 100 missing values each.

The dataset is ideal for testing different missing data handling techniques, such as:

Mean / Median / Mode imputation

Random sampling imputation

Forward / Backward filling

Predictive modeling approaches

🔹 Use Cases

🧹 Practice data cleaning & preprocessing for ML projects.

🔧 Explore imputation techniques for both numerical and categorical data.

🤖 Build predictive models while handling incomplete datasets.

🎓 Great for educational purposes, tutorials, and workshops on missing data handling.
Additional file 4 of Heckman imputation models for binary or continuous MNAR...
springernature.figshare.com
txt
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon (2023). Additional file 4 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors [Dataset]. http://doi.org/10.6084/m9.figshare.7038104.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7038104.v1
Dataset updated
Jun 1, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R code to impute binary outcome. (R 1 kb)
Random Imputer for Missing Data
kaggle.com
zip
Updated Jun 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SakshiRahangdale (2024). Random Imputer for Missing Data [Dataset]. https://www.kaggle.com/datasets/sakshirahangdale/random-imputer-for-missing-data
Explore at:
zip(231998 bytes)Available download formats
Dataset updated
Jun 17, 2024
Authors
SakshiRahangdale
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by SakshiRahangdale

Released under Apache 2.0

Contents
f
Assessment of the missing at random assumption–the associations between...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jun 27, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Egger, Sam; Luo, Qingwei; O’Connell, Dianne L.; Smith, David P.; Yu, Xue Qin (2017). Assessment of the missing at random assumption–the associations between “unknown” stage prostate cancer recorded in the NSWCR and PCOS-stage, after adjusting for variables included in the imputation models (n = 1864). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001805427
Explore at:
Dataset updated
Jun 27, 2017
Authors
Egger, Sam; Luo, Qingwei; O’Connell, Dianne L.; Smith, David P.; Yu, Xue Qin
Description
Assessment of the missing at random assumption–the associations between “unknown” stage prostate cancer recorded in the NSWCR and PCOS-stage, after adjusting for variables included in the imputation models (n = 1864).
Data from: Fast tipping point sensitivity analyses in clinical trials with...
tandf.figshare.com
application/gzip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anders Gorst-Rasmussen; Mads Jeppe Tarp-Johansen (2023). Fast tipping point sensitivity analyses in clinical trials with missing continuous outcomes under multiple imputation [Dataset]. http://doi.org/10.6084/m9.figshare.19967496.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19967496.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Anders Gorst-Rasmussen; Mads Jeppe Tarp-Johansen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
When dealing with missing data in clinical trials, it is often convenient to work under simplifying assumptions, such as missing at random (MAR), and follow up with sensitivity analyses to address unverifiable missing data assumptions. One such sensitivity analysis, routinely requested by regulatory agencies, is the so-called tipping point analysis, in which the treatment effect is re-evaluated after adding a successively more extreme shift parameter to the predicted values among subjects with missing data. If the shift parameter needed to overturn the conclusion is so extreme that it is considered clinically implausible, then this indicates robustness to missing data assumptions. Tipping point analyses are frequently used in the context of continuous outcome data under multiple imputation. While simple to implement, computation can be cumbersome in the two-way setting where both comparator and active arms are shifted, essentially requiring the evaluation of a two-dimensional grid of models. We describe a computationally efficient approach to performing two-way tipping point analysis in the setting of continuous outcome data with multiple imputation. We show how geometric properties can lead to further simplification when exploring the impact of missing data. Lastly, we propose a novel extension to a multi-way setting which yields simple and general sufficient conditions for robustness to missing data assumptions.
Z
Multi-Label Datasets with Missing Values
data.niaid.nih.gov
Updated Mar 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio F. L. Jacob Jr.; Fabrício A. do Carmo; Ádamo L. de Santana; Ewaldo Santana; Fábio M. F. Lobato (2023). Multi-Label Datasets with Missing Values [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7748932
Explore at:
Dataset updated
Mar 19, 2023
Dataset provided by
UEMA
UFOPA
Fuji Electric Co. Ltd.
Authors
Antonio F. L. Jacob Jr.; Fabrício A. do Carmo; Ádamo L. de Santana; Ewaldo Santana; Fábio M. F. Lobato
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Consisting of six multi-label datasets from the UCI Machine Learning repository.

Each dataset contains missing values which have been artificially added at the following rates: 5, 10, 15, 20, 25, and 30%. The “amputation” was performed using the “Missing Completely at Random” mechanism.

File names are represented as follows:

amp_DB_MR.arff

where:

DB = original dataset; MR = missing rate.

For more details, please read:

IEEE Access article (in review process)
f
Data from: Validity of using multiple imputation for "unknown" stage at...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jun 27, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luo, Qingwei; Egger, Sam; Yu, Xue Qin; Smith, David P.; O’Connell, Dianne L. (2017). Validity of using multiple imputation for "unknown" stage at diagnosis in population-based cancer registry data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001781541
Explore at:
Dataset updated
Jun 27, 2017
Authors
Luo, Qingwei; Egger, Sam; Yu, Xue Qin; Smith, David P.; O’Connell, Dianne L.
Description
BackgroundThe multiple imputation approach to missing data has been validated by a number of simulation studies by artificially inducing missingness on fully observed stage data under a pre-specified missing data mechanism. However, the validity of multiple imputation has not yet been assessed using real data. The objective of this study was to assess the validity of using multiple imputation for “unknown” prostate cancer stage recorded in the New South Wales Cancer Registry (NSWCR) in real-world conditions.MethodsData from the population-based cohort study NSW Prostate Cancer Care and Outcomes Study (PCOS) were linked to 2000–2002 NSWCR data. For cases with “unknown” NSWCR stage, PCOS-stage was extracted from clinical notes. Logistic regression was used to evaluate the missing at random assumption adjusted for variables from two imputation models: a basic model including NSWCR variables only and an enhanced model including the same NSWCR variables together with PCOS primary treatment. Cox regression was used to evaluate the performance of MI.ResultsOf the 1864 prostate cancer cases 32.7% were recorded as having “unknown” NSWCR stage. The missing at random assumption was satisfied when the logistic regression included the variables included in the enhanced model, but not those in the basic model only. The Cox models using data with imputed stage from either imputation model provided generally similar estimated hazard ratios but with wider confidence intervals compared with those derived from analysis of the data with PCOS-stage. However, the complete-case analysis of the data provided a considerably higher estimated hazard ratio for the low socio-economic status group and rural areas in comparison with those obtained from all other datasets.ConclusionsUsing MI to deal with “unknown” stage data recorded in a population-based cancer registry appears to provide valid estimates. We would recommend a cautious approach to the use of this method elsewhere.
h
ssa-breast-missing-data-patterns
huggingface.co
Updated Nov 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Electric Sheep (2025). ssa-breast-missing-data-patterns [Dataset]. https://huggingface.co/datasets/electricsheepafrica/ssa-breast-missing-data-patterns
Explore at:
Dataset updated
Nov 26, 2025
Dataset authored and provided by
Electric Sheep
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
SSA Breast Missing Data Patterns (Synthetic)

Dataset summary

This module provides a synthetic missing-data sandbox for oncology care in African healthcare contexts, focusing on:

Realistic loss-to-follow-up (LTFU) and retention patterns over 0–24 months. Incomplete diagnostic and laboratory test results (ordered vs completed vs available in records). Non-random missingness driven by facility type, distance, socioeconomic status (SES), and insurance.

The dataset is… See the full description on the dataset page: https://huggingface.co/datasets/electricsheepafrica/ssa-breast-missing-data-patterns.
f
Pre and post imputation descriptives of all study variables.
datasetcatalog.nlm.nih.gov
plos.figshare.com
+1more
Updated Nov 20, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huber, Stella Maria; Heumann, Christian; Schomaker, Michael; Jenni, Oskar G.; Caflisch, Jon; Radon, Katja; Muñoz, Daniel Moraga; von Ehrenstein, Ondine S.; Michalke, Bernhard; Schierl, Rudolf; Ohlander, Johan (2013). Pre and post imputation descriptives of all study variables. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001624778
Explore at:
Dataset updated
Nov 20, 2013
Authors
Huber, Stella Maria; Heumann, Christian; Schomaker, Michael; Jenni, Oskar G.; Caflisch, Jon; Radon, Katja; Muñoz, Daniel Moraga; von Ehrenstein, Ondine S.; Michalke, Bernhard; Schierl, Rudolf; Ohlander, Johan
Description
1Descriptives for variables post imputation were calculated using Rubin’s rules.2NA = missing value. Column displays percentage of missing values in variable.3Variable additionally included in imputation model to improve missing at random assumption.
S
Penalized Empirical Likelihood of High-Dimensional Semiparametric Varying...
scidb.cn
Updated Feb 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wang long (2025). Penalized Empirical Likelihood of High-Dimensional Semiparametric Varying Coefficient Errors-in-Variables Model Under Missing Data - Appendix [Dataset]. http://doi.org/10.57760/sciencedb.j00206.00050
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00206.00050
Dataset updated
Feb 17, 2025
Dataset provided by
Science Data Bank
Authors
wang long
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
The auxiliary random vector of the parameter part is mainly constructed through inverse probability weighting and local correction methods, and its asymptotic normality is proved for the mixed sequence by combining the random error term Based on the constructed parameter part auxiliary random vector, the empirical logarithmic likelihood ratio function of the parameter part is obtained. At the same time, it is recommended to use penalty empirical likelihood (PEL) for variable selection. Under appropriate conditions, it is proved that the proposed penalty empirical estimation has Oracle characteristics and follows an asymptotic standard chi square distribution
H
Replication Data for: Comparative investigation of time series missing data...
dataverse.harvard.edu
dataone.org
Updated Jul 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LEIZHEN ZANG; Feng XIONG (2020). Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results [Dataset]. http://doi.org/10.7910/DVN/GQHURF
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/GQHURF
Dataset updated
Jul 24, 2020
Dataset provided by
Harvard Dataverse
Authors
LEIZHEN ZANG; Feng XIONG
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Missing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.
d
Data from: Bias and sensitivity in the placement of fossil taxa resulting...
datadryad.org
datasetcatalog.nlm.nih.gov
+2more
zip
Updated Nov 21, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert S. Sansom (2014). Bias and sensitivity in the placement of fossil taxa resulting from interpretations of missing data [Dataset]. http://doi.org/10.5061/dryad.7tq20
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7tq20
Dataset updated
Nov 21, 2014
Dataset provided by
Dryad
Authors
Robert S. Sansom
Time period covered
Aug 21, 2014
Description
supplmentaryscriptTNT script for introduction of random absences and assessment of effect on taxon placement. Also includes TNT script used to generate simulated datasets.
Table 1_A random forest dynamic threshold imputation method for handling...
frontiersin.figshare.com
pdf
Updated Aug 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaofeng You; Jianqin Yang; Xinai Xu (2025). Table 1_A random forest dynamic threshold imputation method for handling missing data in cognitive diagnosis assessments.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2025.1487111.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2025.1487111.s001
Dataset updated
Aug 5, 2025
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Xiaofeng You; Jianqin Yang; Xinai Xu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The handling of missing data in cognitive diagnostic assessment is an important issue. The Random Forest Threshold Imputation (RFTI) method proposed by You et al. in 2023 is specifically designed for cognitive diagnostic models (CDMs) and built on the random forest imputation. However, in RFTI, the threshold for determining imputed values to be 0 is fixed at 0.5, which may result in uncertainty in this imputation. To address this issue, we proposed an improved method, Random Forest Dynamic Threshold Imputation (RFDTI), which possess two dynamic thresholds for dichotomous imputed values. A simulation study showed that the classification of attribute profiles when using RFDTI to impute missing data was always better than the four commonly used traditional methods (i.e., person mean imputation, two-way imputation, expectation–maximization algorithm, and multiple imputation). Compared with RFTI, RFDTI was slightly better for MAR or MCAR data, but slightly worse for MNAR or MIXED data, especially with a larger missingness proportion. An empirical example with MNAR data demonstrates the applicability of RFDTI, which performed similarly as RFTI and much better than the other four traditional methods. An R package is provided to facilitate the application of the proposed method.
f
A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...
datasetcatalog.nlm.nih.gov
acs.figshare.com
+1more
Updated May 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dabke, Kruttika; Jones, Michelle R.; Kreimer, Simion; Parker, Sarah J. (2021). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000907442
Explore at:
Dataset updated
May 3, 2021
Authors
Dabke, Kruttika; Jones, Michelle R.; Kreimer, Simion; Parker, Sarah J.
Description
Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.
Data from: Matrix Completion When Missing Is Not at Random and Its...
tandf.figshare.com
zip
Updated Sep 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jungjun Choi; Ming Yuan (2024). Matrix Completion When Missing Is Not at Random and Its Applications in Causal Panel Data Models [Dataset]. http://doi.org/10.6084/m9.figshare.26319010.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26319010.v2
Dataset updated
Sep 20, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Jungjun Choi; Ming Yuan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This article develops an inferential framework for matrix completion when missing is not at random and without the requirement of strong signals. Our development is based on the observation that if the number of missing entries is small enough compared to the panel size, then they can be estimated well even when missing is not at random. Taking advantage of this fact, we divide the missing entries into smaller groups and estimate each group via nuclear norm regularization. In addition, we show that with appropriate debiasing, our proposed estimate is asymptotically normal even for fairly weak signals. Our work is motivated by recent research on the Tick Size Pilot Program, an experiment conducted by the Security and Exchange Commission (SEC) to evaluate the impact of widening the tick size on the market quality of stocks from 2016 to 2018. While previous studies were based on traditional regression or difference-in-difference methods by assuming that the treatment effect is invariant with respect to time and unit, our analyses suggest significant heterogeneity across units and intriguing dynamics over time during the pilot program. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
f
Dataset for: Latent trait shared parameter mixed-models for missing...
datasetcatalog.nlm.nih.gov
wiley.figshare.com
Updated Oct 31, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cursio, John; Mermelstein, Robin J.; Hedeker, Donald (2018). Dataset for: Latent trait shared parameter mixed-models for missing ecological momentary assessment data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000688132
Explore at:
Dataset updated
Oct 31, 2018
Authors
Cursio, John; Mermelstein, Robin J.; Hedeker, Donald
Description
Latent trait shared-parameter mixed-models (LTSPMM) for ecological momentary assessment (EMA) data containing missing values are developed in which data are collected in an intermittent manner. In such studies, data are often missing due to unanswered prompts. Using item response theory (IRT) models, a latent trait is used to represent the missing prompts and modeled jointly with a mixed-model for bivariate longitudinal outcomes. Both one- and two-parameter LTSPMMs are presented. These new models offer a unique way to analyze missing EMA data with many response patterns. Here, the proposed models represent missingness via a latent trait that corresponds to the students' "ability" to respond to the prompting device. Data containing more than 10,300 observations from an EMA study involving high-school students' positive and negative affect are presented. The latent trait representing missingness was a significant predictor of both positive affect and negative affect outcomes. The models are compared to a missing at random (MAR) mixed-model. A simulation study indicates that the proposed models can provide lower bias and increased efficiency compared to the standard MAR approach commonly used with intermittently missing longitudinal data.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1

Understanding and Managing Missing Data.pdf

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.29265155.v1

Dataset updated

Jun 9, 2025

Dataset provided by

Figsharehttp://figshare.com/
figshare

Authors

Ibrahim Denis Fofanah

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

Clear search

Close search

Google apps

Main menu

Understanding and Managing Missing Data.pdf

Water-quality data imputation with a high percentage of missing values: a...

Data from: A multiple imputation method using population information

Data from: Imputation of Missing Covariates in Randomized Controlled Trials...

Handling Missing Data Example Dataset

📊 Dataset Description – Handling Missing Data

🔹 Features (Columns)

🔹 Missing Data Information

🔹 Use Cases

Additional file 4 of Heckman imputation models for binary or continuous MNAR...

Random Imputer for Missing Data

Dataset

Contents

Assessment of the missing at random assumption–the associations between...

Data from: Fast tipping point sensitivity analyses in clinical trials with...

Multi-Label Datasets with Missing Values

Data from: Validity of using multiple imputation for "unknown" stage at...

ssa-breast-missing-data-patterns

Pre and post imputation descriptives of all study variables.

Penalized Empirical Likelihood of High-Dimensional Semiparametric Varying...

Replication Data for: Comparative investigation of time series missing data...

Data from: Bias and sensitivity in the placement of fossil taxa resulting...

Table 1_A random forest dynamic threshold imputation method for handling...

A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

Data from: Matrix Completion When Missing Is Not at Random and Its...

Dataset for: Latent trait shared parameter mixed-models for missing...

Understanding and Managing Missing Data.pdf