100+ datasets found

Understanding and Managing Missing Data.pdf
figshare.com
pdf
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29265155.v1
Dataset updated
Jun 9, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ibrahim Denis Fofanah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
d
Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...
search.dataone.org
dataverse.harvard.edu
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UPL4TT
Dataset updated
Nov 23, 2023
Dataset provided by
Harvard Dataverse
Authors
Lall, Ranjit; Robinson, Thomas
Description
Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.
Z
Missing data in the analysis of multilevel and dependent data (Examples)
data.niaid.nih.gov
Updated Jul 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Grund; Oliver Lüdtke; Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Examples) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7773613
Explore at:
Dataset updated
Jul 20, 2023
Dataset provided by
University of Hamburg
IPN - Leibniz Institute for Science and Mathematics Education
Authors
Simon Grund; Oliver Lüdtke; Alexander Robitzsch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data sets and computer code for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the computer code (".R") and the data sets from both example analyses (Examples 1 and 2). The data sets are available in two file formats (binary ".rda" for use in R; plain-text ".dat").

The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

ID = group identifier (1-2000) x = numeric (Level 1) y = numeric (Level 1) w = binary (Level 2)

In all data sets, missing values are coded as "NA".
d
Data from: Problems in dealing with missing data and informative censoring...
catalog.data.gov
data.virginia.gov
Updated Sep 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). Problems in dealing with missing data and informative censoring in clinical trials [Dataset]. https://catalog.data.gov/dataset/problems-in-dealing-with-missing-data-and-informative-censoring-in-clinical-trials
Explore at:
Dataset updated
Sep 7, 2025
Dataset provided by
National Institutes of Health
Description
A common problem in clinical trials is the missing data that occurs when patients do not complete the study and drop out without further measurements. Missing data cause the usual statistical analysis of complete or all available data to be subject to bias. There are no universally applicable methods for handling missing data. We recommend the following: (1) Report reasons for dropouts and proportions for each treatment group; (2) Conduct sensitivity analyses to encompass different scenarios of assumptions and discuss consistency or discrepancy among them; (3) Pay attention to minimize the chance of dropouts at the design stage and during trial monitoring; (4) Collect post-dropout data on the primary endpoints, if at all possible; and (5) Consider the dropout event itself an important endpoint in studies with many.
Spaceship Titanic | No missing values
kaggle.com
zip
Updated Mar 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sardor Abdirayimov (2022). Spaceship Titanic | No missing values [Dataset]. https://www.kaggle.com/datasets/sardorabdirayimov/spaceship-titanic-no-missing-values
Explore at:
zip(284931 bytes)Available download formats
Dataset updated
Mar 12, 2022
Authors
Sardor Abdirayimov
Description
Context

Dataset is final solution for dealing with missing values in the Spaceship Titanic competition. Kaggle Notebook: https://www.kaggle.com/sardorabdirayimov/best-way-of-dealing-with-missing-values-titanic-2/
Methods for Handling Missing Item Values in Regression Models Using the...
catalog.data.gov
data.virginia.gov
+1more
Updated Sep 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Substance Abuse and Mental Health Services Administration (2025). Methods for Handling Missing Item Values in Regression Models Using the National Survey on Drug Use and Health (NSDUH) [Dataset]. https://catalog.data.gov/dataset/methods-for-handling-missing-item-values-in-regression-models-using-the-national-survey-on
Explore at:
Dataset updated
Sep 7, 2025
Dataset provided by
Substance Abuse and Mental Health Services Administrationhttps://www.samhsa.gov/
Description
The purpose of this report is to guide analysts interested in fitting regression models using data from the National Survey on Drug Use and Health (NSDUH) by providing them with methods for handling missing item values in regression analyses (MIVRA). The report includes a theoretical review of existing MIVRA methods, a simulation study that evaluates several of the more promising methods using existing NSDUH datasets, and a final chapter where the results of both the theoretical review and the simulation study are synthesized into guidance for analysts via decision trees.
a guide to handle missing values for ML Model
kaggle.com
zip
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Feroz Shinwari (2025). a guide to handle missing values for ML Model [Dataset]. https://www.kaggle.com/datasets/ferozshahshinwari/a-guide-to-handle-missing-values-for-ml-model/code
Explore at:
zip(36646 bytes)Available download formats
Dataset updated
Feb 10, 2025
Authors
Feroz Shinwari
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Feroz Shinwari

Released under Apache 2.0

Contents
n
Data from: Missing data estimation in morphometrics: how much is too much?
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Dec 5, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julien Clavel; Gildas Merceron; Gilles Escarguel (2013). Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.f0b50
Dataset updated
Dec 5, 2013
Dataset provided by
Centre National de la Recherche Scientifique
Authors
Julien Clavel; Gildas Merceron; Gilles Escarguel
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.
S
Deep learning based Missing Data Imputation
scidb.cn
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahjabeen Tahir (2024). Deep learning based Missing Data Imputation [Dataset]. http://doi.org/10.57760/sciencedb.16599
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.16599
Dataset updated
Mar 4, 2024
Dataset provided by
Science Data Bank
Authors
Mahjabeen Tahir
Description
The code provided is related to training an autoencoder, evaluating its performance, and using it for imputing missing values in a dataset. Let's break down each part:Training the Autoencoder (train_autoencoder function):This function takes an autoencoder model and the input features as input.It trains the autoencoder using the input features as both input and target output (hence features, features).The autoencoder is trained for a specified number of epochs (epochs) with a given batch size (batch_size).The shuffle=True argument ensures that the data is shuffled before each epoch to prevent the model from memorizing the input order.After training, it returns the trained autoencoder model and the training history.Evaluating the Autoencoder (evaluate_autoencoder function):This function takes a trained autoencoder model and the input features as input.It uses the trained autoencoder to predict the reconstructed features from the input features.It calculates Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R2) scores between the original and reconstructed features.These metrics provide insights into how well the autoencoder is able to reconstruct the input features.Imputing with the Autoencoder (impute_with_autoencoder function):This function takes a trained autoencoder model and the input features as input.It identifies missing values (e.g., -9999) in the input features.For each row with missing values, it predicts the missing values using the trained autoencoder.It replaces the missing values with the predicted values.The imputed features are returned as output.To reuse this code:Load your dataset and preprocess it as necessary.Build an autoencoder model using the build_autoencoder function.Train the autoencoder using the train_autoencoder function with your input features.Evaluate the performance of the autoencoder using the evaluate_autoencoder function.If your dataset contains missing values, use the impute_with_autoencoder function to impute them with the trained autoencoder.Use the trained autoencoder for any other relevant tasks, such as feature extraction or anomaly detection.
Missing Values Data
kaggle.com
zip
Updated Aug 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed F. ElTantawy (2024). Missing Values Data [Dataset]. https://www.kaggle.com/ahmedgt/missing-values-data
Explore at:
zip(442 bytes)Available download formats
Dataset updated
Aug 23, 2024
Authors
Ahmed F. ElTantawy
License
https://www.licenses.ai/ai-licenseshttps://www.licenses.ai/ai-licenses
Description
Dataset

This dataset was created by Ahmed F. ElTantawy

Released under RAIL (specified in description)

Contents
Data from: Imputation of Missing Covariates in Randomized Controlled Trials...
tandf.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mutamba T. Kayembe; Shahab Jolani; Frans E.S. Tan; Gerard J.P. van Breukelen (2023). Imputation of Missing Covariates in Randomized Controlled Trials with Continuous Outcomes: Simple, Unbiased and Efficient Methods [Dataset]. http://doi.org/10.6084/m9.figshare.18637732.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.18637732.v1
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Mutamba T. Kayembe; Shahab Jolani; Frans E.S. Tan; Gerard J.P. van Breukelen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The literature on dealing with missing covariates in nonrandomized studies advocates the use of sophisticated methods like multiple imputation (MI) and maximum likelihood (ML)-based approaches over simple methods. However, these methods are not necessarily optimal in terms of bias and efficiency of treatment effect estimation in randomized studies, where the covariate of interest (treatment group) is independent of all baseline (pre-randomization) covariates due to randomization. This has been shown in the literature, but only for missingness on a single baseline covariate. Here, we extend the situation to multiple baseline covariates with missingness and evaluate the performance of MI and ML compared with simple alternative methods under various missingness scenarios in RCTs with a quantitative outcome. We first derive asymptotic relative efficiencies of the simple methods under the missing completely at random (MCAR) scenario and then perform a simulation study for non-MCAR scenarios. Finally, a trial on chronic low back pain is used to illustrate the implementation of the methods. The results show that all simple methods give unbiased treatment effect estimation but with increased mean squared residual. It also turns out that mean imputation and the missing-indicator method are most efficient under all covariate missingness scenarios and perform at least as well as MI and LM in each scenario.
Z
NN5 Daily Dataset (without Missing Values)
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Apr 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Godahewa, Rakshitha; Bergmeir, Christoph; Webb, Geoff (2021). NN5 Daily Dataset (without Missing Values) [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3889739
Explore at:
Dataset updated
Apr 1, 2021
Dataset provided by
Professor at Monash University
PhD Student at Monash University
Lecturer at Monash University
Authors
Godahewa, Rakshitha; Bergmeir, Christoph; Webb, Geoff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was used in the NN5 forecasting competition. It contains 111 time series from the banking domain. The goal is predicting the daily cash withdrawals from ATMs in UK.

The original dataset contains missing values. A missing value on a particular day is replaced by the median across all the same days of the week along the whole series.
n
Data from: Using multiple imputation to estimate missing data in...
narcis.nl
data.niaid.nih.gov
+1more
Updated Dec 10, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellington, E. Hance; Bastille-Rousseau, Guillaume; Austin, Cayla; Landolt, Kristen N.; Pond, Bruce A.; Rees, Erin E.; Robar, Nicholas; Murray, Dennis L. (2014). Data from: Using multiple imputation to estimate missing data in meta-regression [Dataset]. http://doi.org/10.5061/dryad.m2v4m
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.m2v4m
Dataset updated
Dec 10, 2014
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Ellington, E. Hance; Bastille-Rousseau, Guillaume; Austin, Cayla; Landolt, Kristen N.; Pond, Bruce A.; Rees, Erin E.; Robar, Nicholas; Murray, Dennis L.
Description
There is a growing need for scientific synthesis in ecology and evolution. In many cases, meta-analytic techniques can be used to complement such synthesis. However, missing data is a serious problem for any synthetic efforts and can compromise the integrity of meta-analyses in these and other disciplines. Currently, the prevalence of missing data in meta-analytic datasets in ecology and the efficacy of different remedies for this problem have not been adequately quantified. 2. We generated meta-analytic datasets based on literature reviews of experimental and observational data and found that missing data were prevalent in meta-analytic ecological datasets. We then tested the performance of complete case removal (a widely used method when data are missing) and multiple imputation (an alternative method for data recovery) and assessed model bias, precision, and multi-model rankings under a variety of simulated conditions using published meta-regression datasets. 3. We found that complete case removal led to biased and imprecise coefficient estimates and yielded poorly specified models. In contrast, multiple imputation provided unbiased parameter estimates with only a small loss in precision. The performance of multiple imputation, however, was dependent on the type of data missing. It performed best when missing values were weighting variables, but performance was mixed when missing values were predictor variables. Multiple imputation performed poorly when imputing raw data which was then used to calculate effect size and the weighting variable. 4. We conclude that complete case removal should not be used in meta-regression, and that multiple imputation has the potential to be an indispensable tool for meta-regression in ecology and evolution. However, we recommend that users assess the performance of multiple imputation by simulating missing data on a subset of their data before implementing it to recover actual missing data.
c
Bitcoin Dataset without Missing Values - Dataset - CryptoData Hub
cryptodata.center
Updated Dec 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Bitcoin Dataset without Missing Values - Dataset - CryptoData Hub [Dataset]. https://cryptodata.center/dataset/bitcoin-dataset-without-missing-values
Explore at:
Dataset updated
Dec 4, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the potential influencers of the bitcoin price. There are a total of 18 daily time series including hash rate, block size, mining difficulty etc. It also encompasses public opinion in the form of tweets and google searches mentioning the keyword bitcoin. The data is scraped from the interactive web-graphs available at https://bitinfocharts.com. The original dataset contains missing values and they have been replaced by carrying forward the corresponding last seen observations (LOCF method).
f
Data from: Hybrid imputation of missing values using KNN on MEWMA-based...
tandf.figshare.com
png
Updated Nov 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yijun Jiang; Tingting He; Miaomiao Yu; Yong Zhou (2025). Hybrid imputation of missing values using KNN on MEWMA-based adaptive process control [Dataset]. http://doi.org/10.6084/m9.figshare.30675585.v1
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30675585.v1
Dataset updated
Nov 21, 2025
Dataset provided by
Taylor & Francis
Authors
Yijun Jiang; Tingting He; Miaomiao Yu; Yong Zhou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Missing data, a common issue in production processes due to factors like sample contamination and equipment malfunctions, can lead to a decrease in the recognition accuracy of control charts, especially in cases of shifting. To address this, we introduce an online adaptive weighted imputation technique that combines the strengths of K-Nearest Neighbor (KNN) and Exponentially Weighted Moving Average (EWMA) imputations. It utilizes an adaptive weight matrix for weighting both methods and an adaptive covariance matrix to optimize for missing structures. When dealing with data fluctuation, we assign a higher weight to the KNN method for its sensitivity, while the EWMA method is preferred for stationary data. This approach does not require data stacking; thus, the imputation process for missing data is conducted online. Consequently, based on the online Multivariate EWMA (MEWMA) control chart, real-time process monitoring can be achieved. To optimize the use of available information, we also adjust the covariance matrix with a weight matrix to emphasize complete data. The proposed technique outperforms traditional methods in performance monitoring by avoiding false alarms and quickly detecting anomalies during process shifts.
f
R scripts used for Monte Carlo simulations and data analyses.
plos.figshare.com
zip
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lateef Babatunde Amusa; Twinomurinzi Hossana (2024). R scripts used for Monte Carlo simulations and data analyses. [Dataset]. http://doi.org/10.1371/journal.pone.0297037.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0297037.s001
Dataset updated
Jan 19, 2024
Dataset provided by
PLOS ONE
Authors
Lateef Babatunde Amusa; Twinomurinzi Hossana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R scripts used for Monte Carlo simulations and data analyses.
f
Results of model performance in handling missing data.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Sep 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bradley, Alison; Van der Meer, Robert; McKay, Colin J. (2019). Results of model performance in handling missing data. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000098498
Explore at:
Dataset updated
Sep 9, 2019
Authors
Bradley, Alison; Van der Meer, Robert; McKay, Colin J.
Description
Results of model performance in handling missing data.
d
Replication Data for: A GMM Approach for Dealing with Missing Data on...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donald, Stephen; Abrevaya, Jason (2023). Replication Data for: A GMM Approach for Dealing with Missing Data on Regressors [Dataset]. http://doi.org/10.7910/DVN/JMWMWW
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/JMWMWW
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Donald, Stephen; Abrevaya, Jason
Description
Replication Data for: A GMM Approach for Dealing with Missing Data on Regressors
Fill The Cell that has Missing Values
kaggle.com
zip
Updated Feb 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEMİL BAYHAN (2021). Fill The Cell that has Missing Values [Dataset]. https://www.kaggle.com/cemlbayhan/fill-cell-has-missing-values
Explore at:
zip(2469 bytes)Available download formats
Dataset updated
Feb 28, 2021
Authors
CEMİL BAYHAN
Description
Dataset

This dataset was created by CEMİL BAYHAN

Contents
S
Prediction of radionuclide diffusion enabled by missing data imputation and...
scidb.cn
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jun-Lei Tian; Jia-Xing Feng; Jia-Cong Shen; Lei Yao; Jing-Yan Wang; Tao Wu; Yao-Lin Zhao (2025). Prediction of radionuclide diffusion enabled by missing data imputation and ensemble machine learning [Dataset]. http://doi.org/10.57760/sciencedb.j00186.00710
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00186.00710
Dataset updated
May 6, 2025
Dataset provided by
Science Data Bank
Authors
Jun-Lei Tian; Jia-Xing Feng; Jia-Cong Shen; Lei Yao; Jing-Yan Wang; Tao Wu; Yao-Lin Zhao
Description
Missing values in radionuclide diffusion datasets can undermine the predictive accuracy and robustness of machine learning models. A regression-based missing data imputation method using light gradient boosting machine algorithm was employed to impute over 60% of the missing data.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1

Understanding and Managing Missing Data.pdf

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.29265155.v1

Dataset updated

Jun 9, 2025

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Ibrahim Denis Fofanah

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

Clear search

Close search

Google apps

Main menu

Understanding and Managing Missing Data.pdf

Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

Missing data in the analysis of multilevel and dependent data (Examples)

Data from: Problems in dealing with missing data and informative censoring...

Spaceship Titanic | No missing values

Context

Methods for Handling Missing Item Values in Regression Models Using the...

a guide to handle missing values for ML Model

Dataset

Contents

Data from: Missing data estimation in morphometrics: how much is too much?

Deep learning based Missing Data Imputation

Missing Values Data

Dataset

Contents

Data from: Imputation of Missing Covariates in Randomized Controlled Trials...

NN5 Daily Dataset (without Missing Values)

Data from: Using multiple imputation to estimate missing data in...

Bitcoin Dataset without Missing Values - Dataset - CryptoData Hub

Data from: Hybrid imputation of missing values using KNN on MEWMA-based...

R scripts used for Monte Carlo simulations and data analyses.

Results of model performance in handling missing data.

Replication Data for: A GMM Approach for Dealing with Missing Data on...

Fill The Cell that has Missing Values

Dataset

Contents

Prediction of radionuclide diffusion enabled by missing data imputation and...

Understanding and Managing Missing Data.pdf