97 datasets found

Understanding and Managing Missing Data.pdf
figshare.com
pdf
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29265155.v1
Dataset updated
Jun 9, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ibrahim Denis Fofanah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
d
Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...
search.dataone.org
dataverse.harvard.edu
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UPL4TT
Dataset updated
Nov 23, 2023
Dataset provided by
Harvard Dataverse
Authors
Lall, Ranjit; Robinson, Thomas
Description
Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.
Z
Missing data in the analysis of multilevel and dependent data (Examples)
data.niaid.nih.gov
Updated Jul 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Grund; Oliver Lüdtke; Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Examples) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7773613
Explore at:
Dataset updated
Jul 20, 2023
Dataset provided by
University of Hamburg
IPN - Leibniz Institute for Science and Mathematics Education
Authors
Simon Grund; Oliver Lüdtke; Alexander Robitzsch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data sets and computer code for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the computer code (".R") and the data sets from both example analyses (Examples 1 and 2). The data sets are available in two file formats (binary ".rda" for use in R; plain-text ".dat").

The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

ID = group identifier (1-2000) x = numeric (Level 1) y = numeric (Level 1) w = binary (Level 2)

In all data sets, missing values are coded as "NA".
d
Data from: Problems in dealing with missing data and informative censoring...
catalog.data.gov
data.virginia.gov
Updated Sep 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). Problems in dealing with missing data and informative censoring in clinical trials [Dataset]. https://catalog.data.gov/dataset/problems-in-dealing-with-missing-data-and-informative-censoring-in-clinical-trials
Explore at:
Dataset updated
Sep 7, 2025
Dataset provided by
National Institutes of Health
Description
A common problem in clinical trials is the missing data that occurs when patients do not complete the study and drop out without further measurements. Missing data cause the usual statistical analysis of complete or all available data to be subject to bias. There are no universally applicable methods for handling missing data. We recommend the following: (1) Report reasons for dropouts and proportions for each treatment group; (2) Conduct sensitivity analyses to encompass different scenarios of assumptions and discuss consistency or discrepancy among them; (3) Pay attention to minimize the chance of dropouts at the design stage and during trial monitoring; (4) Collect post-dropout data on the primary endpoints, if at all possible; and (5) Consider the dropout event itself an important endpoint in studies with many.
Spaceship Titanic | No missing values
kaggle.com
zip
Updated Mar 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sardor Abdirayimov (2022). Spaceship Titanic | No missing values [Dataset]. https://www.kaggle.com/datasets/sardorabdirayimov/spaceship-titanic-no-missing-values
Explore at:
zip(284931 bytes)Available download formats
Dataset updated
Mar 12, 2022
Authors
Sardor Abdirayimov
Description
Context

Dataset is final solution for dealing with missing values in the Spaceship Titanic competition. Kaggle Notebook: https://www.kaggle.com/sardorabdirayimov/best-way-of-dealing-with-missing-values-titanic-2/
Methods for Handling Missing Item Values in Regression Models Using the...
catalog.data.gov
data.virginia.gov
+1more
Updated Sep 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Substance Abuse and Mental Health Services Administration (2025). Methods for Handling Missing Item Values in Regression Models Using the National Survey on Drug Use and Health (NSDUH) [Dataset]. https://catalog.data.gov/dataset/methods-for-handling-missing-item-values-in-regression-models-using-the-national-survey-on
Explore at:
Dataset updated
Sep 7, 2025
Dataset provided by
Substance Abuse and Mental Health Services Administrationhttps://www.samhsa.gov/
Description
The purpose of this report is to guide analysts interested in fitting regression models using data from the National Survey on Drug Use and Health (NSDUH) by providing them with methods for handling missing item values in regression analyses (MIVRA). The report includes a theoretical review of existing MIVRA methods, a simulation study that evaluates several of the more promising methods using existing NSDUH datasets, and a final chapter where the results of both the theoretical review and the simulation study are synthesized into guidance for analysts via decision trees.
a guide to handle missing values for ML Model
kaggle.com
zip
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Feroz Shinwari (2025). a guide to handle missing values for ML Model [Dataset]. https://www.kaggle.com/datasets/ferozshahshinwari/a-guide-to-handle-missing-values-for-ml-model/code
Explore at:
zip(36646 bytes)Available download formats
Dataset updated
Feb 10, 2025
Authors
Feroz Shinwari
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Feroz Shinwari

Released under Apache 2.0

Contents
S
Deep learning based Missing Data Imputation
scidb.cn
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahjabeen Tahir (2024). Deep learning based Missing Data Imputation [Dataset]. http://doi.org/10.57760/sciencedb.16599
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.16599
Dataset updated
Mar 4, 2024
Dataset provided by
Science Data Bank
Authors
Mahjabeen Tahir
Description
The code provided is related to training an autoencoder, evaluating its performance, and using it for imputing missing values in a dataset. Let's break down each part:Training the Autoencoder (train_autoencoder function):This function takes an autoencoder model and the input features as input.It trains the autoencoder using the input features as both input and target output (hence features, features).The autoencoder is trained for a specified number of epochs (epochs) with a given batch size (batch_size).The shuffle=True argument ensures that the data is shuffled before each epoch to prevent the model from memorizing the input order.After training, it returns the trained autoencoder model and the training history.Evaluating the Autoencoder (evaluate_autoencoder function):This function takes a trained autoencoder model and the input features as input.It uses the trained autoencoder to predict the reconstructed features from the input features.It calculates Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R2) scores between the original and reconstructed features.These metrics provide insights into how well the autoencoder is able to reconstruct the input features.Imputing with the Autoencoder (impute_with_autoencoder function):This function takes a trained autoencoder model and the input features as input.It identifies missing values (e.g., -9999) in the input features.For each row with missing values, it predicts the missing values using the trained autoencoder.It replaces the missing values with the predicted values.The imputed features are returned as output.To reuse this code:Load your dataset and preprocess it as necessary.Build an autoencoder model using the build_autoencoder function.Train the autoencoder using the train_autoencoder function with your input features.Evaluate the performance of the autoencoder using the evaluate_autoencoder function.If your dataset contains missing values, use the impute_with_autoencoder function to impute them with the trained autoencoder.Use the trained autoencoder for any other relevant tasks, such as feature extraction or anomaly detection.
n
Data from: Missing data estimation in morphometrics: how much is too much?
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Dec 5, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julien Clavel; Gildas Merceron; Gilles Escarguel (2013). Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.f0b50
Dataset updated
Dec 5, 2013
Dataset provided by
Centre National de la Recherche Scientifique
Authors
Julien Clavel; Gildas Merceron; Gilles Escarguel
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.
f
Description of the dataset used in this study.
figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jan 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Turki Aljrees (2024). Description of the dataset used in this study. [Dataset]. http://doi.org/10.1371/journal.pone.0295632.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295632.t001
Dataset updated
Jan 3, 2024
Dataset provided by
PLOS ONE
Authors
Turki Aljrees
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.
Data from: Evaluating Supplemental Samples in Longitudinal Research:...
tandf.figshare.com
txt
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laura K. Taylor; Xin Tong; Scott E. Maxwell (2024). Evaluating Supplemental Samples in Longitudinal Research: Replacement and Refreshment Approaches [Dataset]. http://doi.org/10.6084/m9.figshare.12162072.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12162072.v1
Dataset updated
Feb 9, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Laura K. Taylor; Xin Tong; Scott E. Maxwell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Despite the wide application of longitudinal studies, they are often plagued by missing data and attrition. The majority of methodological approaches focus on participant retention or modern missing data analysis procedures. This paper, however, takes a new approach by examining how researchers may supplement the sample with additional participants. First, refreshment samples use the same selection criteria as the initial study. Second, replacement samples identify auxiliary variables that may help explain patterns of missingness and select new participants based on those characteristics. A simulation study compares these two strategies for a linear growth model with five measurement occasions. Overall, the results suggest that refreshment samples lead to less relative bias, greater relative efficiency, and more acceptable coverage rates than replacement samples or not supplementing the missing participants in any way. Refreshment samples also have high statistical power. The comparative strengths of the refreshment approach are further illustrated through a real data example. These findings have implications for assessing change over time when researching at-risk samples with high levels of permanent attrition.
f
Results of model performance in handling missing data.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Sep 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bradley, Alison; Van der Meer, Robert; McKay, Colin J. (2019). Results of model performance in handling missing data. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000098498
Explore at:
Dataset updated
Sep 9, 2019
Authors
Bradley, Alison; Van der Meer, Robert; McKay, Colin J.
Description
Results of model performance in handling missing data.
h
drug-reviews
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mouwiya S. A. Al-Qaisieh, drug-reviews [Dataset]. https://huggingface.co/datasets/Mouwiya/drug-reviews
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Mouwiya S. A. Al-Qaisieh
License
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Description
Dataset Details

1.Dataset Loading:

Initially, we load the Drug Review Dataset from the UC Irvine Machine Learning Repository. This dataset contains patient reviews of different drugs, along with the medical condition being treated and the patients' satisfaction ratings.

2.Data Preprocessing:

The dataset is preprocessed to ensure data integrity and consistency. We handle missing values and ensure that each patient ID is unique across the dataset.

3.Text… See the full description on the dataset page: https://huggingface.co/datasets/Mouwiya/drug-reviews.
f
Data from: Hybrid imputation of missing values using KNN on MEWMA-based...
tandf.figshare.com
png
Updated Nov 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yijun Jiang; Tingting He; Miaomiao Yu; Yong Zhou (2025). Hybrid imputation of missing values using KNN on MEWMA-based adaptive process control [Dataset]. http://doi.org/10.6084/m9.figshare.30675585.v1
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30675585.v1
Dataset updated
Nov 21, 2025
Dataset provided by
Taylor & Francis
Authors
Yijun Jiang; Tingting He; Miaomiao Yu; Yong Zhou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Missing data, a common issue in production processes due to factors like sample contamination and equipment malfunctions, can lead to a decrease in the recognition accuracy of control charts, especially in cases of shifting. To address this, we introduce an online adaptive weighted imputation technique that combines the strengths of K-Nearest Neighbor (KNN) and Exponentially Weighted Moving Average (EWMA) imputations. It utilizes an adaptive weight matrix for weighting both methods and an adaptive covariance matrix to optimize for missing structures. When dealing with data fluctuation, we assign a higher weight to the KNN method for its sensitivity, while the EWMA method is preferred for stationary data. This approach does not require data stacking; thus, the imputation process for missing data is conducted online. Consequently, based on the online Multivariate EWMA (MEWMA) control chart, real-time process monitoring can be achieved. To optimize the use of available information, we also adjust the covariance matrix with a weight matrix to emphasize complete data. The proposed technique outperforms traditional methods in performance monitoring by avoiding false alarms and quickly detecting anomalies during process shifts.
d
Replication Data for: A GMM Approach for Dealing with Missing Data on...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donald, Stephen; Abrevaya, Jason (2023). Replication Data for: A GMM Approach for Dealing with Missing Data on Regressors [Dataset]. http://doi.org/10.7910/DVN/JMWMWW
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/JMWMWW
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Donald, Stephen; Abrevaya, Jason
Description
Replication Data for: A GMM Approach for Dealing with Missing Data on Regressors
R
Cdd Dataset
universe.roboflow.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/model/3
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
hakuna matata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cumcumber Diease Detection Bounding Boxes
Description
Project Documentation: Cucumber Disease Detection

Title and Introduction Title: Cucumber Disease Detection

Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

Methodology Machine Learning Algorithms:

Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

Model Evaluation Evaluation Metrics:

Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

Rafiur Rahman Rafit EWU 2018-3-60-111
m
Updated Ljubljana Breast Cancer Data Set: reduced and cleaned version
data.mendeley.com
Updated Oct 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gennady Chuiko (2023). Updated Ljubljana Breast Cancer Data Set: reduced and cleaned version [Dataset]. http://doi.org/10.17632/fgs9pyfv2z.2
Explore at:
Unique identifier
https://doi.org/10.17632/fgs9pyfv2z.2
Dataset updated
Oct 25, 2023
Authors
Gennady Chuiko
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains information for Machine Learning algorithms to forecast recurrence events (RE) for patients with breast cancer stages I to III. The dataset contains 252 instances and six attributes, including a binary class indicating whether RE occurred. This dataset has been reduced and denoised from the original Ljubljana, which holds 286 instances with ten attributes each (LBCD, Zwitter M. and Soklic M. (1988). Breast Cancer. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/14/breast+cancer). The ranking results by eight different Machine learning algorithms and statistical handling of the ranking 8-component vectors for attributes allow one to reduce ten features to six of the most relevant ones. The most pertinent features were the following five: {deg_malig, irradiat, node_caps, tumor_size, inv_nodes}. Less relevant found four attributes: {age, breast_quad, breast, menopause}. The CAIRAD: Co-appearance based Analysis for Incorrect Records and Attribute-values Detection ( Rahman MG, Islam MZ, Bossomaier T, Gao J. CAIRAD: A co-appearance based analysis for incorrect records and attribute-values detection. Proc Int Jt Conf Neural Networks. 2012;(June). https://doi.org/10.1109/IJCNN.2012.6252669) filter has been determined the noises in attributes and class features. Per the filtering results, 34 instances of LBCD had noises in half (or even more than half) of their features. Those were removed from the data. It is known that the noises in the class are riskier and teasing than those of attributes. Meantime, the class attribute had 35 (14%) missed values from 252 after COIRAD filtering. It was unacceptable, considering the comparable number (only 85 cases) of recurrence events in the class of initial LBCD. The imputation (reconstruction, "cure") of missed values was performed via the algorithm offered in:
Bai BM, Mangathayaru N, Rani BP. An approach to find missing values in medical datasets. In: ACM International Conference Proceeding Series. Vol 24-26-Sept. ; 2015. https://doi.org/10.1145/2832987.2833083. The noises presented in the remaining attributes, ranging from 1% to 14%, were neglected. There are 252 instances in the dataset, of which 206 do not have RE, and the remaining 46 have RE. Six attributes, including its class, define each instance. This dataset is obtained from the initial version of the LBCD betterment, and it provides a significant advantage in the performance over the original LBCD for most classifying algorithms of Machine Learning. However, the dataset is slightly more imbalanced than the LBCD, which is a minus.
f
Datasets used in experiments.
plos.figshare.com
xls
Updated Jan 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio Fernando Lavareda Jacob Junior; Fabricio Almeida do Carmo; Adamo Lima de Santana; Ewaldo Eder Carvalho Santana; Fabio Manoel Franca Lobato (2024). Datasets used in experiments. [Dataset]. http://doi.org/10.1371/journal.pone.0297147.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0297147.t003
Dataset updated
Jan 19, 2024
Dataset provided by
PLOS ONE
Authors
Antonio Fernando Lavareda Jacob Junior; Fabricio Almeida do Carmo; Adamo Lima de Santana; Ewaldo Eder Carvalho Santana; Fabio Manoel Franca Lobato
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Missing data is a prevalent problem that requires attention, as most data analysis techniques are unable to handle it. This is particularly critical in Multi-Label Classification (MLC), where only a few studies have investigated missing data in this application domain. MLC differs from Single-Label Classification (SLC) by allowing an instance to be associated with multiple classes. Movie classification is a didactic example since it can be “drama” and “bibliography” simultaneously. One of the most usual missing data treatment methods is data imputation, which seeks plausible values to fill in the missing ones. In this scenario, we propose a novel imputation method based on a multi-objective genetic algorithm for optimizing multiple data imputations called Multiple Imputation of Multi-label Classification data with a genetic algorithm, or simply EvoImp. We applied the proposed method in multi-label learning and evaluated its performance using six synthetic databases, considering various missing values distribution scenarios. The method was compared with other state-of-the-art imputation strategies, such as K-Means Imputation (KMI) and weighted K-Nearest Neighbors Imputation (WKNNI). The results proved that the proposed method outperformed the baseline in all the scenarios by achieving the best evaluation measures considering the Exact Match, Accuracy, and Hamming Loss. The superior results were constant in different dataset domains and sizes, demonstrating the EvoImp robustness. Thus, EvoImp represents a feasible solution to missing data treatment for multi-label learning.
Z
Empathy dataset
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathematical Research Data Initiative (2024). Empathy dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7683906
Explore at:
Dataset updated
Dec 18, 2024
Authors
Mathematical Research Data Initiative
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The database for this study (Briganti et al. 2018; the same for the Braun study analysis) was composed of 1973 French-speaking students in several universities or schools for higher education in the following fields: engineering (31%), medicine (18%), nursing school (16%), economic sciences (15%), physiotherapy, (4%), psychology (11%), law school (4%) and dietetics (1%). The subjects were 17 to 25 years old (M = 19.6 years, SD = 1.6 years), 57% were females and 43% were males. Even though the full dataset was composed of 1973 participants, only 1270 answered the full questionnaire: missing data are handled using pairwise complete observations in estimating a Gaussian Graphical Model, meaning that all available information from every subject are used.

The feature set is composed of 28 items meant to assess the four following components: fantasy, perspective taking, empathic concern and personal distress. In the questionnaire, the items are mixed; reversed items (items 3, 4, 7, 12, 13, 14, 15, 18, 19) are present. Items are scored from 0 to 4, where “0” means “Doesn’t describe me very well” and “4” means “Describes me very well”; reverse-scoring is calculated afterwards. The questionnaires were anonymized. The reanalysis of the database in this retrospective study was approved by the ethical committee of the Erasmus Hospital.

Size: A dataset of size 1973*28

Number of features: 28

Ground truth: No

Type of Graph: Mixed graph

The following gives the description of the variables:

Feature FeatureLabel Domain Item meaning from Davis 1980

001 1FS Green I daydream and fantasize, with some regularity, about things that might happen to me.

002 2EC Purple I often have tender, concerned feelings for people less fortunate than me.

003 3PT_R Yellow I sometimes find it difficult to see things from the “other guy’s” point of view.

004 4EC_R Purple Sometimes I don’t feel very sorry for other people when they are having problems.

005 5FS Green I really get involved with the feelings of the characters in a novel.

006 6PD Red In emergency situations, I feel apprehensive and ill-at-ease.

007 7FS_R Green I am usually objective when I watch a movie or play, and I don’t often get completely caught up in it.(Reversed)

008 8PT Yellow I try to look at everybody’s side of a disagreement before I make a decision.

009 9EC Purple When I see someone being taken advantage of, I feel kind of protective towards them.

010 10PD Red I sometimes feel helpless when I am in the middle of a very emotional situation.

011 11PT Yellow sometimes try to understand my friends better by imagining how things look from their perspective

012 12FS_R Green Becoming extremely involved in a good book or movie is somewhat rare for me. (Reversed)

013 13PD_R Red When I see someone get hurt, I tend to remain calm. (Reversed)

014 14EC_R Purple Other people’s misfortunes do not usually disturb me a great deal. (Reversed)

015 15PT_R Yellow If I’m sure I’m right about something, I don’t waste much time listening to other people’s arguments. (Reversed)

016 16FS Green After seeing a play or movie, I have felt as though I were one of the characters.

017 17PD Red Being in a tense emotional situation scares me.

018 18EC_R Purple When I see someone being treated unfairly, I sometimes don’t feel very much pity for them. (Reversed)

019 19PD_R Red I am usually pretty effective in dealing with emergencies. (Reversed)

020 20FS Green I am often quite touched by things that I see happen.

021 21PT Yellow I believe that there are two sides to every question and try to look at them both.

022 22EC Purple I would describe myself as a pretty soft-hearted person.

023 23FS Green When I watch a good movie, I can very easily put myself in the place of a leading character.

024 24PD Red I tend to lose control during emergencies.

025 25PT Yellow When I’m upset at someone, I usually try to “put myself in his shoes” for a while.

026 26FS Green When I am reading an interesting story or novel, I imagine how I would feel if the events in the story were happening to me.

027 27PD Red When I see someone who badly needs help in an emergency, I go to pieces.

028 28PT Yellow Before criticizing somebody, I try to imagine how I would feel if I were in their place

More information about the dataset is contained in empathy_description.html file.
d
Data from: Learning to see the wood for the trees: machine learning,...
datadryad.org
search.dataone.org
zip
Updated Sep 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Wills; Charlie J. Underwood; Paul M. Barrett (2020). Learning to see the wood for the trees: machine learning, decision trees and the classification of isolated theropod teeth [Dataset]. http://doi.org/10.5061/dryad.1zcrjdfq9
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.1zcrjdfq9
Dataset updated
Sep 14, 2020
Dataset provided by
Dryad
Authors
Simon Wills; Charlie J. Underwood; Paul M. Barrett
Time period covered
Sep 12, 2020
Description
Data to test the models was sourced from:

HENDRICKX, C., MATEUS, O. and ARAÚJO, R. 2015. The dentition of megalosaurid theropods. Acta Palaeontologica Polonica, 60, 627–642.

LARSON, DEREK W., BROWN, CALEB M. and EVANS, DAVID C. 2016. Dental Disparity and Ecological Stability in Bird-like Dinosaurs prior to the End-Cretaceous Mass Extinction. Current Biology, 26, 1325–1333.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1

Understanding and Managing Missing Data.pdf

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.29265155.v1

Dataset updated

Jun 9, 2025

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Ibrahim Denis Fofanah

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

Clear search

Close search

Google apps

Main menu

Understanding and Managing Missing Data.pdf

Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

Missing data in the analysis of multilevel and dependent data (Examples)

Data from: Problems in dealing with missing data and informative censoring...

Spaceship Titanic | No missing values

Context

Methods for Handling Missing Item Values in Regression Models Using the...

a guide to handle missing values for ML Model

Dataset

Contents

Deep learning based Missing Data Imputation

Data from: Missing data estimation in morphometrics: how much is too much?

Description of the dataset used in this study.

Data from: Evaluating Supplemental Samples in Longitudinal Research:...

Results of model performance in handling missing data.

drug-reviews

Data from: Hybrid imputation of missing values using KNN on MEWMA-based...

Replication Data for: A GMM Approach for Dealing with Missing Data on...

Cdd Dataset

Updated Ljubljana Breast Cancer Data Set: reduced and cleaned version

Datasets used in experiments.

Empathy dataset

Data from: Learning to see the wood for the trees: machine learning,...

Understanding and Managing Missing Data.pdf