Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
Facebook
TwitterReplication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data sets and computer code for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the computer code (".R") and the data sets from both example analyses (Examples 1 and 2). The data sets are available in two file formats (binary ".rda" for use in R; plain-text ".dat").
The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:
ID = group identifier (1-2000) x = numeric (Level 1) y = numeric (Level 1) w = binary (Level 2)
In all data sets, missing values are coded as "NA".
Facebook
TwitterA common problem in clinical trials is the missing data that occurs when patients do not complete the study and drop out without further measurements. Missing data cause the usual statistical analysis of complete or all available data to be subject to bias. There are no universally applicable methods for handling missing data. We recommend the following: (1) Report reasons for dropouts and proportions for each treatment group; (2) Conduct sensitivity analyses to encompass different scenarios of assumptions and discuss consistency or discrepancy among them; (3) Pay attention to minimize the chance of dropouts at the design stage and during trial monitoring; (4) Collect post-dropout data on the primary endpoints, if at all possible; and (5) Consider the dropout event itself an important endpoint in studies with many.
Facebook
TwitterDataset is final solution for dealing with missing values in the Spaceship Titanic competition. Kaggle Notebook: https://www.kaggle.com/sardorabdirayimov/best-way-of-dealing-with-missing-values-titanic-2/
Facebook
TwitterThe purpose of this report is to guide analysts interested in fitting regression models using data from the National Survey on Drug Use and Health (NSDUH) by providing them with methods for handling missing item values in regression analyses (MIVRA). The report includes a theoretical review of existing MIVRA methods, a simulation study that evaluates several of the more promising methods using existing NSDUH datasets, and a final chapter where the results of both the theoretical review and the simulation study are synthesized into guidance for analysts via decision trees.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Feroz Shinwari
Released under Apache 2.0
Facebook
TwitterThe code provided is related to training an autoencoder, evaluating its performance, and using it for imputing missing values in a dataset. Let's break down each part:Training the Autoencoder (train_autoencoder function):This function takes an autoencoder model and the input features as input.It trains the autoencoder using the input features as both input and target output (hence features, features).The autoencoder is trained for a specified number of epochs (epochs) with a given batch size (batch_size).The shuffle=True argument ensures that the data is shuffled before each epoch to prevent the model from memorizing the input order.After training, it returns the trained autoencoder model and the training history.Evaluating the Autoencoder (evaluate_autoencoder function):This function takes a trained autoencoder model and the input features as input.It uses the trained autoencoder to predict the reconstructed features from the input features.It calculates Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R2) scores between the original and reconstructed features.These metrics provide insights into how well the autoencoder is able to reconstruct the input features.Imputing with the Autoencoder (impute_with_autoencoder function):This function takes a trained autoencoder model and the input features as input.It identifies missing values (e.g., -9999) in the input features.For each row with missing values, it predicts the missing values using the trained autoencoder.It replaces the missing values with the predicted values.The imputed features are returned as output.To reuse this code:Load your dataset and preprocess it as necessary.Build an autoencoder model using the build_autoencoder function.Train the autoencoder using the train_autoencoder function with your input features.Evaluate the performance of the autoencoder using the evaluate_autoencoder function.If your dataset contains missing values, use the impute_with_autoencoder function to impute them with the trained autoencoder.Use the trained autoencoder for any other relevant tasks, such as feature extraction or anomaly detection.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Despite the wide application of longitudinal studies, they are often plagued by missing data and attrition. The majority of methodological approaches focus on participant retention or modern missing data analysis procedures. This paper, however, takes a new approach by examining how researchers may supplement the sample with additional participants. First, refreshment samples use the same selection criteria as the initial study. Second, replacement samples identify auxiliary variables that may help explain patterns of missingness and select new participants based on those characteristics. A simulation study compares these two strategies for a linear growth model with five measurement occasions. Overall, the results suggest that refreshment samples lead to less relative bias, greater relative efficiency, and more acceptable coverage rates than replacement samples or not supplementing the missing participants in any way. Refreshment samples also have high statistical power. The comparative strengths of the refreshment approach are further illustrated through a real data example. These findings have implications for assessing change over time when researching at-risk samples with high levels of permanent attrition.
Facebook
TwitterResults of model performance in handling missing data.
Facebook
Twitterhttps://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Dataset Details
1.Dataset Loading:
Initially, we load the Drug Review Dataset from the UC Irvine Machine Learning Repository. This dataset contains patient reviews of different drugs, along with the medical condition being treated and the patients' satisfaction ratings.
2.Data Preprocessing:
The dataset is preprocessed to ensure data integrity and consistency. We handle missing values and ensure that each patient ID is unique across the dataset.
3.Text… See the full description on the dataset page: https://huggingface.co/datasets/Mouwiya/drug-reviews.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Missing data, a common issue in production processes due to factors like sample contamination and equipment malfunctions, can lead to a decrease in the recognition accuracy of control charts, especially in cases of shifting. To address this, we introduce an online adaptive weighted imputation technique that combines the strengths of K-Nearest Neighbor (KNN) and Exponentially Weighted Moving Average (EWMA) imputations. It utilizes an adaptive weight matrix for weighting both methods and an adaptive covariance matrix to optimize for missing structures. When dealing with data fluctuation, we assign a higher weight to the KNN method for its sensitivity, while the EWMA method is preferred for stationary data. This approach does not require data stacking; thus, the imputation process for missing data is conducted online. Consequently, based on the online Multivariate EWMA (MEWMA) control chart, real-time process monitoring can be achieved. To optimize the use of available information, we also adjust the covariance matrix with a weight matrix to emphasize complete data. The proposed technique outperforms traditional methods in performance monitoring by avoiding false alarms and quickly detecting anomalies during process shifts.
Facebook
TwitterReplication Data for: A GMM Approach for Dealing with Missing Data on Regressors
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Project Documentation: Cucumber Disease Detection
Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.
Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.
Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.
Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.
Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.
Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.
Methodology Machine Learning Algorithms:
Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:
The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.
Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.
Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.
Model Evaluation Evaluation Metrics:
Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:
The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.
Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.
Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.
References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1
Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g
Rafiur Rahman Rafit EWU 2018-3-60-111
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information for Machine Learning algorithms to forecast recurrence events (RE) for patients with breast cancer stages I to III. The dataset contains 252 instances and six attributes, including a binary class indicating whether RE occurred. This dataset has been reduced and denoised from the original Ljubljana, which holds 286 instances with ten attributes each (LBCD, Zwitter M. and Soklic M. (1988). Breast Cancer. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/14/breast+cancer).
The ranking results by eight different Machine learning algorithms and statistical handling of the ranking 8-component vectors for attributes allow one to reduce ten features to six of the most relevant ones. The most pertinent features were the following five: {deg_malig, irradiat, node_caps, tumor_size, inv_nodes}. Less relevant found four attributes: {age, breast_quad, breast, menopause}.
The CAIRAD: Co-appearance based Analysis for Incorrect Records and Attribute-values Detection ( Rahman MG, Islam MZ, Bossomaier T, Gao J. CAIRAD: A co-appearance based analysis for incorrect records and attribute-values detection. Proc Int Jt Conf Neural Networks. 2012;(June). https://doi.org/10.1109/IJCNN.2012.6252669) filter has been determined the noises in attributes and class features. Per the filtering results, 34 instances of LBCD had noises in half (or even more than half) of their features. Those were removed from the data.
It is known that the noises in the class are riskier and teasing than those of attributes. Meantime, the class attribute had 35 (14%) missed values from 252 after COIRAD filtering. It was unacceptable, considering the comparable number (only 85 cases) of recurrence events in the class of initial LBCD. The imputation (reconstruction, "cure") of missed values was performed via the algorithm offered in:
Bai BM, Mangathayaru N, Rani BP. An approach to find missing values in medical datasets. In: ACM International Conference Proceeding Series. Vol 24-26-Sept. ; 2015. https://doi.org/10.1145/2832987.2833083. The noises presented in the remaining attributes, ranging from 1% to 14%, were neglected.
There are 252 instances in the dataset, of which 206 do not have RE, and the remaining 46 have RE. Six attributes, including its class, define each instance. This dataset is obtained from the initial version of the LBCD betterment, and it provides a significant advantage in the performance over the original LBCD for most classifying algorithms of Machine Learning. However, the dataset is slightly more imbalanced than the LBCD, which is a minus.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Missing data is a prevalent problem that requires attention, as most data analysis techniques are unable to handle it. This is particularly critical in Multi-Label Classification (MLC), where only a few studies have investigated missing data in this application domain. MLC differs from Single-Label Classification (SLC) by allowing an instance to be associated with multiple classes. Movie classification is a didactic example since it can be “drama” and “bibliography” simultaneously. One of the most usual missing data treatment methods is data imputation, which seeks plausible values to fill in the missing ones. In this scenario, we propose a novel imputation method based on a multi-objective genetic algorithm for optimizing multiple data imputations called Multiple Imputation of Multi-label Classification data with a genetic algorithm, or simply EvoImp. We applied the proposed method in multi-label learning and evaluated its performance using six synthetic databases, considering various missing values distribution scenarios. The method was compared with other state-of-the-art imputation strategies, such as K-Means Imputation (KMI) and weighted K-Nearest Neighbors Imputation (WKNNI). The results proved that the proposed method outperformed the baseline in all the scenarios by achieving the best evaluation measures considering the Exact Match, Accuracy, and Hamming Loss. The superior results were constant in different dataset domains and sizes, demonstrating the EvoImp robustness. Thus, EvoImp represents a feasible solution to missing data treatment for multi-label learning.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The database for this study (Briganti et al. 2018; the same for the Braun study analysis) was composed of 1973 French-speaking students in several universities or schools for higher education in the following fields: engineering (31%), medicine (18%), nursing school (16%), economic sciences (15%), physiotherapy, (4%), psychology (11%), law school (4%) and dietetics (1%). The subjects were 17 to 25 years old (M = 19.6 years, SD = 1.6 years), 57% were females and 43% were males. Even though the full dataset was composed of 1973 participants, only 1270 answered the full questionnaire: missing data are handled using pairwise complete observations in estimating a Gaussian Graphical Model, meaning that all available information from every subject are used.
The feature set is composed of 28 items meant to assess the four following components: fantasy, perspective taking, empathic concern and personal distress. In the questionnaire, the items are mixed; reversed items (items 3, 4, 7, 12, 13, 14, 15, 18, 19) are present. Items are scored from 0 to 4, where “0” means “Doesn’t describe me very well” and “4” means “Describes me very well”; reverse-scoring is calculated afterwards. The questionnaires were anonymized. The reanalysis of the database in this retrospective study was approved by the ethical committee of the Erasmus Hospital.
Size: A dataset of size 1973*28
Number of features: 28
Ground truth: No
Type of Graph: Mixed graph
The following gives the description of the variables:
Feature FeatureLabel Domain Item meaning from Davis 1980
001 1FS Green I daydream and fantasize, with some regularity, about things that might happen to me.
002 2EC Purple I often have tender, concerned feelings for people less fortunate than me.
003 3PT_R Yellow I sometimes find it difficult to see things from the “other guy’s” point of view.
004 4EC_R Purple Sometimes I don’t feel very sorry for other people when they are having problems.
005 5FS Green I really get involved with the feelings of the characters in a novel.
006 6PD Red In emergency situations, I feel apprehensive and ill-at-ease.
007 7FS_R Green I am usually objective when I watch a movie or play, and I don’t often get completely caught up in it.(Reversed)
008 8PT Yellow I try to look at everybody’s side of a disagreement before I make a decision.
009 9EC Purple When I see someone being taken advantage of, I feel kind of protective towards them.
010 10PD Red I sometimes feel helpless when I am in the middle of a very emotional situation.
011 11PT Yellow sometimes try to understand my friends better by imagining how things look from their perspective
012 12FS_R Green Becoming extremely involved in a good book or movie is somewhat rare for me. (Reversed)
013 13PD_R Red When I see someone get hurt, I tend to remain calm. (Reversed)
014 14EC_R Purple Other people’s misfortunes do not usually disturb me a great deal. (Reversed)
015 15PT_R Yellow If I’m sure I’m right about something, I don’t waste much time listening to other people’s arguments. (Reversed)
016 16FS Green After seeing a play or movie, I have felt as though I were one of the characters.
017 17PD Red Being in a tense emotional situation scares me.
018 18EC_R Purple When I see someone being treated unfairly, I sometimes don’t feel very much pity for them. (Reversed)
019 19PD_R Red I am usually pretty effective in dealing with emergencies. (Reversed)
020 20FS Green I am often quite touched by things that I see happen.
021 21PT Yellow I believe that there are two sides to every question and try to look at them both.
022 22EC Purple I would describe myself as a pretty soft-hearted person.
023 23FS Green When I watch a good movie, I can very easily put myself in the place of a leading character.
024 24PD Red I tend to lose control during emergencies.
025 25PT Yellow When I’m upset at someone, I usually try to “put myself in his shoes” for a while.
026 26FS Green When I am reading an interesting story or novel, I imagine how I would feel if the events in the story were happening to me.
027 27PD Red When I see someone who badly needs help in an emergency, I go to pieces.
028 28PT Yellow Before criticizing somebody, I try to imagine how I would feel if I were in their place
More information about the dataset is contained in empathy_description.html file.
Facebook
TwitterData to test the models was sourced from:
HENDRICKX, C., MATEUS, O. and ARAÚJO, R. 2015. The dentition of megalosaurid theropods. Acta Palaeontologica Polonica, 60, 627–642.
LARSON, DEREK W., BROWN, CALEB M. and EVANS, DAVID C. 2016. Dental Disparity and Ecological Stability in Bird-like Dinosaurs prior to the End-Cretaceous Mass Extinction. Current Biology, 26, 1325–1333.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.