100+ datasets found

f
Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...
plos.figshare.com
figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikolaj Bak; Lars K. Hansen (2023). Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option [Dataset]. http://doi.org/10.1371/journal.pone.0164464
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0164464
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Nikolaj Bak; Lars K. Hansen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the “true error” in each of these new cases. The error is then estimated for each case with missing values by weighing the “true errors” by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method. The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding to conventions that might not be warranted in the specific dataset.
R
Conjunto de datos de: Water-quality data imputation with a high percentage...
redata.anii.org.uy
tsv
Updated Mar 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione; Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione (2024). Conjunto de datos de: Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.60895/redata/TNRT8Q
Explore at:
tsv(11154), tsv(47413)Available download formats
Unique identifier
https://doi.org/10.60895/redata/TNRT8Q
Dataset updated
Mar 6, 2024
Dataset provided by
Repositorio de datos abiertos de investigación de Uruguay
Authors
Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione; Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
ANII
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries. This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges. To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)). IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases. In this dataset, we include the original and imputed values for the following variables: - Water temperature (Tw) - Dissolved oxygen (DO) - Electrical conductivity (EC) - pH - Turbidity (Turb) - Nitrite (NO2-) - Nitrate (NO3-) - Total Nitrogen (TN) Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC]. More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318. If you use this dataset in your work, please cite our paper: Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
H
Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...
dataverse.harvard.edu
Updated Sep 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ranjit Lall; Thomas Robinson (2022). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/UPL4TT
Dataset updated
Sep 29, 2022
Dataset provided by
Harvard Dataverse
Authors
Ranjit Lall; Thomas Robinson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.
d
Data from: Multiple Imputation for the Supplementary Homicide Reports:...
catalog.data.gov
datasets.ai
+2more
Updated Mar 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Justice (2025). Multiple Imputation for the Supplementary Homicide Reports: Evaluation in Unique Test Data, 1990-1995, Chicago, Philadelphia, Phoenix and St. Louis [Dataset]. https://catalog.data.gov/dataset/multiple-imputation-for-the-supplementary-homicide-reports-evaluation-in-unique-test-data-
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
National Institute of Justice
Area covered
St. Louis, Chicago
Description
This study was an evaluation of multiple imputation strategies to address missing data using the New Approach to Evaluating Supplementary Homicide Report (SHR) Data Imputation, 1990-1995 (ICPSR 20060) dataset.
f
Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...
figshare.com
frontiersin.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2021.691274.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Yi-Hui Zhou; Ehsan Saghapour
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.
Data from: Missing data estimation in morphometrics: how much is too much?
zenodo.org
data.niaid.nih.gov
+2more
Updated Jun 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julien Clavel; Gildas Merceron; Gilles Escarguel; Julien Clavel; Gildas Merceron; Gilles Escarguel (2022). Data from: Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.f0b50
Dataset updated
Jun 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julien Clavel; Gildas Merceron; Gilles Escarguel; Julien Clavel; Gildas Merceron; Gilles Escarguel
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.
f
Data from: Multiple imputation for harmonizing longitudinal non-commensurate...
wiley.figshare.com
pdf
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Juned Siddique; Dr. Jerome Reiter; Dr. Ahnalee Brincks; Dr. Robert D. Gibbons; Prof. Catherine M. Crespi; Prof. C. Hendricks Brown (2023). Multiple imputation for harmonizing longitudinal non-commensurate measures in individual participant data meta-analysis [Dataset]. http://doi.org/10.6084/m9.figshare.1466878.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1466878.v1
Dataset updated
Jun 2, 2023
Dataset provided by
Wiley
Authors
Dr. Juned Siddique; Dr. Jerome Reiter; Dr. Ahnalee Brincks; Dr. Robert D. Gibbons; Prof. Catherine M. Crespi; Prof. C. Hendricks Brown
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
There are many advantages to individual participant data meta-analysis for combining data from multiple studies. These advantages include greater power to detect effects, increased sample heterogeneity, and the ability to perform more sophisticated analyses than meta-analyses that rely on published results. However, a fundamental challenge is that it is unlikely that variables of interest are measured the same way in all of the studies to be combined. We propose that this situation can be viewed as a missing data problem in which some outcomes are entirely missing within some trials, and use multiple imputation to fill in missing measurements. We apply our method to 5 longitudinal adolescent depression trials where 4 studies used one depression measure and the fifth study used a different depression measure. None of the 5 studies contained both depression measures. We describe a multiple imputation approach for filling in missing depression measures that makes use of external calibration studies in which both depression measures were used. We discuss some practical issues in developing the imputation model including taking into account treatment group and study. We present diagnostics for checking the fit of the imputation model and investigating whether external information is appropriately incorporated into the imputed values.
Data from: Evaluating the Use of Uncertainty Visualisations for Imputations...
osf.io
Updated Aug 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhraneel Sarma (2024). Evaluating the Use of Uncertainty Visualisations for Imputations of Data Missing At Random in Scatterplots [Dataset]. https://osf.io/q4y5r
Explore at:
Dataset updated
Aug 26, 2024
Dataset provided by
Center for Open Sciencehttps://cos.io/
Authors
Abhraneel Sarma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains supplementary materials for the paper, Evaluating the Use of Uncertainty Visualisations for Imputations of Data Missing At Random in Scatterplots

Abstract: Most real-world datasets contain missing values yet most exploratory data analysis (EDA) systems only support visualising data points with complete cases. This omission may potentially lead the user to biased analyses and insights. Imputation techniques can help estimate the value of a missing data point, but introduces additional uncertainty. In this work, we investigate the effects of visualising imputed values in charts using different types of uncertainty visualisation techniques—no imputation, mean, 95% confidence intervals, probability density plots, gradient intervals, and hypothetical outcome plots. We focus on scatterplots, which is a commonly used chart type, and conduct a crowdsourced study with 202 participants. We measure users’ bias and precision in performing two tasks—estimating average and detecting trend—and their self-reported confidence in performing these tasks. Our results suggest that, when estimating averages, uncertainty representations may reduce bias but at the cost of decreasing precision. When estimating trend, only hypothetical outcome plots may lead to a small probability of reducing bias while increasing precision. Participants in every uncertainty representation were less certain about their response when compared to the baseline. The findings point towards potential trade-offs in using uncertainty encodings for datasets with a large number of missing values.
d
New Approach to Evaluating Supplementary Homicide Report (SHR) Data...
catalog.data.gov
icpsr.umich.edu
Updated Mar 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Justice (2025). New Approach to Evaluating Supplementary Homicide Report (SHR) Data Imputation, 1990-1995 [Dataset]. https://catalog.data.gov/dataset/new-approach-to-evaluating-supplementary-homicide-report-shr-data-imputation-1990-1995-ff769
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
National Institute of Justice
Description
The purpose of the project was to learn more about patterns of homicide in the United States by strengthening the ability to make imputations for Supplementary Homicide Report (SHR) data with missing values. Supplementary Homicide Reports (SHR) and local police data from Chicago, Illinois, St. Louis, Missouri, Philadelphia, Pennsylvania, and Phoenix, Arizona, for 1990 to 1995 were merged to create a master file by linking on overlapping information on victim and incident characteristics. Through this process, 96 percent of the cases in the SHR were matched with cases in the police files. The data contain variables for three types of cases: complete in SHR, missing offender and incident information in SHR but known in police report, and missing offender and incident information in both. The merged file allows estimation of similarities and differences between the cases with known offender characteristics in the SHR and those in the other two categories. The accuracy of existing data imputation methods can be assessed by comparing imputed values in an "incomplete" dataset (the SHR), generated by the three imputation strategies discussed in the literature, with the actual values in a known "complete" dataset (combined SHR and police data). Variables from both the Supplemental Homicide Reports and the additional police report offense data include incident date, victim characteristics, offender characteristics, incident details, geographic information, as well as variables regarding the matching procedure.
f
Table_2_AdImpute: An Imputation Method for Single-Cell RNA-Seq Data Based on...
frontiersin.figshare.com
docx
Updated Jun 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li Xu; Yin Xu; Tong Xue; Xinyu Zhang; Jin Li (2023). Table_2_AdImpute: An Imputation Method for Single-Cell RNA-Seq Data Based on Semi-Supervised Autoencoders.DOCX [Dataset]. http://doi.org/10.3389/fgene.2021.739677.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2021.739677.s002
Dataset updated
Jun 8, 2023
Dataset provided by
Frontiers
Authors
Li Xu; Yin Xu; Tong Xue; Xinyu Zhang; Jin Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Motivation: The emergence of single-cell RNA sequencing (scRNA-seq) technology has paved the way for measuring RNA levels at single-cell resolution to study precise biological functions. However, the presence of a large number of missing values in its data will affect downstream analysis. This paper presents AdImpute: an imputation method based on semi-supervised autoencoders. The method uses another imputation method (DrImpute is used as an example) to fill the results as imputation weights of the autoencoder, and applies the cost function with imputation weights to learn the latent information in the data to achieve more accurate imputation.Results: As shown in clustering experiments with the simulated data sets and the real data sets, AdImpute is more accurate than other four publicly available scRNA-seq imputation methods, and minimally modifies the biologically silent genes. Overall, AdImpute is an accurate and robust imputation method.
z
Missing data in the analysis of multilevel and dependent data (Example data...
zenodo.org
bin
Updated Jul 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Grund; Simon Grund; Oliver Lüdtke; Oliver Lüdtke; Alexander Robitzsch; Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Example data sets) [Dataset]. http://doi.org/10.5281/zenodo.7773614
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7773614
Dataset updated
Jul 20, 2023
Dataset provided by
Springer
Authors
Simon Grund; Simon Grund; Oliver Lüdtke; Oliver Lüdtke; Alexander Robitzsch; Alexander Robitzsch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data sets for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the data sets used in both example analyses (Examples 1 and 2) in two file formats (binary ".rda" for use in R; plain-text ".dat").

The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

ID = group identifier (1-2000)
x = numeric (Level 1)
y = numeric (Level 1)
w = binary (Level 2)

In all data sets, missing values are coded as "NA".
f
Table_1_Comparison of machine learning and logistic regression as predictive...
frontiersin.figshare.com
xlsx
Updated Jun 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongying Zheng; Xinyu Hao; Muhanmmad Khan; Lixia Wang; Fan Li; Ning Xiang; Fuli Kang; Timo Hamalainen; Fengyu Cong; Kedong Song; Chong Qiao (2023). Table_1_Comparison of machine learning and logistic regression as predictive models for adverse maternal and neonatal outcomes of preeclampsia: A retrospective study.XLSX [Dataset]. http://doi.org/10.3389/fcvm.2022.959649.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fcvm.2022.959649.s003
Dataset updated
Jun 13, 2023
Dataset provided by
Frontiers
Authors
Dongying Zheng; Xinyu Hao; Muhanmmad Khan; Lixia Wang; Fan Li; Ning Xiang; Fuli Kang; Timo Hamalainen; Fengyu Cong; Kedong Song; Chong Qiao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionPreeclampsia, one of the leading causes of maternal and fetal morbidity and mortality, demands accurate predictive models for the lack of effective treatment. Predictive models based on machine learning algorithms demonstrate promising potential, while there is a controversial discussion about whether machine learning methods should be recommended preferably, compared to traditional statistical models.MethodsWe employed both logistic regression and six machine learning methods as binary predictive models for a dataset containing 733 women diagnosed with preeclampsia. Participants were grouped by four different pregnancy outcomes. After the imputation of missing values, statistical description and comparison were conducted preliminarily to explore the characteristics of documented 73 variables. Sequentially, correlation analysis and feature selection were performed as preprocessing steps to filter contributing variables for developing models. The models were evaluated by multiple criteria.ResultsWe first figured out that the influential variables screened by preprocessing steps did not overlap with those determined by statistical differences. Secondly, the most accurate imputation method is K-Nearest Neighbor, and the imputation process did not affect the performance of the developed models much. Finally, the performance of models was investigated. The random forest classifier, multi-layer perceptron, and support vector machine demonstrated better discriminative power for prediction evaluated by the area under the receiver operating characteristic curve, while the decision tree classifier, random forest, and logistic regression yielded better calibration ability verified, as by the calibration curve.ConclusionMachine learning algorithms can accomplish prediction modeling and demonstrate superior discrimination, while Logistic Regression can be calibrated well. Statistical analysis and machine learning are two scientific domains sharing similar themes. The predictive abilities of such developed models vary according to the characteristics of datasets, which still need larger sample sizes and more influential predictors to accumulate evidence.
d
Replication Data for: Qualitative Imputation of Missing Potential Outcomes
search.dataone.org
dataverse.harvard.edu
Updated Nov 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coppock, Alexander; Kaur, Dipin (2023). Replication Data for: Qualitative Imputation of Missing Potential Outcomes [Dataset]. http://doi.org/10.7910/DVN/2IVKXD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/2IVKXD
Dataset updated
Nov 9, 2023
Dataset provided by
Harvard Dataverse
Authors
Coppock, Alexander; Kaur, Dipin
Description
We propose a framework for meta-analysis of qualitative causal inferences. We integrate qualitative counterfactual inquiry with an approach from the quantitative causal inference literature called extreme value bounds. Qualitative counterfactual analysis uses the observed outcome and auxiliary information to infer what would have happened had the treatment been set to a different level. Imputing missing potential outcomes is hard and when it fails, we can fill them in under best- and worst-case scenarios. We apply our approach to 63 cases that could have experienced transitional truth commissions upon democratization, 8 of which did. Prior to any analysis, the extreme value bounds around the average treatment effect on authoritarian resumption are 100 percentage points wide; imputation shrinks the width of these bounds to 51 points. We further demonstrate our method by aggregating specialists' beliefs about causal effects gathered through an expert survey, shrinking the width of the bounds to 44 points.
H
Replication data for: What To Do about Missing Data in Time-Series...
dataverse.harvard.edu
search.dataone.org
Updated Sep 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James Honaker; Gary King (2024). Replication data for: What To Do about Missing Data in Time-Series Cross-Sectional Data [Dataset]. http://doi.org/10.7910/DVN/GGUR0P
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/GGUR0P
Dataset updated
Sep 20, 2024
Dataset provided by
Harvard Dataverse
Authors
James Honaker; Gary King
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Applications of modern methods for analyzing data with missing values, based primarily on multiple imputation, have in the last half-decade become common in American politics and political behavior. Scholars in these fields have thus increasingly avoided the biases and inefficiencies caused by ad hoc methods like listwise deletion and best guess imputation. However, researchers in much of comparative politics and international relations, and others with similar data, have been unable to do the same because the best available imputation methods work poorly with the time-series cross-section data structures common in these fields. We attempt to rectify this situation. First, we build a multiple i mputation model that allows smooth time trends, shifts across cross-sectional units, and correlations over time and space, resulting in far more accurate imputations. Second, we build nonignorable missingness models by enabling analysts to incorporate knowledge from area studies experts via priors on individual missing cell values, rather than on difficult-to-interpret model parameters. Third, since these tasks could not be accomplished within existing imputation algorithms, in that they cannot handle as many variables as needed even in the simpler cross-sectional data for which they were designed, we also develop a new algorithm that substantially expands the range of computationally feasible data types and sizes for which multiple imputation can be used. These developments also made it possible to implement the methods introduced here in freely available open source software that is considerably more reliable than existing strategies. These developments also made it possible to implement the methods introduced here in freely available open source software, Amelia II: A Program for Missing Data, that is considerably more reliable than existing strategies. See also: Missing Data
f
Baseline Characteristics of Strong Heart Study Participants with Complete...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nawar Shara; Sayf A. Yassin; Eduardas Valaitis; Hong Wang; Barbara V. Howard; Wenyu Wang; Elisa T. Lee; Jason G. Umans (2023). Baseline Characteristics of Strong Heart Study Participants with Complete Scr Data at All Three Exams (N = 2,264). [Dataset]. http://doi.org/10.1371/journal.pone.0138923.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0138923.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Nawar Shara; Sayf A. Yassin; Eduardas Valaitis; Hong Wang; Barbara V. Howard; Wenyu Wang; Elisa T. Lee; Jason G. Umans
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abbreviations: CVD = cardiovascular disease; Scr = serum creatinine.Baseline Characteristics of Strong Heart Study Participants with Complete Scr Data at All Three Exams (N = 2,264).
f
Results of the ML models using PCA imputer.
plos.figshare.com
figshare.com
xls
Updated Jan 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Turki Aljrees (2024). Results of the ML models using PCA imputer. [Dataset]. http://doi.org/10.1371/journal.pone.0295632.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295632.t006
Dataset updated
Jan 3, 2024
Dataset provided by
PLOS ONE
Authors
Turki Aljrees
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.
h
Data from: Imputation of Missing Genotypes within LD-Blocks Relying on the...
heidata.uni-heidelberg.de
Updated Oct 15, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Kabisch; Ute Hamann; Justo Lorenzo Bermejo; Maria Kabisch; Ute Hamann; Justo Lorenzo Bermejo (2018). Imputation of Missing Genotypes within LD-Blocks Relying on the Basic Coalescent and Beyond [Source Code] [Dataset]. http://doi.org/10.11588/DATA/X9UEHB
Explore at:
application/x-sas-syntax(18304), application/x-sas-syntax(2641), application/x-sas-syntax(5292), application/x-sas-syntax(11567), application/x-sas-syntax(2109), pdf(244085)Available download formats
Unique identifier
https://doi.org/10.11588/DATA/X9UEHB
Dataset updated
Oct 15, 2018
Dataset provided by
heiDATA
Authors
Maria Kabisch; Ute Hamann; Justo Lorenzo Bermejo; Maria Kabisch; Ute Hamann; Justo Lorenzo Bermejo
License
https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/X9UEHBhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/X9UEHB
Description
Background Genotypes not directly measured in genetic studies are often imputed to improve statistical power and to increase mapping resolution. The accuracy of standard imputation techniques strongly depends on the similarity of linkage disequilibrium (LD) patterns in the study and reference populations. Here we develop a novel approach for genotype imputation in low-recombination regions that relies on the coalescent and permits to explicitly account for population demographic factors. To test the new method, study and reference haplotypes were simulated and gene trees were inferred under the basic coalescent and also considering population growth and structure. The reference haplotypes that first coalesced with study haplotypes were used as templates for genotype imputation. Computer simulations were complemented with the analysis of real data. Genotype concordance rates were used to compare the accuracies of coalescent-based and standard (IMPUTE2) imputation. Results Simulations revealed that, in LD-blocks, imputation accuracy relying on the basic coalescent was higher and less variable than with IMPUTE2. Explicit consideration of population growth and structure, even if present, did not practically improve accuracy. The advantage of coalescent-based over standard imputation increased with the minor allele frequency and it decreased with population stratification. Results based on real data indicated that, even in low-recombination regions, further research is needed to incorporate recombination in coalescence inference, in particular for studies with genetically diverse and admixed individuals. Conclusions To exploit the full potential of coalescent-based methods for the imputation of missing genotypes in genetic studies, further methodological research is needed to reduce computer time, to take into account recombination, and to implement these methods in user-friendly computer programs. Here we provide reproducible code which takes advantage of publicly available software to facilitate further developments in the field.
State Profiles: FY 2014 Public Libraries Survey (Data)
data.wu.ac.at
csv, json, rdf, xml
Updated Dec 7, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institute of Museum and Library Services (2016). State Profiles: FY 2014 Public Libraries Survey (Data) [Dataset]. https://data.wu.ac.at/schema/data_gov/N2RhMjA3OTctZDFiMS00NjA4LWFjMzYtMjA0ZTA5Zjg4NDIw
Explore at:
json, rdf, xml, csvAvailable download formats
Dataset updated
Dec 7, 2016
Dataset provided by
Institute of Museum and Library Serviceshttps://www.imls.gov/
Description
Pull up a state's profile to find state-level totals on key data such as numbers of libraries and librarians, revenue and expenditures, and collection sizes.

These data include imputed values for libraries that did not submit information in the FY 2014 data collection. Imputation is a procedure for estimating a value for a specific data item where the response is missing.

Download PLS data files to see imputation flag variables or learn more on the imputation methods used in FY 2014 at https://www.imls.gov/research-evaluation/data-collection/public-libraries-survey/explore-pls-data/pls-data
Library Systems: FY 2014 Public Libraries Survey (Administrative Entity...
data.wu.ac.at
csv, json, rdf, xml
Updated Aug 30, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institute of Museum and Library Services (2016). Library Systems: FY 2014 Public Libraries Survey (Administrative Entity Data) [Dataset]. https://data.wu.ac.at/schema/data_gov/MzRlMGNmYjEtNjBjOC00OTQ0LThhMGItNTJiNzFkMTYyZDhm
Explore at:
rdf, json, csv, xmlAvailable download formats
Dataset updated
Aug 30, 2016
Dataset provided by
Institute of Museum and Library Serviceshttps://www.imls.gov/
Description
Find key information on library systems around the United States.

These data include imputed values for libraries that did not submit information in the FY 2014 data collection. Imputation is a procedure for estimating a value for a specific data item where the response is missing.

Download PLS data files to see imputation flag variables or learn more on the imputation methods used in FY 2014 at https://www.imls.gov/research-evaluation/data-collection/public-libraries-survey/explore-pls-data/pls-data
c
Overview of the performance of the undiagnosed diabetes risk prediction...
esango.cput.ac.za
figshare.com
+1more
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katya L. Masconi; Tandi Edith Matsha-Erasmus; Rajiv T. Erasmus; Andre P. Kengne (2023). Overview of the performance of the undiagnosed diabetes risk prediction models across the five multiple imputation datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0139210.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0139210.t005
Dataset updated
Jun 4, 2023
Dataset provided by
PLOS ONE
Authors
Katya L. Masconi; Tandi Edith Matsha-Erasmus; Rajiv T. Erasmus; Andre P. Kengne
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview of the performance of the undiagnosed diabetes risk prediction models across the five multiple imputation datasets.

Facebook

Twitter

Click to copy link

Link copied

Cite

Nikolaj Bak; Lars K. Hansen (2023). Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option [Dataset]. http://doi.org/10.1371/journal.pone.0164464

Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

pdfAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0164464

Dataset updated

Jun 1, 2023

Dataset provided by

PLOS ONE

Authors

Nikolaj Bak; Lars K. Hansen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the “true error” in each of these new cases. The error is then estimated for each case with missing values by weighing the “true errors” by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method. The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding to conventions that might not be warranted in the specific dataset.

Clear search

Close search

Google apps

Main menu

Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...

Conjunto de datos de: Water-quality data imputation with a high percentage...

Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

Data from: Multiple Imputation for the Supplementary Homicide Reports:...

Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

Data from: Missing data estimation in morphometrics: how much is too much?

Data from: Multiple imputation for harmonizing longitudinal non-commensurate...

Data from: Evaluating the Use of Uncertainty Visualisations for Imputations...

New Approach to Evaluating Supplementary Homicide Report (SHR) Data...

Table_2_AdImpute: An Imputation Method for Single-Cell RNA-Seq Data Based on...

Missing data in the analysis of multilevel and dependent data (Example data...

Table_1_Comparison of machine learning and logistic regression as predictive...

Replication Data for: Qualitative Imputation of Missing Potential Outcomes

Replication data for: What To Do about Missing Data in Time-Series...

Baseline Characteristics of Strong Heart Study Participants with Complete...

Results of the ML models using PCA imputer.

Data from: Imputation of Missing Genotypes within LD-Blocks Relying on the...

State Profiles: FY 2014 Public Libraries Survey (Data)

Library Systems: FY 2014 Public Libraries Survey (Administrative Entity...

Overview of the performance of the undiagnosed diabetes risk prediction...

Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option