100+ datasets found

h
Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...
datahub.hku.hk
Updated Aug 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wen Ma (2020). Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets [Dataset]. http://doi.org/10.25442/hku.12752549.v1
Explore at:
Unique identifier
https://doi.org/10.25442/hku.12752549.v1
Dataset updated
Aug 13, 2020
Dataset provided by
HKU Data Repository
Authors
Wen Ma
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
NCCTG Lung cancer datasetSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.2.CNV measurements of CNV of GBM This dataset records the information about copy number variation of Glioblastoma (GBM).Abstract:In biology and medicine, conservative patient and data collection malpractice can lead to missing or incorrect values in patient registries, which can affect both diagnosis and prognosis. Insufficient or biased patient information significantly impedes the sensitivity and accuracy of predicting cancer survival. In bioinformatics, making a best guess of the missing values and identifying the incorrect values are collectively called “imputation”. Existing imputation methods work by establishing a model based on the data mechanism of the missing values. Existing imputation methods work well under two assumptions: 1) the data is missing completely at random, and 2) the percentage of missing values is not high. These are not cases found in biomedical datasets, such as the Cancer Genome Atlas Glioblastoma Copy-Number Variant dataset (TCGA: 108 columns), or the North Central Cancer Treatment Group Lung Cancer (NCCTG) dataset (NCCTG: 9 columns). We tested six existing imputation methods, but only two of them worked with these datasets: The Last Observation Carried Forward (LOCF) and K-nearest Algorithm (KNN). Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) worked only with the NCCTG lung cancer dataset with fewer columns, except when the dataset contains 45% missing data. The quality of the imputed values using existing methods is bad because they do not meet the two assumptions.In our study, we propose a Restricted Boltzmann Machine (RBM)-based imputation method to cope with low randomness and the high percentage of the missing values. RBM is an undirected, probabilistic and parameterized two-layer neural network model, which is often used for extracting abstract information from data, especially for high-dimensional data with unknown or non-standard distributions. In our benchmarks, we applied our method to two cancer datasets: 1) NCCTG, and 2) TCGA. The running time, root mean squared error (RMSE) of the different methods were gauged. The benchmarks for the NCCTG dataset show that our method performs better than other methods when there is 5% missing data in the dataset, with 4.64 RMSE lower than the best KNN. For the TCGA dataset, our method achieved 0.78 RMSE lower than the best KNN.In addition to imputation, RBM can achieve simultaneous predictions. We compared the RBM model with four traditional prediction methods. The running time and area under the curve (AUC) were measured to evaluate the performance. Our RBM-based approach outperformed traditional methods. Specifically, the AUC was up to 19.8% higher than the multivariate logistic regression model in the NCCTG lung cancer dataset, and the AUC was higher than the Cox proportional hazard regression model, with 28.1% in the TCGA dataset.Apart from imputation and prediction, RBM models can detect outliers in one pass by allowing the reconstruction of all the inputs in the visible layer with in a single backward pass. Our results show that RBM models have achieved higher precision and recall on detecting outliers than other methods.
f
Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...
plos.figshare.com
figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikolaj Bak; Lars K. Hansen (2023). Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option [Dataset]. http://doi.org/10.1371/journal.pone.0164464
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0164464
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Nikolaj Bak; Lars K. Hansen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the “true error” in each of these new cases. The error is then estimated for each case with missing values by weighing the “true errors” by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method. The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding to conventions that might not be warranted in the specific dataset.
n
Data from: Using multiple imputation to estimate missing data in...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Nov 25, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
E. Hance Ellington; Guillaume Bastille-Rousseau; Cayla Austin; Kristen N. Landolt; Bruce A. Pond; Erin E. Rees; Nicholas Robar; Dennis L. Murray (2015). Using multiple imputation to estimate missing data in meta-regression [Dataset]. http://doi.org/10.5061/dryad.m2v4m
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.m2v4m
Dataset updated
Nov 25, 2015
Dataset provided by
Trent University
University of Prince Edward Island
Authors
E. Hance Ellington; Guillaume Bastille-Rousseau; Cayla Austin; Kristen N. Landolt; Bruce A. Pond; Erin E. Rees; Nicholas Robar; Dennis L. Murray
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
There is a growing need for scientific synthesis in ecology and evolution. In many cases, meta-analytic techniques can be used to complement such synthesis. However, missing data is a serious problem for any synthetic efforts and can compromise the integrity of meta-analyses in these and other disciplines. Currently, the prevalence of missing data in meta-analytic datasets in ecology and the efficacy of different remedies for this problem have not been adequately quantified. 2. We generated meta-analytic datasets based on literature reviews of experimental and observational data and found that missing data were prevalent in meta-analytic ecological datasets. We then tested the performance of complete case removal (a widely used method when data are missing) and multiple imputation (an alternative method for data recovery) and assessed model bias, precision, and multi-model rankings under a variety of simulated conditions using published meta-regression datasets. 3. We found that complete case removal led to biased and imprecise coefficient estimates and yielded poorly specified models. In contrast, multiple imputation provided unbiased parameter estimates with only a small loss in precision. The performance of multiple imputation, however, was dependent on the type of data missing. It performed best when missing values were weighting variables, but performance was mixed when missing values were predictor variables. Multiple imputation performed poorly when imputing raw data which was then used to calculate effect size and the weighting variable. 4. We conclude that complete case removal should not be used in meta-regression, and that multiple imputation has the potential to be an indispensable tool for meta-regression in ecology and evolution. However, we recommend that users assess the performance of multiple imputation by simulating missing data on a subset of their data before implementing it to recover actual missing data.
f
A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...
acs.figshare.com
xlsx
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker (2023). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. http://doi.org/10.1021/acs.jproteome.1c00070.s004
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.1c00070.s004
Dataset updated
Jun 4, 2023
Dataset provided by
ACS Publications
Authors
Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.
H
Replication Data for: Comparative investigation of time series missing data...
dataverse.harvard.edu
Updated Jul 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LEIZHEN ZANG; Feng XIONG (2020). Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results [Dataset]. http://doi.org/10.7910/DVN/GQHURF
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/GQHURF
Dataset updated
Jul 24, 2020
Dataset provided by
Harvard Dataverse
Authors
LEIZHEN ZANG; Feng XIONG
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Missing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.
H
Replication data for: Modeling global health indicators: missing data...
data.niaid.nih.gov
dataverse.harvard.edu
Updated May 5, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jamie mie & Gerstein Bethany (2014). Replication data for: Modeling global health indicators: missing data imputation and accounting for â€˜double uncertaintyâ€™ [Dataset]. http://doi.org/10.7910/DVN/25683
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/25683
Dataset updated
May 5, 2014
Dataset provided by
Harvard University
Authors
Jamie mie & Gerstein Bethany
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
World
Description
Global health indicators such as infant and maternal mortality are important for informing priorities for health research, policy development, and resource allocation. However, due to inconsistent reporting within and across nations, construction of comparable indicators often requires extensive data imputation and complex modeling from limited observed data. We draw on Ahmed et al.â€™s 2012 paper â€“ an analysis of maternal deaths averted by contraceptive use for 172 countries in 2008 â€“ as an exemplary case of the challenge of building reliable models with scarce observations. The authorsâ€™ employ a counterfactual modeling approach using regression imputation on the independent variable which assumes no estimation uncertainty in the final model and does not address the potential for scattered missingness in the predictor variables. We replicate their results and test the sensitivity of their published estimates to the use of an alternative method for imputing missing data, multiple imputation. We also calculate alternative estimates of standard errors for the model estimates that more appropriately account for the uncertainty introduced through data imputation of multiple predictor variables. Based on our results, we discuss the risks associated with the missing data practices employed and evaluate the appropriateness of multiple imputation as an alternative for data imputation and uncertainty estimation for models of global health indicators.
f
Data from: Variable Selection with Multiply-Imputed Datasets: Choosing...
tandf.figshare.com
pdf
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiacong Du; Jonathan Boss; Peisong Han; Lauren J. Beesley; Michael Kleinsasser; Stephen A. Goutman; Stuart Batterman; Eva L. Feldman; Bhramar Mukherjee (2023). Variable Selection with Multiply-Imputed Datasets: Choosing Between Stacked and Grouped Methods [Dataset]. http://doi.org/10.6084/m9.figshare.19111441.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19111441.v2
Dataset updated
Jun 3, 2023
Dataset provided by
Taylor & Francis
Authors
Jiacong Du; Jonathan Boss; Peisong Han; Lauren J. Beesley; Michael Kleinsasser; Stephen A. Goutman; Stuart Batterman; Eva L. Feldman; Bhramar Mukherjee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Penalized regression methods are used in many biomedical applications for variable selection and simultaneous coefficient estimation. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors. This article considers a general class of penalized objective functions which, by construction, force selection of the same variables across imputed datasets. By pooling objective functions across imputations, optimization is then performed jointly over all imputed datasets rather than separately for each dataset. We consider two objective function formulations that exist in the literature, which we will refer to as “stacked” and “grouped” objective functions. Building on existing work, we (i) derive and implement efficient cyclic coordinate descent and majorization-minimization optimization algorithms for continuous and binary outcome data, (ii) incorporate adaptive shrinkage penalties, (iii) compare these methods through simulation, and (iv) develop an R package miselect. Simulations demonstrate that the “stacked” approaches are more computationally efficient and have better estimation and selection properties. We apply these methods to data from the University of Michigan ALS Patients Biorepository aiming to identify the association between environmental pollutants and ALS risk. Supplementary materials for this article are available online.
Data from: Missing data estimation in morphometrics: how much is too much?
zenodo.org
data.niaid.nih.gov
+2more
Updated Jun 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julien Clavel; Gildas Merceron; Gilles Escarguel; Julien Clavel; Gildas Merceron; Gilles Escarguel (2022). Data from: Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.f0b50
Dataset updated
Jun 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julien Clavel; Gildas Merceron; Gilles Escarguel; Julien Clavel; Gildas Merceron; Gilles Escarguel
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.
f
Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...
figshare.com
frontiersin.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2021.691274.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Yi-Hui Zhou; Ehsan Saghapour
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.
d
Replication data for: A Unified Approach To Measurement Error And Missing...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blackwell, Matthew; Honaker, James; King, Gary (2023). Replication data for: A Unified Approach To Measurement Error And Missing Data: Overview [Dataset]. http://doi.org/10.7910/DVN/29606
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/29606
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Blackwell, Matthew; Honaker, James; King, Gary
Description
Although social scientists devote considerable effort to mitigating measurement error during data collection, they often ignore the issue during data analysis. And although many statistical methods have been proposed for reducing measurement error-induced biases, few have been widely used because of implausible assumptions, high levels of model dependence, difficult computation, or inapplicability with multiple mismeasured variables. We develop an easy-to-use alternative without these problems; it generalizes the popular multiple imputation (MI) framework by treating missing data problems as a limiting special case of extreme measurement error, and corrects for both. Like MI, the proposed framework is a simple two-step procedure, so that in the second step researchers can use whatever statistical method they would have if there had been no problem in the first place. We also offer empirical illustrations, open source software that implements all the methods described herein, and a companion paper with technical details and extensions (Blackwell, Honaker, and King, 2014b). Notes: This is the first of two articles to appear in the same issue of the same journal by the same authors. The second is “A Unified Approach to Measurement Error and Missing Data: Details and Extensions.” See also: Missing Data
g
New Approach to Evaluating Supplementary Homicide Report (SHR) Data...
gimi9.com
Updated Apr 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). New Approach to Evaluating Supplementary Homicide Report (SHR) Data Imputation, 1990-1995 | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_75e0e1a99ff858d793066435f75958a02cdacf2e/
Explore at:
Dataset updated
Apr 2, 2025
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
The purpose of the project was to learn more about patterns of homicide in the United States by strengthening the ability to make imputations for Supplementary Homicide Report (SHR) data with missing values. Supplementary Homicide Reports (SHR) and local police data from Chicago, Illinois, St. Louis, Missouri, Philadelphia, Pennsylvania, and Phoenix, Arizona, for 1990 to 1995 were merged to create a master file by linking on overlapping information on victim and incident characteristics. Through this process, 96 percent of the cases in the SHR were matched with cases in the police files. The data contain variables for three types of cases: complete in SHR, missing offender and incident information in SHR but known in police report, and missing offender and incident information in both. The merged file allows estimation of similarities and differences between the cases with known offender characteristics in the SHR and those in the other two categories. The accuracy of existing data imputation methods can be assessed by comparing imputed values in an "incomplete" dataset (the SHR), generated by the three imputation strategies discussed in the literature, with the actual values in a known "complete" dataset (combined SHR and police data). Variables from both the Supplemental Homicide Reports and the additional police report offense data include incident date, victim characteristics, offender characteristics, incident details, geographic information, as well as variables regarding the matching procedure.
d
Replication Data for: Qualitative Imputation of Missing Potential Outcomes
search.dataone.org
dataverse.harvard.edu
Updated Nov 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coppock, Alexander; Kaur, Dipin (2023). Replication Data for: Qualitative Imputation of Missing Potential Outcomes [Dataset]. http://doi.org/10.7910/DVN/2IVKXD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/2IVKXD
Dataset updated
Nov 9, 2023
Dataset provided by
Harvard Dataverse
Authors
Coppock, Alexander; Kaur, Dipin
Description
We propose a framework for meta-analysis of qualitative causal inferences. We integrate qualitative counterfactual inquiry with an approach from the quantitative causal inference literature called extreme value bounds. Qualitative counterfactual analysis uses the observed outcome and auxiliary information to infer what would have happened had the treatment been set to a different level. Imputing missing potential outcomes is hard and when it fails, we can fill them in under best- and worst-case scenarios. We apply our approach to 63 cases that could have experienced transitional truth commissions upon democratization, 8 of which did. Prior to any analysis, the extreme value bounds around the average treatment effect on authoritarian resumption are 100 percentage points wide; imputation shrinks the width of these bounds to 51 points. We further demonstrate our method by aggregating specialists' beliefs about causal effects gathered through an expert survey, shrinking the width of the bounds to 44 points.
f
Parameter settings used in the experiments.
plos.figshare.com
figshare.com
xls
Updated Jan 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio Fernando Lavareda Jacob Junior; Fabricio Almeida do Carmo; Adamo Lima de Santana; Ewaldo Eder Carvalho Santana; Fabio Manoel Franca Lobato (2024). Parameter settings used in the experiments. [Dataset]. http://doi.org/10.1371/journal.pone.0297147.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0297147.t004
Dataset updated
Jan 19, 2024
Dataset provided by
PLOS ONE
Authors
Antonio Fernando Lavareda Jacob Junior; Fabricio Almeida do Carmo; Adamo Lima de Santana; Ewaldo Eder Carvalho Santana; Fabio Manoel Franca Lobato
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Missing data is a prevalent problem that requires attention, as most data analysis techniques are unable to handle it. This is particularly critical in Multi-Label Classification (MLC), where only a few studies have investigated missing data in this application domain. MLC differs from Single-Label Classification (SLC) by allowing an instance to be associated with multiple classes. Movie classification is a didactic example since it can be “drama” and “bibliography” simultaneously. One of the most usual missing data treatment methods is data imputation, which seeks plausible values to fill in the missing ones. In this scenario, we propose a novel imputation method based on a multi-objective genetic algorithm for optimizing multiple data imputations called Multiple Imputation of Multi-label Classification data with a genetic algorithm, or simply EvoImp. We applied the proposed method in multi-label learning and evaluated its performance using six synthetic databases, considering various missing values distribution scenarios. The method was compared with other state-of-the-art imputation strategies, such as K-Means Imputation (KMI) and weighted K-Nearest Neighbors Imputation (WKNNI). The results proved that the proposed method outperformed the baseline in all the scenarios by achieving the best evaluation measures considering the Exact Match, Accuracy, and Hamming Loss. The superior results were constant in different dataset domains and sizes, demonstrating the EvoImp robustness. Thus, EvoImp represents a feasible solution to missing data treatment for multi-label learning.
Data from: A real data-driven simulation strategy to select an imputation...
zenodo.org
datadryad.org
bin
Updated Feb 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacqueline A. May; Jacqueline A. May; Zeny Feng; Sarah J. Adamowicz; Zeny Feng; Sarah J. Adamowicz (2023). Data from: A real data-driven simulation strategy to select an imputation method for mixed-type trait data [Dataset]. http://doi.org/10.5061/dryad.crjdfn37m
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.crjdfn37m
Dataset updated
Feb 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jacqueline A. May; Jacqueline A. May; Zeny Feng; Sarah J. Adamowicz; Zeny Feng; Sarah J. Adamowicz
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Considering the mixed results of imputation, the wide variety of available methods, and the varied structure of real trait datasets, a framework for selecting a suitable imputation method is advantageous. We invoked a real data-driven simulation strategy to select an imputation method for a given mixed-type (categorical, count, continuous) target dataset. Candidate methods included mean/mode imputation, k-nearest neighbour, random forests, and multivariate imputation by chained equations (MICE). Using a trait dataset of squamates (lizards and amphisbaenians; order: Squamata) as a target dataset, a complete-case dataset consisting of species with nearly completed information was formed for the imputation method selection. Missing data were induced by removing values from this dataset under different missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). For each method, combinations with and without phylogenetic information from single gene (nuclear and mitochondrial) or multigene trees were used to impute the missing values for five numerical and two categorical traits. The performances of the methods were evaluated under each missing mechanism by determining the mean squared error and proportion falsely classified rates for numerical and categorical traits, respectively. A random forest method supplemented with a nuclear-derived phylogeny resulted in the lowest error rates for the majority of traits, and this method was used to impute missing values in the original dataset. Data with imputed values better reflected the characteristics and distributions of the original data compared to complete-case data. However, caution should be taken when imputing trait data as phylogeny did not always improve performance for every trait and in every scenario. Ultimately, these results support the use of a real data-driven simulation strategy for selecting a suitable imputation method for a given mixed-type trait dataset.
d
Replication Data for: \"The Missing Dimension of the Political Resource...
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ranjit, Lall (2023). Replication Data for: \"The Missing Dimension of the Political Resource Curse Debate\" (Comparative Political Studies) [Dataset]. http://doi.org/10.7910/DVN/UHABC6
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UHABC6
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Ranjit, Lall
Description
Abstract: Given the methodological sophistication of the debate over the “political resource curse”—the purported negative relationship between natural resource wealth (in particular oil wealth) and democracy—it is surprising that scholars have not paid more attention to the basic statistical issue of how to deal with missing data. This article highlights the problems caused by the most common strategy for analyzing missing data in the political resource curse literature—listwise deletion—and investigates how addressing such problems through the best-practice technique of multiple imputation affects empirical results. I find that multiple imputation causes the results of a number of influential recent studies to converge on a key common finding: A political resource curse does exist, but only since the widespread nationalization of petroleum industries in the 1970s. This striking finding suggests that much of the controversy over the political resource curse has been caused by a neglect of missing-data issues.
m
Data from: A simple approach for maximizing the overlap of phylogenetic and...
figshare.mq.edu.au
borealisdata.ca
+3more
bin
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew W. Pennell; Richard G. FitzJohn; William K. Cornwell (2023). Data from: A simple approach for maximizing the overlap of phylogenetic and comparative data [Dataset]. http://doi.org/10.5061/dryad.5d3rq
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.5d3rq
Dataset updated
May 30, 2023
Dataset provided by
Macquarie University
Authors
Matthew W. Pennell; Richard G. FitzJohn; William K. Cornwell
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Biologists are increasingly using curated, public data sets to conduct phylogenetic comparative analyses. Unfortunately, there is often a mismatch between species for which there is phylogenetic data and those for which other data are available. As a result, researchers are commonly forced to either drop species from analyses entirely or else impute the missing data. A simple strategy to improve the overlap of phylogenetic and comparative data is to swap species in the tree that lack data with ‘phylogenetically equivalent’ species that have data. While this procedure is logically straightforward, it quickly becomes very challenging to do by hand. Here, we present algorithms that use topological and taxonomic information to maximize the number of swaps without altering the structure of the phylogeny. We have implemented our method in a new R package phyndr, which will allow researchers to apply our algorithm to empirical data sets. It is relatively efficient such that taxon swaps can be quickly computed, even for large trees. To facilitate the use of taxonomic knowledge, we created a separate data package taxonlookup; it contains a curated, versioned taxonomic lookup for land plants and is interoperable with phyndr. Emerging online data bases and statistical advances are making it possible for researchers to investigate evolutionary questions at unprecedented scales. However, in this effort species mismatch among data sources will increasingly be a problem; evolutionary informatics tools, such as phyndr and taxonlookup, can help alleviate this issue.

Usage Notes Land plant taxonomic lookup tableThis dataset is a stable version (version 1.0.1) of the dataset contained in the taxonlookup R package (see https://github.com/traitecoevo/taxonlookup for the most recent version). It contains a taxonomic reference table for 16,913 genera of land plants along with the number of recognized species in each genus.plant_lookup.csv
f
Data from: Multiple imputation regression discontinuity designs: Alternative...
figshare.com
docx
Updated Aug 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Masayoshi Takahashi (2021). Multiple imputation regression discontinuity designs: Alternative to regression discontinuity designs to estimate the local average treatment effect at the cutoff [Dataset]. http://doi.org/10.6084/m9.figshare.15196480.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.15196480.v1
Dataset updated
Aug 18, 2021
Dataset provided by
Taylor & Francis
Authors
Masayoshi Takahashi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The regression discontinuity design (RDD) is one of the most credible methods for causal inference that is often regarded as a missing data problem in the potential outcomes framework. However, the methods for missing data such as multiple imputation are rarely used as a method for causal inference. This article proposes multiple imputation regression discontinuity designs (MIRDDs), an alternative way of estimating the local average treatment effect at the cutoff point by multiply-imputing potential outcomes. To assess the performance of the proposed method, Monte Carlo simulations are conducted under 112 different settings, each repeated 5,000 times. The simulation results show that MIRDDs perform well in terms of bias, root mean squared error, coverage, and interval length compared to the standard RDD method. Also, additional simulations exhibit promising results compared to the state-of-the-art RDD methods. Finally, this article proposes to use MIRDDs as a graphical diagnostic tool for RDDs. We illustrate the proposed method with data on the incumbency advantage in U.S. House elections. To implement the proposed method, an easy-to-use software program is also provided.
m
Using a repeated measure mixed model for handling missing cost and outcome...
data.mendeley.com
explore.openaire.eu
Updated Jan 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Modou Diop (2021). Using a repeated measure mixed model for handling missing cost and outcome data in clinical trial-based cost-effectiveness analysis [Dataset]. http://doi.org/10.17632/j8fmdwd4jp.1
Explore at:
Unique identifier
https://doi.org/10.17632/j8fmdwd4jp.1
Dataset updated
Jan 21, 2021
Authors
Modou Diop
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OBJECTIVES: Where trial recruitment is staggered over time, patients may have different lengths of follow-up, meaning that the dataset is an unbalanced panel with a considerable amount of missing data. This study presents a method for estimating the difference in total costs and total Quality – Adjusted Life Years (QALY) over a given time horizon using a repeated measure mixed model (RMMM). To the authors’ knowledge this is the first time this method has been exploited in the context of economic evaluation within clinical trials. METHODS: An example (EVLA trial, NIHR HTA project 11/129/197) is used where patients have between 1 and 5 years of follow up. Early treatment is compared with delayed treatment. Coefficients at each time point from the repeated measures mixed model were aggregated to estimate total mean cost and total mean QALY over 3 years. Results were compared with other methods for handling missing data: Complete-Case-Analysis (CCA), multiple imputation using linear regression (MILR) and using predictive mean matching (MIPMM), and Bayesian parametric approach (BPA). RESULTS: Mean differences in costs obtained varied among the different approach, CCA, MIPMM and MILRM recorded greater mean costs in delayed treatment, £216 (95% CI -£1413 to £1845), £36 (95% CI to £-581 to 652£), £30(95% CI to -£617 to 679£), respectively. While RMM and BPA showed greater costs in early intervention, -£67 (95% CI -£1069 to £855), -£162 (95% CI -£728-£402), respectively. Early intervention was associated with greater QALY among all methods at year 3, RMM show the highest QALYs, 0.073 (95% CI -0.06 to 0.2). CONCLUSION: MIPMM show most efficient results in our cost-effectiveness analysis. By contrast when the percentage of missing is high RMM shows similar results than MIPMM. Hence, we conclude that RMM is a flexible way and robust alternative for modelling continuous outcomes data that can be considered missing-at-random.
f
Experimental results for the Classifier Chains.
plos.figshare.com
xls
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio Fernando Lavareda Jacob Junior; Fabricio Almeida do Carmo; Adamo Lima de Santana; Ewaldo Eder Carvalho Santana; Fabio Manoel Franca Lobato (2024). Experimental results for the Classifier Chains. [Dataset]. http://doi.org/10.1371/journal.pone.0297147.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0297147.t008
Dataset updated
Jan 19, 2024
Dataset provided by
PLOS ONE
Authors
Antonio Fernando Lavareda Jacob Junior; Fabricio Almeida do Carmo; Adamo Lima de Santana; Ewaldo Eder Carvalho Santana; Fabio Manoel Franca Lobato
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Missing data is a prevalent problem that requires attention, as most data analysis techniques are unable to handle it. This is particularly critical in Multi-Label Classification (MLC), where only a few studies have investigated missing data in this application domain. MLC differs from Single-Label Classification (SLC) by allowing an instance to be associated with multiple classes. Movie classification is a didactic example since it can be “drama” and “bibliography” simultaneously. One of the most usual missing data treatment methods is data imputation, which seeks plausible values to fill in the missing ones. In this scenario, we propose a novel imputation method based on a multi-objective genetic algorithm for optimizing multiple data imputations called Multiple Imputation of Multi-label Classification data with a genetic algorithm, or simply EvoImp. We applied the proposed method in multi-label learning and evaluated its performance using six synthetic databases, considering various missing values distribution scenarios. The method was compared with other state-of-the-art imputation strategies, such as K-Means Imputation (KMI) and weighted K-Nearest Neighbors Imputation (WKNNI). The results proved that the proposed method outperformed the baseline in all the scenarios by achieving the best evaluation measures considering the Exact Match, Accuracy, and Hamming Loss. The superior results were constant in different dataset domains and sizes, demonstrating the EvoImp robustness. Thus, EvoImp represents a feasible solution to missing data treatment for multi-label learning.
o
Imputation Using New Jersey Unemployment Insurance Wage Records
explore.openaire.eu
Updated Mar 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Feder; Nathan Barrett (2022). Imputation Using New Jersey Unemployment Insurance Wage Records [Dataset]. http://doi.org/10.5281/zenodo.6399344
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6399344
Dataset updated
Mar 30, 2022
Authors
Benjamin Feder; Nathan Barrett
Area covered
New Jersey
Description
This is a Jupyter notebook demonstrating how to approach working with missing wage records.The notebook explores different imputation methods on the variable of interest (missing wages) – listwise deletion, zero imputation, mean imputation, and regression imputation. This notebook was developed for the Spring 2021 Applied Data Analytics training facilitated by John J. Heldrich Center for Workforce Development and Coleridge Initiative.

Facebook

Twitter

Click to copy link

Link copied

Cite

Wen Ma (2020). Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets [Dataset]. http://doi.org/10.25442/hku.12752549.v1

Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.25442/hku.12752549.v1

Dataset updated

Aug 13, 2020

Dataset provided by

HKU Data Repository

Authors

Wen Ma

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

NCCTG Lung cancer datasetSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.2.CNV measurements of CNV of GBM This dataset records the information about copy number variation of Glioblastoma (GBM).Abstract:In biology and medicine, conservative patient and data collection malpractice can lead to missing or incorrect values in patient registries, which can affect both diagnosis and prognosis. Insufficient or biased patient information significantly impedes the sensitivity and accuracy of predicting cancer survival. In bioinformatics, making a best guess of the missing values and identifying the incorrect values are collectively called “imputation”. Existing imputation methods work by establishing a model based on the data mechanism of the missing values. Existing imputation methods work well under two assumptions: 1) the data is missing completely at random, and 2) the percentage of missing values is not high. These are not cases found in biomedical datasets, such as the Cancer Genome Atlas Glioblastoma Copy-Number Variant dataset (TCGA: 108 columns), or the North Central Cancer Treatment Group Lung Cancer (NCCTG) dataset (NCCTG: 9 columns). We tested six existing imputation methods, but only two of them worked with these datasets: The Last Observation Carried Forward (LOCF) and K-nearest Algorithm (KNN). Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) worked only with the NCCTG lung cancer dataset with fewer columns, except when the dataset contains 45% missing data. The quality of the imputed values using existing methods is bad because they do not meet the two assumptions.In our study, we propose a Restricted Boltzmann Machine (RBM)-based imputation method to cope with low randomness and the high percentage of the missing values. RBM is an undirected, probabilistic and parameterized two-layer neural network model, which is often used for extracting abstract information from data, especially for high-dimensional data with unknown or non-standard distributions. In our benchmarks, we applied our method to two cancer datasets: 1) NCCTG, and 2) TCGA. The running time, root mean squared error (RMSE) of the different methods were gauged. The benchmarks for the NCCTG dataset show that our method performs better than other methods when there is 5% missing data in the dataset, with 4.64 RMSE lower than the best KNN. For the TCGA dataset, our method achieved 0.78 RMSE lower than the best KNN.In addition to imputation, RBM can achieve simultaneous predictions. We compared the RBM model with four traditional prediction methods. The running time and area under the curve (AUC) were measured to evaluate the performance. Our RBM-based approach outperformed traditional methods. Specifically, the AUC was up to 19.8% higher than the multivariate logistic regression model in the NCCTG lung cancer dataset, and the AUC was higher than the Cox proportional hazard regression model, with 28.1% in the TCGA dataset.Apart from imputation and prediction, RBM models can detect outliers in one pass by allowing the reconstruction of all the inputs in the visible layer with in a single backward pass. Our results show that RBM models have achieved higher precision and recall on detecting outliers than other methods.

Clear search

Close search

Google apps

Main menu

Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...

Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...

Data from: Using multiple imputation to estimate missing data in...

A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

Replication Data for: Comparative investigation of time series missing data...

Replication data for: Modeling global health indicators: missing data...

Data from: Variable Selection with Multiply-Imputed Datasets: Choosing...

Data from: Missing data estimation in morphometrics: how much is too much?

Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

Replication data for: A Unified Approach To Measurement Error And Missing...

New Approach to Evaluating Supplementary Homicide Report (SHR) Data...

Replication Data for: Qualitative Imputation of Missing Potential Outcomes

Parameter settings used in the experiments.

Data from: A real data-driven simulation strategy to select an imputation...

Replication Data for: \"The Missing Dimension of the Political Resource...

Data from: A simple approach for maximizing the overlap of phylogenetic and...

Data from: Multiple imputation regression discontinuity designs: Alternative...

Using a repeated measure mixed model for handling missing cost and outcome...

Experimental results for the Classifier Chains.

Imputation Using New Jersey Unemployment Insurance Wage Records

Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets