100+ datasets found
  1. f

    Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...

    • plos.figshare.com
    • figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikolaj Bak; Lars K. Hansen (2023). Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option [Dataset]. http://doi.org/10.1371/journal.pone.0164464
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Nikolaj Bak; Lars K. Hansen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the “true error” in each of these new cases. The error is then estimated for each case with missing values by weighing the “true errors” by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method. The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding to conventions that might not be warranted in the specific dataset.

  2. R

    Conjunto de datos de: Water-quality data imputation with a high percentage...

    • redata.anii.org.uy
    tsv
    Updated Mar 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione; Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione (2024). Conjunto de datos de: Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.60895/redata/TNRT8Q
    Explore at:
    tsv(11154), tsv(47413)Available download formats
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    Repositorio de datos abiertos de investigación de Uruguay
    Authors
    Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione; Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    ANII
    Description

    The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries. This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges. To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)). IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases. In this dataset, we include the original and imputed values for the following variables: - Water temperature (Tw) - Dissolved oxygen (DO) - Electrical conductivity (EC) - pH - Turbidity (Turb) - Nitrite (NO2-) - Nitrate (NO3-) - Total Nitrogen (TN) Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC]. More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318. If you use this dataset in your work, please cite our paper: Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

  3. Data from: Evaluating the Use of Uncertainty Visualisations for Imputations...

    • osf.io
    Updated Aug 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhraneel Sarma (2024). Evaluating the Use of Uncertainty Visualisations for Imputations of Data Missing At Random in Scatterplots [Dataset]. https://osf.io/q4y5r
    Explore at:
    Dataset updated
    Aug 26, 2024
    Dataset provided by
    Center for Open Sciencehttps://cos.io/
    Authors
    Abhraneel Sarma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains supplementary materials for the paper, Evaluating the Use of Uncertainty Visualisations for Imputations of Data Missing At Random in Scatterplots

    Abstract: Most real-world datasets contain missing values yet most exploratory data analysis (EDA) systems only support visualising data points with complete cases. This omission may potentially lead the user to biased analyses and insights. Imputation techniques can help estimate the value of a missing data point, but introduces additional uncertainty. In this work, we investigate the effects of visualising imputed values in charts using different types of uncertainty visualisation techniques—no imputation, mean, 95% confidence intervals, probability density plots, gradient intervals, and hypothetical outcome plots. We focus on scatterplots, which is a commonly used chart type, and conduct a crowdsourced study with 202 participants. We measure users’ bias and precision in performing two tasks—estimating average and detecting trend—and their self-reported confidence in performing these tasks. Our results suggest that, when estimating averages, uncertainty representations may reduce bias but at the cost of decreasing precision. When estimating trend, only hypothetical outcome plots may lead to a small probability of reducing bias while increasing precision. Participants in every uncertainty representation were less certain about their response when compared to the baseline. The findings point towards potential trade-offs in using uncertainty encodings for datasets with a large number of missing values.

  4. z

    Missing data in the analysis of multilevel and dependent data (Example data...

    • zenodo.org
    bin
    Updated Jul 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Grund; Simon Grund; Oliver Lüdtke; Oliver Lüdtke; Alexander Robitzsch; Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Example data sets) [Dataset]. http://doi.org/10.5281/zenodo.7773614
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    Springer
    Authors
    Simon Grund; Simon Grund; Oliver Lüdtke; Oliver Lüdtke; Alexander Robitzsch; Alexander Robitzsch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example data sets for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the data sets used in both example analyses (Examples 1 and 2) in two file formats (binary ".rda" for use in R; plain-text ".dat").

    The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

    ID = group identifier (1-2000)
    x = numeric (Level 1)
    y = numeric (Level 1)
    w = binary (Level 2)

    In all data sets, missing values are coded as "NA".

  5. d

    Data from: Multiple Imputation for the Supplementary Homicide Reports:...

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Mar 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). Multiple Imputation for the Supplementary Homicide Reports: Evaluation in Unique Test Data, 1990-1995, Chicago, Philadelphia, Phoenix and St. Louis [Dataset]. https://catalog.data.gov/dataset/multiple-imputation-for-the-supplementary-homicide-reports-evaluation-in-unique-test-data-
    Explore at:
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    National Institute of Justice
    Area covered
    St. Louis, Chicago
    Description

    This study was an evaluation of multiple imputation strategies to address missing data using the New Approach to Evaluating Supplementary Homicide Report (SHR) Data Imputation, 1990-1995 (ICPSR 20060) dataset.

  6. H

    Replication data for: What To Do about Missing Data in Time-Series...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Sep 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Honaker; Gary King (2024). Replication data for: What To Do about Missing Data in Time-Series Cross-Sectional Data [Dataset]. http://doi.org/10.7910/DVN/GGUR0P
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    James Honaker; Gary King
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Applications of modern methods for analyzing data with missing values, based primarily on multiple imputation, have in the last half-decade become common in American politics and political behavior. Scholars in these fields have thus increasingly avoided the biases and inefficiencies caused by ad hoc methods like listwise deletion and best guess imputation. However, researchers in much of comparative politics and international relations, and others with similar data, have been unable to do the same because the best available imputation methods work poorly with the time-series cross-section data structures common in these fields. We attempt to rectify this situation. First, we build a multiple i mputation model that allows smooth time trends, shifts across cross-sectional units, and correlations over time and space, resulting in far more accurate imputations. Second, we build nonignorable missingness models by enabling analysts to incorporate knowledge from area studies experts via priors on individual missing cell values, rather than on difficult-to-interpret model parameters. Third, since these tasks could not be accomplished within existing imputation algorithms, in that they cannot handle as many variables as needed even in the simpler cross-sectional data for which they were designed, we also develop a new algorithm that substantially expands the range of computationally feasible data types and sizes for which multiple imputation can be used. These developments also made it possible to implement the methods introduced here in freely available open source software that is considerably more reliable than existing strategies. These developments also made it possible to implement the methods introduced here in freely available open source software, Amelia II: A Program for Missing Data, that is considerably more reliable than existing strategies. See also: Missing Data

  7. H

    Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

    • dataverse.harvard.edu
    Updated Sep 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ranjit Lall; Thomas Robinson (2022). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 29, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Ranjit Lall; Thomas Robinson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.

  8. n

    Data from: Using multiple imputation to estimate missing data in...

    • narcis.nl
    • data.niaid.nih.gov
    • +2more
    Updated Dec 10, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellington, E. Hance; Bastille-Rousseau, Guillaume; Austin, Cayla; Landolt, Kristen N.; Pond, Bruce A.; Rees, Erin E.; Robar, Nicholas; Murray, Dennis L. (2014). Data from: Using multiple imputation to estimate missing data in meta-regression [Dataset]. http://doi.org/10.5061/dryad.m2v4m
    Explore at:
    Dataset updated
    Dec 10, 2014
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Ellington, E. Hance; Bastille-Rousseau, Guillaume; Austin, Cayla; Landolt, Kristen N.; Pond, Bruce A.; Rees, Erin E.; Robar, Nicholas; Murray, Dennis L.
    Description
    1. There is a growing need for scientific synthesis in ecology and evolution. In many cases, meta-analytic techniques can be used to complement such synthesis. However, missing data is a serious problem for any synthetic efforts and can compromise the integrity of meta-analyses in these and other disciplines. Currently, the prevalence of missing data in meta-analytic datasets in ecology and the efficacy of different remedies for this problem have not been adequately quantified. 2. We generated meta-analytic datasets based on literature reviews of experimental and observational data and found that missing data were prevalent in meta-analytic ecological datasets. We then tested the performance of complete case removal (a widely used method when data are missing) and multiple imputation (an alternative method for data recovery) and assessed model bias, precision, and multi-model rankings under a variety of simulated conditions using published meta-regression datasets. 3. We found that complete case removal led to biased and imprecise coefficient estimates and yielded poorly specified models. In contrast, multiple imputation provided unbiased parameter estimates with only a small loss in precision. The performance of multiple imputation, however, was dependent on the type of data missing. It performed best when missing values were weighting variables, but performance was mixed when missing values were predictor variables. Multiple imputation performed poorly when imputing raw data which was then used to calculate effect size and the weighting variable. 4. We conclude that complete case removal should not be used in meta-regression, and that multiple imputation has the potential to be an indispensable tool for meta-regression in ecology and evolution. However, we recommend that users assess the performance of multiple imputation by simulating missing data on a subset of their data before implementing it to recover actual missing data.
  9. d

    Data Matching Imputation System

    • catalog.data.gov
    • s.cnmilf.com
    • +3more
    Updated Oct 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact, Custodian) (2024). Data Matching Imputation System [Dataset]. https://catalog.data.gov/dataset/data-matching-imputation-system1
    Explore at:
    Dataset updated
    Oct 19, 2024
    Dataset provided by
    (Point of Contact, Custodian)
    Description

    The DMIS dataset is a flat file record of the matching of several data set collections. Primarily it consists of VTRs, dealer records, Observer data in conjunction with vessel permit information for the purpose of supporting North East Regional quota monitoring projects.

  10. d

    New Approach to Evaluating Supplementary Homicide Report (SHR) Data...

    • catalog.data.gov
    • icpsr.umich.edu
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). New Approach to Evaluating Supplementary Homicide Report (SHR) Data Imputation, 1990-1995 [Dataset]. https://catalog.data.gov/dataset/new-approach-to-evaluating-supplementary-homicide-report-shr-data-imputation-1990-1995-ff769
    Explore at:
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    National Institute of Justice
    Description

    The purpose of the project was to learn more about patterns of homicide in the United States by strengthening the ability to make imputations for Supplementary Homicide Report (SHR) data with missing values. Supplementary Homicide Reports (SHR) and local police data from Chicago, Illinois, St. Louis, Missouri, Philadelphia, Pennsylvania, and Phoenix, Arizona, for 1990 to 1995 were merged to create a master file by linking on overlapping information on victim and incident characteristics. Through this process, 96 percent of the cases in the SHR were matched with cases in the police files. The data contain variables for three types of cases: complete in SHR, missing offender and incident information in SHR but known in police report, and missing offender and incident information in both. The merged file allows estimation of similarities and differences between the cases with known offender characteristics in the SHR and those in the other two categories. The accuracy of existing data imputation methods can be assessed by comparing imputed values in an "incomplete" dataset (the SHR), generated by the three imputation strategies discussed in the literature, with the actual values in a known "complete" dataset (combined SHR and police data). Variables from both the Supplemental Homicide Reports and the additional police report offense data include incident date, victim characteristics, offender characteristics, incident details, geographic information, as well as variables regarding the matching procedure.

  11. f

    Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

    • figshare.com
    • frontiersin.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Yi-Hui Zhou; Ehsan Saghapour
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.

  12. Z

    Data from: A real data-driven simulation strategy to select an imputation...

    • data.niaid.nih.gov
    Updated Feb 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adamowicz, Sarah J. (2023). Data from: A real data-driven simulation strategy to select an imputation method for mixed-type trait data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7644193
    Explore at:
    Dataset updated
    Feb 16, 2023
    Dataset provided by
    Adamowicz, Sarah J.
    May, Jacqueline A.
    Feng, Zeny
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Considering the mixed results of imputation, the wide variety of available methods, and the varied structure of real trait datasets, a framework for selecting a suitable imputation method is advantageous. We invoked a real data-driven simulation strategy to select an imputation method for a given mixed-type (categorical, count, continuous) target dataset. Candidate methods included mean/mode imputation, k-nearest neighbour, random forests, and multivariate imputation by chained equations (MICE). Using a trait dataset of squamates (lizards and amphisbaenians; order: Squamata) as a target dataset, a complete-case dataset consisting of species with nearly completed information was formed for the imputation method selection. Missing data were induced by removing values from this dataset under different missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). For each method, combinations with and without phylogenetic information from single gene (nuclear and mitochondrial) or multigene trees were used to impute the missing values for five numerical and two categorical traits. The performances of the methods were evaluated under each missing mechanism by determining the mean squared error and proportion falsely classified rates for numerical and categorical traits, respectively. A random forest method supplemented with a nuclear-derived phylogeny resulted in the lowest error rates for the majority of traits, and this method was used to impute missing values in the original dataset. Data with imputed values better reflected the characteristics and distributions of the original data compared to complete-case data. However, caution should be taken when imputing trait data as phylogeny did not always improve performance for every trait and in every scenario. Ultimately, these results support the use of a real data-driven simulation strategy for selecting a suitable imputation method for a given mixed-type trait dataset.

  13. Data from: Missing data estimation in morphometrics: how much is too much?

    • zenodo.org
    • data.niaid.nih.gov
    • +2more
    Updated Jun 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julien Clavel; Gildas Merceron; Gilles Escarguel; Julien Clavel; Gildas Merceron; Gilles Escarguel (2022). Data from: Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50
    Explore at:
    Dataset updated
    Jun 1, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julien Clavel; Gildas Merceron; Gilles Escarguel; Julien Clavel; Gildas Merceron; Gilles Escarguel
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.

  14. D

    Data from: Estimating and imputing missing tax loss carryforward data to...

    • dataverse.nl
    • test.dataverse.nl
    txt
    Updated Nov 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M.M. Max; J.L. Wielhouwer; E. Wiersma; M.M. Max; J.L. Wielhouwer; E. Wiersma (2023). Estimating and imputing missing tax loss carryforward data to reduce measurement error [Dataset]. http://doi.org/10.34894/N9J1WE
    Explore at:
    txt(3617825), txt(3728396), txt(3690658)Available download formats
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    DataverseNL
    Authors
    M.M. Max; J.L. Wielhouwer; E. Wiersma; M.M. Max; J.L. Wielhouwer; E. Wiersma
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    On this page, you find the imputed tax loss carryforward (TLCF) values based on the algorithm presented in Max, Wielhouwer & Wiersma (2023). Estimating and imputing missing tax loss carryforward data to reduce measurement error. European Accounting Review 32(1), 55-84. https://doi.org/10.1080/09638180.2021.1924812 . Note that the dataset contains only the imputations for the missing values on Compustat. If the Compustat TLCF is available, we have not included it in the dataset here. We download all observations from Compustat from the period 1982 up until the most recent year. Missing values on the input variables for the Shevlin (1990) taxable income measure are replaced by zero. We drop firms from the sample if years are missing in their time series. Note: We seek to update the dataset each year with the latest imputations, so pay attention to the 'Versions' tab of this dataset. Each update will be posted in a new version.

  15. f

    Results of the ML models using PCA imputer.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jan 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Turki Aljrees (2024). Results of the ML models using PCA imputer. [Dataset]. http://doi.org/10.1371/journal.pone.0295632.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 3, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Turki Aljrees
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.

  16. f

    Summary statistics for the complete-case data, original data, and original...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacqueline A. May; Zeny Feng; Sarah J. Adamowicz (2023). Summary statistics for the complete-case data, original data, and original data with imputed values. [Dataset]. http://doi.org/10.1371/journal.pcbi.1010154.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS Computational Biology
    Authors
    Jacqueline A. May; Zeny Feng; Sarah J. Adamowicz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary statistics for the complete-case data, original data, and original data with imputed values.

  17. f

    Class variables of the chosen datasets.

    • plos.figshare.com
    xls
    Updated Aug 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kritanat Chungnoy; Tanatorn Tanantong; Pokpong Songmuang (2024). Class variables of the chosen datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0305492.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 29, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Kritanat Chungnoy; Tanatorn Tanantong; Pokpong Songmuang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Existing missing value imputation methods focused on imputing the data regarding actual values towards a completion of datasets as an input for machine learning tasks. This work proposes an imputation of missing values towards improvement of accuracy performance for classification. The proposed method was based on bee algorithm and the use of k-nearest neighborhood with linear regression to guide on finding the appropriate solution in prevention of randomness. Among the processes, GINI importance score was utilized in selecting values for imputation. The imputed values thus reflected on improving a discriminative power in classification tasks instead of replicating the actual values from the original dataset. In this study, we evaluated the proposed method against frequently used imputation methods such as k-nearest neighborhood, principal components analysis, nonlinear principal, and component analysis to compare root mean square error results and accuracy of using imputed datasets in a classification task. The experimental results indicated that our proposed method obtained the best accuracy results from all datasets comparing to other methods. In comparison to original dataset, the classification model from imputed datasets yielded 15-25% higher accuracy in class prediction. From analysis, the results showed that feature ranking used in a classification process was affected and lead to noticeably change in informativeness as the imputed data from the proposed method played the role to boost a discriminating power.

  18. Iterative Imputation of Jane St train.csv

    • kaggle.com
    Updated Nov 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tpmeli (2020). Iterative Imputation of Jane St train.csv [Dataset]. https://www.kaggle.com/tpmeli/iterative-imputation-of-jane-st-traincsv/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 29, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    tpmeli
    Description

    I will be sharing all of my missing data exploration here:

    https://www.kaggle.com/tpmeli/missing-data-exploration-mean-iterative-more

  19. Additional file 3 of Methylation data imputation performances under...

    • zenodo.org
    • explore.openaire.eu
    pdf
    Updated Jul 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pietro Di Lena; Pietro Di Lena; Claudia Sala; Andrea Prodi; Christine Nardini; Claudia Sala; Andrea Prodi; Christine Nardini (2024). Additional file 3 of Methylation data imputation performances under different representations and missingness patterns [Dataset]. http://doi.org/10.5281/zenodo.5869127
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pietro Di Lena; Pietro Di Lena; Claudia Sala; Andrea Prodi; Christine Nardini; Claudia Sala; Andrea Prodi; Christine Nardini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 3 Performance comparison between complete (450k) and restricted (21k) datasets.

    Additional file 3 of Methylation data imputation performances under different representations and missingness patterns (figshare.com)

  20. RMSE score for the imputation of 269 DE genes on the split data.

    • plos.figshare.com
    xls
    Updated Nov 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tong Si; Zackary Hopkins; John Yanev; Jie Hou; Haijun Gong (2023). RMSE score for the imputation of 269 DE genes on the split data. [Dataset]. http://doi.org/10.1371/journal.pone.0292792.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 10, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Tong Si; Zackary Hopkins; John Yanev; Jie Hou; Haijun Gong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RMSE score for the imputation of 269 DE genes on the split data.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nikolaj Bak; Lars K. Hansen (2023). Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option [Dataset]. http://doi.org/10.1371/journal.pone.0164464

Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
pdfAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Nikolaj Bak; Lars K. Hansen
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the “true error” in each of these new cases. The error is then estimated for each case with missing values by weighing the “true errors” by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method. The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding to conventions that might not be warranted in the specific dataset.

Search
Clear search
Close search
Google apps
Main menu