100+ datasets found
  1. Water-quality data imputation with a high percentage of missing values: a...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jun 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 8, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

    This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

    To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

    IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

    In this dataset, we include the original and imputed values for the following variables:

    • Water temperature (Tw)

    • Dissolved oxygen (DO)

    • Electrical conductivity (EC)

    • pH

    • Turbidity (Turb)

    • Nitrite (NO2-)

    • Nitrate (NO3-)

    • Total Nitrogen (TN)

    Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

    More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

    If you use this dataset in your work, please cite our paper:
    Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

  2. o

    Identifying Missing Data Handling Methods with Text Mining

    • openicpsr.org
    delimited
    Updated Mar 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Mar 8, 2023
    Dataset provided by
    Hungarian Academy of Sciences
    Authors
    Krisztián Boros; Zoltán Kmetty
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1999 - Dec 31, 2016
    Description

    Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.

  3. Data from: A multiple imputation method using population information

    • tandf.figshare.com
    pdf
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tadayoshi Fushiki (2025). A multiple imputation method using population information [Dataset]. http://doi.org/10.6084/m9.figshare.28900017.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Tadayoshi Fushiki
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multiple imputation (MI) is effectively used to deal with missing data when the missing mechanism is missing at random. However, MI may not be effective when the missing mechanism is not missing at random (NMAR). In such cases, additional information is required to obtain an appropriate imputation. Pham et al. (2019) proposed the calibrated-δ adjustment method, which is a multiple imputation method using population information. It provides appropriate imputation in two NMAR settings. However, the calibrated-δ adjustment method has two problems. First, it can be used only when one variable has missing values. Second, the theoretical properties of the variance estimator have not been provided. This article proposes a multiple imputation method using population information that can be applied when several variables have missing values. The proposed method is proven to include the calibrated-δ adjustment method. It is shown that the proposed method provides a consistent estimator for the parameter of the imputation model in an NMAR situation. The asymptotic variance of the estimator obtained by the proposed method and its estimator are also given.

  4. Understanding and Managing Missing Data.pdf

    • figshare.com
    pdf
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ibrahim Denis Fofanah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

  5. d

    Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
    Explore at:
    Dataset updated
    Nov 23, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Lall, Ranjit; Robinson, Thomas
    Description

    Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.

  6. Z

    Missing data in the analysis of multilevel and dependent data (Examples)

    • data.niaid.nih.gov
    Updated Jul 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Grund; Oliver Lüdtke; Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Examples) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7773613
    Explore at:
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    University of Hamburg
    IPN - Leibniz Institute for Science and Mathematics Education
    Authors
    Simon Grund; Oliver Lüdtke; Alexander Robitzsch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example data sets and computer code for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the computer code (".R") and the data sets from both example analyses (Examples 1 and 2). The data sets are available in two file formats (binary ".rda" for use in R; plain-text ".dat").

    The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

    ID = group identifier (1-2000) x = numeric (Level 1) y = numeric (Level 1) w = binary (Level 2)

    In all data sets, missing values are coded as "NA".

  7. n

    Data from: Missing data estimation in morphometrics: how much is too much?

    • narcis.nl
    • data.niaid.nih.gov
    • +1more
    Updated Dec 5, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clavel, Julien; Merceron, Gildas; Escarguel, Gilles (2013). Data from: Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50
    Explore at:
    Dataset updated
    Dec 5, 2013
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Clavel, Julien; Merceron, Gildas; Escarguel, Gilles
    Description

    Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.

  8. Retail Product Dataset with Missing Values

    • kaggle.com
    zip
    Updated Feb 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Himel Sarder (2025). Retail Product Dataset with Missing Values [Dataset]. https://www.kaggle.com/datasets/himelsarder/retail-product-dataset-with-missing-values
    Explore at:
    zip(47826 bytes)Available download formats
    Dataset updated
    Feb 17, 2025
    Authors
    Himel Sarder
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).

    The dataset includes:
    - Category (Categorical): Product category (A, B, C, D)
    - Price (Numerical): Randomized product prices
    - Rating (Numerical): Ratings between 1 to 5
    - Stock (Categorical): Availability status (In Stock, Out of Stock)
    - Discount (Numerical): Discount percentage

    This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.

  9. Data from: loan Prediction

    • kaggle.com
    zip
    Updated Jan 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deep Jani (2022). loan Prediction [Dataset]. https://www.kaggle.com/deepjani/ipl-matches
    Explore at:
    zip(5197 bytes)Available download formats
    Dataset updated
    Jan 12, 2022
    Authors
    Deep Jani
    Description

    Dataset

    This dataset was created by Deep Jani

    Released under Data files © Original Authors

    Contents

  10. f

    Data from: Revisiting the Thorny Issue of Missing Values in Single-Cell...

    • acs.figshare.com
    zip
    Updated Aug 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christophe Vanderaa; Laurent Gatto (2023). Revisiting the Thorny Issue of Missing Values in Single-Cell Proteomics [Dataset]. http://doi.org/10.1021/acs.jproteome.3c00227.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Christophe Vanderaa; Laurent Gatto
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Missing values are a notable challenge when analyzing mass spectrometry-based proteomics data. While the field is still actively debating the best practices, the challenge increased with the emergence of mass spectrometry-based single-cell proteomics and the dramatic increase in missing values. A popular approach to deal with missing values is to perform imputation. Imputation has several drawbacks for which alternatives exist, but currently, imputation is still a practical solution widely adopted in single-cell proteomics data analysis. This perspective discusses the advantages and drawbacks of imputation. We also highlight 5 main challenges linked to missing value management in single-cell proteomics. Future developments should aim to solve these challenges, whether it is through imputation or data modeling. The perspective concludes with recommendations for reporting missing values, for reporting methods that deal with missing values, and for proper encoding of missing values.

  11. LTMP analysis 11-year versus 25-year with missing data

    • figshare.com
    txt
    Updated Nov 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alfonso Ruiz Moreno (2025). LTMP analysis 11-year versus 25-year with missing data [Dataset]. http://doi.org/10.6084/m9.figshare.28785908.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 6, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Alfonso Ruiz Moreno
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains all the scripts and data used in the analysis of the LTMP data presented in the manuscript “Longer time series with missing data improve parameter estimation in State-Space mode in coral reef fish communities”. There are 22 files in total.All model fits were run on the HPC cluster at James Cook University. The model fit to the 11-year time series took approximately 3-5 days and the model fit to the 25-year time series took approximately 10-12 days. We did not include the model fits as they are big files (~12-30GB) but these can be obtained by running the corresponding scripts.LTMP data and data wranglingLTMP_data_1995_2005_prop_zero_40sp.RData: File containing 45 columns. The first column is Year and it contains the year for each observation in the dataset. The second column Reef contains the reef name, while the latitude and longitude are collected in the third column called Reef_lat and fourth column called Reef_long, respectively. The fifth column is called Shelf and contains the reef shelf position as I for Inner shelf positioning, M for Middle Shelf positioning and O for outer Shelf positioning. The rest of the columns contain the counts of the 40 species with the lowest proportion of zeros in the LTMP data. This contains data from 1995 to 2005.LTMP_data_1995_2019_prop_zero_40sp.RData: Same data structure as above but for the time series from 1995 to 2019 (includes Nas in some of the abundance counts).dw_11y_Pomacentrids.R and dw_25yNA_Pomacentrids.R scripts order species in pomacentrids and non-pomacentrids so the models can be fitted to the data. These files produce the data files LTMP_data_1995_2005_prop_zero_40sp_Pomacentrids.RData and LTMP_data_1995_2019_prop_zero_40sp_PomacentridsNA.RData.Model fittingLTMP_fit_40sp.R is a script that fits the model to the 11-year time series data. Specifically, the input dataset is LTMP_data_1995_2005_prop_zero_40sp_Pomacentrids.RData and the output fit is called LTMP_fit_40sp.RData.LTMP_fit_40sp_NA.R is a script that fits the model to the 25-year time series with missing data. Specifically, the input dataset is LTMP_data_1995_2019_prop_zero_40sp_PomacentridsNA.RData and the output fit is called LTMP_fit_40sp_NA.RData.Stan modelMARPLN_LV_Pomacentrids.stan: Stan code for the multivariate autoregressive Poisson-Lognormal model with the latent variables.MARPLN_LV_Pomacentrids_NA.stan: Stan code for same model as above, but this can deal with missing data.FiguresFigure 1 A and B.R and Figure 4.R produce the corresponding figures in the main text.Note that Figure 1A and B.R requires several files to produce the GBR and Australia maps. These are:Great_Barrier_Reef_Features.cpgGreat_Barrier_Reef_Features.dbfGreat_Barrier_Reef_Features.lyrGreat_Barrier_Reef_Features.shp.xmlReef_lat_long.csvGreat_Barrier_Reef_Features.prjGreat_Barrier_Reef_Features.sbnGreat_Barrier_Reef_Features.sbxGreat_Barrier_Reef_Features.shpGreat_Barrier_Reef_Features.shx

  12. d

    Replication Data for: \"The Missing Dimension of the Political Resource...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ranjit, Lall (2023). Replication Data for: \"The Missing Dimension of the Political Resource Curse Debate\" (Comparative Political Studies) [Dataset]. http://doi.org/10.7910/DVN/UHABC6
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Ranjit, Lall
    Description

    Abstract: Given the methodological sophistication of the debate over the “political resource curse”—the purported negative relationship between natural resource wealth (in particular oil wealth) and democracy—it is surprising that scholars have not paid more attention to the basic statistical issue of how to deal with missing data. This article highlights the problems caused by the most common strategy for analyzing missing data in the political resource curse literature—listwise deletion—and investigates how addressing such problems through the best-practice technique of multiple imputation affects empirical results. I find that multiple imputation causes the results of a number of influential recent studies to converge on a key common finding: A political resource curse does exist, but only since the widespread nationalization of petroleum industries in the 1970s. This striking finding suggests that much of the controversy over the political resource curse has been caused by a neglect of missing-data issues.

  13. Data from: Sample-wise Combined Missing Effect Model with Penalization

    • tandf.figshare.com
    bin
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jialu Li; Guan Yu; Qizhai Li; Yufeng Liu (2024). Sample-wise Combined Missing Effect Model with Penalization [Dataset]. http://doi.org/10.6084/m9.figshare.19651419.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Jialu Li; Guan Yu; Qizhai Li; Yufeng Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Modern high-dimensional statistical inference often faces the problem of missing data. In recent decades, many studies have focused on this topic and provided strategies including complete-sample analysis and imputation procedures. However, complete-sample analysis discards information of incomplete samples, while imputation procedures have accumulative errors from each single imputation. In this paper, we propose a new method, Sample-wise COmbined missing effect Model with penalization (SCOM), to deal with missing data occurring in predictors. Instead of imputing the predictors, SCOM estimates the combined effect caused by all missing data for each incomplete sample. SCOM makes full use of all available data. It is robust with respect to various missing mechanisms. Theoretical studies show the oracle inequality for the proposed estimator, and the consistency of variable selection and combined missing effect selection. Simulation studies and an application to the Residential Building Data also illustrate the effectiveness of the proposed SCOM.

  14. f

    MacroPCA: An All-in-One PCA Method Allowing for Missing Values as Well as...

    • tandf.figshare.com
    pdf
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mia Hubert; Peter J. Rousseeuw; Wannes Van den Bossche (2023). MacroPCA: An All-in-One PCA Method Allowing for Missing Values as Well as Cellwise and Rowwise Outliers [Dataset]. http://doi.org/10.6084/m9.figshare.7624424.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Mia Hubert; Peter J. Rousseeuw; Wannes Van den Bossche
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multivariate data are typically represented by a rectangular matrix (table) in which the rows are the objects (cases) and the columns are the variables (measurements). When there are many variables one often reduces the dimension by principal component analysis (PCA), which in its basic form is not robust to outliers. Much research has focused on handling rowwise outliers, that is, rows that deviate from the majority of the rows in the data (e.g., they might belong to a different population). In recent years also cellwise outliers are receiving attention. These are suspicious cells (entries) that can occur anywhere in the table. Even a relatively small proportion of outlying cells can contaminate over half the rows, which causes rowwise robust methods to break down. In this article, a new PCA method is constructed which combines the strengths of two existing robust methods to be robust against both cellwise and rowwise outliers. At the same time, the algorithm can cope with missing values. As of yet it is the only PCA method that can deal with all three problems simultaneously. Its name MacroPCA stands for PCA allowing for Missingness And Cellwise & Rowwise Outliers. Several simulations and real datasets illustrate its robustness. New residual maps are introduced, which help to determine which variables are responsible for the outlying behavior. The method is well-suited for online process control.

  15. Handling of Missing Data Induced by Time-Varying Covariates in Comparative...

    • icpsr.umich.edu
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Desai, Manisha (2025). Handling of Missing Data Induced by Time-Varying Covariates in Comparative Effectiveness Research HIV Patients [Methods Study], 2013-2018 [Dataset]. http://doi.org/10.3886/ICPSR39528.v1
    Explore at:
    Dataset updated
    Oct 9, 2025
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Desai, Manisha
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/39528/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39528/terms

    Time period covered
    2013 - 2018
    Description

    Researchers can use data from health registries or electronic health records to compare two or more treatments. Registries store data about patients with a specific health problem. These data include how well those patients respond to treatments and information about patient traits, such as age, weight, or blood pressure. But sometimes data about patient traits are missing. Missing data about patient traits can lead to incorrect study results, especially when traits change over time. For example, weight can change over time, and the patient may not report their weight at some points along the way. Researchers use statistical methods to fill in these missing data. In this study, the research team compared a new statistical method to fill in missing data with traditional methods. Traditional methods remove patients with missing data or fill in each missing number with a single estimate. The new method creates multiple possible estimates to fill in each missing number. To access the methods, software, and R package, please visit the SimulateCER GitHub and SimTimeVar CRAN website.

  16. Handle Missing Values

    • kaggle.com
    zip
    Updated Oct 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Safacan Metin (2020). Handle Missing Values [Dataset]. https://www.kaggle.com/safacanmetin/handle-missing-values
    Explore at:
    zip(1806 bytes)Available download formats
    Dataset updated
    Oct 24, 2020
    Authors
    Safacan Metin
    Description

    Dataset

    This dataset was created by Safacan Metin

    Contents

  17. Models and predictions for "How to deal w_ missing input data"

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Mar 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gauch, Martin (2025). Models and predictions for "How to deal w_ missing input data" [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_15008460
    Explore at:
    Dataset updated
    Mar 15, 2025
    Dataset provided by
    Google Research
    Authors
    Gauch, Martin
    Description

    How to deal w_ missing input data

    This repository contains the models, configs, and results files for the paper Gauch et al., "How to deal w_ missing input data".

    The corresponding analysis code is available on GitHub: https://github.com/gauchm/missing-inputs.

    Contents of this repository

    missing-inputs.ipynb -- Jupyter notebook to reproduce figures from the paper.

    results/ -- Folder with model weights, configs, and predictions used in missing-inputs.ipynb.

    patches/ -- Contains patches for local modifications to reproduce experiments from the paper.

    Required setup

    Clone neuralhydrology: git clone https://github.com/neuralhydrology/neuralhydrology.git.

    Install an editable version of neuralhydrology: cd neuralhydrology && pip install -e ..

    Download the following data:

    the CAMELS US dataset (CAMELS time series meteorology, observed flow, meta data, version 1.2) from NCAR into some data directory (has to match data_dir in the config files).

    the extended Maurer and NLDAS forcings set available on HydroShare: Maurer, NLDAS.

    the models, results, and config files from this paper avaliable on this Zenodo repository.

    Note that to reproduce the experiments, local modifications to NeuralHydrology are necessary. To do so, apply the patches in the patches/ directory: git apply patches/experiment-N.patch.

  18. f

    Effect of missing data on topological inference using a total evidence...

    • figshare.com
    bin
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Guillerme; Natalie Cooper (2023). Effect of missing data on topological inference using a total evidence approach [Dataset]. http://doi.org/10.6084/m9.figshare.1306861.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Thomas Guillerme; Natalie Cooper
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To fully understand macroevolutionary patterns and processes, we need to include both extant and extinct species in our models. This requires phylogenetic trees with both living and fossil taxa at the tips. One way to infer such phylogenies is the Total Evidence approach which uses molecular data from living taxa and morphological data from living and fossil taxa.Although the Total Evidence approach is very promising, it requires a great deal of data that can be hard to collect. Therefore this method is likely to suffer from missing data issues that may affect its ability to infer correct phylogenies.Here we use simulations to assess the effects of missing data on tree topologies inferred from Total Evidence matrices. We investigate three major factors that directly affect the completeness and the size of the morphological part of the matrix: the proportion of living taxa with no morphological data, the amount of missing data in the fossil record, and the overall number of morphological characters in the matrix. We infer phylogenies from complete matrices and from matrices with various amounts of missing data, and then compare missing data topologies to the "best" tree topology inferred using the complete matrix.We find that the number of living taxa with morphological characters and the overall number of morphological characters in the matrix, are more important than the amount of missing data in the fossil record for recovering the "best" tree topology. Therefore, we suggest that sampling effort should be focused on morphological data collection for living species to increase the accuracy of topological inference in a Total Evidence framework. Additionally, we find that Bayesian methods consistently outperform other tree inference methods. We therefore recommend using Bayesian consensus trees to fix the tree topology prior to further analyses.

  19. a guide to handle missing values for ML Model

    • kaggle.com
    zip
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Feroz Shinwari (2025). a guide to handle missing values for ML Model [Dataset]. https://www.kaggle.com/datasets/ferozshahshinwari/a-guide-to-handle-missing-values-for-ml-model/code
    Explore at:
    zip(36646 bytes)Available download formats
    Dataset updated
    Feb 10, 2025
    Authors
    Feroz Shinwari
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Feroz Shinwari

    Released under Apache 2.0

    Contents

  20. Additional file 1 of Enabling network inference methods to handle missing...

    • springernature.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abel Folch-Fortuny; Alejandro Villaverde; Alberto Ferrer; Julio Banga (2023). Additional file 1 of Enabling network inference methods to handle missing data and outliers [Dataset]. http://doi.org/10.6084/m9.figshare.c.3611252_D1.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Abel Folch-Fortuny; Alejandro Villaverde; Alberto Ferrer; Julio Banga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TSR. The source code of trimmed scores regression (TSR) algorithm for missing data imputation is provided ready to be used in MATLAB. (M 2 kb)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
Organization logo

Water-quality data imputation with a high percentage of missing values: a machine learning approach

Explore at:
csvAvailable download formats
Dataset updated
Jun 8, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

  • Water temperature (Tw)

  • Dissolved oxygen (DO)

  • Electrical conductivity (EC)

  • pH

  • Turbidity (Turb)

  • Nitrite (NO2-)

  • Nitrate (NO3-)

  • Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

Search
Clear search
Close search
Google apps
Main menu