70 datasets found
  1. Water-quality data imputation with a high percentage of missing values: a...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jun 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 8, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

    This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

    To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

    IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

    In this dataset, we include the original and imputed values for the following variables:

    • Water temperature (Tw)

    • Dissolved oxygen (DO)

    • Electrical conductivity (EC)

    • pH

    • Turbidity (Turb)

    • Nitrite (NO2-)

    • Nitrate (NO3-)

    • Total Nitrogen (TN)

    Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

    More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

    If you use this dataset in your work, please cite our paper:
    Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

  2. z

    Missing data in the analysis of multilevel and dependent data (Example data...

    • zenodo.org
    bin
    Updated Jul 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Grund; Simon Grund; Oliver Lüdtke; Oliver Lüdtke; Alexander Robitzsch; Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Example data sets) [Dataset]. http://doi.org/10.5281/zenodo.7773614
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    Springer
    Authors
    Simon Grund; Simon Grund; Oliver Lüdtke; Oliver Lüdtke; Alexander Robitzsch; Alexander Robitzsch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example data sets for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the data sets used in both example analyses (Examples 1 and 2) in two file formats (binary ".rda" for use in R; plain-text ".dat").

    The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

    ID = group identifier (1-2000)
    x = numeric (Level 1)
    y = numeric (Level 1)
    w = binary (Level 2)

    In all data sets, missing values are coded as "NA".

  3. H

    A Complete Aerosol Optical Depth Dataset with High Spatiotemporal Resolution...

    • dataverse.harvard.edu
    Updated Jan 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lianfa, Li; Jiajie, Wu (2021). A Complete Aerosol Optical Depth Dataset with High Spatiotemporal Resolution for Mainland China [Dataset]. http://doi.org/10.7910/DVN/RNSWRH
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 19, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Lianfa, Li; Jiajie, Wu
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2015 - Dec 31, 2018
    Area covered
    China
    Description

    We share the complete aerosol optical depth dataset with high spatial (1x1km^2) and temporal (daily) resolution and the Beijing 1954 projection (https://epsg.io/2412) for mainland China (2015-2018). The original aerosol optical depth images are from Multi-Angle Implementation of Atmospheric Correction Aerosol Optical Depth (MAIAC AOD) (https://lpdaac.usgs.gov/products/mcd19a2v006/) with the similar spatiotemporal resolution and the sinusoidal projection (https://en.wikipedia.org/wiki/Sinusoidal_projection). After projection conversion, eighteen tiles of MAIAC AOD were merged to obtain a large image of AOD covering the entire area of mainland China. Due to the conditions of clouds and high surface reflectance, each original MAIAC AOD image usually has many missing values, and the average missing percentage of each AOD image may exceed 60%. Such a high percentage of missing values severely limits applicability of the original MAIAC AOD dataset product. We used the sophisticated method of full residual deep networks (Li et al, 2020, https://ieeexplore.ieee.org/document/9186306) to impute the daily missing MAIAC AOD, thus obtaining the complete (no missing values) high-resolution AOD data product covering mainland China. The covariates used in imputation included coordinates, elevation, MERRA2 coarse-resolution PBLH and AOD variables, cloud fraction, high-resolution meteorological variables (air pressure, air temperature, relative humidity and wind speed) and/or time index etc. Ground monitoring data were used to generate high-resolution meteorological variables to ensure the reliability of interpolation. Overall, our daily imputation models achieved an average training R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075, with a range of 0.026 to 0.32) and an average test R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075 with a range of 0.026 to 0.32). With almost no difference between training metrics and test metrics, the high test R^2 and low test RMSE show the reliability of AOD imputation. In the evaluation using the ground AOD data from the monitoring stations of the Aerosol Robot Network (AERONET) in mainland China, our method obtained a R^2 of 0.78 and RMSE of 0.27, which further illustrated the reliability of the method. This database contains four datasets: - Daily complete high-resolution AOD image dataset for mainland China from January 1, 2015 to December 31, 2018. The archived resources contain 1461 images stored in 1461 files, and 3 summary Excel files. The table “CHN_AOD_INFO.xlsx” describing the properties of the 1461 images, including projection, training R^2 and RMSE, testing R^2 and RMSE, minmum, mean, median and maximum AOD that we predicted. - The table “Model_and_Accuracy_of_Meteorological_Elements.xlsx” describing the statistics of performance metrics in interpolation of high-resolution meteorological dataset. - The table “Evaluation_Using_AERONET_AOD.xlsx” showing the evaluation result of AERONET, including R^2, RMSE, and monitoring information used in this study.

  4. S

    Deep learning based Missing Data Imputation

    • scidb.cn
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahjabeen Tahir (2024). Deep learning based Missing Data Imputation [Dataset]. http://doi.org/10.57760/sciencedb.16599
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Mahjabeen Tahir
    Description

    The code provided is related to training an autoencoder, evaluating its performance, and using it for imputing missing values in a dataset. Let's break down each part:Training the Autoencoder (train_autoencoder function):This function takes an autoencoder model and the input features as input.It trains the autoencoder using the input features as both input and target output (hence features, features).The autoencoder is trained for a specified number of epochs (epochs) with a given batch size (batch_size).The shuffle=True argument ensures that the data is shuffled before each epoch to prevent the model from memorizing the input order.After training, it returns the trained autoencoder model and the training history.Evaluating the Autoencoder (evaluate_autoencoder function):This function takes a trained autoencoder model and the input features as input.It uses the trained autoencoder to predict the reconstructed features from the input features.It calculates Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R2) scores between the original and reconstructed features.These metrics provide insights into how well the autoencoder is able to reconstruct the input features.Imputing with the Autoencoder (impute_with_autoencoder function):This function takes a trained autoencoder model and the input features as input.It identifies missing values (e.g., -9999) in the input features.For each row with missing values, it predicts the missing values using the trained autoencoder.It replaces the missing values with the predicted values.The imputed features are returned as output.To reuse this code:Load your dataset and preprocess it as necessary.Build an autoencoder model using the build_autoencoder function.Train the autoencoder using the train_autoencoder function with your input features.Evaluate the performance of the autoencoder using the evaluate_autoencoder function.If your dataset contains missing values, use the impute_with_autoencoder function to impute them with the trained autoencoder.Use the trained autoencoder for any other relevant tasks, such as feature extraction or anomaly detection.

  5. Data from: Benchmarking imputation methods for categorical biological data

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Mar 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthieu Gendre; Torsten Hauffe; Torsten Hauffe; Catalina Pimiento; Catalina Pimiento; Daniele Silvestro; Daniele Silvestro; Matthieu Gendre (2024). Benchmarking imputation methods for categorical biological data [Dataset]. http://doi.org/10.5281/zenodo.10800016
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 10, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Matthieu Gendre; Torsten Hauffe; Torsten Hauffe; Catalina Pimiento; Catalina Pimiento; Daniele Silvestro; Daniele Silvestro; Matthieu Gendre
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Mar 9, 2024
    Description

    Description:

    Welcome to the Zenodo repository for Publication Benchmarking imputation methods for categorical biological data, a comprehensive collection of datasets and scripts utilized in our research endeavors. This repository serves as a vital resource for researchers interested in exploring the empirical and simulated analyses conducted in our study.

    Contents:

    1. empirical_analysis:

      • Trait Dataset of Elasmobranchs: A collection of trait data for elasmobranch species obtained from FishBase , stored as RDS file.
      • Phylogenetic Tree: A phylogenetic tree stored as a TRE file.
      • Imputations Replicates (Imputation): Replicated imputations of missing data in the trait dataset, stored as RData files.
      • Error Calculation (Results): Error calculation results derived from imputed datasets, stored as RData files.
      • Scripts: Collection of R scripts used for the implementation of empirical analysis.
    2. simulation_analysis:

      • Input Files: Input files utilized for simulation analyses as CSV files
      • Data Distribution PDFs: PDF files displaying the distribution of simulated data and the missingness.
      • Output Files: Simulated trait datasets, trait datasets with missing data, and trait imputed datasets with imputation errors calculated as RData files.
      • Scripts: Collection of R scripts used for the simulation analysis.
    3. TDIP_package:

      • Scripts of the TDIP Package: All scripts related to the Trait Data Imputation with Phylogeny (TDIP) R package used in the analyses.

    Purpose:

    This repository aims to provide transparency and reproducibility to our research findings by making the datasets and scripts publicly accessible. Researchers interested in understanding our methodologies, replicating our analyses, or building upon our work can utilize this repository as a valuable reference.

    Citation:

    When using the datasets or scripts from this repository, we kindly request citing Publication Benchmarking imputation methods for categorical biological data and acknowledging the use of this Zenodo repository.

    Thank you for your interest in our research, and we hope this repository serves as a valuable resource in your scholarly pursuits.

  6. f

    R scripts used for Monte Carlo simulations and data analyses.

    • plos.figshare.com
    zip
    Updated Jan 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lateef Babatunde Amusa; Twinomurinzi Hossana (2024). R scripts used for Monte Carlo simulations and data analyses. [Dataset]. http://doi.org/10.1371/journal.pone.0297037.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 19, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lateef Babatunde Amusa; Twinomurinzi Hossana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R scripts used for Monte Carlo simulations and data analyses.

  7. n

    Data from: Missing data estimation in morphometrics: how much is too much?

    • narcis.nl
    • data.niaid.nih.gov
    • +1more
    Updated Dec 5, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clavel, Julien; Merceron, Gildas; Escarguel, Gilles (2013). Data from: Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50
    Explore at:
    Dataset updated
    Dec 5, 2013
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Clavel, Julien; Merceron, Gildas; Escarguel, Gilles
    Description

    Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.

  8. TidY_PracticE_DatasetS

    • kaggle.com
    zip
    Updated Jun 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DEBALINA MITRA (2023). TidY_PracticE_DatasetS [Dataset]. https://www.kaggle.com/datasets/debalinamitra/tidy-practice-datasets
    Explore at:
    zip(139335 bytes)Available download formats
    Dataset updated
    Jun 24, 2023
    Authors
    DEBALINA MITRA
    Description

    Original dataset that is shared on Github can be found here. These are hands on practice datasets that were linked through the Coursera Guided Project Certificate Course for Handling Missing Values in R, a part of Coursera Project Network. The datasets links were shared by the original author and instructor of the course Arimoro Olayinka Imisioluwa.

    Things you could do with this dataset: As a beginner in R, these datasets helped me to get a hang over making data clean and tidy and handling missing values(only numeric) using R. Good for anyone looking for a beginner to intermediate level understanding in these subjects.

    Here are my notebooks as kernels using these datasets and using a few more preloaded datasets in R, as suggested by the instructor. TidY DatA Practice MissinG DatA HandlinG - NumeriC

  9. Additional file 5 of Heckman imputation models for binary or continuous MNAR...

    • springernature.figshare.com
    • datasetcatalog.nlm.nih.gov
    txt
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon (2023). Additional file 5 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors [Dataset]. http://doi.org/10.6084/m9.figshare.7038107.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R code to impute continuous outcome. (R 1 kb)

  10. Description of scenario analyses to test assumptions related to missing data...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emilie D. Duchesneau; Shahar Shmuel; Keturah R. Faurot; Allison Musty; Jihye Park; Til Stürmer; Alan C. Kinlaw; Yang Claire Yang; Jennifer L. Lund (2023). Description of scenario analyses to test assumptions related to missing data and loss-to-follow-up. [Dataset]. http://doi.org/10.1371/journal.pone.0286984.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Emilie D. Duchesneau; Shahar Shmuel; Keturah R. Faurot; Allison Musty; Jihye Park; Til Stürmer; Alan C. Kinlaw; Yang Claire Yang; Jennifer L. Lund
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description of scenario analyses to test assumptions related to missing data and loss-to-follow-up.

  11. f

    A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    • +1more
    Updated May 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dabke, Kruttika; Jones, Michelle R.; Kreimer, Simion; Parker, Sarah J. (2021). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000907442
    Explore at:
    Dataset updated
    May 3, 2021
    Authors
    Dabke, Kruttika; Jones, Michelle R.; Kreimer, Simion; Parker, Sarah J.
    Description

    Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.

  12. Handling of Missing Data Induced by Time-Varying Covariates in Comparative...

    • icpsr.umich.edu
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Desai, Manisha (2025). Handling of Missing Data Induced by Time-Varying Covariates in Comparative Effectiveness Research HIV Patients [Methods Study], 2013-2018 [Dataset]. http://doi.org/10.3886/ICPSR39528.v1
    Explore at:
    Dataset updated
    Oct 9, 2025
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Desai, Manisha
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/39528/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39528/terms

    Time period covered
    2013 - 2018
    Description

    Researchers can use data from health registries or electronic health records to compare two or more treatments. Registries store data about patients with a specific health problem. These data include how well those patients respond to treatments and information about patient traits, such as age, weight, or blood pressure. But sometimes data about patient traits are missing. Missing data about patient traits can lead to incorrect study results, especially when traits change over time. For example, weight can change over time, and the patient may not report their weight at some points along the way. Researchers use statistical methods to fill in these missing data. In this study, the research team compared a new statistical method to fill in missing data with traditional methods. Traditional methods remove patients with missing data or fill in each missing number with a single estimate. The new method creates multiple possible estimates to fill in each missing number. To access the methods, software, and R package, please visit the SimulateCER GitHub and SimTimeVar CRAN website.

  13. Z

    Car Parts Dataset (without Missing Values)

    • data.niaid.nih.gov
    Updated Apr 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Godahewa, Rakshitha; Bergmeir, Christoph; Webb, Geoff (2021). Car Parts Dataset (without Missing Values) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3994910
    Explore at:
    Dataset updated
    Apr 1, 2021
    Dataset provided by
    PhD Student at Monash University
    Lecturer at Monash University
    Professor at Monash University
    Authors
    Godahewa, Rakshitha; Bergmeir, Christoph; Webb, Geoff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 2674 intermittent monthly time series that represent car parts sales from January 1998 to March 2002. It was extracted from R expsmooth package.

    The original dataset contains missing values and they have been replaced by zeros.

  14. Data from: Main Effects and Interactions in Mixed and Incomplete Data Frames...

    • tandf.figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geneviève Robin; Olga Klopp; Julie Josse; Éric Moulines; Robert Tibshirani (2023). Main Effects and Interactions in Mixed and Incomplete Data Frames [Dataset]. http://doi.org/10.6084/m9.figshare.8191850.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Geneviève Robin; Olga Klopp; Julie Josse; Éric Moulines; Robert Tibshirani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A mixed data frame (MDF) is a table collecting categorical, numerical, and count observations. The use of MDF is widespread in statistics and the applications are numerous from abundance data in ecology to recommender systems. In many cases, an MDF exhibits simultaneously main effects, such as row, column, or group effects and interactions, for which a low-rank model has often been suggested. Although the literature on low-rank approximations is very substantial, with few exceptions, existing methods do not allow to incorporate main effects and interactions while providing statistical guarantees. The present work fills this gap. We propose an estimation method which allows to recover simultaneously the main effects and the interactions. We show that our method is near optimal under conditions which are met in our targeted applications. We also propose an optimization algorithm which provably converges to an optimal solution. Numerical experiments reveal that our method, mimi, performs well when the main effects are sparse and the interaction matrix has low-rank. We also show that mimi compares favorably to existing methods, in particular when the main effects are significantly large compared to the interactions, and when the proportion of missing entries is large. The method is available as an R package on the Comprehensive R Archive Network. Supplementary materials for this article are available online.

  15. t

    ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

    • researchdata.tuwien.ac.at
    • researchdata.tuwien.at
    zip
    Updated Sep 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 5, 2025
    Dataset provided by
    TU Wien
    Authors
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

    This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

    Dataset Paper (Open Access)

    A description of this dataset, including the methodology and validation results, is available at:

    Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: an independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data, 17, 4305–4329, https://doi.org/10.5194/essd-17-4305-2025, 2025.

    Abstract

    ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
    However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
    Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

    Summary

    • Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling
    • Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology
    • Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.
    • More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

    Programmatic Download

    You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

    #!/bin/bash

    # Set download directory
    DOWNLOAD_DIR=~/Downloads

    base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

    # Loop through years 1991 to 2023 and download & extract data
    for year in {1991..2023}; do
    echo "Downloading $year.zip..."
    wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
    unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
    rm "$DOWNLOAD_DIR/$year.zip"
    done

    Data details

    The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

    ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

    Data Variables

    Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

    • sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).
    • sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.
    • sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)
    • sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also provided for cases where an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.
    • gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (1), and where the interpolated (smoothed) values are used instead (0) in the 'sm' field.
    • frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

    Additional information for each variable is given in the netCDF attributes.

    Version Changelog

    Changes in v9.1r1 (previous version was v09.1):

    • This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

    Software to open netCDF files

    These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

    References

    • Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: an independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data, 17, 4305–4329, https://doi.org/10.5194/essd-17-4305-2025, 2025.
    • Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869
    • Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020
    • Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

    Related Records

    The following records are all part of the ESA CCI Soil Moisture science data records community

    1

    ESA CCI SM MODELFREE Surface Soil Moisture Record

    <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"

  16. Sensitivity analysis for missing data in cost-effectiveness analysis: Stata...

    • figshare.com
    bin
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Baptiste Leurent; Manuel Gomes; Rita Faria; Stephen Morris; Richard Grieve; James R Carpenter (2023). Sensitivity analysis for missing data in cost-effectiveness analysis: Stata code [Dataset]. http://doi.org/10.6084/m9.figshare.6714206.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Baptiste Leurent; Manuel Gomes; Rita Faria; Stephen Morris; Richard Grieve; James R Carpenter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Stata do-files and data to support tutorial "Sensitivity Analysis for Not-at-Random Missing Data in Trial-Based Cost-Effectiveness Analysis" (Leurent, B. et al. PharmacoEconomics (2018) 36: 889).Do-files should be similar to the code provided in the article's supplementary material.Dataset based on 10 Top Tips trial, but modified to preserve confidentiality. Results will differ from those published.

  17. Developed R-codes

    • figshare.com
    pdf
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Asif (2025). Developed R-codes [Dataset]. http://doi.org/10.6084/m9.figshare.30636971.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 17, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Muhammad Asif
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Missing data imputation

  18. d

    Slave Routes Datasets, 1650s - 1860s

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manning, Patrick; Liu, Yu (2023). Slave Routes Datasets, 1650s - 1860s [Dataset]. http://doi.org/10.7910/DVN/6HLXO3
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Manning, Patrick; Liu, Yu
    Time period covered
    Jan 1, 1650 - Jan 1, 1870
    Description

    Estimates of captives carried in the Atlantic slave trade by decade, 1650s to 1860s. Data: routes of voyages and recorded numbers of captives (10 variables and 33,345 cases of slave voyages). Data are organized into 40 routes linking African regions to overseas regions. Purpose: estimation of missing data and totals of captive flows. Method: techniques of Bayesian statistics to estimate missing data on routes and flows of captives. Also included is R-language code for simulating routes and populations

  19. f

    Data from: A Bayesian Approach to Parameter Estimation in the Presence of...

    • tandf.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Domenica Panzera; Roberto Benedetti; Paolo Postiglione (2023). A Bayesian Approach to Parameter Estimation in the Presence of Spatial Missing Data [Dataset]. http://doi.org/10.6084/m9.figshare.3688599.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Domenica Panzera; Roberto Benedetti; Paolo Postiglione
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    The missing data problem has been widely addressed in the literature. The traditional methods for handling missing data may be not suited to spatial data, which can exhibit distinctive structures of dependence and/or heterogeneity. As a possible solution to the spatial missing data problem, this paper proposes an approach that combines the Bayesian Interpolation method [Benedetti, R. & Palma, D. (1994) Markov random field-based image subsampling method, Journal of Applied Statistics, 21(5), 495–509] with a multiple imputation procedure. The method is developed in a univariate and a multivariate framework, and its performance is evaluated through an empirical illustration based on data related to labour productivity in European regions.

  20. E

    Data from: STAD-R Descriptive statistics for experimental designs

    • data.moa.gov.et
    html
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CIMMYT Ethiopia (2025). STAD-R Descriptive statistics for experimental designs [Dataset]. https://data.moa.gov.et/dataset/hdl-11529-10853
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    CIMMYT Ethiopia
    Description

    STAD-R is a set of R programs that performs descriptive statistics, in order to make boxplots and histograms. STAD-R was designed because is necessary before than the thing, check if the dataset have the same number of repetitions, blocks, genotypes, environments, if we have missing values, where and how many, review the distributions and outliers, because is important to be sure that the dataset is complete and have the correct structure for do and other kind of analysis.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
Organization logo

Water-quality data imputation with a high percentage of missing values: a machine learning approach

Explore at:
csvAvailable download formats
Dataset updated
Jun 8, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

  • Water temperature (Tw)

  • Dissolved oxygen (DO)

  • Electrical conductivity (EC)

  • pH

  • Turbidity (Turb)

  • Nitrite (NO2-)

  • Nitrate (NO3-)

  • Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

Search
Clear search
Close search
Google apps
Main menu