31 datasets found
  1. Additional file 5 of Heckman imputation models for binary or continuous MNAR...

    • springernature.figshare.com
    • datasetcatalog.nlm.nih.gov
    txt
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon (2023). Additional file 5 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors [Dataset]. http://doi.org/10.6084/m9.figshare.7038107.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R code to impute continuous outcome. (R 1 kb)

  2. Z

    Missing data in the analysis of multilevel and dependent data (Examples)

    • data.niaid.nih.gov
    Updated Jul 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Grund; Oliver Lüdtke; Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Examples) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7773613
    Explore at:
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    IPN - Leibniz Institute for Science and Mathematics Education
    University of Hamburg
    Authors
    Simon Grund; Oliver Lüdtke; Alexander Robitzsch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example data sets and computer code for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the computer code (".R") and the data sets from both example analyses (Examples 1 and 2). The data sets are available in two file formats (binary ".rda" for use in R; plain-text ".dat").

    The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

    ID = group identifier (1-2000) x = numeric (Level 1) y = numeric (Level 1) w = binary (Level 2)

    In all data sets, missing values are coded as "NA".

  3. Details on imputation methods in R or python.

    • plos.figshare.com
    xls
    Updated Nov 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Albu; Shan Gao; Laure Wynants; Ben Van Calster (2025). Details on imputation methods in R or python. [Dataset]. http://doi.org/10.1371/journal.pone.0334125.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 7, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Elena Albu; Shan Gao; Laure Wynants; Ben Van Calster
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occur at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion, unified for continuous and categorical variables, is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.

  4. Data from: Multiple Imputation Through XGBoost

    • tandf.figshare.com
    txt
    Updated Oct 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yongshi Deng; Thomas Lumley (2023). Multiple Imputation Through XGBoost [Dataset]. http://doi.org/10.6084/m9.figshare.24073156.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 23, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Yongshi Deng; Thomas Lumley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The use of multiple imputation (MI) is becoming increasingly popular for addressing missing data. Although some conventional MI approaches have been well studied and have shown empirical validity, they have limitations when processing large datasets with complex data structures. Their imputation performances usually rely on the proper specification of imputation models, and this requires expert knowledge of the inherent relations among variables. Moreover, these standard approaches tend to be computationally inefficient for medium and large datasets. In this article, we propose a scalable MI framework mixgb, which is based on XGBoost, subsampling, and predictive mean matching. Our approach leverages the power of XGBoost, a fast implementation of gradient boosted trees, to automatically capture interactions and nonlinear relations while achieving high computational efficiency. In addition, we incorporate subsampling and predictive mean matching to reduce bias and to better account for appropriate imputation variability. The proposed framework is implemented in an R package mixgb. Supplementary materials for this article are available online.

  5. n

    Data from: Missing data estimation in morphometrics: how much is too much?

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated Dec 5, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julien Clavel; Gildas Merceron; Gilles Escarguel (2013). Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 5, 2013
    Dataset provided by
    Centre National de la Recherche Scientifique
    Authors
    Julien Clavel; Gildas Merceron; Gilles Escarguel
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.

  6. R code and data for "Multiple imputation and direct estimation for qPCR data...

    • zenodo.org
    application/gzip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valeriia Sherina; Matthew N. McCall; Matthew N. McCall; Valeriia Sherina (2020). R code and data for "Multiple imputation and direct estimation for qPCR data with non-detects" [Dataset]. http://doi.org/10.5281/zenodo.2554418
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Valeriia Sherina; Matthew N. McCall; Matthew N. McCall; Valeriia Sherina
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    R code and data to reproduce figures and tables in the manuscript: Multiple imputation and direct estimation for qPCR data with non-detects.

  7. H

    Hazardous Times

    • dataverse.harvard.edu
    Updated Aug 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    George Robert Lefter (2025). Hazardous Times [Dataset]. http://doi.org/10.7910/DVN/MP6L5B
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 27, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    George Robert Lefter
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This repository contains the replication package for the article "Hazardous Times: Animal Spirits and U.S. Recession Probabilities." It includes all necessary R code, raw data, and processed data in the start-stop (counting process) format required to reproduce the empirical results, tables, and figures from the study. Project Description: The study assembles monthly U.S. macroeconomic time series from the Federal Reserve Economic Data (FRED) and related sources—covering labor market conditions, consumer sentiment, term spreads, and credit spreads—and implements a novel "high water mark" methodology to measure the lead times with which these indicators signal NBER-dated recessions. Contents: Code: R scripts for data cleaning, multiple imputation, survival analysis, and figure/table generation. A top-level master script (run_all.R) executes the entire analytical pipeline end-to-end. Data: Raw/: Original data pulls from primary sources. Analysis_Ready/: Cleaned series, constructed cycle-specific extremes (high water marks), lead time variables, and the final start-stop dataset for survival analysis. The final curated Excel workbooks used as direct inputs for the replication code. (Note: These Excel sheets must be saved as separate .xlsx files in the designated directory before running the R code.) Documentation: This README file and detailed comments within the code. Key Details: Software Requirements: The replication code is written in R. A list of required R packages (with versions) is provided in the reference list of the article. Missing Data: Addressed via Multiple Imputation by Chained Equations (MICE). License: The original raw data from FRED is subject to its own terms of use, which require citation. The R code is released under the MIT License. All processed data, constructed variables, and analysis-ready datasets created by the author are dedicated to the public domain under the CC0 1.0 Universal Public Domain Dedication. Instructions: Download the entire repository. Install the required R packages. Save Excel sheets from the workbook “Hazardous_Times_Data.xlsx” as separate .xlsx files in the designated directory before running the R code in step 4. Run the master script run_all.R to fully replicate the study's analysis from the provided Analysis_Ready data. This script will regenerate all tables and figures. Users should consult the main publication for full context, theoretical motivation, and series-specific citations.

  8. Diallel analysis reveals Mx1-dependent and Mx1-independent effects on...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, csv +1
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul L Maurizio; Paul L Maurizio; Martin T Ferris; Martin T Ferris; Gregory R Keele; Gregory R Keele; Darla R Miller; Darla R Miller; Ginger D Shaw; Ginger D Shaw; Alan C Whitmore; Ande West; Clayton R Morrison; Kelsey E Noll; Kelsey E Noll; Kenneth S Plante; Adam S Cockrell; David W Threadgill; David W Threadgill; Fernando Pardo-Manuel de Villena; Fernando Pardo-Manuel de Villena; Ralph S Baric; Mark T Heise; William Valdar; William Valdar; Alan C Whitmore; Ande West; Clayton R Morrison; Kenneth S Plante; Adam S Cockrell; Ralph S Baric; Mark T Heise (2024). Diallel analysis reveals Mx1-dependent and Mx1-independent effects on response to influenza A virus in mice [Dataset]. http://doi.org/10.5281/zenodo.293015
    Explore at:
    application/gzip, csv, pdfAvailable download formats
    Dataset updated
    Aug 3, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Paul L Maurizio; Paul L Maurizio; Martin T Ferris; Martin T Ferris; Gregory R Keele; Gregory R Keele; Darla R Miller; Darla R Miller; Ginger D Shaw; Ginger D Shaw; Alan C Whitmore; Ande West; Clayton R Morrison; Kelsey E Noll; Kelsey E Noll; Kenneth S Plante; Adam S Cockrell; David W Threadgill; David W Threadgill; Fernando Pardo-Manuel de Villena; Fernando Pardo-Manuel de Villena; Ralph S Baric; Mark T Heise; William Valdar; William Valdar; Alan C Whitmore; Ande West; Clayton R Morrison; Kenneth S Plante; Adam S Cockrell; Ralph S Baric; Mark T Heise
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and analysis files for diallel analysis of weight loss in 8-12 week old male and female mice (n=1,043), mock treated or infected with influenza A virus (H1N1, PR8) across 4 days post-infection, as well as founder haplotype effect analysis at Mx1 for pre-CC and CC-RIX.

  9. Simulated datasets description.

    • plos.figshare.com
    xls
    Updated Nov 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Albu; Shan Gao; Laure Wynants; Ben Van Calster (2025). Simulated datasets description. [Dataset]. http://doi.org/10.1371/journal.pone.0334125.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 7, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Elena Albu; Shan Gao; Laure Wynants; Ben Van Calster
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occur at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion, unified for continuous and categorical variables, is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.

  10. d

    Data from: Quantifying the impacts of management and herbicide resistance on...

    • search.dataone.org
    • datadryad.org
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Goodsell; David Comont; Helen Hicks; James Lambert; Richard Hull; Laura Crook; Paolo Fraccaro; Katharina Reusch; Robert Freckleton; Dylan Childs (2023). Quantifying the impacts of management and herbicide resistance on regional plant population dynamics in the face of missing data [Dataset]. http://doi.org/10.5061/dryad.9cnp5hqn5
    Explore at:
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Dryad Digital Repository
    Authors
    Robert Goodsell; David Comont; Helen Hicks; James Lambert; Richard Hull; Laura Crook; Paolo Fraccaro; Katharina Reusch; Robert Freckleton; Dylan Childs
    Time period covered
    Jan 1, 2022
    Description

    A key challenge in the management of populations is to quantify the impact of interven-tions in the face of environmental and phenotypic variability. However, accurate estima-tion of the effects of management and environment, in large-scale ecological research is often limited by the expense of data collection, the inherent trade-off between quality and quantity, and missing data. In this paper we develop a novel modelling framework, and demographically informed imputation scheme, to comprehensively account for the uncertainty generated by miss-ing population, management, and herbicide resistance data. Using this framework and a large dataset (178 sites over 3 years) on the densities of a destructive arable weed (Alo-pecurus myosuroides) we investigate the effects of environment, management, and evolved herbicide resistance, on weed population dynamics. In this study we quantify the marginal effects of a suite of common management prac-tices, including cropping, cultivation, and herbici..., Data were collected from a network of UK farms using a density structured survey method outlined in Queensborough 2011. , , # Quantifying the impacts of management and herbicide resistance on regional plant population dynamics in the face of missing data

    Contained are the datasets and code required to replicate the analyses in Goodsell et al (2023), Quantifying the impacts of management and herbicide resistance on regional plant population dynamics in the face of missing data.

    Description of the data and file structure

    Data: Contains data required to run all stages in the analysis.

    Many files contain the same variable names, important variables have been described in the first object they appear in.

    all_imputation_data.rds - The data required to run the imputation scheme, this is an R list containing the following:

    $Management - data frame containing missing and observed values for management imputation

    FF & FFY: the specific field, and field year.

    year: the year.

    crop: crop

    cult_cat : cultivation category

    a_gly: number of autumn (post September 1st) glyphosate applicatio...

  11. m

    PhD Thesis Supplementary Materials: The effect of Health System, Health Risk...

    • data.mendeley.com
    Updated Aug 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KP Junaid (2025). PhD Thesis Supplementary Materials: The effect of Health System, Health Risk factors and Health Service Coverage on Fertility, Morbidity and Mortality in HDI countries: An Econometric analysis [Dataset]. http://doi.org/10.17632/53hy5btx6t.1
    Explore at:
    Dataset updated
    Aug 5, 2025
    Authors
    KP Junaid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository accompanies the doctoral thesis titled: "The Effect of Health System, Health Risk Factors and Health Service Coverage on Fertility, Morbidity and Mortality in HDI Countries: An Econometric Analysis." Given the complexity of the data and methodological procedures, key supplementary materials detailing the data sources, processing techniques, analytical scripts, and extended results are provided in the present Mendeley Data Repository. These materials are intended to promote transparency, reproducibility, and further research. The repository includes the following supplementary files: Supplementary File 1: Contains detailed information on all indicators used in the study, including those from the Global Reference List (GRL) and control variables. It specifies the definition, unit of measurement, data source, missing data proportions, inclusion status, and whether the indicator is positively or negatively associated with the outcome. Supplementary File 2: Provides the R script used to perform Multiple Imputation by Chained Equations (MICE) to handle missing data across indicators. Supplementary File 3: Describes the imputed dataset generated using the MICE method for both pre-COVID (2015–2019) and post-COVID (2020–2021) periods. Supplementary File 4: Contains the R script used to construct the composite and sub-indices for Health System, Health Risk Factors, Service Coverage, and Health Status. Supplementary File 5: Provides the R script used to compute Compound Annual Growth Rates (CAGR) for all indices and component indicators. Supplementary File 6: Includes the Stata Do-file used to run panel data regression models, estimating the impact of Health System, Health Risk Factors, and Service Coverage on fertility, morbidity, and mortality. Supplementary File 7: Contains the Stata Do-file used for conducting the Phillips and Sul Convergence Analysis to assess convergence/divergence trends among countries toward selected health-related SDG targets. Supplementary File 8: Provides descriptive statistics—including mean, standard deviation, and coefficient of variation—for selected health indicators across 100 HDI countries during the study period (2015–2021). Supplementary File 9: Presents the CAGR estimates of all constructed indices, separately reported for pre-COVID (2015–2019) and post-COVID (2020–2021) phases. Supplementary File 10: Provides the forecasted values for 57 indicators across 100 countries up to the year 2025, supporting the study’s predictive analysis.

  12. Pseudocode for the missForestPredict algorithm.

    • plos.figshare.com
    xls
    Updated Nov 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Albu; Shan Gao; Laure Wynants; Ben Van Calster (2025). Pseudocode for the missForestPredict algorithm. [Dataset]. http://doi.org/10.1371/journal.pone.0334125.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 7, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Elena Albu; Shan Gao; Laure Wynants; Ben Van Calster
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occur at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion, unified for continuous and categorical variables, is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.

  13. Multiple Imputation by Ordered Monotone Blocks With Application to the...

    • tandf.figshare.com
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fan Li; Michela Baccini; Fabrizia Mealli; Elizabeth R. Zell; Constantine E. Frangakis; Donald B. Rubin (2023). Multiple Imputation by Ordered Monotone Blocks With Application to the Anthrax Vaccine Research Program [Dataset]. http://doi.org/10.6084/m9.figshare.1067056.v2
    Explore at:
    application/x-dosexecAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Fan Li; Michela Baccini; Fabrizia Mealli; Elizabeth R. Zell; Constantine E. Frangakis; Donald B. Rubin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multiple imputation (MI) has become a standard statistical technique for dealing with missing values. The CDC Anthrax Vaccine Research Program (AVRP) dataset created new challenges for MI due to the large number of variables of different types and the limited sample size. A common method for imputing missing data in such complex studies is to specify, for each of J variables with missing values, a univariate conditional distribution given all other variables, and then to draw imputations by iterating over the J conditional distributions. Such fully conditional imputation strategies have the theoretical drawback that the conditional distributions may be incompatible. When the missingness pattern is monotone, a theoretically valid approach is to specify, for each variable with missing values, a conditional distribution given the variables with fewer or the same number of missing values and sequentially draw from these distributions. In this article, we propose the “multiple imputation by ordered monotone blocks” approach, which combines these two basic approaches by decomposing any missingness pattern into a collection of smaller “constructed” monotone missingness patterns, and iterating. We apply this strategy to impute the missing data in the AVRP interim data. Supplemental materials, including all source code and a synthetic example dataset, are available online.

  14. Iterative Multiple Imputation: A Framework to Determine the Number of...

    • tandf.figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vahid Nassiri; Geert Molenberghs; Geert Verbeke; João Barbosa-Breda (2023). Iterative Multiple Imputation: A Framework to Determine the Number of Imputed Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.7445375.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Vahid Nassiri; Geert Molenberghs; Geert Verbeke; João Barbosa-Breda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We consider multiple imputation as a procedure iterating over a set of imputed datasets. Based on an appropriate stopping rule the number of imputed datasets is determined. Simulations and real-data analyses indicate that the sufficient number of imputed datasets may in some cases be substantially larger than the very small numbers that are usually recommended. For an easier use in various applications, the proposed method is implemented in the R package imi.

  15. Variable Selection with Multiply-Imputed Datasets: Choosing Between Stacked...

    • tandf.figshare.com
    pdf
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiacong Du; Jonathan Boss; Peisong Han; Lauren J. Beesley; Michael Kleinsasser; Stephen A. Goutman; Stuart Batterman; Eva L. Feldman; Bhramar Mukherjee (2023). Variable Selection with Multiply-Imputed Datasets: Choosing Between Stacked and Grouped Methods [Dataset]. http://doi.org/10.6084/m9.figshare.19111441.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Jiacong Du; Jonathan Boss; Peisong Han; Lauren J. Beesley; Michael Kleinsasser; Stephen A. Goutman; Stuart Batterman; Eva L. Feldman; Bhramar Mukherjee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Penalized regression methods are used in many biomedical applications for variable selection and simultaneous coefficient estimation. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors. This article considers a general class of penalized objective functions which, by construction, force selection of the same variables across imputed datasets. By pooling objective functions across imputations, optimization is then performed jointly over all imputed datasets rather than separately for each dataset. We consider two objective function formulations that exist in the literature, which we will refer to as “stacked” and “grouped” objective functions. Building on existing work, we (i) derive and implement efficient cyclic coordinate descent and majorization-minimization optimization algorithms for continuous and binary outcome data, (ii) incorporate adaptive shrinkage penalties, (iii) compare these methods through simulation, and (iv) develop an R package miselect. Simulations demonstrate that the “stacked” approaches are more computationally efficient and have better estimation and selection properties. We apply these methods to data from the University of Michigan ALS Patients Biorepository aiming to identify the association between environmental pollutants and ALS risk. Supplementary materials for this article are available online.

  16. Data from: Analysis of Interval Censored Competing Risk Data via...

    • tandf.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hyung Eun Lee; Yang-Jin Kim (2023). Analysis of Interval Censored Competing Risk Data via Nonparametric Multiple Imputation [Dataset]. http://doi.org/10.6084/m9.figshare.11988687.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Hyung Eun Lee; Yang-Jin Kim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In many clinical studies, the time to event of interest may involve several causes of failure. Furthermore, when the failure times are not completely observed, and instead are only known to lie somewhere between two observation times, interval censored competing risk data occur. For estimating regression coefficient with right censored competing risk data, Fine and Gray introduced the concept of censoring complete data and derived an estimating equation using an inverse probability censoring weight technique to reflect the probability being censored. As an alternative to achieve censoring complete data, Ruan and Gray considered to directly impute a potential censoring time for the subject who experienced the competing event. In this work, we extend Ruan and Gray’s approach to interval censored competing risk data by applying a multiple imputation technique. The suggested method has an advantage to be easily implemented by using several R functions developed for analyzing interval censored failure time data without competing risks. Simulation studies are conducted under diverse schemes to evaluate sizes and powers and to estimate regression coefficients. A dataset from an AIDS cohort study is analyzed as a real data example.

  17. Complete dataset obtained by imputation.

    • plos.figshare.com
    xls
    Updated Sep 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kai Zhang; Po-Chung Chen; YiYang Huang; Shiow-Jyu Tzou; Sheng-Tang Wu; Ta-Wei Chu; Chung-Che Wang; Jyh-Shing Roger Jang (2025). Complete dataset obtained by imputation. [Dataset]. http://doi.org/10.1371/journal.pone.0330184.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 24, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Kai Zhang; Po-Chung Chen; YiYang Huang; Shiow-Jyu Tzou; Sheng-Tang Wu; Ta-Wei Chu; Chung-Che Wang; Jyh-Shing Roger Jang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In response to Taiwan’s rapidly aging population and the rising demand for personalized health care, accurately assessing individual physiological aging has become an essential area of study. This research utilizes health examination data to propose a machine learning-based biological age prediction model that quantifies physiological age through residual life estimation. The model leverages LightGBM, which shows an 11.40% improvement in predictive performance (R-squared) compared to the XGBoost model. In the experiments, the use of MICE imputation for missing data significantly enhanced prediction accuracy, resulting in a 23.35% improvement in predictive performance. Kaplan-Meier (K-M) estimator survival analysis revealed that the model effectively differentiates between groups with varying health levels, underscoring the validity of biological age as a health status indicator. Additionally, the model identified the top ten biomarkers most influential in aging for both men and women, with a 69.23% overlap with Taiwan’s leading causes of death and previously identified top health-impact factors, further validating its practical relevance. Through multidimensional health recommendations based on SHAP and PCC interpretations, if the health recommendations provided by the model are implemented, 64.58% of individuals could potentially extend their life expectancy. This study provides new methodological support and data backing for precision health interventions and life extension.

  18. Cross cultural data for multivariate analysis of subsistence strategies

    • figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac Ullah (2023). Cross cultural data for multivariate analysis of subsistence strategies [Dataset]. http://doi.org/10.6084/m9.figshare.1404233.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Isaac Ullah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These are datasets of human subsistence, mobility, demographic, and environmental variables for the 186 cultures of the Standard Cross Cultural Sample. Missing values have been filled by Multiple Imputation methods in the R statistical package. These data are used in an analysis about human subsistence transitions that is submitted to PNAS. The R script used to complete those analyses is also included.

  19. R-code of the simulation study and use case investigation.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated Nov 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanja Bülow; Ralf-Dieter Hilgers; Nicole Heussen (2023). R-code of the simulation study and use case investigation. [Dataset]. http://doi.org/10.1371/journal.pone.0293640.s007
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Tanja Bülow; Ralf-Dieter Hilgers; Nicole Heussen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This code can be used to define the functions, to create the datasets, to generate the figures and tables for the simulation study and to generate the results from the use case. (ZIP)

  20. f

    Data from: LASSO-based Survival Prediction Modelling with Multiply Imputed...

    • figshare.com
    pdf
    Updated Aug 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Belal Hossain; Mohsen Sadatsafavi; James C. Johnston; Hubert Wong; Victoria J. Cook; Mohammad Ehsanul Karim (2025). LASSO-based Survival Prediction Modelling with Multiply Imputed Data: A Case Study in Tuberculosis Mortality Prediction [Dataset]. http://doi.org/10.6084/m9.figshare.29447083.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 4, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Md. Belal Hossain; Mohsen Sadatsafavi; James C. Johnston; Hubert Wong; Victoria J. Cook; Mohammad Ehsanul Karim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Utilizing health administrative datasets for developing prediction models is always challenging due to missing values in key predictors. Multiple imputation has been recommended to deal with missing predictor values. However, predicting survival outcomes using regularized regression, e.g., Cox-LASSO, faces limitations as these methods are incompatible with pooling model outputs from multiple imputed data using Rubin’s rule. In this study, we explored the performance of three statistical methods in developing prediction models with Cox-LASSO on multiply imputed data: prediction average, performance average, and stacked. We considered two hyperparameter selection techniques: minimum-lambda that gives the minimum cross-validated prediction error and 1SE-lambda that selects more parsimonious models. We also conducted plasmode simulations with varying the events per parameter. The stacked approach provided the most robust predictions in our case study of predicting tuberculosis mortality and simulations, producing a time-dependent c-statistic of 0.93 and a well-calibrated calibration plot. The 1SE-lambda technique resulted in underfitting of the models in most scenarios, both in case study and simulation. Our findings advocate the stacked method with minimum-lambda as an effective technique for combining LASSO-based prediction outputs from multiply imputed data. We shared reproducible R codes for future researchers to facilitate the adoption of these methodologies in their research.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon (2023). Additional file 5 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors [Dataset]. http://doi.org/10.6084/m9.figshare.7038107.v1
Organization logoOrganization logo

Additional file 5 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors

Related Article
Explore at:
txtAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

R code to impute continuous outcome. (R 1 kb)

Search
Clear search
Close search
Google apps
Main menu