31 datasets found

Additional file 5 of Heckman imputation models for binary or continuous MNAR...
springernature.figshare.com
datasetcatalog.nlm.nih.gov
txt
Updated May 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon (2023). Additional file 5 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors [Dataset]. http://doi.org/10.6084/m9.figshare.7038107.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7038107.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R code to impute continuous outcome. (R 1 kb)
Z
Missing data in the analysis of multilevel and dependent data (Examples)
data.niaid.nih.gov
Updated Jul 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Grund; Oliver Lüdtke; Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Examples) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7773613
Explore at:
Dataset updated
Jul 20, 2023
Dataset provided by
IPN - Leibniz Institute for Science and Mathematics Education
University of Hamburg
Authors
Simon Grund; Oliver Lüdtke; Alexander Robitzsch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data sets and computer code for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the computer code (".R") and the data sets from both example analyses (Examples 1 and 2). The data sets are available in two file formats (binary ".rda" for use in R; plain-text ".dat").

The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

ID = group identifier (1-2000) x = numeric (Level 1) y = numeric (Level 1) w = binary (Level 2)

In all data sets, missing values are coded as "NA".
Details on imputation methods in R or python.
plos.figshare.com
xls
Updated Nov 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Albu; Shan Gao; Laure Wynants; Ben Van Calster (2025). Details on imputation methods in R or python. [Dataset]. http://doi.org/10.1371/journal.pone.0334125.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0334125.t005
Dataset updated
Nov 7, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Elena Albu; Shan Gao; Laure Wynants; Ben Van Calster
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occur at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion, unified for continuous and categorical variables, is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.
Data from: Multiple Imputation Through XGBoost
tandf.figshare.com
txt
Updated Oct 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yongshi Deng; Thomas Lumley (2023). Multiple Imputation Through XGBoost [Dataset]. http://doi.org/10.6084/m9.figshare.24073156.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24073156.v3
Dataset updated
Oct 23, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Yongshi Deng; Thomas Lumley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The use of multiple imputation (MI) is becoming increasingly popular for addressing missing data. Although some conventional MI approaches have been well studied and have shown empirical validity, they have limitations when processing large datasets with complex data structures. Their imputation performances usually rely on the proper specification of imputation models, and this requires expert knowledge of the inherent relations among variables. Moreover, these standard approaches tend to be computationally inefficient for medium and large datasets. In this article, we propose a scalable MI framework mixgb, which is based on XGBoost, subsampling, and predictive mean matching. Our approach leverages the power of XGBoost, a fast implementation of gradient boosted trees, to automatically capture interactions and nonlinear relations while achieving high computational efficiency. In addition, we incorporate subsampling and predictive mean matching to reduce bias and to better account for appropriate imputation variability. The proposed framework is implemented in an R package mixgb. Supplementary materials for this article are available online.
n
Data from: Missing data estimation in morphometrics: how much is too much?
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Dec 5, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julien Clavel; Gildas Merceron; Gilles Escarguel (2013). Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.f0b50
Dataset updated
Dec 5, 2013
Dataset provided by
Centre National de la Recherche Scientifique
Authors
Julien Clavel; Gildas Merceron; Gilles Escarguel
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.
R code and data for "Multiple imputation and direct estimation for qPCR data...
zenodo.org
application/gzip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valeriia Sherina; Matthew N. McCall; Matthew N. McCall; Valeriia Sherina (2020). R code and data for "Multiple imputation and direct estimation for qPCR data with non-detects" [Dataset]. http://doi.org/10.5281/zenodo.2554418
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2554418
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Valeriia Sherina; Matthew N. McCall; Matthew N. McCall; Valeriia Sherina
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
R code and data to reproduce figures and tables in the manuscript: Multiple imputation and direct estimation for qPCR data with non-detects.
H
Hazardous Times
dataverse.harvard.edu
Updated Aug 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
George Robert Lefter (2025). Hazardous Times [Dataset]. http://doi.org/10.7910/DVN/MP6L5B
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/MP6L5B
Dataset updated
Aug 27, 2025
Dataset provided by
Harvard Dataverse
Authors
George Robert Lefter
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This repository contains the replication package for the article "Hazardous Times: Animal Spirits and U.S. Recession Probabilities." It includes all necessary R code, raw data, and processed data in the start-stop (counting process) format required to reproduce the empirical results, tables, and figures from the study. Project Description: The study assembles monthly U.S. macroeconomic time series from the Federal Reserve Economic Data (FRED) and related sources—covering labor market conditions, consumer sentiment, term spreads, and credit spreads—and implements a novel "high water mark" methodology to measure the lead times with which these indicators signal NBER-dated recessions. Contents: Code: R scripts for data cleaning, multiple imputation, survival analysis, and figure/table generation. A top-level master script (run_all.R) executes the entire analytical pipeline end-to-end. Data: Raw/: Original data pulls from primary sources. Analysis_Ready/: Cleaned series, constructed cycle-specific extremes (high water marks), lead time variables, and the final start-stop dataset for survival analysis. The final curated Excel workbooks used as direct inputs for the replication code. (Note: These Excel sheets must be saved as separate .xlsx files in the designated directory before running the R code.) Documentation: This README file and detailed comments within the code. Key Details: Software Requirements: The replication code is written in R. A list of required R packages (with versions) is provided in the reference list of the article. Missing Data: Addressed via Multiple Imputation by Chained Equations (MICE). License: The original raw data from FRED is subject to its own terms of use, which require citation. The R code is released under the MIT License. All processed data, constructed variables, and analysis-ready datasets created by the author are dedicated to the public domain under the CC0 1.0 Universal Public Domain Dedication. Instructions: Download the entire repository. Install the required R packages. Save Excel sheets from the workbook “Hazardous_Times_Data.xlsx” as separate .xlsx files in the designated directory before running the R code in step 4. Run the master script run_all.R to fully replicate the study's analysis from the provided Analysis_Ready data. This script will regenerate all tables and figures. Users should consult the main publication for full context, theoretical motivation, and series-specific citations.
Diallel analysis reveals Mx1-dependent and Mx1-independent effects on...
zenodo.org
data.niaid.nih.gov
application/gzip, csv +1
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul L Maurizio; Paul L Maurizio; Martin T Ferris; Martin T Ferris; Gregory R Keele; Gregory R Keele; Darla R Miller; Darla R Miller; Ginger D Shaw; Ginger D Shaw; Alan C Whitmore; Ande West; Clayton R Morrison; Kelsey E Noll; Kelsey E Noll; Kenneth S Plante; Adam S Cockrell; David W Threadgill; David W Threadgill; Fernando Pardo-Manuel de Villena; Fernando Pardo-Manuel de Villena; Ralph S Baric; Mark T Heise; William Valdar; William Valdar; Alan C Whitmore; Ande West; Clayton R Morrison; Kenneth S Plante; Adam S Cockrell; Ralph S Baric; Mark T Heise (2024). Diallel analysis reveals Mx1-dependent and Mx1-independent effects on response to influenza A virus in mice [Dataset]. http://doi.org/10.5281/zenodo.293015
Explore at:
application/gzip, csv, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.293015
Dataset updated
Aug 3, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Paul L Maurizio; Paul L Maurizio; Martin T Ferris; Martin T Ferris; Gregory R Keele; Gregory R Keele; Darla R Miller; Darla R Miller; Ginger D Shaw; Ginger D Shaw; Alan C Whitmore; Ande West; Clayton R Morrison; Kelsey E Noll; Kelsey E Noll; Kenneth S Plante; Adam S Cockrell; David W Threadgill; David W Threadgill; Fernando Pardo-Manuel de Villena; Fernando Pardo-Manuel de Villena; Ralph S Baric; Mark T Heise; William Valdar; William Valdar; Alan C Whitmore; Ande West; Clayton R Morrison; Kenneth S Plante; Adam S Cockrell; Ralph S Baric; Mark T Heise
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and analysis files for diallel analysis of weight loss in 8-12 week old male and female mice (n=1,043), mock treated or infected with influenza A virus (H1N1, PR8) across 4 days post-infection, as well as founder haplotype effect analysis at Mx1 for pre-CC and CC-RIX.
Simulated datasets description.
plos.figshare.com
xls
Updated Nov 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Albu; Shan Gao; Laure Wynants; Ben Van Calster (2025). Simulated datasets description. [Dataset]. http://doi.org/10.1371/journal.pone.0334125.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0334125.t003
Dataset updated
Nov 7, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Elena Albu; Shan Gao; Laure Wynants; Ben Van Calster
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occur at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion, unified for continuous and categorical variables, is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.
d
Data from: Quantifying the impacts of management and herbicide resistance on...
search.dataone.org
datadryad.org
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Goodsell; David Comont; Helen Hicks; James Lambert; Richard Hull; Laura Crook; Paolo Fraccaro; Katharina Reusch; Robert Freckleton; Dylan Childs (2023). Quantifying the impacts of management and herbicide resistance on regional plant population dynamics in the face of missing data [Dataset]. http://doi.org/10.5061/dryad.9cnp5hqn5
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.9cnp5hqn5
Dataset updated
Nov 29, 2023
Dataset provided by
Dryad Digital Repository
Authors
Robert Goodsell; David Comont; Helen Hicks; James Lambert; Richard Hull; Laura Crook; Paolo Fraccaro; Katharina Reusch; Robert Freckleton; Dylan Childs
Time period covered
Jan 1, 2022
Description
A key challenge in the management of populations is to quantify the impact of interven-tions in the face of environmental and phenotypic variability. However, accurate estima-tion of the effects of management and environment, in large-scale ecological research is often limited by the expense of data collection, the inherent trade-off between quality and quantity, and missing data. In this paper we develop a novel modelling framework, and demographically informed imputation scheme, to comprehensively account for the uncertainty generated by miss-ing population, management, and herbicide resistance data. Using this framework and a large dataset (178 sites over 3 years) on the densities of a destructive arable weed (Alo-pecurus myosuroides) we investigate the effects of environment, management, and evolved herbicide resistance, on weed population dynamics. In this study we quantify the marginal effects of a suite of common management prac-tices, including cropping, cultivation, and herbici..., Data were collected from a network of UK farms using a density structured survey method outlined in Queensborough 2011.Â , , # Quantifying the impacts of management and herbicide resistance on regional plant population dynamics in the face of missing data

Contained are the datasets and code required to replicate the analyses in Goodsell et al (2023), Quantifying the impacts of management and herbicide resistance on regional plant population dynamics in the face of missing data.

Description of the data and file structure

Data: Contains data required to run all stages in the analysis.

Many files contain the same variable names, important variables have been described in the first object they appear in.

all_imputation_data.rds - The data required to run the imputation scheme, this is an R list containing the following:

$Management - data frame containing missing and observed values for management imputation

FF & FFY: the specific field, and field year.

year: the year.

crop: crop

cult_cat : cultivation category

a_gly: number of autumn (post September 1st) glyphosate applicatio...
m
PhD Thesis Supplementary Materials: The effect of Health System, Health Risk...
data.mendeley.com
Updated Aug 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KP Junaid (2025). PhD Thesis Supplementary Materials: The effect of Health System, Health Risk factors and Health Service Coverage on Fertility, Morbidity and Mortality in HDI countries: An Econometric analysis [Dataset]. http://doi.org/10.17632/53hy5btx6t.1
Explore at:
Unique identifier
https://doi.org/10.17632/53hy5btx6t.1
Dataset updated
Aug 5, 2025
Authors
KP Junaid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository accompanies the doctoral thesis titled: "The Effect of Health System, Health Risk Factors and Health Service Coverage on Fertility, Morbidity and Mortality in HDI Countries: An Econometric Analysis." Given the complexity of the data and methodological procedures, key supplementary materials detailing the data sources, processing techniques, analytical scripts, and extended results are provided in the present Mendeley Data Repository. These materials are intended to promote transparency, reproducibility, and further research. The repository includes the following supplementary files: Supplementary File 1: Contains detailed information on all indicators used in the study, including those from the Global Reference List (GRL) and control variables. It specifies the definition, unit of measurement, data source, missing data proportions, inclusion status, and whether the indicator is positively or negatively associated with the outcome. Supplementary File 2: Provides the R script used to perform Multiple Imputation by Chained Equations (MICE) to handle missing data across indicators. Supplementary File 3: Describes the imputed dataset generated using the MICE method for both pre-COVID (2015–2019) and post-COVID (2020–2021) periods. Supplementary File 4: Contains the R script used to construct the composite and sub-indices for Health System, Health Risk Factors, Service Coverage, and Health Status. Supplementary File 5: Provides the R script used to compute Compound Annual Growth Rates (CAGR) for all indices and component indicators. Supplementary File 6: Includes the Stata Do-file used to run panel data regression models, estimating the impact of Health System, Health Risk Factors, and Service Coverage on fertility, morbidity, and mortality. Supplementary File 7: Contains the Stata Do-file used for conducting the Phillips and Sul Convergence Analysis to assess convergence/divergence trends among countries toward selected health-related SDG targets. Supplementary File 8: Provides descriptive statistics—including mean, standard deviation, and coefficient of variation—for selected health indicators across 100 HDI countries during the study period (2015–2021). Supplementary File 9: Presents the CAGR estimates of all constructed indices, separately reported for pre-COVID (2015–2019) and post-COVID (2020–2021) phases. Supplementary File 10: Provides the forecasted values for 57 indicators across 100 countries up to the year 2025, supporting the study’s predictive analysis.
Pseudocode for the missForestPredict algorithm.
plos.figshare.com
xls
Updated Nov 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Albu; Shan Gao; Laure Wynants; Ben Van Calster (2025). Pseudocode for the missForestPredict algorithm. [Dataset]. http://doi.org/10.1371/journal.pone.0334125.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0334125.t001
Dataset updated
Nov 7, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Elena Albu; Shan Gao; Laure Wynants; Ben Van Calster
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occur at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion, unified for continuous and categorical variables, is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.
Multiple Imputation by Ordered Monotone Blocks With Application to the...
tandf.figshare.com
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fan Li; Michela Baccini; Fabrizia Mealli; Elizabeth R. Zell; Constantine E. Frangakis; Donald B. Rubin (2023). Multiple Imputation by Ordered Monotone Blocks With Application to the Anthrax Vaccine Research Program [Dataset]. http://doi.org/10.6084/m9.figshare.1067056.v2
Explore at:
application/x-dosexecAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1067056.v2
Dataset updated
Jun 2, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Fan Li; Michela Baccini; Fabrizia Mealli; Elizabeth R. Zell; Constantine E. Frangakis; Donald B. Rubin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multiple imputation (MI) has become a standard statistical technique for dealing with missing values. The CDC Anthrax Vaccine Research Program (AVRP) dataset created new challenges for MI due to the large number of variables of different types and the limited sample size. A common method for imputing missing data in such complex studies is to specify, for each of J variables with missing values, a univariate conditional distribution given all other variables, and then to draw imputations by iterating over the J conditional distributions. Such fully conditional imputation strategies have the theoretical drawback that the conditional distributions may be incompatible. When the missingness pattern is monotone, a theoretically valid approach is to specify, for each variable with missing values, a conditional distribution given the variables with fewer or the same number of missing values and sequentially draw from these distributions. In this article, we propose the “multiple imputation by ordered monotone blocks” approach, which combines these two basic approaches by decomposing any missingness pattern into a collection of smaller “constructed” monotone missingness patterns, and iterating. We apply this strategy to impute the missing data in the AVRP interim data. Supplemental materials, including all source code and a synthetic example dataset, are available online.
Iterative Multiple Imputation: A Framework to Determine the Number of...
tandf.figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vahid Nassiri; Geert Molenberghs; Geert Verbeke; João Barbosa-Breda (2023). Iterative Multiple Imputation: A Framework to Determine the Number of Imputed Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.7445375.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7445375.v2
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Vahid Nassiri; Geert Molenberghs; Geert Verbeke; João Barbosa-Breda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We consider multiple imputation as a procedure iterating over a set of imputed datasets. Based on an appropriate stopping rule the number of imputed datasets is determined. Simulations and real-data analyses indicate that the sufficient number of imputed datasets may in some cases be substantially larger than the very small numbers that are usually recommended. For an easier use in various applications, the proposed method is implemented in the R package imi.
Variable Selection with Multiply-Imputed Datasets: Choosing Between Stacked...
tandf.figshare.com
pdf
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiacong Du; Jonathan Boss; Peisong Han; Lauren J. Beesley; Michael Kleinsasser; Stephen A. Goutman; Stuart Batterman; Eva L. Feldman; Bhramar Mukherjee (2023). Variable Selection with Multiply-Imputed Datasets: Choosing Between Stacked and Grouped Methods [Dataset]. http://doi.org/10.6084/m9.figshare.19111441.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19111441.v2
Dataset updated
Jun 3, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Jiacong Du; Jonathan Boss; Peisong Han; Lauren J. Beesley; Michael Kleinsasser; Stephen A. Goutman; Stuart Batterman; Eva L. Feldman; Bhramar Mukherjee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Penalized regression methods are used in many biomedical applications for variable selection and simultaneous coefficient estimation. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors. This article considers a general class of penalized objective functions which, by construction, force selection of the same variables across imputed datasets. By pooling objective functions across imputations, optimization is then performed jointly over all imputed datasets rather than separately for each dataset. We consider two objective function formulations that exist in the literature, which we will refer to as “stacked” and “grouped” objective functions. Building on existing work, we (i) derive and implement efficient cyclic coordinate descent and majorization-minimization optimization algorithms for continuous and binary outcome data, (ii) incorporate adaptive shrinkage penalties, (iii) compare these methods through simulation, and (iv) develop an R package miselect. Simulations demonstrate that the “stacked” approaches are more computationally efficient and have better estimation and selection properties. We apply these methods to data from the University of Michigan ALS Patients Biorepository aiming to identify the association between environmental pollutants and ALS risk. Supplementary materials for this article are available online.
Data from: Analysis of Interval Censored Competing Risk Data via...
tandf.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hyung Eun Lee; Yang-Jin Kim (2023). Analysis of Interval Censored Competing Risk Data via Nonparametric Multiple Imputation [Dataset]. http://doi.org/10.6084/m9.figshare.11988687.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.11988687.v2
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Hyung Eun Lee; Yang-Jin Kim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In many clinical studies, the time to event of interest may involve several causes of failure. Furthermore, when the failure times are not completely observed, and instead are only known to lie somewhere between two observation times, interval censored competing risk data occur. For estimating regression coefficient with right censored competing risk data, Fine and Gray introduced the concept of censoring complete data and derived an estimating equation using an inverse probability censoring weight technique to reflect the probability being censored. As an alternative to achieve censoring complete data, Ruan and Gray considered to directly impute a potential censoring time for the subject who experienced the competing event. In this work, we extend Ruan and Gray’s approach to interval censored competing risk data by applying a multiple imputation technique. The suggested method has an advantage to be easily implemented by using several R functions developed for analyzing interval censored failure time data without competing risks. Simulation studies are conducted under diverse schemes to evaluate sizes and powers and to estimate regression coefficients. A dataset from an AIDS cohort study is analyzed as a real data example.
Complete dataset obtained by imputation.
plos.figshare.com
xls
Updated Sep 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kai Zhang; Po-Chung Chen; YiYang Huang; Shiow-Jyu Tzou; Sheng-Tang Wu; Ta-Wei Chu; Chung-Che Wang; Jyh-Shing Roger Jang (2025). Complete dataset obtained by imputation. [Dataset]. http://doi.org/10.1371/journal.pone.0330184.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0330184.t007
Dataset updated
Sep 24, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Kai Zhang; Po-Chung Chen; YiYang Huang; Shiow-Jyu Tzou; Sheng-Tang Wu; Ta-Wei Chu; Chung-Che Wang; Jyh-Shing Roger Jang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In response to Taiwan’s rapidly aging population and the rising demand for personalized health care, accurately assessing individual physiological aging has become an essential area of study. This research utilizes health examination data to propose a machine learning-based biological age prediction model that quantifies physiological age through residual life estimation. The model leverages LightGBM, which shows an 11.40% improvement in predictive performance (R-squared) compared to the XGBoost model. In the experiments, the use of MICE imputation for missing data significantly enhanced prediction accuracy, resulting in a 23.35% improvement in predictive performance. Kaplan-Meier (K-M) estimator survival analysis revealed that the model effectively differentiates between groups with varying health levels, underscoring the validity of biological age as a health status indicator. Additionally, the model identified the top ten biomarkers most influential in aging for both men and women, with a 69.23% overlap with Taiwan’s leading causes of death and previously identified top health-impact factors, further validating its practical relevance. Through multidimensional health recommendations based on SHAP and PCC interpretations, if the health recommendations provided by the model are implemented, 64.58% of individuals could potentially extend their life expectancy. This study provides new methodological support and data backing for precision health interventions and life extension.
Cross cultural data for multivariate analysis of subsistence strategies
figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Ullah (2023). Cross cultural data for multivariate analysis of subsistence strategies [Dataset]. http://doi.org/10.6084/m9.figshare.1404233.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1404233.v1
Dataset updated
Jun 2, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Isaac Ullah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These are datasets of human subsistence, mobility, demographic, and environmental variables for the 186 cultures of the Standard Cross Cultural Sample. Missing values have been filled by Multiple Imputation methods in the R statistical package. These data are used in an analysis about human subsistence transitions that is submitted to PNAS. The R script used to complete those analyses is also included.
R-code of the simulation study and use case investigation.
plos.figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated Nov 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tanja Bülow; Ralf-Dieter Hilgers; Nicole Heussen (2023). R-code of the simulation study and use case investigation. [Dataset]. http://doi.org/10.1371/journal.pone.0293640.s007
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0293640.s007
Dataset updated
Nov 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Tanja Bülow; Ralf-Dieter Hilgers; Nicole Heussen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This code can be used to define the functions, to create the datasets, to generate the figures and tables for the simulation study and to generate the results from the use case. (ZIP)
f
Data from: LASSO-based Survival Prediction Modelling with Multiply Imputed...
figshare.com
pdf
Updated Aug 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Belal Hossain; Mohsen Sadatsafavi; James C. Johnston; Hubert Wong; Victoria J. Cook; Mohammad Ehsanul Karim (2025). LASSO-based Survival Prediction Modelling with Multiply Imputed Data: A Case Study in Tuberculosis Mortality Prediction [Dataset]. http://doi.org/10.6084/m9.figshare.29447083.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29447083.v1
Dataset updated
Aug 4, 2025
Dataset provided by
Taylor & Francis
Authors
Md. Belal Hossain; Mohsen Sadatsafavi; James C. Johnston; Hubert Wong; Victoria J. Cook; Mohammad Ehsanul Karim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Utilizing health administrative datasets for developing prediction models is always challenging due to missing values in key predictors. Multiple imputation has been recommended to deal with missing predictor values. However, predicting survival outcomes using regularized regression, e.g., Cox-LASSO, faces limitations as these methods are incompatible with pooling model outputs from multiple imputed data using Rubin’s rule. In this study, we explored the performance of three statistical methods in developing prediction models with Cox-LASSO on multiply imputed data: prediction average, performance average, and stacked. We considered two hyperparameter selection techniques: minimum-lambda that gives the minimum cross-validated prediction error and 1SE-lambda that selects more parsimonious models. We also conducted plasmode simulations with varying the events per parameter. The stacked approach provided the most robust predictions in our case study of predicting tuberculosis mortality and simulations, producing a time-dependent c-statistic of 0.93 and a well-calibrated calibration plot. The 1SE-lambda technique resulted in underfitting of the models in most scenarios, both in case study and simulation. Our findings advocate the stacked method with minimum-lambda as an effective technique for combining LASSO-based prediction outputs from multiply imputed data. We shared reproducible R codes for future researchers to facilitate the adoption of these methodologies in their research.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon (2023). Additional file 5 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors [Dataset]. http://doi.org/10.6084/m9.figshare.7038107.v1

Additional file 5 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.7038107.v1

Dataset updated

May 30, 2023

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

R code to impute continuous outcome. (R 1 kb)

Clear search

Close search

Google apps

Main menu

Additional file 5 of Heckman imputation models for binary or continuous MNAR...

Missing data in the analysis of multilevel and dependent data (Examples)

Details on imputation methods in R or python.

Data from: Multiple Imputation Through XGBoost

Data from: Missing data estimation in morphometrics: how much is too much?

R code and data for "Multiple imputation and direct estimation for qPCR data...

Hazardous Times

Diallel analysis reveals Mx1-dependent and Mx1-independent effects on...

Simulated datasets description.

Data from: Quantifying the impacts of management and herbicide resistance on...

Description of the data and file structure

PhD Thesis Supplementary Materials: The effect of Health System, Health Risk...

Pseudocode for the missForestPredict algorithm.

Multiple Imputation by Ordered Monotone Blocks With Application to the...

Iterative Multiple Imputation: A Framework to Determine the Number of...

Variable Selection with Multiply-Imputed Datasets: Choosing Between Stacked...

Data from: Analysis of Interval Censored Competing Risk Data via...

Complete dataset obtained by imputation.

Cross cultural data for multivariate analysis of subsistence strategies

R-code of the simulation study and use case investigation.

Data from: LASSO-based Survival Prediction Modelling with Multiply Imputed...

Additional file 5 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictorsSee More Versions

Additional file 5 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors