Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R code to impute continuous outcome. (R 1 kb)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data sets and computer code for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the computer code (".R") and the data sets from both example analyses (Examples 1 and 2). The data sets are available in two file formats (binary ".rda" for use in R; plain-text ".dat").
The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:
ID = group identifier (1-2000) x = numeric (Level 1) y = numeric (Level 1) w = binary (Level 2)
In all data sets, missing values are coded as "NA".
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occur at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion, unified for continuous and categorical variables, is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The use of multiple imputation (MI) is becoming increasingly popular for addressing missing data. Although some conventional MI approaches have been well studied and have shown empirical validity, they have limitations when processing large datasets with complex data structures. Their imputation performances usually rely on the proper specification of imputation models, and this requires expert knowledge of the inherent relations among variables. Moreover, these standard approaches tend to be computationally inefficient for medium and large datasets. In this article, we propose a scalable MI framework mixgb, which is based on XGBoost, subsampling, and predictive mean matching. Our approach leverages the power of XGBoost, a fast implementation of gradient boosted trees, to automatically capture interactions and nonlinear relations while achieving high computational efficiency. In addition, we incorporate subsampling and predictive mean matching to reduce bias and to better account for appropriate imputation variability. The proposed framework is implemented in an R package mixgb. Supplementary materials for this article are available online.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
R code and data to reproduce figures and tables in the manuscript: Multiple imputation and direct estimation for qPCR data with non-detects.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This repository contains the replication package for the article "Hazardous Times: Animal Spirits and U.S. Recession Probabilities." It includes all necessary R code, raw data, and processed data in the start-stop (counting process) format required to reproduce the empirical results, tables, and figures from the study. Project Description: The study assembles monthly U.S. macroeconomic time series from the Federal Reserve Economic Data (FRED) and related sources—covering labor market conditions, consumer sentiment, term spreads, and credit spreads—and implements a novel "high water mark" methodology to measure the lead times with which these indicators signal NBER-dated recessions. Contents: Code: R scripts for data cleaning, multiple imputation, survival analysis, and figure/table generation. A top-level master script (run_all.R) executes the entire analytical pipeline end-to-end. Data: Raw/: Original data pulls from primary sources. Analysis_Ready/: Cleaned series, constructed cycle-specific extremes (high water marks), lead time variables, and the final start-stop dataset for survival analysis. The final curated Excel workbooks used as direct inputs for the replication code. (Note: These Excel sheets must be saved as separate .xlsx files in the designated directory before running the R code.) Documentation: This README file and detailed comments within the code. Key Details: Software Requirements: The replication code is written in R. A list of required R packages (with versions) is provided in the reference list of the article. Missing Data: Addressed via Multiple Imputation by Chained Equations (MICE). License: The original raw data from FRED is subject to its own terms of use, which require citation. The R code is released under the MIT License. All processed data, constructed variables, and analysis-ready datasets created by the author are dedicated to the public domain under the CC0 1.0 Universal Public Domain Dedication. Instructions: Download the entire repository. Install the required R packages. Save Excel sheets from the workbook “Hazardous_Times_Data.xlsx” as separate .xlsx files in the designated directory before running the R code in step 4. Run the master script run_all.R to fully replicate the study's analysis from the provided Analysis_Ready data. This script will regenerate all tables and figures. Users should consult the main publication for full context, theoretical motivation, and series-specific citations.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and analysis files for diallel analysis of weight loss in 8-12 week old male and female mice (n=1,043), mock treated or infected with influenza A virus (H1N1, PR8) across 4 days post-infection, as well as founder haplotype effect analysis at Mx1 for pre-CC and CC-RIX.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occur at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion, unified for continuous and categorical variables, is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.
Facebook
TwitterA key challenge in the management of populations is to quantify the impact of interven-tions in the face of environmental and phenotypic variability. However, accurate estima-tion of the effects of management and environment, in large-scale ecological research is often limited by the expense of data collection, the inherent trade-off between quality and quantity, and missing data. In this paper we develop a novel modelling framework, and demographically informed imputation scheme, to comprehensively account for the uncertainty generated by miss-ing population, management, and herbicide resistance data. Using this framework and a large dataset (178 sites over 3 years) on the densities of a destructive arable weed (Alo-pecurus myosuroides) we investigate the effects of environment, management, and evolved herbicide resistance, on weed population dynamics. In this study we quantify the marginal effects of a suite of common management prac-tices, including cropping, cultivation, and herbici..., Data were collected from a network of UK farms using a density structured survey method outlined in Queensborough 2011. , , # Quantifying the impacts of management and herbicide resistance on regional plant population dynamics in the face of missing data
Contained are the datasets and code required to replicate the analyses in Goodsell et al (2023), Quantifying the impacts of management and herbicide resistance on regional plant population dynamics in the face of missing data.
Data: Contains data required to run all stages in the analysis.
Many files contain the same variable names, important variables have been described in the first object they appear in.
all_imputation_data.rds - The data required to run the imputation scheme, this is an R list containing the following:
$Management - data frame containing missing and observed values for management imputation
FF & FFY: the specific field, and field year.
year: the year.
crop: crop
cult_cat : cultivation category
a_gly: number of autumn (post September 1st) glyphosate applicatio...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository accompanies the doctoral thesis titled: "The Effect of Health System, Health Risk Factors and Health Service Coverage on Fertility, Morbidity and Mortality in HDI Countries: An Econometric Analysis." Given the complexity of the data and methodological procedures, key supplementary materials detailing the data sources, processing techniques, analytical scripts, and extended results are provided in the present Mendeley Data Repository. These materials are intended to promote transparency, reproducibility, and further research. The repository includes the following supplementary files: Supplementary File 1: Contains detailed information on all indicators used in the study, including those from the Global Reference List (GRL) and control variables. It specifies the definition, unit of measurement, data source, missing data proportions, inclusion status, and whether the indicator is positively or negatively associated with the outcome. Supplementary File 2: Provides the R script used to perform Multiple Imputation by Chained Equations (MICE) to handle missing data across indicators. Supplementary File 3: Describes the imputed dataset generated using the MICE method for both pre-COVID (2015–2019) and post-COVID (2020–2021) periods. Supplementary File 4: Contains the R script used to construct the composite and sub-indices for Health System, Health Risk Factors, Service Coverage, and Health Status. Supplementary File 5: Provides the R script used to compute Compound Annual Growth Rates (CAGR) for all indices and component indicators. Supplementary File 6: Includes the Stata Do-file used to run panel data regression models, estimating the impact of Health System, Health Risk Factors, and Service Coverage on fertility, morbidity, and mortality. Supplementary File 7: Contains the Stata Do-file used for conducting the Phillips and Sul Convergence Analysis to assess convergence/divergence trends among countries toward selected health-related SDG targets. Supplementary File 8: Provides descriptive statistics—including mean, standard deviation, and coefficient of variation—for selected health indicators across 100 HDI countries during the study period (2015–2021). Supplementary File 9: Presents the CAGR estimates of all constructed indices, separately reported for pre-COVID (2015–2019) and post-COVID (2020–2021) phases. Supplementary File 10: Provides the forecasted values for 57 indicators across 100 countries up to the year 2025, supporting the study’s predictive analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occur at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion, unified for continuous and categorical variables, is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multiple imputation (MI) has become a standard statistical technique for dealing with missing values. The CDC Anthrax Vaccine Research Program (AVRP) dataset created new challenges for MI due to the large number of variables of different types and the limited sample size. A common method for imputing missing data in such complex studies is to specify, for each of J variables with missing values, a univariate conditional distribution given all other variables, and then to draw imputations by iterating over the J conditional distributions. Such fully conditional imputation strategies have the theoretical drawback that the conditional distributions may be incompatible. When the missingness pattern is monotone, a theoretically valid approach is to specify, for each variable with missing values, a conditional distribution given the variables with fewer or the same number of missing values and sequentially draw from these distributions. In this article, we propose the “multiple imputation by ordered monotone blocks” approach, which combines these two basic approaches by decomposing any missingness pattern into a collection of smaller “constructed” monotone missingness patterns, and iterating. We apply this strategy to impute the missing data in the AVRP interim data. Supplemental materials, including all source code and a synthetic example dataset, are available online.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We consider multiple imputation as a procedure iterating over a set of imputed datasets. Based on an appropriate stopping rule the number of imputed datasets is determined. Simulations and real-data analyses indicate that the sufficient number of imputed datasets may in some cases be substantially larger than the very small numbers that are usually recommended. For an easier use in various applications, the proposed method is implemented in the R package imi.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Penalized regression methods are used in many biomedical applications for variable selection and simultaneous coefficient estimation. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors. This article considers a general class of penalized objective functions which, by construction, force selection of the same variables across imputed datasets. By pooling objective functions across imputations, optimization is then performed jointly over all imputed datasets rather than separately for each dataset. We consider two objective function formulations that exist in the literature, which we will refer to as “stacked” and “grouped” objective functions. Building on existing work, we (i) derive and implement efficient cyclic coordinate descent and majorization-minimization optimization algorithms for continuous and binary outcome data, (ii) incorporate adaptive shrinkage penalties, (iii) compare these methods through simulation, and (iv) develop an R package miselect. Simulations demonstrate that the “stacked” approaches are more computationally efficient and have better estimation and selection properties. We apply these methods to data from the University of Michigan ALS Patients Biorepository aiming to identify the association between environmental pollutants and ALS risk. Supplementary materials for this article are available online.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In many clinical studies, the time to event of interest may involve several causes of failure. Furthermore, when the failure times are not completely observed, and instead are only known to lie somewhere between two observation times, interval censored competing risk data occur. For estimating regression coefficient with right censored competing risk data, Fine and Gray introduced the concept of censoring complete data and derived an estimating equation using an inverse probability censoring weight technique to reflect the probability being censored. As an alternative to achieve censoring complete data, Ruan and Gray considered to directly impute a potential censoring time for the subject who experienced the competing event. In this work, we extend Ruan and Gray’s approach to interval censored competing risk data by applying a multiple imputation technique. The suggested method has an advantage to be easily implemented by using several R functions developed for analyzing interval censored failure time data without competing risks. Simulation studies are conducted under diverse schemes to evaluate sizes and powers and to estimate regression coefficients. A dataset from an AIDS cohort study is analyzed as a real data example.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In response to Taiwan’s rapidly aging population and the rising demand for personalized health care, accurately assessing individual physiological aging has become an essential area of study. This research utilizes health examination data to propose a machine learning-based biological age prediction model that quantifies physiological age through residual life estimation. The model leverages LightGBM, which shows an 11.40% improvement in predictive performance (R-squared) compared to the XGBoost model. In the experiments, the use of MICE imputation for missing data significantly enhanced prediction accuracy, resulting in a 23.35% improvement in predictive performance. Kaplan-Meier (K-M) estimator survival analysis revealed that the model effectively differentiates between groups with varying health levels, underscoring the validity of biological age as a health status indicator. Additionally, the model identified the top ten biomarkers most influential in aging for both men and women, with a 69.23% overlap with Taiwan’s leading causes of death and previously identified top health-impact factors, further validating its practical relevance. Through multidimensional health recommendations based on SHAP and PCC interpretations, if the health recommendations provided by the model are implemented, 64.58% of individuals could potentially extend their life expectancy. This study provides new methodological support and data backing for precision health interventions and life extension.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are datasets of human subsistence, mobility, demographic, and environmental variables for the 186 cultures of the Standard Cross Cultural Sample. Missing values have been filled by Multiple Imputation methods in the R statistical package. These data are used in an analysis about human subsistence transitions that is submitted to PNAS. The R script used to complete those analyses is also included.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This code can be used to define the functions, to create the datasets, to generate the figures and tables for the simulation study and to generate the results from the use case. (ZIP)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Utilizing health administrative datasets for developing prediction models is always challenging due to missing values in key predictors. Multiple imputation has been recommended to deal with missing predictor values. However, predicting survival outcomes using regularized regression, e.g., Cox-LASSO, faces limitations as these methods are incompatible with pooling model outputs from multiple imputed data using Rubin’s rule. In this study, we explored the performance of three statistical methods in developing prediction models with Cox-LASSO on multiply imputed data: prediction average, performance average, and stacked. We considered two hyperparameter selection techniques: minimum-lambda that gives the minimum cross-validated prediction error and 1SE-lambda that selects more parsimonious models. We also conducted plasmode simulations with varying the events per parameter. The stacked approach provided the most robust predictions in our case study of predicting tuberculosis mortality and simulations, producing a time-dependent c-statistic of 0.93 and a well-calibrated calibration plot. The 1SE-lambda technique resulted in underfitting of the models in most scenarios, both in case study and simulation. Our findings advocate the stacked method with minimum-lambda as an effective technique for combining LASSO-based prediction outputs from multiply imputed data. We shared reproducible R codes for future researchers to facilitate the adoption of these methodologies in their research.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R code to impute continuous outcome. (R 1 kb)