44 datasets found
  1. Iterative Multiple Imputation: A Framework to Determine the Number of...

    • tandf.figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vahid Nassiri; Geert Molenberghs; Geert Verbeke; João Barbosa-Breda (2023). Iterative Multiple Imputation: A Framework to Determine the Number of Imputed Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.7445375.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Vahid Nassiri; Geert Molenberghs; Geert Verbeke; João Barbosa-Breda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We consider multiple imputation as a procedure iterating over a set of imputed datasets. Based on an appropriate stopping rule the number of imputed datasets is determined. Simulations and real-data analyses indicate that the sufficient number of imputed datasets may in some cases be substantially larger than the very small numbers that are usually recommended. For an easier use in various applications, the proposed method is implemented in the R package imi.

  2. Variable Selection with Multiply-Imputed Datasets: Choosing Between Stacked...

    • tandf.figshare.com
    pdf
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiacong Du; Jonathan Boss; Peisong Han; Lauren J. Beesley; Michael Kleinsasser; Stephen A. Goutman; Stuart Batterman; Eva L. Feldman; Bhramar Mukherjee (2023). Variable Selection with Multiply-Imputed Datasets: Choosing Between Stacked and Grouped Methods [Dataset]. http://doi.org/10.6084/m9.figshare.19111441.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Jiacong Du; Jonathan Boss; Peisong Han; Lauren J. Beesley; Michael Kleinsasser; Stephen A. Goutman; Stuart Batterman; Eva L. Feldman; Bhramar Mukherjee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Penalized regression methods are used in many biomedical applications for variable selection and simultaneous coefficient estimation. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors. This article considers a general class of penalized objective functions which, by construction, force selection of the same variables across imputed datasets. By pooling objective functions across imputations, optimization is then performed jointly over all imputed datasets rather than separately for each dataset. We consider two objective function formulations that exist in the literature, which we will refer to as “stacked” and “grouped” objective functions. Building on existing work, we (i) derive and implement efficient cyclic coordinate descent and majorization-minimization optimization algorithms for continuous and binary outcome data, (ii) incorporate adaptive shrinkage penalties, (iii) compare these methods through simulation, and (iv) develop an R package miselect. Simulations demonstrate that the “stacked” approaches are more computationally efficient and have better estimation and selection properties. We apply these methods to data from the University of Michigan ALS Patients Biorepository aiming to identify the association between environmental pollutants and ALS risk. Supplementary materials for this article are available online.

  3. f

    Data from: Validity of using multiple imputation for "unknown" stage at...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jun 27, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luo, Qingwei; Egger, Sam; Yu, Xue Qin; Smith, David P.; O’Connell, Dianne L. (2017). Validity of using multiple imputation for "unknown" stage at diagnosis in population-based cancer registry data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001781541
    Explore at:
    Dataset updated
    Jun 27, 2017
    Authors
    Luo, Qingwei; Egger, Sam; Yu, Xue Qin; Smith, David P.; O’Connell, Dianne L.
    Description

    BackgroundThe multiple imputation approach to missing data has been validated by a number of simulation studies by artificially inducing missingness on fully observed stage data under a pre-specified missing data mechanism. However, the validity of multiple imputation has not yet been assessed using real data. The objective of this study was to assess the validity of using multiple imputation for “unknown” prostate cancer stage recorded in the New South Wales Cancer Registry (NSWCR) in real-world conditions.MethodsData from the population-based cohort study NSW Prostate Cancer Care and Outcomes Study (PCOS) were linked to 2000–2002 NSWCR data. For cases with “unknown” NSWCR stage, PCOS-stage was extracted from clinical notes. Logistic regression was used to evaluate the missing at random assumption adjusted for variables from two imputation models: a basic model including NSWCR variables only and an enhanced model including the same NSWCR variables together with PCOS primary treatment. Cox regression was used to evaluate the performance of MI.ResultsOf the 1864 prostate cancer cases 32.7% were recorded as having “unknown” NSWCR stage. The missing at random assumption was satisfied when the logistic regression included the variables included in the enhanced model, but not those in the basic model only. The Cox models using data with imputed stage from either imputation model provided generally similar estimated hazard ratios but with wider confidence intervals compared with those derived from analysis of the data with PCOS-stage. However, the complete-case analysis of the data provided a considerably higher estimated hazard ratio for the low socio-economic status group and rural areas in comparison with those obtained from all other datasets.ConclusionsUsing MI to deal with “unknown” stage data recorded in a population-based cancer registry appears to provide valid estimates. We would recommend a cautious approach to the use of this method elsewhere.

  4. H

    Replication Data for: Prediction, Proxies, and Power

    • dataverse.harvard.edu
    Updated Nov 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brenton Kenkel; Robert Carroll (2019). Replication Data for: Prediction, Proxies, and Power [Dataset]. http://doi.org/10.7910/DVN/FPYKTP
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Brenton Kenkel; Robert Carroll
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/FPYKTPhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/FPYKTP

    Description

    Many enduring questions in international relations theory focus on power relations, so it is important that scholars have a good measure of relative power. The standard measure of relative military power, the capability ratio, is barely better than random guessing at pre- dicting militarized dispute outcomes. We use machine learning to build a superior proxy, the Dispute Outcome Expectations score, from the same underlying data. Our measure is an order of magnitude better than the capability ratio at predicting dispute outcomes. We replicate Reed et al. (2008) and find, contrary to the original conclusions, that the probability of conflict is always highest when the state with the least benefits has a preponderance of power. In replications of 18 other dyadic analyses that use power as a control, we find that replacing the standard measure with DOE scores usually improves both in-sample and out-of-sample goodness of fit. Note:This analysis involves many layers of computation: multiple imputation of the underlying data, creation of an ensemble of machine learning models on the imputed datasets, predictions from that ensemble, and replications of previous studies using those predictions. Our replication code sets seeds in any script where random numbers are drawn, and runs in a Docker environment to ensure identical package versions across machines. Nevertheless, because of differences in machine precision and floating point computations across CPUs, the replication code may not produce results identical to those in the paper. Any differences should be small in magnitude and should not affect any substantive conclusions of the analysis.

  5. Pooling ANOVA Results from Multiply Imputed Datasets - Supplemental...

    • figshare.com
    txt
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Grund; Oliver Lüdtke; Alexander Robitzsch (2016). Pooling ANOVA Results from Multiply Imputed Datasets - Supplemental Materials - Example Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.1267542.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Simon Grund; Oliver Lüdtke; Alexander Robitzsch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example dataset for the analysis example provided in the online supplemental material of the corresponding article.

  6. f

    Data from: Multiple Imputation of Missing or Faulty Values Under Linear...

    • datasetcatalog.nlm.nih.gov
    • tandf.figshare.com
    Updated Aug 19, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cox, Lawrence H.; Karr, Alan F.; Reiter, Jerome P.; Wang, Quanli; Kim, Hang J. (2014). Multiple Imputation of Missing or Faulty Values Under Linear Constraints [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001187823
    Explore at:
    Dataset updated
    Aug 19, 2014
    Authors
    Cox, Lawrence H.; Karr, Alan F.; Reiter, Jerome P.; Wang, Quanli; Kim, Hang J.
    Description

    Many statistical agencies, survey organizations, and research centers collect data that suffer from item nonresponse and erroneous or inconsistent values. These data may be required to satisfy linear constraints, for example, bounds on individual variables and inequalities for ratios or sums of variables. Often these constraints are designed to identify faulty values, which then are blanked and imputed. The data also may exhibit complex distributional features, including nonlinear relationships and highly nonnormal distributions. We present a fully Bayesian, joint model for modeling or imputing data with missing/blanked values under linear constraints that (i) automatically incorporates the constraints in inferences and imputations, and (ii) uses a flexible Dirichlet process mixture of multivariate normal distributions to reflect complex distributional features. Our strategy for estimation is to augment the observed data with draws from a hypothetical population in which the constraints are not present, thereby taking advantage of computationally expedient methods for fitting mixture models. Missing/blanked items are sampled from their posterior distribution using the Hit-and-Run sampler, which guarantees that all imputations satisfy the constraints. We illustrate the approach using manufacturing data from Colombia, examining the potential to preserve joint distributions and a regression from the plant productivity literature. Supplementary materials for this article are available online.

  7. f

    Table_2_Evaluating the Accuracy of Imputation Methods in a Five-Way Admixed...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Feb 5, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoal, Eileen G.; Kinnear, Craig J.; Schurz, Haiko; van Helden, Paul David; Möller, Marlo; Müller, Stephanie J.; Tromp, Gerard (2019). Table_2_Evaluating the Accuracy of Imputation Methods in a Five-Way Admixed Population.DOCX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000154552
    Explore at:
    Dataset updated
    Feb 5, 2019
    Authors
    Hoal, Eileen G.; Kinnear, Craig J.; Schurz, Haiko; van Helden, Paul David; Möller, Marlo; Müller, Stephanie J.; Tromp, Gerard
    Description

    Genotype imputation is a powerful tool for increasing statistical power in an association analysis. Meta-analysis of multiple study datasets also requires a substantial overlap of SNPs for a successful association analysis, which can be achieved by imputation. Quality of imputed datasets is largely dependent on the software used, as well as the reference populations chosen. The accuracy of imputation of available reference populations has not been tested for the five-way admixed South African Colored (SAC) population. In this study, imputation results obtained using three freely-accessible methods were evaluated for accuracy and quality. We show that the African Genome Resource is the best reference panel for imputation of missing genotypes in samples from the SAC population, implemented via the freely accessible Sanger Imputation Server.

  8. Additional file 1: of Outcome-sensitive multiple imputation: a simulation...

    • springernature.figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evangelos Kontopantelis; Ian White; Matthew Sperrin; Iain Buchan (2023). Additional file 1: of Outcome-sensitive multiple imputation: a simulation study [Dataset]. http://doi.org/10.6084/m9.figshare.c.3661877_D2.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Evangelos Kontopantelis; Ian White; Matthew Sperrin; Iain Buchan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Simulation code file 1 of 4. Generate data and obtain true estimates (making sure the simulations work as they should before incorporating the missing data mechanisms). Simulation code file 2 of 4. Main data generation file across missingness mechanisms (1 of 2). Simulation code file 3 of 4. Main data generation file across missingness mechanisms (2 of 2). Simulation code file 4 of 4. Summarise the simulation results in a data file. (ZIP 10Â kb)

  9. f

    Comparison of observed and imputed data from 100 multiple imputation models,...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jan 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pant, Dikshya; Shrestha, Shrijana; Basnyat,; Pollard, Andrew J.; Colin-Jones, Rachel; Smith, Nicola; Shakya, Mila; Voysey, Merryn; Pitzer, Virginia E.; Theiss-Nyland, Katherine; Liu, Xinxue (2020). Comparison of observed and imputed data from 100 multiple imputation models, by age group. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000546027
    Explore at:
    Dataset updated
    Jan 16, 2020
    Authors
    Pant, Dikshya; Shrestha, Shrijana; Basnyat,; Pollard, Andrew J.; Colin-Jones, Rachel; Smith, Nicola; Shakya, Mila; Voysey, Merryn; Pitzer, Virginia E.; Theiss-Nyland, Katherine; Liu, Xinxue
    Description

    Comparison of observed and imputed data from 100 multiple imputation models, by age group.

  10. d

    Replication Data for: \"Subnational views on multilevel governance\"

    • search.dataone.org
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasar, Rusen (2023). Replication Data for: \"Subnational views on multilevel governance\" [Dataset]. http://doi.org/10.7910/DVN/S4Z0U0
    Explore at:
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Yasar, Rusen
    Description

    The main dataset contains data from a cross-national survey conducted with local and regional politicians in several European countries in 2015. The data were collected by the researcher individually. The main dataset is provided in .csv format. The dataset was primarily used for drafting the paper "Subnational views on multilevel governance". An accompanying R script shows data preparation, analysis, simulations and plotting. Multiple imputation was used to mend item non-response: results may vary slightly in different iterations. For the replication of exact results as reported in the paper, 5 imputed datasets are also provided.

  11. r

    Forecasting potential invaders to prevent future biological invasions...

    • researchdata.edu.au
    • bridges.monash.edu
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arman Pili (2024). Forecasting potential invaders to prevent future biological invasions worldwide, Supporting Information S1 [Dataset]. http://doi.org/10.26180/24080646.V5
    Explore at:
    Dataset updated
    Oct 16, 2024
    Dataset provided by
    Monash University
    Authors
    Arman Pili
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Forecasting potential invaders to prevent future biological invasions worldwide, Supporting Information S1

    ## Research rationale

    The ever-increasing and expanding globalisation of trade and transport underpins the escalating global problem of biological invasions. Developing biosecurity infrastructures is crucial to anticipate and prevent the transport and introduction of invasive alien species, but robust and defensible forecasts of potential invaders, especially species worldwide with no invasion history, are rare.


    ## The tool

    Here, we aim to support decision-making by developing a quantitative invasion risk assessment tool based on invasion syndromes (i.e. attributes of a typical invasive alien species). We implemented a multiple imputation with chain equation workflow to estimate invasion syndromes from imputed datasets of species’ life-history and ecological traits (e.g., body size, reproductive traits, microhabitat) and macroecological patterns (e.g., geographic range size, commonness, habitat generalism, tolerance to disturbance).


    The tool is run under R computing program. And this repository contains the R scripts and sample files to run the tool.


    The description and application of tool can be read in full in Pili et al. (2024 -- Global Change Biology). The project repository containing the R code to run our quantitative invasion risk assessment tool can be accessed in: https://github.com/armanpili/ForecastingInvaders

    Contained herein are the associated data of the project.


    Tabl_S1_1.csv — Unintentionally transported and introduced amphibians and reptiles

    TableS1_2_raw.xslx — Full raw and harmonised life-history and ecological traits of global amphibians and reptiles.

    TableS1_2_1.csv — Life-history and ecological traits and macroecological patterns of frogs used in multiple imputation. This is a subset of the global amphibian dataset, reduced to contain a maximum of 60% data missingness in columns (variables) and rows (species).

    TableS1_2_2.csv — Life-history and ecological traits and macroecological patterns of lizards used in multiple imputation. This is a subset of the global saurian reptile dataset, reduced to contain a maximum of 60% data missingness in columns (variables) and rows (species).

    TableS1_2_3.csv — Life-history and ecological traits and macroecological patterns of snakes used in multiple imputation. This is a subset of the global serpentine reptile dataset, reduced to contain a maximum of 60% data missingness in columns (variables) and rows (species).

    TableS1_3.csv — Evaluation scores of random forest models fitted with life-history and ecological traits, macroecological patterns, life-history and ecological traits and macroecological patterns, and optimal subset of life-history and ecological traits and macroecological patterns.

    Table S1_4.csv — Predicted risk scores of unintentional transport, introduction, and establishment of frogs, lizards, and snakes.




  12. n

    Data from: Biological traits of seabirds predict extinction risk and...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Mar 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cerren Richards; Robert Cooke; Amanda Bates (2021). Biological traits of seabirds predict extinction risk and vulnerability to anthropogenic threats [Dataset]. http://doi.org/10.5061/dryad.x69p8czhd
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 16, 2021
    Dataset provided by
    University of Gothenburg
    Memorial University of Newfoundland
    Authors
    Cerren Richards; Robert Cooke; Amanda Bates
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Aim

    Seabirds are heavily threatened by anthropogenic activities and their conservation status is deteriorating rapidly. Yet, these pressures are unlikely to uniformly impact all species. It remains an open question if seabirds with similar ecological roles are responding similarly to human pressures. Here we aim to: 1) test whether threatened vs non-threatened seabirds are separated in trait space; 2) quantify the similarity of species’ roles (redundancy) per IUCN Red List Category; and 3) identify traits that render species vulnerable to anthropogenic threats.

    Location

    Global

    Time period

    Contemporary

    Major taxa studied

    Seabirds

    Methods

    We compile and impute eight traits that relate to species’ vulnerabilities and ecosystem functioning across 341 seabird species. Using these traits, we build a mixed-data PCA of species’ trait space. We quantify trait redundancy using the unique trait combinations (UTCs) approach. Finally, we employ a SIMPER analysis to identify which traits explain the greatest difference between threat groups.

    Results

    We find seabirds segregate in trait space based on threat status, indicating anthropogenic impacts are selectively removing large, long-lived, pelagic surface feeders with narrow habitat breadths. We further find that threatened species have higher trait redundancy, while non-threatened species have relatively limited redundancy. Finally, we find that species with narrow habitat breadths, fast reproductive speeds, and varied diets are more likely to be threatened by habitat-modifying processes (e.g., pollution and natural system modifications); whereas pelagic specialists with slow reproductive speeds and varied diets are vulnerable to threats that directly impact survival and fecundity (e.g., invasive species and biological resource use) and climate change. Species with no threats are non-pelagic specialists with invertebrate diets and fast reproductive speeds.

    Main conclusions

    Our results suggest both threatened and non-threatened species contribute unique ecological strategies. Consequently, conserving both threat groups, but with contrasting approaches may avoid potential changes in ecosystem functioning and stability.

    Methods ​​​​Trait Selection and Data

    We compiled data from multiple databases for eight traits across all 341 extant species of seabirds. Here we recognise seabirds as those that feed at sea, either nearshore or offshore, but excluding marine ducks. These traits encompass the varying ecological and life history strategies of seabirds, and relate to ecosystem functioning and species’ vulnerabilities. We first extracted the trait data for body mass, clutch size, habitat breadth and diet guild from a recently compiled trait database for birds (Cooke, Bates, et al., 2019). Generation length and migration status were compiled from BirdLife International (datazone.birdlife.org), and pelagic specialism and foraging guild from Wilman et al. (2014). We further compiled clutch size information for 84 species through a literature search.

    Foraging and diet guild describe the most dominant foraging strategy and diet of the species. Wilman et al. (2014) assigned species a score from 0 to 100% for each foraging and diet guild based on their relative usage of a given category. Using these scores, species were classified into four foraging guild categories (diver, surface, ground, and generalist foragers) and three diet guild categories (omnivore, invertebrate, and vertebrate & scavenger diets). Each was assigned to a guild based on the predominant foraging strategy or diet (score > 50%). Species with category scores < 50% were classified as generalists for the foraging guild trait and omnivores for the diet guild trait. Body mass was measured in grams and was the median across multiple databases. Habitat breadth is the number of habitats listed as suitable by the International Union for Conservation of Nature (IUCN, iucnredlist.org). Generation length describes the mean age in years at which a species produces offspring. Clutch size is the number of eggs per clutch (the central tendency was recorded as the mean or mode). Migration status describes whether a species undertakes full migration (regular or seasonal cyclical movements beyond the breeding range, with predictable timing and destinations) or not. Pelagic specialism describes whether foraging is predominantly pelagic. To improve normality of the data, continuous traits, except clutch size, were log10 transformed.

    Multiple Imputation

    All traits had more than 80% coverage for our list of 341 seabird species, and body mass and habitat breadth had complete species coverage. To achieve complete species trait coverage, we imputed missing data for clutch size (4 species), generation length (1 species), diet guild (60 species), foraging guild (60 species), pelagic specialism (60 species) and migration status (3 species). The imputation approach has the advantage of increasing the sample size and consequently the statistical power of any analysis whilst reducing bias and error (Kim, Blomberg, & Pandolfi, 2018; Penone et al., 2014; Taugourdeau, Villerd, Plantureux, Huguenin-Elie, & Amiaud, 2014).

    We estimated missing values using random forest regression trees, a non-parametric imputation method, based on the ecological and phylogenetic relationships between species (Breiman, 2001; Stekhoven & Bühlmann, 2012). This method has high predictive accuracy and the capacity to deal with complexity in relationships including non-linearities and interactions (Cutler et al., 2007). To perform the random forest multiple imputations, we used the missForest function from package “missForest” (Stekhoven & Bühlmann, 2012). We imputed missing values based on the ecological (the trait data) and phylogenetic (the first 10 phylogenetic eigenvectors, detailed below) relationships between species. We generated 1,000 trees - a cautiously large number to increase predictive accuracy and prevent overfitting (Stekhoven & Bühlmann, 2012). We set the number of variables randomly sampled at each split (mtry) as the square-root of the number variables included (10 phylogenetic eigenvectors, 8 traits; mtry = 4); a useful compromise between imputation error and computation time (Stekhoven & Bühlmann, 2012). We used a maximum of 20 iterations (maxiter = 20), to ensure the imputations finished due to the stopping criterion and not due to the limit of iterations (the imputed datasets generally finished after 4 – 10 iterations).

    Due to the stochastic nature of the regression tree imputation approach, the estimated values will differ slightly each time. To capture this imputation uncertainty and to converge on a reliable result, we repeated the process 15 times, resulting in 15 trait datasets, which is suggested to be sufficient (González-Suárez, Zanchetta Ferreira, & Grilo, 2018; van Buuren & Groothuis-Oudshoorn, 2011). We took the mean values for continuous traits and modal values for categorical traits across the 15 datasets for subsequent analyses.

    Phylogenetic data can improve the estimation of missing trait values in the imputation process (Kim et al., 2018; Swenson, 2014), because closely related species tend to be more similar to each other (Pagel, 1999) and many traits display high degrees of phylogenetic signal (Blomberg, Garland, & Ives, 2003). Phylogenetic information was summarised by eigenvectors extracted from a principal coordinate analysis, representing the variation in the phylogenetic distances among species (Jose Alexandre F. Diniz-Filho et al., 2012; José Alexandre Felizola Diniz-Filho, Rangel, Santos, & Bini, 2012). Bird phylogenetic distance data (Prum et al., 2015) were decomposed into a set of orthogonal phylogenetic eigenvectors using the Phylo2DirectedGraph and PEM.build functions from the “MPSEM” package (Guenard & Legendre, 2018). Here, we used the first 10 phylogenetic eigenvectors, which have previously been shown to minimise imputation error (Penone et al., 2014). These phylogenetic eigenvectors summarise major phylogenetic differences between species (Diniz-Filho et al., 2012) and captured 61% of the variation in the phylogenetic distances among seabirds. Still, these eigenvectors do not include fine-scale differences between species (Diniz-Filho et al., 2012), however the inclusion of many phylogenetic eigenvectors would dilute the ecological information contained in the traits, and could lead to excessive noise (Diniz-Filho et al., 2012; Peres‐Neto & Legendre, 2010). Thus, including the first 10 phylogenetic eigenvectors reduces imputation error and ensures a balance between including detailed phylogenetic information and diluting the information contained in the other traits.

    To quantify the average error in random forest predictions across the imputed datasets (out-of-bag error), we calculated the mean normalized root squared error and associated standard deviation across the 15 datasets for continuous traits (clutch size = 13.3 ± 0.35 %, generation length = 0.6 ± 0.02 %). For categorical data, we quantified the mean percentage of traits falsely classified (diet guild = 28.6 ± 0.97 %, foraging guild = 18.0 ± 1.05 %, pelagic specialism = 11.2 ± 0.66 %, migration status = 18.8 ± 0.58 %). Since body mass and habitat breadth have complete trait coverage, they did not require imputation. Low imputation accuracy is reflected in high out-of-bag error values where diet guild had the lowest imputation accuracy with 28.6% wrongly classified on average. Diet is generally difficult to predict (Gainsbury, Tallowin, & Meiri, 2018), potentially due to species’ high dietary plasticity (Gaglio, Cook, McInnes, Sherley, & Ryan, 2018) and/or the low phylogenetic conservatism of diet (Gainsbury et al., 2018). With this caveat in mind, we chose dietary guild, as more coarse dietary classifications are more

  13. Data from: Safety of the anterior approach versus the lateral approach for...

    • zenodo.org
    • datadryad.org
    csv
    Updated Jun 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akihiro Shiroshita; Akihiro Shiroshita (2022). Safety of the anterior approach versus the lateral approach for chest tube insertion by residents treating spontaneous pneumothorax: a propensity score weighted analysis [Dataset]. http://doi.org/10.5061/dryad.v15dv41t6
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 3, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Akihiro Shiroshita; Akihiro Shiroshita
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Background: Chest tube malposition is the most common complication during chest tube insertion. This study aimed to compare the risk of chest tube malposition between the anterior and lateral approaches for thoracostomy performed by junior and senior residents.

    Methods: This retrospective study included patients aged ≥ 20 years who exhibited primary or secondary spontaneous pneumothorax without pleural adhesion and underwent chest tube drainage performed by junior or senior residents. The study exposure involved the insertion of the chest tube in the midclavicular line (anterior approach) or the anterior or midaxillary line (lateral approach). The primary outcome was the number of malpositioned chest tubes. Multiple imputation was used for missing data. The propensity score within each imputed dataset was calculated by using the collected variables. The inverse probability of treatment weighting (IPTW) method was used to adjust for baseline confounders.

    Results: IPTW analysis revealed that the estimated odds ratio for chest tube malposition in the anterior approach group (n = 34) versus the lateral approach group (n = 219) was 0.61 (95% confidence interval, 0.17–2.11).

    Conclusion: In patients treated for primary or secondary pneumothorax by junior or senior residents, the risk of chest tube malposition was not significantly different between the anterior and lateral approach for thoracostomy.

  14. d

    Data from: Testing hypotheses of marsupial brain size variation using...

    • dataone.org
    • datadryad.org
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orlin S. Todorov (2025). Testing hypotheses of marsupial brain size variation using phylogenetic multiple imputations and a Bayesian comparative framework [Dataset]. http://doi.org/10.5061/dryad.jh9w0vt9h
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Orlin S. Todorov
    Time period covered
    Aug 9, 2022
    Description

    Considerable controversy exists about which hypotheses and variables best explain mammalian brain size variation. We use a new, high-coverage dataset of marsupial brain and body sizes, and the first phylogenetically imputed full datasets of 16 predictor variables, to model the prevalent hypotheses explaining brain size evolution using phylogenetically corrected Bayesian generalised linear mixed-effects modelling. Despite this comprehensive analysis, litter size emerges as the only significant predictor. Marsupials differ from the more frequently studied placentals in displaying much lower diversity of reproductive traits, which are known to interact extensively with many behavioural and ecological predictors of brain size. Our results therefore suggest that studies of relative brain size evolution in placental mammals may require targeted co-analysis or adjustment of reproductive parameters like litter size, weaning age, or gestation length. This supports suggestions that significant as...

  15. f

    Tests of attitude valence.

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Nov 10, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schofield, Timothy P.; Butterworth, Peter (2015). Tests of attitude valence. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001858094
    Explore at:
    Dataset updated
    Nov 10, 2015
    Authors
    Schofield, Timothy P.; Butterworth, Peter
    Description

    Outcomes of sample-weighted one sample t-tests against a value of 3 (indicating a neutral response). Cohen’s d is reported as a measure of effect size. Comparison confidence intervals are presented for an imputed data set (assumption of missing at random) using sample weighted multiple imputation. Imputation was performed in SPSS v22 using MCMC with seed set to 3319607 and a maximum of 10 iterations and 5 imputed datasets.*** denotes that the effect is significant at p < .001** denotes that the effect is significant at p < .01.Tests of attitude valence.

  16. Data from: Elucidating Age and Sex-Dependent Association Between Frontal EEG...

    • tandf.figshare.com
    pdf
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam Ciarleglio; Eva Petkova; Ofer Harel (2023). Elucidating Age and Sex-Dependent Association Between Frontal EEG Asymmetry and Depression: An Application of Multiple Imputation in Functional Regression [Dataset]. http://doi.org/10.6084/m9.figshare.14779490.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Adam Ciarleglio; Eva Petkova; Ofer Harel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Frontal power asymmetry (FA), a measure of brain function derived from electroencephalography, is a potential biomarker for major depressive disorder (MDD). Though FA is functional in nature, it is typically reduced to a scalar value prior to analysis, possibly obscuring its relationship with MDD and leading to a number of studies that have provided contradictory results. To overcome this issue, we sought to fit a functional regression model to characterize the association between FA and MDD status, adjusting for age, sex, cognitive ability, and handedness using data from a large clinical study that included both MDD and healthy control (HC) subjects. Since nearly 40% of the observations are missing data on either FA or cognitive ability, we propose an extension of multiple imputation (MI) by chained equations that allows for the imputation of both scalar and functional data. We also propose an extension of Rubin’s Rules for conducting valid inference in this setting. The proposed methods are evaluated in a simulation and applied to our FA data. For our FA data, a pooled analysis from the imputed datasets yielded similar results to those of the complete case analysis. We found that, among young females, HCs tended to have higher FA over the θ, α, and β frequency bands, but that the difference between HC and MDD subjects diminishes and ultimately reverses with age. For males, HCs tended to have higher FA in the β frequency band, regardless of age. Young male HCs had higher FA in the θ and α bands, but this difference diminishes with increasing age in the α band and ultimately reverses with increasing age in the θ band. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.

  17. d

    Replication Data for: Pitfalls in the Study of Democratization: Testing the...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Welzel, Christian; Inglehart, Ronald; Kruse, Stefan (2023). Replication Data for: Pitfalls in the Study of Democratization: Testing the Emancipatory Theory of Democratization [Dataset]. http://doi.org/10.7910/DVN/TSJXGH
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Welzel, Christian; Inglehart, Ronald; Kruse, Stefan
    Description

    The Multiple Imputation dataset contains the original dataset and five imputed rectangular datasets.

  18. f

    Data from: On Combining Reference Data to Improve Imputation Accuracy

    • datasetcatalog.nlm.nih.gov
    Updated Jan 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pei, Yu-Fang; Li, Jian; Deng, Hong-Wen; Zhang, Ji-Gang; Chen, Jun (2013). On Combining Reference Data to Improve Imputation Accuracy [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001725356
    Explore at:
    Dataset updated
    Jan 30, 2013
    Authors
    Pei, Yu-Fang; Li, Jian; Deng, Hong-Wen; Zhang, Ji-Gang; Chen, Jun
    Description

    Genotype imputation is an important tool in human genetics studies, which uses reference sets with known genotypes and prior knowledge on linkage disequilibrium and recombination rates to infer un-typed alleles for human genetic variations at a low cost. The reference sets used by current imputation approaches are based on HapMap data, and/or based on recently available next-generation sequencing (NGS) data such as data generated by the 1000 Genomes Project. However, with different coverage and call rates for different NGS data sets, how to integrate NGS data sets of different accuracy as well as previously available reference data as references in imputation is not an easy task and has not been systematically investigated. In this study, we performed a comprehensive assessment of three strategies on using NGS data and previously available reference data in genotype imputation for both simulated data and empirical data, in order to obtain guidelines for optimal reference set construction. Briefly, we considered three strategies: strategy 1 uses one NGS data as a reference; strategy 2 imputes samples by using multiple individual data sets of different accuracy as independent references and then combines the imputed samples with samples based on the high accuracy reference selected when overlapping occurs; and strategy 3 combines multiple available data sets as a single reference after imputing each other. We used three software (MACH, IMPUTE2 and BEAGLE) for assessing the performances of these three strategies. Our results show that strategy 2 and strategy 3 have higher imputation accuracy than strategy 1. Particularly, strategy 2 is the best strategy across all the conditions that we have investigated, producing the best accuracy of imputation for rare variant. Our study is helpful in guiding application of imputation methods in next generation association analyses.

  19. d

    Data from: Evaluation of Gender Violence and Harassment Prevention Programs...

    • catalog.data.gov
    • icpsr.umich.edu
    Updated Nov 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). Evaluation of Gender Violence and Harassment Prevention Programs in Middle Schools in Cleveland, Ohio, 2006-2007 [United States] [Dataset]. https://catalog.data.gov/dataset/evaluation-of-gender-violence-and-harassment-prevention-programs-in-middle-schools-in-clev-0e0df
    Explore at:
    Dataset updated
    Nov 14, 2025
    Dataset provided by
    National Institute of Justice
    Area covered
    Cleveland, Ohio, United States
    Description

    The study was designed to help increase the capacity of programs to prevent gender violence and harassment (GV/H) among middle school youth. The long-term goal of the study was to help prevent intimate partner violence, sexual violence, and sexual harassment by employing rigorous methods to evaluate strategies for altering violence-supportive attitudes and norms of youth. Specifically, the study was structured to evaluate the relative effectiveness of common approaches to youth GV/H prevention programming (in terms of knowledge, attitudes, intended behavior, behavior, and emotional safety of youth participants) for one of the youngest populations ever studied in this area. In a longitudinal randomized controlled trial study, two five-lesson curricula were created to address gender violence and harassment (GV/H) in middle schools, and classrooms were assigned randomly to treatment and control groups. Treatment 1 was an interaction-based curriculum focused on the setting and communication of boundaries in relationships, the determination of wanted and unwanted behaviors, and the role of the bystander as intervener. Treatment 2 was a law and justice curriculum focused on laws, definitions, information, and data about penalties for sexual assault and sexual harassment. The control group did not receive either treatment. Pencil-and-paper surveys were designed for students to complete, and were administered either by a member of the research team or by teachers who were trained by a member of the research team in proper administration processes. Data were collected from three inner-ring suburbs of Cleveland, Ohio, from November 2006 to May 2007. Surveys were distributed at three different times: immediately before the assignment to one of the three study conditions, immediately after the treatment (or control condition) was completed, and 5-6 months after their assignment to one of the three study conditions. The data contain responses for 1,507 students over 3 waves. Additionally, researchers used multiple imputations for this dataset which resulted in 5 imputed datasets for each record for a total of 7,535 cases in the data file. The data have 697 variables, including from such questions as whether someone had ever or in the past 6 months done something to the respondent such as slapped or scratched the respondent, hit the respondent, or threatened the respondent. Additionally, respondents were asked if they had done these same actions to someone else. Respondents were also asked a series of questions regarding whether they had ever been sexually harassed by someone or if they had sexually harassed someone themselves. Next, respondents were asked to rate whether they agreed with a series of statements such as "It is all right for a girl to ask a boy out on a date", "If you ignore sexual harassment, more than likely it will stop", and "Making sexual comments to a girl is wrong". Students were then asked to indicate whether a series of statements were true or false, such as "If two kids who are both under the age of 16 have sex, it is not against the law" and "If a person is not physically harming someone, then they are not really abusive". Respondents were then asked to read three scenarios and indicate how they would respond in that scenario. Also, students indicated how likely they would be to react in specified ways to a prepared statement. Data also provide demographic information such as age, gender, and ethnic/racial background, as well as variables to generically identify school district, school, and class period.

  20. m

    Open dataset for: "Reflecting on existential threats elicits negative affect...

    • data.mendeley.com
    Updated Dec 11, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eefje Poppelaars (2019). Open dataset for: "Reflecting on existential threats elicits negative affect but no physiological arousal" [Dataset]. http://doi.org/10.17632/s4fh846kjk.3
    Explore at:
    Dataset updated
    Dec 11, 2019
    Authors
    Eefje Poppelaars
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Open data and R analysis scripts for the paper as submitted for publication: "Poppelaars, E. S., Klackl, J., Scheepers, DT, Mühlberger, C., & Jonas, E. (2019). Reflecting on existential threats elicits negative affect but no physiological arousal."

    A dataset of 171 undergraduate students were randomly allocated to one of four existential threat conditions: mortality salience, freedom restriction, uncontrollability, and uncertainty; or to the non-existential threat condition: social-evaluative threat; or to a control condition (TV salience). Three facets of arousal were measured: positive and negative affect before and after reflection, subjective arousal during baseline and reflection, and physiological activation during baseline and reflection (electrodermal, cardiovascular, and respiratory), as well as personality traits (e.g. trait avoidance and approach, self-esteem).

    Description of files: - File 'README.txt' contains the description of the files (metadata). - File '20191024_IJMData_brief.sav' contains the raw data. - Files 'EXI.outl.del.RData' contains the complete dataset with missing values, with extra variables calculated, and with outliers deleted. - File 'Codebook_EXI.outl.del.csv' contains a description of all variables in the 'EXI.outl.del.RData' file (metadata). - Files 'EXI.outl.del.imp.RData' and 'EXI.outl.del.imp.extra.RData' contain multiple imputed datasets (without missing values) that can be used to reproduce results from the paper. - File '01_CalculationOfData.R' is an R analysis script that imports the raw data, calculates new variables, and imputes missing data via multiple imputation using the 'predictorMatrixAdj.xlsx' file. - File '02_AnalysisOfImputedData.R' is an R analysis script that calculates descriptive statistics, creates plots, and tests hypotheses using t-tests, Bayesian statistics, and multiple lineair regressions. Also uses the custom functions: 'BF.evidence.R', 'cohen.d.magnitude.R' and 'p.value.sig.R'.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Vahid Nassiri; Geert Molenberghs; Geert Verbeke; João Barbosa-Breda (2023). Iterative Multiple Imputation: A Framework to Determine the Number of Imputed Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.7445375.v2
Organization logo

Iterative Multiple Imputation: A Framework to Determine the Number of Imputed Datasets

Related Article
Explore at:
21 scholarly articles cite this dataset (View in Google Scholar)
zipAvailable download formats
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Vahid Nassiri; Geert Molenberghs; Geert Verbeke; João Barbosa-Breda
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We consider multiple imputation as a procedure iterating over a set of imputed datasets. Based on an appropriate stopping rule the number of imputed datasets is determined. Simulations and real-data analyses indicate that the sufficient number of imputed datasets may in some cases be substantially larger than the very small numbers that are usually recommended. For an easier use in various applications, the proposed method is implemented in the R package imi.

Search
Clear search
Close search
Google apps
Main menu