CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Overcoming bias due to confounding and missing data is challenging when analysing observational data. Propensity scores are commonly used to account for the first problem and multiple imputation for the latter. Unfortunately, it is not known how best to proceed when both techniques are required. We investigate whether two different approaches to combining propensity scores and multiple imputation (Across and Within) lead to differences in the accuracy or precision of exposure effect estimates. Both approaches start by imputing missing values multiple times. Propensity scores are then estimated for each resulting dataset. Using the Across approach, the mean propensity score across imputations for each subject is used in a single subsequent analysis. Alternatively, the Within approach uses propensity scores individually to obtain exposure effect estimates in each imputation, which are combined to produce an overall estimate. These approaches were compared in a series of Monte Carlo simulations and applied to data from the British Society for Rheumatology Biologics Register. Results indicated that the Within approach produced unbiased estimates with appropriate confidence intervals, whereas the Across approach produced biased results and unrealistic confidence intervals. Researchers are encouraged to implement the Within approach when conducting propensity score analyses with incomplete data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Penalized regression methods are used in many biomedical applications for variable selection and simultaneous coefficient estimation. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors. This article considers a general class of penalized objective functions which, by construction, force selection of the same variables across imputed datasets. By pooling objective functions across imputations, optimization is then performed jointly over all imputed datasets rather than separately for each dataset. We consider two objective function formulations that exist in the literature, which we will refer to as “stacked” and “grouped” objective functions. Building on existing work, we (i) derive and implement efficient cyclic coordinate descent and majorization-minimization optimization algorithms for continuous and binary outcome data, (ii) incorporate adaptive shrinkage penalties, (iii) compare these methods through simulation, and (iv) develop an R package miselect. Simulations demonstrate that the “stacked” approaches are more computationally efficient and have better estimation and selection properties. We apply these methods to data from the University of Michigan ALS Patients Biorepository aiming to identify the association between environmental pollutants and ALS risk. Supplementary materials for this article are available online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Empirical data analyses often require complete data sets. Therefore, in case of incompletely observed data sets, methods are attractive that generate plausible values (imputations) for the unobserved data. The idea is to then analyze the completed data set in an easy way. Thus, various imputation techniques have been proposed and evaluated. Popular measures used for evaluating these techniques are based on distances between true and imputed values applied in simulation studies. In this paper we show through a theoretical example and a simulation study that these measures may be misleading: From the fact that they are zero if all the imputed values were equal to the true but unobserved values and are usually larger than zero otherwise, it does not follow that the smaller the value of such a measure, the `closer' the inference based on the imputed data set to the inference based on the complete data set without missing values. Moreover, since these measures are usually only applied in simulations, corresponding findings can not be generalized.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/HRM8EQhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/HRM8EQ
The various variables in these datasets are described in the paper. Note that the "base" file is the file that served as the base file for multiple imputation. The five imputed files are labeled with imp as the file stub. These imputed datasets were created with Amelia, as described in the paper. Variable names in the imputed datasets correspond to those in the base file. The exception are variables ending in _cent and that have an "X" in the name. The "_cent" indicates centered versions of variables created within the imputed dataset. Variables with an "X" are multiplicative interactions. There are a couple of variables that might be unclear, since they are factors scores. However, the creation of those variables and their meaning should be clear in the context of the paper and the datasets. If you have questions or need help with using these datasets, feel free to email me at Nathan.J.Kelly@gmail.com. I'll do my best to respond quickly. Also, since NES is the originator of these data, please cite the NES and its funding sources if you use these data.
Many enduring questions in international relations theory focus on power relations, so it is important that scholars have a good measure of relative power. The standard measure of relative military power, the capability ratio, is barely better than random guessing at pre- dicting militarized dispute outcomes. We use machine learning to build a superior proxy, the Dispute Outcome Expectations score, from the same underlying data. Our measure is an order of magnitude better than the capability ratio at predicting dispute outcomes. We replicate Reed et al. (2008) and find, contrary to the original conclusions, that the probability of conflict is always highest when the state with the least benefits has a preponderance of power. In replications of 18 other dyadic analyses that use power as a control, we find that replacing the standard measure with DOE scores usually improves both in-sample and out-of-sample goodness of fit. Note:This analysis involves many layers of computation: multiple imputation of the underlying data, creation of an ensemble of machine learning models on the imputed datasets, predictions from that ensemble, and replications of previous studies using those predictions. Our replication code sets seeds in any script where random numbers are drawn, and runs in a Docker environment to ensure identical package versions across machines. Nevertheless, because of differences in machine precision and floating point computations across CPUs, the replication code may not produce results identical to those in the paper. Any differences should be small in magnitude and should not affect any substantive conclusions of the analysis.
The main dataset contains data from a cross-national survey conducted with local and regional politicians in several European countries in 2015. The data were collected by the researcher individually. The main dataset is provided in .csv format. The dataset was primarily used for drafting the paper "Subnational views on multilevel governance". An accompanying R script shows data preparation, analysis, simulations and plotting. Multiple imputation was used to mend item non-response: results may vary slightly in different iterations. For the replication of exact results as reported in the paper, 5 imputed datasets are also provided.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the national health interview survey (nhis) with r the national health interview survey (nhis) is a household survey about health status and utilization. each annual data set can be used to examine the disease burden and access to care that individuals and families are currently experiencing across the country. check out the wikipedia article (ohh hayy i wrote that) for more detail about its current and potential uses. if you're cooking up a health-related analysis that doesn't need medical expenditures or monthly health insurance coverage, look at nhis before the medical expenditure panel survey (it's sample is twice as big). the centers for disease control and prevention (cdc) has been keeping nhis real since 1957, and the scripts below automate the download, importation, and analysis of every file back to 1963. what happened in 1997, you ask? scientists cloned dolly the sheep, clinton started his second term, and the national health interview survey underwent its most recent major questionnaire re-design. here's how all the moving parts work: a person-level file (personsx) that merges onto other files using unique household (hhx), family (fmx), and person (fpx) identifiers. [note to data historians: prior to 2004, person number was (px) and unique within each household.] this file includes the complex sample survey variables needed to construct a taylor-series linearization design, and should be used if your analysis doesn't require variables from the sample adult or sample c hild files. this survey setup generalizes to the noninstitutional, non-active duty military population. a family-level file that merges onto other files using unique household (hhx) and family (fmx) identifiers. a household-level file that merges onto other files using the unique household (hhx) identifier. a sample adult file that includes questions asked of only one adult within each household (selected at random) - a subset of the main person-level file. hhx, fmx, and fpx identifiers will merge with each of the files above, but since not every adult gets asked thes e questions, this file contains its own set of weights: wtfa_sa instead of wtfa. you can merge on whatever other variables you need from the three files above, but if your analysis requires any variables from the sample adult questionnaire, you can't use records in the person-level file that aren't also in the sample adult file (a big sample size cut). this survey setup generalizes to the noninstitutional, non-active duty military adult population. a sample child file that includes questions asked of only one child within each household (if available, and also selected at random) - another subset of the main person-level file. same deal as the sample adult description, except use wtfa_sc instead of wtfa oh yeah and this one generalizes to the child population. five imputed income files. if you want income and/or poverty variables incorporated into any part of your analysis, you'll need these puppies. the replication example below uses these, but if that's impenetrable, post in the comments describing where you get stuck. some injury stuff and other miscellanea that varies by year. if anyone uses this, please share your experience. if you use anything more than the personsx file alone, you'll need to merge some tables together. make sure you understand the difference between setting the parameter all = TRUE versus all = FALSE -- not everyone in the personsx file has a record in the samadult and sam child files. this new github repository contains four scripts: 1963-2011 - download all microdata.R loop through every year and download every file hosted on the cdc's nhis ftp site import each file into r with SAScii save each file as an r d ata file (.rda) download all the documentation into the year-specific directory 2011 personsx - analyze.R load the r data file (.rda) created by the download script (above) set up a taylor-series linearization survey design outlined on page 6 of this survey document perform a smattering of analysis examples 2011 personsx plus samadult with multiple imputation - analyze.R load the personsx and samadult r data files (.rda) created by the download script (above) merge the personsx and samadult files, highlighting how to conduct analyses that need both create tandem survey designs for both personsx-only and merg ed personsx-samadult files perform just a touch of analysis examples load and loop through the five imputed income files, tack them onto the personsx-samadult file conduct a poverty recode or two analyze the multiply-imputed survey design object, just like mom used to analyze replicate cdc tecdoc - 2000 multiple imputation.R download and import the nhis 2000 personsx and imputed income files, using SAScii and this imputed income sas importation script (no longer hosted on the cdc's nhis ftp site). loop through each of the five imputed income files, merging each to the personsx file and performing the same set of...
https://www.icpsr.umich.edu/web/ICPSR/studies/36275/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/36275/terms
The Consumer Expenditure Survey (CE) program provides a continuous and comprehensive flow of data on the buying habits of American consumers, including data on their expenditures, income, and consumer unit (families and single consumers) characteristics. These data are used widely in economic research and analysis, and in support of revisions of the Consumer Price Index. The CE program is comprised of two separate components (each with its own survey questionnaire and independent sample), the Diary Survey and the quarterly Interview Survey (ICPSR 36237). This data collection contains the Diary Survey component, which was designed to obtain data on frequently purchased smaller items, including food, housing, apparel and services, transportation, entertainment, and out-of-pocket health care costs. Each consumer unit (CU) recorded its expenditures in a diary for two consecutive 1-week periods. Although the diary was designed to collect information on expenditures that could not be easily recalled over time, respondents were asked to report all expenses (except overnight travel) that the CU incurred during the survey week. The 2013 Diary Survey release contains five sets of data files (FMLD, MEMD, EXPD, DTBD, DTID), and one processing file (DSTUB). The FMLD, MEMD, EXPD, DTBD, and DTID files are organized by the quarter of the calendar year in which the data were collected. There are four quarterly datasets for each of these files. The FMLD files contain CU characteristics, income, and summary level expenditures; the MEMD files contain member characteristics and income data; the EXPD files contain detailed weekly expenditures at the Universal Classification Code (UCC) level; the DTBD files contain the CU's reported annual income values or the mean of the five imputed income values in the multiple imputation method; and the DTID files contain the five imputed income values. Please note that the summary level expenditure and income information on the FMLD files permit the data user to link consumer spending, by general expenditure category, and household characteristics and demographics on one set of files. The DSTUB file provides the aggregation scheme used in the published consumer expenditure tables. The DSTUB file is further explained in Section III.F.6. "Processing Files" of the Diary Survey Users' Guide. A second documentation guide, the "Users' Guide to Income Imputation," includes information on how to appropriately use the imputed income data. Demographic and family characteristics data include age, sex, race, marital status, and CU relationships for each CU member. Income information was also collected, such as wage, salary, unemployment compensation, child support, and alimony, as well as information on the employment of each CU member age 14 and over. The unpublished integrated CE data tables produced by the BLS are available to download through NADAC (click on "Other" in the Dataset(s) section). The tables show average and percentile expenditures for detailed items, as well as the standard error and coefficient of variation (CV) for each spending estimate. The BLS unpublished integrated CE data tables are provided as an easy-to-use tool for obtaining spending estimates. However, users are cautioned to read the BLS explanatory letter accompanying the tables. The letter explains that estimates of average expenditures on detailed spending items (such as leisure and art-related categories) may be unreliable due to so few reports of expenditures for those items.
The study was designed to help increase the capacity of programs to prevent gender violence and harassment (GV/H) among middle school youth. The long-term goal of the study was to help prevent intimate partner violence, sexual violence, and sexual harassment by employing rigorous methods to evaluate strategies for altering violence-supportive attitudes and norms of youth. Specifically, the study was structured to evaluate the relative effectiveness of common approaches to youth GV/H prevention programming (in terms of knowledge, attitudes, intended behavior, behavior, and emotional safety of youth participants) for one of the youngest populations ever studied in this area. In a longitudinal randomized controlled trial study, two five-lesson curricula were created to address gender violence and harassment (GV/H) in middle schools, and classrooms were assigned randomly to treatment and control groups. Treatment 1 was an interaction-based curriculum focused on the setting and communication of boundaries in relationships, the determination of wanted and unwanted behaviors, and the role of the bystander as intervener. Treatment 2 was a law and justice curriculum focused on laws, definitions, information, and data about penalties for sexual assault and sexual harassment. The control group did not receive either treatment. Pencil-and-paper surveys were designed for students to complete, and were administered either by a member of the research team or by teachers who were trained by a member of the research team in proper administration processes. Data were collected from three inner-ring suburbs of Cleveland, Ohio, from November 2006 to May 2007. Surveys were distributed at three different times: immediately before the assignment to one of the three study conditions, immediately after the treatment (or control condition) was completed, and 5-6 months after their assignment to one of the three study conditions. The data contain responses for 1,507 students over 3 waves. Additionally, researchers used multiple imputations for this dataset which resulted in 5 imputed datasets for each record for a total of 7,535 cases in the data file. The data have 697 variables, including from such questions as whether someone had ever or in the past 6 months done something to the respondent such as slapped or scratched the respondent, hit the respondent, or threatened the respondent. Additionally, respondents were asked if they had done these same actions to someone else. Respondents were also asked a series of questions regarding whether they had ever been sexually harassed by someone or if they had sexually harassed someone themselves. Next, respondents were asked to rate whether they agreed with a series of statements such as "It is all right for a girl to ask a boy out on a date", "If you ignore sexual harassment, more than likely it will stop", and "Making sexual comments to a girl is wrong". Students were then asked to indicate whether a series of statements were true or false, such as "If two kids who are both under the age of 16 have sex, it is not against the law" and "If a person is not physically harming someone, then they are not really abusive". Respondents were then asked to read three scenarios and indicate how they would respond in that scenario. Also, students indicated how likely they would be to react in specified ways to a prepared statement. Data also provide demographic information such as age, gender, and ethnic/racial background, as well as variables to generically identify school district, school, and class period.
Considerable controversy exists about which hypotheses and variables best explain mammalian brain size variation. We use a new, high-coverage dataset of marsupial brain and body sizes, and the first phylogenetically imputed full datasets of 16 predictor variables, to model the prevalent hypotheses explaining brain size evolution using phylogenetically corrected Bayesian generalised linear mixed-effects modelling. Despite this comprehensive analysis, litter size emerges as the only significant predictor. Marsupials differ from the more frequently studied placentals in displaying much lower diversity of reproductive traits, which are known to interact extensively with many behavioural and ecological predictors of brain size. Our results therefore suggest that studies of relative brain size evolution in placental mammals may require targeted co-analysis or adjustment of reproductive parameters like litter size, weaning age, or gestation length. This supports suggestions that significant as...
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Aim
Seabirds are heavily threatened by anthropogenic activities and their conservation status is deteriorating rapidly. Yet, these pressures are unlikely to uniformly impact all species. It remains an open question if seabirds with similar ecological roles are responding similarly to human pressures. Here we aim to: 1) test whether threatened vs non-threatened seabirds are separated in trait space; 2) quantify the similarity of species’ roles (redundancy) per IUCN Red List Category; and 3) identify traits that render species vulnerable to anthropogenic threats.
Location
Global
Time period
Contemporary
Major taxa studied
Seabirds
Methods
We compile and impute eight traits that relate to species’ vulnerabilities and ecosystem functioning across 341 seabird species. Using these traits, we build a mixed-data PCA of species’ trait space. We quantify trait redundancy using the unique trait combinations (UTCs) approach. Finally, we employ a SIMPER analysis to identify which traits explain the greatest difference between threat groups.
Results
We find seabirds segregate in trait space based on threat status, indicating anthropogenic impacts are selectively removing large, long-lived, pelagic surface feeders with narrow habitat breadths. We further find that threatened species have higher trait redundancy, while non-threatened species have relatively limited redundancy. Finally, we find that species with narrow habitat breadths, fast reproductive speeds, and varied diets are more likely to be threatened by habitat-modifying processes (e.g., pollution and natural system modifications); whereas pelagic specialists with slow reproductive speeds and varied diets are vulnerable to threats that directly impact survival and fecundity (e.g., invasive species and biological resource use) and climate change. Species with no threats are non-pelagic specialists with invertebrate diets and fast reproductive speeds.
Main conclusions
Our results suggest both threatened and non-threatened species contribute unique ecological strategies. Consequently, conserving both threat groups, but with contrasting approaches may avoid potential changes in ecosystem functioning and stability.
Methods Trait Selection and Data
We compiled data from multiple databases for eight traits across all 341 extant species of seabirds. Here we recognise seabirds as those that feed at sea, either nearshore or offshore, but excluding marine ducks. These traits encompass the varying ecological and life history strategies of seabirds, and relate to ecosystem functioning and species’ vulnerabilities. We first extracted the trait data for body mass, clutch size, habitat breadth and diet guild from a recently compiled trait database for birds (Cooke, Bates, et al., 2019). Generation length and migration status were compiled from BirdLife International (datazone.birdlife.org), and pelagic specialism and foraging guild from Wilman et al. (2014). We further compiled clutch size information for 84 species through a literature search.
Foraging and diet guild describe the most dominant foraging strategy and diet of the species. Wilman et al. (2014) assigned species a score from 0 to 100% for each foraging and diet guild based on their relative usage of a given category. Using these scores, species were classified into four foraging guild categories (diver, surface, ground, and generalist foragers) and three diet guild categories (omnivore, invertebrate, and vertebrate & scavenger diets). Each was assigned to a guild based on the predominant foraging strategy or diet (score > 50%). Species with category scores < 50% were classified as generalists for the foraging guild trait and omnivores for the diet guild trait. Body mass was measured in grams and was the median across multiple databases. Habitat breadth is the number of habitats listed as suitable by the International Union for Conservation of Nature (IUCN, iucnredlist.org). Generation length describes the mean age in years at which a species produces offspring. Clutch size is the number of eggs per clutch (the central tendency was recorded as the mean or mode). Migration status describes whether a species undertakes full migration (regular or seasonal cyclical movements beyond the breeding range, with predictable timing and destinations) or not. Pelagic specialism describes whether foraging is predominantly pelagic. To improve normality of the data, continuous traits, except clutch size, were log10 transformed.
Multiple Imputation
All traits had more than 80% coverage for our list of 341 seabird species, and body mass and habitat breadth had complete species coverage. To achieve complete species trait coverage, we imputed missing data for clutch size (4 species), generation length (1 species), diet guild (60 species), foraging guild (60 species), pelagic specialism (60 species) and migration status (3 species). The imputation approach has the advantage of increasing the sample size and consequently the statistical power of any analysis whilst reducing bias and error (Kim, Blomberg, & Pandolfi, 2018; Penone et al., 2014; Taugourdeau, Villerd, Plantureux, Huguenin-Elie, & Amiaud, 2014).
We estimated missing values using random forest regression trees, a non-parametric imputation method, based on the ecological and phylogenetic relationships between species (Breiman, 2001; Stekhoven & Bühlmann, 2012). This method has high predictive accuracy and the capacity to deal with complexity in relationships including non-linearities and interactions (Cutler et al., 2007). To perform the random forest multiple imputations, we used the missForest function from package “missForest” (Stekhoven & Bühlmann, 2012). We imputed missing values based on the ecological (the trait data) and phylogenetic (the first 10 phylogenetic eigenvectors, detailed below) relationships between species. We generated 1,000 trees - a cautiously large number to increase predictive accuracy and prevent overfitting (Stekhoven & Bühlmann, 2012). We set the number of variables randomly sampled at each split (mtry) as the square-root of the number variables included (10 phylogenetic eigenvectors, 8 traits; mtry = 4); a useful compromise between imputation error and computation time (Stekhoven & Bühlmann, 2012). We used a maximum of 20 iterations (maxiter = 20), to ensure the imputations finished due to the stopping criterion and not due to the limit of iterations (the imputed datasets generally finished after 4 – 10 iterations).
Due to the stochastic nature of the regression tree imputation approach, the estimated values will differ slightly each time. To capture this imputation uncertainty and to converge on a reliable result, we repeated the process 15 times, resulting in 15 trait datasets, which is suggested to be sufficient (González-Suárez, Zanchetta Ferreira, & Grilo, 2018; van Buuren & Groothuis-Oudshoorn, 2011). We took the mean values for continuous traits and modal values for categorical traits across the 15 datasets for subsequent analyses.
Phylogenetic data can improve the estimation of missing trait values in the imputation process (Kim et al., 2018; Swenson, 2014), because closely related species tend to be more similar to each other (Pagel, 1999) and many traits display high degrees of phylogenetic signal (Blomberg, Garland, & Ives, 2003). Phylogenetic information was summarised by eigenvectors extracted from a principal coordinate analysis, representing the variation in the phylogenetic distances among species (Jose Alexandre F. Diniz-Filho et al., 2012; José Alexandre Felizola Diniz-Filho, Rangel, Santos, & Bini, 2012). Bird phylogenetic distance data (Prum et al., 2015) were decomposed into a set of orthogonal phylogenetic eigenvectors using the Phylo2DirectedGraph and PEM.build functions from the “MPSEM” package (Guenard & Legendre, 2018). Here, we used the first 10 phylogenetic eigenvectors, which have previously been shown to minimise imputation error (Penone et al., 2014). These phylogenetic eigenvectors summarise major phylogenetic differences between species (Diniz-Filho et al., 2012) and captured 61% of the variation in the phylogenetic distances among seabirds. Still, these eigenvectors do not include fine-scale differences between species (Diniz-Filho et al., 2012), however the inclusion of many phylogenetic eigenvectors would dilute the ecological information contained in the traits, and could lead to excessive noise (Diniz-Filho et al., 2012; Peres‐Neto & Legendre, 2010). Thus, including the first 10 phylogenetic eigenvectors reduces imputation error and ensures a balance between including detailed phylogenetic information and diluting the information contained in the other traits.
To quantify the average error in random forest predictions across the imputed datasets (out-of-bag error), we calculated the mean normalized root squared error and associated standard deviation across the 15 datasets for continuous traits (clutch size = 13.3 ± 0.35 %, generation length = 0.6 ± 0.02 %). For categorical data, we quantified the mean percentage of traits falsely classified (diet guild = 28.6 ± 0.97 %, foraging guild = 18.0 ± 1.05 %, pelagic specialism = 11.2 ± 0.66 %, migration status = 18.8 ± 0.58 %). Since body mass and habitat breadth have complete trait coverage, they did not require imputation. Low imputation accuracy is reflected in high out-of-bag error values where diet guild had the lowest imputation accuracy with 28.6% wrongly classified on average. Diet is generally difficult to predict (Gainsbury, Tallowin, & Meiri, 2018), potentially due to species’ high dietary plasticity (Gaglio, Cook, McInnes, Sherley, & Ryan, 2018) and/or the low phylogenetic conservatism of diet (Gainsbury et al., 2018). With this caveat in mind, we chose dietary guild, as more coarse dietary classifications are more
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Tetrapods (amphibians, reptiles, birds and mammals) are model systems for global biodiversity science, but continuing data gaps, limited data standardisation, and ongoing flux in taxonomic nomenclature constrain integrative research on this group and potentially cause biassed inference. We combined and harmonised taxonomic, spatial, phylogenetic, and attribute data with phylogeny-based multiple imputation to provide a comprehensive data resource (TetrapodTraits 1.0.0) that includes values, predictions, and sources for body size, activity time, micro- and macrohabitat, ecosystem, threat status, biogeography, insularity, environmental preferences and human influence, for all 33,281 tetrapod species covered in recent fully sampled phylogenies. We assess gaps and biases across taxa and space, finding that shared data missing in attribute values increased with taxon-level completeness and richness across clades. Prediction of missing attribute values using multiple imputation revealed substantial changes in estimated macroecological patterns. These results highlight biases incurred by non-random missingness and strategies to best address them. While there is an obvious need for further data collection and updates, our phylogeny-informed database of tetrapod traits can support a more comprehensive representation of tetrapod species and their attributes in ecology, evolution, and conservation research.
Additional Information: This work is output of the VertLife project. To flag erros, provide updates, or leave other comments, please go to vertlife.org. We aim to develop the database into a living resource at vertlife.org and your feedback is essential to improve data quality and support community use.
Version 1.0.1 (25 May 2024). This minor release addresses a spelling error in the file Tetrapod_360.csv. The error involves replacing white-space characters with underscore characters in the field Scientific.Name to match the spelling used in the file TetrapodTraits_1.0.0.csv. These corrections affect only 102 species considered extinct and 13 domestic species (Bos_frontalis, Bos_grunniens, Bos_indicus, Bos_taurus, Camelus_bactrianus, Camelus_dromedarius, Capra_hircus, Cavia_porcellus, Equus_caballus, Felis_catus, Lama_glama, Ovis_aries, Vicugna_pacos). All extinct and domestic species in TetrapodTraits have their binomial names separated by underscore symbols instead of white space. Additionally, we have added the file GridCellShapefile.zip, which contains the shapefile required to map species presence across the 110 × 110 km equal area grid cells (this file was previously provided through an External Source here).
Version 1.0.0 (19 April 2024). TetrapodTraits, the full phylogenetically coherent database we developed, is being made publicly available to support a range of research applications in ecology, evolution, and conservation and to help minimise the impacts of biassed data in this model system. The database includes 24 species-level attributes linked to their respective sources across 33,281 tetrapod species. Specific fields clearly label data sources and imputations in the TetrapodTraits, while additional tables record the 10K values per missing entry per species.
Taxonomy – includes 8 attributes that inform scientific names and respective higher-level taxonomic ranks, authority name, and year of species description. Field names: Scientific.Name, Genus, Family, Suborder, Order, Class, Authority, and YearOfDescription.
Phylogenetic tree – includes 2 attributes that notify which fully-sampled phylogeny contains the species, along with whether the species placement was imputed or not in the phylogeny. Field names: TreeTaxon, TreeImputed.
Body size – includes 7 attributes that inform length, mass, and data sources on species sizes, and details on the imputation of species length or mass. Field names: BodyLength_mm, LengthMeasure, ImputedLength, SourceBodyLength, BodyMass_g, ImputedMass, SourceBodyMass.
Activity time – includes 5 attributes that describe period of activity (e.g., diurnal, fossorial) as dummy (binary) variables, data sources, details on the imputation of species activity time, and a nocturnality score. Field names: Diu, Noc, ImputedActTime, SourceActTime, Nocturnality.
Microhabitat – includes 8 attributes covering habitat use (e.g., fossorial, terrestrial, aquatic, arboreal, aerial) as dummy (binary) variables, data sources, details on the imputation of microhabitat, and a verticality score. Field names: Fos, Ter, Aqu, Arb, Aer, ImputedHabitat, SourceHabitat, Verticality.
Macrohabitat – includes 19 attributes that reflect major habitat types according to the IUCN classification, the sum of major habitats, data source, and details on the imputation of macrohabitat. Field names: MajorHabitat_1 to MajorHabitat_10, MajorHabitat_12 to MajorHabitat_17, MajorHabitatSum, ImputedMajorHabitat, SourceMajorHabitat. MajorHabitat_11, representing the marine deep ocean floor (unoccupied by any species in our database), is not included here.
Ecosystem – includes 6 attributes covering species ecosystem (e.g., terrestrial, freshwater, marine) as dummy (binary) variables, the sum of ecosystem types, data sources, and details on the imputation of ecosystem. Field names: EcoTer, EcoFresh, EcoMar, EcosystemSum, ImputedEcosystem, SourceEcosystem.
Threat status – includes 3 attributes that inform the assessed threat statuses according to IUCN red list and related literature. Field names: IUCN_Binomial, AssessedStatus, SourceStatus.
RangeSize – the number of 110×110 grid cells covered by the species range map. Data derived from MOL.
Latitude – coordinate centroid of the species range map.
Longitude – coordinate centroid of the species range map.
Biogeography – includes 8 attributes that present the proportion of species range within each WWF biogeographical realm. Field names: Afrotropic, Australasia, IndoMalay, Nearctic, Neotropic, Oceania, Palearctic, Antarctic.
Insularity – includes 2 attributes that notify if a species is insular endemic (binary, 1 = yes, 0 = no), followed by the respective data source. Field names: Insularity, SourceInsularity.
AnnuMeanTemp – Average within-range annual mean temperature (Celsius degree). Data derived from CHELSA v. 1.2.
AnnuPrecip – Average within-range annual precipitation (mm). Data derived from CHELSA v. 1.2.
TempSeasonality – Average within-range temperature seasonality (Standard deviation × 100). Data derived from CHELSA v. 1.2.
PrecipSeasonality – Average within-range precipitation seasonality (Coefficient of Variation). Data derived from CHELSA v. 1.2.
Elevation – Average within-range elevation (metres). Data derived from topographic layers in EarthEnv.
ETA50K – Average within-range estimated time to travel to cities with a population >50K in the year 2015. Data from Nelson et al. (2019).
HumanDensity – Average within-range human population density in 2017. Data derived from HYDE v. 3.2.
PropUrbanArea – Proportion of species range map covered by built-up area, such as towns, cities, etc. at year 2017. Data derived from HYDE v. 3.2.
PropCroplandArea – Proportion of species range map covered by cropland area, identical to FAO's category 'Arable land and permanent crops' at year 2017. Data derived from HYDE v. 3.2.
PropPastureArea – Proportion of species range map covered by cropland, defined as Grazing land with an aridity index > 0.5, assumed to be more intensively managed (converted in climate models) at year 2017. Data derived from HYDE v. 3.2.
PropRangelandArea – Proportion of species range map covered by rangeland, defined as Grazing land with an aridity index < 0.5, assumed to be less or not managed (not converted in climate models) at year 2017. Data derived from HYDE v. 3.2.
File content
All files use UTF-8 encoding.
ImputedSets.zip – the phylogenetic multiple imputation framework applied to the TetrapodTraits database produced 10,000 imputed values per missing data entry (= 100 phylogenetic trees x 10 validation-folds x 10 multiple imputations). These imputations were specifically developed for four fundamental natural history traits: Body length, Body mass, Activity time, and Microhabitat. To facilitate the evaluation of each imputed value in a user-friendly format, we offer 10,000 tables containing both observed and imputed data for the 33,281 species in the TetrapodTraits database. Each table encompasses information about the four targeted natural history traits, along with designated fields (e.g., ImputedMass) that clearly indicate whether the trait value provided (e.g., BodyMass_g) corresponds to observed (e.g., ImputedMass = 0) or imputed (e.g., ImputedMass = 1) data. Given that the complete set of 10,000 tables necessitates nearly 17GB of storage space, we have organized sets of 1,000 tables into separate zip files to streamline the download process.
ImputedSets_1K.zip, imputations for trees 1 to 10.
ImputedSets_2K.zip, imputations for trees 11 to 20.
ImputedSets_3K.zip, imputations for trees 21 to 30.
ImputedSets_4K.zip, imputations for trees 31 to 40.
ImputedSets_5K.zip, imputations for trees 41 to 50.
ImputedSets_6K.zip, imputations for trees 51 to 60.
ImputedSets_7K.zip, imputations for trees 61 to 70.
ImputedSets_8K.zip, imputations for trees 71 to 80.
ImputedSets_9K.zip, imputations for trees 81 to 90.
ImputedSets_10K.zip, imputations for trees 91 to 100.
TetrapodTraits_1.0.0.csv – the complete TetrapodTraits database, with missing data entries in natural history traits (body length, body mass, activity time, and microhabitat) replaced by the average across the 10K imputed values obtained through phylogenetic multiple imputation. Please note that imputed microhabitat (attribute fields: Fos, Ter, Aqu, Arb, Aer) and imputed activity time (attribute fields: Diu, Noc) are continuous variables within the 0-1 range interval. At the user's
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundMany datasets in medicine and other branches of science are incomplete. In this article we compare various imputation algorithms for missing data.ObjectivesWe take the point of view that it has already been decided that the imputation should be carried out using multiple imputation by chained equation and the only decision left is that of a subroutine for the one-dimensional imputations. The subroutines to be compared are predictive mean matching, weighted predictive mean matching, sampling, classification or regression trees and random forests.MethodsWe compare these subroutines on real data and on simulated data. We consider the estimation of expected values, variances and coefficients of linear regression models, logistic regression models and Cox regression models. As real data we use data of the survival times after the diagnosis of an obstructive coronary artery disease with systolic blood pressure, LDL, diabetes, smoking behavior and family history of premature heart diseases as variables for which values have to be imputed. While we are mainly interested in statistical properties like biases, mean squared errors or coverage probabilities of confidence intervals, we also have an eye on the computation time.ResultsWeighted predictive mean matching had to be excluded from the statistical comparison due to its enormous computation time. Among the remaining algorithms, in most situations we tested, predictive mean matching performed best.NoveltyThis is by far the largest comparison study for subroutines of multiple imputation by chained equations that has been performed up to now.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Background
To date, no consensus exists on the effects of steroid use on pneumonic chronic obstructive pulmonary disease (COPD) owing to trial design issues in previous trials involving these conditions. Therefore, we aimed to evaluate steroid effectiveness in pneumonic COPD exacerbation patients.
Methods
This multi-centred, retrospective, observational study was conducted across five acute general hospitals in Japan. We analysed the association between parenteral/oral steroid therapy and time to clinical stability in pneumonic COPD exacerbation.
We used a validated algorithm derived from the 10th revision of the International Classification of Diseases and Related Health Problems (ICD-10) to include pneumonic COPD exacerbation patients. We excluded patients with other hypoxia causes (asthma exacerbation, pneumothorax, heart failure) and complicated pneumonia (obstructive pneumonia, empyema), those who required tracheal intubation/vasopressors, and those who were clinically stable on the admission day.
The primary outcome was time to clinical stability. Multiple imputation was used for missing data. Propensity scores within each imputed dataset were calculated using potential confounding factors. The Fine and Gray model was used within each dataset to account for the competing risk of death and hospital discharge without clinical stability, and we combined the results.
Results
Altogether, 1237 patients were included. The pooled estimated subdistribution hazard ratio of time to clinical stability in steroid versus non-steroid users was 0.89 (95% confidence interval, 0.78 to 1.03). However, there were potentially unmeasured confounders, and we could not assess longer-term outcomes.
Conclusions
The current study recommends that steroid therapy should not be used routinely for pneumonic COPD exacerbation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In 2011 we published a revision on the Dibrachys cavus complex in Zootaxa (Peters and Baur, 2011, here a link to our open access paper). It was our wish to also publish two versions of the dataset (one with missing values, one with missing values imputed) as supplementary files. Unfortunately, the data files seem to be no longer available on the publishers webpage. Hence, we publish the data files herewith again in CSV format.
We take the opportunity to also publish the R-scripts that we used for calculating multivariate analyses, tests, and the multiple imputation of missing values.
All files are available individually and with an own link. For convenience, we have compiled all files also in a ZIP file.
Baur (2020) used the dataset for further exploration in a Multivariate Ratio Analysis (MRA).
Papers quoted above you may find in the section References of the Zenodo package.
Citation of this package
Peters, Ralph S., & Baur, Hannes (2020, November 9) Datasets and R-scripts used for the revision of the Dibrachys cavus complex by Peters & Baur, 2011, Zootaxa 2937.1. Zenodo. https://doi.org/10.5281/zenodo.4264539 (directs to the newest version of the package).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundThe multiple imputation approach to missing data has been validated by a number of simulation studies by artificially inducing missingness on fully observed stage data under a pre-specified missing data mechanism. However, the validity of multiple imputation has not yet been assessed using real data. The objective of this study was to assess the validity of using multiple imputation for “unknown” prostate cancer stage recorded in the New South Wales Cancer Registry (NSWCR) in real-world conditions.MethodsData from the population-based cohort study NSW Prostate Cancer Care and Outcomes Study (PCOS) were linked to 2000–2002 NSWCR data. For cases with “unknown” NSWCR stage, PCOS-stage was extracted from clinical notes. Logistic regression was used to evaluate the missing at random assumption adjusted for variables from two imputation models: a basic model including NSWCR variables only and an enhanced model including the same NSWCR variables together with PCOS primary treatment. Cox regression was used to evaluate the performance of MI.ResultsOf the 1864 prostate cancer cases 32.7% were recorded as having “unknown” NSWCR stage. The missing at random assumption was satisfied when the logistic regression included the variables included in the enhanced model, but not those in the basic model only. The Cox models using data with imputed stage from either imputation model provided generally similar estimated hazard ratios but with wider confidence intervals compared with those derived from analysis of the data with PCOS-stage. However, the complete-case analysis of the data provided a considerably higher estimated hazard ratio for the low socio-economic status group and rural areas in comparison with those obtained from all other datasets.ConclusionsUsing MI to deal with “unknown” stage data recorded in a population-based cancer registry appears to provide valid estimates. We would recommend a cautious approach to the use of this method elsewhere.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Genotype imputation is a powerful tool for increasing statistical power in an association analysis. Meta-analysis of multiple study datasets also requires a substantial overlap of SNPs for a successful association analysis, which can be achieved by imputation. Quality of imputed datasets is largely dependent on the software used, as well as the reference populations chosen. The accuracy of imputation of available reference populations has not been tested for the five-way admixed South African Colored (SAC) population. In this study, imputation results obtained using three freely-accessible methods were evaluated for accuracy and quality. We show that the African Genome Resource is the best reference panel for imputation of missing genotypes in samples from the SAC population, implemented via the freely accessible Sanger Imputation Server.
TIMSS measures trends in mathematics and science achievement at the fourth and eighth grades in participating countries around the world, as well as monitoring curricular implementation and identifying promising instructional practices. Conducted on a regular 4-year cycle, TIMSS has assessed mathematics and science in 1995, 1999, 2003, and 2007, with planning underway for 2011. TIMSS collects a rich array of background information to provide comparative perspectives on trends in achievement in the context of different educational systems, school organizational approaches, and instructional practices. To support and promote secondary analyses aimed at improving mathematics and science education at the fourth and eighth grades, the TIMSS 2007 international database makes available to researchers, analysts, and other users the data collected and processed by the TIMSS project. This database comprises student achievement data as well as student, teacher, school, and curricular background data for 59 countries and 8 benchmarking participants. Across both grades, the database includes data from 433,785 students, 46,770 teachers, 14,753 school principals, and the National Research Coordinators of each country. All participating countries gave the IEA permission to release their national data.
The survey had national coverage
Units of analysis in the study include documents, schools and individuals
The TIMSS target populations are all fourth and eighth graders in each participating country. The teachers in the TIMSS 2007 international database do not constitute representative samples of teachers in the participating countries. Rather, they are the teachers of nationally representative samples of students. Therefore, analyses with teacher data should be made with students as the units of analysis and reported in terms of students who are taught by teachers with a particular attribute. Teacher data are analyzed by linking the students to their teachers. The student-teacher linkage data files are used for this purpose.
Sample survey data [ssd]
The TIMSS target populations are all fourth and eighth graders in each participating country. To obtain accurate and representative samples, TIMSS used a two-stage sampling procedure whereby a random sample of schools is selected at the first stage and one or two intact fourth or eighth grade classes are sampled at the second stage. This is a very effective and efficient sampling approach, but the resulting student sample has a complex structure that must be taken into consideration when analyzing the data. In particular, sampling weights need to be applied and a re-sampling technique such as the jackknife employed to estimate sampling variances correctly.
In addition, TIMSS 2007 uses Item Response Theory (IRT) scaling to summarize student achievement on the assessment and to provide accurate measures of trends from previous assessments. The TIMSS IRT scaling approach used multiple imputation-or "plausible values"-methodology to obtain proficiency scores in mathematics and science for all students. Each student record in the TIMSS 2007 international database contains imputed scores in mathematics and science overall, as well as for each of the content domain subscales and cognitive domain subscales. Because each imputed score is a prediction based on limited information, it almost certainly includes some small amount of error. To allow analysts to incorporate this error into analyses of the TIMSS achievement data, the TIMSS database provides five separate imputed scores for each scale. Each analysis should be replicated five times, using a different plausible value each time, and the results combined into a single result that includes information on standard errors that incorporate both sampling and imputation error.
Face-to-face [f2f]
The study used the following questionnaires: Fourth Grade Student Questionnaire, Fourth Grade Teacher Questionnaire, Fourth Grade School Questionnaire, Eighth Grade Student Questionnaire, Eighth Grade Mathematics Teacher Questionnaire, Eighth Grade Science Teacher Questionnaire, and Eighth Grade School Questionnaire. Information on the variables obtained or derived from questions in the survey is available in the TIMSS 2007 user guide for the international database: Data Supplement3: Variables derived from the Student, Teacher, and School Questionnaire data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Code and data to accompany: Cooke et al., 2019 - Global trade-offs of functional redundancy and functional dispersion for birds and mammals. Global Ecology and Biogeography https://onlinelibrary.wiley.com/doi/pdf/10.1111/geb.12869Six traits are here made available for all 15,485 terrestrial extant bird and mammal species: body mass, litter/clutch size, diel activity, diet, volancy and habitat breadthThe trait data were compiled from four main sources: PanTHERIA (Jones et al., 2009), Pacifici database (Pacifici et al., 2013), EltonTraits database (Wilman et al., 2014) and Amniote database (Myhrvold et al., 2015), which should be cited, along with this paper, when using this trait data.Where computing power allows, analyses should be performed across the 25 imputed datasets (trait_25_mi_data.csv) or imputation can be performed by the user of the data for the compiled trait data (trait_data.csv).Fileset:functional-tradeoffs.R - Code for the analysisspecies_site.csv - Ecoregion scale species-by-site matrix trait_data.csv - Compiled trait data (includes missing-data species, i.e., excludes imputed data); see below for details on the data compilation process)trait_25_mi_data.csv - Compiled and imputed trait data (25 imputed trait datasets for the 4 imputed traits [litter_clutch_size, activity, hab_breadth, diet_5cat], body_mass_median and volant were complete for all species - see trait_data.csv; see Supporting Information of paper for multiple imputation process)trait_mi_selected_data.csv - Single, randomly selected, compiled and imputed trait data (single imputed dataset used in the analyses; see Supporting Information of paper for multiple imputation process)functional_tradeoffs_results_combined.csv - Outputs from the analyses for birds and mammals combined (empirical and null values for functional dispersion and functional redundancy per ecoregion)functional_tradeoffs_results_birds.csv - Outputs from the analyses for birds (empirical and null values for functional dispersion and functional redundancy per ecoregion for birds)functional_tradeoffs_results_mammals.csv - Outputs from the analyses for mammals (empirical and null values for functional dispersion and functional redundancy per ecoregion for mammals)Metadata:'eco' - ecoregion code (see https://www.lib.ncsu.edu/gis/esridm/2004/help/world/wwf_terr.sdc.htm)'binomial' - species scientific names'class' - species Class (Aves or Mammalia)'body_mass_median' - median adult body mass (g) across databases (see Trait data compilation)'litter_clutch_size' - litter size for mammals and clutch size for birds'activity' - diel activity (1 = diurnal, 2 = nocturnal)'hab_breadth' - habitat breadth, number of IUCN habitats occupied by species (See Trait data compilation)'volant' - flight capability (1 = non-volant, 2 = volant)'diet_5cat' - 5 diet categories, (1 = plant/seed, 2 = fruit/nectar, 3 = vertebrates, including carrion, 4 = invertebrates and 5 = omnivore, score of ≤ 50 in the four other diet categories) (see Wilman et al., 2014 and Trait data compilation)'spp' - species richness'fred_emp' - empirical values for functional redundancy'fred_null' - mean functional redundancy across 999 null model runs (global species pool)'fred_sd' - standard deviation of functional redundancy across null model runs'fred_ses' - standardized effect size for functional redundancy across null model runs'fred_p_g' - p-value for one-tailed permutation test (greater)'fred_p_l' - p-value for one-tailed permutation test (less)'fdisp_emp' - empirical values for functional dispersion'fdisp_null' - mean functional dispersion across 999 null model runs (global species pool)'fdisp_sd' - standard deviation of functional dispersion across null model runs'fdisp_ses' - standardized effect size for functional dispersion across null model runs'fdisp_p_g' - p-value for one-tailed permutation test (greater)'fdisp_p_l' - p-value for one-tailed permutation test (less)'tradeoff' - trade-off between functional redundancy and functional dispersion (fdisp_ses - fred_ses)Trait data compilation:Specifically, body mass data were sourced from three databases for mammals: Pacifici, EltonTraits and Amniote. The Pacifici database builds on PanTHERIA, but for species that lacked body mass data (1,047 species) they calculated the average body mass of congeneric or confamilial species, we extended this for the 11 species from our global list that were missing data. We took the median across these databases, with 84% of species having values from all three datasets and all species having at least one value (this was required so that all species overlapped in at least one trait dimension). For birds, we calculated the median across two databases: Amniote and EltonTraits. We sourced estimates of body mass values for 573 birds that were missing data, using the average from congeners, as we recognized body mass as a key trait that is strongly related to many aspects of a species’ ecology and therefore their contribution to various functions (see Supporting Information of paper). Therefore all species had at least one estimate of body mass and 72% received the average from two estimates. Diel activity was obtained from the EltonTraits database for both mammals and birds. Diet information was available as both semi-quantitative records (percentage use of different dietary categories) or as an aggregated score (assignment to the dominant diet category based on the summed scores of constituent individual diets). We chose to use the more coarse representation, as the semi-quantitative diet data have been shown to differ between databases (Olalla-Tárraga et al., 2016). Thus species were classified into five groups according to their primary diet (Wilman et al., 2014): plant/seed, fruit/nectar, invertebrates, vertebrates (including carrion), and omnivore (score of ≤ 50 in the four other diet categories). Habitat breadth was coded using the IUCN Habitats Classification Scheme (http://www.iucnredlist.org/technical-documents/classification-schemes/habitats-classification-scheme-ver3) and was quantified as the number of habitats listed for each species. These habitat affinities were extracted via the IUCN Red List Application Programming Interface (API) using the rl_habitats function (rredlist package (Chamberlain 2016)). Litter size was calculated as the median across the Amniote and PanTHERIA databases (46% had values from both databases), whereas clutch size was only sourced from the Amniote database. Data on volancy (flight capability) were compiled from the literature, where it is established that bats (Chiroptera) are the only true flying mammals and that most extant birds - apart from ratites, penguins and some flightless rails and waterfowl - are volant (Findley, 1993; Healy et al., 2014). The volancy of birds was validated using two main sources (del Hoyo et al., 2013; BirdLife International, 2017).BirdLife International. (2017) IUCN Red List for birds.Chamberlain, S. (2016) rredlist: ‘IUCN’ Red List Client. R Package. version 0.1.0.Findley, J. S. Bats: A community perspective. Cambridge Studies in Ecology (Cambridge University Press, 1993). doi:10.2307/1382335Healy, K. et al. Ecology and mode-of-life explain lifespan variation in birds and mammals. Proc. Biol. Sci. 281, 20140298 (2014).del Hoyo, J., Elliott, A., Sargatal, J., Christie, D. A. & de Juana, E. eds. (2013) Handbook of the Birds of the World Alive, Lynx Edicions, Barcelona, Spain.Jones, K. E. et al. (2009). PanTHERIA: a species-level database of life history, ecology, and geography of extant and recently extinct mammals. Ecology, 90, 2648–2648. Data available at: http://esapubs.org/archive/ecol/e090/184/Myhrvold, N. P., Baldridge, E., Chan, B., Freeman, D. L. & Ernest, S. K. M. (2015) An amniote life-history database to perform comparative analyses with birds, mammals, and reptiles. Ecology, 96, 3109. Data available at: http://www.esapubs.org/archive/ecol/E096/269/default.phpOlalla-Tárraga, M. Á., González-Suárez, M., Bernardo-Madrid, R., Revilla, E. & Villalobos, F. (2016) Contrasting evidence of phylogenetic trophic niche conservatism in mammals worldwide. Journal of Biogeography 44, 99–110.Pacifici, M. et al. (2013) Generation length for mammals. Nature Conservation, 5, 87–94. Data available at: http://datadryad.org/resource/doi:10.5061/dryad.gd0m3Wilman, H. et al. (2014) EltonTraits 1.0: Species-level foraging attributes of the world’s birds and mammals. Ecology, 95, 2027. Data available at: https://figshare.com/articles/Data_Paper_Data_Paper/3559887
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%–20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Overcoming bias due to confounding and missing data is challenging when analysing observational data. Propensity scores are commonly used to account for the first problem and multiple imputation for the latter. Unfortunately, it is not known how best to proceed when both techniques are required. We investigate whether two different approaches to combining propensity scores and multiple imputation (Across and Within) lead to differences in the accuracy or precision of exposure effect estimates. Both approaches start by imputing missing values multiple times. Propensity scores are then estimated for each resulting dataset. Using the Across approach, the mean propensity score across imputations for each subject is used in a single subsequent analysis. Alternatively, the Within approach uses propensity scores individually to obtain exposure effect estimates in each imputation, which are combined to produce an overall estimate. These approaches were compared in a series of Monte Carlo simulations and applied to data from the British Society for Rheumatology Biologics Register. Results indicated that the Within approach produced unbiased estimates with appropriate confidence intervals, whereas the Across approach produced biased results and unrealistic confidence intervals. Researchers are encouraged to implement the Within approach when conducting propensity score analyses with incomplete data.