Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.
Tagging scheme:
Aligned (AL) - A concept is represented as a class in both models, either
with the same name or using synonyms or clearly linkable names;
Wrongly represented (WR) - A class in the domain expert model is
incorrectly represented in the student model, either (i) via an attribute,
method, or relationship rather than class, or
(ii) using a generic term (e.g., user'' instead of
urban
planner'');
System-oriented (SO) - A class in CM-Stud that denotes a technical
implementation aspect, e.g., access control. Classes that represent legacy
system or the system under design (portal, simulator) are legitimate;
Omitted (OM) - A class in CM-Expert that does not appear in any way in
CM-Stud;
Missing (MI) - A class in CM-Stud that does not appear in any way in
CM-Expert.
All the calculations and information provided in the following sheets
originate from that raw data.
Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,
including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.
Sheet 3 (Size-Ratio):
The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.
Sheet 4 (Overall):
Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.
For sheet 4 as well as for the following four sheets, diverging stacked bar
charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:
Sheet 5 (By-Notation):
Model correctness and model completeness is compared by notation - UC, US.
Sheet 6 (By-Case):
Model correctness and model completeness is compared by case - SIM, HOS, IFA.
Sheet 7 (By-Process):
Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.
Sheet 8 (By-Grade):
Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Rotroff Lab 2022-07-01 This repository contains simulated datasets related to Identification of Robust Deep Neural Network Models of Longitudinal Clinical Measurements article published in npj Digital Medicine (2022). These datasets were simulated by manipulating shapes and magnitudes of clinical longitudinal measurements of random blood glucose tests, systolic blood pressure and BMI. Per clinical measurement type, 3168 datasets were generated. In half of these datasets, classes were defined by differences in shape (50%, N=1584) and in the other half, classes were defined by differences in magnitudes (50%, N=1584). The complete list of parameters manipulated is as follows:
Effect Size: 0.25-2.75 Dispersion: 0.25-2.00 Noise: 0%, 10%, 25%, 50% Irregularity: Moderate (0-3 missing measurements) or High (2-5 missing measurements)
For more information on the simulations, please refer to the Methods section of the paper. Please note that with the inclusion of class imbalance, the total number of cohorts increased to 12,672 in the paper. Parameter annotations The parameters are annotated in the file names as follows: _Simulated_Cohort_EffectSize_Dispersion_Noise_Irregularity_OverlapBetweenClasses.csv When IrrNMP is used for Irregularity, no irregularity is applied. OverlapBetweenClasses specifies the overlap between the two classes. Directory Structure The directory structure for each clinical measurement type is as follows:
bmi_simulation (N=1056)mag_cohorts
shape_cohorts
Simulation_Cohort_IDRandomNumber: Random trajectory base 4
irregularity_cohorts
other_cohorts
Simulation_Cohort_ID_RandomNumber: Random trajectory base 5
irregularity_cohorts
other_cohorts
Simulation_Cohort_ID_RandomNumber: Random trajectory base 6
irregularity_cohorts
other_cohorts
glucose_simulation (1056)mag_cohorts
shape_cohorts
sbp_simulation (N=1056)mag_cohorts
shape_cohorts Note: irregularity_cohorts: Effect sizes, dispersions and irregularity were manipulated in these files. other_cohorts: Effect sizes, dispersions and noise were manipulated in these files.
Please cite the following paper and this dataset in your work:
H. Javidi, et al., Identification of Robust Deep Neural Network Models of Longitudinal Clinical Measurements, Nature Digital Medicine, 2022.
This data release provides two example groundwater-level datasets used to benchmark the Automated Regional Correlation Analysis for Hydrologic Record Imputation (ARCHI) software package (Levy and others, 2024). The first dataset contains groundwater-level records and site metadata for wells located on Long Island, New York (NY) and some surrounding mainland sites in New York and Connecticut. The second dataset contains groundwater-level records and site metadata for wells located in the southeastern San Joaquin Valley of the Central Valley, California (CA). For ease of exposition these are referred to as NY and CA datasets, respectively. Both datasets are formatted with column headers that can be read by the ARCHI software package within the R computing environment. These datasets were used to benchmark the imputation accuracy of three ARCHI model settings (OLS, ridge, and MOVE.1) against the widely used imputation program missForest (Stekhoven and Bühlmann, 2012). The ARCHI program was used to process the NY and CA datasets on monthly and annual timesteps, respectively, filter out sites with insufficient data for imputation, and create 200 test datasets from each of the example datasets with 5 percent of observations removed at random (herein, referred to as "holdouts"). Imputation accuracy for test datasets was assessed using normalized root mean square error (NRMSE), which is the root mean square error divided by the standard deviation of the observed holdout values. ARCHI produces prediction intervals (PIs) using a non-parametric bootstrapping routine, which were assessed by computing a coverage rate (CR) defined as the proportion of holdout observations falling within the estimated PI. The multiple regression models included with the ARCHI package (OLS and ridge) were further tested on all test datasets at eleven different levels of the p_per_n input parameter, which limits the maximum ratio of regression model predictors (p) per observations (n) as a decimal fraction greater than zero and less than or equal to one. This data release contains ten tables formatted as tab-delimited text files. The “CA_data.txt” and “NY_data.txt” tables contain 243,094 and 89,997 depth-to-groundwater measurement values (value, in feet below land surface) indexed by site identifier (site_no) and measurement date (date) for CA and NY datasets, respectively. The “CA_sites.txt” and “NY_sites.txt” tables contain site metadata for the 4,380 and 476 unique sites included in the CA and NY datasets, respectively. The “CA_NRMSE.txt” and “NY_NRMSE.txt” tables contain NRMSE values computed by imputing 200 test datasets with 5 percent random holdouts to assess imputation accuracy for three different ARCHI model settings and missForest using CA and NY datasets, respectively. The “CA_CR.txt” and “NY_CR.txt” tables contain CR values used to evaluate non-parametric PIs generated by bootstrapping regressions with three different ARCHI model settings using the CA and NY test datasets, respectively. The “CA_p_per_n.txt” and “NY_p_per_n.txt” tables contain mean NRMSE values computed for 200 test datasets with 5 percent random holdouts at 11 different levels of p_per_n for OLS and ridge models compared to training error for the same models on the entire CA and NY datasets, respectively. References Cited Levy, Z.F., Stagnitta, T.J., and Glas, R.L., 2024, ARCHI: Automated Regional Correlation Analysis for Hydrologic Record Imputation, v1.0.0: U.S. Geological Survey software release, https://doi.org/10.5066/P1VVHWKE. Stekhoven, D.J., and Bühlmann, P., 2012, MissForest—non-parametric missing value imputation for mixed-type data: Bioinformatics 28(1), 112-118. https://doi.org/10.1093/bioinformatics/btr597.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Tetrapods (amphibians, reptiles, birds and mammals) are model systems for global biodiversity science, but continuing data gaps, limited data standardisation, and ongoing flux in taxonomic nomenclature constrain integrative research on this group and potentially cause biassed inference. We combined and harmonised taxonomic, spatial, phylogenetic, and attribute data with phylogeny-based multiple imputation to provide a comprehensive data resource (TetrapodTraits 1.0.0) that includes values, predictions, and sources for body size, activity time, micro- and macrohabitat, ecosystem, threat status, biogeography, insularity, environmental preferences and human influence, for all 33,281 tetrapod species covered in recent fully sampled phylogenies. We assess gaps and biases across taxa and space, finding that shared data missing in attribute values increased with taxon-level completeness and richness across clades. Prediction of missing attribute values using multiple imputation revealed substantial changes in estimated macroecological patterns. These results highlight biases incurred by non-random missingness and strategies to best address them. While there is an obvious need for further data collection and updates, our phylogeny-informed database of tetrapod traits can support a more comprehensive representation of tetrapod species and their attributes in ecology, evolution, and conservation research.
Additional Information: This work is output of the VertLife project. To flag erros, provide updates, or leave other comments, please go to vertlife.org. We aim to develop the database into a living resource at vertlife.org and your feedback is essential to improve data quality and support community use.
Version 1.0.1 (25 May 2024). This minor release addresses a spelling error in the file Tetrapod_360.csv. The error involves replacing white-space characters with underscore characters in the field Scientific.Name to match the spelling used in the file TetrapodTraits_1.0.0.csv. These corrections affect only 102 species considered extinct and 13 domestic species (Bos_frontalis, Bos_grunniens, Bos_indicus, Bos_taurus, Camelus_bactrianus, Camelus_dromedarius, Capra_hircus, Cavia_porcellus, Equus_caballus, Felis_catus, Lama_glama, Ovis_aries, Vicugna_pacos). All extinct and domestic species in TetrapodTraits have their binomial names separated by underscore symbols instead of white space. Additionally, we have added the file GridCellShapefile.zip, which contains the shapefile required to map species presence across the 110 × 110 km equal area grid cells (this file was previously provided through an External Source here).
Version 1.0.0 (19 April 2024). TetrapodTraits, the full phylogenetically coherent database we developed, is being made publicly available to support a range of research applications in ecology, evolution, and conservation and to help minimise the impacts of biassed data in this model system. The database includes 24 species-level attributes linked to their respective sources across 33,281 tetrapod species. Specific fields clearly label data sources and imputations in the TetrapodTraits, while additional tables record the 10K values per missing entry per species.
Taxonomy – includes 8 attributes that inform scientific names and respective higher-level taxonomic ranks, authority name, and year of species description. Field names: Scientific.Name, Genus, Family, Suborder, Order, Class, Authority, and YearOfDescription.
Phylogenetic tree – includes 2 attributes that notify which fully-sampled phylogeny contains the species, along with whether the species placement was imputed or not in the phylogeny. Field names: TreeTaxon, TreeImputed.
Body size – includes 7 attributes that inform length, mass, and data sources on species sizes, and details on the imputation of species length or mass. Field names: BodyLength_mm, LengthMeasure, ImputedLength, SourceBodyLength, BodyMass_g, ImputedMass, SourceBodyMass.
Activity time – includes 5 attributes that describe period of activity (e.g., diurnal, fossorial) as dummy (binary) variables, data sources, details on the imputation of species activity time, and a nocturnality score. Field names: Diu, Noc, ImputedActTime, SourceActTime, Nocturnality.
Microhabitat – includes 8 attributes covering habitat use (e.g., fossorial, terrestrial, aquatic, arboreal, aerial) as dummy (binary) variables, data sources, details on the imputation of microhabitat, and a verticality score. Field names: Fos, Ter, Aqu, Arb, Aer, ImputedHabitat, SourceHabitat, Verticality.
Macrohabitat – includes 19 attributes that reflect major habitat types according to the IUCN classification, the sum of major habitats, data source, and details on the imputation of macrohabitat. Field names: MajorHabitat_1 to MajorHabitat_10, MajorHabitat_12 to MajorHabitat_17, MajorHabitatSum, ImputedMajorHabitat, SourceMajorHabitat. MajorHabitat_11, representing the marine deep ocean floor (unoccupied by any species in our database), is not included here.
Ecosystem – includes 6 attributes covering species ecosystem (e.g., terrestrial, freshwater, marine) as dummy (binary) variables, the sum of ecosystem types, data sources, and details on the imputation of ecosystem. Field names: EcoTer, EcoFresh, EcoMar, EcosystemSum, ImputedEcosystem, SourceEcosystem.
Threat status – includes 3 attributes that inform the assessed threat statuses according to IUCN red list and related literature. Field names: IUCN_Binomial, AssessedStatus, SourceStatus.
RangeSize – the number of 110×110 grid cells covered by the species range map. Data derived from MOL.
Latitude – coordinate centroid of the species range map.
Longitude – coordinate centroid of the species range map.
Biogeography – includes 8 attributes that present the proportion of species range within each WWF biogeographical realm. Field names: Afrotropic, Australasia, IndoMalay, Nearctic, Neotropic, Oceania, Palearctic, Antarctic.
Insularity – includes 2 attributes that notify if a species is insular endemic (binary, 1 = yes, 0 = no), followed by the respective data source. Field names: Insularity, SourceInsularity.
AnnuMeanTemp – Average within-range annual mean temperature (Celsius degree). Data derived from CHELSA v. 1.2.
AnnuPrecip – Average within-range annual precipitation (mm). Data derived from CHELSA v. 1.2.
TempSeasonality – Average within-range temperature seasonality (Standard deviation × 100). Data derived from CHELSA v. 1.2.
PrecipSeasonality – Average within-range precipitation seasonality (Coefficient of Variation). Data derived from CHELSA v. 1.2.
Elevation – Average within-range elevation (metres). Data derived from topographic layers in EarthEnv.
ETA50K – Average within-range estimated time to travel to cities with a population >50K in the year 2015. Data from Nelson et al. (2019).
HumanDensity – Average within-range human population density in 2017. Data derived from HYDE v. 3.2.
PropUrbanArea – Proportion of species range map covered by built-up area, such as towns, cities, etc. at year 2017. Data derived from HYDE v. 3.2.
PropCroplandArea – Proportion of species range map covered by cropland area, identical to FAO's category 'Arable land and permanent crops' at year 2017. Data derived from HYDE v. 3.2.
PropPastureArea – Proportion of species range map covered by cropland, defined as Grazing land with an aridity index > 0.5, assumed to be more intensively managed (converted in climate models) at year 2017. Data derived from HYDE v. 3.2.
PropRangelandArea – Proportion of species range map covered by rangeland, defined as Grazing land with an aridity index < 0.5, assumed to be less or not managed (not converted in climate models) at year 2017. Data derived from HYDE v. 3.2.
File content
All files use UTF-8 encoding.
ImputedSets.zip – the phylogenetic multiple imputation framework applied to the TetrapodTraits database produced 10,000 imputed values per missing data entry (= 100 phylogenetic trees x 10 validation-folds x 10 multiple imputations). These imputations were specifically developed for four fundamental natural history traits: Body length, Body mass, Activity time, and Microhabitat. To facilitate the evaluation of each imputed value in a user-friendly format, we offer 10,000 tables containing both observed and imputed data for the 33,281 species in the TetrapodTraits database. Each table encompasses information about the four targeted natural history traits, along with designated fields (e.g., ImputedMass) that clearly indicate whether the trait value provided (e.g., BodyMass_g) corresponds to observed (e.g., ImputedMass = 0) or imputed (e.g., ImputedMass = 1) data. Given that the complete set of 10,000 tables necessitates nearly 17GB of storage space, we have organized sets of 1,000 tables into separate zip files to streamline the download process.
ImputedSets_1K.zip, imputations for trees 1 to 10.
ImputedSets_2K.zip, imputations for trees 11 to 20.
ImputedSets_3K.zip, imputations for trees 21 to 30.
ImputedSets_4K.zip, imputations for trees 31 to 40.
ImputedSets_5K.zip, imputations for trees 41 to 50.
ImputedSets_6K.zip, imputations for trees 51 to 60.
ImputedSets_7K.zip, imputations for trees 61 to 70.
ImputedSets_8K.zip, imputations for trees 71 to 80.
ImputedSets_9K.zip, imputations for trees 81 to 90.
ImputedSets_10K.zip, imputations for trees 91 to 100.
TetrapodTraits_1.0.0.csv – the complete TetrapodTraits database, with missing data entries in natural history traits (body length, body mass, activity time, and microhabitat) replaced by the average across the 10K imputed values obtained through phylogenetic multiple imputation. Please note that imputed microhabitat (attribute fields: Fos, Ter, Aqu, Arb, Aer) and imputed activity time (attribute fields: Diu, Noc) are continuous variables within the 0-1 range interval. At the user's
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Aim
Seabirds are heavily threatened by anthropogenic activities and their conservation status is deteriorating rapidly. Yet, these pressures are unlikely to uniformly impact all species. It remains an open question if seabirds with similar ecological roles are responding similarly to human pressures. Here we aim to: 1) test whether threatened vs non-threatened seabirds are separated in trait space; 2) quantify the similarity of species’ roles (redundancy) per IUCN Red List Category; and 3) identify traits that render species vulnerable to anthropogenic threats.
Location
Global
Time period
Contemporary
Major taxa studied
Seabirds
Methods
We compile and impute eight traits that relate to species’ vulnerabilities and ecosystem functioning across 341 seabird species. Using these traits, we build a mixed-data PCA of species’ trait space. We quantify trait redundancy using the unique trait combinations (UTCs) approach. Finally, we employ a SIMPER analysis to identify which traits explain the greatest difference between threat groups.
Results
We find seabirds segregate in trait space based on threat status, indicating anthropogenic impacts are selectively removing large, long-lived, pelagic surface feeders with narrow habitat breadths. We further find that threatened species have higher trait redundancy, while non-threatened species have relatively limited redundancy. Finally, we find that species with narrow habitat breadths, fast reproductive speeds, and varied diets are more likely to be threatened by habitat-modifying processes (e.g., pollution and natural system modifications); whereas pelagic specialists with slow reproductive speeds and varied diets are vulnerable to threats that directly impact survival and fecundity (e.g., invasive species and biological resource use) and climate change. Species with no threats are non-pelagic specialists with invertebrate diets and fast reproductive speeds.
Main conclusions
Our results suggest both threatened and non-threatened species contribute unique ecological strategies. Consequently, conserving both threat groups, but with contrasting approaches may avoid potential changes in ecosystem functioning and stability.
Methods Trait Selection and Data
We compiled data from multiple databases for eight traits across all 341 extant species of seabirds. Here we recognise seabirds as those that feed at sea, either nearshore or offshore, but excluding marine ducks. These traits encompass the varying ecological and life history strategies of seabirds, and relate to ecosystem functioning and species’ vulnerabilities. We first extracted the trait data for body mass, clutch size, habitat breadth and diet guild from a recently compiled trait database for birds (Cooke, Bates, et al., 2019). Generation length and migration status were compiled from BirdLife International (datazone.birdlife.org), and pelagic specialism and foraging guild from Wilman et al. (2014). We further compiled clutch size information for 84 species through a literature search.
Foraging and diet guild describe the most dominant foraging strategy and diet of the species. Wilman et al. (2014) assigned species a score from 0 to 100% for each foraging and diet guild based on their relative usage of a given category. Using these scores, species were classified into four foraging guild categories (diver, surface, ground, and generalist foragers) and three diet guild categories (omnivore, invertebrate, and vertebrate & scavenger diets). Each was assigned to a guild based on the predominant foraging strategy or diet (score > 50%). Species with category scores < 50% were classified as generalists for the foraging guild trait and omnivores for the diet guild trait. Body mass was measured in grams and was the median across multiple databases. Habitat breadth is the number of habitats listed as suitable by the International Union for Conservation of Nature (IUCN, iucnredlist.org). Generation length describes the mean age in years at which a species produces offspring. Clutch size is the number of eggs per clutch (the central tendency was recorded as the mean or mode). Migration status describes whether a species undertakes full migration (regular or seasonal cyclical movements beyond the breeding range, with predictable timing and destinations) or not. Pelagic specialism describes whether foraging is predominantly pelagic. To improve normality of the data, continuous traits, except clutch size, were log10 transformed.
Multiple Imputation
All traits had more than 80% coverage for our list of 341 seabird species, and body mass and habitat breadth had complete species coverage. To achieve complete species trait coverage, we imputed missing data for clutch size (4 species), generation length (1 species), diet guild (60 species), foraging guild (60 species), pelagic specialism (60 species) and migration status (3 species). The imputation approach has the advantage of increasing the sample size and consequently the statistical power of any analysis whilst reducing bias and error (Kim, Blomberg, & Pandolfi, 2018; Penone et al., 2014; Taugourdeau, Villerd, Plantureux, Huguenin-Elie, & Amiaud, 2014).
We estimated missing values using random forest regression trees, a non-parametric imputation method, based on the ecological and phylogenetic relationships between species (Breiman, 2001; Stekhoven & Bühlmann, 2012). This method has high predictive accuracy and the capacity to deal with complexity in relationships including non-linearities and interactions (Cutler et al., 2007). To perform the random forest multiple imputations, we used the missForest function from package “missForest” (Stekhoven & Bühlmann, 2012). We imputed missing values based on the ecological (the trait data) and phylogenetic (the first 10 phylogenetic eigenvectors, detailed below) relationships between species. We generated 1,000 trees - a cautiously large number to increase predictive accuracy and prevent overfitting (Stekhoven & Bühlmann, 2012). We set the number of variables randomly sampled at each split (mtry) as the square-root of the number variables included (10 phylogenetic eigenvectors, 8 traits; mtry = 4); a useful compromise between imputation error and computation time (Stekhoven & Bühlmann, 2012). We used a maximum of 20 iterations (maxiter = 20), to ensure the imputations finished due to the stopping criterion and not due to the limit of iterations (the imputed datasets generally finished after 4 – 10 iterations).
Due to the stochastic nature of the regression tree imputation approach, the estimated values will differ slightly each time. To capture this imputation uncertainty and to converge on a reliable result, we repeated the process 15 times, resulting in 15 trait datasets, which is suggested to be sufficient (González-Suárez, Zanchetta Ferreira, & Grilo, 2018; van Buuren & Groothuis-Oudshoorn, 2011). We took the mean values for continuous traits and modal values for categorical traits across the 15 datasets for subsequent analyses.
Phylogenetic data can improve the estimation of missing trait values in the imputation process (Kim et al., 2018; Swenson, 2014), because closely related species tend to be more similar to each other (Pagel, 1999) and many traits display high degrees of phylogenetic signal (Blomberg, Garland, & Ives, 2003). Phylogenetic information was summarised by eigenvectors extracted from a principal coordinate analysis, representing the variation in the phylogenetic distances among species (Jose Alexandre F. Diniz-Filho et al., 2012; José Alexandre Felizola Diniz-Filho, Rangel, Santos, & Bini, 2012). Bird phylogenetic distance data (Prum et al., 2015) were decomposed into a set of orthogonal phylogenetic eigenvectors using the Phylo2DirectedGraph and PEM.build functions from the “MPSEM” package (Guenard & Legendre, 2018). Here, we used the first 10 phylogenetic eigenvectors, which have previously been shown to minimise imputation error (Penone et al., 2014). These phylogenetic eigenvectors summarise major phylogenetic differences between species (Diniz-Filho et al., 2012) and captured 61% of the variation in the phylogenetic distances among seabirds. Still, these eigenvectors do not include fine-scale differences between species (Diniz-Filho et al., 2012), however the inclusion of many phylogenetic eigenvectors would dilute the ecological information contained in the traits, and could lead to excessive noise (Diniz-Filho et al., 2012; Peres‐Neto & Legendre, 2010). Thus, including the first 10 phylogenetic eigenvectors reduces imputation error and ensures a balance between including detailed phylogenetic information and diluting the information contained in the other traits.
To quantify the average error in random forest predictions across the imputed datasets (out-of-bag error), we calculated the mean normalized root squared error and associated standard deviation across the 15 datasets for continuous traits (clutch size = 13.3 ± 0.35 %, generation length = 0.6 ± 0.02 %). For categorical data, we quantified the mean percentage of traits falsely classified (diet guild = 28.6 ± 0.97 %, foraging guild = 18.0 ± 1.05 %, pelagic specialism = 11.2 ± 0.66 %, migration status = 18.8 ± 0.58 %). Since body mass and habitat breadth have complete trait coverage, they did not require imputation. Low imputation accuracy is reflected in high out-of-bag error values where diet guild had the lowest imputation accuracy with 28.6% wrongly classified on average. Diet is generally difficult to predict (Gainsbury, Tallowin, & Meiri, 2018), potentially due to species’ high dietary plasticity (Gaglio, Cook, McInnes, Sherley, & Ryan, 2018) and/or the low phylogenetic conservatism of diet (Gainsbury et al., 2018). With this caveat in mind, we chose dietary guild, as more coarse dietary classifications are more
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.
Tagging scheme:
Aligned (AL) - A concept is represented as a class in both models, either
with the same name or using synonyms or clearly linkable names;
Wrongly represented (WR) - A class in the domain expert model is
incorrectly represented in the student model, either (i) via an attribute,
method, or relationship rather than class, or
(ii) using a generic term (e.g., user'' instead of
urban
planner'');
System-oriented (SO) - A class in CM-Stud that denotes a technical
implementation aspect, e.g., access control. Classes that represent legacy
system or the system under design (portal, simulator) are legitimate;
Omitted (OM) - A class in CM-Expert that does not appear in any way in
CM-Stud;
Missing (MI) - A class in CM-Stud that does not appear in any way in
CM-Expert.
All the calculations and information provided in the following sheets
originate from that raw data.
Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,
including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.
Sheet 3 (Size-Ratio):
The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.
Sheet 4 (Overall):
Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.
For sheet 4 as well as for the following four sheets, diverging stacked bar
charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:
Sheet 5 (By-Notation):
Model correctness and model completeness is compared by notation - UC, US.
Sheet 6 (By-Case):
Model correctness and model completeness is compared by case - SIM, HOS, IFA.
Sheet 7 (By-Process):
Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.
Sheet 8 (By-Grade):
Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.