Descriptive statistics of all metric predictor and outcome variables.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the last decade, a plethora of algorithms have been developed for spatial ecology studies. In our case, we use some of these codes for underwater research work in applied ecology analysis of threatened endemic fishes and their natural habitat. For this, we developed codes in Rstudio® script environment to run spatial and statistical analyses for ecological response and spatial distribution models (e.g., Hijmans & Elith, 2017; Den Burg et al., 2020). The employed R packages are as follows: caret (Kuhn et al., 2020), corrplot (Wei & Simko, 2017), devtools (Wickham, 2015), dismo (Hijmans & Elith, 2017), gbm (Freund & Schapire, 1997; Friedman, 2002), ggplot2 (Wickham et al., 2019), lattice (Sarkar, 2008), lattice (Musa & Mansor, 2021), maptools (Hijmans & Elith, 2017), modelmetrics (Hvitfeldt & Silge, 2021), pander (Wickham, 2015), plyr (Wickham & Wickham, 2015), pROC (Robin et al., 2011), raster (Hijmans & Elith, 2017), RColorBrewer (Neuwirth, 2014), Rcpp (Eddelbeuttel & Balamura, 2018), rgdal (Verzani, 2011), sdm (Naimi & Araujo, 2016), sf (e.g., Zainuddin, 2023), sp (Pebesma, 2020) and usethis (Gladstone, 2022).
It is important to follow all the codes in order to obtain results from the ecological response and spatial distribution models. In particular, for the ecological scenario, we selected the Generalized Linear Model (GLM) and for the geographic scenario we selected DOMAIN, also known as Gower's metric (Carpenter et al., 1993). We selected this regression method and this distance similarity metric because of its adequacy and robustness for studies with endemic or threatened species (e.g., Naoki et al., 2006). Next, we explain the statistical parameterization for the codes immersed in the GLM and DOMAIN running:
In the first instance, we generated the background points and extracted the values of the variables (Code2_Extract_values_DWp_SC.R). Barbet-Massin et al. (2012) recommend the use of 10,000 background points when using regression methods (e.g., Generalized Linear Model) or distance-based models (e.g., DOMAIN). However, we considered important some factors such as the extent of the area and the type of study species for the correct selection of the number of points (Pers. Obs.). Then, we extracted the values of predictor variables (e.g., bioclimatic, topographic, demographic, habitat) in function of presence and background points (e.g., Hijmans and Elith, 2017).
Subsequently, we subdivide both the presence and background point groups into 75% training data and 25% test data, each group, following the method of Soberón & Nakamura (2009) and Hijmans & Elith (2017). For a training control, the 10-fold (cross-validation) method is selected, where the response variable presence is assigned as a factor. In case that some other variable would be important for the study species, it should also be assigned as a factor (Kim, 2009).
After that, we ran the code for the GBM method (Gradient Boost Machine; Code3_GBM_Relative_contribution.R and Code4_Relative_contribution.R), where we obtained the relative contribution of the variables used in the model. We parameterized the code with a Gaussian distribution and cross iteration of 5,000 repetitions (e.g., Friedman, 2002; kim, 2009; Hijmans and Elith, 2017). In addition, we considered selecting a validation interval of 4 random training points (Personal test). The obtained plots were the partial dependence blocks, in function of each predictor variable.
Subsequently, the correlation of the variables is run by Pearson's method (Code5_Pearson_Correlation.R) to evaluate multicollinearity between variables (Guisan & Hofer, 2003). It is recommended to consider a bivariate correlation ± 0.70 to discard highly correlated variables (e.g., Awan et al., 2021).
Once the above codes were run, we uploaded the same subgroups (i.e., presence and background groups with 75% training and 25% testing) (Code6_Presence&backgrounds.R) for the GLM method code (Code7_GLM_model.R). Here, we first ran the GLM models per variable to obtain the p-significance value of each variable (alpha ≤ 0.05); we selected the value one (i.e., presence) as the likelihood factor. The generated models are of polynomial degree to obtain linear and quadratic response (e.g., Fielding and Bell, 1997; Allouche et al., 2006). From these results, we ran ecological response curve models, where the resulting plots included the probability of occurrence and values for continuous variables or categories for discrete variables. The points of the presence and background training group are also included.
On the other hand, a global GLM was also run, from which the generalized model is evaluated by means of a 2 x 2 contingency matrix, including both observed and predicted records. A representation of this is shown in Table 1 (adapted from Allouche et al., 2006). In this process we select an arbitrary boundary of 0.5 to obtain better modeling performance and avoid high percentage of bias in type I (omission) or II (commission) errors (e.g., Carpenter et al., 1993; Fielding and Bell, 1997; Allouche et al., 2006; Kim, 2009; Hijmans and Elith, 2017).
Table 1. Example of 2 x 2 contingency matrix for calculating performance metrics for GLM models. A represents true presence records (true positives), B represents false presence records (false positives - error of commission), C represents true background points (true negatives) and D represents false backgrounds (false negatives - errors of omission).
|
Validation set | |
Model |
True |
False |
Presence |
A |
B |
Background |
C |
D |
We then calculated the Overall and True Skill Statistics (TSS) metrics. The first is used to assess the proportion of correctly predicted cases, while the second metric assesses the prevalence of correctly predicted cases (Olden and Jackson, 2002). This metric also gives equal importance to the prevalence of presence prediction as to the random performance correction (Fielding and Bell, 1997; Allouche et al., 2006).
The last code (i.e., Code8_DOMAIN_SuitHab_model.R) is for species distribution modelling using the DOMAIN algorithm (Carpenter et al., 1993). Here, we loaded the variable stack and the presence and background group subdivided into 75% training and 25% test, each. We only included the presence training subset and the predictor variables stack in the calculation of the DOMAIN metric, as well as in the evaluation and validation of the model.
Regarding the model evaluation and estimation, we selected the following estimators:
1) partial ROC, which evaluates the approach between the curves of positive (i.e., correctly predicted presence) and negative (i.e., correctly predicted absence) cases. As farther apart these curves are, the model has a better prediction performance for the correct spatial distribution of the species (Manzanilla-Quiñones, 2020).
2) ROC/AUC curve for model validation, where an optimal performance threshold is estimated to have an expected confidence of 75% to 99% probability (De Long et al., 1988).
To validate the robustness of controls in each plate, the QC metrics were performed for each variable using different controls depending on the biological readout i.e. leishmanicidal activity or toxicity on host cells: C− (64 and 24 wells for P1–P4 and P5–P9, respectively)/C+ (n = 20) for (Am) and (PV/HM) variables and C− (n = 24)/C† (n = 20) for (VI). Of note is the fact that the QC metrics for the VI variable was only available in the P5–P9 when the C† control was included (NA: not applicable).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We studied vegetation metric robustness to environmental (season, interannual, and regional) and methodological (observer) variables, as well as adequate sample size for vegetation metrics across four regions of the United States.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: For each categorical variable, one category was chosen as a reference category (RC, e.g., RC = Social Sciences for the categorical variable discipline). For categorical variables, effect for each predictor variable (a dummy variable representing one of the categories) is a regression coefficient (Coeff) that should be interpreted in relation to its standard error (SE) and the effect of the reference category. Variance components for level 1 are derived from the data, but variance components at level 2 and level 3 indicate the amount of variance that can be explained by differences between studies (level 3) and differences between single reliability coefficients nested within studies (level 2). The loglikelihood test provided by SAS/proc mixed (−2LL) can be used to compare different models, as can also the Bayes Information Criteria (BIC). The smaller the BIC, the better the model is.*p
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We identified and document 137 datasets and databases on European biodiversity, ecosystem services, the drivers and pressures affecting them, and the mechanisms put in place to address these. These datasets represent nearly 2000 variables and metrics that can be used directly by researchers, land managers and decision-makers for example for spatial planning in conservation or be further integrated into biodiversity and ecosystem services models.
This metadatabase and associated tables supports Deliverable 3.1 of the NaturaConnect Horizon Europe project (D3.1 Report and data on the biodiversity, protected areas and environmental and socioeconomic data available for the project. Including data gap analysis).
Content
1. Typology.xlsx - Table presenting the typology used to classify and document the datasets and databases within the metadatabase. The typology used to classify those datasets and the variables and metrics within them is built on the DPSIR framework (Drivers, Pressures, State, Impact, Response), the Threats Classification Scheme (version 3.3) of the International Union for Conservation of Nature (IUCN), as well as the Essential Biodiversity Variables and Essential Ecosystem Services Variables frameworks (EBVs and EESVs respectively).
2. MetaDatabase.xlsx - MetaDatabase documenting the datasets and databases identified in the context of the NaturaConnect project. This metadatabase documents for each dataset or database:
General information on each entry, that is its name, the corresponding component of the data typology, for instance if the data concerns biodiversity, or pressures on biodiversity. This section also documents the type of information or metrics contained in the entry and their unit as well as the realm (Terrestrial or Freshwater) covered by the data. In many cases, an entry will contain data on more than one variable or product, in which case we labelled it as “multiple” in the general information and list all individual metrics and their unites in a separate table.
Biological information: if the entry relates to data on biodiversity or ecosystem services, this section is used to inform about the biological entity and taxonomic resolution of the data (e.g. species), the coverage of the biological entity (e.g. amphibians), and the coverage of Essential Variables (EBV or EESV – e.g. species traits).
Non-biological information: for entries that provide data on drivers, pressures or responses, we document the entity (e.g. type of pressure) and the coverage or scope of the entity.
Temporal information: we describe the temporal extent of each entry and their temporal resolution for those that are repeated measurements in time.
Spatial information: This section of the metadatabase documents, for the entries that are spatially explicit, which is the spatial scope (e.g. global, national), the spatial extent (e.g. EU28, Spain), and the spatial resolution of the data.
Method: for each entry, we document whether the data is modelled, interpreted or raw, as well as the dependencies with other datasets. Specifically, we identify if the data is also shared or used in another dataset (either documented in the metadatabase or not).
Accessibility: this last part of the metadatabase documents the links to (and references of) the data, and, when appropriate, the scientific publication accompanying them. We also keep track of the curator and contact person as well as the last update of the entry. This section is also used to document the data format (e.g. NetCDF, csv), licensing and whether the data can be accessed via an Application Programming Interface (API) or other tool.
3. DetailedMetrics.xlsx - Table containing all the metrics and variables from the datasets documented in the metadatabase. The metrics are mapped to the data typology, and when appropriate to the corresponding Essential Biodiversity Variable or Essential Ecosystem Service Variable. This table documents the name of the metric, or field, as given in the source material, its type (e.g. number, categorical, characters) and when appropriate, its unit. When the information is provided in the source material, the table also contains a definition of the metric as well as the different options given in the case of categorical data.
Method - Databases and Datasets identification
The entries of the metadatabase were identified through three main approaches.
First, a list of online catalogues and repositories was produced and scoped for relevant datasets or databases: European Environment Agency Datahub, European Environment Agency EIONET Central Data Repository, COPERNICUS Land Monitoring Service, Essential Biodiversity Variables - EBV data portal of the Group on Earth Observations Biodiversity Observation Network, Open Traits Network Catalogue, Open Environmental Data Cube Europe, NASA’s Earth Data, NASA’s SEDAC (Socioeconomic data and application center), Euro-Lex (access to European Union Law), JRC - ESDAC (European Soil Data Center), Database of European Vegetation Habitats and Flora, ESA (European Space Agency) Climate Office.
Second, a survey was sent out to all NaturaConnect consortium members in the third quarter of 2022 to identify both their needs and uses of data across the data typology. This allowed to identify (and document) additional datasets either used or produced within the consortium.
Lastly, the research team punctually added scientific publications of large-scale datasets, although it is important to highlight that this is not resulting from a systematic survey effort of the literature.
This study developed interval-level measurement scales for evaluating police officer performance during real or simulated deadly force situations. Through a two-day concept mapping focus group, statements were identified to describe two sets of dynamics: the difficulty (D) of a deadly force situation and the performance (P) of a police officer in that situation. These statements were then operationalized into measurable Likert-scale items that were scored by 291 use of force instructors from more than 100 agencies across the United States using an online survey instrument. The dataset resulting from this process contains a total of 685 variables, comprised of 312 difficulty statement items, 278 performance statement items, and 94 variables that measure the demographic characteristics of the scorers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The aim of this study is to synthesise the strategies implemented in the teaching of the Decimal Metric System in the Escuela Nueva pedagogical model, through a systematic review of the literature. This literature review will focus on scientific productions found in high impact indexed journals, endorsed by H-index. To carry out the above, a qualitative methodology was used, making use of methodological guidelines established by Preferred Reporting Items for Systematic Reviews and Meta-Analyses. Inclusion criteria, exclusion criteria, bibliometric variables, and variables of interest on the content were taken into account. Likewise, strategies based on Boolean operators and key terms were used to search for research articles. One of the results obtained is the use of contexts that are close or real to the students so that they can internalise the mathematical object present in this research.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The 30 year average values for the climate variables for the first day and length in days of episodes that meet certain conditions. Ordered by climate variable.
The reliability and convergent validity metrics associated with the variables that govern consumer behavior in the realm of foreign films and TV series.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The 30 year average values of various climate variables per month, season or year. Ordered by climate variable.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Predictor variables, sources, names and metrics for each county.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is compiled from the program and output data using in Nishizawa (2024).
Nishizawa, 2024: Extracting latent variables from forecast ensembles and advancements in similarity metric utilizing optimal transport. submitted to JGR: Machine Learning and Computation.
This dataset is a compilation of 13 thermal response metrics for 834 freshwater fish species across the conterminous United States (CONUS). The data were extracted from six published sources, many of which are compilations of data from other sources. The data were harmonized for comparison, and additional variables were added to summarize the metrics. The dataset is presented as a spreadsheet containing 17 sheets. The first sheet (datasets) describes the data sources. Other sheets describe the source and compilation variables in detail.
Dynamic community metrics definition for a group and an individual. Variables used in the analysis here are italicized.Dynamic community metrics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Selecting from or ranking a set of candidates variables in terms of their capacity for predicting an outcome of interest is an important task in many scientific fields. A variety of methods for variable selection and ranking have been proposed in the literature. In practice, it can be challenging to know which method is most appropriate for a given dataset. In this article, we propose methods of comparing variable selection and ranking algorithms. We first introduce measures of the quality of variable selection and ranking algorithms. We then define estimators of our proposed measures, and establish asymptotic results for our estimators. We use our results to conduct large-sample inference for our measures, and we propose a computationally efficient partial bootstrap procedure to potentially improve finite-sample inference. We assess the properties of our proposed methods using numerical studies, and we illustrate our methods with an analysis of data for predicting wine quality from its physicochemical properties.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The 30 year average values of various climate variables per 10 days. Ordered by climate variable. Traditionally the months are divided in 2 times 10 days and the remainder of the month.
We extracted the variables with a resolution of 20 meters, in order to make the grid’s area approximately equal to the sample plot’s area. This basal area prediction is based on a linear regression using field data and LiDAR metrics.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Previous research shows that class size can influence the associations between object-oriented (OO) metrics and fault-proneness and therefore proposes that it should be controlled as a confounding variable when validating OO metrics on fault-proneness. Otherwise, their true associations may be distorted. However, it has not been determined whether this practice is equally applicable to other external quality attributes. In this paper, we use three size metrics, two of which are available during the high-level design phase, to examine the potentially confounding effect of class size on the associations between OO metrics and change-proneness. The OO metrics that are investigated include cohesion, coupling, and inheritance metrics. Our results, based on Eclipse, indicate that (1) the confounding effect of class size on the associations between OO metrics and change-proneness in general exists, regardless of whichever size metric is used, that (2) the confounding effect of class size generally leads to an overestimate of the associations between OO metrics and change-proneness; and that (3) for many OO metrics, the confounding effect of class size completely accounts for their associations with change-proneness or results in a change of the direction of the associations. These results strongly suggest that studies validating OO metrics on change-proneness should also consider class size as a confounding variable.
Population metrics are provided at the census tract, planning district, and citywide levels of geography. You can find related vital statistics tables that contain aggregate metrics on natality (births) and mortality (deaths) of Philadelphia residents as well as social determinants of health metrics at the city and planning district levels of geography. Please refer to the metadata links below for variable definitions and the technical notes document to access detailed technical notes and variable definitions.
Descriptive statistics of all metric predictor and outcome variables.