Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model transformation languages are special-purpose languages, which are designed to define transformations as comfortably as possible, i.e., often in a declarative way. With the increasing use of transformations in various domains, the complexity and size of input models are also increasing. However, developers often lack suitable models for performance testing. We have therefore conducted experiments in which we predict the performance of model transformations based on characteristics of input models using machine learning approaches. This dataset contains our raw and processed input data, the scripts necessary to repeat our experiments, and the results we obtained.
Our input data consists of the time measurements for six different transformations defined in the Atlas Transformation Language (ATL), as well as the collected characteristics of the real-world input models that were transformed. We provide the script that implements our experiments. We predict the execution time of ATL transformations using the machine learning approaches linear regression, random forests and support vector regression using a radial basis function kernel. We also investigate different sets of characteristics of input models as input for the machine learning approaches. These are described in detail in the provided documentation.pdf. The results of the experiments are provided as raw data in individual cvs files. Additionally, we calculated the mean absolute percentage error in % and the 95th percentile of the absolute percentage error in % for each experiment and provide these results. Furthermore, we provide our Eclipse plugin, which collects the characteristics for a set of given models, the Java projects used to measure the execution time of the transformations, and other supporting scripts, e.g. for the analysis of the results.
A short introduction with a quick start guide can be found in README.md and a detailed documentation in documentaion.pdf.
This data package is associated with the publication “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning’’ submitted to the Journal of Geophysical Research: Machine Learning and Computation (Scheibe et al. 2024). River sediment respiration observations are expensive and labor intensive to obtain and there is no physical model for predicting this quantity. The Worldwide Hydrobiogeochemisty Observation Network for Dynamic River Systems (WHONDRS) observational data set (Goldman et al.; 2020) is used to train machine learning (ML) models to predict respiration rates at unsampled sites. This repository archives training data, ML models, predictions, and model evaluation results for the purposes of reproducibility of the results in the associated manuscript and community reuse of the ML models trained in this project. One of the key challenges in this work was to find an optimum configuration for machine learning models to work with this feature-rich (i.e. 100+ possible input variables) data set. Here, we used a two-tiered approach to managing the analysis of this complex data set: 1) a stacked ensemble of ML models that can automatically optimize hyperparameters to accelerate the process of model selection and tuning and 2) feature permutation importance to iteratively select the most important features (i.e. inputs) to the ML models. The major elements of this ML workflow are modular, portable, open, and cloud-based, thus making this implementation a potential template for other applications. This data package is associated with the GitHub repository found at https://github.com/parallelworks/sl-archive-whondrs . A static copy of the GitHub repository is included in this data package as an archived version at the time of publishing this data package (March 2023). However, we recommend accessing these files via GitHub for full functionality. Please see the file level metadata (flmd; “sl-archive-whondrs_flmd.csv”) for a list of all files contained in this data package and descriptions for each. Please see the data dictionary (dd; “sl-archive-whondrs_dd.csv”) for a list of all column headers contained within comma separated value (csv) files in this data package and descriptions for each. The GitHub repository is organized into five top-level directories: (1) “input_data” holds the training data for the ML models; (2) “ml_models” holds machine learning models trained on the data in “input_data”; (3) “scripts” contains data preprocessing and postprocessing scripts and intermediate results specific to this data set that bookend the ML workflow; (4) “examples” contains the visualization of the results in this repository including plotting scripts for the manuscript (e.g., model evaluation, FPI results) and scripts for running predictions with the ML models (i.e., reusing the trained ML models); (5) “output_data” holds the overall results of the ML model on that branch. Each trained ML model resides on its own branch in the repository; this means that inputs and outputs can be different branch-to-branch. Furthermore, depending on the number of features used to train the ML models, the preprocessing and postprocessing scripts, and their intermediate results, can also be different branch-to-branch. The “main-*” branches are meant to be starting points (i.e. trunks) for each model branch (i.e. sprouts). Please see the Branch Navigation section in the top-level README.md in the GitHub repository for more details. There is also one hidden directory “.github/workflows”. This hidden directory contains information for how to run the ML workflow as an end-to-end automated GitHub Action but it is not needed for reusing the ML models archived here. Please the top-level README.md in the GitHub repository for more details on the automation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the last decade, a plethora of algorithms have been developed for spatial ecology studies. In our case, we use some of these codes for underwater research work in applied ecology analysis of threatened endemic fishes and their natural habitat. For this, we developed codes in Rstudio® script environment to run spatial and statistical analyses for ecological response and spatial distribution models (e.g., Hijmans & Elith, 2017; Den Burg et al., 2020). The employed R packages are as follows: caret (Kuhn et al., 2020), corrplot (Wei & Simko, 2017), devtools (Wickham, 2015), dismo (Hijmans & Elith, 2017), gbm (Freund & Schapire, 1997; Friedman, 2002), ggplot2 (Wickham et al., 2019), lattice (Sarkar, 2008), lattice (Musa & Mansor, 2021), maptools (Hijmans & Elith, 2017), modelmetrics (Hvitfeldt & Silge, 2021), pander (Wickham, 2015), plyr (Wickham & Wickham, 2015), pROC (Robin et al., 2011), raster (Hijmans & Elith, 2017), RColorBrewer (Neuwirth, 2014), Rcpp (Eddelbeuttel & Balamura, 2018), rgdal (Verzani, 2011), sdm (Naimi & Araujo, 2016), sf (e.g., Zainuddin, 2023), sp (Pebesma, 2020) and usethis (Gladstone, 2022).
It is important to follow all the codes in order to obtain results from the ecological response and spatial distribution models. In particular, for the ecological scenario, we selected the Generalized Linear Model (GLM) and for the geographic scenario we selected DOMAIN, also known as Gower's metric (Carpenter et al., 1993). We selected this regression method and this distance similarity metric because of its adequacy and robustness for studies with endemic or threatened species (e.g., Naoki et al., 2006). Next, we explain the statistical parameterization for the codes immersed in the GLM and DOMAIN running:
In the first instance, we generated the background points and extracted the values of the variables (Code2_Extract_values_DWp_SC.R). Barbet-Massin et al. (2012) recommend the use of 10,000 background points when using regression methods (e.g., Generalized Linear Model) or distance-based models (e.g., DOMAIN). However, we considered important some factors such as the extent of the area and the type of study species for the correct selection of the number of points (Pers. Obs.). Then, we extracted the values of predictor variables (e.g., bioclimatic, topographic, demographic, habitat) in function of presence and background points (e.g., Hijmans and Elith, 2017).
Subsequently, we subdivide both the presence and background point groups into 75% training data and 25% test data, each group, following the method of Soberón & Nakamura (2009) and Hijmans & Elith (2017). For a training control, the 10-fold (cross-validation) method is selected, where the response variable presence is assigned as a factor. In case that some other variable would be important for the study species, it should also be assigned as a factor (Kim, 2009).
After that, we ran the code for the GBM method (Gradient Boost Machine; Code3_GBM_Relative_contribution.R and Code4_Relative_contribution.R), where we obtained the relative contribution of the variables used in the model. We parameterized the code with a Gaussian distribution and cross iteration of 5,000 repetitions (e.g., Friedman, 2002; kim, 2009; Hijmans and Elith, 2017). In addition, we considered selecting a validation interval of 4 random training points (Personal test). The obtained plots were the partial dependence blocks, in function of each predictor variable.
Subsequently, the correlation of the variables is run by Pearson's method (Code5_Pearson_Correlation.R) to evaluate multicollinearity between variables (Guisan & Hofer, 2003). It is recommended to consider a bivariate correlation ± 0.70 to discard highly correlated variables (e.g., Awan et al., 2021).
Once the above codes were run, we uploaded the same subgroups (i.e., presence and background groups with 75% training and 25% testing) (Code6_Presence&backgrounds.R) for the GLM method code (Code7_GLM_model.R). Here, we first ran the GLM models per variable to obtain the p-significance value of each variable (alpha ≤ 0.05); we selected the value one (i.e., presence) as the likelihood factor. The generated models are of polynomial degree to obtain linear and quadratic response (e.g., Fielding and Bell, 1997; Allouche et al., 2006). From these results, we ran ecological response curve models, where the resulting plots included the probability of occurrence and values for continuous variables or categories for discrete variables. The points of the presence and background training group are also included.
On the other hand, a global GLM was also run, from which the generalized model is evaluated by means of a 2 x 2 contingency matrix, including both observed and predicted records. A representation of this is shown in Table 1 (adapted from Allouche et al., 2006). In this process we select an arbitrary boundary of 0.5 to obtain better modeling performance and avoid high percentage of bias in type I (omission) or II (commission) errors (e.g., Carpenter et al., 1993; Fielding and Bell, 1997; Allouche et al., 2006; Kim, 2009; Hijmans and Elith, 2017).
Table 1. Example of 2 x 2 contingency matrix for calculating performance metrics for GLM models. A represents true presence records (true positives), B represents false presence records (false positives - error of commission), C represents true background points (true negatives) and D represents false backgrounds (false negatives - errors of omission).
|
Validation set | |
Model |
True |
False |
Presence |
A |
B |
Background |
C |
D |
We then calculated the Overall and True Skill Statistics (TSS) metrics. The first is used to assess the proportion of correctly predicted cases, while the second metric assesses the prevalence of correctly predicted cases (Olden and Jackson, 2002). This metric also gives equal importance to the prevalence of presence prediction as to the random performance correction (Fielding and Bell, 1997; Allouche et al., 2006).
The last code (i.e., Code8_DOMAIN_SuitHab_model.R) is for species distribution modelling using the DOMAIN algorithm (Carpenter et al., 1993). Here, we loaded the variable stack and the presence and background group subdivided into 75% training and 25% test, each. We only included the presence training subset and the predictor variables stack in the calculation of the DOMAIN metric, as well as in the evaluation and validation of the model.
Regarding the model evaluation and estimation, we selected the following estimators:
1) partial ROC, which evaluates the approach between the curves of positive (i.e., correctly predicted presence) and negative (i.e., correctly predicted absence) cases. As farther apart these curves are, the model has a better prediction performance for the correct spatial distribution of the species (Manzanilla-Quiñones, 2020).
2) ROC/AUC curve for model validation, where an optimal performance threshold is estimated to have an expected confidence of 75% to 99% probability (De Long et al., 1988).
Global change is impacting biodiversity across all habitats on earth. New selection pressures from changing climatic conditions and other anthropogenic activities are creating heterogeneous ecological and evolutionary responses across many species’ geographic ranges. Yet we currently lack standardised and reproducible tools to effectively predict the resulting patterns in species vulnerability to declines or range changes. We developed an informatic toolbox that integrates ecological, environmental and genomic data and analyses (environmental dissimilarity, species distribution models, landscape connectivity, neutral and adaptive genetic diversity, genotype-environment associations and genomic offset) to estimate population vulnerability. In our toolbox, functions and data structures are coded in a standardised way so that it is applicable to any species or geographic region where appropriate data are available, for example individual or population sampling and genomic datasets (e.g. RA..., Raw sequence data is available at the European Nucleotide Archive (ENA): Myotis escalerai and M. crypticus (PRJEB29086), and the NCBI Short Read Archive (SRA): Afrixalus fornasini – (SRP150605). Input data (processed genomic data and spatial-environmental data prior to running the toolbox) available as part of this repository. Methods: see methods text of manuscript and tutorials: Setup and running the LotE toolbox - https://cd-barratt.github.io/Life_on_the_edge.github.io/Vignette, Full tutorials for setup and running the LotE toolbox - https://cd-barratt.github.io/Life_on_the_edge.github.io/Vignette This software is intended for HPC use. Please make sure the software below is installed and functional in your HPC environment before proceeding:
Life on the edge data and scripts (also available here: https://github.com/cd-barratt/Life_on_the_edge)
Singularity (3.5) and bioconductor container with correct R version:Â https://cloud.sylabs.io/library/sinwood/bioconductor/bioconductor_3.14
R (4.1.3). Dependencies for toolbox installed within R version in singularity container upon setup (you specify your R libraries in the script where annotated) Julia (1.7.2)
Additionally you need to download the following and place in the correct directories to be sure the toolbox will function properly: * Environmental predictor data - please download and place environmental layers used for SDMs, GEAs etc in separate folders for current and future environmental conditions. These f..., # Life on the edge: a new toolbox for population-level climate change vulnerability assessments
Dataset contains input files needed to run Life on the edge for an example dataset (Afrixalus fornasini) You may run data for your focal species following the structure and content of the example files provided
First you need to download the following and place in the correct directories to be sure the toolbox will function properly:
Full setup and how to run the LotE toolbox - https://cd-barratt.github.io/Life_on_the_edge.github.io/Vignette
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background and objectiveMultistate models, which allow the prediction of complex multistate survival processes such as multimorbidity, or recovery, relapse and death following treatment for cancer, are being used for clinical prediction. It is paramount to evaluate the calibration (as well as other metrics) of a risk prediction model before implementation of the model. While there are a number of software applications available for developing multistate models, currently no software exists to aid in assessing the calibration of a multistate model, and as a result evaluation of model performance is uncommon. calibmsm has been developed to fill this gap.MethodsAssessing the calibration of predicted transition probabilities between any two states is made possible through three approaches. The first two utilise calibration techniques for binary and multinomial logistic regression models in combination with inverse probability of censoring weights, whereas the third utilises pseudo-values. All methods are implemented in conjunction with landmarking to allow calibration assessment of predictions made at any time beyond the start of follow up. This study focuses on calibration curves, but the methodological framework also allows estimation of calibration slopes and intercepts.ResultsThis article serves as a guide on how to use calibmsm to assess the calibration of any multistate model, via a comprehensive example evaluating a model developed to predict recovery, adverse events, relapse and survival in patients with blood cancer after a transplantation. The calibration plots indicate that predictions of relapse made at the time of transplant are poorly calibrated, however predictions of death are well calibrated. The calibration of all predictions made at 100 days post transplant appear to be poor, although a larger validation sample is required to make stronger conclusions.Conclusionscalibmsm is an R package which allows users to assess the calibration of predicted transition probabilities from a multistate model. Evaluation of model performance is a key step in the pathway to model implementation, yet evaluation of the performance of predictions from multistate models is not common. We hope availability of software will help model developers evaluate the calibration of models being developed.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of heritability and prediction performance in rats. The table shows the number of rats used in the prediction, number of genes predicted per model (R2>0.01), the average prediction performance R2 (after filtering R2
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description:
This replication package contains all datasets and scripts used in the study From App Features to Explanation Needs: Analyzing Correlations and Predictive Potential. The study investigates the relationships between app features and users' explanation needs, combining correlation analysis and predictive modeling techniques such as logistic regression. The dataset comprises 4,495 user reviews from Google Play Store and Apple App Store, each annotated for explanation needs and enriched with detailed app metadata (e.g., genre, ratings, age restriction, in-app purchases, and more). The original annotaded gold-standard dataset is here: https://doi.org/10.5281/zenodo.11522828
The package includes:
A detailed README file is provided, explaining the folder structure, dataset contents, and the purpose of each script to ensure reproducibility.
This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Grant No.: 470146331, project softXplain (2022–2025).
If you use this resource, please cite the following publication:
Obaidi, M., Qengaj, K., Droste, J., Deters, H., Herrmann, M., Klünder, J., Schmid, E., Schneider, K. (2025). From App Features to Explanation Needs: Analyzing Correlations and Predictive Potential. 2025 IEEE 33rd International Requirements Engineering Workshop (REW).
Unless otherwise stated, this dataset and all associated resources are provided under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
For questions or further information, please contact Martin Obaidi (martin.obaidi@inf.uni-hannover.de) or the corresponding authors listed in the publication.
Lake temperature is an important environmental metric for understanding habitat suitability for many freshwater species and is especially useful when temperatures are predicted throughout the water column (known as temperature profiles). In this data release, multiple modeling approaches were used to generate predictions of daily temperature profiles for thousands of lakes in the Midwest.
Predictions were generated using two modeling frameworks: a machine learning model (specifically an entity-aware long short-term memory or EA-LSTM model; Kratzert et al., 2019) and a process-based model (specifically the General Lake Model or GLM; Hipsey et al., 2019). Both the EA-LSTM and GLM frameworks were used to generate lake temperature predictions in the contemporary period (1979-04-12 to 2022-04-11 for EA-LSTM and 1980-01-01 to 2021-12-31 for GLM; times differ due to modeling spin-up/spin-down configurations) using the North American Land Data Assimilation System [NLDAS; Mitchell et al., 2004] as meteorological drivers. In addition, GLM was used to generate lake temperature predictions under future climate scenarios (covering 1981-2000, 2040-2059, and 2080-2099) using six dynamically downscaled Global Climate Models (GCM; Notaro et al., 2018) as meteorological drivers. Appropriate application of the six GCMs is dependent on the use-case and will be up to the user to determine. For an example of a similar analysis in the Midwest and Great Lakes region using 31 GCMs, see Byun and Hamlet, 2018.
The modeling frameworks and driver datasets have slightly different footprints and input data requirements. This means that some of the lakes do not meet the criteria to be included in all three modeling approaches, which results in different numbers of lakes in the output (noted in the file descriptions below). The input data requirements for lakes to be included in the EA-LSTM predictions are lake latitude, longitude, elevation, and surface area, plus NLDAS drivers at the lake's location. All 62,966 lakes included this data release met these requirements. The input data requirements for lakes to be included in the contemporary GLM NLDAS-driven predictions are lake location (within one of the following 11 states: North Dakota, South Dakota, Iowa, Michigan, Indiana, Illinois, Wisconsin, Minnesota, Missouri, Arkansas, and Ohio), latitude, longitude, maximum depth (though more detailed hypsography was used where available), surface area, and a clarity esitmate, plus NLDAS drivers at the lake's location. 12,688 lakes included this data release met these requirements. The input data requirements for lakes to be included in the future climate scenario GCM-driven predictions were the same as for the contemporary GLM predictions, except GCM drivers at the lake's location were required in place of NLDAS drivers. 11,715 lakes included this data release met these requirements.
This data release includes the following files:
This work was completed with funding support from the Midwest Climate Adaptation Science Center (MW CASC) and as part of the USGS project on Predictive Understanding of Multiscale Processes (PUMP), an element of the Integrated Water Prediction Program, supported by the Water Availability and Use Science Program to advance multi-scale, integrated modeling capabilities to address water resource issues. Access to computing facilities was provided by USGS Advanced Research Computing, USGS Tallgrass Supercomputer (doi.org/10.5066/F7D798MJ).
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description: Data on recreational boating are needed for marine spatial planning initiatives in British Columbia (BC). Vessel traffic data are typically obtained by analyzing automatic identification system (AIS) vessel tracking data, but recreational vessels are often omitted or underrepresented in AIS data because they are not required to carry AIS tracking devices. Transport Canada’s National Aerial Surveillance Program (NASP) conducted aerial surveys to collect information on recreational vessels along several sections of the BC coast between 2018 and 2022. Recreational vessel sightings were modeled against predictor variables (e.g., distance to shore, water depth, distance to, and density of marinas) to predict the number of recreational vessels along coastal waters of BC. The files included here are: --A Geodatabase (‘Recreational_Boating_Data_Model’), which includes: (1) recreational vessel sightings data collected by NASP in BC and used in the recreational vessel traffic model (‘Recreational_Vessels_PointData_BC’); (2) aerial survey effort (or number of aerial surveys) raster dataset (‘surveyeffort’); and (3) a vector grid dataset (2.5 km resolution) containing the predicted number of recreational vessels per cell and predictor variables (‘Recreational_Boating_Model_Results_BC). --Scripts folder which includes R Markdown file with R code to run the modelling analysis (‘Recreational_Boating_Model_R_Script’) and data used to run the code. Methods: Data on recreational vessels were collected by NASP during planned aerial surveys along pre-determined routes along the BC coast from 2018 to 2022. Data on non-AIS recreational vessels were collected using video cameras onboard the aircraft, and data on AIS recreational vessels using an AIS receiver also onboard the aircraft. Recreational boating predictors explored were: water depth, distance to shore, distance to marinas, density of marinas, latitude, and longitude. Recreational vessel traffic models were fitted using Generalized Linear Models (GLM) R packages and libraries used here include: AED (Roman Lustrik, 2021) and MASS (Venables, W. N., Ripley, 2002), pscl package (Zeileis, Kleiber, and Jackman, 2008) for zeroinfl() and hurdle() function. Final model was selected based on the Akaike’s information criterion (AIC) and the Bayes’ information criterion (BIC). An R Markdown file with code use to run this analysis is included in the data package in a folder called Script. Spatial Predictive Model: The selected model, ZINB, consist of two parts: one with a binomial process that predicts the probability of encountering a recreational vessel, and a second part that predicts the number of recreational vessels via a count model. The closer to shore and to marinas, and the higher the density of marinas, the higher the predicted number of recreational vessels. The probability of encountering recreational vessels is driven by water depth and distance to shore. For more information on methodology, consult metadata pdf available with the Open Data record. References: Serra-Sogas, N. et al. 2021. Using aerial surveys to fill gaps in AIS vessel traffic data to inform threat assessments, vessel management and planning. Marine Policy 133: 104765. https://doi.org/10.1016/j.marpol.2021.104765 Data Sources: Recreational vessel sightings and survey effort: Data collected by NASP and analyzed by Norma Serra to extract vessel information and survey effort (more information on how this data was analyzed see SerraSogas et al, 2021). Bathymetry data for the whole BC coast and only waters within the Canadian EEZ was provided by DFO – Science (Selina Agbayani). The data layer was presented as a raster file of 100 meters resolution. Coastline dataset used to estimate distance to shore and to clip grid was provided by DFO – Science (Selina Agbayani), created by David Williams and Yuriko Hashimoto (DFO – Oceans). Marinas dataset was provided by DFO – Science (Selina Agbayani), created by Josie Iacarella (DFO – Science). This dataset includes large and medium size marinas and fishing lodges. The data can be downloaded from here: Floating Structures in the Pacific Northwest - Open Government Portal (https://open.canada.ca/data/en/dataset/049770ef-6cb3-44ee-afc8-5d77d6200a12) Uncertainties: Model results are based on recreational vessels sighted by NASP and their related predictor variables and not always might reflect real-world vessel distributions. Any biases caused by the opportunistic nature of the NASP surveys were minimized by using survey effort as an offset variable.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data for the paper: Machine learning reveals the diversity of human 3D chromatin contact patterns
GitHub: https://github.com/erin-n-gilbertson/3DGenome-diversity/tree/main
biorXiv: https://www.biorxiv.org/content/10.1101/2023.12.22.573104v1.full
Manuscript accepted at Molecular Biology and Evolution
Of primary interest will be the example predictions genome wide for hg38 reference, human-archaic hominin ancestor and most divergent 1KG individual per genome along with the Jupyter notebook tutorial for making your own Akita predictions given any input 1MB sequence.
We conducted a macroscale study of 2,210 shallow lakes (mean depth ≤ 3m or a maximum depth ≤ 5m) in the Upper Midwestern and Northeastern U.S. We asked: What are the patterns and drivers of shallow lake total phosphorus (TP), chlorophyll a (CHLa), and TP–CHLa relationships at the macroscale, how do these differ from those for 4,360 non-shallow lakes, and do results differ by hydrologic connectivity class? To answer this question, we assembled the LAGOS-NE Shallow Lakes dataset described herein, a dataset derived from existing LAGOS-NE, LAGOS-DEPTH, and LAGOS-CLIMATE datasets. Response data variables were the median of available summer (e.g., 15 June to 15 September) values of total phosphorus (TP) and chlorophyll a (CHLa). Predictor variables were assembled at two spatial scales for incorporation into hierarchical models. At the local or lake-specific scale (including the individual lake, its inter-lake watershed [iws] or corresponding HU12 watershed), variables included those representing land use/cover, hydrology, climate, morphometry, and acid deposition. At the regional scale (e.g., HU4 watershed), variables included a smaller set of predictor variables for hydrology and land use/cover. The dataset also includes the unique identifier assigned by LAGOS-NE(lagoslakeid); the latitude and longitude of the study lakes; their maximum and mean depths along with a depth classification of Shallow or non-Shallow; connectivity class (i.e., whether a lake was classified as connected (with inlets and outlets) or unconnected (lacking inlets); and the zone id for the HU4 to which each lake belongs. Along with the database, we provide the R scripts for the hierarchical models predicting TP or CHLa (TPorCHL_predictive_model.R), and the TP—CHLa relationship (TP_CHL_CSI_Model.R) for depth and connectivity subsets of the study lakes.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Script outputs for the paper "Evidence-based guidelines for developing automated conservation assessment methods".
The code used to generate these outputs can be found on GitHub.
To use these outputs, download the code, download this dataset and extract the dataset in the project folder. Some outputs are in the RData format, including all of the trained models. To view these files in R you may need to install the packages listed in the README of the GitHub project.
The outputs are arranged in this file structure:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
README.txt
Maintenance example belonging to:
The MANTIS Book: Cyber Physical System Based Proactive Collaborative Maintenance
Chapter 9, The Future of Maintenance (2019).
Lambert Schomaker, Michele Albano, Erkki Jantunen, Luis Lino Ferreira
River Publishers (DK)
ISBN: 9788793609853, e-ISBN: 9788793609846, https://doi.org/10.13052/rp-9788793609846
The figure .pdf did not make it into the book. Here are the raw data, processed
logs and .gnu script to produce it.
Data: event logs on disk failure in two racks of a huge RAID disk system (2009-2016).
disks1.raw
disks2.raw
Event logs to RC-filtered time series:
RC-filt-disks-log.c
do-RC-filter-to-make-spikes-more-visible (bash script)
-->
disks1.log
disks2.log
Constant (horizontal line) indicating the level where users experienced system-down time
Disrupted-operations-threshold
disk-replacement-log.gnu
disk-replacement-log.pdf
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Modeling Face Recognition in the Predictive Coding Framework: A Combined Computational Modeling and Functional Imaging Study by Zaragoza, Niehaus, et al. 2022 (under review)
This is the repository ONLY containing the fMRI raw data and associated code scripts of both the level 1 and level 2 SPM-analyses of the connected publication. Note that this code was run on Windows 10 using MATLAB. For additional links see below:
In-principle-accepted Stage 1 Pre-registration: https://doi.org/10.17605/OSF.IO/A8VU7
In-principle-accepted Stage 2 modeling scripts, behavioral data, and preprint: https://osf.io/tye24/
This is the fMRI Dataset for the FRAPPS study (bids format).
study consists of two tasks: identity learning task ('FRAPPS') and face localizer task ('localizer')
defacing of anatomical images was performed with pydeface 2.0.0!!!
functional data is stored as 4D file including all scans (e.g. sub-01_task-frapps_bold.nii.gz) as well as without the 4 dummy scans (e.g. sub-01_task-frapps_bold-dummy-scans.nii). The version without dummy scans was used for all further analysis.
Exclusion of participants: due to extensive motion (mriqc FD parameter > 0.3) two subjects (sub-16 and sub-26) were excluded. Additionally, we excluded sub-21 because of low performance in the FRAPPS task (< 60%). We also excluded sub-4, sub-18 and sub-28 from all fMRI analyses assessing the influence of the parametric modulators on brain activation.
onsets and durations of both tasks are described in files called subID_spm_time.csv (for FRAPPS task) and subID-fp_no_Eyelink_onsets.mat (for the localizer task). The directory is subID/func/onsets_and_durations/. In this folder, there are also the multiple condition files for the first-level analysis with and without parametric modulations. They are called for example: sub-01_multi_cond_VIEW-INDxCONTEXT_pe_VI.mat
Computational model parameters for the parametric modulation can be found in folder subID/func/model_parameters/
Notes preprocessing FRAPPS: Slice timing, TA is not relevant and will not be used when slice timings are entered. TA was therefore set to 0, as reference slice TR/2 (1.53/2) was used as timing.
Notes first level FRAPPS: slice time correction was performed, thus for the first level model the microtime resolution was changed to the number of slices divided by two because of multiband factor 2 (= 24), microtime onset needs to match the chosen reference slice so it was set to 12.
Notes mean beta calculation FRAPPS: sub-4, sub-18, and sub-28 have extremely high (values around -1000) beta values for the history VI pmod. This is most probably caused by only a very slightly changing history VI parameter.
Localizer coordinates group level (conjunction) (this is just as a reference, of note for OFA, FFA, and pSTS we used the group coordinates derived from the study Thome et al. 2022, Neuroimage to locate the individual maxima)
lOFA -42 -84 -10 t = 1.82 rOFA 48 -74 -6 t = 3.86
lFFA -42 -46 -18 t = 7.32 rFFA 42 -46 -16 t = 6.80
lpSTS -58 -54 8 t = 5.55 rpSTS 46 -54 14 t = 6.25
lPPA -22 -46 -8 t = 17.57 rPPA 24 -46 -10 t = 16.95
Modelling the spread of introduced ecosystem engineers is a conservation priority due to their potential to cause irreversible ecosystem-level changes. While existing models predict potential distributions and spread capacities, new approaches that simulate the trajectory of a species’ spread over time are needed. We developed novel simulations that predict spatial and temporal spread, capturing both continuous diffusion-dispersal and occasional long-distance leaps. We focused on the introduced population of Superb Lyrebird (Menura novaehollandiae) in Tasmania, Australia. Initially introduced as an insurance population, lyrebirds have become novel bioturbators, spreading across key natural areas and becoming "unwanted but challenging to eradicate". Using multi-scale ecological data, our research (1) identified broad and fine-scale correlates of lyrebird occupation and (2) developed a spread simulation guided by a pattern-oriented framework. This occurrence-based modelling framework is u..., This dataset integrates fine- and broad-scale ecological data collected between 1970 and 2023. Fine-scale data was obtained from 210 camera trap sites in Tasmania, recording vegetation structure and lyrebird detections. Broad-scale data was compiled from citizen science records via the Atlas of Living Australia, combined with environmental predictors such as climatic variables, elevation, and land-use data. Fine-scale habitat data was derived from camera trap detections, with vegetation classified into dense or sparse categories using a 3×3 grid overlay method. Broad-scale occurrence records were filtered for spatial and temporal accuracy, and pseudo-absence data was generated based on effort-controlled absence criteria. Predictor variables for Species Distribution Models (SDMs) were z-transformed, and categorical variables (e.g., vegetation, land-use types) were aggregated into broader classes. Spread simulations used a stochastic grid-cell model with parameters calibrated via a patter..., , # A pattern-oriented simulation for forecasting species spread through time and space: A case study on an ecosystem engineer on the move
https://doi.org/10.5061/dryad.xsj3tx9rg
This dataset supports the study on habitat suitability and spread modelling of the Superb Lyrebird (Menura novaehollandiae) in Tasmania. The study applies a sequential framework:
The dataset includes fine-scale vegetation structure, occurrence records, environmental predictors, SDM performance outputs, and simulation scripts.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.
A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.
Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.
Climate change threatens biodiversity by compromising the ability to balance energy and water, influencing animal behaviour, species interactions, distribution, and ultimately survival. Predicting climate change effects on thermal physiology is complicated by interspecific variation in thermal tolerance limits, thermoregulatory behaviour, and heterogenous thermal landscapes. We develop an approach for assessing thermal vulnerability for endotherms by incorporating behaviour and microsite data into a biophysical model. We parameterised the model using species-specific functional traits and published behavioural data on hotter (maximum daily temperature, Tmax > 35 °C) and cooler days (Tmax < 35 °C). Incorporating continuous time-activity focal observations of behaviour into the biophysical approach reveals that the three insectivorous birds modelled here are at greater risk of lethal hyperthermia than dehydration under climate change, contrary to previous thermal risk assessments. S..., , , # Data from: Parameters used in the endotherm biophysical model for each species
https://doi.org/10.5061/dryad.zgmsbccnh
We provide a novel approach for assessing thermal vulnerability by incorporating detailed behaviour and microsite data for endotherms exposed to recent and future climates (RCP 8.5, 4.5) into a biophysical model. The model was parameterised using species-specific functional traits and published behavioural data on hotter (maximum daily temperature, *T*max > 35 °C) and cooler days (*T*max < 35 °C). Below we have provided the biophysical traits used in the parameterisation of the model, the customized endotherm model, the operative temperature model, and an example script for predicting thermoregulation in an exposed-on-ground microsite. We have noted below the sequence in which these scripts need to be run.
**Des...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example code script showing the parameters used in all AlphaFold2 predictions within the module.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the following:
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Kaggle’s March Machine Learning Mania competition challenged data scientists to predict winners and losers of the men's 2016 NCAA basketball tournament. This dataset contains the 1070 selected predictions of all Kaggle participants. These predictions were collected and locked in prior to the start of the tournament.
How can this data be used? You can pivot it to look at both Kaggle and NCAA teams alike. You can look at who will win games, which games will be close, which games are hardest to forecast, or which Kaggle teams are gambling vs. sticking to the data.
The NCAA tournament is a single-elimination tournament that begins with 68 teams. There are four games, usually called the “play-in round,” before the traditional bracket action starts. Due to competition timing, these games are included in the prediction files but should not be used in analysis, as it’s possible that the prediction was submitted after the play-in round games were over.
Each Kaggle team could submit up to two prediction files. The prediction files in the dataset are in the 'predictions' folder and named according to:
TeamName_TeamId_SubmissionId.csv
The file format contains a probability prediction for every possible game between the 68 teams. This is necessary to cover every possible tournament outcome. Each team has a unique numerical Id (given in Teams.csv). Each game has a unique Id column created by concatenating the year and the two team Ids. The format is the following:
Id,Pred
2016_1112_1114,0.6
2016_1112_1122,0
...
The team with the lower numerical Id is always listed first. “Pred” represents the probability that the team with the lower Id beats the team with the higher Id. For example, "2016_1112_1114,0.6" indicates team 1112 has a 0.6 probability of beating team 1114.
For convenience, we have included the data files from the 2016 March Mania competition dataset in the Scripts environment (you may find TourneySlots.csv and TourneySeeds.csv useful for determining matchups, see the documentation). However, the focus of this dataset is on Kagglers' predictions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model transformation languages are special-purpose languages, which are designed to define transformations as comfortably as possible, i.e., often in a declarative way. With the increasing use of transformations in various domains, the complexity and size of input models are also increasing. However, developers often lack suitable models for performance testing. We have therefore conducted experiments in which we predict the performance of model transformations based on characteristics of input models using machine learning approaches. This dataset contains our raw and processed input data, the scripts necessary to repeat our experiments, and the results we obtained.
Our input data consists of the time measurements for six different transformations defined in the Atlas Transformation Language (ATL), as well as the collected characteristics of the real-world input models that were transformed. We provide the script that implements our experiments. We predict the execution time of ATL transformations using the machine learning approaches linear regression, random forests and support vector regression using a radial basis function kernel. We also investigate different sets of characteristics of input models as input for the machine learning approaches. These are described in detail in the provided documentation.pdf. The results of the experiments are provided as raw data in individual cvs files. Additionally, we calculated the mean absolute percentage error in % and the 95th percentile of the absolute percentage error in % for each experiment and provide these results. Furthermore, we provide our Eclipse plugin, which collects the characteristics for a set of given models, the Java projects used to measure the execution time of the transformations, and other supporting scripts, e.g. for the analysis of the results.
A short introduction with a quick start guide can be found in README.md and a detailed documentation in documentaion.pdf.