28 datasets found

Z
Data Set for Predicting the Performance of ATL Model Transformations
data.niaid.nih.gov
zenodo.org
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Höppner, Stefan (2024). Data Set for Predicting the Performance of ATL Model Transformations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7510273
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Thiam, Patrick
Bellmann, Peter
Groner, Raffaela
Schwenker, Friedhelm
Tichy, Matthias
Höppner, Stefan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Model transformation languages are special-purpose languages, which are designed to define transformations as comfortably as possible, i.e., often in a declarative way. With the increasing use of transformations in various domains, the complexity and size of input models are also increasing. However, developers often lack suitable models for performance testing. We have therefore conducted experiments in which we predict the performance of model transformations based on characteristics of input models using machine learning approaches. This dataset contains our raw and processed input data, the scripts necessary to repeat our experiments, and the results we obtained.

Our input data consists of the time measurements for six different transformations defined in the Atlas Transformation Language (ATL), as well as the collected characteristics of the real-world input models that were transformed. We provide the script that implements our experiments. We predict the execution time of ATL transformations using the machine learning approaches linear regression, random forests and support vector regression using a radial basis function kernel. We also investigate different sets of characteristics of input models as input for the machine learning approaches. These are described in detail in the provided documentation.pdf. The results of the experiments are provided as raw data in individual cvs files. Additionally, we calculated the mean absolute percentage error in % and the 95th percentile of the absolute percentage error in % for each experiment and provide these results. Furthermore, we provide our Eclipse plugin, which collects the characteristics for a set of given models, the Java projects used to measure the execution time of the transformations, and other supporting scripts, e.g. for the analysis of the results.

A short introduction with a quick start guide can be found in README.md and a detailed documentation in documentaion.pdf.
d
Models, data, and scripts associated with “Prediction of Distributed River...
search.dataone.org
data.ess-dive.lbl.gov
+1more
Updated Mar 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Gary; Timothy D. Scheibe; Em Rexer; Michael Wilde; Alvaro Vidal Torreira; Vanessa A. Garayburu-Caruso; Amy E. Goldman; James C. Stegen (2024). Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning” [Dataset]. http://doi.org/10.15485/2318723
Explore at:
Unique identifier
https://doi.org/10.15485/2318723
Dataset updated
Mar 23, 2024
Dataset provided by
ESS-DIVE
Authors
Stefan Gary; Timothy D. Scheibe; Em Rexer; Michael Wilde; Alvaro Vidal Torreira; Vanessa A. Garayburu-Caruso; Amy E. Goldman; James C. Stegen
Time period covered
Jul 1, 2019 - Aug 31, 2022
Area covered

Description
This data package is associated with the publication “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning’’ submitted to the Journal of Geophysical Research: Machine Learning and Computation (Scheibe et al. 2024). River sediment respiration observations are expensive and labor intensive to obtain and there is no physical model for predicting this quantity. The Worldwide Hydrobiogeochemisty Observation Network for Dynamic River Systems (WHONDRS) observational data set (Goldman et al.; 2020) is used to train machine learning (ML) models to predict respiration rates at unsampled sites. This repository archives training data, ML models, predictions, and model evaluation results for the purposes of reproducibility of the results in the associated manuscript and community reuse of the ML models trained in this project. One of the key challenges in this work was to find an optimum configuration for machine learning models to work with this feature-rich (i.e. 100+ possible input variables) data set. Here, we used a two-tiered approach to managing the analysis of this complex data set: 1) a stacked ensemble of ML models that can automatically optimize hyperparameters to accelerate the process of model selection and tuning and 2) feature permutation importance to iteratively select the most important features (i.e. inputs) to the ML models. The major elements of this ML workflow are modular, portable, open, and cloud-based, thus making this implementation a potential template for other applications. This data package is associated with the GitHub repository found at https://github.com/parallelworks/sl-archive-whondrs . A static copy of the GitHub repository is included in this data package as an archived version at the time of publishing this data package (March 2023). However, we recommend accessing these files via GitHub for full functionality. Please see the file level metadata (flmd; “sl-archive-whondrs_flmd.csv”) for a list of all files contained in this data package and descriptions for each. Please see the data dictionary (dd; “sl-archive-whondrs_dd.csv”) for a list of all column headers contained within comma separated value (csv) files in this data package and descriptions for each. The GitHub repository is organized into five top-level directories: (1) “input_data” holds the training data for the ML models; (2) “ml_models” holds machine learning models trained on the data in “input_data”; (3) “scripts” contains data preprocessing and postprocessing scripts and intermediate results specific to this data set that bookend the ML workflow; (4) “examples” contains the visualization of the results in this repository including plotting scripts for the manuscript (e.g., model evaluation, FPI results) and scripts for running predictions with the ML models (i.e., reusing the trained ML models); (5) “output_data” holds the overall results of the ML model on that branch. Each trained ML model resides on its own branch in the repository; this means that inputs and outputs can be different branch-to-branch. Furthermore, depending on the number of features used to train the ML models, the preprocessing and postprocessing scripts, and their intermediate results, can also be different branch-to-branch. The “main-*” branches are meant to be starting points (i.e. trunks) for each model branch (i.e. sprouts). Please see the Branch Navigation section in the top-level README.md in the GitHub repository for more details. There is also one hidden directory “.github/workflows”. This hidden directory contains information for how to run the ML workflow as an end-to-end automated GitHub Action but it is not needed for reusing the ML models archived here. Please the top-level README.md in the GitHub repository for more details on the automation.

Codes in R for spatial statistics analysis, ecological response models and...

zenodo.org
data.niaid.nih.gov

bin

Updated Apr 24, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

D. W. Rössel-Ramírez; D. W. Rössel-Ramírez; J. Palacio-Núñez; J. Palacio-Núñez; S. Espinosa; S. Espinosa; J. F. Martínez-Montoya; J. F. Martínez-Montoya (2025). Codes in R for spatial statistics analysis, ecological response models and spatial distribution models [Dataset]. http://doi.org/10.5281/zenodo.7603557

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7603557

Dataset updated

Apr 24, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

D. W. Rössel-Ramírez; D. W. Rössel-Ramírez; J. Palacio-Núñez; J. Palacio-Núñez; S. Espinosa; S. Espinosa; J. F. Martínez-Montoya; J. F. Martínez-Montoya

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

In the last decade, a plethora of algorithms have been developed for spatial ecology studies. In our case, we use some of these codes for underwater research work in applied ecology analysis of threatened endemic fishes and their natural habitat. For this, we developed codes in Rstudio® script environment to run spatial and statistical analyses for ecological response and spatial distribution models (e.g., Hijmans & Elith, 2017; Den Burg et al., 2020). The employed R packages are as follows: caret (Kuhn et al., 2020), corrplot (Wei & Simko, 2017), devtools (Wickham, 2015), dismo (Hijmans & Elith, 2017), gbm (Freund & Schapire, 1997; Friedman, 2002), ggplot2 (Wickham et al., 2019), lattice (Sarkar, 2008), lattice (Musa & Mansor, 2021), maptools (Hijmans & Elith, 2017), modelmetrics (Hvitfeldt & Silge, 2021), pander (Wickham, 2015), plyr (Wickham & Wickham, 2015), pROC (Robin et al., 2011), raster (Hijmans & Elith, 2017), RColorBrewer (Neuwirth, 2014), Rcpp (Eddelbeuttel & Balamura, 2018), rgdal (Verzani, 2011), sdm (Naimi & Araujo, 2016), sf (e.g., Zainuddin, 2023), sp (Pebesma, 2020) and usethis (Gladstone, 2022).

It is important to follow all the codes in order to obtain results from the ecological response and spatial distribution models. In particular, for the ecological scenario, we selected the Generalized Linear Model (GLM) and for the geographic scenario we selected DOMAIN, also known as Gower's metric (Carpenter et al., 1993). We selected this regression method and this distance similarity metric because of its adequacy and robustness for studies with endemic or threatened species (e.g., Naoki et al., 2006). Next, we explain the statistical parameterization for the codes immersed in the GLM and DOMAIN running:

In the first instance, we generated the background points and extracted the values of the variables (Code2_Extract_values_DWp_SC.R). Barbet-Massin et al. (2012) recommend the use of 10,000 background points when using regression methods (e.g., Generalized Linear Model) or distance-based models (e.g., DOMAIN). However, we considered important some factors such as the extent of the area and the type of study species for the correct selection of the number of points (Pers. Obs.). Then, we extracted the values of predictor variables (e.g., bioclimatic, topographic, demographic, habitat) in function of presence and background points (e.g., Hijmans and Elith, 2017).

Subsequently, we subdivide both the presence and background point groups into 75% training data and 25% test data, each group, following the method of Soberón & Nakamura (2009) and Hijmans & Elith (2017). For a training control, the 10-fold (cross-validation) method is selected, where the response variable presence is assigned as a factor. In case that some other variable would be important for the study species, it should also be assigned as a factor (Kim, 2009).

After that, we ran the code for the GBM method (Gradient Boost Machine; Code3_GBM_Relative_contribution.R and Code4_Relative_contribution.R), where we obtained the relative contribution of the variables used in the model. We parameterized the code with a Gaussian distribution and cross iteration of 5,000 repetitions (e.g., Friedman, 2002; kim, 2009; Hijmans and Elith, 2017). In addition, we considered selecting a validation interval of 4 random training points (Personal test). The obtained plots were the partial dependence blocks, in function of each predictor variable.

Subsequently, the correlation of the variables is run by Pearson's method (Code5_Pearson_Correlation.R) to evaluate multicollinearity between variables (Guisan & Hofer, 2003). It is recommended to consider a bivariate correlation ± 0.70 to discard highly correlated variables (e.g., Awan et al., 2021).

Once the above codes were run, we uploaded the same subgroups (i.e., presence and background groups with 75% training and 25% testing) (Code6_Presence&backgrounds.R) for the GLM method code (Code7_GLM_model.R). Here, we first ran the GLM models per variable to obtain the p-significance value of each variable (alpha ≤ 0.05); we selected the value one (i.e., presence) as the likelihood factor. The generated models are of polynomial degree to obtain linear and quadratic response (e.g., Fielding and Bell, 1997; Allouche et al., 2006). From these results, we ran ecological response curve models, where the resulting plots included the probability of occurrence and values for continuous variables or categories for discrete variables. The points of the presence and background training group are also included.

On the other hand, a global GLM was also run, from which the generalized model is evaluated by means of a 2 x 2 contingency matrix, including both observed and predicted records. A representation of this is shown in Table 1 (adapted from Allouche et al., 2006). In this process we select an arbitrary boundary of 0.5 to obtain better modeling performance and avoid high percentage of bias in type I (omission) or II (commission) errors (e.g., Carpenter et al., 1993; Fielding and Bell, 1997; Allouche et al., 2006; Kim, 2009; Hijmans and Elith, 2017).

Table 1. Example of 2 x 2 contingency matrix for calculating performance metrics for GLM models. A represents true presence records (true positives), B represents false presence records (false positives - error of commission), C represents true background points (true negatives) and D represents false backgrounds (false negatives - errors of omission).

	Validation set
Model	True	False
Presence	A	B
Background	C	D

We then calculated the Overall and True Skill Statistics (TSS) metrics. The first is used to assess the proportion of correctly predicted cases, while the second metric assesses the prevalence of correctly predicted cases (Olden and Jackson, 2002). This metric also gives equal importance to the prevalence of presence prediction as to the random performance correction (Fielding and Bell, 1997; Allouche et al., 2006).

The last code (i.e., Code8_DOMAIN_SuitHab_model.R) is for species distribution modelling using the DOMAIN algorithm (Carpenter et al., 1993). Here, we loaded the variable stack and the presence and background group subdivided into 75% training and 25% test, each. We only included the presence training subset and the predictor variables stack in the calculation of the DOMAIN metric, as well as in the evaluation and validation of the model.

Regarding the model evaluation and estimation, we selected the following estimators:

1) partial ROC, which evaluates the approach between the curves of positive (i.e., correctly predicted presence) and negative (i.e., correctly predicted absence) cases. As farther apart these curves are, the model has a better prediction performance for the correct spatial distribution of the species (Manzanilla-Quiñones, 2020).

2) ROC/AUC curve for model validation, where an optimal performance threshold is estimated to have an expected confidence of 75% to 99% probability (De Long et al., 1988).

d
Data from: Life on the edge: A new toolbox for population-level climate...
search.dataone.org
data.niaid.nih.gov
+2more
Updated Jan 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Barratt; Renske Onstein; Malin Pinsky; Sebastian Steinfartz; Hjalmar Kuehl; Brenna Forester; Orly Razgour (2025). Life on the edge: A new toolbox for population-level climate change vulnerability assessments [Dataset]. http://doi.org/10.5061/dryad.2rbnzs7t4
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.2rbnzs7t4
Dataset updated
Jan 29, 2025
Dataset provided by
Dryad Digital Repository
Authors
Christopher Barratt; Renske Onstein; Malin Pinsky; Sebastian Steinfartz; Hjalmar Kuehl; Brenna Forester; Orly Razgour
Time period covered
Jun 28, 2023
Description
Global change is impacting biodiversity across all habitats on earth. New selection pressures from changing climatic conditions and other anthropogenic activities are creating heterogeneous ecological and evolutionary responses across many speciesâ€™ geographic ranges. Yet we currently lack standardised and reproducible tools to effectively predict the resulting patterns in species vulnerability to declines or range changes. We developed an informatic toolbox that integrates ecological, environmental and genomic data and analyses (environmental dissimilarity, species distribution models, landscape connectivity, neutral and adaptive genetic diversity, genotype-environment associations and genomic offset) to estimate population vulnerability. In our toolbox, functions and data structures are coded in a standardised way so that it is applicable to any species or geographic region where appropriate data are available, for example individual or population sampling and genomic datasets (e.g. RA..., Raw sequence data is available at the European Nucleotide Archive (ENA): Myotis escalerai and M. crypticus (PRJEB29086), and the NCBI Short Read Archive (SRA): Afrixalus fornasini â€“ (SRP150605). Input data (processed genomic data and spatial-environmental data prior to running the toolbox) available as part of this repository. Methods: see methods text of manuscript and tutorials: Setup and running the LotE toolbox - https://cd-barratt.github.io/Life_on_the_edge.github.io/Vignette, Full tutorials for setup and running the LotE toolbox - https://cd-barratt.github.io/Life_on_the_edge.github.io/Vignette This software is intended for HPC use. Please make sure the software below is installed and functional in your HPC environment before proceeding:

Life on the edge data and scripts (also available here: https://github.com/cd-barratt/Life_on_the_edge)

Singularity (3.5) and bioconductor container with correct R version:Â https://cloud.sylabs.io/library/sinwood/bioconductor/bioconductor_3.14

R (4.1.3). Dependencies for toolbox installed within R version in singularity container upon setup (you specify your R libraries in the script where annotated) Julia (1.7.2)

Additionally you need to download the following and place in the correct directories to be sure the toolbox will function properly: * Environmental predictor data - please download and place environmental layers used for SDMs, GEAs etc in separate folders for current and future environmental conditions. These f..., # Life on the edge: a new toolbox for population-level climate change vulnerability assessments

Dataset contains input files needed to run Life on the edge for an example dataset (Afrixalus fornasini) You may run data for your focal species following the structure and content of the example files provided

First you need to download the following and place in the correct directories to be sure the toolbox will function properly:

Environmental predictor data (e.g. Worldclim2/CHELSA, land cover, see below)

A working plink and maxent version (see below)

Country border data (e.g. Natural Earth data, see below)

Tutorials for initial setup and running the toolbox

Full setup and how to run the LotE toolbox - https://cd-barratt.github.io/Life_on_the_edge.github.io/Vignette

Description of the data and file structure

Params.tsv is a tab separated file that contains all parameters for running each species ...
f
R: a replication script for the analyses.
figshare.com
txt
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Pate; Matthew Sperrin; Richard D. Riley; Ben Van Calster; Glen P. Martin (2025). R: a replication script for the analyses. [Dataset]. http://doi.org/10.1371/journal.pone.0320504.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320504.s001
Dataset updated
Jun 4, 2025
Dataset provided by
PLOS ONE
Authors
Alexander Pate; Matthew Sperrin; Richard D. Riley; Ben Van Calster; Glen P. Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background and objectiveMultistate models, which allow the prediction of complex multistate survival processes such as multimorbidity, or recovery, relapse and death following treatment for cancer, are being used for clinical prediction. It is paramount to evaluate the calibration (as well as other metrics) of a risk prediction model before implementation of the model. While there are a number of software applications available for developing multistate models, currently no software exists to aid in assessing the calibration of a multistate model, and as a result evaluation of model performance is uncommon. calibmsm has been developed to fill this gap.MethodsAssessing the calibration of predicted transition probabilities between any two states is made possible through three approaches. The first two utilise calibration techniques for binary and multinomial logistic regression models in combination with inverse probability of censoring weights, whereas the third utilises pseudo-values. All methods are implemented in conjunction with landmarking to allow calibration assessment of predictions made at any time beyond the start of follow up. This study focuses on calibration curves, but the methodological framework also allows estimation of calibration slopes and intercepts.ResultsThis article serves as a guide on how to use calibmsm to assess the calibration of any multistate model, via a comprehensive example evaluating a model developed to predict recovery, adverse events, relapse and survival in patients with blood cancer after a transplantation. The calibration plots indicate that predictions of relapse made at the time of transplant are poorly calibrated, however predictions of death are well calibrated. The calibration of all predictions made at 100 days post transplant appear to be poor, although a larger validation sample is required to make stronger conclusions.Conclusionscalibmsm is an R package which allows users to assess the calibration of predicted transition probabilities from a multistate model. Evaluation of model performance is a key step in the pathway to model implementation, yet evaluation of the performance of predictions from multistate models is not common. We hope availability of software will help model developers evaluate the calibration of models being developed.
Summary of heritability and prediction performance in rats. The table shows...
plos.figshare.com
xls
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natasha Santhanam; Sandra Sanchez-Roige; Sabrina Mi; Yanyu Liang; Apurva S. Chitre; Daniel Munro; Denghui Chen; Jianjun Gao; Angel Garcia-Martinez; Anthony M. George; Alexander F. Gileta; Wenyan Han; Katie Holl; Alesa Hughson; Christopher P. King; Alexander C. Lamparelli; Connor D. Martin; Festus Nyasimi; Celine L. St. Pierre; Sarah Sumner; Jordan Tripi; Tengfei Wang; Hao Chen; Shelly Flagel; Keita Ishiwari; Paul Meyer; Oksana Polesskaya; Laura Saba; Leah C. Solberg Woods; Abraham A. Palmer; Hae Kyung Im (2025). Summary of heritability and prediction performance in rats. The table shows the number of rats used in the prediction, number of genes predicted per model (R2>0.01), the average prediction performance R2 (after filtering R2 [Dataset]. http://doi.org/10.1371/journal.pgen.1011583.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgen.1011583.t001
Dataset updated
May 5, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Natasha Santhanam; Sandra Sanchez-Roige; Sabrina Mi; Yanyu Liang; Apurva S. Chitre; Daniel Munro; Denghui Chen; Jianjun Gao; Angel Garcia-Martinez; Anthony M. George; Alexander F. Gileta; Wenyan Han; Katie Holl; Alesa Hughson; Christopher P. King; Alexander C. Lamparelli; Connor D. Martin; Festus Nyasimi; Celine L. St. Pierre; Sarah Sumner; Jordan Tripi; Tengfei Wang; Hao Chen; Shelly Flagel; Keita Ishiwari; Paul Meyer; Oksana Polesskaya; Laura Saba; Leah C. Solberg Woods; Abraham A. Palmer; Hae Kyung Im
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of heritability and prediction performance in rats. The table shows the number of rats used in the prediction, number of genes predicted per model (R2>0.01), the average prediction performance R2 (after filtering R2
Dataset: From App Features to Explanation Needs: Analyzing Correlations and...
zenodo.org
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Obaidi; Martin Obaidi; Qengaj Kushtrim; Jakob Droste; Jakob Droste; Hannah Deters; Hannah Deters; Marc Herrmann; Marc Herrmann; Jil Klünder; Jil Klünder; Elisa Schmid; Elisa Schmid; Kurt Schneider; Kurt Schneider; Qengaj Kushtrim (2025). Dataset: From App Features to Explanation Needs: Analyzing Correlations and Predictive Potential [Dataset]. http://doi.org/10.5281/zenodo.15851594
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15851594
Dataset updated
Jul 10, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Martin Obaidi; Martin Obaidi; Qengaj Kushtrim; Jakob Droste; Jakob Droste; Hannah Deters; Hannah Deters; Marc Herrmann; Marc Herrmann; Jil Klünder; Jil Klünder; Elisa Schmid; Elisa Schmid; Kurt Schneider; Kurt Schneider; Qengaj Kushtrim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
From App Features to Explanation Needs: Analyzing Correlations and Predictive Potential

Description:
This replication package contains all datasets and scripts used in the study From App Features to Explanation Needs: Analyzing Correlations and Predictive Potential. The study investigates the relationships between app features and users' explanation needs, combining correlation analysis and predictive modeling techniques such as logistic regression. The dataset comprises 4,495 user reviews from Google Play Store and Apple App Store, each annotated for explanation needs and enriched with detailed app metadata (e.g., genre, ratings, age restriction, in-app purchases, and more). The original annotaded gold-standard dataset is here: https://doi.org/10.5281/zenodo.11522828

The package includes:

Datasets: Annotated app reviews with metadata from the Google Play Store and Apple App Store, as well as validation datasets used to test the predictive models.

Correlation Analysis Scripts: Scripts for performing various statistical analyses, including Cramér's V, Pearson, Spearman, and eta-squared correlation tests.

Logistic Regression Scripts: Scripts for building and validating logistic regression models to predict explanation needs based on app features.

A detailed README file is provided, explaining the folder structure, dataset contents, and the purpose of each script to ensure reproducibility.

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Grant No.: 470146331, project softXplain (2022–2025).

Citation

If you use this resource, please cite the following publication:

Obaidi, M., Qengaj, K., Droste, J., Deters, H., Herrmann, M., Klünder, J., Schmid, E., Schneider, K. (2025). From App Features to Explanation Needs: Analyzing Correlations and Predictive Potential. 2025 IEEE 33rd International Requirements Engineering Workshop (REW).

License

Unless otherwise stated, this dataset and all associated resources are provided under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Contact

For questions or further information, please contact Martin Obaidi (martin.obaidi@inf.uni-hannover.de) or the corresponding authors listed in the publication.
d
Daily water column temperature predictions for thousands of Midwest U.S....
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Daily water column temperature predictions for thousands of Midwest U.S. lakes between 1979-2022 and under future climate scenarios [Dataset]. https://catalog.data.gov/dataset/daily-water-column-temperature-predictions-for-thousands-of-midwest-u-s-lakes-between-1979
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Area covered
Midwestern United States, United States
Description
Lake temperature is an important environmental metric for understanding habitat suitability for many freshwater species and is especially useful when temperatures are predicted throughout the water column (known as temperature profiles). In this data release, multiple modeling approaches were used to generate predictions of daily temperature profiles for thousands of lakes in the Midwest.

Predictions were generated using two modeling frameworks: a machine learning model (specifically an entity-aware long short-term memory or EA-LSTM model; Kratzert et al., 2019) and a process-based model (specifically the General Lake Model or GLM; Hipsey et al., 2019). Both the EA-LSTM and GLM frameworks were used to generate lake temperature predictions in the contemporary period (1979-04-12 to 2022-04-11 for EA-LSTM and 1980-01-01 to 2021-12-31 for GLM; times differ due to modeling spin-up/spin-down configurations) using the North American Land Data Assimilation System [NLDAS; Mitchell et al., 2004] as meteorological drivers. In addition, GLM was used to generate lake temperature predictions under future climate scenarios (covering 1981-2000, 2040-2059, and 2080-2099) using six dynamically downscaled Global Climate Models (GCM; Notaro et al., 2018) as meteorological drivers. Appropriate application of the six GCMs is dependent on the use-case and will be up to the user to determine. For an example of a similar analysis in the Midwest and Great Lakes region using 31 GCMs, see Byun and Hamlet, 2018.

The modeling frameworks and driver datasets have slightly different footprints and input data requirements. This means that some of the lakes do not meet the criteria to be included in all three modeling approaches, which results in different numbers of lakes in the output (noted in the file descriptions below). The input data requirements for lakes to be included in the EA-LSTM predictions are lake latitude, longitude, elevation, and surface area, plus NLDAS drivers at the lake's location. All 62,966 lakes included this data release met these requirements. The input data requirements for lakes to be included in the contemporary GLM NLDAS-driven predictions are lake location (within one of the following 11 states: North Dakota, South Dakota, Iowa, Michigan, Indiana, Illinois, Wisconsin, Minnesota, Missouri, Arkansas, and Ohio), latitude, longitude, maximum depth (though more detailed hypsography was used where available), surface area, and a clarity esitmate, plus NLDAS drivers at the lake's location. 12,688 lakes included this data release met these requirements. The input data requirements for lakes to be included in the future climate scenario GCM-driven predictions were the same as for the contemporary GLM predictions, except GCM drivers at the lake's location were required in place of NLDAS drivers. 11,715 lakes included this data release met these requirements.

This data release includes the following files:

lake_locations.zip: shapefiles with the centroid of each lake (62,966 lakes)

lake_metadata.csv: metadata for each lake with predictions available (62,966 lakes)

lake_id_crosswalk.csv: mapping between the identifications for lakes used in this data release to state and other organization systems

lake_hypsography.csv: lake-specific area-depth relationships (13,785 lakes)

lake_temperature_observations.zip: temperature observational data used in training and/or evaluation (8,760 lakes)

meteorological_inputs_GCM.zip: meteorological input data for future climate scenarios, zipped NetCDF files. One NetCDF file per climate model (see the "lake_metadata.csv" file for how to map the lakes to the cells in these NetCDF files).

meteorological_inputs_NLDAS_{GROUP}.zip: meteorological input data for the contemporary period organized into grids, groups of zipped CSV files (see the "lake_metadata.csv" file for how to map the lakes to these files).

lake_temp_preds_EALSTM_NLDAS_AR-MN.zip: daily lake temperature profiles for the contemporary period generated by the EA-LSTM model. The zip folder contains a NetCDF file for each of the following states: AR, IA, IL, IN, KS, KY, LA, MI, and MN. Includes data for 33,646 lakes across these 9 states.

lake_temp_preds_EALSTM_NLDAS_MO-WY.zip: daily lake temperature profiles for the contemporary period generated by the EA-LSTM model. The zip folder contains a NetCDF file for each of the following states: MO, MS, MT, ND, NE, OH, OK, SD, TN, TX, WI, and WY. Includes data for 29,320 lakes across these 12 states.

lake_temp_preds_GLM_NLDAS.zip: daily lake temperature profiles for the contemporary period generated by GLM. The zip folder contains a NetCDF file for each of the following states: AR, IA, IL, IN, MI, MN, MO, ND, OH, SD, and WI. Includes data for 12,688 lakes across these 11 states.

lake_temp_preds_GLM_GCM_{CLIMATE MODEL}.zip: daily lake temperature profiles for future climate scenarios generated by GLM, one zip file per climate model. Each zip file contains a NetCDF file for each of the following states: AR, IA, IL, IN, MI, MN, MO, ND, OH, SD, and WI. Includes data for 11,715 lakes across these 11 states.

lake_temp_metrics_GLM_NLDAS.feather: annual lake temperature metrics for the contemporary period derived from daily predictions generated by GLM (12,688 lakes)

lake_temp_metrics_GLM_GCM.feather: annual lake temperature metrics for future climate scenarios derived from daily predictions generated by GLM (11,715 lakes)

lake_temp_model_evaluation_metrics.csv: overall and seasonal evaluation metrics for each model + meteorological driver dataset

extract_output_from_netCDFs.R: an R script showing examples for how to pull lake temperature predictions and meteorological data from the NetCDF files

netCDF_extract_utils.R: an R script containing functions used in "extract_output_from_netCDFs.R"

lake_locations.png: a figure showing the centroids for all 62,966 lakes included in this data release

This work was completed with funding support from the Midwest Climate Adaptation Science Center (MW CASC) and as part of the USGS project on Predictive Understanding of Multiscale Processes (PUMP), an element of the Integrated Water Prediction Program, supported by the Water Availability and Use Science Program to advance multi-scale, integrated modeling capabilities to address water resource issues. Access to computing facilities was provided by USGS Advanced Research Computing, USGS Tallgrass Supercomputer (doi.org/10.5066/F7D798MJ).
Recreational Vessel Traffic Model for British Columbia
ouvert.canada.ca
catalogue.arctic-sdi.org
+1more
csv, esri rest +2
Updated Feb 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fisheries and Oceans Canada (2025). Recreational Vessel Traffic Model for British Columbia [Dataset]. https://ouvert.canada.ca/data/dataset/fed5f00f-7b17-4ac2-95d6-f1a73858dac0
Explore at:
fgdb/gdb, esri rest, pdf, csvAvailable download formats
Dataset updated
Feb 17, 2025
Dataset provided by
Fisheries and Oceans Canadahttp://www.dfo-mpo.gc.ca/
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Time period covered
Jan 6, 2015 - Aug 31, 2022
Area covered
British Columbia
Description
Description: Data on recreational boating are needed for marine spatial planning initiatives in British Columbia (BC). Vessel traffic data are typically obtained by analyzing automatic identification system (AIS) vessel tracking data, but recreational vessels are often omitted or underrepresented in AIS data because they are not required to carry AIS tracking devices. Transport Canada’s National Aerial Surveillance Program (NASP) conducted aerial surveys to collect information on recreational vessels along several sections of the BC coast between 2018 and 2022. Recreational vessel sightings were modeled against predictor variables (e.g., distance to shore, water depth, distance to, and density of marinas) to predict the number of recreational vessels along coastal waters of BC. The files included here are: --A Geodatabase (‘Recreational_Boating_Data_Model’), which includes: (1) recreational vessel sightings data collected by NASP in BC and used in the recreational vessel traffic model (‘Recreational_Vessels_PointData_BC’); (2) aerial survey effort (or number of aerial surveys) raster dataset (‘surveyeffort’); and (3) a vector grid dataset (2.5 km resolution) containing the predicted number of recreational vessels per cell and predictor variables (‘Recreational_Boating_Model_Results_BC). --Scripts folder which includes R Markdown file with R code to run the modelling analysis (‘Recreational_Boating_Model_R_Script’) and data used to run the code. Methods: Data on recreational vessels were collected by NASP during planned aerial surveys along pre-determined routes along the BC coast from 2018 to 2022. Data on non-AIS recreational vessels were collected using video cameras onboard the aircraft, and data on AIS recreational vessels using an AIS receiver also onboard the aircraft. Recreational boating predictors explored were: water depth, distance to shore, distance to marinas, density of marinas, latitude, and longitude. Recreational vessel traffic models were fitted using Generalized Linear Models (GLM) R packages and libraries used here include: AED (Roman Lustrik, 2021) and MASS (Venables, W. N., Ripley, 2002), pscl package (Zeileis, Kleiber, and Jackman, 2008) for zeroinfl() and hurdle() function. Final model was selected based on the Akaike’s information criterion (AIC) and the Bayes’ information criterion (BIC). An R Markdown file with code use to run this analysis is included in the data package in a folder called Script. Spatial Predictive Model: The selected model, ZINB, consist of two parts: one with a binomial process that predicts the probability of encountering a recreational vessel, and a second part that predicts the number of recreational vessels via a count model. The closer to shore and to marinas, and the higher the density of marinas, the higher the predicted number of recreational vessels. The probability of encountering recreational vessels is driven by water depth and distance to shore. For more information on methodology, consult metadata pdf available with the Open Data record. References: Serra-Sogas, N. et al. 2021. Using aerial surveys to fill gaps in AIS vessel traffic data to inform threat assessments, vessel management and planning. Marine Policy 133: 104765. https://doi.org/10.1016/j.marpol.2021.104765 Data Sources: Recreational vessel sightings and survey effort: Data collected by NASP and analyzed by Norma Serra to extract vessel information and survey effort (more information on how this data was analyzed see SerraSogas et al, 2021). Bathymetry data for the whole BC coast and only waters within the Canadian EEZ was provided by DFO – Science (Selina Agbayani). The data layer was presented as a raster file of 100 meters resolution. Coastline dataset used to estimate distance to shore and to clip grid was provided by DFO – Science (Selina Agbayani), created by David Williams and Yuriko Hashimoto (DFO – Oceans). Marinas dataset was provided by DFO – Science (Selina Agbayani), created by Josie Iacarella (DFO – Science). This dataset includes large and medium size marinas and fishing lodges. The data can be downloaded from here: Floating Structures in the Pacific Northwest - Open Government Portal (https://open.canada.ca/data/en/dataset/049770ef-6cb3-44ee-afc8-5d77d6200a12) Uncertainties: Model results are based on recreational vessels sighted by NASP and their related predictor variables and not always might reflect real-world vessel distributions. Any biases caused by the opportunistic nature of the NASP surveys were minimized by using survey effort as an offset variable.
Machine learning reveals the diversity of human 3D chromatin contact...
zenodo.org
bin, csv, txt, zip
Updated Oct 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erin Gilbertson; Erin Gilbertson (2024). Machine learning reveals the diversity of human 3D chromatin contact patterns (example predictions genome wide) [Dataset]. http://doi.org/10.5281/zenodo.13900918
Explore at:
zip, txt, bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13900918
Dataset updated
Oct 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Erin Gilbertson; Erin Gilbertson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data for the paper: Machine learning reveals the diversity of human 3D chromatin contact patterns

GitHub: https://github.com/erin-n-gilbertson/3DGenome-diversity/tree/main

biorXiv: https://www.biorxiv.org/content/10.1101/2023.12.22.573104v1.full

Manuscript accepted at Molecular Biology and Evolution

Of primary interest will be the example predictions genome wide for hg38 reference, human-archaic hominin ancestor and most divergent 1KG individual per genome along with the Jupyter notebook tutorial for making your own Akita predictions given any input 1MB sequence.

bin: contains python script for and qsub array shell script for generating example predictions. These scripts can be modified to take in any fasta files as input.

akita_predictions: contains both Akita prediction output arrays and SVG files with predicted contact maps for the hg38 reference, human-archaic hominin ancestor and most divergent 1KG individual in each of 4,873 1MB windows

anc_window_spearman.csv: spearman correlation between each 1KG individual and the ancestor for each 1MB window. To calculate 3D divergence subtract these values from 1.

basenji: basenji dir from their github, necessary in the directory to run predictions - https://github.com/calico/basenji/tree/master

genomes: fasta genomes for hg38 reference and human-archaic hominin ancestor used to make akita predictions

divergent_windows: variants and expected divergence distributions for 392 more divergent than expected windows. Defined in the manuscript as windows where 3D divergence between 1KG indiivudals and the ancestor is greater than what would be expected based on sequence divergence. See manuscript Fig. S9 for more details.

windows.txt: 4,873 1MB genomic windows with 100% coverage in hg38 used for Akita predictions

making_examples.ipynb: jupyter notebook with tutorial instructions for making Akita predictions on any human genome sequence.
e
LAGOS-NE Shallow Lakes: a dataset of lake variables and multi-scaled...
portal.edirepository.org
search.dataone.org
bin, csv
Updated Feb 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kendra Cheruvelil; Katherine Webster; Katelyn King; Autumn Poisson; Tyler Wagner (2022). LAGOS-NE Shallow Lakes: a dataset of lake variables and multi-scaled ecological context variables used to predict and compare trophic status and TP:CHLa relationships between shallow and non-shallow lakes in the Upper Midwest and Northeastern United States. [Dataset]. http://doi.org/10.6073/pasta/be49507b941815d7a6807a273ee02d1e
Explore at:
csv(2403511 bytes), bin(10527 bytes), bin(9007 bytes)Available download formats
Unique identifier
https://doi.org/10.6073/pasta/be49507b941815d7a6807a273ee02d1e
Dataset updated
Feb 9, 2022
Dataset provided by
EDI
Authors
Kendra Cheruvelil; Katherine Webster; Katelyn King; Autumn Poisson; Tyler Wagner
Time period covered
Jan 1, 2014 - Jan 1, 2021
Area covered

Variables measured
conncd, iws_ag, depthcd, huc4_ag, nhd_lat, maxdepth, nhd_long, iws_urban, meandepth, median_tp, and 33 more
Description
We conducted a macroscale study of 2,210 shallow lakes (mean depth ≤ 3m or a maximum depth ≤ 5m) in the Upper Midwestern and Northeastern U.S. We asked: What are the patterns and drivers of shallow lake total phosphorus (TP), chlorophyll a (CHLa), and TP–CHLa relationships at the macroscale, how do these differ from those for 4,360 non-shallow lakes, and do results differ by hydrologic connectivity class? To answer this question, we assembled the LAGOS-NE Shallow Lakes dataset described herein, a dataset derived from existing LAGOS-NE, LAGOS-DEPTH, and LAGOS-CLIMATE datasets. Response data variables were the median of available summer (e.g., 15 June to 15 September) values of total phosphorus (TP) and chlorophyll a (CHLa). Predictor variables were assembled at two spatial scales for incorporation into hierarchical models. At the local or lake-specific scale (including the individual lake, its inter-lake watershed [iws] or corresponding HU12 watershed), variables included those representing land use/cover, hydrology, climate, morphometry, and acid deposition. At the regional scale (e.g., HU4 watershed), variables included a smaller set of predictor variables for hydrology and land use/cover. The dataset also includes the unique identifier assigned by LAGOS-NE(lagoslakeid); the latitude and longitude of the study lakes; their maximum and mean depths along with a depth classification of Shallow or non-Shallow; connectivity class (i.e., whether a lake was classified as connected (with inlets and outlets) or unconnected (lacking inlets); and the zone id for the HU4 to which each lake belongs. Along with the database, we provide the R scripts for the hierarchical models predicting TP or CHLa (TPorCHL_predictive_model.R), and the TP—CHLa relationship (TP_CHL_CSI_Model.R) for depth and connectivity subsets of the study lakes.
Evidence-based guidelines for developing automated conservation assessment...
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barnaby E. Walker; Barnaby E. Walker; Tarciso C.C. Leão; Tarciso C.C. Leão; Steven P. Bachman; Steven P. Bachman; Eve Lucas; Eve Lucas; Eimear Nic Lughadha; Eimear Nic Lughadha (2021). Evidence-based guidelines for developing automated conservation assessment methods (script outputs) [Dataset]. http://doi.org/10.5281/zenodo.4899925
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4899925
Dataset updated
Jun 5, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Barnaby E. Walker; Barnaby E. Walker; Tarciso C.C. Leão; Tarciso C.C. Leão; Steven P. Bachman; Steven P. Bachman; Eve Lucas; Eve Lucas; Eimear Nic Lughadha; Eimear Nic Lughadha
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Script outputs for the paper "Evidence-based guidelines for developing automated conservation assessment methods".

The code used to generate these outputs can be found on GitHub.

To use these outputs, download the code, download this dataset and extract the dataset in the project folder. Some outputs are in the RData format, including all of the trained models. To view these files in R you may need to install the packages listed in the README of the GitHub project.

The outputs are arranged in this file structure:

output

cleaned_occurrences: CSV files containing the GBIF ID of all occurrence records retained after each cleaning step, and the IPNI ID of the species they relate to. Generated by the script 05_clean_occurrences.R.

explanations: SHapely Additive exPlanations for an example set of predictions. Generated by the script 08_calculate_explanations.R.

model_results: CSV files with the evaluation results for each model, on each study group, after each cleaning step. There are results for the method performance, learning curves, and permutation importance (random forest models only), as well as predictions for test sets and unassessed species. Generated by the script 07_evaluate_methods.R.

models: RData files containing the trained models, generated by the script 07_evaluate_methods.R.

name_matching: CSV files with the results of matching IUCN Red List assessment and GBIF names to WCVP taxonomy, as well as JSON files used to manually resolve ambiguous and missing matches. Generated by the scripts 02_collate_species.R and 03_process_occurrences.R.

predictors: CSV files with species-level predictors calculated from the cleaned occurrence files, ready for input into automated assessment methods. Generated by the script 06_prepare_predictors.R.

rasters: Processed raster files used to calculate species-level predictors. Generated by the script 01_process_rasters.R.

results: CSV files of summarised results, generated by the script 09_summarise_results.R.

{group}_distributions.csv: CSV files with the distribution for species in each study group, downloaded from POWO by the script 02_collate_species.R.

{group}-{source}_species-list.csv: The list of species for each study group along with their IUCN Red List category if they have been assessed, generated by the script 02_collate_species.R. The 'source' refers to if the assessments were from the IUCN Red List or Sampled Red List Index.

{group}-GBIF_occurrences.csv: The occurrence records for each species group, downloaded from GBIF. Generated by the script 03_process_occurrences.R.

{group}-GBIF_labelled-occurrences.csv: The occurrence records for each species group labelled with values extracted at their coordinates from the rasters in the rasters folder. Generated by the script 04_annotate_points.R.
Disk replacement log file examples from a very large RAID disk system for...
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lambert Schomaker; Lambert Schomaker; Ger Strikwerda; Ger Strikwerda (2020). Disk replacement log file examples from a very large RAID disk system for predictive maintenance analysis [Dataset]. http://doi.org/10.5281/zenodo.2580162
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2580162
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lambert Schomaker; Lambert Schomaker; Ger Strikwerda; Ger Strikwerda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
README.txt

Maintenance example belonging to:

The MANTIS Book: Cyber Physical System Based Proactive Collaborative Maintenance
Chapter 9, The Future of Maintenance (2019).
Lambert Schomaker, Michele Albano, Erkki Jantunen, Luis Lino Ferreira
River Publishers (DK)
ISBN: 9788793609853, e-ISBN: 9788793609846, https://doi.org/10.13052/rp-9788793609846

The figure .pdf did not make it into the book. Here are the raw data, processed
logs and .gnu script to produce it.

Data: event logs on disk failure in two racks of a huge RAID disk system (2009-2016).

disks1.raw
disks2.raw

Event logs to RC-filtered time series:
RC-filt-disks-log.c
do-RC-filter-to-make-spikes-more-visible (bash script)
-->
disks1.log
disks2.log

Constant (horizontal line) indicating the level where users experienced system-down time
Disrupted-operations-threshold

disk-replacement-log.gnu
disk-replacement-log.pdf
Modeling Face Recognition in the Predictive Coding Framework: A Combined...
openneuro.org
Updated Jul 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nestor Zaragoza-Jimenez; Hauke Niehaus; Ina Thome; Christoph Vogelbacher; Gabriele Ende; Inge Kamp-Becker; Dominik Endres; Andreas Jansen (2023). Modeling Face Recognition in the Predictive Coding Framework: A Combined Computational Modeling and Functional Imaging Study - Stage 2 [Dataset]. http://doi.org/10.18112/openneuro.ds004529.v1.1.1
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds004529.v1.1.1
Dataset updated
Jul 12, 2023
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Nestor Zaragoza-Jimenez; Hauke Niehaus; Ina Thome; Christoph Vogelbacher; Gabriele Ende; Inge Kamp-Becker; Dominik Endres; Andreas Jansen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Modeling Face Recognition in the Predictive Coding Framework: A Combined Computational Modeling and Functional Imaging Study by Zaragoza, Niehaus, et al. 2022 (under review)

This is the repository ONLY containing the fMRI raw data and associated code scripts of both the level 1 and level 2 SPM-analyses of the connected publication. Note that this code was run on Windows 10 using MATLAB. For additional links see below:

In-principle-accepted Stage 1 Pre-registration: https://doi.org/10.17605/OSF.IO/A8VU7

In-principle-accepted Stage 2 modeling scripts, behavioral data, and preprint: https://osf.io/tye24/

This is the fMRI Dataset for the FRAPPS study (bids format).

study consists of two tasks: identity learning task ('FRAPPS') and face localizer task ('localizer')

defacing of anatomical images was performed with pydeface 2.0.0!!!

functional data is stored as 4D file including all scans (e.g. sub-01_task-frapps_bold.nii.gz) as well as without the 4 dummy scans (e.g. sub-01_task-frapps_bold-dummy-scans.nii). The version without dummy scans was used for all further analysis.

Exclusion of participants: due to extensive motion (mriqc FD parameter > 0.3) two subjects (sub-16 and sub-26) were excluded. Additionally, we excluded sub-21 because of low performance in the FRAPPS task (< 60%). We also excluded sub-4, sub-18 and sub-28 from all fMRI analyses assessing the influence of the parametric modulators on brain activation.

onsets and durations of both tasks are described in files called subID_spm_time.csv (for FRAPPS task) and subID-fp_no_Eyelink_onsets.mat (for the localizer task). The directory is subID/func/onsets_and_durations/. In this folder, there are also the multiple condition files for the first-level analysis with and without parametric modulations. They are called for example: sub-01_multi_cond_VIEW-INDxCONTEXT_pe_VI.mat

Computational model parameters for the parametric modulation can be found in folder subID/func/model_parameters/

Notes preprocessing FRAPPS: Slice timing, TA is not relevant and will not be used when slice timings are entered. TA was therefore set to 0, as reference slice TR/2 (1.53/2) was used as timing.

Notes first level FRAPPS: slice time correction was performed, thus for the first level model the microtime resolution was changed to the number of slices divided by two because of multiband factor 2 (= 24), microtime onset needs to match the chosen reference slice so it was set to 12.

Notes mean beta calculation FRAPPS: sub-4, sub-18, and sub-28 have extremely high (values around -1000) beta values for the history VI pmod. This is most probably caused by only a very slightly changing history VI parameter.

Localizer coordinates group level (conjunction) (this is just as a reference, of note for OFA, FFA, and pSTS we used the group coordinates derived from the study Thome et al. 2022, Neuroimage to locate the individual maxima)

lOFA -42 -84 -10 t = 1.82 rOFA 48 -74 -6 t = 3.86

lFFA -42 -46 -18 t = 7.32 rFFA 42 -46 -16 t = 6.80

lpSTS -58 -54 8 t = 5.55 rpSTS 46 -54 14 t = 6.25

lPPA -22 -46 -8 t = 17.57 rPPA 24 -46 -10 t = 16.95

individual coordinates in the localizer task were detected with an automatic algorithm starting at the group max and jumping to the nearest local max within the spmT_conjunction map thresholded at 0.001 unc. These coordinates were further refined (plausibility check) by visual inspection.
d
Data from: A pattern-oriented simulation for forecasting species spread...
search.dataone.org
datadryad.org
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahil Amin; Jessie Buettel; Matthew Fielding; Peter Vaughan; Barry Brook (2025). A pattern-oriented simulation for forecasting species spread through time and space: A case study on an ecosystem engineer on the move [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9rg
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.xsj3tx9rg
Dataset updated
Jan 23, 2025
Dataset provided by
Dryad Digital Repository
Authors
Rahil Amin; Jessie Buettel; Matthew Fielding; Peter Vaughan; Barry Brook
Description
Modelling the spread of introduced ecosystem engineers is a conservation priority due to their potential to cause irreversible ecosystem-level changes. While existing models predict potential distributions and spread capacities, new approaches that simulate the trajectory of a speciesâ€™ spread over time are needed. We developed novel simulations that predict spatial and temporal spread, capturing both continuous diffusion-dispersal and occasional long-distance leaps. We focused on the introduced population of Superb Lyrebird (Menura novaehollandiae) in Tasmania, Australia. Initially introduced as an insurance population, lyrebirds have become novel bioturbators, spreading across key natural areas and becoming "unwanted but challenging to eradicate". Using multi-scale ecological data, our research (1) identified broad and fine-scale correlates of lyrebird occupation and (2) developed a spread simulation guided by a pattern-oriented framework. This occurrence-based modelling framework is u..., This dataset integrates fine- and broad-scale ecological data collected between 1970 and 2023. Fine-scale data was obtained from 210 camera trap sites in Tasmania, recording vegetation structure and lyrebird detections. Broad-scale data was compiled from citizen science records via the Atlas of Living Australia, combined with environmental predictors such as climatic variables, elevation, and land-use data. Fine-scale habitat data was derived from camera trap detections, with vegetation classified into dense or sparse categories using a 3Ã—3 grid overlay method. Broad-scale occurrence records were filtered for spatial and temporal accuracy, and pseudo-absence data was generated based on effort-controlled absence criteria. Predictor variables for Species Distribution Models (SDMs) were z-transformed, and categorical variables (e.g., vegetation, land-use types) were aggregated into broader classes. Spread simulations used a stochastic grid-cell model with parameters calibrated via a patter..., , # A pattern-oriented simulation for forecasting species spread through time and space: A case study on an ecosystem engineer on the move

https://doi.org/10.5061/dryad.xsj3tx9rg

Description of the data and file structure

This dataset supports the study on habitat suitability and spread modelling of the Superb Lyrebird (Menura novaehollandiae) in Tasmania. The study applies a sequential framework:

Fine-scale habitat modelling using camera-trap data to identify vegetation features influencing lyrebird activity.

Broad-scale species distribution modelling (SDM) using citizen science and camera-trap records combined with environmental predictors.

Spread simulation to predict lyrebird expansion over time using a stochastic grid-cell approach.

The dataset includes fine-scale vegetation structure, occurrence records, environmental predictors, SDM performance outputs, and simulation scripts.

Files and variables

...
n
Data from: WiBB: An integrated method for quantifying the relative...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Aug 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qin Li; Xiaojun Kou (2021). WiBB: An integrated method for quantifying the relative importance of predictive variables [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9g1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.xsj3tx9g1
Dataset updated
Aug 20, 2021
Dataset provided by
Field Museum of Natural History
Beijing Normal University
Authors
Qin Li; Xiaojun Kou
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.

A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.

Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.
d
Parameters used in the endotherm biophysical model for each species
search.dataone.org
researchdata.edu.au
+1more
Updated Nov 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shannon Rose Conradie; Blair Wolf; Susan Cunningham; Amanda Bourne; Tanja van de Ven; Amanda Ridley; Andrew McKechnie (2024). Parameters used in the endotherm biophysical model for each species [Dataset]. http://doi.org/10.5061/dryad.zgmsbccnh
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.zgmsbccnh
Dataset updated
Nov 15, 2024
Dataset provided by
Dryad Digital Repository
Authors
Shannon Rose Conradie; Blair Wolf; Susan Cunningham; Amanda Bourne; Tanja van de Ven; Amanda Ridley; Andrew McKechnie
Description
Climate change threatens biodiversity by compromising the ability to balance energy and water, influencing animal behaviour, species interactions, distribution, and ultimately survival. Predicting climate change effects on thermal physiology is complicated by interspecific variation in thermal tolerance limits, thermoregulatory behaviour, and heterogenous thermal landscapes. We develop an approach for assessing thermal vulnerability for endotherms by incorporating behaviour and microsite data into a biophysical model. We parameterised the model using species-specific functional traits and published behavioural data on hotter (maximum daily temperature, Tmax > 35 Â°C) and cooler days (Tmax < 35 Â°C). Incorporating continuous time-activity focal observations of behaviour into the biophysical approach reveals that the three insectivorous birds modelled here are at greater risk of lethal hyperthermia than dehydration under climate change, contrary to previous thermal risk assessments. S..., , , # Data from: Parameters used in the endotherm biophysical model for each species

https://doi.org/10.5061/dryad.zgmsbccnh

Description of the data and file structure

We provide a novel approach for assessing thermal vulnerability by incorporating detailed behaviour and microsite data for endotherms exposed to recent and future climates (RCP 8.5, 4.5) into a biophysical model. The model was parameterised using species-specific functional traits and published behavioural data on hotter (maximum daily temperature, *T*max > 35 Â°C) and cooler days (*T*max < 35 Â°C). Below we have provided the biophysical traits used in the parameterisation of the model, the customized endotherm model, the operative temperature model, and an example script for predicting thermoregulation in an exposed-on-ground microsite. We have noted below the sequence in which these scripts need to be run.

Files and variables

File: Model_parameters.xlsx

**Des...
f
Example code script showing the parameters used in all AlphaFold2...
figshare.com
txt
Updated Jun 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Devon J. Boland; Nicola M. Ayres (2024). Example code script showing the parameters used in all AlphaFold2 predictions within the module. [Dataset]. http://doi.org/10.1371/journal.pcbi.1012123.s007
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1012123.s007
Dataset updated
Jun 27, 2024
Dataset provided by
PLOS Computational Biology
Authors
Devon J. Boland; Nicola M. Ayres
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example code script showing the parameters used in all AlphaFold2 predictions within the module.
Data from: Predictor complexity and feature selection affect Maxent model...
zenodo.org
data.niaid.nih.gov
+1more
bin, zip
Updated Jun 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bi Wei Low; Bi Wei Low; Yiwen Zeng; Heok Hui Tan; Darren C. J. Yeo; Yiwen Zeng; Heok Hui Tan; Darren C. J. Yeo (2022). Predictor complexity and feature selection affect Maxent model transferability: evidence from global freshwater invasive species [Dataset]. http://doi.org/10.5061/dryad.ttdz08kww
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.ttdz08kww
Dataset updated
Jun 3, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bi Wei Low; Bi Wei Low; Yiwen Zeng; Heok Hui Tan; Darren C. J. Yeo; Yiwen Zeng; Heok Hui Tan; Darren C. J. Yeo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the following:

Occurrence datasets of five global freshwater invasive species (African sharptooth catfish Clarias gariepinus, Mozambique tilapia Oreochromis mossambicus, American bullfrog Lithobates catesbeianus, red swamp crayfish Procambarus clarkii, and Australian redclaw crayfish Cherax quadricarinatus)

Background points for presence-only ecological niche modelling (e.g., Maxent)

Example R script (with annotations inline) to conduct model tuning and transferability assessments using Maxent
2016 March ML Mania Predictions
kaggle.com
zip
Updated Nov 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Will Cukierski (2017). 2016 March ML Mania Predictions [Dataset]. https://www.kaggle.com/datasets/wcukierski/2016-march-ml-mania
Explore at:
zip(28950066 bytes)Available download formats
Dataset updated
Nov 15, 2017
Authors
Will Cukierski
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Kaggle’s March Machine Learning Mania competition challenged data scientists to predict winners and losers of the men's 2016 NCAA basketball tournament. This dataset contains the 1070 selected predictions of all Kaggle participants. These predictions were collected and locked in prior to the start of the tournament.

How can this data be used? You can pivot it to look at both Kaggle and NCAA teams alike. You can look at who will win games, which games will be close, which games are hardest to forecast, or which Kaggle teams are gambling vs. sticking to the data.

The NCAA tournament is a single-elimination tournament that begins with 68 teams. There are four games, usually called the “play-in round,” before the traditional bracket action starts. Due to competition timing, these games are included in the prediction files but should not be used in analysis, as it’s possible that the prediction was submitted after the play-in round games were over.

Data Description

Each Kaggle team could submit up to two prediction files. The prediction files in the dataset are in the 'predictions' folder and named according to:

TeamName_TeamId_SubmissionId.csv

The file format contains a probability prediction for every possible game between the 68 teams. This is necessary to cover every possible tournament outcome. Each team has a unique numerical Id (given in Teams.csv). Each game has a unique Id column created by concatenating the year and the two team Ids. The format is the following:

Id,Pred
2016_1112_1114,0.6
2016_1112_1122,0
...

The team with the lower numerical Id is always listed first. “Pred” represents the probability that the team with the lower Id beats the team with the higher Id. For example, "2016_1112_1114,0.6" indicates team 1112 has a 0.6 probability of beating team 1114.

For convenience, we have included the data files from the 2016 March Mania competition dataset in the Scripts environment (you may find TourneySlots.csv and TourneySeeds.csv useful for determining matchups, see the documentation). However, the focus of this dataset is on Kagglers' predictions.

Facebook

Twitter

Click to copy link

Link copied

Cite

Höppner, Stefan (2024). Data Set for Predicting the Performance of ATL Model Transformations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7510273

Data Set for Predicting the Performance of ATL Model Transformations

Explore at:

Dataset updated

Jul 12, 2024

Dataset provided by

Thiam, Patrick
Bellmann, Peter
Groner, Raffaela
Schwenker, Friedhelm
Tichy, Matthias
Höppner, Stefan

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Model transformation languages are special-purpose languages, which are designed to define transformations as comfortably as possible, i.e., often in a declarative way. With the increasing use of transformations in various domains, the complexity and size of input models are also increasing. However, developers often lack suitable models for performance testing. We have therefore conducted experiments in which we predict the performance of model transformations based on characteristics of input models using machine learning approaches. This dataset contains our raw and processed input data, the scripts necessary to repeat our experiments, and the results we obtained.

Our input data consists of the time measurements for six different transformations defined in the Atlas Transformation Language (ATL), as well as the collected characteristics of the real-world input models that were transformed. We provide the script that implements our experiments. We predict the execution time of ATL transformations using the machine learning approaches linear regression, random forests and support vector regression using a radial basis function kernel. We also investigate different sets of characteristics of input models as input for the machine learning approaches. These are described in detail in the provided documentation.pdf. The results of the experiments are provided as raw data in individual cvs files. Additionally, we calculated the mean absolute percentage error in % and the 95th percentile of the absolute percentage error in % for each experiment and provide these results. Furthermore, we provide our Eclipse plugin, which collects the characteristics for a set of given models, the Java projects used to measure the execution time of the transformations, and other supporting scripts, e.g. for the analysis of the results.

A short introduction with a quick start guide can be found in README.md and a detailed documentation in documentaion.pdf.

Clear search

Close search

Google apps

Main menu

Data Set for Predicting the Performance of ATL Model Transformations

Models, data, and scripts associated with “Prediction of Distributed River...

Codes in R for spatial statistics analysis, ecological response models and...

Data from: Life on the edge: A new toolbox for population-level climate...

Tutorials for initial setup and running the toolbox

Description of the data and file structure

R: a replication script for the analyses.

Summary of heritability and prediction performance in rats. The table shows...

Dataset: From App Features to Explanation Needs: Analyzing Correlations and...

From App Features to Explanation Needs: Analyzing Correlations and Predictive Potential

Citation

License

Contact

Daily water column temperature predictions for thousands of Midwest U.S....

Recreational Vessel Traffic Model for British Columbia

Machine learning reveals the diversity of human 3D chromatin contact...

LAGOS-NE Shallow Lakes: a dataset of lake variables and multi-scaled...

Evidence-based guidelines for developing automated conservation assessment...

Disk replacement log file examples from a very large RAID disk system for...

Modeling Face Recognition in the Predictive Coding Framework: A Combined...

Data from: A pattern-oriented simulation for forecasting species spread...

Description of the data and file structure

Files and variables

...

Data from: WiBB: An integrated method for quantifying the relative...

Parameters used in the endotherm biophysical model for each species

Description of the data and file structure

Files and variables

File: Model_parameters.xlsx

Example code script showing the parameters used in all AlphaFold2...

Data from: Predictor complexity and feature selection affect Maxent model...

2016 March ML Mania Predictions

Data Description

Data Set for Predicting the Performance of ATL Model Transformations