Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv
and Trad_Test.csv
) is derived directly from the original complete geochemical dataset (alldata.csv
) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv
and Simu_Test.csv
) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip
archive contains the raw input files alldata.csv
used to generate the proxies_alldata.csv
, it also contains Analysis1.csv
and Analysis2.csv
for performing confidence analysis. To run the executable files in place of the .m
scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.
Analysis1.csv
and Analysis2.csv
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost can be run on a single CPU in less than a third of the wall-clock time of either of the other methods.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF).
Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only set). Gasteiger charges were reassigned to the remaining compounds using OpenBabel. In addition, four in-vitro-only compounds with docking scores greater than 1 kcal/mol have been rejected.
The provided in-vivo and the in-vitro-only sets contain 59,884 (in-vivo.xyz) and 174,014 (in-vitro-only.xyz) compounds, respectively. Compounds in both sets contain the following elements: H, C, N, O, F, P, S, Cl, Br, and I. The in-vivo compound set was used as the primary data set for the training of the ML models in the referencing study.
The file in-vivo-splits-data.csv contains the exact composition of all (random) 80-5-15 train-validation-test splits used in the study, labeled I, II, III, IV, and V. Eight additional random subsets in each of the in-vivo 80-5-15 splits were created to monitor the training process convergence. These subsets were constructed in such a manner, that each subset contains all compounds from the previous subset (starting with the 10-5-15 subset) and was enlarged by one eighth of the entire (80-5-15) train set of a given split. These subsets are further referred to as in_vivo_10_(I, II, ..., V), in_vivo_20_(I, II, ..., V),..., in_vivo_80_(I, II, ... V).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository includes the RNA-seq dataset from 27 GBM samples, as published in this manuscript:
Topographic mapping of the glioblastoma proteome reveals a triple axis model of intra-tumoral heterogeneity
Lam KHB, Leon AJ, Hui W, Lee SCE, Batruch I, Faust K, Koritzinsky M, Richer M, Djuric U, Diamandis P (under review)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Evaluation of parameters for the XGBoost models of different training and test sets for COVID-19 deaths.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data package presents forcing data, model code, and model output for classical machine learning models that predict monthly stream water temperature as presented in the manuscript ‘Stream Temperature Predictions for River Basin Management in the Pacific Northwest and Mid-Atlantic Regions Using Machine Learning’, Water (Weierbach et al., 2022). Specifically, for input forcing datasets we include two files each generated using the BASIN-3D data integration tool (Varadharajan et al., 2022) for stations in the Pacific Northwest and Mid Atlantic Hydrologic regions. Model code (written in python with the use of jupyter notebooks) includes codes for data preprocessing, training Multiple Linear Regression, Support Vector Regression, and Extreme Gradient Boosted Tree models, and additional notebooks for analysis of model output. We include specific model output files which represent modeling configurations presented in the manuscript also presented in an hdf5 format. Together, these data make up the workflow for predictions across three scenarios (single station, regional, and predictions in unmonitored basins) presented in the manuscript and allow for reproducibility of modeling procedures.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of a machine learning project focused on predicting rainfall, a critical task for sectors like agriculture, water resource management, and disaster prevention. The project employs machine learning algorithms to forecast rainfall occurrences based on historical weather data, including features like temperature, humidity, and pressure.
The primary goal of the dataset is to train multiple machine learning models to predict rainfall and compare their performances. The insights gained will help identify the most accurate models for real-world predictions of rainfall events.
The dataset is derived from various historical weather observations, including temperature, humidity, wind speed, and pressure, collected by weather stations across Australia. These observations are used as inputs for training machine learning models. The dataset is publicly available on platforms like Kaggle and is often used in competitions and research to advance predictive analytics in meteorology.
The dataset consists of weather data from multiple Australian weather stations, spanning various time periods. Key features include:
Temperature
Humidity
Wind Speed
Pressure
Rainfall (target variable)
These features are tracked for each weather station over different times, with the goal of predicting rainfall.
Python: The primary programming language for data analysis and machine learning.
scikit-learn: For implementing machine learning models.
XGBoost, LightGBM, and CatBoost: Popular libraries for building more advanced ensemble models.
Matplotlib/Seaborn: For data visualization.
These libraries and tools help in data manipulation, modeling, evaluation, and visualization of results.
DBRepo Authorization: Required to access datasets via the DBRepo API for dataset retrieval.
Model Comparison Charts: The project includes output charts comparing the performance of seven popular machine learning models.
Trained Models (.pkl files): Pre-trained models are saved as .pkl files for reuse without retraining.
Documentation and Code: A Jupyter notebook guides through the process of data analysis, model training, and evaluation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fastest training times for CNN and XGBoost on CPU and GPU (all features).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Models and Predictions
This dataset contains the trained XGBoost and EA-LSTM models and the models' predictions for the paper The Proper Care and Feeding of CAMELS: How Limited Training Data Affects Streamflow Prediction.
For each combination of model (XGBoost, EA-LSTM), training years (3, 6, 9), number of basins (13, 26, 53, 265, 531), and seed (111-888), there are five folders. Each corresponds to a random basin sample (for 531 basins there's only one folder, since it's all basins).
In each folder, there are two files:
In addition to each folder, there is a SLURM submission script called \(\texttt{ that was used to create and evaluate the model in the folder.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The use of deep learning (DL) is steadily gaining traction in scientific challenges such as cancer research. Advances in enhanced data generation, machine learning algorithms, and compute infrastructure have led to an acceleration in the use of deep learning in various domains of cancer research such as drug response problems. In our study, we explored tree-based models to improve the accuracy of a single drug response model and demonstrate that tree-based models such as XGBoost (eXtreme Gradient Boosting) have advantages over deep learning models, such as a convolutional neural network (CNN), for single drug response problems. However, comparing models is not a trivial task. To make training and comparing CNNs and XGBoost more accessible to users, we developed an open-source library called UNNT (A novel Utility for comparing Neural Net and Tree-based models). The case studies, in this manuscript, focus on cancer drug response datasets however the application can be used on datasets from other domains, such as chemistry.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Aim This study uses a novel modeling approach to understand global trophic structure transformations under 21st-century climate changes. The goal is to project and understand the impacts of climate change on trophic dynamics, guiding future research and conservation efforts. Location 14,520 terrestrial grid cells of 1° x 1° globally. Taxon Trophic structures were assessed for 15,265 species, including 9,993 non-marine birds and 5,272 terrestrial mammals, across 9 predefined trophic guilds. Methods A spatially explicit community trophic structure model, based on an extreme gradient boosting algorithm (Xgboost), was used. The model was trained with 1961-1990 climatic data and projected changes according to three Shared Socioeconomic Pathways: SSP2-45, SSP3-70, and SSP5-85. Results The Xgboost model showed high predictive accuracy (86%, kappa=0.91). Projections indicated many global regions are transitioning in their trophic structures due to climate changes from 1990 to 2018, with decreases in species carrying capacity in 5.5% of cells and increases in 9.8%. Predictions for mid- and late-21st century under climate scenarios suggest significant reorganization, with notable impacts in regions such as the Amazon Basin, Central Africa, and Southeast Asia. Under SSP5-85, 17.1% of cells may face reductions in carrying capacity, while 41.1% could see increases, affecting thousands of species. Main conclusions Climate change is profoundly reorganizing global trophic communities, with significant shifts in species carrying capacity across different guilds. Tropical regions and high northern latitudes are most affected, with some species facing collapses and others finding new opportunities. These changes highlight the need to integrate community trophic structure models into biodiversity conservation strategies, offering a comprehensive view of climate change impacts on trophic networks. Methods Data Collection Species Distribution Data Geographical data were garnered from two primary sources and subsequently plotted on a global terrestrial grid, with each cell measuring 1 × 1°. These sources included the global distribution ranges of terrestrial mammals and non-marine birds. The distributions of species, specifically 9,993 non-marine birds and 5,272 terrestrial mammals, totaling 15,265 species, were informed by the IUCN Global Assessment's data on native ranges (IUCN, 2014). To enable analysis, a presence/absence matrix was created. In this matrix, the species were aligned as columns, each named, against 14,498 terrestrial grid cells, each cell measuring 1 × 1°, as rows. These include all the non-coastal cells of the world, excluding Antarctica and some northern regions, such as most of Greenland, for which some data are lacking. This approach provided a clear, granular view of species distribution across the globe. Bioclimatic Variables The bioclimatic variables were divided into two datasets: historical (1961-2018) and future (2021-2100). Historical bioclimatic variables were not obtained directly but derived from three monthly meteorological variables: mean minimum temperature (°C), mean maximum temperature (°C), and total precipitation (mm). These variables were downscaled from CRU-TS-4.03 (Harris et al., 2014) with WorldClim 2.1 (Fick & Hijmans, 2017) for bias correction. The nineteen WorldClim variables were calculated from these three monthly meteorological variables using the "biovars" function of the R dismo package (Hijmans et al., 2011). Unlike the historical data, pre-processed bioclimatic variables for the future could be accessed directly. We used a multimodel ensemble approach, which tends to perform better than any individual model (Pierce et al., 2009; Araújo & New, 2007). The ensemble integrates mean outputs from 25 global climate models (GCMs) corresponding to an array of twelve different future climate change scenarios (Harris et al., 2014; Fick & Hijmans, 2017). These scenarios emerge from the interplay of four specific timeframes (2021-2040, 2041-2060, 2061-2080, and 2081-2100) and three Shared Socio-economic Pathways (ssp2-45, ssp3-70, and ssp5-85) (Gidden et al., 2019). Feeding Habits Data The feeding habits of bird and mammal species were obtained from the global species-level compilation of key trophic attributes, known as Elton traits 1.0 (Wilman et al., 2014). This dataset provided essential information on the trophic roles of species, which is crucial for understanding their ecological interactions and energy flow within ecosystems. Trophic profile of the cells and structure identification Trophic profile of the cells We assigned each of the 15,265 terrestrial mammal and non-marine bird species to one of 9 trophic guilds and then counted the number of species in each guild within each cell, following a previous analysis (Mendoza & Araújo, 2022). The result is a matrix with the 9 trophic guilds as columns, 14,498 cells as rows, and values representing numbers of species. The trophic profile of every community is thus a point in a 9-dimensional ‘trophic space' defined by the number of species from each trophic guild (a vector of dimension 9). Selection of training samples From the initial set of 14,498 terrestrial grid cells, each measuring 1°×1°, a specific subset of 6,610 continental cells was selected. This subset was defined by their overlap, either partial or complete, with designated protected areas. This subset was crucial for two analytical steps: first, to decipher the community trophic structures; and second, to model the interaction between the prevailing climate and the trophic structure. Given the nature of these cells — designated as "continental protected area cells" — we assume they experience reduced human activity compared to the surrounding matrix; an assumption that may not align with reality globally, considering evidence of reduced effectiveness of protected areas in ensuring tangible protection in various parts of the tropics (Geldmann et al., 2019). Nevertheless, a working assumption is made that the trophic structures displayed within these areas likely present a closer reflection of what might be expected from an undisturbed, stable energy network (Mendoza & Araújo, 2022). Identification of the six basic trophic structures through AMD analysis We utilized AMD analysis to explore the previously described 9-dimensional 'community trophic space', defined by the number of species within each trophic guild. This analysis is rooted in computing the Average Membership Degree (AMD) of cluster elements based on their Euclidean distance to the geometric center. The primary aim of AMD analysis is to discern the presence of distinct groups within multidimensional spaces, while concurrently assessing their degree of definition and compactness. The emergence of well-defined community groups within this trophic space allows for the consideration of the identified basic trophic structures as qualitatively distinct entities (Mendoza & Araújo, 2022). We applied AMD analysis to the 6,610 continental protected area cells to confirm that the same six basic trophic structures (TS1 to TS6) identified by Mendoza & Araújo (2022) are present within this curated subset. For a more comprehensive understanding of the AMD method and its application to our dataset, readers are directed to the supplementary information of Mendoza & Araújo (2022), accessible via the following link: https://nsojournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1111%2Fecog.06289&file=ecog12872-sup-0001-AppendixS1.pdf Climate modelling of community trophic structures Data preparation We modelled the relationship between climate and trophic structures, utilizing 19 predictors derived from historical bioclimatic data encompassing the years 1961-1990. Denoted as pre-1990 period, this phase marks a time before the significant uptick in temperatures attributable to human-induced greenhouse gas emissions. The trophic profile data, systematically assembled from faunal lists gathered over numerous decades, also hail from an era prior to this pronounced temperature increase. Therefore, these records present a fitting basis for examining the interplay between the trophic structure and the climatic conditions prevalent during the pre-1990 period. The bioclimatic variables represent conditions over specific time periods, and the corresponding trophic structure type (TS1 to TS6) is inferred as the one expected at the end of these periods. Model Implementation Using Xgboost We employed the Extreme Gradient Boosting algorithm (Xgboost) (Chen & Guestrin, 2016), using the xgboost package (Chen et al., 2023), a state-of-the-art machine learning technique known for its superior performance over traditional models such as random forests (e.g., Shao et al., 2024). The target variable in our analysis was the basic type of trophic structure (TS1 to TS6), identified in the previous step (with the AMD analysis) in the 6,610 continental protected area cells. Hyperparameter optimization Before training the model, we optimized the hyperparameters of the Xgboost algorithm to enhance its performance. Specifically, we focused on six parameters: learning rate, maximum tree depth, gamma, lambda, alpha, and the number of trees. Due to the enormous number of possible parameter combinations, we employed a Bayesian optimization approach, which provided a more efficient search over the hyperparameter space compared to traditional grid search. As an optimization criterion, we used the xgb.cv cross-validation function within the Xgboost package, based on k-fold cross-validation. Spatial cross-validation by blocks In order to thoroughly assess the predictive accuracy of our model and address the spatial autocorrelation inherent in ecological data, we employed a rigorous Spatial Cross-Validation by Blocks method. This approach entailed partitioning the 6,610 continental protected area cells into 3,848 validation blocks,
INTRODUCTION: Sepsis is intricately linked to intestinal damage and barrier dysfunction. At present times, there is a growing interest in a metabolite-based therapy for multiple diseases.METHODS: Serum samples from septic patients and healthy individuals were collected and their metabonomics profiling assessed using Ultra-Performance Liquid Chromatography-Time of Flight Mass Spectrometry (UPLC-TOFMS). The eXtreme Gradient Boosting algorithms (XGBOOST) method was used to screen essential metabolites associated with sepsis, and five machine learning models, including Logistic Regression, XGBoost, GaussianNB(GNB), upport vector machines(SVM) and RandomForest were constructed to distinguish sepsis including a training set (75%) and validation set(25%). The area under the receiver-operating characteristic curve (AUROC) and Brier scores were used to compare the prediction performances of different models. Pearson analysis was used to analysis the relationship between the metabolites and the severity of sepsis. Both cellular and animal models were used to HYPERLINK 'javascript:;' assess the function of the metabolites.RESULTS: The occurrence of sepsis involve metabolite dysregulation. The metabolites mannose-6-phosphate and sphinganine as the optimal sepsis-related variables screened by XGBOOST algorithm. The XGBoost model (AUROC=0.956) has the most stable performance to establish diagnostic model among the five machine learning methods. The SHapley Additive exPlanations (SHAP) package was used to interpret the XGBOOST model. Pearson analysis reinforced the expression of Sphinganine, Mannose 6-phosphate were positively associated with the APACHE-II, PCT, WBC, CRP and IL-6. We also demonstrated that sphinganine strongly diminished the LDH content in LPS-treated Caco-2 cells. In addition, using both in vitro and in vivo examination, we revealed that sphinganine strongly protects against sepsis-induced intestinal barrier injury.DISCUSSION: These findings highlighted the potential diagnostic value of the ML, and also provided new insight into enhanced therapy and/or preventative measures against sepsis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset reconstructs the annual mass balance of glaciers larger than 0.1 km² in the Tien Shan and Pamir regions from 1950 to 2022. The dataset is derived using a nonlinear relationship between glacier mass balance and meteorological and topographical variables. The reconstruction method employs the XGBoost algorithm. Initially, XGBoost is trained on the complete training dataset, followed by incremental training for each sub-region to tailor models to specific regional characteristics. The final training results yield an average coefficient of determination (R²) of 0.87.
All code used in this dataset is publicly available and organized into the following five sections:
Data Processing
Model Training
Result Analysis
Result Evaluation
SHAP Analysis
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accurate prediction of water inrush volumes is essential for safeguarding tunnel construction operations. This study proposes a method for predicting tunnel water inrush volumes, leveraging the eXtreme Gradient Boosting (XGBoost) model optimized with Bayesian techniques. To maximize the utility of available data, 654 datasets with missing values were imputed and augmented, forming a robust dataset for the training and validation of the Bayesian optimized XGBoost (BO-XGBoost) model. Furthermore, the SHapley Additive explanations (SHAP) method was employed to elucidate the contribution of each input feature to the predictive outcomes. The results indicate that: (1) The constructed BO-XGBoost model exhibited exceptionally high predictive accuracy on the test set, with a root mean square error (RMSE) of 7.5603, mean absolute error (MAE) of 3.2940, mean absolute percentage error (MAPE) of 4.51%, and coefficient of determination (R2) of 0.9755; (2) Compared to the predictive performance of support vector mechine (SVR), decision tree (DT), and random forest (RF) models, the BO-XGBoost model demonstrates the highest R2 values and the smallest prediction error; (3) The input feature importance yielded by SHAP is groundwater level (h) > water-producing characteristics (W) > tunnel burial depth (H) > rock mass quality index (RQD). The proposed BO-XGBoost model exhibited exceptionally high predictive accuracy on the tunnel water inrush volume prediction dataset, thereby aiding managers in making informed decisions to mitigate water inrush risks and ensuring the safe and efficient advancement of tunnel projects.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
XGBTree achieved best performance in most of the evaluation metrics (PrePro—Pre-processing type (B—Balanced (ENUS), O—Original data); VarRem—Variable removal (Y—Yes, N—No)).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Times for threads represents model runs on CPU with the corresponding threads used for speedup. Last row corresponds to running the same model on single NVIDIA V100 GPU.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: The ability to assess adverse outcomes in patients with community-acquired pneumonia (CAP) could improve clinical decision-making to enhance clinical practice, but the studies remain insufficient, and similarly, few machine learning (ML) models have been developed.Objective: We aimed to explore the effectiveness of predicting adverse outcomes in CAP through ML models.Methods: A total of 2,302 adults with CAP who were prospectively recruited between January 2012 and March 2015 across three cities in South America were extracted from DryadData. After a 70:30 training set: test set split of the data, nine ML algorithms were executed and their diagnostic accuracy was measured mainly by the area under the curve (AUC). The nine ML algorithms included decision trees, random forests, extreme gradient boosting (XGBoost), support vector machines, Naïve Bayes, K-nearest neighbors, ridge regression, logistic regression without regularization, and neural networks. The adverse outcomes included hospital admission, mortality, ICU admission, and one-year post-enrollment status.Results: The XGBoost algorithm had the best performance in predicting hospital admission. Its AUC reached 0.921, and accuracy, precision, recall, and F1-score were better than those of other models. In the prediction of ICU admission, a model trained with the XGBoost algorithm showed the best performance with AUC 0.801. XGBoost algorithm also did a good job at predicting one-year post-enrollment status. The results of AUC, accuracy, precision, recall, and F1-score indicated the algorithm had high accuracy and precision. In addition, the best performance was seen by the neural network algorithm when predicting death (AUC 0.831).Conclusions: ML algorithms, particularly the XGBoost algorithm, were feasible and effective in predicting adverse outcomes of CAP patients. The ML models based on available common clinical features had great potential to guide individual treatment and subsequent clinical decisions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains several data, results and processing material from the application of GEOBIA-based, Spatially Partitioned Segmentation Parameter Optimization (SPUSPO) in the city of Ouagadougou. In detail in contains:
Labels :
2 : Artificial Ground Surface
0 : Building
5 : Low Vegetation
4 : Tree
1 : Swimming Pool
3 : Bare Ground
7 : Shadow
6 : Inland Water
The data are given in a csv format.
Python code calling GRASS GIS functions for automatizing the procedure.
Segmentation rasters for each approach.
A csv file with the data sued to compute the Area Fit Index for each approach.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Carbon Capture Storage (CCS) relevant Reactive Transport Modelling (RTM) of microfractures in basaltic rock; emulated using Gradient Boosted Decision Trees (GBDT) and subsequently optimised using Bayesian Optimisation (BO) framework. This project's code is hosted on Github at https://github.com/ThomasDodd97/CCS-RTM-GBDT-BO. This upload on Zenodo contains the dataset used to train four XGBoost GBDT surrogate models, whose model files are also uploaded here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accurate real-time icing grid fields are critical for preventing ice-related disasters during winter and protecting property. These fields are essential both for mapping ice distribution and for predicting icing using physical models combined with numerical weather prediction systems. However, developing precise real-time icing grids is challenging due to the uneven distribution of monitoring stations, data confidentiality restrictions, and the limitations of existing interpolation methods. In this study, we propose a new approach for constructing real-time icing grid fields using 1,339 online terminal monitoring datasets provided by the China Southern Power Grid Research Institute Co., Ltd. (CSPGRI) during the winter of 2023. Our method integrates static geographic information, dynamic meteorological factors, and ice_kriging values derived from parameter-optimized Empirical Bayesian Kriging Interpolation (EBKI) to create a spatiotemporally matched, multi-source fused icing thickness grid dataset. We applied five machine learning algorithms—Random Forest, XGBoost, LightGBM, Stacking, and Convolutional Neural Network Transformers (CNNT)—and evaluated their performance using six metrics: R, RMSE, CSI, MAR, FAR, and fbias, on both validation and testing sets. The stacking model performed best, achieving an R value of 0.634 (0.893), RMSE of 3.424 mm (2.834 mm), CSI of 0.514 (0.774), MAR of 0.309 (0.091), FAR of 0.332 (0.161), and fbias of 1.034 (1.084), respectively, when comparing predicted icing values with actual measurements on pylons. Additionally, we employed the SHAP model to provide a physical interpretation of the stacking model, confirming the independence of selected features. Meteorological factors such as relative humidity (RH), 10-meter wind speed (WS10), 2-meter temperature (T2), and precipitation (PRE) demonstrated a range of positive and negative contributions consistent with the observed growth of icing. Thus, our multi-source remote sensing data fusion approach, combined with the stacking model, offers a highly accurate and interpretable solution for generating real-time icing grid fields.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv
and Trad_Test.csv
) is derived directly from the original complete geochemical dataset (alldata.csv
) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv
and Simu_Test.csv
) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip
archive contains the raw input files alldata.csv
used to generate the proxies_alldata.csv
, it also contains Analysis1.csv
and Analysis2.csv
for performing confidence analysis. To run the executable files in place of the .m
scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.
Analysis1.csv
and Analysis2.csv