100+ datasets found

MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and...
zenodo.org
csv, zip
Updated Jun 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah (2025). MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies [Dataset]. http://doi.org/10.5281/zenodo.15666484
Explore at:
csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15666484
Dataset updated
Jun 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 14, 2025
Description

The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv and Trad_Test.csv) is derived directly from the original complete geochemical dataset (alldata.csv) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv and Simu_Test.csv) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip archive contains the raw input files alldata.csvused to generate the proxies_alldata.csv, it also contains Analysis1.csv and Analysis2.csvfor performing confidence analysis. To run the executable files in place of the .m scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.

Details on the input files for confidence analysis: Analysis1.csv and Analysis2.csv
These files contain two columns for the test data: column 1 = match or mismatch between predicted and true alterations? column 2 = probability of a correct classification, according to bootstrapped samples (Analysis1.csv) or to simulated proxies (Analysis2.csv)
f
Data from: Extreme Gradient Boosting as a Method for Quantitative...
figshare.com
acs.figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert P. Sheridan; Wei Min Wang; Andy Liaw; Junshui Ma; Eric M. Gifford (2023). Extreme Gradient Boosting as a Method for Quantitative Structure–Activity Relationships [Dataset]. http://doi.org/10.1021/acs.jcim.6b00591.s028
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.6b00591.s028
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan; Wei Min Wang; Andy Liaw; Junshui Ma; Eric M. Gifford
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost can be run on a single CPU in less than a third of the wall-clock time of either of the other methods.
Data for: Advances and critical assessment of machine learning techniques...
zenodo.org
dataone.org
+2more
bin, csv
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Bucinsky; Marián Gall; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč; Lukas Bucinsky; Ján Matúška; Michal Pitoňák; Marek Štekláč (2023). Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores [Dataset]. http://doi.org/10.5061/dryad.zgmsbccg7
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.zgmsbccg7
Dataset updated
Sep 5, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lukas Bucinsky; Marián Gall; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč; Lukas Bucinsky; Ján Matúška; Michal Pitoňák; Marek Štekláč
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease M^pro (PDB ID: 6WQF).

Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only set). Gasteiger charges were reassigned to the remaining compounds using OpenBabel. In addition, four in-vitro-only compounds with docking scores greater than 1 kcal/mol have been rejected.

The provided in-vivo and the in-vitro-only sets contain 59,884 (in-vivo.xyz) and 174,014 (in-vitro-only.xyz) compounds, respectively. Compounds in both sets contain the following elements: H, C, N, O, F, P, S, Cl, Br, and I. The in-vivo compound set was used as the primary data set for the training of the ML models in the referencing study.

The file in-vivo-splits-data.csv contains the exact composition of all (random) 80-5-15 train-validation-test splits used in the study, labeled I, II, III, IV, and V. Eight additional random subsets in each of the in-vivo 80-5-15 splits were created to monitor the training process convergence. These subsets were constructed in such a manner, that each subset contains all compounds from the previous subset (starting with the 10-5-15 subset) and was enlarged by one eighth of the entire (80-5-15) train set of a given split. These subsets are further referred to as in_vivo_10_(I, II, ..., V), in_vivo_20_(I, II, ..., V),..., in_vivo_80_(I, II, ... V).
RNA dataset to train XGBoost model
zenodo.org
txt
Updated Nov 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Lam; Alberto Leon; Ugljesa Djuric; Phedias Diamandis; Brian Lam; Alberto Leon; Ugljesa Djuric; Phedias Diamandis (2021). RNA dataset to train XGBoost model [Dataset]. http://doi.org/10.5281/zenodo.5593517
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5593517
Dataset updated
Nov 2, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Brian Lam; Alberto Leon; Ugljesa Djuric; Phedias Diamandis; Brian Lam; Alberto Leon; Ugljesa Djuric; Phedias Diamandis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository includes the RNA-seq dataset from 27 GBM samples, as published in this manuscript:

Topographic mapping of the glioblastoma proteome reveals a triple axis model of intra-tumoral heterogeneity
Lam KHB, Leon AJ, Hui W, Lee SCE, Batruch I, Faust K, Koritzinsky M, Richer M, Djuric U, Diamandis P (under review)
f
Evaluation of parameters for the XGBoost models of different training and...
plos.figshare.com
xls
Updated Jun 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Siddikur Rahman; Arman Hossain Chowdhury; Miftahuzzannat Amrin (2023). Evaluation of parameters for the XGBoost models of different training and test sets for COVID-19 deaths. [Dataset]. http://doi.org/10.1371/journal.pgph.0000495.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgph.0000495.t005
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS Global Public Health
Authors
Md. Siddikur Rahman; Arman Hossain Chowdhury; Miftahuzzannat Amrin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Evaluation of parameters for the XGBoost models of different training and test sets for COVID-19 deaths.
E
Dataset for 'Stream Temperature Predictions for River Basin Management in...
data.ess-dive.lbl.gov
knb.ecoinformatics.org
+1more
bin, xlsx, zip
Updated 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Helen Weierbach; Aranildo R. Lima; Jared D. Willard; Valerie C. Hendrix; Danielle S. Christianson; Misha Lubich; Charuleka Varadharajan (2022). Dataset for 'Stream Temperature Predictions for River Basin Management in the Pacific Northwest and Mid-Atlantic Regions Using Machine Learning', Water 2022 [Dataset]. http://doi.org/10.15485/1854257
Explore at:
bin, xlsx, zipAvailable download formats
Unique identifier
https://doi.org/10.15485/1854257
Dataset updated
2022
Dataset provided by
iNAIADS
Authors
Helen Weierbach; Aranildo R. Lima; Jared D. Willard; Valerie C. Hendrix; Danielle S. Christianson; Misha Lubich; Charuleka Varadharajan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1980 - Jun 30, 2021
Area covered

Variables measured
EARTH SCIENCE > OCEANS > OCEAN TEMPERATURE > WATER TEMPERATURE, EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > PRECIPITATION AMOUNT, EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC RADIATION > SOLAR RADIATION, EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC TEMPERATURE > SURFACE TEMPERATURE > MAXIMUM/MINIMUM TEMPERATURE, EARTH SCIENCE > TERRESTRIAL HYDROSPHERE > SURFACE WATER > SURFACE WATER PROCESSES/MEASUREMENTS > DISCHARGE/FLOW
Measurement technique
Model code and scripts: Model code and scripts: Model code and scripts (water_2022_code.zip) was developed in Python using packages listed in subfile environment.yml. For each of three modeling scenarios, we include a notebook which provides the workflow for building models with varying input features, hyperparameter optimization, and training subsets. This code is used to produce model output and to analyze model output as presented in Weierbach et al. 2022., Data selection: From all available stations in NWIS with data, sites were down selected based to the stations within a basin which have at least 20 years of temperature and discharge records. After down-selecting these stations, all data for stations meeting initial data requirements are compiled into hdf5 files for each basin. Here we have included two such model input files (USGS-02_RDC-WT-DAYMET_1980-01-01_2021-06-30_20y and USGS-17_RDC-WT-DAYMET_1980-01-01_2021-06-30_20y), one for the Mid Atlantic Region (HUC 02) and one for the Pacific Northwest Region (HUC 17)., Model output: Model output files (model_output_water_2022.zip) were generated using the provided model code, while varying model scenarios, model input features, model training (with or without hyperparameter optimization) and model training subsets (training with both regions or single regions). Each run represents one of the 14 model configurations presented in Weierbach et al. 2022, where for each configuration we train Multiple Linear Regression, Support Vector Regression, and XGBoost Regression models, report predictions in the test period, and provide information on hyperparameter optimization, feature importance (if applicable), and model accuracy., Data curation: Model input data files (each including stream temperature, discharge, meteorological variables and metadata for stations in a given HUC) were prepared using the BASIN-3D (Broker for Assimilation, Synthesis and Integration of eNvironmental Diverse, Distributed Datasets) data integration tool (Varadharajan et al. 2022). This tool generates input datasets by querying and synthesizing available discharge and stream temperature data from the National Water Information System (NWIS), meteorological data from Daymet, and additional metadata from both systems.
Dataset funded by
U.S. DOE > Office of Science > Biological and Environmental Research (BER)
Description
This data package presents forcing data, model code, and model output for classical machine learning models that predict monthly stream water temperature as presented in the manuscript ‘Stream Temperature Predictions for River Basin Management in the Pacific Northwest and Mid-Atlantic Regions Using Machine Learning’, Water (Weierbach et al., 2022). Specifically, for input forcing datasets we include two files each generated using the BASIN-3D data integration tool (Varadharajan et al., 2022) for stations in the Pacific Northwest and Mid Atlantic Hydrologic regions. Model code (written in python with the use of jupyter notebooks) includes codes for data preprocessing, training Multiple Linear Regression, Support Vector Regression, and Extreme Gradient Boosted Tree models, and additional notebooks for analysis of model output. We include specific model output files which represent modeling configurations presented in the manuscript also presented in an hdf5 format. Together, these data make up the workflow for predictions across three scenarios (single station, regional, and predictions in unmonitored basins) presented in the manuscript and allow for reproducibility of modeling procedures.
t
Rainfall Prediction: Comparison of 7 Popular Models
test.researchdata.tuwien.ac.at
bin, png +1
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaya Ali Kus; Kaya Ali Kus (2025). Rainfall Prediction: Comparison of 7 Popular Models [Dataset]. http://doi.org/10.70124/p7rh4-0g783
Explore at:
png, text/markdown, binAvailable download formats
Unique identifier
https://doi.org/10.70124/p7rh4-0g783
Dataset updated
Apr 28, 2025
Dataset provided by
TU Wien
Authors
Kaya Ali Kus; Kaya Ali Kus
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 28, 2025
Description
Rainfall Prediction using 7 Popular Models

Context and Methodology

Research Domain/Project:

This dataset is part of a machine learning project focused on predicting rainfall, a critical task for sectors like agriculture, water resource management, and disaster prevention. The project employs machine learning algorithms to forecast rainfall occurrences based on historical weather data, including features like temperature, humidity, and pressure.

Purpose:

The primary goal of the dataset is to train multiple machine learning models to predict rainfall and compare their performances. The insights gained will help identify the most accurate models for real-world predictions of rainfall events.

Creation Process:

The dataset is derived from various historical weather observations, including temperature, humidity, wind speed, and pressure, collected by weather stations across Australia. These observations are used as inputs for training machine learning models. The dataset is publicly available on platforms like Kaggle and is often used in competitions and research to advance predictive analytics in meteorology.

Technical Details

Dataset Structure:

The dataset consists of weather data from multiple Australian weather stations, spanning various time periods. Key features include:

Temperature
Humidity
Wind Speed
Pressure
Rainfall (target variable)
These features are tracked for each weather station over different times, with the goal of predicting rainfall.

Software Requirements:

Python: The primary programming language for data analysis and machine learning.
scikit-learn: For implementing machine learning models.
XGBoost, LightGBM, and CatBoost: Popular libraries for building more advanced ensemble models.
Matplotlib/Seaborn: For data visualization.
These libraries and tools help in data manipulation, modeling, evaluation, and visualization of results.
DBRepo Authorization: Required to access datasets via the DBRepo API for dataset retrieval.

Additional Resources

Model Comparison Charts: The project includes output charts comparing the performance of seven popular machine learning models.
Trained Models (.pkl files): Pre-trained models are saved as .pkl files for reuse without retraining.
Documentation and Code: A Jupyter notebook guides through the process of data analysis, model training, and evaluation.
f
Fastest training times for CNN and XGBoost on CPU and GPU (all features).
plos.figshare.com
xls
Updated May 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vineeth Gutta; Satish Ranganathan Ganakammal; Sara Jones; Matthew Beyers; Sunita Chandrasekaran (2024). Fastest training times for CNN and XGBoost on CPU and GPU (all features). [Dataset]. http://doi.org/10.1371/journal.pcbi.1011504.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1011504.t007
Dataset updated
May 10, 2024
Dataset provided by
PLOS Computational Biology
Authors
Vineeth Gutta; Satish Ranganathan Ganakammal; Sara Jones; Matthew Beyers; Sunita Chandrasekaran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Fastest training times for CNN and XGBoost on CPU and GPU (all features).
Models and Predictions for "The Proper Care and Feeding of CAMELS: How...
zenodo.org
application/gzip, bin
Updated Feb 6, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Gauch; Juliane Mai; Jimmy Lin; Martin Gauch; Juliane Mai; Jimmy Lin (2020). Models and Predictions for "The Proper Care and Feeding of CAMELS: How Limited Training Data Affects Streamflow Prediction" [Dataset]. http://doi.org/10.5281/zenodo.3543549
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3543549
Dataset updated
Feb 6, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Martin Gauch; Juliane Mai; Jimmy Lin; Martin Gauch; Juliane Mai; Jimmy Lin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Models and Predictions

This dataset contains the trained XGBoost and EA-LSTM models and the models' predictions for the paper The Proper Care and Feeding of CAMELS: How Limited Training Data Affects Streamflow Prediction.

For each combination of model (XGBoost, EA-LSTM), training years (3, 6, 9), number of basins (13, 26, 53, 265, 531), and seed (111-888), there are five folders. Each corresponds to a random basin sample (for 531 basins there's only one folder, since it's all basins).
In each folder, there are two files:

\(\texttt{model.pkl}\) (XGBoost) or \(\texttt{model_epoch30.pt}\) (EA-LSTM), which stores the pickled trained model

\(\texttt{xgboost_seedNNN.p}\) or \(\texttt{ealstm_seedNNN.p}\), which stores a pickled dictionary that maps each basin to the DataFrame of predicted and actual daily streamflow.

In addition to each folder, there is a SLURM submission script called \(\texttt{ that was used to create and evaluate the model in the folder.
XGBoost Errors for model trained NCI60 data.
plos.figshare.com
xls
Updated May 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vineeth Gutta; Satish Ranganathan Ganakammal; Sara Jones; Matthew Beyers; Sunita Chandrasekaran (2024). XGBoost Errors for model trained NCI60 data. [Dataset]. http://doi.org/10.1371/journal.pcbi.1011504.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1011504.t001
Dataset updated
May 10, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Vineeth Gutta; Satish Ranganathan Ganakammal; Sara Jones; Matthew Beyers; Sunita Chandrasekaran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The use of deep learning (DL) is steadily gaining traction in scientific challenges such as cancer research. Advances in enhanced data generation, machine learning algorithms, and compute infrastructure have led to an acceleration in the use of deep learning in various domains of cancer research such as drug response problems. In our study, we explored tree-based models to improve the accuracy of a single drug response model and demonstrate that tree-based models such as XGBoost (eXtreme Gradient Boosting) have advantages over deep learning models, such as a convolutional neural network (CNN), for single drug response problems. However, comparing models is not a trivial task. To make training and comparing CNNs and XGBoost more accessible to users, we developed an open-source library called UNNT (A novel Utility for comparing Neural Net and Tree-based models). The case studies, in this manuscript, focus on cancer drug response datasets however the application can be used on datasets from other domains, such as chemistry.
n
Trophic reorganization of animal communities under climate change
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Aug 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuel Mendoza; Miguel B. Araujo (2024). Trophic reorganization of animal communities under climate change [Dataset]. http://doi.org/10.5061/dryad.dbrv15f83
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.dbrv15f83
Dataset updated
Aug 26, 2024
Dataset provided by
Consejo Superior de Investigaciones Científicas
Authors
Manuel Mendoza; Miguel B. Araujo
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Aim This study uses a novel modeling approach to understand global trophic structure transformations under 21st-century climate changes. The goal is to project and understand the impacts of climate change on trophic dynamics, guiding future research and conservation efforts. Location 14,520 terrestrial grid cells of 1° x 1° globally. Taxon Trophic structures were assessed for 15,265 species, including 9,993 non-marine birds and 5,272 terrestrial mammals, across 9 predefined trophic guilds. Methods A spatially explicit community trophic structure model, based on an extreme gradient boosting algorithm (Xgboost), was used. The model was trained with 1961-1990 climatic data and projected changes according to three Shared Socioeconomic Pathways: SSP2-45, SSP3-70, and SSP5-85. Results The Xgboost model showed high predictive accuracy (86%, kappa=0.91). Projections indicated many global regions are transitioning in their trophic structures due to climate changes from 1990 to 2018, with decreases in species carrying capacity in 5.5% of cells and increases in 9.8%. Predictions for mid- and late-21st century under climate scenarios suggest significant reorganization, with notable impacts in regions such as the Amazon Basin, Central Africa, and Southeast Asia. Under SSP5-85, 17.1% of cells may face reductions in carrying capacity, while 41.1% could see increases, affecting thousands of species. Main conclusions Climate change is profoundly reorganizing global trophic communities, with significant shifts in species carrying capacity across different guilds. Tropical regions and high northern latitudes are most affected, with some species facing collapses and others finding new opportunities. These changes highlight the need to integrate community trophic structure models into biodiversity conservation strategies, offering a comprehensive view of climate change impacts on trophic networks. Methods Data Collection Species Distribution Data Geographical data were garnered from two primary sources and subsequently plotted on a global terrestrial grid, with each cell measuring 1 × 1°. These sources included the global distribution ranges of terrestrial mammals and non-marine birds. The distributions of species, specifically 9,993 non-marine birds and 5,272 terrestrial mammals, totaling 15,265 species, were informed by the IUCN Global Assessment's data on native ranges (IUCN, 2014). To enable analysis, a presence/absence matrix was created. In this matrix, the species were aligned as columns, each named, against 14,498 terrestrial grid cells, each cell measuring 1 × 1°, as rows. These include all the non-coastal cells of the world, excluding Antarctica and some northern regions, such as most of Greenland, for which some data are lacking. This approach provided a clear, granular view of species distribution across the globe. Bioclimatic Variables The bioclimatic variables were divided into two datasets: historical (1961-2018) and future (2021-2100). Historical bioclimatic variables were not obtained directly but derived from three monthly meteorological variables: mean minimum temperature (°C), mean maximum temperature (°C), and total precipitation (mm). These variables were downscaled from CRU-TS-4.03 (Harris et al., 2014) with WorldClim 2.1 (Fick & Hijmans, 2017) for bias correction. The nineteen WorldClim variables were calculated from these three monthly meteorological variables using the "biovars" function of the R dismo package (Hijmans et al., 2011). Unlike the historical data, pre-processed bioclimatic variables for the future could be accessed directly. We used a multimodel ensemble approach, which tends to perform better than any individual model (Pierce et al., 2009; Araújo & New, 2007). The ensemble integrates mean outputs from 25 global climate models (GCMs) corresponding to an array of twelve different future climate change scenarios (Harris et al., 2014; Fick & Hijmans, 2017). These scenarios emerge from the interplay of four specific timeframes (2021-2040, 2041-2060, 2061-2080, and 2081-2100) and three Shared Socio-economic Pathways (ssp2-45, ssp3-70, and ssp5-85) (Gidden et al., 2019). Feeding Habits Data The feeding habits of bird and mammal species were obtained from the global species-level compilation of key trophic attributes, known as Elton traits 1.0 (Wilman et al., 2014). This dataset provided essential information on the trophic roles of species, which is crucial for understanding their ecological interactions and energy flow within ecosystems. Trophic profile of the cells and structure identification Trophic profile of the cells We assigned each of the 15,265 terrestrial mammal and non-marine bird species to one of 9 trophic guilds and then counted the number of species in each guild within each cell, following a previous analysis (Mendoza & Araújo, 2022). The result is a matrix with the 9 trophic guilds as columns, 14,498 cells as rows, and values representing numbers of species. The trophic profile of every community is thus a point in a 9-dimensional ‘trophic space' defined by the number of species from each trophic guild (a vector of dimension 9). Selection of training samples From the initial set of 14,498 terrestrial grid cells, each measuring 1°×1°, a specific subset of 6,610 continental cells was selected. This subset was defined by their overlap, either partial or complete, with designated protected areas. This subset was crucial for two analytical steps: first, to decipher the community trophic structures; and second, to model the interaction between the prevailing climate and the trophic structure. Given the nature of these cells — designated as "continental protected area cells" — we assume they experience reduced human activity compared to the surrounding matrix; an assumption that may not align with reality globally, considering evidence of reduced effectiveness of protected areas in ensuring tangible protection in various parts of the tropics (Geldmann et al., 2019). Nevertheless, a working assumption is made that the trophic structures displayed within these areas likely present a closer reflection of what might be expected from an undisturbed, stable energy network (Mendoza & Araújo, 2022). Identification of the six basic trophic structures through AMD analysis We utilized AMD analysis to explore the previously described 9-dimensional 'community trophic space', defined by the number of species within each trophic guild. This analysis is rooted in computing the Average Membership Degree (AMD) of cluster elements based on their Euclidean distance to the geometric center. The primary aim of AMD analysis is to discern the presence of distinct groups within multidimensional spaces, while concurrently assessing their degree of definition and compactness. The emergence of well-defined community groups within this trophic space allows for the consideration of the identified basic trophic structures as qualitatively distinct entities (Mendoza & Araújo, 2022). We applied AMD analysis to the 6,610 continental protected area cells to confirm that the same six basic trophic structures (TS1 to TS6) identified by Mendoza & Araújo (2022) are present within this curated subset. For a more comprehensive understanding of the AMD method and its application to our dataset, readers are directed to the supplementary information of Mendoza & Araújo (2022), accessible via the following link: https://nsojournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1111%2Fecog.06289&file=ecog12872-sup-0001-AppendixS1.pdf Climate modelling of community trophic structures Data preparation We modelled the relationship between climate and trophic structures, utilizing 19 predictors derived from historical bioclimatic data encompassing the years 1961-1990. Denoted as pre-1990 period, this phase marks a time before the significant uptick in temperatures attributable to human-induced greenhouse gas emissions. The trophic profile data, systematically assembled from faunal lists gathered over numerous decades, also hail from an era prior to this pronounced temperature increase. Therefore, these records present a fitting basis for examining the interplay between the trophic structure and the climatic conditions prevalent during the pre-1990 period. The bioclimatic variables represent conditions over specific time periods, and the corresponding trophic structure type (TS1 to TS6) is inferred as the one expected at the end of these periods. Model Implementation Using Xgboost We employed the Extreme Gradient Boosting algorithm (Xgboost) (Chen & Guestrin, 2016), using the xgboost package (Chen et al., 2023), a state-of-the-art machine learning technique known for its superior performance over traditional models such as random forests (e.g., Shao et al., 2024). The target variable in our analysis was the basic type of trophic structure (TS1 to TS6), identified in the previous step (with the AMD analysis) in the 6,610 continental protected area cells. Hyperparameter optimization Before training the model, we optimized the hyperparameters of the Xgboost algorithm to enhance its performance. Specifically, we focused on six parameters: learning rate, maximum tree depth, gamma, lambda, alpha, and the number of trees. Due to the enormous number of possible parameter combinations, we employed a Bayesian optimization approach, which provided a more efficient search over the hyperparameter space compared to traditional grid search. As an optimization criterion, we used the xgb.cv cross-validation function within the Xgboost package, based on k-fold cross-validation. Spatial cross-validation by blocks In order to thoroughly assess the predictive accuracy of our model and address the spatial autocorrelation inherent in ecological data, we employed a rigorous Spatial Cross-Validation by Blocks method. This approach entailed partitioning the 6,610 continental protected area cells into 3,848 validation blocks,
Data from: Circulating sepsis-related metabolite sphinganine could protect...
data.niaid.nih.gov
xml
Updated Jul 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zetian Wang (2023). Circulating sepsis-related metabolite sphinganine could protect against intestinal damage during sepsis [Dataset]. https://data.niaid.nih.gov/resources?id=mtbls7878
Explore at:
xmlAvailable download formats
Dataset updated
Jul 28, 2023
Dataset provided by
Shanghai Fifth People's Hospital
Authors
Zetian Wang
Variables measured
Disease, Metabolomics
Description
INTRODUCTION: Sepsis is intricately linked to intestinal damage and barrier dysfunction. At present times, there is a growing interest in a metabolite-based therapy for multiple diseases.METHODS: Serum samples from septic patients and healthy individuals were collected and their metabonomics profiling assessed using Ultra-Performance Liquid Chromatography-Time of Flight Mass Spectrometry (UPLC-TOFMS). The eXtreme Gradient Boosting algorithms (XGBOOST) method was used to screen essential metabolites associated with sepsis, and five machine learning models, including Logistic Regression, XGBoost, GaussianNB(GNB), upport vector machines(SVM) and RandomForest were constructed to distinguish sepsis including a training set (75%) and validation set(25%). The area under the receiver-operating characteristic curve (AUROC) and Brier scores were used to compare the prediction performances of different models. Pearson analysis was used to analysis the relationship between the metabolites and the severity of sepsis. Both cellular and animal models were used to HYPERLINK 'javascript:;' assess the function of the metabolites.RESULTS: The occurrence of sepsis involve metabolite dysregulation. The metabolites mannose-6-phosphate and sphinganine as the optimal sepsis-related variables screened by XGBOOST algorithm. The XGBoost model (AUROC=0.956) has the most stable performance to establish diagnostic model among the five machine learning methods. The SHapley Additive exPlanations (SHAP) package was used to interpret the XGBOOST model. Pearson analysis reinforced the expression of Sphinganine, Mannose 6-phosphate were positively associated with the APACHE-II, PCT, WBC, CRP and IL-6. We also demonstrated that sphinganine strongly diminished the LDH content in LPS-treated Caco-2 cells. In addition, using both in vitro and in vivo examination, we revealed that sphinganine strongly protects against sepsis-induced intestinal barrier injury.DISCUSSION: These findings highlighted the potential diagnostic value of the ML, and also provided new insight into enhanced therapy and/or preventative measures against sepsis.
Spatio-temporal reconstruction of annual glacier mass balance in the Central...
zenodo.org
csv, zip
Updated Dec 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanfei Peng; Yanfei Peng; Bolch Tobias; Yuan Qiangqiang; Baldacchino Francesca; Yang Qianqian; Bolch Tobias; Yuan Qiangqiang; Baldacchino Francesca; Yang Qianqian (2024). Spatio-temporal reconstruction of annual glacier mass balance in the Central Asia (1950- 2020) using machine learning method [Dataset]. http://doi.org/10.5281/zenodo.14546263
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14546263
Dataset updated
Dec 23, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yanfei Peng; Yanfei Peng; Bolch Tobias; Yuan Qiangqiang; Baldacchino Francesca; Yang Qianqian; Bolch Tobias; Yuan Qiangqiang; Baldacchino Francesca; Yang Qianqian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Central Asia
Description
This dataset reconstructs the annual mass balance of glaciers larger than 0.1 km² in the Tien Shan and Pamir regions from 1950 to 2022. The dataset is derived using a nonlinear relationship between glacier mass balance and meteorological and topographical variables. The reconstruction method employs the XGBoost algorithm. Initially, XGBoost is trained on the complete training dataset, followed by incremental training for each sub-region to tailor models to specific regional characteristics. The final training results yield an average coefficient of determination (R²) of 0.87.

All code used in this dataset is publicly available and organized into the following five sections:

Data Processing

Code for extracting monthly meteorological variables.

Combines meteorological and topographical variables for each glacier.

Model Training

Implements the two-step training process for all ensemble learning methods tested in this study.

Result Analysis

Pie charts of mass balance distribution for clustered glaciers.

Line graphs of annual mass balance for each sub-region.

Result Evaluation

Extracts glacier mass balance data from previous studies.

Compares these data with the results of this study.

SHAP Analysis

Provides scripts to generate SHAP (SHapley Additive exPlanations) value-related figures, highlighting the contribution of different variables to model predictions.
f
Data Sheet 1_Tunnel water inflow prediction using explainable machine...
frontiersin.figshare.com
zip
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shengdong Ju; Guangzhao Ou; Tao Peng; Yanning Wang; Quanlin Song; Peng Guan (2025). Data Sheet 1_Tunnel water inflow prediction using explainable machine learning and augmented partially missing dataset.zip [Dataset]. http://doi.org/10.3389/feart.2025.1590203.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/feart.2025.1590203.s001
Dataset updated
Apr 25, 2025
Dataset provided by
Frontiers
Authors
Shengdong Ju; Guangzhao Ou; Tao Peng; Yanning Wang; Quanlin Song; Peng Guan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Accurate prediction of water inrush volumes is essential for safeguarding tunnel construction operations. This study proposes a method for predicting tunnel water inrush volumes, leveraging the eXtreme Gradient Boosting (XGBoost) model optimized with Bayesian techniques. To maximize the utility of available data, 654 datasets with missing values were imputed and augmented, forming a robust dataset for the training and validation of the Bayesian optimized XGBoost (BO-XGBoost) model. Furthermore, the SHapley Additive explanations (SHAP) method was employed to elucidate the contribution of each input feature to the predictive outcomes. The results indicate that: (1) The constructed BO-XGBoost model exhibited exceptionally high predictive accuracy on the test set, with a root mean square error (RMSE) of 7.5603, mean absolute error (MAE) of 3.2940, mean absolute percentage error (MAPE) of 4.51%, and coefficient of determination (R2) of 0.9755; (2) Compared to the predictive performance of support vector mechine (SVR), decision tree (DT), and random forest (RF) models, the BO-XGBoost model demonstrates the highest R2 values and the smallest prediction error; (3) The input feature importance yielded by SHAP is groundwater level (h) > water-producing characteristics (W) > tunnel burial depth (H) > rock mass quality index (RQD). The proposed BO-XGBoost model exhibited exceptionally high predictive accuracy on the tunnel water inrush volume prediction dataset, thereby aiding managers in making informed decisions to mitigate water inrush risks and ensuring the safe and efficient advancement of tunnel projects.
f
XGBTree achieved best performance in most of the evaluation metrics...
plos.figshare.com
xls
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tuan Tran; Uyen Le; Yihui Shi (2023). XGBTree achieved best performance in most of the evaluation metrics (PrePro—Pre-processing type (B—Balanced (ENUS), O—Original data); VarRem—Variable removal (Y—Yes, N—No)). [Dataset]. http://doi.org/10.1371/journal.pone.0269135.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0269135.t005
Dataset updated
Jun 11, 2023
Dataset provided by
PLOS ONE
Authors
Tuan Tran; Uyen Le; Yihui Shi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
XGBTree achieved best performance in most of the evaluation metrics (PrePro—Pre-processing type (B—Balanced (ENUS), O—Original data); VarRem—Variable removal (Y—Yes, N—No)).
Results using XGBoost.
plos.figshare.com
xls
Updated May 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Results using XGBoost. [Dataset]. https://plos.figshare.com/articles/dataset/Results_using_XGBoost_/25718206
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1011504.t004
Dataset updated
May 10, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Vineeth Gutta; Satish Ranganathan Ganakammal; Sara Jones; Matthew Beyers; Sunita Chandrasekaran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Times for threads represents model runs on CPU with the corresponding threads used for speedup. Last row corresponds to running the same model on single NVIDIA V100 GPU.
f
Table2_Performance of Machine Learning Algorithms for Predicting Adverse...
frontiersin.figshare.com
docx
Updated Jun 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhixiao Xu; Kun Guo; Weiwei Chu; Jingwen Lou; Chengshui Chen (2023). Table2_Performance of Machine Learning Algorithms for Predicting Adverse Outcomes in Community-Acquired Pneumonia.DOCX [Dataset]. http://doi.org/10.3389/fbioe.2022.903426.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fbioe.2022.903426.s002
Dataset updated
Jun 4, 2023
Dataset provided by
Frontiers
Authors
Zhixiao Xu; Kun Guo; Weiwei Chu; Jingwen Lou; Chengshui Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: The ability to assess adverse outcomes in patients with community-acquired pneumonia (CAP) could improve clinical decision-making to enhance clinical practice, but the studies remain insufficient, and similarly, few machine learning (ML) models have been developed.Objective: We aimed to explore the effectiveness of predicting adverse outcomes in CAP through ML models.Methods: A total of 2,302 adults with CAP who were prospectively recruited between January 2012 and March 2015 across three cities in South America were extracted from DryadData. After a 70:30 training set: test set split of the data, nine ML algorithms were executed and their diagnostic accuracy was measured mainly by the area under the curve (AUC). The nine ML algorithms included decision trees, random forests, extreme gradient boosting (XGBoost), support vector machines, Naïve Bayes, K-nearest neighbors, ridge regression, logistic regression without regularization, and neural networks. The adverse outcomes included hospital admission, mortality, ICU admission, and one-year post-enrollment status.Results: The XGBoost algorithm had the best performance in predicting hospital admission. Its AUC reached 0.921, and accuracy, precision, recall, and F1-score were better than those of other models. In the prediction of ICU admission, a model trained with the XGBoost algorithm showed the best performance with AUC 0.801. XGBoost algorithm also did a good job at predicting one-year post-enrollment status. The results of AUC, accuracy, precision, recall, and F1-score indicated the algorithm had high accuracy and precision. In addition, the best performance was seen by the neural network algorithm when predicting death (AUC 0.831).Conclusions: ML algorithms, particularly the XGBoost algorithm, were feasible and effective in predicting adverse outcomes of CAP patients. The ML models based on available common clinical features had great potential to guide individual treatment and subsequent clinical decisions.
Data from: SPUSPO: Spatially Partitioned Unsupervised Segmentation Parameter...
zenodo.org
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefanos Georganos; Tais Grippa; Moritz Lennert; Brian Alan Johnson; Sabine Vanhuysse; Eléonore Wolff; Stefanos Georganos; Tais Grippa; Moritz Lennert; Brian Alan Johnson; Sabine Vanhuysse; Eléonore Wolff (2020). SPUSPO: Spatially Partitioned Unsupervised Segmentation Parameter Optimization for Efficiently Segmenting Large Heterogeneous Areas [Dataset]. http://doi.org/10.5281/zenodo.1341116
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1341116
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Stefanos Georganos; Tais Grippa; Moritz Lennert; Brian Alan Johnson; Sabine Vanhuysse; Eléonore Wolff; Stefanos Georganos; Tais Grippa; Moritz Lennert; Brian Alan Johnson; Sabine Vanhuysse; Eléonore Wolff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains several data, results and processing material from the application of GEOBIA-based, Spatially Partitioned Segmentation Parameter Optimization (SPUSPO) in the city of Ouagadougou. In detail in contains:

A Land Use - Land Cover map of Ouagadougou derived through SPUSPO. The classifier used was Extreme Gradient Boosting (XGBoost).
Labels :

2 : Artificial Ground Surface

0 : Building

5 : Low Vegetation

4 : Tree

1 : Swimming Pool

3 : Bare Ground

7 : Shadow

6 : Inland Water

The training and test data used in the study (SPUSPO and benchmark approach).

The data are given in a csv format.

The Jupyter notebook code which involves Python and GRASS GIS to automatize and efficiently perform SPUSPO in a large dataset.

Python code calling GRASS GIS functions for automatizing the procedure.

The segmentation layers coming from SPUSPO and the benchmark approaches (in raster formats due to data limitations).

Segmentation rasters for each approach.

The R code for optimization of XGBoost as well as feature selection with VSURF and classification of the whole dataset.

Segmentation evaluation metrics.

A csv file with the data sued to compute the Area Fit Index for each approach.

Morphological zones of Ouagadougou as created by Grippa et al. 2017 a shp format.
CCS-RTM-GBDT-BO
zenodo.org
bin, json
Updated Jun 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
T. Højlund-Dodd; T. Højlund-Dodd (2022). CCS-RTM-GBDT-BO [Dataset]. http://doi.org/10.5281/zenodo.6774384
Explore at:
json, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6774384
Dataset updated
Jun 29, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
T. Højlund-Dodd; T. Højlund-Dodd
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Carbon Capture Storage (CCS) relevant Reactive Transport Modelling (RTM) of microfractures in basaltic rock; emulated using Gradient Boosted Decision Trees (GBDT) and subsequently optimised using Bayesian Optimisation (BO) framework. This project's code is hosted on Github at https://github.com/ThomasDodd97/CCS-RTM-GBDT-BO. This upload on Zenodo contains the dataset used to train four XGBoost GBDT surrogate models, whose model files are also uploaded here.
Ensemble Learning for Spatial Modeling of Icing Fields from Multi-Source...
zenodo.org
zip
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
shaohui zhou; shaohui zhou (2025). Ensemble Learning for Spatial Modeling of Icing Fields from Multi-Source Remote Sensing Data: Partial Data and Training Code [Dataset]. http://doi.org/10.5281/zenodo.15622908
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15622908
Dataset updated
Jun 9, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
shaohui zhou; shaohui zhou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Accurate real-time icing grid fields are critical for preventing ice-related disasters during winter and protecting property. These fields are essential both for mapping ice distribution and for predicting icing using physical models combined with numerical weather prediction systems. However, developing precise real-time icing grids is challenging due to the uneven distribution of monitoring stations, data confidentiality restrictions, and the limitations of existing interpolation methods. In this study, we propose a new approach for constructing real-time icing grid fields using 1,339 online terminal monitoring datasets provided by the China Southern Power Grid Research Institute Co., Ltd. (CSPGRI) during the winter of 2023. Our method integrates static geographic information, dynamic meteorological factors, and ice_kriging values derived from parameter-optimized Empirical Bayesian Kriging Interpolation (EBKI) to create a spatiotemporally matched, multi-source fused icing thickness grid dataset. We applied five machine learning algorithms—Random Forest, XGBoost, LightGBM, Stacking, and Convolutional Neural Network Transformers (CNNT)—and evaluated their performance using six metrics: R, RMSE, CSI, MAR, FAR, and fbias, on both validation and testing sets. The stacking model performed best, achieving an R value of 0.634 (0.893), RMSE of 3.424 mm (2.834 mm), CSI of 0.514 (0.774), MAR of 0.309 (0.091), FAR of 0.332 (0.161), and fbias of 1.034 (1.084), respectively, when comparing predicted icing values with actual measurements on pylons. Additionally, we employed the SHAP model to provide a physical interpretation of the stacking model, confirming the independence of selected features. Meteorological factors such as relative humidity (RH), 10-meter wind speed (WS₁₀), 2-meter temperature (T₂), and precipitation (PRE) demonstrated a range of positive and negative contributions consistent with the observed growth of icing. Thus, our multi-source remote sensing data fusion approach, combined with the stacking model, offers a highly accurate and interpretable solution for generating real-time icing grid fields.

Facebook

Twitter

Click to copy link

Link copied

Cite

Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah (2025). MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies [Dataset]. http://doi.org/10.5281/zenodo.15666484

MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies

Explore at:

csv, zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15666484

Dataset updated

Jun 18, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jun 14, 2025

Description

The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv and Trad_Test.csv) is derived directly from the original complete geochemical dataset (alldata.csv) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv and Simu_Test.csv) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip archive contains the raw input files alldata.csvused to generate the proxies_alldata.csv, it also contains Analysis1.csv and Analysis2.csvfor performing confidence analysis. To run the executable files in place of the .m scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.

Details on the input files for confidence analysis: Analysis1.csv and Analysis2.csv

These files contain two columns for the test data: column 1 = match or mismatch between predicted and true alterations? column 2 = probability of a correct classification, according to bootstrapped samples (Analysis1.csv) or to simulated proxies (Analysis2.csv)

Clear search

Close search

Google apps

Main menu

MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and...

Data from: Extreme Gradient Boosting as a Method for Quantitative...

Data for: Advances and critical assessment of machine learning techniques...

RNA dataset to train XGBoost model

Evaluation of parameters for the XGBoost models of different training and...

Dataset for 'Stream Temperature Predictions for River Basin Management in...

Rainfall Prediction: Comparison of 7 Popular Models

Rainfall Prediction using 7 Popular Models

Context and Methodology

Research Domain/Project:

Purpose:

Creation Process:

Technical Details

Dataset Structure:

Software Requirements:

Additional Resources

Fastest training times for CNN and XGBoost on CPU and GPU (all features).

Models and Predictions for "The Proper Care and Feeding of CAMELS: How...

XGBoost Errors for model trained NCI60 data.

Trophic reorganization of animal communities under climate change

Data from: Circulating sepsis-related metabolite sphinganine could protect...

Spatio-temporal reconstruction of annual glacier mass balance in the Central...

Data Sheet 1_Tunnel water inflow prediction using explainable machine...

XGBTree achieved best performance in most of the evaluation metrics...

Results using XGBoost.

Table2_Performance of Machine Learning Algorithms for Predicting Adverse...

Data from: SPUSPO: Spatially Partitioned Unsupervised Segmentation Parameter...

CCS-RTM-GBDT-BO

Ensemble Learning for Spatial Modeling of Icing Fields from Multi-Source...

MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxiesSee More Versions

MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies