40 datasets found

f
The data distribution and details of datasets used to train XGBoost models.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Oct 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sun, Yan; Liu, Qian; Hu, Pingzhao; Huang, Zi Huai; Chen, Lianghong; Domaratzki, Mike (2024). The data distribution and details of datasets used to train XGBoost models. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001285960
Explore at:
Dataset updated
Oct 7, 2024
Authors
Sun, Yan; Liu, Qian; Hu, Pingzhao; Huang, Zi Huai; Chen, Lianghong; Domaratzki, Mike
Description
The data distribution and details of datasets used to train XGBoost models.
RNA dataset to train XGBoost model
zenodo.org
bin, txt
Updated Nov 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Lam; Alberto Leon; Ugljesa Djuric; Phedias Diamandis; Brian Lam; Alberto Leon; Ugljesa Djuric; Phedias Diamandis (2021). RNA dataset to train XGBoost model [Dataset]. http://doi.org/10.5281/zenodo.5639569
Explore at:
txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5639569
Dataset updated
Nov 3, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Brian Lam; Alberto Leon; Ugljesa Djuric; Phedias Diamandis; Brian Lam; Alberto Leon; Ugljesa Djuric; Phedias Diamandis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository includes the RNA-seq dataset from 27 GBM samples, as published in this manuscript:

Topographic mapping of the glioblastoma proteome reveals a triple axis model of intra-tumoral heterogeneity
Lam KHB, Leon AJ, Hui W, Lee SCE, Batruch I, Faust K, Koritzinsky M, Richer M, Djuric U, Diamandis P (under review)
f
Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train...
plos.figshare.com
xlsx
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olatomiwa O. Bifarin (2023). Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train set of the ST000369 dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0284315.s008
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0284315.s008
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Olatomiwa O. Bifarin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train set of the ST000369 dataset.
f
Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train...
plos.figshare.com
xlsx
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olatomiwa O. Bifarin (2023). Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train set of the MTBLS547 dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0284315.s007
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0284315.s007
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Olatomiwa O. Bifarin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train set of the MTBLS547 dataset.
n
Data for: Advances and critical assessment of machine learning techniques...
data.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Mar 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Bucinsky; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč (2023). Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores [Dataset]. http://doi.org/10.5061/dryad.zgmsbccg7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.zgmsbccg7
Dataset updated
Mar 3, 2023
Dataset provided by
Slovak University of Technology in Bratislava
Comenius University Bratislava
Authors
Lukas Bucinsky; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF). Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only set). Gasteiger charges were reassigned to the remaining compounds using OpenBabel. In addition, four in-vitro-only compounds with docking scores greater than 1 kcal/mol have been rejected. The provided in-vivo and the in-vitro-only sets contain 59,884 (in-vivo.xyz) and 174,014 (in-vitro-only.xyz) compounds, respectively. Compounds in both sets contain the following elements: H, C, N, O, F, P, S, Cl, Br, and I. The in-vivo compound set was used as the primary data set for the training of the ML models in the referencing study. The file in-vivo-splits-data.csv contains the exact composition of all (random) 80-5-15 train-validation-test splits used in the study, labeled I, II, III, IV, and V. Eight additional random subsets in each of the in-vivo 80-5-15 splits were created to monitor the training process convergence. These subsets were constructed in such a manner, that each subset contains all compounds from the previous subset (starting with the 10-5-15 subset) and was enlarged by one eighth of the entire (80-5-15) train set of a given split. These subsets are further referred to as in_vivo_10_(I, II, ..., V), in_vivo_20_(I, II, ..., V),..., in_vivo_80_(I, II, ... V). Methods Molecular docking calculations and the machine learning approaches are described in the Computational details section of [1]. Reference[1] Lukas Bucinsky, Marián Gall, Ján Matúška, Michal Pitoňák, Marek Štekláč. Advances and critical assessment of machine learning techniques for prediction of docking scores. Int. J. Quantum. Chem. (2023) DOI: 10.1002/qua.27110.
f
Evaluation of parameters for the XGBoost models of different training and...
plos.figshare.com
xls
Updated Jun 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Siddikur Rahman; Arman Hossain Chowdhury; Miftahuzzannat Amrin (2023). Evaluation of parameters for the XGBoost models of different training and test sets for COVID-19 deaths. [Dataset]. http://doi.org/10.1371/journal.pgph.0000495.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgph.0000495.t005
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS Global Public Health
Authors
Md. Siddikur Rahman; Arman Hossain Chowdhury; Miftahuzzannat Amrin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Evaluation of parameters for the XGBoost models of different training and test sets for COVID-19 deaths.
t
Rainfall Prediction: Comparison of 7 Popular Models
test.researchdata.tuwien.ac.at
bin, png +1
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaya Ali Kus; Kaya Ali Kus (2025). Rainfall Prediction: Comparison of 7 Popular Models [Dataset]. http://doi.org/10.70124/p7rh4-0g783
Explore at:
png, text/markdown, binAvailable download formats
Unique identifier
https://doi.org/10.70124/p7rh4-0g783
Dataset updated
Apr 28, 2025
Dataset provided by
TU Wien
Authors
Kaya Ali Kus; Kaya Ali Kus
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 28, 2025
Description
Rainfall Prediction using 7 Popular Models

Context and Methodology

Research Domain/Project:

This dataset is part of a machine learning project focused on predicting rainfall, a critical task for sectors like agriculture, water resource management, and disaster prevention. The project employs machine learning algorithms to forecast rainfall occurrences based on historical weather data, including features like temperature, humidity, and pressure.

Purpose:

The primary goal of the dataset is to train multiple machine learning models to predict rainfall and compare their performances. The insights gained will help identify the most accurate models for real-world predictions of rainfall events.

Creation Process:

The dataset is derived from various historical weather observations, including temperature, humidity, wind speed, and pressure, collected by weather stations across Australia. These observations are used as inputs for training machine learning models. The dataset is publicly available on platforms like Kaggle and is often used in competitions and research to advance predictive analytics in meteorology.

Technical Details

Dataset Structure:

The dataset consists of weather data from multiple Australian weather stations, spanning various time periods. Key features include:

Temperature
Humidity
Wind Speed
Pressure
Rainfall (target variable)
These features are tracked for each weather station over different times, with the goal of predicting rainfall.

Software Requirements:

Python: The primary programming language for data analysis and machine learning.
scikit-learn: For implementing machine learning models.
XGBoost, LightGBM, and CatBoost: Popular libraries for building more advanced ensemble models.
Matplotlib/Seaborn: For data visualization.
These libraries and tools help in data manipulation, modeling, evaluation, and visualization of results.
DBRepo Authorization: Required to access datasets via the DBRepo API for dataset retrieval.

Additional Resources

Model Comparison Charts: The project includes output charts comparing the performance of seven popular machine learning models.
Trained Models (.pkl files): Pre-trained models are saved as .pkl files for reuse without retraining.
Documentation and Code: A Jupyter notebook guides through the process of data analysis, model training, and evaluation.
m
ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured...
data.mendeley.com
Updated Aug 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Lynch (2025). ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods [Dataset]. http://doi.org/10.17632/g2sdzmssgh.1
Explore at:
Unique identifier
https://doi.org/10.17632/g2sdzmssgh.1
Dataset updated
Aug 15, 2025
Authors
Christopher Lynch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods. Including:

Tagged datasets (.csv): human-tagged gold labels for evaluation

Untagged datasets (.csv): raw data with Prompt matched to corresponding LLM-generated narrative

Suitable for inference, semi-automatic labeling, or transfer learning

Python and R code for preprocessing, model training, evaluation, and visualization

Configuration files and environment specifications to enable end-to-end reproducibility

The materials accompany the study presented in [Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1], where Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts.

Value of the Data: * Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers. * Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis. * Offers untagged datasets for new annotation or domain adaptation. * Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows. * Facilitates extension into other domains (e.g., multilingual LLM messaging validation).

Data Description: * /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv. * /data/untagged/*.csv – Clean datasets without labels for inference or annotation. * /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting. * /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables.

File Formats: * Data: CSV (UTF-8, RFC 4180) * Code: .py, .R, .Rproj

Ethics & Licensing * All data are de-identified and contain no PII. * Released under CC BY 4.0 (data) and MIT License (code).

Limitations * Labels reflect annotator interpretations and may encode bias. * Models trained on English text; generalization to other languages requires adaptation.

Funding Note * Funding sources provided time in support of human taggers annotating the data sets.
Z
Models and Predictions for "The Proper Care and Feeding of CAMELS: How...
data.niaid.nih.gov
Updated Feb 6, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lin, Jimmy (2020). Models and Predictions for "The Proper Care and Feeding of CAMELS: How Limited Training Data Affects Streamflow Prediction" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3543548
Explore at:
Dataset updated
Feb 6, 2020
Dataset provided by
Gauch, Martin
Mai, Juliane
Lin, Jimmy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Models and Predictions

This dataset contains the trained XGBoost and EA-LSTM models and the models' predictions for the paper The Proper Care and Feeding of CAMELS: How Limited Training Data Affects Streamflow Prediction.

For each combination of model (XGBoost, EA-LSTM), training years (3, 6, 9), number of basins (13, 26, 53, 265, 531), and seed (111-888), there are five folders. Each corresponds to a random basin sample (for 531 basins there's only one folder, since it's all basins). In each folder, there are three files:

(\texttt{model.pkl}) (XGBoost) or (\texttt{model_epoch30.pt}) (EA-LSTM), which stores the pickled trained model

(\texttt{xgboost_seedNNN.p}) or (\texttt{ealstm_seedNNN.p}), which stores a pickled dictionary that maps each basin to the DataFrame of predicted and actual daily streamflow.

(\texttt{attributes.db}), which stores static catchment attributes needed for inference.

In addition to each folder, there is a SLURM submission script called (\texttt{.sbatch}) that was used to create and evaluate the model in the folder.
f
DataSheet_1_XGBoost Classifier Based on Computed Tomography Radiomics for...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated May 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiang, Hui; Wang, Li; Yu, Jieyu; Li, Jing; Lu, Jianping; Liu, Yanfang; Feng, Xiaochen; Li, Qi; Shi, Zhang; Bian, Yun; Cao, Kai; Liu, Fang; Fang, Xu; Shao, Chengwei; Meng, Yinghao; Zhang, Hao (2021). DataSheet_1_XGBoost Classifier Based on Computed Tomography Radiomics for Prediction of Tumor-Infiltrating CD8+ T-Cells in Patients With Pancreatic Ductal Adenocarcinoma.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000824484
Explore at:
Dataset updated
May 19, 2021
Authors
Jiang, Hui; Wang, Li; Yu, Jieyu; Li, Jing; Lu, Jianping; Liu, Yanfang; Feng, Xiaochen; Li, Qi; Shi, Zhang; Bian, Yun; Cao, Kai; Liu, Fang; Fang, Xu; Shao, Chengwei; Meng, Yinghao; Zhang, Hao
Description
ObjectivesThis study constructed and validated a machine learning model to predict CD8+ tumor-infiltrating lymphocyte expression levels in patients with pancreatic ductal adenocarcinoma (PDAC) using computed tomography (CT) radiomic features.Materials and MethodsIn this retrospective study, 184 PDAC patients were randomly assigned to a training dataset (n =137) and validation dataset (n =47). All patients were divided into CD8+ T-high and -low groups using X-tile plots. A total of 1409 radiomics features were extracted from the segmentation of regions of interest, based on preoperative CT images of each patient. The LASSO algorithm was applied to reduce the dimensionality of the data and select features. The extreme gradient boosting classifier (XGBoost) was developed using a training set consisting of 137 consecutive patients admitted between January 2017 and December 2017. The model was validated in 47 consecutive patients admitted between January 2018 and April 2018. The performance of the XGBoost classifier was determined by its discriminative ability, calibration, and clinical usefulness.ResultsThe cut-off value of the CD8+ T-cell level was 18.69%, as determined by the X-tile program. A Kaplan−Meier analysis indicated a correlation between higher CD8+ T-cell levels and better overall survival (p = 0.001). The XGBoost classifier showed good discrimination in the training set (area under curve [AUC], 0.75; 95% confidence interval [CI]: 0.67–0.83) and validation set (AUC, 0.67; 95% CI: 0.51–0.83). Moreover, it showed a good calibration. The sensitivity, specificity, accuracy, positive and negative predictive values were 80.65%, 60.00%, 0.69, 0.63, and 0.79, respectively, for the training set, and 80.95%, 57.69%, 0.68, 0.61, and 0.79, respectively, for the validation set.ConclusionsWe developed a CT-based XGBoost classifier to extrapolate the infiltration levels of CD8+ T-cells in patients with PDAC. This method could be useful in identifying potential patients who can benefit from immunotherapies.
f
Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train...
plos.figshare.com
xlsx
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olatomiwa O. Bifarin (2023). Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train set of the MTBLS404 dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0284315.s006
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0284315.s006
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Olatomiwa O. Bifarin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train set of the MTBLS404 dataset.
f
Data from: Extreme Gradient Boosting as a Method for Quantitative...
figshare.com
acs.figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert P. Sheridan; Wei Min Wang; Andy Liaw; Junshui Ma; Eric M. Gifford (2023). Extreme Gradient Boosting as a Method for Quantitative Structure–Activity Relationships [Dataset]. http://doi.org/10.1021/acs.jcim.6b00591.s031
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.6b00591.s031
Dataset updated
May 31, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan; Wei Min Wang; Andy Liaw; Junshui Ma; Eric M. Gifford
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost can be run on a single CPU in less than a third of the wall-clock time of either of the other methods.
f
Table_1_Predicting Adverse Radiation Effects in Brain Tumors After...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Jul 13, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andratschke, Nicolaus; Lambin, Philippe; Morin, Olivier; van Timmeren, Janita E.; Keek, Simon A.; Hendriks, Lizza E. L.; Woodruff, Henry C.; Primakov, Sergey; Chatterjee, Avishek; Vallières, Martin; Kraft, Johannes; Beuque, Manon; Braunstein, Steve E. (2022). Table_1_Predicting Adverse Radiation Effects in Brain Tumors After Stereotactic Radiotherapy With Deep Learning and Handcrafted Radiomics.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000326928
Explore at:
Dataset updated
Jul 13, 2022
Authors
Andratschke, Nicolaus; Lambin, Philippe; Morin, Olivier; van Timmeren, Janita E.; Keek, Simon A.; Hendriks, Lizza E. L.; Woodruff, Henry C.; Primakov, Sergey; Chatterjee, Avishek; Vallières, Martin; Kraft, Johannes; Beuque, Manon; Braunstein, Steve E.
Description
IntroductionThere is a cumulative risk of 20–40% of developing brain metastases (BM) in solid cancers. Stereotactic radiotherapy (SRT) enables the application of high focal doses of radiation to a volume and is often used for BM treatment. However, SRT can cause adverse radiation effects (ARE), such as radiation necrosis, which sometimes cause irreversible damage to the brain. It is therefore of clinical interest to identify patients at a high risk of developing ARE. We hypothesized that models trained with radiomics features, deep learning (DL) features, and patient characteristics or their combination can predict ARE risk in patients with BM before SRT.MethodsGadolinium-enhanced T1-weighted MRIs and characteristics from patients treated with SRT for BM were collected for a training and testing cohort (N = 1,404) and a validation cohort (N = 237) from a separate institute. From each lesion in the training set, radiomics features were extracted and used to train an extreme gradient boosting (XGBoost) model. A DL model was trained on the same cohort to make a separate prediction and to extract the last layer of features. Different models using XGBoost were built using only radiomics features, DL features, and patient characteristics or a combination of them. Evaluation was performed using the area under the curve (AUC) of the receiver operating characteristic curve on the external dataset. Predictions for individual lesions and per patient developing ARE were investigated.ResultsThe best-performing XGBoost model on a lesion level was trained on a combination of radiomics features and DL features (AUC of 0.71 and recall of 0.80). On a patient level, a combination of radiomics features, DL features, and patient characteristics obtained the best performance (AUC of 0.72 and recall of 0.84). The DL model achieved an AUC of 0.64 and recall of 0.85 per lesion and an AUC of 0.70 and recall of 0.60 per patient.ConclusionMachine learning models built on radiomics features and DL features extracted from BM combined with patient characteristics show potential to predict ARE at the patient and lesion levels. These models could be used in clinical decision making, informing patients on their risk of ARE and allowing physicians to opt for different therapies.
CCS-RTM-GBDT-BO
zenodo.org
bin, json
Updated Jun 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
T. Højlund-Dodd; T. Højlund-Dodd (2022). CCS-RTM-GBDT-BO [Dataset]. http://doi.org/10.5281/zenodo.6774384
Explore at:
json, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6774384
Dataset updated
Jun 29, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
T. Højlund-Dodd; T. Højlund-Dodd
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Carbon Capture Storage (CCS) relevant Reactive Transport Modelling (RTM) of microfractures in basaltic rock; emulated using Gradient Boosted Decision Trees (GBDT) and subsequently optimised using Bayesian Optimisation (BO) framework. This project's code is hosted on Github at https://github.com/ThomasDodd97/CCS-RTM-GBDT-BO. This upload on Zenodo contains the dataset used to train four XGBoost GBDT surrogate models, whose model files are also uploaded here.
Spatio-temporal reconstruction of annual glacier mass balance in the Central...
zenodo.org
csv, zip
Updated Dec 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanfei Peng; Yanfei Peng; Bolch Tobias; Yuan Qiangqiang; Baldacchino Francesca; Yang Qianqian; Bolch Tobias; Yuan Qiangqiang; Baldacchino Francesca; Yang Qianqian (2024). Spatio-temporal reconstruction of annual glacier mass balance in the Central Asia (1950- 2020) using machine learning method [Dataset]. http://doi.org/10.5281/zenodo.14546263
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14546263
Dataset updated
Dec 23, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yanfei Peng; Yanfei Peng; Bolch Tobias; Yuan Qiangqiang; Baldacchino Francesca; Yang Qianqian; Bolch Tobias; Yuan Qiangqiang; Baldacchino Francesca; Yang Qianqian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Central Asia
Description
This dataset reconstructs the annual mass balance of glaciers larger than 0.1 km² in the Tien Shan and Pamir regions from 1950 to 2022. The dataset is derived using a nonlinear relationship between glacier mass balance and meteorological and topographical variables. The reconstruction method employs the XGBoost algorithm. Initially, XGBoost is trained on the complete training dataset, followed by incremental training for each sub-region to tailor models to specific regional characteristics. The final training results yield an average coefficient of determination (R²) of 0.87.

All code used in this dataset is publicly available and organized into the following five sections:

Data Processing

Code for extracting monthly meteorological variables.

Combines meteorological and topographical variables for each glacier.

Model Training

Implements the two-step training process for all ensemble learning methods tested in this study.

Result Analysis

Pie charts of mass balance distribution for clustered glaciers.

Line graphs of annual mass balance for each sub-region.

Result Evaluation

Extracts glacier mass balance data from previous studies.

Compares these data with the results of this study.

SHAP Analysis

Provides scripts to generate SHAP (SHapley Additive exPlanations) value-related figures, highlighting the contribution of different variables to model predictions.
Machine Learning Assisted Synthesis of Metal–Organic Nanocapsules
acs.figshare.com
datasetcatalog.nlm.nih.gov
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yunchao Xie; Chen Zhang; Xiangquan Hu; Chi Zhang; Steven P. Kelley; Jerry L. Atwood; Jian Lin (2023). Machine Learning Assisted Synthesis of Metal–Organic Nanocapsules [Dataset]. http://doi.org/10.1021/jacs.9b11569.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/jacs.9b11569.s001
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Yunchao Xie; Chen Zhang; Xiangquan Hu; Chi Zhang; Steven P. Kelley; Jerry L. Atwood; Jian Lin
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Herein, we report machine learning algorithms by training data sets from a set of both successful and failed experiments for studying the crystallization propensity of metal–organic nanocapsules (MONCs). Among a variety of studied machine learning algorithms, XGBoost affords the highest prediction accuracy of >90%. The derived chemical feature scores that determine importance of reaction parameters from the XGBoost model assist to identify synthesis parameters for successfully synthesizing new hierarchical structures of MONCs, showing superior performance to a well-trained chemist. This work demonstrates that the machine learning algorithms can assist the chemists to faster search for the optimal reaction parameters from many experimental variables, whose features are usually hidden in the high-dimensional space.
f
Fastest training times for CNN and XGBoost on CPU and GPU (all features).
plos.figshare.com
xls
Updated May 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vineeth Gutta; Satish Ranganathan Ganakammal; Sara Jones; Matthew Beyers; Sunita Chandrasekaran (2024). Fastest training times for CNN and XGBoost on CPU and GPU (all features). [Dataset]. http://doi.org/10.1371/journal.pcbi.1011504.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1011504.t007
Dataset updated
May 10, 2024
Dataset provided by
PLOS Computational Biology
Authors
Vineeth Gutta; Satish Ranganathan Ganakammal; Sara Jones; Matthew Beyers; Sunita Chandrasekaran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Fastest training times for CNN and XGBoost on CPU and GPU (all features).
f
Raw data.
figshare.com
bin
Updated Aug 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuan Liu; Wenyi Du; Yi Guo; Zhiqiang Tian; Wei Shen (2023). Raw data. [Dataset]. http://doi.org/10.1371/journal.pone.0289621.s002
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0289621.s002
Dataset updated
Aug 11, 2023
Dataset provided by
PLOS ONE
Authors
Yuan Liu; Wenyi Du; Yi Guo; Zhiqiang Tian; Wei Shen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundColon cancer recurrence is a common adverse outcome for patients after complete mesocolic excision (CME) and greatly affects the near-term and long-term prognosis of patients. This study aimed to develop a machine learning model that can identify high-risk factors before, during, and after surgery, and predict the occurrence of postoperative colon cancer recurrence.MethodsThe study included 1187 patients with colon cancer, including 110 patients who had recurrent colon cancer. The researchers collected 44 characteristic variables, including patient demographic characteristics, basic medical history, preoperative examination information, type of surgery, and intraoperative information. Four machine learning algorithms, namely extreme gradient boosting (XGBoost), random forest (RF), support vector machine (SVM), and k-nearest neighbor algorithm (KNN), were used to construct the model. The researchers evaluated the model using the k-fold cross-validation method, ROC curve, calibration curve, decision curve analysis (DCA), and external validation.ResultsAmong the four prediction models, the XGBoost algorithm performed the best. The ROC curve results showed that the AUC value of XGBoost was 0.962 in the training set and 0.952 in the validation set, indicating high prediction accuracy. The XGBoost model was stable during internal validation using the k-fold cross-validation method. The calibration curve demonstrated high predictive ability of the XGBoost model. The DCA curve showed that patients who received interventional treatment had a higher benefit rate under the XGBoost model. The external validation set’s AUC value was 0.91, indicating good extrapolation of the XGBoost prediction model.ConclusionThe XGBoost machine learning algorithm-based prediction model for colon cancer recurrence has high prediction accuracy and clinical utility.
f
S1 Code -
plos.figshare.com
bin
Updated Jan 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaosha Li; Qian Yu; Feng Dong; Zhaoxia Wu; Xijing Fan; Lingling Zhang; Ying Yu (2025). S1 Code - [Dataset]. http://doi.org/10.1371/journal.pone.0315406.s001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315406.s001
Dataset updated
Jan 16, 2025
Dataset provided by
PLOS ONE
Authors
Gaosha Li; Qian Yu; Feng Dong; Zhaoxia Wu; Xijing Fan; Lingling Zhang; Ying Yu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveNon-puerperal mastitis (NPM) is an inflammatory breast disease affecting women during non-lactation periods, and it is prone to relapse after being cured. Accurate prediction of its recurrence is crucial for personalized adjuvant therapy, and pathological examination is the primary basis for the classification, diagnosis, and confirmation of non-puerperal mastitis. Currently, there is a lack of recurrence models for non-puerperal mastitis. The aim of this research is to create and validate a recurrence model using machine learning for patients with non-puerperal mastitis.MethodsWe retrospectively collected laboratory data from 120 NPM patients, dividing them into a non-recurrence group (n = 59) and a recurrence group (n = 61). Through random allocation, these individuals were split into a training cohort and a testing cohort in a 90%:10% ratio for the purpose of building the model. Additionally, data from 25 NPM patients from another center were collected to serve as an external validation cohort for the model. Univariate analysis was used to examine differential indicators, and variable selection was conducted through LASSO regression. A combination of four machine learning algorithms (XGBoost、Logistic Regression、Random Forest、AdaBoost) was employed to predict NPM recurrence, and the model with the highest Area Under the Curve (AUC) in the test set was selected as the best model. The finally selected model was interpreted and evaluated using Receiver Operating Characteristic (ROC) curves, calibration curves, Decision curve analysis (DCA), and Shapley Additive Explanations (SHAP) plots.ResultsThe logistic regression model emerged as the optimal model for predicting recurrence of NPM with machine learning, primarily utilizing three variables: FIB, bacterial infection, and CD4+ T cell count. The model showed an AUC of 0.846 in the training cohort and 0.833 in the testing cohort. The calibration curve indicated excellent calibration of the model. DCA revealed that the model possessed favorable clinical utility. Furthermore, the model effectively achieved in the external validation group, with an AUC of 0.825.ConclusionThe machine learning model developed in this study, serving as an effective tool for predicting NPM recurrence, aids doctors in making more individualized treatment decisions, thereby enhancing therapeutic efficacy and reducing the risk of recurrence.
f
Evaluation of parameters for the ARIMA model of different training and test...
figshare.com
xls
Updated Jun 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Siddikur Rahman; Arman Hossain Chowdhury; Miftahuzzannat Amrin (2023). Evaluation of parameters for the ARIMA model of different training and test sets for COVID-19 confirmed cases. [Dataset]. http://doi.org/10.1371/journal.pgph.0000495.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgph.0000495.t002
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS Global Public Health
Authors
Md. Siddikur Rahman; Arman Hossain Chowdhury; Miftahuzzannat Amrin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Evaluation of parameters for the ARIMA model of different training and test sets for COVID-19 confirmed cases.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sun, Yan; Liu, Qian; Hu, Pingzhao; Huang, Zi Huai; Chen, Lianghong; Domaratzki, Mike (2024). The data distribution and details of datasets used to train XGBoost models. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001285960

The data distribution and details of datasets used to train XGBoost models.

Explore at:

Dataset updated

Oct 7, 2024

Authors

Sun, Yan; Liu, Qian; Hu, Pingzhao; Huang, Zi Huai; Chen, Lianghong; Domaratzki, Mike

Description

The data distribution and details of datasets used to train XGBoost models.

Clear search

Close search

Google apps

Main menu

The data distribution and details of datasets used to train XGBoost models.

RNA dataset to train XGBoost model

Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train...

Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train...

Data for: Advances and critical assessment of machine learning techniques...

Evaluation of parameters for the XGBoost models of different training and...

Rainfall Prediction: Comparison of 7 Popular Models

Rainfall Prediction using 7 Popular Models

Context and Methodology

Research Domain/Project:

Purpose:

Creation Process:

Technical Details

Dataset Structure:

Software Requirements:

Additional Resources

ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured...

Models and Predictions for "The Proper Care and Feeding of CAMELS: How...

DataSheet_1_XGBoost Classifier Based on Computed Tomography Radiomics for...

Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train...

Data from: Extreme Gradient Boosting as a Method for Quantitative...

Table_1_Predicting Adverse Radiation Effects in Brain Tumors After...

CCS-RTM-GBDT-BO

Spatio-temporal reconstruction of annual glacier mass balance in the Central...

Machine Learning Assisted Synthesis of Metal–Organic Nanocapsules

Fastest training times for CNN and XGBoost on CPU and GPU (all features).

Raw data.

S1 Code -

Evaluation of parameters for the ARIMA model of different training and test...

The data distribution and details of datasets used to train XGBoost models.