40 datasets found
  1. f

    The data distribution and details of datasets used to train XGBoost models.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Oct 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sun, Yan; Liu, Qian; Hu, Pingzhao; Huang, Zi Huai; Chen, Lianghong; Domaratzki, Mike (2024). The data distribution and details of datasets used to train XGBoost models. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001285960
    Explore at:
    Dataset updated
    Oct 7, 2024
    Authors
    Sun, Yan; Liu, Qian; Hu, Pingzhao; Huang, Zi Huai; Chen, Lianghong; Domaratzki, Mike
    Description

    The data distribution and details of datasets used to train XGBoost models.

  2. RNA dataset to train XGBoost model

    • zenodo.org
    bin, txt
    Updated Nov 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Lam; Alberto Leon; Ugljesa Djuric; Phedias Diamandis; Brian Lam; Alberto Leon; Ugljesa Djuric; Phedias Diamandis (2021). RNA dataset to train XGBoost model [Dataset]. http://doi.org/10.5281/zenodo.5639569
    Explore at:
    txt, binAvailable download formats
    Dataset updated
    Nov 3, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Brian Lam; Alberto Leon; Ugljesa Djuric; Phedias Diamandis; Brian Lam; Alberto Leon; Ugljesa Djuric; Phedias Diamandis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository includes the RNA-seq dataset from 27 GBM samples, as published in this manuscript:

    Topographic mapping of the glioblastoma proteome reveals a triple axis model of intra-tumoral heterogeneity
    Lam KHB, Leon AJ, Hui W, Lee SCE, Batruch I, Faust K, Koritzinsky M, Richer M, Djuric U, Diamandis P (under review)

  3. f

    Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train...

    • plos.figshare.com
    xlsx
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olatomiwa O. Bifarin (2023). Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train set of the ST000369 dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0284315.s008
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Olatomiwa O. Bifarin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train set of the ST000369 dataset.

  4. f

    Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train...

    • plos.figshare.com
    xlsx
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olatomiwa O. Bifarin (2023). Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train set of the MTBLS547 dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0284315.s007
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Olatomiwa O. Bifarin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train set of the MTBLS547 dataset.

  5. n

    Data for: Advances and critical assessment of machine learning techniques...

    • data.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated Mar 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Bucinsky; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč (2023). Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores [Dataset]. http://doi.org/10.5061/dryad.zgmsbccg7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 3, 2023
    Dataset provided by
    Slovak University of Technology in Bratislava
    Comenius University Bratislava
    Authors
    Lukas Bucinsky; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF). Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only set). Gasteiger charges were reassigned to the remaining compounds using OpenBabel. In addition, four in-vitro-only compounds with docking scores greater than 1 kcal/mol have been rejected. The provided in-vivo and the in-vitro-only sets contain 59,884 (in-vivo.xyz) and 174,014 (in-vitro-only.xyz) compounds, respectively. Compounds in both sets contain the following elements: H, C, N, O, F, P, S, Cl, Br, and I. The in-vivo compound set was used as the primary data set for the training of the ML models in the referencing study. The file in-vivo-splits-data.csv contains the exact composition of all (random) 80-5-15 train-validation-test splits used in the study, labeled I, II, III, IV, and V. Eight additional random subsets in each of the in-vivo 80-5-15 splits were created to monitor the training process convergence. These subsets were constructed in such a manner, that each subset contains all compounds from the previous subset (starting with the 10-5-15 subset) and was enlarged by one eighth of the entire (80-5-15) train set of a given split. These subsets are further referred to as in_vivo_10_(I, II, ..., V), in_vivo_20_(I, II, ..., V),..., in_vivo_80_(I, II, ... V). Methods Molecular docking calculations and the machine learning approaches are described in the Computational details section of [1]. Reference[1] Lukas Bucinsky, Marián Gall, Ján Matúška, Michal Pitoňák, Marek Štekláč. Advances and critical assessment of machine learning techniques for prediction of docking scores. Int. J. Quantum. Chem. (2023) DOI: 10.1002/qua.27110.

  6. f

    Evaluation of parameters for the XGBoost models of different training and...

    • plos.figshare.com
    xls
    Updated Jun 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Siddikur Rahman; Arman Hossain Chowdhury; Miftahuzzannat Amrin (2023). Evaluation of parameters for the XGBoost models of different training and test sets for COVID-19 deaths. [Dataset]. http://doi.org/10.1371/journal.pgph.0000495.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOS Global Public Health
    Authors
    Md. Siddikur Rahman; Arman Hossain Chowdhury; Miftahuzzannat Amrin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Evaluation of parameters for the XGBoost models of different training and test sets for COVID-19 deaths.

  7. t

    Rainfall Prediction: Comparison of 7 Popular Models

    • test.researchdata.tuwien.ac.at
    bin, png +1
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaya Ali Kus; Kaya Ali Kus (2025). Rainfall Prediction: Comparison of 7 Popular Models [Dataset]. http://doi.org/10.70124/p7rh4-0g783
    Explore at:
    png, text/markdown, binAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Kaya Ali Kus; Kaya Ali Kus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Rainfall Prediction using 7 Popular Models

    Context and Methodology

    Research Domain/Project:

    This dataset is part of a machine learning project focused on predicting rainfall, a critical task for sectors like agriculture, water resource management, and disaster prevention. The project employs machine learning algorithms to forecast rainfall occurrences based on historical weather data, including features like temperature, humidity, and pressure.

    Purpose:

    The primary goal of the dataset is to train multiple machine learning models to predict rainfall and compare their performances. The insights gained will help identify the most accurate models for real-world predictions of rainfall events.

    Creation Process:

    The dataset is derived from various historical weather observations, including temperature, humidity, wind speed, and pressure, collected by weather stations across Australia. These observations are used as inputs for training machine learning models. The dataset is publicly available on platforms like Kaggle and is often used in competitions and research to advance predictive analytics in meteorology.

    Technical Details


    Dataset Structure:

    The dataset consists of weather data from multiple Australian weather stations, spanning various time periods. Key features include:

    Temperature
    Humidity
    Wind Speed
    Pressure
    Rainfall (target variable)
    These features are tracked for each weather station over different times, with the goal of predicting rainfall.

    Software Requirements:

    Python: The primary programming language for data analysis and machine learning.
    scikit-learn: For implementing machine learning models.
    XGBoost, LightGBM, and CatBoost: Popular libraries for building more advanced ensemble models.
    Matplotlib/Seaborn: For data visualization.
    These libraries and tools help in data manipulation, modeling, evaluation, and visualization of results.
    DBRepo Authorization: Required to access datasets via the DBRepo API for dataset retrieval.

    Additional Resources

    Model Comparison Charts: The project includes output charts comparing the performance of seven popular machine learning models.
    Trained Models (.pkl files): Pre-trained models are saved as .pkl files for reuse without retraining.
    Documentation and Code: A Jupyter notebook guides through the process of data analysis, model training, and evaluation.

  8. m

    ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured...

    • data.mendeley.com
    Updated Aug 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Lynch (2025). ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods [Dataset]. http://doi.org/10.17632/g2sdzmssgh.1
    Explore at:
    Dataset updated
    Aug 15, 2025
    Authors
    Christopher Lynch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods. Including:

    • Tagged datasets (.csv): human-tagged gold labels for evaluation
    • Untagged datasets (.csv): raw data with Prompt matched to corresponding LLM-generated narrative
      • Suitable for inference, semi-automatic labeling, or transfer learning
    • Python and R code for preprocessing, model training, evaluation, and visualization
    • Configuration files and environment specifications to enable end-to-end reproducibility

    The materials accompany the study presented in [Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1], where Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts.

    Value of the Data: * Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers. * Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis. * Offers untagged datasets for new annotation or domain adaptation. * Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows. * Facilitates extension into other domains (e.g., multilingual LLM messaging validation).

    Data Description: * /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv. * /data/untagged/*.csv – Clean datasets without labels for inference or annotation. * /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting. * /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables.

    File Formats: * Data: CSV (UTF-8, RFC 4180) * Code: .py, .R, .Rproj

    Ethics & Licensing * All data are de-identified and contain no PII. * Released under CC BY 4.0 (data) and MIT License (code).

    Limitations * Labels reflect annotator interpretations and may encode bias. * Models trained on English text; generalization to other languages requires adaptation.

    Funding Note * Funding sources provided time in support of human taggers annotating the data sets.

  9. Z

    Models and Predictions for "The Proper Care and Feeding of CAMELS: How...

    • data.niaid.nih.gov
    Updated Feb 6, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lin, Jimmy (2020). Models and Predictions for "The Proper Care and Feeding of CAMELS: How Limited Training Data Affects Streamflow Prediction" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3543548
    Explore at:
    Dataset updated
    Feb 6, 2020
    Dataset provided by
    Gauch, Martin
    Mai, Juliane
    Lin, Jimmy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Models and Predictions

    This dataset contains the trained XGBoost and EA-LSTM models and the models' predictions for the paper The Proper Care and Feeding of CAMELS: How Limited Training Data Affects Streamflow Prediction.

    For each combination of model (XGBoost, EA-LSTM), training years (3, 6, 9), number of basins (13, 26, 53, 265, 531), and seed (111-888), there are five folders. Each corresponds to a random basin sample (for 531 basins there's only one folder, since it's all basins). In each folder, there are three files:

    (\texttt{model.pkl}) (XGBoost) or (\texttt{model_epoch30.pt}) (EA-LSTM), which stores the pickled trained model

    (\texttt{xgboost_seedNNN.p}) or (\texttt{ealstm_seedNNN.p}), which stores a pickled dictionary that maps each basin to the DataFrame of predicted and actual daily streamflow.

    (\texttt{attributes.db}), which stores static catchment attributes needed for inference.

    In addition to each folder, there is a SLURM submission script called (\texttt{.sbatch}) that was used to create and evaluate the model in the folder.

  10. f

    DataSheet_1_XGBoost Classifier Based on Computed Tomography Radiomics for...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated May 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiang, Hui; Wang, Li; Yu, Jieyu; Li, Jing; Lu, Jianping; Liu, Yanfang; Feng, Xiaochen; Li, Qi; Shi, Zhang; Bian, Yun; Cao, Kai; Liu, Fang; Fang, Xu; Shao, Chengwei; Meng, Yinghao; Zhang, Hao (2021). DataSheet_1_XGBoost Classifier Based on Computed Tomography Radiomics for Prediction of Tumor-Infiltrating CD8+ T-Cells in Patients With Pancreatic Ductal Adenocarcinoma.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000824484
    Explore at:
    Dataset updated
    May 19, 2021
    Authors
    Jiang, Hui; Wang, Li; Yu, Jieyu; Li, Jing; Lu, Jianping; Liu, Yanfang; Feng, Xiaochen; Li, Qi; Shi, Zhang; Bian, Yun; Cao, Kai; Liu, Fang; Fang, Xu; Shao, Chengwei; Meng, Yinghao; Zhang, Hao
    Description

    ObjectivesThis study constructed and validated a machine learning model to predict CD8+ tumor-infiltrating lymphocyte expression levels in patients with pancreatic ductal adenocarcinoma (PDAC) using computed tomography (CT) radiomic features.Materials and MethodsIn this retrospective study, 184 PDAC patients were randomly assigned to a training dataset (n =137) and validation dataset (n =47). All patients were divided into CD8+ T-high and -low groups using X-tile plots. A total of 1409 radiomics features were extracted from the segmentation of regions of interest, based on preoperative CT images of each patient. The LASSO algorithm was applied to reduce the dimensionality of the data and select features. The extreme gradient boosting classifier (XGBoost) was developed using a training set consisting of 137 consecutive patients admitted between January 2017 and December 2017. The model was validated in 47 consecutive patients admitted between January 2018 and April 2018. The performance of the XGBoost classifier was determined by its discriminative ability, calibration, and clinical usefulness.ResultsThe cut-off value of the CD8+ T-cell level was 18.69%, as determined by the X-tile program. A Kaplan−Meier analysis indicated a correlation between higher CD8+ T-cell levels and better overall survival (p = 0.001). The XGBoost classifier showed good discrimination in the training set (area under curve [AUC], 0.75; 95% confidence interval [CI]: 0.67–0.83) and validation set (AUC, 0.67; 95% CI: 0.51–0.83). Moreover, it showed a good calibration. The sensitivity, specificity, accuracy, positive and negative predictive values were 80.65%, 60.00%, 0.69, 0.63, and 0.79, respectively, for the training set, and 80.95%, 57.69%, 0.68, 0.61, and 0.79, respectively, for the validation set.ConclusionsWe developed a CT-based XGBoost classifier to extrapolate the infiltration levels of CD8+ T-cells in patients with PDAC. This method could be useful in identifying potential patients who can benefit from immunotherapies.

  11. f

    Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train...

    • plos.figshare.com
    xlsx
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olatomiwa O. Bifarin (2023). Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train set of the MTBLS404 dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0284315.s006
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Olatomiwa O. Bifarin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hyperparameters tuned for PLS-DA, random forest, and XGBoost for the train set of the MTBLS404 dataset.

  12. f

    Data from: Extreme Gradient Boosting as a Method for Quantitative...

    • figshare.com
    • acs.figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert P. Sheridan; Wei Min Wang; Andy Liaw; Junshui Ma; Eric M. Gifford (2023). Extreme Gradient Boosting as a Method for Quantitative Structure–Activity Relationships [Dataset]. http://doi.org/10.1021/acs.jcim.6b00591.s031
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ACS Publications
    Authors
    Robert P. Sheridan; Wei Min Wang; Andy Liaw; Junshui Ma; Eric M. Gifford
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost can be run on a single CPU in less than a third of the wall-clock time of either of the other methods.

  13. f

    Table_1_Predicting Adverse Radiation Effects in Brain Tumors After...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Jul 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andratschke, Nicolaus; Lambin, Philippe; Morin, Olivier; van Timmeren, Janita E.; Keek, Simon A.; Hendriks, Lizza E. L.; Woodruff, Henry C.; Primakov, Sergey; Chatterjee, Avishek; Vallières, Martin; Kraft, Johannes; Beuque, Manon; Braunstein, Steve E. (2022). Table_1_Predicting Adverse Radiation Effects in Brain Tumors After Stereotactic Radiotherapy With Deep Learning and Handcrafted Radiomics.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000326928
    Explore at:
    Dataset updated
    Jul 13, 2022
    Authors
    Andratschke, Nicolaus; Lambin, Philippe; Morin, Olivier; van Timmeren, Janita E.; Keek, Simon A.; Hendriks, Lizza E. L.; Woodruff, Henry C.; Primakov, Sergey; Chatterjee, Avishek; Vallières, Martin; Kraft, Johannes; Beuque, Manon; Braunstein, Steve E.
    Description

    IntroductionThere is a cumulative risk of 20–40% of developing brain metastases (BM) in solid cancers. Stereotactic radiotherapy (SRT) enables the application of high focal doses of radiation to a volume and is often used for BM treatment. However, SRT can cause adverse radiation effects (ARE), such as radiation necrosis, which sometimes cause irreversible damage to the brain. It is therefore of clinical interest to identify patients at a high risk of developing ARE. We hypothesized that models trained with radiomics features, deep learning (DL) features, and patient characteristics or their combination can predict ARE risk in patients with BM before SRT.MethodsGadolinium-enhanced T1-weighted MRIs and characteristics from patients treated with SRT for BM were collected for a training and testing cohort (N = 1,404) and a validation cohort (N = 237) from a separate institute. From each lesion in the training set, radiomics features were extracted and used to train an extreme gradient boosting (XGBoost) model. A DL model was trained on the same cohort to make a separate prediction and to extract the last layer of features. Different models using XGBoost were built using only radiomics features, DL features, and patient characteristics or a combination of them. Evaluation was performed using the area under the curve (AUC) of the receiver operating characteristic curve on the external dataset. Predictions for individual lesions and per patient developing ARE were investigated.ResultsThe best-performing XGBoost model on a lesion level was trained on a combination of radiomics features and DL features (AUC of 0.71 and recall of 0.80). On a patient level, a combination of radiomics features, DL features, and patient characteristics obtained the best performance (AUC of 0.72 and recall of 0.84). The DL model achieved an AUC of 0.64 and recall of 0.85 per lesion and an AUC of 0.70 and recall of 0.60 per patient.ConclusionMachine learning models built on radiomics features and DL features extracted from BM combined with patient characteristics show potential to predict ARE at the patient and lesion levels. These models could be used in clinical decision making, informing patients on their risk of ARE and allowing physicians to opt for different therapies.

  14. CCS-RTM-GBDT-BO

    • zenodo.org
    bin, json
    Updated Jun 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    T. Højlund-Dodd; T. Højlund-Dodd (2022). CCS-RTM-GBDT-BO [Dataset]. http://doi.org/10.5281/zenodo.6774384
    Explore at:
    json, binAvailable download formats
    Dataset updated
    Jun 29, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    T. Højlund-Dodd; T. Højlund-Dodd
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Carbon Capture Storage (CCS) relevant Reactive Transport Modelling (RTM) of microfractures in basaltic rock; emulated using Gradient Boosted Decision Trees (GBDT) and subsequently optimised using Bayesian Optimisation (BO) framework. This project's code is hosted on Github at https://github.com/ThomasDodd97/CCS-RTM-GBDT-BO. This upload on Zenodo contains the dataset used to train four XGBoost GBDT surrogate models, whose model files are also uploaded here.

  15. Spatio-temporal reconstruction of annual glacier mass balance in the Central...

    • zenodo.org
    csv, zip
    Updated Dec 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yanfei Peng; Yanfei Peng; Bolch Tobias; Yuan Qiangqiang; Baldacchino Francesca; Yang Qianqian; Bolch Tobias; Yuan Qiangqiang; Baldacchino Francesca; Yang Qianqian (2024). Spatio-temporal reconstruction of annual glacier mass balance in the Central Asia (1950- 2020) using machine learning method [Dataset]. http://doi.org/10.5281/zenodo.14546263
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Dec 23, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yanfei Peng; Yanfei Peng; Bolch Tobias; Yuan Qiangqiang; Baldacchino Francesca; Yang Qianqian; Bolch Tobias; Yuan Qiangqiang; Baldacchino Francesca; Yang Qianqian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Central Asia
    Description

    This dataset reconstructs the annual mass balance of glaciers larger than 0.1 km² in the Tien Shan and Pamir regions from 1950 to 2022. The dataset is derived using a nonlinear relationship between glacier mass balance and meteorological and topographical variables. The reconstruction method employs the XGBoost algorithm. Initially, XGBoost is trained on the complete training dataset, followed by incremental training for each sub-region to tailor models to specific regional characteristics. The final training results yield an average coefficient of determination (R²) of 0.87.

    All code used in this dataset is publicly available and organized into the following five sections:

    1. Data Processing

      • Code for extracting monthly meteorological variables.
      • Combines meteorological and topographical variables for each glacier.
    2. Model Training

      • Implements the two-step training process for all ensemble learning methods tested in this study.
    3. Result Analysis

      • Pie charts of mass balance distribution for clustered glaciers.
      • Line graphs of annual mass balance for each sub-region.
    4. Result Evaluation

      • Extracts glacier mass balance data from previous studies.
      • Compares these data with the results of this study.
    5. SHAP Analysis

      • Provides scripts to generate SHAP (SHapley Additive exPlanations) value-related figures, highlighting the contribution of different variables to model predictions.
  16. Machine Learning Assisted Synthesis of Metal–Organic Nanocapsules

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yunchao Xie; Chen Zhang; Xiangquan Hu; Chi Zhang; Steven P. Kelley; Jerry L. Atwood; Jian Lin (2023). Machine Learning Assisted Synthesis of Metal–Organic Nanocapsules [Dataset]. http://doi.org/10.1021/jacs.9b11569.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Yunchao Xie; Chen Zhang; Xiangquan Hu; Chi Zhang; Steven P. Kelley; Jerry L. Atwood; Jian Lin
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Herein, we report machine learning algorithms by training data sets from a set of both successful and failed experiments for studying the crystallization propensity of metal–organic nanocapsules (MONCs). Among a variety of studied machine learning algorithms, XGBoost affords the highest prediction accuracy of >90%. The derived chemical feature scores that determine importance of reaction parameters from the XGBoost model assist to identify synthesis parameters for successfully synthesizing new hierarchical structures of MONCs, showing superior performance to a well-trained chemist. This work demonstrates that the machine learning algorithms can assist the chemists to faster search for the optimal reaction parameters from many experimental variables, whose features are usually hidden in the high-dimensional space.

  17. f

    Fastest training times for CNN and XGBoost on CPU and GPU (all features).

    • plos.figshare.com
    xls
    Updated May 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vineeth Gutta; Satish Ranganathan Ganakammal; Sara Jones; Matthew Beyers; Sunita Chandrasekaran (2024). Fastest training times for CNN and XGBoost on CPU and GPU (all features). [Dataset]. http://doi.org/10.1371/journal.pcbi.1011504.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 10, 2024
    Dataset provided by
    PLOS Computational Biology
    Authors
    Vineeth Gutta; Satish Ranganathan Ganakammal; Sara Jones; Matthew Beyers; Sunita Chandrasekaran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Fastest training times for CNN and XGBoost on CPU and GPU (all features).

  18. f

    Raw data.

    • figshare.com
    bin
    Updated Aug 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuan Liu; Wenyi Du; Yi Guo; Zhiqiang Tian; Wei Shen (2023). Raw data. [Dataset]. http://doi.org/10.1371/journal.pone.0289621.s002
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 11, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Yuan Liu; Wenyi Du; Yi Guo; Zhiqiang Tian; Wei Shen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundColon cancer recurrence is a common adverse outcome for patients after complete mesocolic excision (CME) and greatly affects the near-term and long-term prognosis of patients. This study aimed to develop a machine learning model that can identify high-risk factors before, during, and after surgery, and predict the occurrence of postoperative colon cancer recurrence.MethodsThe study included 1187 patients with colon cancer, including 110 patients who had recurrent colon cancer. The researchers collected 44 characteristic variables, including patient demographic characteristics, basic medical history, preoperative examination information, type of surgery, and intraoperative information. Four machine learning algorithms, namely extreme gradient boosting (XGBoost), random forest (RF), support vector machine (SVM), and k-nearest neighbor algorithm (KNN), were used to construct the model. The researchers evaluated the model using the k-fold cross-validation method, ROC curve, calibration curve, decision curve analysis (DCA), and external validation.ResultsAmong the four prediction models, the XGBoost algorithm performed the best. The ROC curve results showed that the AUC value of XGBoost was 0.962 in the training set and 0.952 in the validation set, indicating high prediction accuracy. The XGBoost model was stable during internal validation using the k-fold cross-validation method. The calibration curve demonstrated high predictive ability of the XGBoost model. The DCA curve showed that patients who received interventional treatment had a higher benefit rate under the XGBoost model. The external validation set’s AUC value was 0.91, indicating good extrapolation of the XGBoost prediction model.ConclusionThe XGBoost machine learning algorithm-based prediction model for colon cancer recurrence has high prediction accuracy and clinical utility.

  19. f

    S1 Code -

    • plos.figshare.com
    bin
    Updated Jan 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaosha Li; Qian Yu; Feng Dong; Zhaoxia Wu; Xijing Fan; Lingling Zhang; Ying Yu (2025). S1 Code - [Dataset]. http://doi.org/10.1371/journal.pone.0315406.s001
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 16, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Gaosha Li; Qian Yu; Feng Dong; Zhaoxia Wu; Xijing Fan; Lingling Zhang; Ying Yu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveNon-puerperal mastitis (NPM) is an inflammatory breast disease affecting women during non-lactation periods, and it is prone to relapse after being cured. Accurate prediction of its recurrence is crucial for personalized adjuvant therapy, and pathological examination is the primary basis for the classification, diagnosis, and confirmation of non-puerperal mastitis. Currently, there is a lack of recurrence models for non-puerperal mastitis. The aim of this research is to create and validate a recurrence model using machine learning for patients with non-puerperal mastitis.MethodsWe retrospectively collected laboratory data from 120 NPM patients, dividing them into a non-recurrence group (n = 59) and a recurrence group (n = 61). Through random allocation, these individuals were split into a training cohort and a testing cohort in a 90%:10% ratio for the purpose of building the model. Additionally, data from 25 NPM patients from another center were collected to serve as an external validation cohort for the model. Univariate analysis was used to examine differential indicators, and variable selection was conducted through LASSO regression. A combination of four machine learning algorithms (XGBoost、Logistic Regression、Random Forest、AdaBoost) was employed to predict NPM recurrence, and the model with the highest Area Under the Curve (AUC) in the test set was selected as the best model. The finally selected model was interpreted and evaluated using Receiver Operating Characteristic (ROC) curves, calibration curves, Decision curve analysis (DCA), and Shapley Additive Explanations (SHAP) plots.ResultsThe logistic regression model emerged as the optimal model for predicting recurrence of NPM with machine learning, primarily utilizing three variables: FIB, bacterial infection, and CD4+ T cell count. The model showed an AUC of 0.846 in the training cohort and 0.833 in the testing cohort. The calibration curve indicated excellent calibration of the model. DCA revealed that the model possessed favorable clinical utility. Furthermore, the model effectively achieved in the external validation group, with an AUC of 0.825.ConclusionThe machine learning model developed in this study, serving as an effective tool for predicting NPM recurrence, aids doctors in making more individualized treatment decisions, thereby enhancing therapeutic efficacy and reducing the risk of recurrence.

  20. f

    Evaluation of parameters for the ARIMA model of different training and test...

    • figshare.com
    xls
    Updated Jun 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Siddikur Rahman; Arman Hossain Chowdhury; Miftahuzzannat Amrin (2023). Evaluation of parameters for the ARIMA model of different training and test sets for COVID-19 confirmed cases. [Dataset]. http://doi.org/10.1371/journal.pgph.0000495.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOS Global Public Health
    Authors
    Md. Siddikur Rahman; Arman Hossain Chowdhury; Miftahuzzannat Amrin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Evaluation of parameters for the ARIMA model of different training and test sets for COVID-19 confirmed cases.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sun, Yan; Liu, Qian; Hu, Pingzhao; Huang, Zi Huai; Chen, Lianghong; Domaratzki, Mike (2024). The data distribution and details of datasets used to train XGBoost models. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001285960

The data distribution and details of datasets used to train XGBoost models.

Explore at:
Dataset updated
Oct 7, 2024
Authors
Sun, Yan; Liu, Qian; Hu, Pingzhao; Huang, Zi Huai; Chen, Lianghong; Domaratzki, Mike
Description

The data distribution and details of datasets used to train XGBoost models.

Search
Clear search
Close search
Google apps
Main menu