100+ datasets found
  1. jars xgboost example files

    • kaggle.com
    zip
    Updated Apr 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edwin Hauwert M.Sc. (2022). jars xgboost example files [Dataset]. https://www.kaggle.com/develuse/jars-xgboost-old
    Explore at:
    zip(615742866 bytes)Available download formats
    Dataset updated
    Apr 16, 2022
    Authors
    Edwin Hauwert M.Sc.
    Description

    Dataset

    This dataset was created by Edwin Hauwert M.Sc.

    Contents

  2. Hyperparameters for the XGBoost model.

    • plos.figshare.com
    xls
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoa Thi Trinh; Tuan Anh Pham; Vu Dinh Tho; Duy Hung Nguyen (2024). Hyperparameters for the XGBoost model. [Dataset]. http://doi.org/10.1371/journal.pone.0312531.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Hoa Thi Trinh; Tuan Anh Pham; Vu Dinh Tho; Duy Hung Nguyen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Structurally, the lateral load-bearing capacity mainly depends on reinforced concrete (RC) walls. Determination of flexural strength and shear strength is mandatory when designing reinforced concrete walls. Typically, these strengths are determined through theoretical formulas and verified experimentally. However, theoretical formulas often have large errors and testing is costly and time-consuming. Therefore, this study exploits machine learning techniques, specifically the hybrid XGBoost model combined with optimization algorithms, to predict the shear strength of RC walls based on model training from available experimental results. The study used the largest database of RC walls to date, consisting of 1057 samples with various cross-sectional shapes. Bayesian optimization (BO) algorithms, including BO—Gaussian Process, BO—Random Forest, and Random Search methods, were used to refine the XGBoost model architecture. The results show that Gaussian Process emerged as the most efficient solution compared to other optimization algorithms, providing the lowest Mean Square Error and achieving a prediction R2 of 0.998 for the training set, 0.972 for the validation set and 0.984 for the test set, while BO—Random Forest and Random Search performed as well on the training and test sets as Gaussian Process but significantly worse on the validation set, specifically R2 on the validation set of BO—Random Forest and Random Search were 0.970 and 0.969 respectively over the entire dataset including all cross-sectional shapes of the RC wall. SHAP (Shapley Additive Explanations) technique was used to clarify the predictive ability of the model and the importance of input variables. Furthermore, the performance of the model was validated through comparative analysis with benchmark models and current standards. Notably, the coefficient of variation (COV %) of the XGBoost model is 13.27%, while traditional models often have COV % exceeding 50%.

  3. H

    Data from: Representative sample size for estimating saturated hydraulic...

    • beta.hydroshare.org
    • hydroshare.org
    • +1more
    zip
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amin Ahmadisharaf; Reza Nematirad; Sadra Sabouri; Yakov Pachepsky; Behzad Ghanbarian (2024). Representative sample size for estimating saturated hydraulic conductivity via machine learning [Dataset]. https://beta.hydroshare.org/resource/4c33179a77834634969bb9787c41e71a/
    Explore at:
    zip(5.9 MB)Available download formats
    Dataset updated
    May 21, 2024
    Dataset provided by
    HydroShare
    Authors
    Amin Ahmadisharaf; Reza Nematirad; Sadra Sabouri; Yakov Pachepsky; Behzad Ghanbarian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This database including saturated hydraulic conductivity data from the USKSAT database as well as the associated Python codes used to analyze learning curves and train and test the developed machine learning models.

  4. f

    Table_7_Five-Feature Model for Developing the Classifier for Synergistic vs....

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Jul 9, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shi, Tieliu; Ji, Xiangjun; Liu, Zhichao; Tong, Weida (2019). Table_7_Five-Feature Model for Developing the Classifier for Synergistic vs. Antagonistic Drug Combinations Built by XGBoost.XLSX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000134259
    Explore at:
    Dataset updated
    Jul 9, 2019
    Authors
    Shi, Tieliu; Ji, Xiangjun; Liu, Zhichao; Tong, Weida
    Description

    Combinatorial drug therapy can improve the therapeutic effect and reduce the corresponding adverse events. In silico strategies to classify synergistic vs. antagonistic drug pairs is more efficient than experimental strategies. However, most of the developed methods have been applied only to cancer therapies. In this study, we introduce a novel method, XGBoost, based on five features of drugs and biomolecular networks of their targets, to classify synergistic vs. antagonistic drug combinations from different drug categories. We found that XGBoost outperformed other classifiers in both stratified fivefold cross-validation (CV) and independent validation. For example, XGBoost achieved higher predictive accuracy than other models (0.86, 0.78, 0.78, and 0.83 for XGBoost, logistic regression, naïve Bayesian, and random forest, respectively) for an independent validation set. We also found that the five-feature XGBoost model is much more effective at predicting combinatorial therapies that have synergistic effects than those with antagonistic effects. The five-feature XGBoost model was also validated on TCGA data with accuracy of 0.79 among the 61 tested drug pairs, which is comparable to that of DeepSynergy. Among the 14 main anatomical/pharmacological groups classified according to WHO Anatomic Therapeutic Class, for drugs belonging to five groups, their prediction accuracy was significantly increased (odds ratio < 1) or reduced (odds ratio > 1) (Fisher’s exact test, p < 0.05). This study concludes that our five-feature XGBoost model has significant benefits for classifying synergistic vs. antagonistic drug combinations.

  5. n

    Data from: A Deep Learning and XGBoost-based Method for Predicting...

    • narcis.nl
    • data.mendeley.com
    Updated Aug 3, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wang, P (via Mendeley Data) (2021). A Deep Learning and XGBoost-based Method for Predicting Protein-protein Interaction Sites [Dataset]. http://doi.org/10.17632/9tft3vz5tm.2
    Explore at:
    Dataset updated
    Aug 3, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    wang, P (via Mendeley Data)
    Description

    local_feature_training_set.csv: Preprocessing data of feature extractor contains 65869 rows and 344 columns, and rows represent the number of samples , the first 343 columns represent feature and the last column represent label

    local_feature_testing_set.csv: Preprocessing data of feature extractor contains 11791 rows and 344 columns, and rows represent the number of samples , the first 343 columns represent feature and the last column represent label

    global&local_feature_training_set.csv: Preprocessing data of feature extractor contains 65869 rows and 1028 columns, and rows represent the number of samples , the first 1027 columns represent feature and the last column represent label

    global&local_feature_testing_set.csv: Preprocessing data of feature extractor contains 11791 rows and 1028 columns, and rows represent the number of samples , the first 1027 columns represent feature and the last column represent label

  6. MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and...

    • zenodo.org
    csv, zip
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah (2025). MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies [Dataset]. http://doi.org/10.5281/zenodo.15666484
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 14, 2025
    Description

    The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv and Trad_Test.csv) is derived directly from the original complete geochemical dataset (alldata.csv) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv and Simu_Test.csv) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip archive contains the raw input files alldata.csvused to generate the proxies_alldata.csv, it also contains Analysis1.csv and Analysis2.csvfor performing confidence analysis. To run the executable files in place of the .m scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.

    Details on the input files for confidence analysis: Analysis1.csv and Analysis2.csv
    These files contain two columns for the test data: column 1 = match or mismatch between predicted and true alterations? column 2 = probability of a correct classification, according to bootstrapped samples (Analysis1.csv) or to simulated proxies (Analysis2.csv)
  7. Fraud Detection Transactions Dataset

    • kaggle.com
    zip
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samay Ashar (2025). Fraud Detection Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/samayashar/fraud-detection-transactions-dataset
    Explore at:
    zip(2104444 bytes)Available download formats
    Dataset updated
    Feb 21, 2025
    Authors
    Samay Ashar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description

    This dataset is designed to help data scientists and machine learning enthusiasts develop robust fraud detection models. It contains realistic synthetic transaction data, including user information, transaction types, risk scores, and more, making it ideal for binary classification tasks with models like XGBoost and LightGBM.

    📌 Key Features

    1. 21 features capturing various aspects of a financial transaction
    2. Realistic structure with numerical, categorical, and temporal data
    3. Binary fraud labels (0 = Not Fraud, 1 = Fraud)
    4. Designed for high accuracy with XGBoost and other ML models
    5. Useful for anomaly detection, risk analysis, and security research

    📌 Columns in the Dataset

    Column NameDescription
    Transaction_IDUnique identifier for each transaction
    User_IDUnique identifier for the user
    Transaction_AmountAmount of money involved in the transaction
    Transaction_TypeType of transaction (Online, In-Store, ATM, etc.)
    TimestampDate and time of the transaction
    Account_BalanceUser's current account balance before the transaction
    Device_TypeType of device used (Mobile, Desktop, etc.)
    LocationGeographical location of the transaction
    Merchant_CategoryType of merchant (Retail, Food, Travel, etc.)
    IP_Address_FlagWhether the IP address was flagged as suspicious (0 or 1)
    Previous_Fraudulent_ActivityNumber of past fraudulent activities by the user
    Daily_Transaction_CountNumber of transactions made by the user that day
    Avg_Transaction_Amount_7dUser's average transaction amount in the past 7 days
    Failed_Transaction_Count_7dCount of failed transactions in the past 7 days
    Card_TypeType of payment card used (Credit, Debit, Prepaid, etc.)
    Card_AgeAge of the card in months
    Transaction_DistanceDistance between the user's usual location and transaction location
    Authentication_MethodHow the user authenticated (PIN, Biometric, etc.)
    Risk_ScoreFraud risk score computed for the transaction
    Is_WeekendWhether the transaction occurred on a weekend (0 or 1)
    Fraud_LabelTarget variable (0 = Not Fraud, 1 = Fraud)

    📌 Potential Use Cases

    1. Fraud detection model training
    2. Anomaly detection in financial transactions
    3. Risk scoring systems for banks and fintech companies
    4. Feature engineering and model explainability research
  8. Tox24 challenge data

    • kaggle.com
    zip
    Updated Sep 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonina Dolgorukova (2024). Tox24 challenge data [Dataset]. https://www.kaggle.com/datasets/antoninadolgorukova/tox24-challenge-data/suggestions
    Explore at:
    zip(19160575 bytes)Available download formats
    Dataset updated
    Sep 18, 2024
    Authors
    Antonina Dolgorukova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset and associated notebooks were created to solve the Tox24 Challenge and provide a real-world data and a working example of how machine learning can be used to predict binding activity to a target protein like Transthyretin (TTR) - from retrieving and preprocessing SMILES to ensembling the obtained predictions.

    SMILES: File all_smiles_data.csv contains various smiles for the 1512 competition chemicals (retrieved from pubchem, cleaned, and smiles with isolated atoms removed), generated in this notebook. Also, here I evaluated the performance of XGBoost using molecular descriptors computed from different SMILES representations of the chemicals.

    FEATURES: The 'features' folder contains features calculated with ochem and those computed in the features engineering notebook. All feature sets were evaluated with XGBoost here.

    Feature selection notebooks: - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-for-xgboost - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-by-clusters-for-xgboost - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-by-clusters-for-lightgbm

    MODELS: The 'submits' folder contains the predictions for the 500 test chemicals of the Tox24 Challenge were made with the XGBoost and LightGBM models and used for the final ensemble.

    DATA: The TTR Supplemental Tables are taken from the article that accompanied the Tox24 Challenge, and include:

    • Tables outlining the components of the assay reactions and lists of autofluorescent chemicals,
    • chemicals excluded from the analysis due to interference,
    • and chemicals screened in single concentration and concentration response testing.

    This dataset can be used for drug design research, protein-ligand interaction studies, and machine learning model development focused on chemical binding activities.

  9. Data from: Robust Data-driven Metallicities for 175 Million Stars from Gaia...

    • zenodo.org
    application/gzip
    Updated May 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    René Andrae; René Andrae; Hans-Walter Rix; Hans-Walter Rix; Vedant Chandra; Vedant Chandra (2023). Robust Data-driven Metallicities for 175 Million Stars from Gaia XP Spectra [Dataset]. http://doi.org/10.5281/zenodo.7925612
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    René Andrae; René Andrae; Hans-Walter Rix; Hans-Walter Rix; Vedant Chandra; Vedant Chandra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset accompanying Andrae et al (2023), "Robust Data-driven Metallicities for 175 Million Stars from Gaia XP Spectra".

    Table 1 (174,922,161 rows) contains XGBoost parameters (temperature, surface gravity, and metallicity) for all stars in the sample.

    Table 2 (17,558,141 rows) contains Gaia DR3 parameters and XGBoost parameters for a vetted sample of RGB stars with reliable measurements.

    The tables are provided in compressed CSV format, and the full data model is described in the accompanying paper.

  10. Compare errors made by different models

    • kaggle.com
    zip
    Updated Jan 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Afshan Nabi (2020). Compare errors made by different models [Dataset]. https://www.kaggle.com/afshannabi/compare-errors-made-by-different-models
    Explore at:
    zip(5583 bytes)Available download formats
    Dataset updated
    Jan 16, 2020
    Authors
    Afshan Nabi
    Description

    Context

    I had 3 different models trained on the same data. Gauging model accuracy was easy, but I was curious about whether all models mis-classify the same samples or not. So, I generated the predictions made by each model and tried to visualize them in a useful manner.

    Content

    The dataframe contains the True label for each sample as well as the predictions made by 3 different models: SVM, XGBoost and MLP.

  11. TPS-Mar-2025-Rain-Prediction-Data

    • kaggle.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eren Ata (2025). TPS-Mar-2025-Rain-Prediction-Data [Dataset]. https://www.kaggle.com/datasets/erenata/tps-mar-2025-rain-prediction-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Eren Ata
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Rain Prediction Model - Kaggle Competition Project Overview This project is a machine learning solution for the Tabular Playground Series - March 2025 Kaggle competition, focusing on rain prediction using various weather-related features.

    Features Advanced feature engineering with weather interactions and rolling statistics Ensemble learning with XGBoost, LightGBM, and Logistic Regression Hyperparameter optimization using Optuna Cross-validation with GroupKFold Feature importance analysis and visualization Model Performance XGBoost CV Score: 0.8957 ± 0.0192 AUC Optimized hyperparameters through 20 trials Feature importance visualization available in 'feature_importance.png'

  12. I

    Large-scale proteomics in the first trimester of pregnancy predict...

    • immport.org
    • data.niaid.nih.gov
    url
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Large-scale proteomics in the first trimester of pregnancy predict psychopathology and temperament in preschool children [Dataset]. http://doi.org/10.21430/m34sax8sjb
    Explore at:
    urlAvailable download formats
    License

    https://www.immport.org/agreementhttps://www.immport.org/agreement

    Description

    This study investigates how prenatal inflammation might affect childhood psychopathology by analyzing over 1,000 proteins in first-trimester blood samples using XGBoost machine learning model. Results show these proteins predict 5-10% of variance in early childhood behaviors like sadness and attention issues, highlighting immune and nervous system development as key factors. The findings suggest that a broader range of proteins than previously considered could influence future mental health outcomes in children.

  13. n

    A machine learning based prediction model for life expectancy

    • data.niaid.nih.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated Nov 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo (2022). A machine learning based prediction model for life expectancy [Dataset]. http://doi.org/10.5061/dryad.z612jm6fv
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 14, 2022
    Dataset provided by
    University of South Carolina Upstate
    Strathmore University
    Authors
    Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The social and financial systems of many nations throughout the world are significantly impacted by life expectancy (LE) models. Numerous studies have pointed out the crucial effects that life expectancy projections will have on societal issues and the administration of the global healthcare system. The computation of life expectancy has primarily entailed building an ordinary life table. However, the life table is limited by its long duration, the assumption of homogeneity of cohorts and censoring. As a result, a robust and more accurate approach is inevitable. In this study, a supervised machine learning model for estimating life expectancy rates is developed. The model takes into consideration health, socioeconomic, and behavioral characteristics by using the eXtreme Gradient Boosting (XGBoost) algorithm to data from 193 UN member states. The effectiveness of the model's prediction is compared to that of the Random Forest (RF) and Artificial Neural Network (ANN) regressors utilized in earlier research. XGBoost attains an MAE and an RMSE of 1.554 and 2.402, respectively outperforming the RF and ANN models that achieved MAE and RMSE values of 7.938 and 11.304, and 3.86 and 5.002, respectively. The overall results of this study support XGBoost as a reliable and efficient model for estimating life expectancy. Methods Secondary data were used from which a sample of 2832 observations of 21 variables was sourced from the World Health Organization (WHO) and the United Nations (UN) databases. The data was on 193 UN member states from the year 2000–2015, with the LE health-related factors drawn from the Global Health Observatory data repository.

  14. f

    Data_Sheet_2_Machine Learning Prediction Models for Mechanically Ventilated...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Jul 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yao, Renqi; Li, Lin; Du, Bin; Li, Yang; Zhang, Jin; Wang, Guowei; Chen, Yan; Li, Wei; Chen, Ge; Xi, Xiuming; Jin, Xin; Liu, Shi; Ren, Chao; Huang, Huibin; Guo, Junyang; Guo, Qianqian; Yu, Qian; Zhu, Yibing; Zheng, Hua (2021). Data_Sheet_2_Machine Learning Prediction Models for Mechanically Ventilated Patients: Analyses of the MIMIC-III Database.PDF [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000884346
    Explore at:
    Dataset updated
    Jul 1, 2021
    Authors
    Yao, Renqi; Li, Lin; Du, Bin; Li, Yang; Zhang, Jin; Wang, Guowei; Chen, Yan; Li, Wei; Chen, Ge; Xi, Xiuming; Jin, Xin; Liu, Shi; Ren, Chao; Huang, Huibin; Guo, Junyang; Guo, Qianqian; Yu, Qian; Zhu, Yibing; Zheng, Hua
    Description

    Background: Mechanically ventilated patients in the intensive care unit (ICU) have high mortality rates. There are multiple prediction scores, such as the Simplified Acute Physiology Score II (SAPS II), Oxford Acute Severity of Illness Score (OASIS), and Sequential Organ Failure Assessment (SOFA), widely used in the general ICU population. We aimed to establish prediction scores on mechanically ventilated patients with the combination of these disease severity scores and other features available on the first day of admission.Methods: A retrospective administrative database study from the Medical Information Mart for Intensive Care (MIMIC-III) database was conducted. The exposures of interest consisted of the demographics, pre-ICU comorbidity, ICU diagnosis, disease severity scores, vital signs, and laboratory test results on the first day of ICU admission. Hospital mortality was used as the outcome. We used the machine learning methods of k-nearest neighbors (KNN), logistic regression, bagging, decision tree, random forest, Extreme Gradient Boosting (XGBoost), and neural network for model establishment. A sample of 70% of the cohort was used for the training set; the remaining 30% was applied for testing. Areas under the receiver operating characteristic curves (AUCs) and calibration plots would be constructed for the evaluation and comparison of the models' performance. The significance of the risk factors was identified through models and the top factors were reported.Results: A total of 28,530 subjects were enrolled through the screening of the MIMIC-III database. After data preprocessing, 25,659 adult patients with 66 predictors were included in the model analyses. With the training set, the models of KNN, logistic regression, decision tree, random forest, neural network, bagging, and XGBoost were established and the testing set obtained AUCs of 0.806, 0.818, 0.743, 0.819, 0.780, 0.803, and 0.821, respectively. The calibration curves of all the models, except for the neural network, performed well. The XGBoost model performed best among the seven models. The top five predictors were age, respiratory dysfunction, SAPS II score, maximum hemoglobin, and minimum lactate.Conclusion: The current study indicates that models with the risk of factors on the first day could be successfully established for predicting mortality in ventilated patients. The XGBoost model performs best among the seven machine learning models.

  15. f

    Table_1_Automatic text classification of drug-induced liver injury using...

    • datasetcatalog.nlm.nih.gov
    Updated Jun 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bao, Wenjun; Chen, Minjun; Thakkar, Shraddha; Wu, Yue; Tong, Weida; Wingerd, Byron; Liu, Zhichao; Mann, Nicholas; Wolfinger, Russell D.; Donnelly, Tom; Xu, Joshua; Pedersen, Thomas J. (2024). Table_1_Automatic text classification of drug-induced liver injury using document-term matrix and XGBoost.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001354925
    Explore at:
    Dataset updated
    Jun 3, 2024
    Authors
    Bao, Wenjun; Chen, Minjun; Thakkar, Shraddha; Wu, Yue; Tong, Weida; Wingerd, Byron; Liu, Zhichao; Mann, Nicholas; Wolfinger, Russell D.; Donnelly, Tom; Xu, Joshua; Pedersen, Thomas J.
    Description

    IntroductionRegulatory agencies generate a vast amount of textual data in the review process. For example, drug labeling serves as a valuable resource for regulatory agencies, such as U.S. Food and Drug Administration (FDA) and Europe Medical Agency (EMA), to communicate drug safety and effectiveness information to healthcare professionals and patients. Drug labeling also serves as a resource for pharmacovigilance and drug safety research. Automated text classification would significantly improve the analysis of drug labeling documents and conserve reviewer resources.MethodsWe utilized artificial intelligence in this study to classify drug-induced liver injury (DILI)-related content from drug labeling documents based on FDA’s DILIrank dataset. We employed text mining and XGBoost models and utilized the Preferred Terms of Medical queries for adverse event standards to simplify the elimination of common words and phrases while retaining medical standard terms for FDA and EMA drug label datasets. Then, we constructed a document term matrix using weights computed by Term Frequency-Inverse Document Frequency (TF-IDF) for each included word/term/token.ResultsThe automatic text classification model exhibited robust performance in predicting DILI, achieving cross-validation AUC scores exceeding 0.90 for both drug labels from FDA and EMA and literature abstracts from the Critical Assessment of Massive Data Analysis (CAMDA).DiscussionMoreover, the text mining and XGBoost functions demonstrated in this study can be applied to other text processing and classification tasks.

  16. f

    Data from: Assessing individual genetic susceptibility to metabolic...

    • datasetcatalog.nlm.nih.gov
    • tandf.figshare.com
    Updated Jun 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huang, Tao; Zheng, Xiujuan; Huang, Xirui; Xiong, Wenhui; Yang, Menghan; Wang, Simin; Li, Yuanyuan; Gao, Bizhen; Qiao, Shijie (2025). Assessing individual genetic susceptibility to metabolic syndrome: interpretable machine learning method [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002050385
    Explore at:
    Dataset updated
    Jun 22, 2025
    Authors
    Huang, Tao; Zheng, Xiujuan; Huang, Xirui; Xiong, Wenhui; Yang, Menghan; Wang, Simin; Li, Yuanyuan; Gao, Bizhen; Qiao, Shijie
    Description

    Genome-wide association studies have provided profound insights into the genetic aetiology of metabolic syndrome (MetS). However, there is a lack of machine-learning (ML)-based predictive models to assess individual genetic susceptibility to MetS. This study utilized single-nucleotide polymorphisms (SNPs) as variables and employed ML-based genetic risk score (GRS) models to predict the occurrence of MetS, bringing it closer to clinical application. Feature selection was performed using Least Absolute Shrinkage and Selection Operator. Six ML algorithms were employed to construct GRS models. A fivefold cross-validation was utilized to aid in the internal validation of models. The receiver operating characteristic (ROC) curve was used to select the better-performing GRS model. The SHapley Additive exPlanations (SHAP) was then applied to interpret the model. After extracting GRS, stratified analysis of BMI, age and gender was performed. Finally, these conventional risk factors and GRS were integrated through multivariate logistic regression to establish a combined model. A total of 17 SNPs were selected for analysis. Among the GRS models, the extreme gradient boosting (XGBoost) model demonstrated superior discriminative performance (AUC = 0.837). The XGBoost’s optimal robustness was also validated through five-fold cross-validation (mean ROC-AUC = 0.706). The XGBoost-based SHAP algorithm not only elucidated the global effects of 17 SNPs across all samples, but also described the interaction between SNPs, providing a visual representation of how SNPs impact the prediction of MetS in an individual. There was a strong correlation between GRS and MetS risk, particularly observed among young individuals, males and overweight individuals. Furthermore, the model combining conventional risk factors and GRS exhibited excellent discriminative performance (AUC = 0.962) and outstanding robustness (mean ROC-AUC = 0.959). This study established a reliable XGBoost-based GRS model and a GRS prediction platform (https://metabolicsyndromeapps.shinyapps.io/geneticriskscore/) to assess individual genetic susceptibility to MetS. This model has high interpretability and can provide personalized reference for determining the necessity of primary prevention measures for MetS. Additionally, there may be interactions between traditional risk factors and GRS, and the integration of both in a comprehensive model is useful in the prediction of MetS occurrence.

  17. TCOM-N2O: TOMCAT CTM and Occultation Measurements based daily zonal...

    • zenodo.org
    nc, pdf
    Updated Jul 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandip Dhomse; Sandip Dhomse (2024). TCOM-N2O: TOMCAT CTM and Occultation Measurements based daily zonal stratospheric nitrous oxide profile dataset [1991-2021] constructed using machine-learning [Dataset]. http://doi.org/10.5281/zenodo.7386001
    Explore at:
    nc, pdfAvailable download formats
    Dataset updated
    Jul 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sandip Dhomse; Sandip Dhomse
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Methodology: TOMCAT simulation is performed at T64L32 resolution that is similar to the one used in Dhomse et al., (2021, 2022) for 1991-2021 time period. Model profile are sample at ACE-FTS (2004-present) measurement collocation, so that we get model output at nearest lat/lon and time. Then collocated N2O profiles are divided in five latitude bins: SH polar (90S-50S), SH mid-lat (70S-20S), tropics (40S-40N), NH mid-lat (20N-70N) and NH polar (50N-90N). Corrections for overlapping latitude are averaged to ensure that mean correction terms do not have sharp edges

    Initially, differences are calculated for each zonal bins for 51 height levels (10km to 60km). Then separate XGBoost regression models are trained for the N2O differences between TOMCAT and measurements at each level for a given latitude bin. Same model is used for all day/night time (2 X11323 days) TOMCAT output sampled at 1.30 am and 1.30 pm local time at the equator. Bias corrections for a given model grid are calculated using XGBoost and are added to the original TOMCAT day and night time profiles. Height resolved data are then interpolated on 28-pressure levels (300 - 0.1hPa). For overlapping latitude bins, we use averages and then calculate daily zonal mean values. For more details see attached presentation.

    Dataset also includes two files containing daily mean zonal mean N2O profiles on height (15-60 km) and pressure (300-0.1 hPa) levels:

    zmn2o_TCOM_hlev_T2Dz_1991_2021.nc – height level data (15 to 60 km)

    zmn2o_TCOM_plev_T2Dz_1991_2021.nc – pressure level data (300 to 0.1 hPa)

    Note that there is no observational constrain for 1991-2003 time period, hence correction terms assume that there are no significant discontinuities in ERA5 reanalysis fields that are used drive TOMCAT transport.

    Dhomse_TCOM-N2O.pdf provides brief description.

  18. Lifestyle and Health Risk Prediction

    • kaggle.com
    zip
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Miah (2025). Lifestyle and Health Risk Prediction [Dataset]. https://www.kaggle.com/datasets/miadul/lifestyle-and-health-risk-prediction
    Explore at:
    zip(61139 bytes)Available download formats
    Dataset updated
    Oct 19, 2025
    Authors
    Arif Miah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📘 Description:

    This synthetic health dataset simulates real-world lifestyle and wellness data for individuals. It is designed to help data scientists, machine learning engineers, and students build and test health risk prediction models safely — without using sensitive medical data.

    The dataset includes features such as age, weight, height, exercise habits, sleep hours, sugar intake, smoking, alcohol consumption, marital status, and profession, along with a synthetic health_risk label generated using a heuristic rule-based algorithm that mimics realistic risk behavior patterns.

    🧾 Columns Description:

    Column NameDescriptionTypeExample
    ageAge of the person (years)Numeric35
    weightBody weight in kilogramsNumeric70
    heightHeight in centimetersNumeric172
    exerciseExercise frequency levelCategorical (none, low, medium, high)medium
    sleepAverage hours of sleep per nightNumeric7
    sugar_intakeLevel of sugar consumptionCategorical (low, medium, high)high
    smokingSmoking habitCategorical (yes, no)no
    alcoholAlcohol consumption habitCategorical (yes, no)yes
    marriedMarital statusCategorical (yes, no)yes
    professionType of work or professionCategorical (office_worker, teacher, doctor, engineer, etc.)teacher
    bmiBody Mass Index calculated as weight / (height²)Numeric24.5
    health_riskTarget label showing overall health riskCategorical (low, high)high

    🧩 Use Cases:

    1. Health Risk Prediction: Train classification models (Logistic Regression, RandomForest, XGBoost, CatBoost) to predict health risk (low / high).

    2. Feature Importance Analysis: Identify which lifestyle factors most influence health risk.

    3. Data Preprocessing & EDA Practice: Use this dataset for data cleaning, encoding, and visualization practice.

    4. Model Explainability Projects: Use SHAP or LIME to explain how different lifestyle habits affect predictions.

    5. Streamlit or Flask Web App Development: Build a real-time web app that predicts health risk from user input.

    💡 Case Study Example:

    Imagine you are a data scientist building a Health Risk Prediction App for a wellness startup. You want to analyze how exercise, sleep, and sugar intake affect overall health risk. This dataset helps you simulate those relationships without handling sensitive medical data.

    You could:

    • Perform EDA to find correlations between age, BMI, and health risk.
    • Train a model using Random Forest to predict health_risk.
    • Deploy a Streamlit app where users can input their lifestyle information and get a risk score instantly.

    ⚙️ Technical Information:

    • Rows: 5,000 (adjustable, you can create more)
    • Columns: 12
    • Target variable: health_risk
    • Data type: Mixed (Numeric + Categorical)
    • Source: Fully synthetic, generated using Python (NumPy, Faker)

    📈 License:

    CC0: Public Domain You are free to use this dataset for research, learning, or commercial projects.

    🌍 Author:

    Created by Arif Miah Machine Learning Engineer | Kaggle Expert | Data Scientist 📧 arifmiahcse@gmail.com

  19. Wine Quality Classification

    • kaggle.com
    zip
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    🇹🇷 Şahide Şeker, MSc (2025). Wine Quality Classification [Dataset]. https://www.kaggle.com/datasets/sahideseker/wine-quality-classification/data
    Explore at:
    zip(7606 bytes)Available download formats
    Dataset updated
    Apr 1, 2025
    Authors
    🇹🇷 Şahide Şeker, MSc
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    🇬🇧 English:

    This synthetic dataset is designed for classification tasks involving wine quality. It includes 1,000 samples with key chemical attributes such as acidity, sugar level, alcohol content, and density. Each sample is labeled with a wine quality class: low, medium, or high.

    Use this dataset to:

    • Train classification models like SVM, XGBoost, and Logistic Regression
    • Explore the impact of chemical features on wine quality
    • Practice ML tasks in a food and beverage context

    Features:

    • fixed_acidity: Level of fixed acidity
    • residual_sugar: Sugar level after fermentation
    • alcohol: Alcohol content (%)
    • density: Liquid density
    • quality_label: Wine quality class (low / medium / high)

    🇹🇷 Türkçe:

    Bu sentetik veri seti, şarap kalitesini sınıflandırmaya yönelik makine öğrenmesi uygulamaları için tasarlanmıştır. 1.000 örnekten oluşan veri setinde asitlik, şeker oranı, alkol yüzdesi ve yoğunluk gibi kimyasal özellikler yer almakta ve her örnek düşük, orta veya yüksek kalite olarak etiketlenmiştir.

    Bu veri seti sayesinde:

    • SVM, XGBoost ve Lojistik Regresyon gibi sınıflandırma algoritmaları uygulanabilir
    • Kimyasal özelliklerin şarap kalitesi üzerindeki etkisi analiz edilebilir
    • Gıda ve içecek sektörüne yönelik ML uygulamaları geliştirilebilir

    Değişkenler:

    • fixed_acidity: Sabit asit seviyesi
    • residual_sugar: Fermantasyon sonrası kalan şeker
    • alcohol: Alkol oranı (%)
    • density: Yoğunluk değeri
    • quality_label: Kalite sınıfı (low / medium / high)
  20. Data from: Predicting metallicities and carbon abundances from Gaia XP...

    • zenodo.org
    csv
    Updated Jan 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anke Ardern-Arentsen; Anke Ardern-Arentsen (2025). Predicting metallicities and carbon abundances from Gaia XP spectra for (carbon-enhanced) metal-poor stars [Dataset]. http://doi.org/10.5281/zenodo.14651678
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anke Ardern-Arentsen; Anke Ardern-Arentsen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset described in Ardern-Arentsen et al. (2025), "Predicting metallicities and carbon abundances from Gaia XP spectra for (carbon-enhanced) metal-poor stars". There are three tables:

    • reference: spectroscopic parameters used to train and test the neural network plus predictions
    • L23: predictions for Lucey et al. (2023), XGBoost C-rich candidates from XP
    • A23: predictions for the "vetted RGB sample" (those with radial velocities only) from Andrae et al. (2023), XGBoost metallicities from XP

    as described in the paper, with the data models presented in the appendix.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Edwin Hauwert M.Sc. (2022). jars xgboost example files [Dataset]. https://www.kaggle.com/develuse/jars-xgboost-old
Organization logo

jars xgboost example files

xgboost model for java/ pyspark

Explore at:
zip(615742866 bytes)Available download formats
Dataset updated
Apr 16, 2022
Authors
Edwin Hauwert M.Sc.
Description

Dataset

This dataset was created by Edwin Hauwert M.Sc.

Contents

Search
Clear search
Close search
Google apps
Main menu