100+ datasets found

jars xgboost example files
kaggle.com
zip
Updated Apr 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Edwin Hauwert M.Sc. (2022). jars xgboost example files [Dataset]. https://www.kaggle.com/develuse/jars-xgboost-old
Explore at:
zip(615742866 bytes)Available download formats
Dataset updated
Apr 16, 2022
Authors
Edwin Hauwert M.Sc.
Description
Dataset

This dataset was created by Edwin Hauwert M.Sc.

Contents
Hyperparameters for the XGBoost model.
plos.figshare.com
xls
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hoa Thi Trinh; Tuan Anh Pham; Vu Dinh Tho; Duy Hung Nguyen (2024). Hyperparameters for the XGBoost model. [Dataset]. http://doi.org/10.1371/journal.pone.0312531.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0312531.t002
Dataset updated
Nov 27, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Hoa Thi Trinh; Tuan Anh Pham; Vu Dinh Tho; Duy Hung Nguyen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Structurally, the lateral load-bearing capacity mainly depends on reinforced concrete (RC) walls. Determination of flexural strength and shear strength is mandatory when designing reinforced concrete walls. Typically, these strengths are determined through theoretical formulas and verified experimentally. However, theoretical formulas often have large errors and testing is costly and time-consuming. Therefore, this study exploits machine learning techniques, specifically the hybrid XGBoost model combined with optimization algorithms, to predict the shear strength of RC walls based on model training from available experimental results. The study used the largest database of RC walls to date, consisting of 1057 samples with various cross-sectional shapes. Bayesian optimization (BO) algorithms, including BO—Gaussian Process, BO—Random Forest, and Random Search methods, were used to refine the XGBoost model architecture. The results show that Gaussian Process emerged as the most efficient solution compared to other optimization algorithms, providing the lowest Mean Square Error and achieving a prediction R2 of 0.998 for the training set, 0.972 for the validation set and 0.984 for the test set, while BO—Random Forest and Random Search performed as well on the training and test sets as Gaussian Process but significantly worse on the validation set, specifically R2 on the validation set of BO—Random Forest and Random Search were 0.970 and 0.969 respectively over the entire dataset including all cross-sectional shapes of the RC wall. SHAP (Shapley Additive Explanations) technique was used to clarify the predictive ability of the model and the importance of input variables. Furthermore, the performance of the model was validated through comparative analysis with benchmark models and current standards. Notably, the coefficient of variation (COV %) of the XGBoost model is 13.27%, while traditional models often have COV % exceeding 50%.
H
Data from: Representative sample size for estimating saturated hydraulic...
beta.hydroshare.org
hydroshare.org
+1more
zip
Updated May 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amin Ahmadisharaf; Reza Nematirad; Sadra Sabouri; Yakov Pachepsky; Behzad Ghanbarian (2024). Representative sample size for estimating saturated hydraulic conductivity via machine learning [Dataset]. https://beta.hydroshare.org/resource/4c33179a77834634969bb9787c41e71a/
Explore at:
zip(5.9 MB)Available download formats
Dataset updated
May 21, 2024
Dataset provided by
HydroShare
Authors
Amin Ahmadisharaf; Reza Nematirad; Sadra Sabouri; Yakov Pachepsky; Behzad Ghanbarian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This database including saturated hydraulic conductivity data from the USKSAT database as well as the associated Python codes used to analyze learning curves and train and test the developed machine learning models.
f
Table_7_Five-Feature Model for Developing the Classifier for Synergistic vs....
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Jul 9, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shi, Tieliu; Ji, Xiangjun; Liu, Zhichao; Tong, Weida (2019). Table_7_Five-Feature Model for Developing the Classifier for Synergistic vs. Antagonistic Drug Combinations Built by XGBoost.XLSX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000134259
Explore at:
Dataset updated
Jul 9, 2019
Authors
Shi, Tieliu; Ji, Xiangjun; Liu, Zhichao; Tong, Weida
Description
Combinatorial drug therapy can improve the therapeutic effect and reduce the corresponding adverse events. In silico strategies to classify synergistic vs. antagonistic drug pairs is more efficient than experimental strategies. However, most of the developed methods have been applied only to cancer therapies. In this study, we introduce a novel method, XGBoost, based on five features of drugs and biomolecular networks of their targets, to classify synergistic vs. antagonistic drug combinations from different drug categories. We found that XGBoost outperformed other classifiers in both stratified fivefold cross-validation (CV) and independent validation. For example, XGBoost achieved higher predictive accuracy than other models (0.86, 0.78, 0.78, and 0.83 for XGBoost, logistic regression, naïve Bayesian, and random forest, respectively) for an independent validation set. We also found that the five-feature XGBoost model is much more effective at predicting combinatorial therapies that have synergistic effects than those with antagonistic effects. The five-feature XGBoost model was also validated on TCGA data with accuracy of 0.79 among the 61 tested drug pairs, which is comparable to that of DeepSynergy. Among the 14 main anatomical/pharmacological groups classified according to WHO Anatomic Therapeutic Class, for drugs belonging to five groups, their prediction accuracy was significantly increased (odds ratio < 1) or reduced (odds ratio > 1) (Fisher’s exact test, p < 0.05). This study concludes that our five-feature XGBoost model has significant benefits for classifying synergistic vs. antagonistic drug combinations.
n
Data from: A Deep Learning and XGBoost-based Method for Predicting...
narcis.nl
data.mendeley.com
Updated Aug 3, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wang, P (via Mendeley Data) (2021). A Deep Learning and XGBoost-based Method for Predicting Protein-protein Interaction Sites [Dataset]. http://doi.org/10.17632/9tft3vz5tm.2
Explore at:
Unique identifier
https://doi.org/10.17632/9tft3vz5tm.2
Dataset updated
Aug 3, 2021
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
wang, P (via Mendeley Data)
Description
local_feature_training_set.csv: Preprocessing data of feature extractor contains 65869 rows and 344 columns, and rows represent the number of samples , the first 343 columns represent feature and the last column represent label

local_feature_testing_set.csv: Preprocessing data of feature extractor contains 11791 rows and 344 columns, and rows represent the number of samples , the first 343 columns represent feature and the last column represent label

global&local_feature_training_set.csv: Preprocessing data of feature extractor contains 65869 rows and 1028 columns, and rows represent the number of samples , the first 1027 columns represent feature and the last column represent label

global&local_feature_testing_set.csv: Preprocessing data of feature extractor contains 11791 rows and 1028 columns, and rows represent the number of samples , the first 1027 columns represent feature and the last column represent label
MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and...
zenodo.org
csv, zip
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah (2025). MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies [Dataset]. http://doi.org/10.5281/zenodo.15666484
Explore at:
csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15666484
Dataset updated
Jun 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 14, 2025
Description

The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv and Trad_Test.csv) is derived directly from the original complete geochemical dataset (alldata.csv) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv and Simu_Test.csv) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip archive contains the raw input files alldata.csvused to generate the proxies_alldata.csv, it also contains Analysis1.csv and Analysis2.csvfor performing confidence analysis. To run the executable files in place of the .m scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.

Details on the input files for confidence analysis: Analysis1.csv and Analysis2.csv
These files contain two columns for the test data: column 1 = match or mismatch between predicted and true alterations? column 2 = probability of a correct classification, according to bootstrapped samples (Analysis1.csv) or to simulated proxies (Analysis2.csv)

Fraud Detection Transactions Dataset

kaggle.com

zip

Updated Feb 21, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Samay Ashar (2025). Fraud Detection Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/samayashar/fraud-detection-transactions-dataset

Explore at:

zip(2104444 bytes)Available download formats

Dataset updated

Feb 21, 2025

Authors

Samay Ashar

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is designed to help data scientists and machine learning enthusiasts develop robust fraud detection models. It contains realistic synthetic transaction data, including user information, transaction types, risk scores, and more, making it ideal for binary classification tasks with models like XGBoost and LightGBM.

📌 Key Features

21 features capturing various aspects of a financial transaction
Realistic structure with numerical, categorical, and temporal data
Binary fraud labels (0 = Not Fraud, 1 = Fraud)
Designed for high accuracy with XGBoost and other ML models
Useful for anomaly detection, risk analysis, and security research

📌 Columns in the Dataset

Column Name	Description
Transaction_ID	Unique identifier for each transaction
User_ID	Unique identifier for the user
Transaction_Amount	Amount of money involved in the transaction
Transaction_Type	Type of transaction (`Online`, `In-Store`, `ATM`, etc.)
Timestamp	Date and time of the transaction
Account_Balance	User's current account balance before the transaction
Device_Type	Type of device used (`Mobile`, `Desktop`, etc.)
Location	Geographical location of the transaction
Merchant_Category	Type of merchant (`Retail`, `Food`, `Travel`, etc.)
IP_Address_Flag	Whether the IP address was flagged as suspicious (`0` or `1`)
Previous_Fraudulent_Activity	Number of past fraudulent activities by the user
Daily_Transaction_Count	Number of transactions made by the user that day
Avg_Transaction_Amount_7d	User's average transaction amount in the past 7 days
Failed_Transaction_Count_7d	Count of failed transactions in the past 7 days
Card_Type	Type of payment card used (`Credit`, `Debit`, `Prepaid`, etc.)
Card_Age	Age of the card in months
Transaction_Distance	Distance between the user's usual location and transaction location
Authentication_Method	How the user authenticated (`PIN`, `Biometric`, etc.)
Risk_Score	Fraud risk score computed for the transaction
Is_Weekend	Whether the transaction occurred on a weekend (`0` or `1`)
Fraud_Label	Target variable (`0 = Not Fraud`, `1 = Fraud`)

📌 Potential Use Cases

Fraud detection model training
Anomaly detection in financial transactions
Risk scoring systems for banks and fintech companies
Feature engineering and model explainability research

Tox24 challenge data
kaggle.com
zip
Updated Sep 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonina Dolgorukova (2024). Tox24 challenge data [Dataset]. https://www.kaggle.com/datasets/antoninadolgorukova/tox24-challenge-data/suggestions
Explore at:
zip(19160575 bytes)Available download formats
Dataset updated
Sep 18, 2024
Authors
Antonina Dolgorukova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset and associated notebooks were created to solve the Tox24 Challenge and provide a real-world data and a working example of how machine learning can be used to predict binding activity to a target protein like Transthyretin (TTR) - from retrieving and preprocessing SMILES to ensembling the obtained predictions.

SMILES: File all_smiles_data.csv contains various smiles for the 1512 competition chemicals (retrieved from pubchem, cleaned, and smiles with isolated atoms removed), generated in this notebook. Also, here I evaluated the performance of XGBoost using molecular descriptors computed from different SMILES representations of the chemicals.

FEATURES: The 'features' folder contains features calculated with ochem and those computed in the features engineering notebook. All feature sets were evaluated with XGBoost here.

Feature selection notebooks: - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-for-xgboost - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-by-clusters-for-xgboost - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-by-clusters-for-lightgbm

MODELS: The 'submits' folder contains the predictions for the 500 test chemicals of the Tox24 Challenge were made with the XGBoost and LightGBM models and used for the final ensemble.

DATA: The TTR Supplemental Tables are taken from the article that accompanied the Tox24 Challenge, and include:

Tables outlining the components of the assay reactions and lists of autofluorescent chemicals,

chemicals excluded from the analysis due to interference,

and chemicals screened in single concentration and concentration response testing.

This dataset can be used for drug design research, protein-ligand interaction studies, and machine learning model development focused on chemical binding activities.
Data from: Robust Data-driven Metallicities for 175 Million Stars from Gaia...
zenodo.org
application/gzip
Updated May 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
René Andrae; René Andrae; Hans-Walter Rix; Hans-Walter Rix; Vedant Chandra; Vedant Chandra (2023). Robust Data-driven Metallicities for 175 Million Stars from Gaia XP Spectra [Dataset]. http://doi.org/10.5281/zenodo.7925612
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7925612
Dataset updated
May 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
René Andrae; René Andrae; Hans-Walter Rix; Hans-Walter Rix; Vedant Chandra; Vedant Chandra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset accompanying Andrae et al (2023), "Robust Data-driven Metallicities for 175 Million Stars from Gaia XP Spectra".

Table 1 (174,922,161 rows) contains XGBoost parameters (temperature, surface gravity, and metallicity) for all stars in the sample.

Table 2 (17,558,141 rows) contains Gaia DR3 parameters and XGBoost parameters for a vetted sample of RGB stars with reliable measurements.

The tables are provided in compressed CSV format, and the full data model is described in the accompanying paper.
Compare errors made by different models
kaggle.com
zip
Updated Jan 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Afshan Nabi (2020). Compare errors made by different models [Dataset]. https://www.kaggle.com/afshannabi/compare-errors-made-by-different-models
Explore at:
zip(5583 bytes)Available download formats
Dataset updated
Jan 16, 2020
Authors
Afshan Nabi
Description
Context

I had 3 different models trained on the same data. Gauging model accuracy was easy, but I was curious about whether all models mis-classify the same samples or not. So, I generated the predictions made by each model and tried to visualize them in a useful manner.

Content

The dataframe contains the True label for each sample as well as the predictions made by 3 different models: SVM, XGBoost and MLP.
TPS-Mar-2025-Rain-Prediction-Data
kaggle.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eren Ata (2025). TPS-Mar-2025-Rain-Prediction-Data [Dataset]. https://www.kaggle.com/datasets/erenata/tps-mar-2025-rain-prediction-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 14, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Eren Ata
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Rain Prediction Model - Kaggle Competition Project Overview This project is a machine learning solution for the Tabular Playground Series - March 2025 Kaggle competition, focusing on rain prediction using various weather-related features.

Features Advanced feature engineering with weather interactions and rolling statistics Ensemble learning with XGBoost, LightGBM, and Logistic Regression Hyperparameter optimization using Optuna Cross-validation with GroupKFold Feature importance analysis and visualization Model Performance XGBoost CV Score: 0.8957 ± 0.0192 AUC Optimized hyperparameters through 20 trials Feature importance visualization available in 'feature_importance.png'
I
Large-scale proteomics in the first trimester of pregnancy predict...
immport.org
data.niaid.nih.gov
url
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Large-scale proteomics in the first trimester of pregnancy predict psychopathology and temperament in preschool children [Dataset]. http://doi.org/10.21430/m34sax8sjb
Explore at:
urlAvailable download formats
Unique identifier
https://doi.org/10.21430/m34sax8sjb
License
https://www.immport.org/agreementhttps://www.immport.org/agreement
Description
This study investigates how prenatal inflammation might affect childhood psychopathology by analyzing over 1,000 proteins in first-trimester blood samples using XGBoost machine learning model. Results show these proteins predict 5-10% of variance in early childhood behaviors like sadness and attention issues, highlighting immune and nervous system development as key factors. The findings suggest that a broader range of proteins than previously considered could influence future mental health outcomes in children.
n
A machine learning based prediction model for life expectancy
data.niaid.nih.gov
datasetcatalog.nlm.nih.gov
+1more
zip
Updated Nov 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo (2022). A machine learning based prediction model for life expectancy [Dataset]. http://doi.org/10.5061/dryad.z612jm6fv
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.z612jm6fv
Dataset updated
Nov 14, 2022
Dataset provided by
University of South Carolina Upstate
Strathmore University
Authors
Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The social and financial systems of many nations throughout the world are significantly impacted by life expectancy (LE) models. Numerous studies have pointed out the crucial effects that life expectancy projections will have on societal issues and the administration of the global healthcare system. The computation of life expectancy has primarily entailed building an ordinary life table. However, the life table is limited by its long duration, the assumption of homogeneity of cohorts and censoring. As a result, a robust and more accurate approach is inevitable. In this study, a supervised machine learning model for estimating life expectancy rates is developed. The model takes into consideration health, socioeconomic, and behavioral characteristics by using the eXtreme Gradient Boosting (XGBoost) algorithm to data from 193 UN member states. The effectiveness of the model's prediction is compared to that of the Random Forest (RF) and Artificial Neural Network (ANN) regressors utilized in earlier research. XGBoost attains an MAE and an RMSE of 1.554 and 2.402, respectively outperforming the RF and ANN models that achieved MAE and RMSE values of 7.938 and 11.304, and 3.86 and 5.002, respectively. The overall results of this study support XGBoost as a reliable and efficient model for estimating life expectancy. Methods Secondary data were used from which a sample of 2832 observations of 21 variables was sourced from the World Health Organization (WHO) and the United Nations (UN) databases. The data was on 193 UN member states from the year 2000–2015, with the LE health-related factors drawn from the Global Health Observatory data repository.
f
Data_Sheet_2_Machine Learning Prediction Models for Mechanically Ventilated...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Jul 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yao, Renqi; Li, Lin; Du, Bin; Li, Yang; Zhang, Jin; Wang, Guowei; Chen, Yan; Li, Wei; Chen, Ge; Xi, Xiuming; Jin, Xin; Liu, Shi; Ren, Chao; Huang, Huibin; Guo, Junyang; Guo, Qianqian; Yu, Qian; Zhu, Yibing; Zheng, Hua (2021). Data_Sheet_2_Machine Learning Prediction Models for Mechanically Ventilated Patients: Analyses of the MIMIC-III Database.PDF [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000884346
Explore at:
Dataset updated
Jul 1, 2021
Authors
Yao, Renqi; Li, Lin; Du, Bin; Li, Yang; Zhang, Jin; Wang, Guowei; Chen, Yan; Li, Wei; Chen, Ge; Xi, Xiuming; Jin, Xin; Liu, Shi; Ren, Chao; Huang, Huibin; Guo, Junyang; Guo, Qianqian; Yu, Qian; Zhu, Yibing; Zheng, Hua
Description
Background: Mechanically ventilated patients in the intensive care unit (ICU) have high mortality rates. There are multiple prediction scores, such as the Simplified Acute Physiology Score II (SAPS II), Oxford Acute Severity of Illness Score (OASIS), and Sequential Organ Failure Assessment (SOFA), widely used in the general ICU population. We aimed to establish prediction scores on mechanically ventilated patients with the combination of these disease severity scores and other features available on the first day of admission.Methods: A retrospective administrative database study from the Medical Information Mart for Intensive Care (MIMIC-III) database was conducted. The exposures of interest consisted of the demographics, pre-ICU comorbidity, ICU diagnosis, disease severity scores, vital signs, and laboratory test results on the first day of ICU admission. Hospital mortality was used as the outcome. We used the machine learning methods of k-nearest neighbors (KNN), logistic regression, bagging, decision tree, random forest, Extreme Gradient Boosting (XGBoost), and neural network for model establishment. A sample of 70% of the cohort was used for the training set; the remaining 30% was applied for testing. Areas under the receiver operating characteristic curves (AUCs) and calibration plots would be constructed for the evaluation and comparison of the models' performance. The significance of the risk factors was identified through models and the top factors were reported.Results: A total of 28,530 subjects were enrolled through the screening of the MIMIC-III database. After data preprocessing, 25,659 adult patients with 66 predictors were included in the model analyses. With the training set, the models of KNN, logistic regression, decision tree, random forest, neural network, bagging, and XGBoost were established and the testing set obtained AUCs of 0.806, 0.818, 0.743, 0.819, 0.780, 0.803, and 0.821, respectively. The calibration curves of all the models, except for the neural network, performed well. The XGBoost model performed best among the seven models. The top five predictors were age, respiratory dysfunction, SAPS II score, maximum hemoglobin, and minimum lactate.Conclusion: The current study indicates that models with the risk of factors on the first day could be successfully established for predicting mortality in ventilated patients. The XGBoost model performs best among the seven machine learning models.
f
Table_1_Automatic text classification of drug-induced liver injury using...
datasetcatalog.nlm.nih.gov
Updated Jun 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bao, Wenjun; Chen, Minjun; Thakkar, Shraddha; Wu, Yue; Tong, Weida; Wingerd, Byron; Liu, Zhichao; Mann, Nicholas; Wolfinger, Russell D.; Donnelly, Tom; Xu, Joshua; Pedersen, Thomas J. (2024). Table_1_Automatic text classification of drug-induced liver injury using document-term matrix and XGBoost.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001354925
Explore at:
Dataset updated
Jun 3, 2024
Authors
Bao, Wenjun; Chen, Minjun; Thakkar, Shraddha; Wu, Yue; Tong, Weida; Wingerd, Byron; Liu, Zhichao; Mann, Nicholas; Wolfinger, Russell D.; Donnelly, Tom; Xu, Joshua; Pedersen, Thomas J.
Description
IntroductionRegulatory agencies generate a vast amount of textual data in the review process. For example, drug labeling serves as a valuable resource for regulatory agencies, such as U.S. Food and Drug Administration (FDA) and Europe Medical Agency (EMA), to communicate drug safety and effectiveness information to healthcare professionals and patients. Drug labeling also serves as a resource for pharmacovigilance and drug safety research. Automated text classification would significantly improve the analysis of drug labeling documents and conserve reviewer resources.MethodsWe utilized artificial intelligence in this study to classify drug-induced liver injury (DILI)-related content from drug labeling documents based on FDA’s DILIrank dataset. We employed text mining and XGBoost models and utilized the Preferred Terms of Medical queries for adverse event standards to simplify the elimination of common words and phrases while retaining medical standard terms for FDA and EMA drug label datasets. Then, we constructed a document term matrix using weights computed by Term Frequency-Inverse Document Frequency (TF-IDF) for each included word/term/token.ResultsThe automatic text classification model exhibited robust performance in predicting DILI, achieving cross-validation AUC scores exceeding 0.90 for both drug labels from FDA and EMA and literature abstracts from the Critical Assessment of Massive Data Analysis (CAMDA).DiscussionMoreover, the text mining and XGBoost functions demonstrated in this study can be applied to other text processing and classification tasks.
f
Data from: Assessing individual genetic susceptibility to metabolic...
datasetcatalog.nlm.nih.gov
tandf.figshare.com
Updated Jun 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huang, Tao; Zheng, Xiujuan; Huang, Xirui; Xiong, Wenhui; Yang, Menghan; Wang, Simin; Li, Yuanyuan; Gao, Bizhen; Qiao, Shijie (2025). Assessing individual genetic susceptibility to metabolic syndrome: interpretable machine learning method [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002050385
Explore at:
Dataset updated
Jun 22, 2025
Authors
Huang, Tao; Zheng, Xiujuan; Huang, Xirui; Xiong, Wenhui; Yang, Menghan; Wang, Simin; Li, Yuanyuan; Gao, Bizhen; Qiao, Shijie
Description
Genome-wide association studies have provided profound insights into the genetic aetiology of metabolic syndrome (MetS). However, there is a lack of machine-learning (ML)-based predictive models to assess individual genetic susceptibility to MetS. This study utilized single-nucleotide polymorphisms (SNPs) as variables and employed ML-based genetic risk score (GRS) models to predict the occurrence of MetS, bringing it closer to clinical application. Feature selection was performed using Least Absolute Shrinkage and Selection Operator. Six ML algorithms were employed to construct GRS models. A fivefold cross-validation was utilized to aid in the internal validation of models. The receiver operating characteristic (ROC) curve was used to select the better-performing GRS model. The SHapley Additive exPlanations (SHAP) was then applied to interpret the model. After extracting GRS, stratified analysis of BMI, age and gender was performed. Finally, these conventional risk factors and GRS were integrated through multivariate logistic regression to establish a combined model. A total of 17 SNPs were selected for analysis. Among the GRS models, the extreme gradient boosting (XGBoost) model demonstrated superior discriminative performance (AUC = 0.837). The XGBoost’s optimal robustness was also validated through five-fold cross-validation (mean ROC-AUC = 0.706). The XGBoost-based SHAP algorithm not only elucidated the global effects of 17 SNPs across all samples, but also described the interaction between SNPs, providing a visual representation of how SNPs impact the prediction of MetS in an individual. There was a strong correlation between GRS and MetS risk, particularly observed among young individuals, males and overweight individuals. Furthermore, the model combining conventional risk factors and GRS exhibited excellent discriminative performance (AUC = 0.962) and outstanding robustness (mean ROC-AUC = 0.959). This study established a reliable XGBoost-based GRS model and a GRS prediction platform (https://metabolicsyndromeapps.shinyapps.io/geneticriskscore/) to assess individual genetic susceptibility to MetS. This model has high interpretability and can provide personalized reference for determining the necessity of primary prevention measures for MetS. Additionally, there may be interactions between traditional risk factors and GRS, and the integration of both in a comprehensive model is useful in the prediction of MetS occurrence.
TCOM-N2O: TOMCAT CTM and Occultation Measurements based daily zonal...
zenodo.org
nc, pdf
Updated Jul 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandip Dhomse; Sandip Dhomse (2024). TCOM-N2O: TOMCAT CTM and Occultation Measurements based daily zonal stratospheric nitrous oxide profile dataset [1991-2021] constructed using machine-learning [Dataset]. http://doi.org/10.5281/zenodo.7386001
Explore at:
nc, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7386001
Dataset updated
Jul 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sandip Dhomse; Sandip Dhomse
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Methodology: TOMCAT simulation is performed at T64L32 resolution that is similar to the one used in Dhomse et al., (2021, 2022) for 1991-2021 time period. Model profile are sample at ACE-FTS (2004-present) measurement collocation, so that we get model output at nearest lat/lon and time. Then collocated N2O profiles are divided in five latitude bins: SH polar (90S-50S), SH mid-lat (70S-20S), tropics (40S-40N), NH mid-lat (20N-70N) and NH polar (50N-90N). Corrections for overlapping latitude are averaged to ensure that mean correction terms do not have sharp edges

Initially, differences are calculated for each zonal bins for 51 height levels (10km to 60km). Then separate XGBoost regression models are trained for the N2O differences between TOMCAT and measurements at each level for a given latitude bin. Same model is used for all day/night time (2 X11323 days) TOMCAT output sampled at 1.30 am and 1.30 pm local time at the equator. Bias corrections for a given model grid are calculated using XGBoost and are added to the original TOMCAT day and night time profiles. Height resolved data are then interpolated on 28-pressure levels (300 - 0.1hPa). For overlapping latitude bins, we use averages and then calculate daily zonal mean values. For more details see attached presentation.

Dataset also includes two files containing daily mean zonal mean N2O profiles on height (15-60 km) and pressure (300-0.1 hPa) levels:

zmn2o_TCOM_hlev_T2Dz_1991_2021.nc – height level data (15 to 60 km)

zmn2o_TCOM_plev_T2Dz_1991_2021.nc – pressure level data (300 to 0.1 hPa)

Note that there is no observational constrain for 1991-2003 time period, hence correction terms assume that there are no significant discontinuities in ERA5 reanalysis fields that are used drive TOMCAT transport.

Dhomse_TCOM-N2O.pdf provides brief description.

Lifestyle and Health Risk Prediction

kaggle.com

zip

Updated Oct 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Arif Miah (2025). Lifestyle and Health Risk Prediction [Dataset]. https://www.kaggle.com/datasets/miadul/lifestyle-and-health-risk-prediction

Explore at:

zip(61139 bytes)Available download formats

Dataset updated

Oct 19, 2025

Authors

Arif Miah

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

📘 Description:

This synthetic health dataset simulates real-world lifestyle and wellness data for individuals. It is designed to help data scientists, machine learning engineers, and students build and test health risk prediction models safely — without using sensitive medical data.

The dataset includes features such as age, weight, height, exercise habits, sleep hours, sugar intake, smoking, alcohol consumption, marital status, and profession, along with a synthetic health_risk label generated using a heuristic rule-based algorithm that mimics realistic risk behavior patterns.

🧾 Columns Description:

Column Name	Description	Type	Example
`age`	Age of the person (years)	Numeric	35
`weight`	Body weight in kilograms	Numeric	70
`height`	Height in centimeters	Numeric	172
`exercise`	Exercise frequency level	Categorical (`none`, `low`, `medium`, `high`)	`medium`
`sleep`	Average hours of sleep per night	Numeric	7
`sugar_intake`	Level of sugar consumption	Categorical (`low`, `medium`, `high`)	`high`
`smoking`	Smoking habit	Categorical (`yes`, `no`)	`no`
`alcohol`	Alcohol consumption habit	Categorical (`yes`, `no`)	`yes`
`married`	Marital status	Categorical (`yes`, `no`)	`yes`
`profession`	Type of work or profession	Categorical (`office_worker`, `teacher`, `doctor`, `engineer`, etc.)	`teacher`
`bmi`	Body Mass Index calculated as weight / (height²)	Numeric	24.5
`health_risk`	Target label showing overall health risk	Categorical (`low`, `high`)	`high`

🧩 Use Cases:

Health Risk Prediction: Train classification models (Logistic Regression, RandomForest, XGBoost, CatBoost) to predict health risk (low / high).
Feature Importance Analysis: Identify which lifestyle factors most influence health risk.
Data Preprocessing & EDA Practice: Use this dataset for data cleaning, encoding, and visualization practice.
Model Explainability Projects: Use SHAP or LIME to explain how different lifestyle habits affect predictions.
Streamlit or Flask Web App Development: Build a real-time web app that predicts health risk from user input.

💡 Case Study Example:

Imagine you are a data scientist building a Health Risk Prediction App for a wellness startup. You want to analyze how exercise, sleep, and sugar intake affect overall health risk. This dataset helps you simulate those relationships without handling sensitive medical data.

You could:

Perform EDA to find correlations between age, BMI, and health risk.
Train a model using Random Forest to predict health_risk.
Deploy a Streamlit app where users can input their lifestyle information and get a risk score instantly.

⚙️ Technical Information:

Rows: 5,000 (adjustable, you can create more)
Columns: 12
Target variable: health_risk
Data type: Mixed (Numeric + Categorical)
Source: Fully synthetic, generated using Python (NumPy, Faker)

📈 License:

CC0: Public Domain You are free to use this dataset for research, learning, or commercial projects.

🌍 Author:

Created by Arif Miah Machine Learning Engineer | Kaggle Expert | Data Scientist 📧 arifmiahcse@gmail.com

Wine Quality Classification
kaggle.com
zip
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
🇹🇷 Şahide Şeker, MSc (2025). Wine Quality Classification [Dataset]. https://www.kaggle.com/datasets/sahideseker/wine-quality-classification/data
Explore at:
zip(7606 bytes)Available download formats
Dataset updated
Apr 1, 2025
Authors
🇹🇷 Şahide Şeker, MSc
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
🇬🇧 English:

This synthetic dataset is designed for classification tasks involving wine quality. It includes 1,000 samples with key chemical attributes such as acidity, sugar level, alcohol content, and density. Each sample is labeled with a wine quality class: low, medium, or high.

Use this dataset to:

Train classification models like SVM, XGBoost, and Logistic Regression

Explore the impact of chemical features on wine quality

Practice ML tasks in a food and beverage context

Features:

fixed_acidity: Level of fixed acidity

residual_sugar: Sugar level after fermentation

alcohol: Alcohol content (%)

density: Liquid density

quality_label: Wine quality class (low / medium / high)

🇹🇷 Türkçe:

Bu sentetik veri seti, şarap kalitesini sınıflandırmaya yönelik makine öğrenmesi uygulamaları için tasarlanmıştır. 1.000 örnekten oluşan veri setinde asitlik, şeker oranı, alkol yüzdesi ve yoğunluk gibi kimyasal özellikler yer almakta ve her örnek düşük, orta veya yüksek kalite olarak etiketlenmiştir.

Bu veri seti sayesinde:

SVM, XGBoost ve Lojistik Regresyon gibi sınıflandırma algoritmaları uygulanabilir

Kimyasal özelliklerin şarap kalitesi üzerindeki etkisi analiz edilebilir

Gıda ve içecek sektörüne yönelik ML uygulamaları geliştirilebilir

Değişkenler:

fixed_acidity: Sabit asit seviyesi

residual_sugar: Fermantasyon sonrası kalan şeker

alcohol: Alkol oranı (%)

density: Yoğunluk değeri

quality_label: Kalite sınıfı (low / medium / high)
Data from: Predicting metallicities and carbon abundances from Gaia XP...
zenodo.org
csv
Updated Jan 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anke Ardern-Arentsen; Anke Ardern-Arentsen (2025). Predicting metallicities and carbon abundances from Gaia XP spectra for (carbon-enhanced) metal-poor stars [Dataset]. http://doi.org/10.5281/zenodo.14651678
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14651678
Dataset updated
Jan 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anke Ardern-Arentsen; Anke Ardern-Arentsen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset described in Ardern-Arentsen et al. (2025), "Predicting metallicities and carbon abundances from Gaia XP spectra for (carbon-enhanced) metal-poor stars". There are three tables:

reference: spectroscopic parameters used to train and test the neural network plus predictions

L23: predictions for Lucey et al. (2023), XGBoost C-rich candidates from XP

A23: predictions for the "vetted RGB sample" (those with radial velocities only) from Andrae et al. (2023), XGBoost metallicities from XP

as described in the paper, with the data models presented in the appendix.

Facebook

Twitter

Click to copy link

Link copied

Cite

Edwin Hauwert M.Sc. (2022). jars xgboost example files [Dataset]. https://www.kaggle.com/develuse/jars-xgboost-old

jars xgboost example files

xgboost model for java/ pyspark

Explore at:

zip(615742866 bytes)Available download formats

Dataset updated

Apr 16, 2022

Authors

Edwin Hauwert M.Sc.

Description

Dataset

This dataset was created by Edwin Hauwert M.Sc.

Clear search

Close search

Google apps

Main menu

jars xgboost example files

Dataset

Contents

Hyperparameters for the XGBoost model.

Data from: Representative sample size for estimating saturated hydraulic...

Table_7_Five-Feature Model for Developing the Classifier for Synergistic vs....

Data from: A Deep Learning and XGBoost-based Method for Predicting...

MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and...

Fraud Detection Transactions Dataset

Description

📌 Key Features

📌 Columns in the Dataset

📌 Potential Use Cases

Tox24 challenge data

Data from: Robust Data-driven Metallicities for 175 Million Stars from Gaia...

Compare errors made by different models

Context

Content

TPS-Mar-2025-Rain-Prediction-Data

Large-scale proteomics in the first trimester of pregnancy predict...

A machine learning based prediction model for life expectancy

Data_Sheet_2_Machine Learning Prediction Models for Mechanically Ventilated...

Table_1_Automatic text classification of drug-induced liver injury using...

Data from: Assessing individual genetic susceptibility to metabolic...

TCOM-N2O: TOMCAT CTM and Occultation Measurements based daily zonal...

Lifestyle and Health Risk Prediction

📘 Description:

🧾 Columns Description:

🧩 Use Cases:

💡 Case Study Example:

⚙️ Technical Information:

📈 License:

🌍 Author:

Wine Quality Classification

Data from: Predicting metallicities and carbon abundances from Gaia XP...

jars xgboost example files

xgboost model for java/ pyspark

Dataset

Contents