61 datasets found

jars xgboost example files
kaggle.com
zip
Updated Apr 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Edwin Hauwert M.Sc. (2022). jars xgboost example files [Dataset]. https://www.kaggle.com/develuse/jars-xgboost-old
Explore at:
zip(615742866 bytes)Available download formats
Dataset updated
Apr 16, 2022
Authors
Edwin Hauwert M.Sc.
Description
Dataset

This dataset was created by Edwin Hauwert M.Sc.

Contents
Hyperparameters for the XGBoost model.
plos.figshare.com
xls
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hoa Thi Trinh; Tuan Anh Pham; Vu Dinh Tho; Duy Hung Nguyen (2024). Hyperparameters for the XGBoost model. [Dataset]. http://doi.org/10.1371/journal.pone.0312531.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0312531.t002
Dataset updated
Nov 27, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Hoa Thi Trinh; Tuan Anh Pham; Vu Dinh Tho; Duy Hung Nguyen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Structurally, the lateral load-bearing capacity mainly depends on reinforced concrete (RC) walls. Determination of flexural strength and shear strength is mandatory when designing reinforced concrete walls. Typically, these strengths are determined through theoretical formulas and verified experimentally. However, theoretical formulas often have large errors and testing is costly and time-consuming. Therefore, this study exploits machine learning techniques, specifically the hybrid XGBoost model combined with optimization algorithms, to predict the shear strength of RC walls based on model training from available experimental results. The study used the largest database of RC walls to date, consisting of 1057 samples with various cross-sectional shapes. Bayesian optimization (BO) algorithms, including BO—Gaussian Process, BO—Random Forest, and Random Search methods, were used to refine the XGBoost model architecture. The results show that Gaussian Process emerged as the most efficient solution compared to other optimization algorithms, providing the lowest Mean Square Error and achieving a prediction R2 of 0.998 for the training set, 0.972 for the validation set and 0.984 for the test set, while BO—Random Forest and Random Search performed as well on the training and test sets as Gaussian Process but significantly worse on the validation set, specifically R2 on the validation set of BO—Random Forest and Random Search were 0.970 and 0.969 respectively over the entire dataset including all cross-sectional shapes of the RC wall. SHAP (Shapley Additive Explanations) technique was used to clarify the predictive ability of the model and the importance of input variables. Furthermore, the performance of the model was validated through comparative analysis with benchmark models and current standards. Notably, the coefficient of variation (COV %) of the XGBoost model is 13.27%, while traditional models often have COV % exceeding 50%.
MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and...
zenodo.org
csv, zip
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah (2025). MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies [Dataset]. http://doi.org/10.5281/zenodo.15666484
Explore at:
csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15666484
Dataset updated
Jun 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 14, 2025
Description

The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv and Trad_Test.csv) is derived directly from the original complete geochemical dataset (alldata.csv) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv and Simu_Test.csv) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip archive contains the raw input files alldata.csvused to generate the proxies_alldata.csv, it also contains Analysis1.csv and Analysis2.csvfor performing confidence analysis. To run the executable files in place of the .m scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.

Details on the input files for confidence analysis: Analysis1.csv and Analysis2.csv
These files contain two columns for the test data: column 1 = match or mismatch between predicted and true alterations? column 2 = probability of a correct classification, according to bootstrapped samples (Analysis1.csv) or to simulated proxies (Analysis2.csv)

Fraud Detection Transactions Dataset

kaggle.com

zip

Updated Feb 21, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Samay Ashar (2025). Fraud Detection Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/samayashar/fraud-detection-transactions-dataset

Explore at:

zip(2104444 bytes)Available download formats

Dataset updated

Feb 21, 2025

Authors

Samay Ashar

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is designed to help data scientists and machine learning enthusiasts develop robust fraud detection models. It contains realistic synthetic transaction data, including user information, transaction types, risk scores, and more, making it ideal for binary classification tasks with models like XGBoost and LightGBM.

📌 Key Features

21 features capturing various aspects of a financial transaction
Realistic structure with numerical, categorical, and temporal data
Binary fraud labels (0 = Not Fraud, 1 = Fraud)
Designed for high accuracy with XGBoost and other ML models
Useful for anomaly detection, risk analysis, and security research

📌 Columns in the Dataset

Column Name	Description
Transaction_ID	Unique identifier for each transaction
User_ID	Unique identifier for the user
Transaction_Amount	Amount of money involved in the transaction
Transaction_Type	Type of transaction (`Online`, `In-Store`, `ATM`, etc.)
Timestamp	Date and time of the transaction
Account_Balance	User's current account balance before the transaction
Device_Type	Type of device used (`Mobile`, `Desktop`, etc.)
Location	Geographical location of the transaction
Merchant_Category	Type of merchant (`Retail`, `Food`, `Travel`, etc.)
IP_Address_Flag	Whether the IP address was flagged as suspicious (`0` or `1`)
Previous_Fraudulent_Activity	Number of past fraudulent activities by the user
Daily_Transaction_Count	Number of transactions made by the user that day
Avg_Transaction_Amount_7d	User's average transaction amount in the past 7 days
Failed_Transaction_Count_7d	Count of failed transactions in the past 7 days
Card_Type	Type of payment card used (`Credit`, `Debit`, `Prepaid`, etc.)
Card_Age	Age of the card in months
Transaction_Distance	Distance between the user's usual location and transaction location
Authentication_Method	How the user authenticated (`PIN`, `Biometric`, etc.)
Risk_Score	Fraud risk score computed for the transaction
Is_Weekend	Whether the transaction occurred on a weekend (`0` or `1`)
Fraud_Label	Target variable (`0 = Not Fraud`, `1 = Fraud`)

📌 Potential Use Cases

Fraud detection model training
Anomaly detection in financial transactions
Risk scoring systems for banks and fintech companies
Feature engineering and model explainability research

Table_1_Five-Feature Model for Developing the Classifier for Synergistic vs....
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Jun 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiangjun Ji; Weida Tong; Zhichao Liu; Tieliu Shi (2023). Table_1_Five-Feature Model for Developing the Classifier for Synergistic vs. Antagonistic Drug Combinations Built by XGBoost.XLSX [Dataset]. http://doi.org/10.3389/fgene.2019.00600.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2019.00600.s002
Dataset updated
Jun 13, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Xiangjun Ji; Weida Tong; Zhichao Liu; Tieliu Shi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Combinatorial drug therapy can improve the therapeutic effect and reduce the corresponding adverse events. In silico strategies to classify synergistic vs. antagonistic drug pairs is more efficient than experimental strategies. However, most of the developed methods have been applied only to cancer therapies. In this study, we introduce a novel method, XGBoost, based on five features of drugs and biomolecular networks of their targets, to classify synergistic vs. antagonistic drug combinations from different drug categories. We found that XGBoost outperformed other classifiers in both stratified fivefold cross-validation (CV) and independent validation. For example, XGBoost achieved higher predictive accuracy than other models (0.86, 0.78, 0.78, and 0.83 for XGBoost, logistic regression, naïve Bayesian, and random forest, respectively) for an independent validation set. We also found that the five-feature XGBoost model is much more effective at predicting combinatorial therapies that have synergistic effects than those with antagonistic effects. The five-feature XGBoost model was also validated on TCGA data with accuracy of 0.79 among the 61 tested drug pairs, which is comparable to that of DeepSynergy. Among the 14 main anatomical/pharmacological groups classified according to WHO Anatomic Therapeutic Class, for drugs belonging to five groups, their prediction accuracy was significantly increased (odds ratio < 1) or reduced (odds ratio > 1) (Fisher’s exact test, p < 0.05). This study concludes that our five-feature XGBoost model has significant benefits for classifying synergistic vs. antagonistic drug combinations.
Internal evaluation of the XGBoost model on different datasets and...
plos.figshare.com
xls
Updated Feb 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Safdari; Chanda Sai Keshav; Deepanshu Mody; Kshitij Verma; Utsav Kaushal; Vaadeendra Kumar Burra; Sibnath Ray; Debashree Bandyopadhyay (2025). Internal evaluation of the XGBoost model on different datasets and comparison with published datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0316467.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316467.t002
Dataset updated
Feb 4, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Ali Safdari; Chanda Sai Keshav; Deepanshu Mody; Kshitij Verma; Utsav Kaushal; Vaadeendra Kumar Burra; Sibnath Ray; Debashree Bandyopadhyay
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Internal evaluation of the XGBoost model on different datasets and comparison with published datasets.
n
Data from: A Deep Learning and XGBoost-based Method for Predicting...
narcis.nl
data.mendeley.com
Updated Aug 3, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wang, P (via Mendeley Data) (2021). A Deep Learning and XGBoost-based Method for Predicting Protein-protein Interaction Sites [Dataset]. http://doi.org/10.17632/9tft3vz5tm.2
Explore at:
Unique identifier
https://doi.org/10.17632/9tft3vz5tm.2
Dataset updated
Aug 3, 2021
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
wang, P (via Mendeley Data)
Description
local_feature_training_set.csv: Preprocessing data of feature extractor contains 65869 rows and 344 columns, and rows represent the number of samples , the first 343 columns represent feature and the last column represent label

local_feature_testing_set.csv: Preprocessing data of feature extractor contains 11791 rows and 344 columns, and rows represent the number of samples , the first 343 columns represent feature and the last column represent label

global&local_feature_training_set.csv: Preprocessing data of feature extractor contains 65869 rows and 1028 columns, and rows represent the number of samples , the first 1027 columns represent feature and the last column represent label

global&local_feature_testing_set.csv: Preprocessing data of feature extractor contains 11791 rows and 1028 columns, and rows represent the number of samples , the first 1027 columns represent feature and the last column represent label
Tox24 challenge data
kaggle.com
zip
Updated Sep 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonina Dolgorukova (2024). Tox24 challenge data [Dataset]. https://www.kaggle.com/datasets/antoninadolgorukova/tox24-challenge-data/suggestions
Explore at:
zip(19160575 bytes)Available download formats
Dataset updated
Sep 18, 2024
Authors
Antonina Dolgorukova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset and associated notebooks were created to solve the Tox24 Challenge and provide a real-world data and a working example of how machine learning can be used to predict binding activity to a target protein like Transthyretin (TTR) - from retrieving and preprocessing SMILES to ensembling the obtained predictions.

SMILES: File all_smiles_data.csv contains various smiles for the 1512 competition chemicals (retrieved from pubchem, cleaned, and smiles with isolated atoms removed), generated in this notebook. Also, here I evaluated the performance of XGBoost using molecular descriptors computed from different SMILES representations of the chemicals.

FEATURES: The 'features' folder contains features calculated with ochem and those computed in the features engineering notebook. All feature sets were evaluated with XGBoost here.

Feature selection notebooks: - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-for-xgboost - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-by-clusters-for-xgboost - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-by-clusters-for-lightgbm

MODELS: The 'submits' folder contains the predictions for the 500 test chemicals of the Tox24 Challenge were made with the XGBoost and LightGBM models and used for the final ensemble.

DATA: The TTR Supplemental Tables are taken from the article that accompanied the Tox24 Challenge, and include:

Tables outlining the components of the assay reactions and lists of autofluorescent chemicals,

chemicals excluded from the analysis due to interference,

and chemicals screened in single concentration and concentration response testing.

This dataset can be used for drug design research, protein-ligand interaction studies, and machine learning model development focused on chemical binding activities.
Heart Disease Risk Prediction Dataset
kaggle.com
zip
Updated Feb 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahatir Ahmed Tusher (2025). Heart Disease Risk Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/mahatiratusher/heart-disease-risk-prediction-dataset
Explore at:
zip(1448235 bytes)Available download formats
Dataset updated
Feb 7, 2025
Authors
Mahatir Ahmed Tusher
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Heart Disease Risk Prediction Dataset

Overview

This synthetic dataset is designed to predict the risk of heart disease based on a combination of symptoms, lifestyle factors, and medical history. Each row in the dataset represents a patient, with binary (Yes/No) indicators for symptoms and risk factors, along with a computed risk label indicating whether the patient is at high or low risk of developing heart disease.

The dataset contains 70,000 samples, making it suitable for training machine learning models for classification tasks. The goal is to provide researchers, data scientists, and healthcare professionals with a clean and structured dataset to explore predictive modeling for cardiovascular health.

This dataset is a side project of EarlyMed, developed by students of Vellore Institute of Technology (VIT-AP). EarlyMed aims to leverage data science and machine learning for early detection and prevention of chronic diseases.

Dataset Features

Input Features

Symptoms (Binary - Yes/No)

Chest Pain (chest_pain): Presence of chest pain, a common symptom of heart disease.

Shortness of Breath (shortness_of_breath): Difficulty breathing, often associated with heart conditions.

Unexplained Fatigue (fatigue): Persistent tiredness without an obvious cause.

Palpitations (palpitations): Irregular or rapid heartbeat.

Dizziness/Fainting (dizziness): Episodes of lightheadedness or fainting.

Swelling in Legs/Ankles (swelling): Swelling due to fluid retention, often linked to heart failure.

Pain in Arm/Jaw/Neck/Back (radiating_pain): Radiating pain, a hallmark of angina or heart attacks.

Cold Sweats & Nausea (cold_sweats): Symptoms commonly associated with acute cardiac events.

Risk Factors (Binary - Yes/No or Continuous)

Age (age): Patient's age in years (continuous variable).

High Blood Pressure (hypertension): History of hypertension (Yes/No).

High Cholesterol (cholesterol_high): Elevated cholesterol levels (Yes/No).

Diabetes (diabetes): Diagnosis of diabetes (Yes/No).

Smoking History (smoker): Whether the patient is a smoker (Yes/No).

Obesity (obesity): Obesity status (Yes/No).

Family History of Heart Disease (family_history): Family history of cardiovascular conditions (Yes/No).

Output Label

Heart Disease Risk (risk_label): Binary label indicating the risk of heart disease:

0: Low risk

1: High risk

Data Generation Process

This dataset was synthetically generated using Python libraries such as numpy and pandas. The generation process ensured a balanced distribution of high-risk and low-risk cases while maintaining realistic correlations between features. For example: - Patients with multiple risk factors (e.g., smoking, hypertension, and diabetes) were more likely to be labeled as high risk. - Symptom patterns were modeled after clinical guidelines and research studies on heart disease.

Sources of Inspiration

The design of this dataset was inspired by the following resources:

Books

"Harrison's Principles of Internal Medicine" by J. Larry Jameson et al.: A comprehensive resource on cardiovascular diseases and their symptoms.

"Mayo Clinic Cardiology" by Joseph G. Murphy et al.: Provides insights into heart disease risk factors and diagnostic criteria.

Research Papers

Framingham Heart Study: A landmark study identifying key risk factors for cardiovascular disease.

American Heart Association (AHA) Guidelines: Recommendations for diagnosing and managing heart disease.

Existing Datasets

UCI Heart Disease Dataset: A widely used dataset for heart disease prediction.

Kaggle’s Heart Disease datasets: Various datasets contributed by the community.

Clinical Guidelines

Centers for Disease Control and Prevention (CDC): Information on heart disease symptoms and risk factors.

World Health Organization (WHO): Global statistics and risk factor analysis for cardiovascular diseases.

Applications

This dataset can be used for a variety of purposes:

Machine Learning Research:

Train classification models (e.g., Logistic Regression, Random Forest, XGBoost) to predict heart disease risk.

Experiment with feature engineering, model tuning, and evaluation metrics like Accuracy, Precision, Recall, and ROC-AUC.

Healthcare Analytics:

Identify key risk factors contributing to heart disease.

Develop decision support systems for early detection of cardiovascular risks.

Educational Purposes:

Teach students and practitioners about predictive modeling in healthcare.

Demonstrate the importance of feature selection...
Additional file 1 of Classification of tumor types using XGBoost machine...
springernature.figshare.com
zip
Updated Aug 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Veronica Zelli; Andrea Manno; Chiara Compagnoni; Rasheed Oyewole Ibraheem; Francesca Zazzeroni; Edoardo Alesse; Fabrizio Rossi; Claudio Arbib; Alessandra Tessitore (2024). Additional file 1 of Classification of tumor types using XGBoost machine learning model: a vector space transformation of genomic alterations [Dataset]. http://doi.org/10.6084/m9.figshare.26643322.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26643322.v1
Dataset updated
Aug 14, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Veronica Zelli; Andrea Manno; Chiara Compagnoni; Rasheed Oyewole Ibraheem; Francesca Zazzeroni; Edoardo Alesse; Fabrizio Rossi; Claudio Arbib; Alessandra Tessitore
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 1: Figure S1. Example of a SPM[t] dataset for a generic tumor type t. Figure S2. Example of a CNV[t] dataset for a generic tumor type t. Figure S3. Pseudocode of the VSM data transformation procedure. Figure S4. Charts showing the size, in terms of total count and percentage, of each random group in the newly created dataset with groups as targets and confusion matrix showing the performance [accuracy (ACC), balanced accuracy (BACC) and AUC score] of the model; hyperparameters are also reported. Of note, accuracy values obtained from random grouping experiments reported here, were significantly lower than those obtained by performing grouping experiments based on biological criteria and characterized by the same numerical complexity (similar group sizes).
h
ai-detector-dataset
huggingface.co
Updated Oct 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maaz (2025). ai-detector-dataset [Dataset]. https://huggingface.co/datasets/mhb-maaz/ai-detector-dataset
Explore at:
Dataset updated
Oct 19, 2025
Authors
Maaz
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
AI vs Human Code Detection Dataset

This dataset is designed for binary classification of AI-generated vs human-written source code.It was used to train and evaluate multiple baseline models including TF-IDF, XGBoost, and CodeBERT.

Dataset Overview

Split Samples Human AI Format

Train 500,000 50% 50% Parquet

Dev 100,000 50% 50% Parquet

Test 10,000 50% 50% Parquet

Each row in the dataset contains:

code: The code snippet as text. label: 0 for… See the full description on the dataset page: https://huggingface.co/datasets/mhb-maaz/ai-detector-dataset.
TPS-Mar-2025-Rain-Prediction-Data
kaggle.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eren Ata (2025). TPS-Mar-2025-Rain-Prediction-Data [Dataset]. https://www.kaggle.com/datasets/erenata/tps-mar-2025-rain-prediction-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 14, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Eren Ata
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Rain Prediction Model - Kaggle Competition Project Overview This project is a machine learning solution for the Tabular Playground Series - March 2025 Kaggle competition, focusing on rain prediction using various weather-related features.

Features Advanced feature engineering with weather interactions and rolling statistics Ensemble learning with XGBoost, LightGBM, and Logistic Regression Hyperparameter optimization using Optuna Cross-validation with GroupKFold Feature importance analysis and visualization Model Performance XGBoost CV Score: 0.8957 ± 0.0192 AUC Optimized hyperparameters through 20 trials Feature importance visualization available in 'feature_importance.png'
n
A machine learning based prediction model for life expectancy
data.niaid.nih.gov
datasetcatalog.nlm.nih.gov
+1more
zip
Updated Nov 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo (2022). A machine learning based prediction model for life expectancy [Dataset]. http://doi.org/10.5061/dryad.z612jm6fv
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.z612jm6fv
Dataset updated
Nov 14, 2022
Dataset provided by
University of South Carolina Upstate
Strathmore University
Authors
Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The social and financial systems of many nations throughout the world are significantly impacted by life expectancy (LE) models. Numerous studies have pointed out the crucial effects that life expectancy projections will have on societal issues and the administration of the global healthcare system. The computation of life expectancy has primarily entailed building an ordinary life table. However, the life table is limited by its long duration, the assumption of homogeneity of cohorts and censoring. As a result, a robust and more accurate approach is inevitable. In this study, a supervised machine learning model for estimating life expectancy rates is developed. The model takes into consideration health, socioeconomic, and behavioral characteristics by using the eXtreme Gradient Boosting (XGBoost) algorithm to data from 193 UN member states. The effectiveness of the model's prediction is compared to that of the Random Forest (RF) and Artificial Neural Network (ANN) regressors utilized in earlier research. XGBoost attains an MAE and an RMSE of 1.554 and 2.402, respectively outperforming the RF and ANN models that achieved MAE and RMSE values of 7.938 and 11.304, and 3.86 and 5.002, respectively. The overall results of this study support XGBoost as a reliable and efficient model for estimating life expectancy. Methods Secondary data were used from which a sample of 2832 observations of 21 variables was sourced from the World Health Organization (WHO) and the United Nations (UN) databases. The data was on 193 UN member states from the year 2000–2015, with the LE health-related factors drawn from the Global Health Observatory data repository.
Egypt Fake Tweets Detection Dataset Labeled
kaggle.com
zip
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahmoud Elgendy68 (2025). Egypt Fake Tweets Detection Dataset Labeled [Dataset]. https://www.kaggle.com/datasets/mahmoudelgendy68/egypt-fake-tweets-detection-dataset-labeled/data
Explore at:
zip(1348136 bytes)Available download formats
Dataset updated
Apr 25, 2025
Authors
Mahmoud Elgendy68
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
Egypt
Description
This dataset is part of a project focused on detecting fake news and misleading content in Egyptian Arabic text from Twitter and Facebook. It contains 22,906 labeled text samples, with labels representing:

f → Fake or misleading content

r → Real or factual content

idk → Unclear or ambiguous content

🔍 Sources & Labeling The dataset is based on manually labeled samples and semi-supervised labeling using an XGBoost classifier trained on a small seed set. Over 20,000 examples were confidently pseudo-labeled using probability thresholds.

The original texts are in Arabic, with content reflecting real social media discourse in Egypt, making this dataset particularly useful for research on:

Arabic NLP

Fake news detection

Misinformation studies

Social media analysis

🧠 Applications This dataset can be used for training and evaluating:

Text classification models

Fake news detectors

Sentiment analysis pipelines

Arabic language models

📌 Notes The dataset will be continuously refined, and future updates will include more manually verified labels. Please cite appropriately and reach out if using it in academic work.
f
Table_1_Automatic text classification of drug-induced liver injury using...
datasetcatalog.nlm.nih.gov
Updated Jun 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bao, Wenjun; Chen, Minjun; Thakkar, Shraddha; Wu, Yue; Tong, Weida; Wingerd, Byron; Liu, Zhichao; Mann, Nicholas; Wolfinger, Russell D.; Donnelly, Tom; Xu, Joshua; Pedersen, Thomas J. (2024). Table_1_Automatic text classification of drug-induced liver injury using document-term matrix and XGBoost.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001354925
Explore at:
Dataset updated
Jun 3, 2024
Authors
Bao, Wenjun; Chen, Minjun; Thakkar, Shraddha; Wu, Yue; Tong, Weida; Wingerd, Byron; Liu, Zhichao; Mann, Nicholas; Wolfinger, Russell D.; Donnelly, Tom; Xu, Joshua; Pedersen, Thomas J.
Description
IntroductionRegulatory agencies generate a vast amount of textual data in the review process. For example, drug labeling serves as a valuable resource for regulatory agencies, such as U.S. Food and Drug Administration (FDA) and Europe Medical Agency (EMA), to communicate drug safety and effectiveness information to healthcare professionals and patients. Drug labeling also serves as a resource for pharmacovigilance and drug safety research. Automated text classification would significantly improve the analysis of drug labeling documents and conserve reviewer resources.MethodsWe utilized artificial intelligence in this study to classify drug-induced liver injury (DILI)-related content from drug labeling documents based on FDA’s DILIrank dataset. We employed text mining and XGBoost models and utilized the Preferred Terms of Medical queries for adverse event standards to simplify the elimination of common words and phrases while retaining medical standard terms for FDA and EMA drug label datasets. Then, we constructed a document term matrix using weights computed by Term Frequency-Inverse Document Frequency (TF-IDF) for each included word/term/token.ResultsThe automatic text classification model exhibited robust performance in predicting DILI, achieving cross-validation AUC scores exceeding 0.90 for both drug labels from FDA and EMA and literature abstracts from the Critical Assessment of Massive Data Analysis (CAMDA).DiscussionMoreover, the text mining and XGBoost functions demonstrated in this study can be applied to other text processing and classification tasks.
f
Data_Sheet_2_Machine Learning Prediction Models for Mechanically Ventilated...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Jul 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yao, Renqi; Li, Lin; Du, Bin; Li, Yang; Zhang, Jin; Wang, Guowei; Chen, Yan; Li, Wei; Chen, Ge; Xi, Xiuming; Jin, Xin; Liu, Shi; Ren, Chao; Huang, Huibin; Guo, Junyang; Guo, Qianqian; Yu, Qian; Zhu, Yibing; Zheng, Hua (2021). Data_Sheet_2_Machine Learning Prediction Models for Mechanically Ventilated Patients: Analyses of the MIMIC-III Database.PDF [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000884346
Explore at:
Dataset updated
Jul 1, 2021
Authors
Yao, Renqi; Li, Lin; Du, Bin; Li, Yang; Zhang, Jin; Wang, Guowei; Chen, Yan; Li, Wei; Chen, Ge; Xi, Xiuming; Jin, Xin; Liu, Shi; Ren, Chao; Huang, Huibin; Guo, Junyang; Guo, Qianqian; Yu, Qian; Zhu, Yibing; Zheng, Hua
Description
Background: Mechanically ventilated patients in the intensive care unit (ICU) have high mortality rates. There are multiple prediction scores, such as the Simplified Acute Physiology Score II (SAPS II), Oxford Acute Severity of Illness Score (OASIS), and Sequential Organ Failure Assessment (SOFA), widely used in the general ICU population. We aimed to establish prediction scores on mechanically ventilated patients with the combination of these disease severity scores and other features available on the first day of admission.Methods: A retrospective administrative database study from the Medical Information Mart for Intensive Care (MIMIC-III) database was conducted. The exposures of interest consisted of the demographics, pre-ICU comorbidity, ICU diagnosis, disease severity scores, vital signs, and laboratory test results on the first day of ICU admission. Hospital mortality was used as the outcome. We used the machine learning methods of k-nearest neighbors (KNN), logistic regression, bagging, decision tree, random forest, Extreme Gradient Boosting (XGBoost), and neural network for model establishment. A sample of 70% of the cohort was used for the training set; the remaining 30% was applied for testing. Areas under the receiver operating characteristic curves (AUCs) and calibration plots would be constructed for the evaluation and comparison of the models' performance. The significance of the risk factors was identified through models and the top factors were reported.Results: A total of 28,530 subjects were enrolled through the screening of the MIMIC-III database. After data preprocessing, 25,659 adult patients with 66 predictors were included in the model analyses. With the training set, the models of KNN, logistic regression, decision tree, random forest, neural network, bagging, and XGBoost were established and the testing set obtained AUCs of 0.806, 0.818, 0.743, 0.819, 0.780, 0.803, and 0.821, respectively. The calibration curves of all the models, except for the neural network, performed well. The XGBoost model performed best among the seven models. The top five predictors were age, respiratory dysfunction, SAPS II score, maximum hemoglobin, and minimum lactate.Conclusion: The current study indicates that models with the risk of factors on the first day could be successfully established for predicting mortality in ventilated patients. The XGBoost model performs best among the seven machine learning models.
S1 File -
plos.figshare.com
zip
Updated Feb 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiyu Wang; Niaz Muhammad Shahani; Xigui Zheng; Jiang Hongwei; Xin Wei (2025). S1 File - [Dataset]. http://doi.org/10.1371/journal.pone.0314977.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0314977.s001
Dataset updated
Feb 6, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Jiyu Wang; Niaz Muhammad Shahani; Xigui Zheng; Jiang Hongwei; Xin Wei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Accurately evaluating earthquake-induced slope displacement is a key factor for designing slopes that can effectively respond to seismic activity. This study evaluates the capabilities of various machine learning models, including artificial neural network (ANN), support vector machine (SVM), random forest (RF), and extreme gradient boosting (XGBoost) in analyzing earthquake-induced slope displacement. A dataset of 45 samples was used, with 70% allocated for training and 30% for testing. To improve model robustness, repeated 5-fold cross-validation was applied. Among the models, XGBoost demonstrated superior predictive accuracy, with an R2 value of 0.99 on both the train and test data, outperforming ANN, SVM, and RF, which had R2 values of 0.63 and 0.80, 0.87 and 0.86, 0.94 and 0.87 on the train and test data, respectively. Sensitivity analysis identified maximum horizontal acceleration (kmax = 0.714) as the most influential factor in slope displacement. The findings suggest that the XGBoost model developed in this study is highly effective in predicting earthquake-induced slope displacement, offering valuable insights for early warning systems and slope stability management.
d
ASV tables of Myasthenia gravis (MG) and non-Myasthenia gravis
datadryad.org
data-staging.niaid.nih.gov
+2more
zip
Updated Sep 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Che-Cheng Chang; Hou-Chang Chiu; Wei-Ning Lin (2023). ASV tables of Myasthenia gravis (MG) and non-Myasthenia gravis [Dataset]. http://doi.org/10.5061/dryad.73n5tb32m
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.73n5tb32m
Dataset updated
Sep 22, 2023
Dataset provided by
Dryad
Authors
Che-Cheng Chang; Hou-Chang Chiu; Wei-Ning Lin
Time period covered
Jul 8, 2023
Description
In this prospective study, 19 individuals with MG and 10 individuals without were consecutively recruited from Fu-Jen Catholic University Hospital. Individuals were enrolled in the MG group if they 1) were given a diagnosis of MG on the basis of having the combination of symptoms and signs that are characteristic of muscle weakness with diurnal changes and either 2a) had a positive test result for specific autoantibodies or 2b) had a positive electrophysiological diagnosis obtained using single-fiber electromyography and repetitive nerve stimulation (Rousseff, 2021). None of the participants had received any abdominal chirurgic intervention; consumed antibiotics, probiotics, or antacids during the previous 6 months; or reported gastrointestinal symptoms during the previous year. This study was approved by the Regional Ethics Committee of Fu-Jen Catholic University Hospital and written informed consent was obtained from each participant (No. FJUH109043). All experiments were completed in...
f
Data from: Assessing individual genetic susceptibility to metabolic...
datasetcatalog.nlm.nih.gov
tandf.figshare.com
Updated Jun 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huang, Tao; Zheng, Xiujuan; Huang, Xirui; Xiong, Wenhui; Yang, Menghan; Wang, Simin; Li, Yuanyuan; Gao, Bizhen; Qiao, Shijie (2025). Assessing individual genetic susceptibility to metabolic syndrome: interpretable machine learning method [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002050385
Explore at:
Dataset updated
Jun 22, 2025
Authors
Huang, Tao; Zheng, Xiujuan; Huang, Xirui; Xiong, Wenhui; Yang, Menghan; Wang, Simin; Li, Yuanyuan; Gao, Bizhen; Qiao, Shijie
Description
Genome-wide association studies have provided profound insights into the genetic aetiology of metabolic syndrome (MetS). However, there is a lack of machine-learning (ML)-based predictive models to assess individual genetic susceptibility to MetS. This study utilized single-nucleotide polymorphisms (SNPs) as variables and employed ML-based genetic risk score (GRS) models to predict the occurrence of MetS, bringing it closer to clinical application. Feature selection was performed using Least Absolute Shrinkage and Selection Operator. Six ML algorithms were employed to construct GRS models. A fivefold cross-validation was utilized to aid in the internal validation of models. The receiver operating characteristic (ROC) curve was used to select the better-performing GRS model. The SHapley Additive exPlanations (SHAP) was then applied to interpret the model. After extracting GRS, stratified analysis of BMI, age and gender was performed. Finally, these conventional risk factors and GRS were integrated through multivariate logistic regression to establish a combined model. A total of 17 SNPs were selected for analysis. Among the GRS models, the extreme gradient boosting (XGBoost) model demonstrated superior discriminative performance (AUC = 0.837). The XGBoost’s optimal robustness was also validated through five-fold cross-validation (mean ROC-AUC = 0.706). The XGBoost-based SHAP algorithm not only elucidated the global effects of 17 SNPs across all samples, but also described the interaction between SNPs, providing a visual representation of how SNPs impact the prediction of MetS in an individual. There was a strong correlation between GRS and MetS risk, particularly observed among young individuals, males and overweight individuals. Furthermore, the model combining conventional risk factors and GRS exhibited excellent discriminative performance (AUC = 0.962) and outstanding robustness (mean ROC-AUC = 0.959). This study established a reliable XGBoost-based GRS model and a GRS prediction platform (https://metabolicsyndromeapps.shinyapps.io/geneticriskscore/) to assess individual genetic susceptibility to MetS. This model has high interpretability and can provide personalized reference for determining the necessity of primary prevention measures for MetS. Additionally, there may be interactions between traditional risk factors and GRS, and the integration of both in a comprehensive model is useful in the prediction of MetS occurrence.
Fake News Detection Dataset
kaggle.com
zip
Updated Apr 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi Mashayekhi (2025). Fake News Detection Dataset [Dataset]. https://www.kaggle.com/datasets/mahdimashayekhi/fake-news-detection-dataset
Explore at:
zip(11735585 bytes)Available download formats
Dataset updated
Apr 27, 2025
Authors
Mahdi Mashayekhi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📚 Fake News Detection Dataset

Overview

This dataset is designed for practicing fake news detection using machine learning and natural language processing (NLP) techniques. It includes a rich collection of 20,000 news articles, carefully generated to simulate real-world data scenarios. Each record contains metadata about the article and a label indicating whether the news is real or fake.

The dataset also intentionally includes around 5% missing values in some fields to simulate the challenges of handling incomplete data in real-life projects.

Columns Description

title A short headline summarizing the article (around 6 words). text The body of the news article (200–300 words on average). date The publication date of the article, randomly selected over the past 3 years. source The media source that published the article (e.g., BBC, CNN, Al Jazeera). May contain missing values (~5%). author The author's full name. Some entries are missing (~5%) to simulate real-world incomplete data. category The general category of the article (e.g., Politics, Health, Sports, Technology). label The target label: real or fake news.

Why Use This Dataset?

Fake News Detection Practice: Perfect for binary classification tasks.

NLP Preprocessing: Allows users to practice text cleaning, tokenization, vectorization, etc.

Handling Missing Data: Some fields are incomplete to simulate real-world data challenges.

Feature Engineering: Encourages creating new features from text and metadata.

Balanced Labels: Realistic distribution of real and fake news for fair model training.

Potential Use Cases

Building and evaluating text classification models (e.g., Logistic Regression, Random Forests, XGBoost).

Practicing NLP techniques like TF-IDF, Word2Vec, BERT embeddings.

Performing exploratory data analysis (EDA) on news data.

Developing pipelines for dealing with missing values and feature extraction.

A Note on Data Quality

This dataset has been synthetically generated to closely resemble real news articles. The diversity in titles, text, sources, and categories ensures that models trained on this dataset can generalize well to unseen, real-world data. However, since it is synthetic, it should not be used for production models or decision-making without careful validation.

File Info

Filename: fake_news_dataset.csv

Size: 20,000 rows × 7 columns

Missing Data: ~5% missing values in the source and author columns.

Facebook

Twitter

Click to copy link

Link copied

Cite

Edwin Hauwert M.Sc. (2022). jars xgboost example files [Dataset]. https://www.kaggle.com/develuse/jars-xgboost-old

jars xgboost example files

xgboost model for java/ pyspark

Explore at:

zip(615742866 bytes)Available download formats

Dataset updated

Apr 16, 2022

Authors

Edwin Hauwert M.Sc.

Description

Dataset

This dataset was created by Edwin Hauwert M.Sc.

Clear search

Close search

Google apps

Main menu

jars xgboost example files

Dataset

Contents

Hyperparameters for the XGBoost model.

MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and...

Fraud Detection Transactions Dataset

Description

📌 Key Features

📌 Columns in the Dataset

📌 Potential Use Cases

Table_1_Five-Feature Model for Developing the Classifier for Synergistic vs....

Internal evaluation of the XGBoost model on different datasets and...

Data from: A Deep Learning and XGBoost-based Method for Predicting...

Tox24 challenge data

Heart Disease Risk Prediction Dataset

Heart Disease Risk Prediction Dataset

Overview

Dataset Features

Input Features

Symptoms (Binary - Yes/No)

Risk Factors (Binary - Yes/No or Continuous)

Output Label

Data Generation Process

Sources of Inspiration

Books

Research Papers

Existing Datasets

Clinical Guidelines

Applications

Additional file 1 of Classification of tumor types using XGBoost machine...

ai-detector-dataset

TPS-Mar-2025-Rain-Prediction-Data

A machine learning based prediction model for life expectancy

Egypt Fake Tweets Detection Dataset Labeled

Table_1_Automatic text classification of drug-induced liver injury using...

Data_Sheet_2_Machine Learning Prediction Models for Mechanically Ventilated...

S1 File -

ASV tables of Myasthenia gravis (MG) and non-Myasthenia gravis

Data from: Assessing individual genetic susceptibility to metabolic...

Fake News Detection Dataset

📚 Fake News Detection Dataset

Overview

Columns Description

Why Use This Dataset?

Potential Use Cases

A Note on Data Quality

File Info

jars xgboost example files

xgboost model for java/ pyspark

Dataset

Contents