61 datasets found
  1. jars xgboost example files

    • kaggle.com
    zip
    Updated Apr 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edwin Hauwert M.Sc. (2022). jars xgboost example files [Dataset]. https://www.kaggle.com/develuse/jars-xgboost-old
    Explore at:
    zip(615742866 bytes)Available download formats
    Dataset updated
    Apr 16, 2022
    Authors
    Edwin Hauwert M.Sc.
    Description

    Dataset

    This dataset was created by Edwin Hauwert M.Sc.

    Contents

  2. Hyperparameters for the XGBoost model.

    • plos.figshare.com
    xls
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoa Thi Trinh; Tuan Anh Pham; Vu Dinh Tho; Duy Hung Nguyen (2024). Hyperparameters for the XGBoost model. [Dataset]. http://doi.org/10.1371/journal.pone.0312531.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Hoa Thi Trinh; Tuan Anh Pham; Vu Dinh Tho; Duy Hung Nguyen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Structurally, the lateral load-bearing capacity mainly depends on reinforced concrete (RC) walls. Determination of flexural strength and shear strength is mandatory when designing reinforced concrete walls. Typically, these strengths are determined through theoretical formulas and verified experimentally. However, theoretical formulas often have large errors and testing is costly and time-consuming. Therefore, this study exploits machine learning techniques, specifically the hybrid XGBoost model combined with optimization algorithms, to predict the shear strength of RC walls based on model training from available experimental results. The study used the largest database of RC walls to date, consisting of 1057 samples with various cross-sectional shapes. Bayesian optimization (BO) algorithms, including BO—Gaussian Process, BO—Random Forest, and Random Search methods, were used to refine the XGBoost model architecture. The results show that Gaussian Process emerged as the most efficient solution compared to other optimization algorithms, providing the lowest Mean Square Error and achieving a prediction R2 of 0.998 for the training set, 0.972 for the validation set and 0.984 for the test set, while BO—Random Forest and Random Search performed as well on the training and test sets as Gaussian Process but significantly worse on the validation set, specifically R2 on the validation set of BO—Random Forest and Random Search were 0.970 and 0.969 respectively over the entire dataset including all cross-sectional shapes of the RC wall. SHAP (Shapley Additive Explanations) technique was used to clarify the predictive ability of the model and the importance of input variables. Furthermore, the performance of the model was validated through comparative analysis with benchmark models and current standards. Notably, the coefficient of variation (COV %) of the XGBoost model is 13.27%, while traditional models often have COV % exceeding 50%.

  3. MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and...

    • zenodo.org
    csv, zip
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah (2025). MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies [Dataset]. http://doi.org/10.5281/zenodo.15666484
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 14, 2025
    Description

    The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv and Trad_Test.csv) is derived directly from the original complete geochemical dataset (alldata.csv) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv and Simu_Test.csv) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip archive contains the raw input files alldata.csvused to generate the proxies_alldata.csv, it also contains Analysis1.csv and Analysis2.csvfor performing confidence analysis. To run the executable files in place of the .m scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.

    Details on the input files for confidence analysis: Analysis1.csv and Analysis2.csv
    These files contain two columns for the test data: column 1 = match or mismatch between predicted and true alterations? column 2 = probability of a correct classification, according to bootstrapped samples (Analysis1.csv) or to simulated proxies (Analysis2.csv)
  4. Fraud Detection Transactions Dataset

    • kaggle.com
    zip
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samay Ashar (2025). Fraud Detection Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/samayashar/fraud-detection-transactions-dataset
    Explore at:
    zip(2104444 bytes)Available download formats
    Dataset updated
    Feb 21, 2025
    Authors
    Samay Ashar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description

    This dataset is designed to help data scientists and machine learning enthusiasts develop robust fraud detection models. It contains realistic synthetic transaction data, including user information, transaction types, risk scores, and more, making it ideal for binary classification tasks with models like XGBoost and LightGBM.

    📌 Key Features

    1. 21 features capturing various aspects of a financial transaction
    2. Realistic structure with numerical, categorical, and temporal data
    3. Binary fraud labels (0 = Not Fraud, 1 = Fraud)
    4. Designed for high accuracy with XGBoost and other ML models
    5. Useful for anomaly detection, risk analysis, and security research

    📌 Columns in the Dataset

    Column NameDescription
    Transaction_IDUnique identifier for each transaction
    User_IDUnique identifier for the user
    Transaction_AmountAmount of money involved in the transaction
    Transaction_TypeType of transaction (Online, In-Store, ATM, etc.)
    TimestampDate and time of the transaction
    Account_BalanceUser's current account balance before the transaction
    Device_TypeType of device used (Mobile, Desktop, etc.)
    LocationGeographical location of the transaction
    Merchant_CategoryType of merchant (Retail, Food, Travel, etc.)
    IP_Address_FlagWhether the IP address was flagged as suspicious (0 or 1)
    Previous_Fraudulent_ActivityNumber of past fraudulent activities by the user
    Daily_Transaction_CountNumber of transactions made by the user that day
    Avg_Transaction_Amount_7dUser's average transaction amount in the past 7 days
    Failed_Transaction_Count_7dCount of failed transactions in the past 7 days
    Card_TypeType of payment card used (Credit, Debit, Prepaid, etc.)
    Card_AgeAge of the card in months
    Transaction_DistanceDistance between the user's usual location and transaction location
    Authentication_MethodHow the user authenticated (PIN, Biometric, etc.)
    Risk_ScoreFraud risk score computed for the transaction
    Is_WeekendWhether the transaction occurred on a weekend (0 or 1)
    Fraud_LabelTarget variable (0 = Not Fraud, 1 = Fraud)

    📌 Potential Use Cases

    1. Fraud detection model training
    2. Anomaly detection in financial transactions
    3. Risk scoring systems for banks and fintech companies
    4. Feature engineering and model explainability research
  5. Table_1_Five-Feature Model for Developing the Classifier for Synergistic vs....

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Jun 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiangjun Ji; Weida Tong; Zhichao Liu; Tieliu Shi (2023). Table_1_Five-Feature Model for Developing the Classifier for Synergistic vs. Antagonistic Drug Combinations Built by XGBoost.XLSX [Dataset]. http://doi.org/10.3389/fgene.2019.00600.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Xiangjun Ji; Weida Tong; Zhichao Liu; Tieliu Shi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Combinatorial drug therapy can improve the therapeutic effect and reduce the corresponding adverse events. In silico strategies to classify synergistic vs. antagonistic drug pairs is more efficient than experimental strategies. However, most of the developed methods have been applied only to cancer therapies. In this study, we introduce a novel method, XGBoost, based on five features of drugs and biomolecular networks of their targets, to classify synergistic vs. antagonistic drug combinations from different drug categories. We found that XGBoost outperformed other classifiers in both stratified fivefold cross-validation (CV) and independent validation. For example, XGBoost achieved higher predictive accuracy than other models (0.86, 0.78, 0.78, and 0.83 for XGBoost, logistic regression, naïve Bayesian, and random forest, respectively) for an independent validation set. We also found that the five-feature XGBoost model is much more effective at predicting combinatorial therapies that have synergistic effects than those with antagonistic effects. The five-feature XGBoost model was also validated on TCGA data with accuracy of 0.79 among the 61 tested drug pairs, which is comparable to that of DeepSynergy. Among the 14 main anatomical/pharmacological groups classified according to WHO Anatomic Therapeutic Class, for drugs belonging to five groups, their prediction accuracy was significantly increased (odds ratio < 1) or reduced (odds ratio > 1) (Fisher’s exact test, p < 0.05). This study concludes that our five-feature XGBoost model has significant benefits for classifying synergistic vs. antagonistic drug combinations.

  6. Internal evaluation of the XGBoost model on different datasets and...

    • plos.figshare.com
    xls
    Updated Feb 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Safdari; Chanda Sai Keshav; Deepanshu Mody; Kshitij Verma; Utsav Kaushal; Vaadeendra Kumar Burra; Sibnath Ray; Debashree Bandyopadhyay (2025). Internal evaluation of the XGBoost model on different datasets and comparison with published datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0316467.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 4, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ali Safdari; Chanda Sai Keshav; Deepanshu Mody; Kshitij Verma; Utsav Kaushal; Vaadeendra Kumar Burra; Sibnath Ray; Debashree Bandyopadhyay
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Internal evaluation of the XGBoost model on different datasets and comparison with published datasets.

  7. n

    Data from: A Deep Learning and XGBoost-based Method for Predicting...

    • narcis.nl
    • data.mendeley.com
    Updated Aug 3, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wang, P (via Mendeley Data) (2021). A Deep Learning and XGBoost-based Method for Predicting Protein-protein Interaction Sites [Dataset]. http://doi.org/10.17632/9tft3vz5tm.2
    Explore at:
    Dataset updated
    Aug 3, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    wang, P (via Mendeley Data)
    Description

    local_feature_training_set.csv: Preprocessing data of feature extractor contains 65869 rows and 344 columns, and rows represent the number of samples , the first 343 columns represent feature and the last column represent label

    local_feature_testing_set.csv: Preprocessing data of feature extractor contains 11791 rows and 344 columns, and rows represent the number of samples , the first 343 columns represent feature and the last column represent label

    global&local_feature_training_set.csv: Preprocessing data of feature extractor contains 65869 rows and 1028 columns, and rows represent the number of samples , the first 1027 columns represent feature and the last column represent label

    global&local_feature_testing_set.csv: Preprocessing data of feature extractor contains 11791 rows and 1028 columns, and rows represent the number of samples , the first 1027 columns represent feature and the last column represent label

  8. Tox24 challenge data

    • kaggle.com
    zip
    Updated Sep 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonina Dolgorukova (2024). Tox24 challenge data [Dataset]. https://www.kaggle.com/datasets/antoninadolgorukova/tox24-challenge-data/suggestions
    Explore at:
    zip(19160575 bytes)Available download formats
    Dataset updated
    Sep 18, 2024
    Authors
    Antonina Dolgorukova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset and associated notebooks were created to solve the Tox24 Challenge and provide a real-world data and a working example of how machine learning can be used to predict binding activity to a target protein like Transthyretin (TTR) - from retrieving and preprocessing SMILES to ensembling the obtained predictions.

    SMILES: File all_smiles_data.csv contains various smiles for the 1512 competition chemicals (retrieved from pubchem, cleaned, and smiles with isolated atoms removed), generated in this notebook. Also, here I evaluated the performance of XGBoost using molecular descriptors computed from different SMILES representations of the chemicals.

    FEATURES: The 'features' folder contains features calculated with ochem and those computed in the features engineering notebook. All feature sets were evaluated with XGBoost here.

    Feature selection notebooks: - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-for-xgboost - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-by-clusters-for-xgboost - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-by-clusters-for-lightgbm

    MODELS: The 'submits' folder contains the predictions for the 500 test chemicals of the Tox24 Challenge were made with the XGBoost and LightGBM models and used for the final ensemble.

    DATA: The TTR Supplemental Tables are taken from the article that accompanied the Tox24 Challenge, and include:

    • Tables outlining the components of the assay reactions and lists of autofluorescent chemicals,
    • chemicals excluded from the analysis due to interference,
    • and chemicals screened in single concentration and concentration response testing.

    This dataset can be used for drug design research, protein-ligand interaction studies, and machine learning model development focused on chemical binding activities.

  9. Heart Disease Risk Prediction Dataset

    • kaggle.com
    zip
    Updated Feb 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahatir Ahmed Tusher (2025). Heart Disease Risk Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/mahatiratusher/heart-disease-risk-prediction-dataset
    Explore at:
    zip(1448235 bytes)Available download formats
    Dataset updated
    Feb 7, 2025
    Authors
    Mahatir Ahmed Tusher
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Heart Disease Risk Prediction Dataset

    Overview

    This synthetic dataset is designed to predict the risk of heart disease based on a combination of symptoms, lifestyle factors, and medical history. Each row in the dataset represents a patient, with binary (Yes/No) indicators for symptoms and risk factors, along with a computed risk label indicating whether the patient is at high or low risk of developing heart disease.

    The dataset contains 70,000 samples, making it suitable for training machine learning models for classification tasks. The goal is to provide researchers, data scientists, and healthcare professionals with a clean and structured dataset to explore predictive modeling for cardiovascular health.

    This dataset is a side project of EarlyMed, developed by students of Vellore Institute of Technology (VIT-AP). EarlyMed aims to leverage data science and machine learning for early detection and prevention of chronic diseases.

    Dataset Features

    Input Features

    Symptoms (Binary - Yes/No)

    1. Chest Pain (chest_pain): Presence of chest pain, a common symptom of heart disease.
    2. Shortness of Breath (shortness_of_breath): Difficulty breathing, often associated with heart conditions.
    3. Unexplained Fatigue (fatigue): Persistent tiredness without an obvious cause.
    4. Palpitations (palpitations): Irregular or rapid heartbeat.
    5. Dizziness/Fainting (dizziness): Episodes of lightheadedness or fainting.
    6. Swelling in Legs/Ankles (swelling): Swelling due to fluid retention, often linked to heart failure.
    7. Pain in Arm/Jaw/Neck/Back (radiating_pain): Radiating pain, a hallmark of angina or heart attacks.
    8. Cold Sweats & Nausea (cold_sweats): Symptoms commonly associated with acute cardiac events.

    Risk Factors (Binary - Yes/No or Continuous)

    1. Age (age): Patient's age in years (continuous variable).
    2. High Blood Pressure (hypertension): History of hypertension (Yes/No).
    3. High Cholesterol (cholesterol_high): Elevated cholesterol levels (Yes/No).
    4. Diabetes (diabetes): Diagnosis of diabetes (Yes/No).
    5. Smoking History (smoker): Whether the patient is a smoker (Yes/No).
    6. Obesity (obesity): Obesity status (Yes/No).
    7. Family History of Heart Disease (family_history): Family history of cardiovascular conditions (Yes/No).

    Output Label

    • Heart Disease Risk (risk_label): Binary label indicating the risk of heart disease:
      • 0: Low risk
      • 1: High risk

    Data Generation Process

    This dataset was synthetically generated using Python libraries such as numpy and pandas. The generation process ensured a balanced distribution of high-risk and low-risk cases while maintaining realistic correlations between features. For example: - Patients with multiple risk factors (e.g., smoking, hypertension, and diabetes) were more likely to be labeled as high risk. - Symptom patterns were modeled after clinical guidelines and research studies on heart disease.

    Sources of Inspiration

    The design of this dataset was inspired by the following resources:

    Books

    • "Harrison's Principles of Internal Medicine" by J. Larry Jameson et al.: A comprehensive resource on cardiovascular diseases and their symptoms.
    • "Mayo Clinic Cardiology" by Joseph G. Murphy et al.: Provides insights into heart disease risk factors and diagnostic criteria.

    Research Papers

    • Framingham Heart Study: A landmark study identifying key risk factors for cardiovascular disease.
    • American Heart Association (AHA) Guidelines: Recommendations for diagnosing and managing heart disease.

    Existing Datasets

    • UCI Heart Disease Dataset: A widely used dataset for heart disease prediction.
    • Kaggle’s Heart Disease datasets: Various datasets contributed by the community.

    Clinical Guidelines

    • Centers for Disease Control and Prevention (CDC): Information on heart disease symptoms and risk factors.
    • World Health Organization (WHO): Global statistics and risk factor analysis for cardiovascular diseases.

    Applications

    This dataset can be used for a variety of purposes:

    1. Machine Learning Research:

      • Train classification models (e.g., Logistic Regression, Random Forest, XGBoost) to predict heart disease risk.
      • Experiment with feature engineering, model tuning, and evaluation metrics like Accuracy, Precision, Recall, and ROC-AUC.
    2. Healthcare Analytics:

      • Identify key risk factors contributing to heart disease.
      • Develop decision support systems for early detection of cardiovascular risks.
    3. Educational Purposes:

      • Teach students and practitioners about predictive modeling in healthcare.
      • Demonstrate the importance of feature selection...
  10. Additional file 1 of Classification of tumor types using XGBoost machine...

    • springernature.figshare.com
    zip
    Updated Aug 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Veronica Zelli; Andrea Manno; Chiara Compagnoni; Rasheed Oyewole Ibraheem; Francesca Zazzeroni; Edoardo Alesse; Fabrizio Rossi; Claudio Arbib; Alessandra Tessitore (2024). Additional file 1 of Classification of tumor types using XGBoost machine learning model: a vector space transformation of genomic alterations [Dataset]. http://doi.org/10.6084/m9.figshare.26643322.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 14, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Veronica Zelli; Andrea Manno; Chiara Compagnoni; Rasheed Oyewole Ibraheem; Francesca Zazzeroni; Edoardo Alesse; Fabrizio Rossi; Claudio Arbib; Alessandra Tessitore
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 1: Figure S1. Example of a SPM[t] dataset for a generic tumor type t. Figure S2. Example of a CNV[t] dataset for a generic tumor type t. Figure S3. Pseudocode of the VSM data transformation procedure. Figure S4. Charts showing the size, in terms of total count and percentage, of each random group in the newly created dataset with groups as targets and confusion matrix showing the performance [accuracy (ACC), balanced accuracy (BACC) and AUC score] of the model; hyperparameters are also reported. Of note, accuracy values obtained from random grouping experiments reported here, were significantly lower than those obtained by performing grouping experiments based on biological criteria and characterized by the same numerical complexity (similar group sizes).

  11. h

    ai-detector-dataset

    • huggingface.co
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maaz (2025). ai-detector-dataset [Dataset]. https://huggingface.co/datasets/mhb-maaz/ai-detector-dataset
    Explore at:
    Dataset updated
    Oct 19, 2025
    Authors
    Maaz
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    AI vs Human Code Detection Dataset

    This dataset is designed for binary classification of AI-generated vs human-written source code.It was used to train and evaluate multiple baseline models including TF-IDF, XGBoost, and CodeBERT.

      Dataset Overview
    

    Split Samples Human AI Format

    Train 500,000 50% 50% Parquet

    Dev 100,000 50% 50% Parquet

    Test 10,000 50% 50% Parquet

    Each row in the dataset contains:

    code: The code snippet as text. label: 0 for… See the full description on the dataset page: https://huggingface.co/datasets/mhb-maaz/ai-detector-dataset.

  12. TPS-Mar-2025-Rain-Prediction-Data

    • kaggle.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eren Ata (2025). TPS-Mar-2025-Rain-Prediction-Data [Dataset]. https://www.kaggle.com/datasets/erenata/tps-mar-2025-rain-prediction-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Eren Ata
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Rain Prediction Model - Kaggle Competition Project Overview This project is a machine learning solution for the Tabular Playground Series - March 2025 Kaggle competition, focusing on rain prediction using various weather-related features.

    Features Advanced feature engineering with weather interactions and rolling statistics Ensemble learning with XGBoost, LightGBM, and Logistic Regression Hyperparameter optimization using Optuna Cross-validation with GroupKFold Feature importance analysis and visualization Model Performance XGBoost CV Score: 0.8957 ± 0.0192 AUC Optimized hyperparameters through 20 trials Feature importance visualization available in 'feature_importance.png'

  13. n

    A machine learning based prediction model for life expectancy

    • data.niaid.nih.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated Nov 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo (2022). A machine learning based prediction model for life expectancy [Dataset]. http://doi.org/10.5061/dryad.z612jm6fv
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 14, 2022
    Dataset provided by
    University of South Carolina Upstate
    Strathmore University
    Authors
    Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The social and financial systems of many nations throughout the world are significantly impacted by life expectancy (LE) models. Numerous studies have pointed out the crucial effects that life expectancy projections will have on societal issues and the administration of the global healthcare system. The computation of life expectancy has primarily entailed building an ordinary life table. However, the life table is limited by its long duration, the assumption of homogeneity of cohorts and censoring. As a result, a robust and more accurate approach is inevitable. In this study, a supervised machine learning model for estimating life expectancy rates is developed. The model takes into consideration health, socioeconomic, and behavioral characteristics by using the eXtreme Gradient Boosting (XGBoost) algorithm to data from 193 UN member states. The effectiveness of the model's prediction is compared to that of the Random Forest (RF) and Artificial Neural Network (ANN) regressors utilized in earlier research. XGBoost attains an MAE and an RMSE of 1.554 and 2.402, respectively outperforming the RF and ANN models that achieved MAE and RMSE values of 7.938 and 11.304, and 3.86 and 5.002, respectively. The overall results of this study support XGBoost as a reliable and efficient model for estimating life expectancy. Methods Secondary data were used from which a sample of 2832 observations of 21 variables was sourced from the World Health Organization (WHO) and the United Nations (UN) databases. The data was on 193 UN member states from the year 2000–2015, with the LE health-related factors drawn from the Global Health Observatory data repository.

  14. Egypt Fake Tweets Detection Dataset Labeled

    • kaggle.com
    zip
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahmoud Elgendy68 (2025). Egypt Fake Tweets Detection Dataset Labeled [Dataset]. https://www.kaggle.com/datasets/mahmoudelgendy68/egypt-fake-tweets-detection-dataset-labeled/data
    Explore at:
    zip(1348136 bytes)Available download formats
    Dataset updated
    Apr 25, 2025
    Authors
    Mahmoud Elgendy68
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    Egypt
    Description

    This dataset is part of a project focused on detecting fake news and misleading content in Egyptian Arabic text from Twitter and Facebook. It contains 22,906 labeled text samples, with labels representing:

    f → Fake or misleading content

    r → Real or factual content

    idk → Unclear or ambiguous content

    🔍 Sources & Labeling The dataset is based on manually labeled samples and semi-supervised labeling using an XGBoost classifier trained on a small seed set. Over 20,000 examples were confidently pseudo-labeled using probability thresholds.

    The original texts are in Arabic, with content reflecting real social media discourse in Egypt, making this dataset particularly useful for research on:

    Arabic NLP

    Fake news detection

    Misinformation studies

    Social media analysis

    🧠 Applications This dataset can be used for training and evaluating:

    Text classification models

    Fake news detectors

    Sentiment analysis pipelines

    Arabic language models

    📌 Notes The dataset will be continuously refined, and future updates will include more manually verified labels. Please cite appropriately and reach out if using it in academic work.

  15. f

    Table_1_Automatic text classification of drug-induced liver injury using...

    • datasetcatalog.nlm.nih.gov
    Updated Jun 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bao, Wenjun; Chen, Minjun; Thakkar, Shraddha; Wu, Yue; Tong, Weida; Wingerd, Byron; Liu, Zhichao; Mann, Nicholas; Wolfinger, Russell D.; Donnelly, Tom; Xu, Joshua; Pedersen, Thomas J. (2024). Table_1_Automatic text classification of drug-induced liver injury using document-term matrix and XGBoost.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001354925
    Explore at:
    Dataset updated
    Jun 3, 2024
    Authors
    Bao, Wenjun; Chen, Minjun; Thakkar, Shraddha; Wu, Yue; Tong, Weida; Wingerd, Byron; Liu, Zhichao; Mann, Nicholas; Wolfinger, Russell D.; Donnelly, Tom; Xu, Joshua; Pedersen, Thomas J.
    Description

    IntroductionRegulatory agencies generate a vast amount of textual data in the review process. For example, drug labeling serves as a valuable resource for regulatory agencies, such as U.S. Food and Drug Administration (FDA) and Europe Medical Agency (EMA), to communicate drug safety and effectiveness information to healthcare professionals and patients. Drug labeling also serves as a resource for pharmacovigilance and drug safety research. Automated text classification would significantly improve the analysis of drug labeling documents and conserve reviewer resources.MethodsWe utilized artificial intelligence in this study to classify drug-induced liver injury (DILI)-related content from drug labeling documents based on FDA’s DILIrank dataset. We employed text mining and XGBoost models and utilized the Preferred Terms of Medical queries for adverse event standards to simplify the elimination of common words and phrases while retaining medical standard terms for FDA and EMA drug label datasets. Then, we constructed a document term matrix using weights computed by Term Frequency-Inverse Document Frequency (TF-IDF) for each included word/term/token.ResultsThe automatic text classification model exhibited robust performance in predicting DILI, achieving cross-validation AUC scores exceeding 0.90 for both drug labels from FDA and EMA and literature abstracts from the Critical Assessment of Massive Data Analysis (CAMDA).DiscussionMoreover, the text mining and XGBoost functions demonstrated in this study can be applied to other text processing and classification tasks.

  16. f

    Data_Sheet_2_Machine Learning Prediction Models for Mechanically Ventilated...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Jul 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yao, Renqi; Li, Lin; Du, Bin; Li, Yang; Zhang, Jin; Wang, Guowei; Chen, Yan; Li, Wei; Chen, Ge; Xi, Xiuming; Jin, Xin; Liu, Shi; Ren, Chao; Huang, Huibin; Guo, Junyang; Guo, Qianqian; Yu, Qian; Zhu, Yibing; Zheng, Hua (2021). Data_Sheet_2_Machine Learning Prediction Models for Mechanically Ventilated Patients: Analyses of the MIMIC-III Database.PDF [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000884346
    Explore at:
    Dataset updated
    Jul 1, 2021
    Authors
    Yao, Renqi; Li, Lin; Du, Bin; Li, Yang; Zhang, Jin; Wang, Guowei; Chen, Yan; Li, Wei; Chen, Ge; Xi, Xiuming; Jin, Xin; Liu, Shi; Ren, Chao; Huang, Huibin; Guo, Junyang; Guo, Qianqian; Yu, Qian; Zhu, Yibing; Zheng, Hua
    Description

    Background: Mechanically ventilated patients in the intensive care unit (ICU) have high mortality rates. There are multiple prediction scores, such as the Simplified Acute Physiology Score II (SAPS II), Oxford Acute Severity of Illness Score (OASIS), and Sequential Organ Failure Assessment (SOFA), widely used in the general ICU population. We aimed to establish prediction scores on mechanically ventilated patients with the combination of these disease severity scores and other features available on the first day of admission.Methods: A retrospective administrative database study from the Medical Information Mart for Intensive Care (MIMIC-III) database was conducted. The exposures of interest consisted of the demographics, pre-ICU comorbidity, ICU diagnosis, disease severity scores, vital signs, and laboratory test results on the first day of ICU admission. Hospital mortality was used as the outcome. We used the machine learning methods of k-nearest neighbors (KNN), logistic regression, bagging, decision tree, random forest, Extreme Gradient Boosting (XGBoost), and neural network for model establishment. A sample of 70% of the cohort was used for the training set; the remaining 30% was applied for testing. Areas under the receiver operating characteristic curves (AUCs) and calibration plots would be constructed for the evaluation and comparison of the models' performance. The significance of the risk factors was identified through models and the top factors were reported.Results: A total of 28,530 subjects were enrolled through the screening of the MIMIC-III database. After data preprocessing, 25,659 adult patients with 66 predictors were included in the model analyses. With the training set, the models of KNN, logistic regression, decision tree, random forest, neural network, bagging, and XGBoost were established and the testing set obtained AUCs of 0.806, 0.818, 0.743, 0.819, 0.780, 0.803, and 0.821, respectively. The calibration curves of all the models, except for the neural network, performed well. The XGBoost model performed best among the seven models. The top five predictors were age, respiratory dysfunction, SAPS II score, maximum hemoglobin, and minimum lactate.Conclusion: The current study indicates that models with the risk of factors on the first day could be successfully established for predicting mortality in ventilated patients. The XGBoost model performs best among the seven machine learning models.

  17. S1 File -

    • plos.figshare.com
    zip
    Updated Feb 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiyu Wang; Niaz Muhammad Shahani; Xigui Zheng; Jiang Hongwei; Xin Wei (2025). S1 File - [Dataset]. http://doi.org/10.1371/journal.pone.0314977.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 6, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jiyu Wang; Niaz Muhammad Shahani; Xigui Zheng; Jiang Hongwei; Xin Wei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Accurately evaluating earthquake-induced slope displacement is a key factor for designing slopes that can effectively respond to seismic activity. This study evaluates the capabilities of various machine learning models, including artificial neural network (ANN), support vector machine (SVM), random forest (RF), and extreme gradient boosting (XGBoost) in analyzing earthquake-induced slope displacement. A dataset of 45 samples was used, with 70% allocated for training and 30% for testing. To improve model robustness, repeated 5-fold cross-validation was applied. Among the models, XGBoost demonstrated superior predictive accuracy, with an R2 value of 0.99 on both the train and test data, outperforming ANN, SVM, and RF, which had R2 values of 0.63 and 0.80, 0.87 and 0.86, 0.94 and 0.87 on the train and test data, respectively. Sensitivity analysis identified maximum horizontal acceleration (kmax = 0.714) as the most influential factor in slope displacement. The findings suggest that the XGBoost model developed in this study is highly effective in predicting earthquake-induced slope displacement, offering valuable insights for early warning systems and slope stability management.

  18. d

    ASV tables of Myasthenia gravis (MG) and non-Myasthenia gravis

    • datadryad.org
    • data-staging.niaid.nih.gov
    • +2more
    zip
    Updated Sep 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Che-Cheng Chang; Hou-Chang Chiu; Wei-Ning Lin (2023). ASV tables of Myasthenia gravis (MG) and non-Myasthenia gravis [Dataset]. http://doi.org/10.5061/dryad.73n5tb32m
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 22, 2023
    Dataset provided by
    Dryad
    Authors
    Che-Cheng Chang; Hou-Chang Chiu; Wei-Ning Lin
    Time period covered
    Jul 8, 2023
    Description

    In this prospective study, 19 individuals with MG and 10 individuals without were consecutively recruited from Fu-Jen Catholic University Hospital. Individuals were enrolled in the MG group if they 1) were given a diagnosis of MG on the basis of having the combination of symptoms and signs that are characteristic of muscle weakness with diurnal changes and either 2a) had a positive test result for specific autoantibodies or 2b) had a positive electrophysiological diagnosis obtained using single-fiber electromyography and repetitive nerve stimulation (Rousseff, 2021). None of the participants had received any abdominal chirurgic intervention; consumed antibiotics, probiotics, or antacids during the previous 6 months; or reported gastrointestinal symptoms during the previous year. This study was approved by the Regional Ethics Committee of Fu-Jen Catholic University Hospital and written informed consent was obtained from each participant (No. FJUH109043). All experiments were completed in...

  19. f

    Data from: Assessing individual genetic susceptibility to metabolic...

    • datasetcatalog.nlm.nih.gov
    • tandf.figshare.com
    Updated Jun 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huang, Tao; Zheng, Xiujuan; Huang, Xirui; Xiong, Wenhui; Yang, Menghan; Wang, Simin; Li, Yuanyuan; Gao, Bizhen; Qiao, Shijie (2025). Assessing individual genetic susceptibility to metabolic syndrome: interpretable machine learning method [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002050385
    Explore at:
    Dataset updated
    Jun 22, 2025
    Authors
    Huang, Tao; Zheng, Xiujuan; Huang, Xirui; Xiong, Wenhui; Yang, Menghan; Wang, Simin; Li, Yuanyuan; Gao, Bizhen; Qiao, Shijie
    Description

    Genome-wide association studies have provided profound insights into the genetic aetiology of metabolic syndrome (MetS). However, there is a lack of machine-learning (ML)-based predictive models to assess individual genetic susceptibility to MetS. This study utilized single-nucleotide polymorphisms (SNPs) as variables and employed ML-based genetic risk score (GRS) models to predict the occurrence of MetS, bringing it closer to clinical application. Feature selection was performed using Least Absolute Shrinkage and Selection Operator. Six ML algorithms were employed to construct GRS models. A fivefold cross-validation was utilized to aid in the internal validation of models. The receiver operating characteristic (ROC) curve was used to select the better-performing GRS model. The SHapley Additive exPlanations (SHAP) was then applied to interpret the model. After extracting GRS, stratified analysis of BMI, age and gender was performed. Finally, these conventional risk factors and GRS were integrated through multivariate logistic regression to establish a combined model. A total of 17 SNPs were selected for analysis. Among the GRS models, the extreme gradient boosting (XGBoost) model demonstrated superior discriminative performance (AUC = 0.837). The XGBoost’s optimal robustness was also validated through five-fold cross-validation (mean ROC-AUC = 0.706). The XGBoost-based SHAP algorithm not only elucidated the global effects of 17 SNPs across all samples, but also described the interaction between SNPs, providing a visual representation of how SNPs impact the prediction of MetS in an individual. There was a strong correlation between GRS and MetS risk, particularly observed among young individuals, males and overweight individuals. Furthermore, the model combining conventional risk factors and GRS exhibited excellent discriminative performance (AUC = 0.962) and outstanding robustness (mean ROC-AUC = 0.959). This study established a reliable XGBoost-based GRS model and a GRS prediction platform (https://metabolicsyndromeapps.shinyapps.io/geneticriskscore/) to assess individual genetic susceptibility to MetS. This model has high interpretability and can provide personalized reference for determining the necessity of primary prevention measures for MetS. Additionally, there may be interactions between traditional risk factors and GRS, and the integration of both in a comprehensive model is useful in the prediction of MetS occurrence.

  20. Fake News Detection Dataset

    • kaggle.com
    zip
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Mashayekhi (2025). Fake News Detection Dataset [Dataset]. https://www.kaggle.com/datasets/mahdimashayekhi/fake-news-detection-dataset
    Explore at:
    zip(11735585 bytes)Available download formats
    Dataset updated
    Apr 27, 2025
    Authors
    Mahdi Mashayekhi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📚 Fake News Detection Dataset

    Overview

    This dataset is designed for practicing fake news detection using machine learning and natural language processing (NLP) techniques. It includes a rich collection of 20,000 news articles, carefully generated to simulate real-world data scenarios. Each record contains metadata about the article and a label indicating whether the news is real or fake.

    The dataset also intentionally includes around 5% missing values in some fields to simulate the challenges of handling incomplete data in real-life projects.

    Columns Description

    title A short headline summarizing the article (around 6 words). text The body of the news article (200–300 words on average). date The publication date of the article, randomly selected over the past 3 years. source The media source that published the article (e.g., BBC, CNN, Al Jazeera). May contain missing values (~5%). author The author's full name. Some entries are missing (~5%) to simulate real-world incomplete data. category The general category of the article (e.g., Politics, Health, Sports, Technology). label The target label: real or fake news.

    Why Use This Dataset?

    Fake News Detection Practice: Perfect for binary classification tasks.

    NLP Preprocessing: Allows users to practice text cleaning, tokenization, vectorization, etc.

    Handling Missing Data: Some fields are incomplete to simulate real-world data challenges.

    Feature Engineering: Encourages creating new features from text and metadata.

    Balanced Labels: Realistic distribution of real and fake news for fair model training.

    Potential Use Cases

    Building and evaluating text classification models (e.g., Logistic Regression, Random Forests, XGBoost).

    Practicing NLP techniques like TF-IDF, Word2Vec, BERT embeddings.

    Performing exploratory data analysis (EDA) on news data.

    Developing pipelines for dealing with missing values and feature extraction.

    A Note on Data Quality

    This dataset has been synthetically generated to closely resemble real news articles. The diversity in titles, text, sources, and categories ensures that models trained on this dataset can generalize well to unseen, real-world data. However, since it is synthetic, it should not be used for production models or decision-making without careful validation.

    File Info

    Filename: fake_news_dataset.csv

    Size: 20,000 rows × 7 columns

    Missing Data: ~5% missing values in the source and author columns.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Edwin Hauwert M.Sc. (2022). jars xgboost example files [Dataset]. https://www.kaggle.com/develuse/jars-xgboost-old
Organization logo

jars xgboost example files

xgboost model for java/ pyspark

Explore at:
zip(615742866 bytes)Available download formats
Dataset updated
Apr 16, 2022
Authors
Edwin Hauwert M.Sc.
Description

Dataset

This dataset was created by Edwin Hauwert M.Sc.

Contents

Search
Clear search
Close search
Google apps
Main menu