61 datasets found
  1. MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and...

    • zenodo.org
    csv, zip
    Updated Jun 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah (2025). MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies [Dataset]. http://doi.org/10.5281/zenodo.15666484
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 14, 2025
    Description

    The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv and Trad_Test.csv) is derived directly from the original complete geochemical dataset (alldata.csv) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv and Simu_Test.csv) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip archive contains the raw input files alldata.csvused to generate the proxies_alldata.csv, it also contains Analysis1.csv and Analysis2.csvfor performing confidence analysis. To run the executable files in place of the .m scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.

    Details on the input files for confidence analysis: Analysis1.csv and Analysis2.csv
    These files contain two columns for the test data: column 1 = match or mismatch between predicted and true alterations? column 2 = probability of a correct classification, according to bootstrapped samples (Analysis1.csv) or to simulated proxies (Analysis2.csv)
  2. f

    Training dataset portioning results using XGBoost.

    • plos.figshare.com
    xls
    Updated Aug 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavithra Mahesh; Rajkumar Soundrapandiyan (2024). Training dataset portioning results using XGBoost. [Dataset]. http://doi.org/10.1371/journal.pone.0291928.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 26, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Pavithra Mahesh; Rajkumar Soundrapandiyan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Training dataset portioning results using XGBoost.

  3. RNA dataset to train XGBoost model

    • zenodo.org
    txt
    Updated Nov 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Lam; Alberto Leon; Ugljesa Djuric; Phedias Diamandis; Brian Lam; Alberto Leon; Ugljesa Djuric; Phedias Diamandis (2021). RNA dataset to train XGBoost model [Dataset]. http://doi.org/10.5281/zenodo.5593517
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 2, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Brian Lam; Alberto Leon; Ugljesa Djuric; Phedias Diamandis; Brian Lam; Alberto Leon; Ugljesa Djuric; Phedias Diamandis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository includes the RNA-seq dataset from 27 GBM samples, as published in this manuscript:

    Topographic mapping of the glioblastoma proteome reveals a triple axis model of intra-tumoral heterogeneity
    Lam KHB, Leon AJ, Hui W, Lee SCE, Batruch I, Faust K, Koritzinsky M, Richer M, Djuric U, Diamandis P (under review)

  4. f

    Data Sheet 1_Tunnel water inflow prediction using explainable machine...

    • frontiersin.figshare.com
    zip
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shengdong Ju; Guangzhao Ou; Tao Peng; Yanning Wang; Quanlin Song; Peng Guan (2025). Data Sheet 1_Tunnel water inflow prediction using explainable machine learning and augmented partially missing dataset.zip [Dataset]. http://doi.org/10.3389/feart.2025.1590203.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 25, 2025
    Dataset provided by
    Frontiers
    Authors
    Shengdong Ju; Guangzhao Ou; Tao Peng; Yanning Wang; Quanlin Song; Peng Guan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Accurate prediction of water inrush volumes is essential for safeguarding tunnel construction operations. This study proposes a method for predicting tunnel water inrush volumes, leveraging the eXtreme Gradient Boosting (XGBoost) model optimized with Bayesian techniques. To maximize the utility of available data, 654 datasets with missing values were imputed and augmented, forming a robust dataset for the training and validation of the Bayesian optimized XGBoost (BO-XGBoost) model. Furthermore, the SHapley Additive explanations (SHAP) method was employed to elucidate the contribution of each input feature to the predictive outcomes. The results indicate that: (1) The constructed BO-XGBoost model exhibited exceptionally high predictive accuracy on the test set, with a root mean square error (RMSE) of 7.5603, mean absolute error (MAE) of 3.2940, mean absolute percentage error (MAPE) of 4.51%, and coefficient of determination (R2) of 0.9755; (2) Compared to the predictive performance of support vector mechine (SVR), decision tree (DT), and random forest (RF) models, the BO-XGBoost model demonstrates the highest R2 values and the smallest prediction error; (3) The input feature importance yielded by SHAP is groundwater level (h) > water-producing characteristics (W) > tunnel burial depth (H) > rock mass quality index (RQD). The proposed BO-XGBoost model exhibited exceptionally high predictive accuracy on the tunnel water inrush volume prediction dataset, thereby aiding managers in making informed decisions to mitigate water inrush risks and ensuring the safe and efficient advancement of tunnel projects.

  5. f

    Table_5_Preliminary prediction of semen quality based on modifiable...

    • frontiersin.figshare.com
    docx
    Updated Jun 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mingjuan Zhou; Tianci Yao; Jian Li; Hui Hui; Weimin Fan; Yunfeng Guan; Aijun Zhang; Bufang Xu (2023). Table_5_Preliminary prediction of semen quality based on modifiable lifestyle factors by using the XGBoost algorithm.docx [Dataset]. http://doi.org/10.3389/fmed.2022.811890.s006
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    Frontiers
    Authors
    Mingjuan Zhou; Tianci Yao; Jian Li; Hui Hui; Weimin Fan; Yunfeng Guan; Aijun Zhang; Bufang Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionSemen quality has decreased gradually in recent years, and lifestyle changes are among the primary causes for this issue. Thus far, the specific lifestyle factors affecting semen quality remain to be elucidated.Materials and methodsIn this study, data on the following factors were collected from 5,109 men examined at our reproductive medicine center: 10 lifestyle factors that potentially affect semen quality (smoking status, alcohol consumption, staying up late, sleeplessness, consumption of pungent food, intensity of sports activity, sedentary lifestyle, working in hot conditions, sauna use in the last 3 months, and exposure to radioactivity); general factors including age, abstinence period, and season of semen examination; and comprehensive semen parameters [semen volume, sperm concentration, progressive and total sperm motility, sperm morphology, and DNA fragmentation index (DFI)]. Then, machine learning with the XGBoost algorithm was applied to establish a primary prediction model by using the collected data. Furthermore, the accuracy of the model was verified via multiple logistic regression following k-fold cross-validation analyses.ResultsThe results indicated that for semen volume, sperm concentration, progressive and total sperm motility, and DFI, the area under the curve (AUC) values ranged from 0.648 to 0.697, while the AUC for sperm morphology was only 0.506. Among the 13 factors, smoking status was the major factor affecting semen volume, sperm concentration, and progressive and total sperm motility. Age was the most important factor affecting DFI. Logistic combined with cross-validation analysis revealed similar results. Furthermore, it showed that heavy smoking (>20 cigarettes/day) had an overall negative effect on semen volume and sperm concentration and progressive and total sperm motility (OR = 4.69, 6.97, 11.16, and 10.35, respectively), while age of >35 years was associated with increased DFI (OR = 5.47).ConclusionThe preliminary lifestyle-based model developed for semen quality prediction by using the XGBoost algorithm showed potential for clinical application and further optimization with larger training datasets.

  6. t

    Rainfall Prediction: Comparison of 7 Popular Models

    • test.researchdata.tuwien.ac.at
    bin, png +1
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaya Ali Kus; Kaya Ali Kus (2025). Rainfall Prediction: Comparison of 7 Popular Models [Dataset]. http://doi.org/10.70124/p7rh4-0g783
    Explore at:
    png, text/markdown, binAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Kaya Ali Kus; Kaya Ali Kus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Rainfall Prediction using 7 Popular Models

    Context and Methodology

    Research Domain/Project:

    This dataset is part of a machine learning project focused on predicting rainfall, a critical task for sectors like agriculture, water resource management, and disaster prevention. The project employs machine learning algorithms to forecast rainfall occurrences based on historical weather data, including features like temperature, humidity, and pressure.

    Purpose:

    The primary goal of the dataset is to train multiple machine learning models to predict rainfall and compare their performances. The insights gained will help identify the most accurate models for real-world predictions of rainfall events.

    Creation Process:

    The dataset is derived from various historical weather observations, including temperature, humidity, wind speed, and pressure, collected by weather stations across Australia. These observations are used as inputs for training machine learning models. The dataset is publicly available on platforms like Kaggle and is often used in competitions and research to advance predictive analytics in meteorology.

    Technical Details


    Dataset Structure:

    The dataset consists of weather data from multiple Australian weather stations, spanning various time periods. Key features include:

    Temperature
    Humidity
    Wind Speed
    Pressure
    Rainfall (target variable)
    These features are tracked for each weather station over different times, with the goal of predicting rainfall.

    Software Requirements:

    Python: The primary programming language for data analysis and machine learning.
    scikit-learn: For implementing machine learning models.
    XGBoost, LightGBM, and CatBoost: Popular libraries for building more advanced ensemble models.
    Matplotlib/Seaborn: For data visualization.
    These libraries and tools help in data manipulation, modeling, evaluation, and visualization of results.
    DBRepo Authorization: Required to access datasets via the DBRepo API for dataset retrieval.

    Additional Resources

    Model Comparison Charts: The project includes output charts comparing the performance of seven popular machine learning models.
    Trained Models (.pkl files): Pre-trained models are saved as .pkl files for reuse without retraining.
    Documentation and Code: A Jupyter notebook guides through the process of data analysis, model training, and evaluation.

  7. d

    Data for: Advances and critical assessment of machine learning techniques...

    • datadryad.org
    • search.dataone.org
    • +1more
    zip
    Updated Mar 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Bucinsky; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč (2023). Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores [Dataset]. http://doi.org/10.5061/dryad.zgmsbccg7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 3, 2023
    Dataset provided by
    Dryad
    Authors
    Lukas Bucinsky; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč
    Time period covered
    Feb 24, 2023
    Description

    Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF). Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only...

  8. e

    Machine learning predicted AGNs in HSC-Wide region - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Sep 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    Dataset updated
    Sep 4, 2020
    Description

    We investigate the performance of machine-learning techniques in classifying active galactic nuclei (AGNs), including X-ray-selected AGNs (XAGNs), infrared-selected AGNs (IRAGNs), and radio-selected AGNs (RAGNs). Using the known physical parameters in the Cosmic Evolution Survey (COSMOS) field, we are able to create quality training samples in the region of the Hyper Suprime-Cam (HSC) survey. We compare several Python packages (e.g., scikit- learn, Keras, and XGBoost) and use XGBoost to identify AGNs and show the performance (e.g., accuracy, precision, recall, F1 score, and AUROC). Our results indicate that the performance is high for bright XAGN and IRAGN host galaxies. The combination of the HSC (optical) information with the Wide-field Infrared Survey Explorer band 1 and band 2 (near-infrared) information performs well to identify AGN hosts. For both type 1 (broad-line) XAGNs and type 1 (unobscured) IRAGNs, the performance is very good by using optical-to-infrared information. These results can apply to the five-band data from the wide regions of the HSC survey and future all-sky surveys. Cone search capability for table J/ApJ/920/68/table7 (AGN candidates in HSC-Wide region for 112609 objects)

  9. n

    Vocalizations in the plains zebra (Equus quagga)

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bing Xie; Virgile Daunay; Troels Petersen; Elodie Briefer (2024). Vocalizations in the plains zebra (Equus quagga) [Dataset]. http://doi.org/10.5061/dryad.v9s4mw73w
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 21, 2024
    Dataset provided by
    Université Lumière Lyon 2
    University of Copenhagen
    Authors
    Bing Xie; Virgile Daunay; Troels Petersen; Elodie Briefer
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Acoustic signals are vital in animal communication, and quantifying these signals them is fundamental for understanding animal behaviour and ecology. Vocaliszations can be classified into acoustically and functionally or contextually distinct categories, but establishing these categories can be challenging. Newly developed methods, such as machine learning, can provide solutions for classification tasks. The plains zebra is known for its loud and specific vocaliszations, yet limited knowledge exists on the structure and information content of its vocaliszations. In this study, we employed both feature-based and spectrogram-based algorithms, incorporating supervised and unsupervised machine learning methods to enhance robustness in categoriszing zebra vocaliszation types. Additionally, we implemented a permuted discriminant function analysis (pDFA) to examine the individual identity information contained in the identified vocaliszation types. The findings revealed at least four distinct vocaliszation types he ‘“snort’,” the ‘“soft snort’,” the ‘“squeal’,” and the ‘“quagga quagga’” with individual differences observed mostly in snorts, and to a lesser extent in squeals. Analyses based on acoustic features outperformed those based on spectrograms, but each excelled in characteriszing different vocaliszation types. We thus recommend the combined use of these two approaches. OuThisr study offers valuable insights into plains zebra vocaliszation, with implications for future comprehensive explorations in animal communication. Methods Data collection and sampling We collected data in three locations, in Denmark and South Africa: 1) 10 months between December 2020 and July 2021 and between September and December 2021, at Pilanesberg National Park (hereafter “PNP”), South Africa, covering both dry season (i.e. from May to September) and wet season (i.e. from October to April) (1); 2) 16 days between May and June 2019, and 33 days between February and May 2022, at Knuthenborg Safari Park (hereafter “KSP”), Denmark, covering both periods before the park’s opening for tourists (i.e. from November to March) and after (i.e. from April to October); 3) 4 days in August 2019 at Givskud Zoo (hereafter “GKZ”), Denmark. For all places and periods, three types of data were collected as follows: 1) Pictures were taken for each individual from both sides using a camera (Nikon COOLPIX P950); 2) Contexts of vocal production were recorded either through notes (in the first period of KSP and in GKZ) or videos (in the second period of KSP and in PNP) filmed by a video camera recorder (Sony HDRPJ410 HD); 3) Audio recordings were collected using a directional microphone (Sennheiser MKH-70 P48, with a frequency response of 50 - 20000 Hz (+/- 2.5 dB)) linked to an audio recorder (Marantz PMD661 MKIII). Six zebras housed in GKZ were recorded while being separated from one another into three enclosures (the stable, the small enclosure and the savannah) manually by the zookeeper for management purpose, which triggered vocalisations. These vocalisations, along with other types of data, were recorded at distances of 5 - 30 m. In KSP, 15 - 18 zebras (population changed due to newborns, deaths, or removal of adult males) were living with other herbivores in a 0.14 km2 savannah. There, we approached the zebras by driving down the road until approximately 7 - 40 m, at which point spontaneous vocalisations and other information were collected. This distance allowed us to collect good quality recordings without eliciting any obvious reactions from the zebras to our presence. Finally, PNP is a 580 km2 national park, with approximately 800 - 2000 zebras (2). In this park, we drove on the road and parked at distances of 10 - 80 m when encountering zebras, where all data, including spontaneous vocalisations, were recorded. Data processing Individual zebras were manually identified based on the pictures collected from KSP and GKZ (15-18 and 6 zebras, respectively). In PNP, the animals present in the pictures were individually identified using WildMe (https://zebra.wildme.org/), a web-based machine learning platform facilitating individual recognition. All zebra pictures were uploaded to the platform for a full comparison through the algorithm. The resulting matching candidates were then determined by manually reviewing the output. Audio files (sampling rate: 44100 Hz) were saved at 16-bit amplitude resolution in WAV format. We annotated zebra vocalisations, along with context and individuals emitting the vocalisations, using Audacity software (version 3.3.3) (3). Vocalisations were first subjectively labelled as five vocalisation types based on both audio and spectrogram examinations (i.e. visually inspection) (Table 1 and Figure 1). Among these types, the “squeal-snort” was excluded from further analysis, as the focus of this study was on individual vocalisation types instead of combinations. Acoustic analysis We extracted vocalisations of good quality, defined as vocalisations with clear spectrograms, low background noise, and no overlap with other sounds, and saved them as distinct audio files. For the individual distinctiveness analysis, we excluded individuals with fewer than 5 vocalisations of each type, to avoid strong imbalance, resulting in 359 snorts from 28 individuals and 138 squeals from 14 individuals (Table S3 and S4) (4, 5). The individuality content of quagga quagga and soft snorts could not be explored, due to insufficient individual data. For vocal repertoire analysis, we excluded vocalisations longer than 1.25 s to improve spectrogram-based analysis, following Thomas et al (6). In total, we gathered 678 vocalisations for the spectrogram-based vocal repertoire analysis, including 117 quagga quagga, 204 snorts, 161 squeals and 196 soft snorts (Table S2). Among these vocalisations, six squeals were excluded in the acoustic feature-based vocal repertoire analysis, due to missing data for one of the features (amplitude modulation extent). All calls were first high-passed filtered above 30 Hz for snorts and soft snorts, above 500 Hz for squeals and above 600 Hz for quagga quagga (i.e. above the average minimum fundamental frequency of these vocalisations; Table S1). We then extracted 12 acoustic features from vocalisations for the individual distinctiveness analysis (Table 2), using a custom script (7-10) in Praat software (11). Eight of these features were also extracted for the vocal repertoire analysis (i.e. all features except those related to the fundamental frequency, which were not available for soft snorts that are not tonal). Additionally, to explore the vocal repertoire, mel-spectrograms were generated from audio files using STFT, following Thomas et al. (6). Spectrograms were padded with zeros according to the length of the longest audio file to ensure uniform length for all audio files, and time-shift adjustments were implemented to align the starting points of vocalisations (6). Statistical analyses a. Vocal repertoire We applied both supervised and unsupervised machine learning to both acoustic features and spectrogram using Python (version 3.9.7) (12). Supervised method. To define the vocal repertoire via an acoustic feature-based approach, we deployed feature importance analysis by SHapley Additive exPlanation (SHAP) (13), using the shap library (version 0.40.0) (14). Six features with SHAP value > 1 were selected (Figure S1). We split the selected features with vocalisation type labels into a training dataset (70%) and a testing dataset (30%) using the Scikit-learn library (function: train_test_split, version 0.24.2) (15). Subsequently, we employed a supervised approach, the eXtreme Gradient Boosting (XGBoost) classifier in xgboost library (version 1.6.0) (16) to train the model. Three hyperparameters were tuned on the training dataset to reach maximum accuracy using optuna library (direction = minimize, n_trials = 200, version 2.10.0) (17), incorporating cross validation (five folds), which resulted in the best model (Table S5). To define the vocal repertoire via a spectrogram-based approach, we split the dataset into a training set (49%), a validation set (21%), and a test dataset (30%), using the Scikit-learn library (function: train_test_split, version 0.24.2) (15). We implemented a Convolutional Neural Network (CNN) architecture using the tensorflow library (version 2.8.0) (18). The architecture was constructed (Table S6) and seven hyperparameters were tuned to reach maximum accuracy on the training and validation dataset using the optuna library (direction = minimize, n_trials = 50, version 2.10.0) (17), which resulted in the best model (Table S6). We evaluated model performance for both feature-based and spectrogram-based classification models through predictions on each test dataset, including the test accuracy across all call types (number of correct predictions / total number of predictions), and three metrics for each call type; precision (true positives / (true positive + false positives)), recall (true positives / (true positives + false negatives) and the harmonic mean of precision and recall — f1-score (2 × (precision × recall) / (precision + recall) (19). We also plotted the confusion matrix between true classes and predicted classes. Unsupervised method. For both acoustic feature-based and spectrogram-based analyses, we applied Uniform Manifold Approximation and Projection (UMAP) in the umap library (function: umap.UMAP, n_neighbors=200 and local_connectivity= 150 for acoustic feature-based analysis, and metric = calc_timeshift_pad and min_dist = 0 for spectrogram-based analysis, version 0.1.1) (20), to reduce variables into a 2-dimensional latent space. We also implemented k-means clustering algorithm for both analyses from the Scikit-learn library (function: kmeans.fit, version 0.24.2) (15), to identify distinct clusters using the elbow method (21). The

  10. Heart Disease Risk Prediction Dataset

    • kaggle.com
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Şahide ŞEKER (2025). Heart Disease Risk Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/sahideseker/heart-disease-risk-prediction-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 3, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Şahide ŞEKER
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    🇬🇧 English:

    This synthetic dataset helps build machine learning models to predict whether a patient is at risk of heart disease. It includes patient attributes such as age, cholesterol, blood pressure, sex, and diabetes history.

    Use this dataset to:

    • Train classification models (e.g., XGBoost, Decision Tree)
    • Analyze the relationship between health metrics and heart disease
    • Practice healthcare-related ML without privacy concerns

    Features:

    • age: Age of the patient
    • cholesterol: Cholesterol level (mg/dL)
    • bp: Blood pressure (mmHg)
    • sex: Biological sex (Male/Female)
    • diabetes: Diabetes status (Yes/No)
    • heart_disease: Presence of heart disease (1 = Yes, 0 = No)

    🇹🇷 Türkçe:

    Bu sentetik veri seti, hastaların kalp hastalığı riski taşıyıp taşımadığını tahmin etmeye yönelik makine öğrenmesi modelleri geliştirmek için tasarlanmıştır. Yaş, kolesterol, tansiyon, cinsiyet ve diyabet bilgileri gibi özellikleri içerir.

    Bu veri seti ile:

    • XGBoost ve Decision Tree gibi sınıflandırma modelleri eğitilebilir
    • Sağlık verileriyle risk analizi yapılabilir
    • Gizlilik endişesi olmadan sağlık odaklı projeler geliştirilebilir
  11. Predicted occurrence probability for ticks in Great Britain (2014 to 2021)...

    • data.europa.eu
    • data.niaid.nih.gov
    • +1more
    unknown
    Updated Feb 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2023). Predicted occurrence probability for ticks in Great Britain (2014 to 2021) at 1 km spatial resolution [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7625175?locale=en
    Explore at:
    unknown(5810)Available download formats
    Dataset updated
    Feb 9, 2023
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United Kingdom
    Description

    The dataset contains predictions of occurrence probability for ticks in Great Britain (2014 to 2021) at 1 km spatial resolution + all covariate layers used for modeling. Over seven million electronic health records (EHRs), among which 11,741 EHRs reported tick attachment, were used to evaluate climate, environmental and animal host factors affecting the risk of tick attachment in cats and dogs in Great Britain (GB). The tick presence/absence EHRs for dogs and cats were further overlaid with spatiotemporal time-series of climatic, vegetation, human influence, hydrological and terrain variables (slope, wetness index) to produce a spatiotemporal regression matrix; an Ensemble Machine Learning framework was used to fine-tune hyperparameters for Random Forest (classif.ranger), Gradient boosting (classif.xgboost) and GLM-net (classif.glmnet) algorithms, which were then used to produce a final ensemble meta-learner that predicts the probability of occurrence of ticks across GB with monthly intervals. gb1km_covariates.zip contains ALL covariate layers as GeoTIFFs (time-series) used for modeling ticks dynamics; data_1km_2014_M01.rds = contains all covariates for January 2014 prepared as SpatialGridDataFrame (R data object); Codes of files indicate e.g.: "monthly.tick.prob_savsnet.mar_p_1km_s_2014_2021" = monthly occurrence probability for January based on the training data from 2014 to 2021; "monthly.tick.prob_savsnet.oct_md_1km_s_20211001_20211031" = monthly prediction (model) error derived as the standard deviation from multiple base learners; The dataset is described in detail in the following publication: Arsevska, E., Hengl, T., Singelton, D. et al. (2023?) Risk factors for tick attachment in companion animals in Great Britain: a spatiotemporal analysis covering 2014–2021. Submitted to Parasites & Vectors (in review). The model summary shows: Call: stats::glm(formula = f, family = "binomial", data = getTaskData(.task, .subset), weights = .weights, model = FALSE) Deviance Residuals: Min 1Q Median 3Q Max -1.4749 -0.0557 -0.0471 -0.0430 3.7611 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -7.64495 0.02095 -364.957 < 2e-16 *** classif.ranger 4.95061 0.63615 7.782 7.13e-15 *** classif.xgboost 189.75543 5.53109 34.307 < 2e-16 *** classif.glmnet 140.24208 5.05375 27.750 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 170604 on 7303013 degrees of freedom Residual deviance: 162571 on 7303010 degrees of freedom AIC: 162579 Number of Fisher Scoring iterations: 9 Acknowledgements: We are grateful to data providers in veterinary practice (VetSolutions, Teleos, CVS, and other practitioners). We are grateful to the INRAE MIGALE bioinformatics facility (MIGALE, INRAE, 2020. Migale Bioinformatics Facility, doi: 10.15454/1.5572390655343293E12) for providing computing resources. We are also grateful for the help and support provided by SAVSNET team members Bethaney Brant, Susan Bolan and Steven Smyth. This study was funded mainly by a grant from the Biotechnology and Biological Sciences Research Council, BB/NO19547/1 and British Small Animal Veterinary Association (BSAVA). The research was partly funded by the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Emerging and Zoonotic Infections at the University of Liverpool in partnership with Public Health England (PHE) and Liverpool School of Tropical Medicine (LSTM). This work has been partially funded by the “Monitoring outbreak events for disease surveillance in a data science context" (MOOD) project from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 874850 (https://mood-h2020.eu/). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR, the Department of Health or Public Health England.

  12. o

    Data from: Spuspo: Spatially Partitioned Unsupervised Segmentation Parameter...

    • explore.openaire.eu
    • zenodo.org
    Updated Aug 6, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefanos Georganos; Tais Grippa; Moritz Lennert; Brian Alan Johnson; Sabine Vanhuysse; Eléonore Wolff (2018). Spuspo: Spatially Partitioned Unsupervised Segmentation Parameter Optimization For Efficiently Segmenting Large Heterogeneous Areas [Dataset]. http://doi.org/10.5281/zenodo.1341116
    Explore at:
    Dataset updated
    Aug 6, 2018
    Authors
    Stefanos Georganos; Tais Grippa; Moritz Lennert; Brian Alan Johnson; Sabine Vanhuysse; Eléonore Wolff
    Description

    This dataset contains several data, results and processing material from the application of GEOBIA-based, Spatially Partitioned Segmentation Parameter Optimization (SPUSPO) in the city of Ouagadougou. In detail in contains: A Land Use - Land Cover map of Ouagadougou derived through SPUSPO. The classifier used was Extreme Gradient Boosting (XGBoost). Labels : 2 : Artificial Ground Surface 0 : Building 5 : Low Vegetation 4 : Tree 1 : Swimming Pool 3 : Bare Ground 7 : Shadow 6 : Inland Water The training and test data used in the study (SPUSPO and benchmark approach). The data are given in a csv format. The Jupyter notebook code which involves Python and GRASS GIS to automatize and efficiently perform SPUSPO in a large dataset. Python code calling GRASS GIS functions for automatizing the procedure. The segmentation layers coming from SPUSPO and the benchmark approaches (in raster formats due to data limitations). Segmentation rasters for each approach. The R code for optimization of XGBoost as well as feature selection with VSURF and classification of the whole dataset. Segmentation evaluation metrics. A csv file with the data sued to compute the Area Fit Index for each approach. Morphological zones of Ouagadougou as created by Grippa et al. 2017 a shp format.

  13. Benchmark Dataset for Impact4Cast: Forecasting high-impact research topics...

    • zenodo.org
    zip
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuemei Gu; Xuemei Gu (2024). Benchmark Dataset for Impact4Cast: Forecasting high-impact research topics via machine learning on evolving knowledge graphs [Dataset]. http://doi.org/10.5281/zenodo.14527306
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xuemei Gu; Xuemei Gu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset, from the paper 'Forecasting high-impact research topics via machine learning on evolving knowledge graphs' by Xuemei Gu and Mario Krenn, includes benchmark data for evaluating fully connected NNs, Transformers, Random Forest, and XGBoost on prediction tasks with 2-4 year training and 1-5 year evaluation intervals across two impact ranges (IR). It also provides some examples about 10M evaluation samples (2019-2022) with raw outputs from a neural network trained on 2016-2019 data.

  14. m

    urban forest

    • data.mendeley.com
    Updated Jul 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatwa Ramdani (2022). urban forest [Dataset]. http://doi.org/10.17632/j739yc6cgc.1
    Explore at:
    Dataset updated
    Jul 28, 2022
    Authors
    Fatwa Ramdani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data contain of: 1. Data of satellite imagery of PlanetScope of University of Brawijaya with 3m spatial resolution. 2. Data training and testing in CSV format 3. R Script of four different algorithms (XGBoost, Random Forest, Support Vector Machine, and Neural Networks) The manuscript that using this dataset has been submitted to F1000 Research (https://f1000research.com/)

  15. Spatio-temporal reconstruction of annual glacier mass balance in the Central...

    • zenodo.org
    csv, zip
    Updated Dec 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yanfei Peng; Yanfei Peng; Bolch Tobias; Yuan Qiangqiang; Baldacchino Francesca; Yang Qianqian; Bolch Tobias; Yuan Qiangqiang; Baldacchino Francesca; Yang Qianqian (2024). Spatio-temporal reconstruction of annual glacier mass balance in the Central Asia (1950- 2020) using machine learning method [Dataset]. http://doi.org/10.5281/zenodo.14546263
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Dec 23, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yanfei Peng; Yanfei Peng; Bolch Tobias; Yuan Qiangqiang; Baldacchino Francesca; Yang Qianqian; Bolch Tobias; Yuan Qiangqiang; Baldacchino Francesca; Yang Qianqian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Central Asia
    Description

    This dataset reconstructs the annual mass balance of glaciers larger than 0.1 km² in the Tien Shan and Pamir regions from 1950 to 2022. The dataset is derived using a nonlinear relationship between glacier mass balance and meteorological and topographical variables. The reconstruction method employs the XGBoost algorithm. Initially, XGBoost is trained on the complete training dataset, followed by incremental training for each sub-region to tailor models to specific regional characteristics. The final training results yield an average coefficient of determination (R²) of 0.87.

    All code used in this dataset is publicly available and organized into the following five sections:

    1. Data Processing

      • Code for extracting monthly meteorological variables.
      • Combines meteorological and topographical variables for each glacier.
    2. Model Training

      • Implements the two-step training process for all ensemble learning methods tested in this study.
    3. Result Analysis

      • Pie charts of mass balance distribution for clustered glaciers.
      • Line graphs of annual mass balance for each sub-region.
    4. Result Evaluation

      • Extracts glacier mass balance data from previous studies.
      • Compares these data with the results of this study.
    5. SHAP Analysis

      • Provides scripts to generate SHAP (SHapley Additive exPlanations) value-related figures, highlighting the contribution of different variables to model predictions.
  16. Models and Predictions for "The Proper Care and Feeding of CAMELS: How...

    • zenodo.org
    application/gzip, bin
    Updated Feb 6, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Gauch; Juliane Mai; Jimmy Lin; Martin Gauch; Juliane Mai; Jimmy Lin (2020). Models and Predictions for "The Proper Care and Feeding of CAMELS: How Limited Training Data Affects Streamflow Prediction" [Dataset]. http://doi.org/10.5281/zenodo.3543549
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Feb 6, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Martin Gauch; Juliane Mai; Jimmy Lin; Martin Gauch; Juliane Mai; Jimmy Lin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Models and Predictions

    This dataset contains the trained XGBoost and EA-LSTM models and the models' predictions for the paper The Proper Care and Feeding of CAMELS: How Limited Training Data Affects Streamflow Prediction.

    For each combination of model (XGBoost, EA-LSTM), training years (3, 6, 9), number of basins (13, 26, 53, 265, 531), and seed (111-888), there are five folders. Each corresponds to a random basin sample (for 531 basins there's only one folder, since it's all basins).
    In each folder, there are two files:

    • \(\texttt{model.pkl}\) (XGBoost) or \(\texttt{model_epoch30.pt}\) (EA-LSTM), which stores the pickled trained model
    • \(\texttt{xgboost_seedNNN.p}\) or \(\texttt{ealstm_seedNNN.p}\), which stores a pickled dictionary that maps each basin to the DataFrame of predicted and actual daily streamflow.

    In addition to each folder, there is a SLURM submission script called \(\texttt{ that was used to create and evaluate the model in the folder.

  17. f

    Table_2_A Machine Learning Method to Trace Cancer Primary Lesion Using...

    • frontiersin.figshare.com
    xlsx
    Updated Jun 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qingfeng Lu; Fengxia Chen; Qianyue Li; Lihong Chen; Ling Tong; Geng Tian; Xiaohong Zhou (2023). Table_2_A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data.xlsx [Dataset]. http://doi.org/10.3389/fonc.2022.832567.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Frontiers
    Authors
    Qingfeng Lu; Fengxia Chen; Qianyue Li; Lihong Chen; Ling Tong; Geng Tian; Xiaohong Zhou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cancer of unknown primary site (CUP) is a heterogeneous group of cancers whose tissue of origin remains unknown after detailed investigation by conventional clinical methods. The number of CUP accounts for roughly 3%–5% of all human malignancies. CUP patients are usually treated with broad-spectrum chemotherapy, which often leads to a poor prognosis. Recent studies suggest that the treatment targeting the primary lesion of CUP will significantly improve the prognosis of the patient. Therefore, it is urgent to develop an efficient method to accurately detect tissue of origin of CUP in clinical cancer research. In this work, we developed a novel framework that uses Extreme Gradient Boosting (XGBoost) to trace the primary site of CUP based on microarray-based gene expression data. First, we downloaded the microarray-based gene expression profiles of 59,385 genes for 57,08 samples from The Cancer Genome Atlas (TCGA) and 6,364 genes for 3,101 samples from the Gene Expression Omnibus (GEO). Both data were divided into training and independent testing data with a ratio of 4:1. Then, we obtained in the training data 200 and 290 genes from TCGA and the GEO datasets, respectively, to train XGBoost models for the identification of the primary site of CUP. The overall 5-fold cross-validation accuracies of our methods were 96.9% and 95.3% on TCGA and GEO training datasets, respectively. Meanwhile, the macro-precision for the independent dataset reached 96.75% and 98.8% on, respectively, TCGA and GEO. Experimental results demonstrated that the XGBoost framework not only can reduce the cost of clinical cancer traceability but also has high efficiency, which might be useful in clinical usage.

  18. e

    Bilos - Gross Primary Production (2020-2023) - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Apr 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Bilos - Gross Primary Production (2020-2023) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/04ff50f2-456b-5229-9f7f-5fbed12134b7
    Explore at:
    Dataset updated
    Apr 9, 2024
    Description

    Gross Primary Production (GPP) represents the total amount of carbon fixed by plants through photosynthesis in an ecosystem over a specific period. GPP data products are derived using a data-driven approach that integrates Earth observation data with in-situ carbon flux measurements. Specifically, GPP estimations combine Sentinel-2 multispectral imagery with carbon flux data from eddy covariance towers, employing the XGBoost machine learning algorithm for prediction. The resulting GPP maps are generated at a 10-meter spatial resolution and a temporal frequency up to 5 days, covering the period from March 2017 to December 2023. The temporal resolution is contingent upon 50% free-cloud conditions in the area of interest, with lower frequencies occurring during periods of high cloud coverage. The spatial extent of the GPP maps corresponds to the boundaries of long-term observation sites as recorded in the DEIMS-SDR registry (e.g., https://deims.org/6f716444-c0bd-4a04-b72b-add3e302eef1). For sites where the boundary area is smaller than 1 km², or if only point coordinates are available in DEIMS-SDR, the maps are constrained to a 1 km x 1 km bounding box. The methodology integrates multiple data sources via machine learning techniques to estimate Gross Primary Production (GPP) across different ecosystems. The process begins with data pre-processing, including the selection of sites based on criteria such available vegetation information, at least three full years of eddy covariance flux data. GPP and environmental data (e.g. air temperature, vapor pressure deficit, etc.) are extracted from the ICOS database across different ecosystem types. Then, different remote sensing (RS) indices (e.g NDVI, EVI, etc.) are estimated in GEE using Sentinel-2 MSI data as the mean value of the pixels found inside the climatological footprint 70 (an area were 70% of the GPP measurements are coming from). Both data coming from ICOS dataset and RS indices are used as predictors for the model. The data is split, with 70% used for training and 30% for testing. In the model setup, an XGBoost model is trained using the selected environmental and RS based indices. The model parameters are fine-tuned to improve accuracy. The remaining 30% of testing data is used to evaluate the model’s performance by comparing its predictions against in-situ GPP data. Error metrics like Mean Absolute Error (MAE), Root Mean Absolute Error (RMAE), and R² are provided. The maps computation phase applies the trained model to ecosystem boundaries from the DEIMS website to generate 5-day GPP maps. Acknowledgement This work on the AGAME Gross Primary Production data product is funded by the European Space Agency (ESA, contract no. 4000143740/24/I-AG) in the frame of the GEOSS Platform Plus project (Horizon Europe, GA No. GA.Nr. 101039118). The work done is based on the requirements from eLTER contributing in addition to the eLTER Site Information Cluster. In-situ data for model calibration and validation has been derived from the ICOS Carbon Portal.

  19. f

    Model performance results based on random forest, gradient boosting,...

    • figshare.com
    xls
    Updated Mar 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Junying Wang; David D. Wu; Christine DeLorenzo; Jie Yang (2024). Model performance results based on random forest, gradient boosting, penalized logistic regression, XGBoost, SVM, neural network, and stacking for EMBARC data as training set and APAT data as testing set after multiple imputation for 10 times. [Dataset]. http://doi.org/10.1371/journal.pone.0299625.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 28, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Junying Wang; David D. Wu; Christine DeLorenzo; Jie Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Model performance results based on random forest, gradient boosting, penalized logistic regression, XGBoost, SVM, neural network, and stacking for EMBARC data as training set and APAT data as testing set after multiple imputation for 10 times.

  20. f

    Data Sheet 9_Prediction of outpatient rehabilitation patient preferences and...

    • frontiersin.figshare.com
    xlsx
    Updated Jan 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang (2025). Data Sheet 9_Prediction of outpatient rehabilitation patient preferences and optimization of graded diagnosis and treatment based on XGBoost machine learning algorithm.xlsx [Dataset]. http://doi.org/10.3389/frai.2024.1473837.s010
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Frontiers
    Authors
    Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThe Department of Rehabilitation Medicine is key to improving patients’ quality of life. Driven by chronic diseases and an aging population, there is a need to enhance the efficiency and resource allocation of outpatient facilities. This study aims to analyze the treatment preferences of outpatient rehabilitation patients by using data and a grading tool to establish predictive models. The goal is to improve patient visit efficiency and optimize resource allocation through these predictive models.MethodsData were collected from 38 Chinese institutions, including 4,244 patients visiting outpatient rehabilitation clinics. Data processing was conducted using Python software. The pandas library was used for data cleaning and preprocessing, involving 68 categorical and 12 continuous variables. The steps included handling missing values, data normalization, and encoding conversion. The data were divided into 80% training and 20% test sets using the Scikit-learn library to ensure model independence and prevent overfitting. Performance comparisons among XGBoost, random forest, and logistic regression were conducted using metrics, including accuracy and receiver operating characteristic (ROC) curves. The imbalanced learning library’s SMOTE technique was used to address the sample imbalance during model training. The model was optimized using a confusion matrix and feature importance analysis, and partial dependence plots (PDP) were used to analyze the key influencing factors.ResultsXGBoost achieved the highest overall accuracy of 80.21% with high precision and recall in Category 1. random forest showed a similar overall accuracy. Logistic Regression had a significantly lower accuracy, indicating difficulties with nonlinear data. The key influencing factors identified include distance to medical institutions, arrival time, length of hospital stay, and specific diseases, such as cardiovascular, pulmonary, oncological, and orthopedic conditions. The tiered diagnosis and treatment tool effectively helped doctors assess patients’ conditions and recommend suitable medical institutions based on rehabilitation grading.ConclusionThis study confirmed that ensemble learning methods, particularly XGBoost, outperform single models in classification tasks involving complex datasets. Addressing class imbalance and enhancing feature engineering can further improve model performance. Understanding patient preferences and the factors influencing medical institution selection can guide healthcare policies to optimize resource allocation, improve service quality, and enhance patient satisfaction. Tiered diagnosis and treatment tools play a crucial role in helping doctors evaluate patient conditions and make informed recommendations for appropriate medical care.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah (2025). MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies [Dataset]. http://doi.org/10.5281/zenodo.15666484
Organization logo

MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies

Explore at:
csv, zipAvailable download formats
Dataset updated
Jun 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
Jun 14, 2025
Description

The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv and Trad_Test.csv) is derived directly from the original complete geochemical dataset (alldata.csv) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv and Simu_Test.csv) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip archive contains the raw input files alldata.csvused to generate the proxies_alldata.csv, it also contains Analysis1.csv and Analysis2.csvfor performing confidence analysis. To run the executable files in place of the .m scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.

Details on the input files for confidence analysis: Analysis1.csv and Analysis2.csv
These files contain two columns for the test data: column 1 = match or mismatch between predicted and true alterations? column 2 = probability of a correct classification, according to bootstrapped samples (Analysis1.csv) or to simulated proxies (Analysis2.csv)
Search
Clear search
Close search
Google apps
Main menu