13 datasets found

f
Data_Sheet_1_Identification of Orphan Genes in Unbalanced Datasets Based on...
frontiersin.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qijuan Gao; Xiu Jin; Enhua Xia; Xiangwei Wu; Lichuan Gu; Hanwei Yan; Yingchun Xia; Shaowen Li (2023). Data_Sheet_1_Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning.CSV [Dataset]. http://doi.org/10.3389/fgene.2020.00820.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2020.00820.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Qijuan Gao; Xiu Jin; Enhua Xia; Xiangwei Wu; Lichuan Gu; Hanwei Yan; Yingchun Xia; Shaowen Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets.
f
Data Sheet 3_Prediction of outpatient rehabilitation patient preferences and...
frontiersin.figshare.com
docx
Updated Jan 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang (2025). Data Sheet 3_Prediction of outpatient rehabilitation patient preferences and optimization of graded diagnosis and treatment based on XGBoost machine learning algorithm.docx [Dataset]. http://doi.org/10.3389/frai.2024.1473837.s003
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2024.1473837.s003
Dataset updated
Jan 15, 2025
Dataset provided by
Frontiers
Authors
Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe Department of Rehabilitation Medicine is key to improving patients’ quality of life. Driven by chronic diseases and an aging population, there is a need to enhance the efficiency and resource allocation of outpatient facilities. This study aims to analyze the treatment preferences of outpatient rehabilitation patients by using data and a grading tool to establish predictive models. The goal is to improve patient visit efficiency and optimize resource allocation through these predictive models.MethodsData were collected from 38 Chinese institutions, including 4,244 patients visiting outpatient rehabilitation clinics. Data processing was conducted using Python software. The pandas library was used for data cleaning and preprocessing, involving 68 categorical and 12 continuous variables. The steps included handling missing values, data normalization, and encoding conversion. The data were divided into 80% training and 20% test sets using the Scikit-learn library to ensure model independence and prevent overfitting. Performance comparisons among XGBoost, random forest, and logistic regression were conducted using metrics, including accuracy and receiver operating characteristic (ROC) curves. The imbalanced learning library’s SMOTE technique was used to address the sample imbalance during model training. The model was optimized using a confusion matrix and feature importance analysis, and partial dependence plots (PDP) were used to analyze the key influencing factors.ResultsXGBoost achieved the highest overall accuracy of 80.21% with high precision and recall in Category 1. random forest showed a similar overall accuracy. Logistic Regression had a significantly lower accuracy, indicating difficulties with nonlinear data. The key influencing factors identified include distance to medical institutions, arrival time, length of hospital stay, and specific diseases, such as cardiovascular, pulmonary, oncological, and orthopedic conditions. The tiered diagnosis and treatment tool effectively helped doctors assess patients’ conditions and recommend suitable medical institutions based on rehabilitation grading.ConclusionThis study confirmed that ensemble learning methods, particularly XGBoost, outperform single models in classification tasks involving complex datasets. Addressing class imbalance and enhancing feature engineering can further improve model performance. Understanding patient preferences and the factors influencing medical institution selection can guide healthcare policies to optimize resource allocation, improve service quality, and enhance patient satisfaction. Tiered diagnosis and treatment tools play a crucial role in helping doctors evaluate patient conditions and make informed recommendations for appropriate medical care.
High resolution landslide susceptibility mapping using ensemble machine...
zenodo.org
tiff
Updated Aug 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirdesh Sharma; Manabendra Saharia; Manabendra Saharia; GV Ramana; Nirdesh Sharma; GV Ramana (2024). High resolution landslide susceptibility mapping using ensemble machine learning and geospatial big data [Dataset]. http://doi.org/10.1016/j.catena.2023.107653
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1016/j.catena.2023.107653
Dataset updated
Aug 28, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nirdesh Sharma; Manabendra Saharia; Manabendra Saharia; GV Ramana; Nirdesh Sharma; GV Ramana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2023
Description

Landslide susceptibility represents the potential of slope failure for given geo-environmental conditions. The existing landslide susceptibility maps suffer from several limitations, such as being based on limited data, heuristic methodologies, low spatial resolution, and small areas of interest. In this study, we overcome all these limitations by developing a probabilistic framework that combines imbalance handling and ensemble machine learning for landslide susceptibility mapping. We employ a combination of One -Sided Selection and Support Vector Machine Synthetic Minority Oversampling Technique (SVMSMOTE) to eliminate class imbalance and develop smaller representative data from big data for model training. A blending ensemble approach using hyperparameter tuned Artificial Neural Networks, Random Forests, and Support Vector Machine, is employed to reduce the uncertainty associated with a single model. The methodology provides the landslide susceptibility probability and a landslide susceptibility class. A thorough evaluation of the framework is performed using receiver operating characteristic curves, confusion matrices, and the derivatives of confusion matrices. This framework is used to develop India's first national-scale machine learning based landslide susceptibility map. The landslide database is carefully curated from global and local inventories, and the landslide conditioning factors are selected from a multitude of geophysical and climatological variables. The Indian Landslide Susceptibility Map (ILSM) is developed at a resolution of 0.001° (∼100 m) and is classified into five classes: very low, low, medium, high, and very high. We report an accuracy of 95.73 %, sensitivity of 97.08 %, and matthews correlation coefficient (MCC) of 0.915 on test data, demonstrating the accuracy, robustness, and generalizability of the framework for landslide identification. The model classified 4.75 % area in India as very highly susceptible to landslides and detected new landslide susceptible zones in the Eastern Ghats, hitherto unreported in the government landslide records. The ILSM is expected to aid policymaking in disaster risk reduction and developing landslide prediction models.
CREDIT CARD FRAUD DETECTION (NEW)
kaggle.com
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
somnath paul 71 (2025). CREDIT CARD FRAUD DETECTION (NEW) [Dataset]. https://www.kaggle.com/datasets/somnathpaul71/credit-card-fraud-detection-new
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 25, 2025
Dataset provided by
Kaggle
Authors
somnath paul 71
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🔍 Dataset Description: Credit Card Fraud Detection This dataset is designed for building and evaluating machine learning models for credit card fraud detection. It contains anonymized transaction records where the goal is to classify transactions as fraudulent (1) or non-fraudulent (0) based on several features.

📁 Dataset Overview: Each row represents a single credit card transaction.

Features include a mix of numerical and transformed variables (e.g., V1 to V28) derived from PCA for confidentiality.

The Amount and Hour_of_Day features represent the transaction value and time, respectively.

The Class column is the target variable:

0 → Legitimate transaction

1 → Fraudulent transaction

✅ Key Highlights: The dataset contains both classes (0 and 1) to ensure balanced evaluation for binary classification.

Suitable for testing anomaly detection, binary classification, and imbalanced dataset handling techniques like SMOTE or under-sampling.

Ideal for learners, researchers, and practitioners working on fraud detection in real-world scenarios.

🧠 Suggested Use Cases: Model evaluation with metrics like precision, recall, F1-score (due to class imbalance).

Experimentation with algorithms such as Logistic Regression, Random Forest, XGBoost, and Neural Networks.

Feature engineering and explainability techniques (e.g., SHAP values).
Support Ticket Priority Dataset (50K)
kaggle.com
Updated Aug 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Albert5913 (2025). Support Ticket Priority Dataset (50K) [Dataset]. http://doi.org/10.34740/kaggle/dsv/12771872
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/12771872
Dataset updated
Aug 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Albert5913
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

Synthetic tabular dataset of 50,000 support tickets from 25 companies used to study priority classification (low, medium, high). Companies differ by size and industry; large companies operate across multiple regions. Features mix numeric and categorical signals commonly available at ticket intake. Data is fully artificial—no real users, systems, or proprietary logs.

Intended use: benchmarking supervised learning for tabular classification (e.g., Gradient Boosting, XGBoost, LightGBM, AdaBoost, SVM, Naive Bayes), feature engineering, handling mixed types, class imbalance, and mild label noise.

File & schema

Identifiers & time

-ticket_id (int64): unique ticket identifier (randomized order)

-day_of_week (Mon–Sun), day_of_week_num (1–7; Mon=1)

Company profile (replicated per row)

-company_id (int), company_size (Small/Medium/Large + _cat),

-industry (7 categories + _cat),

-customer_tier (Basic/Plus/Enterprise + _cat),

-org_users (int): active user seats (Large up to ~10,000)

Context

-region (AMER/EMEA/APAC + _cat)

-past_30d_tickets (int), past_90d_incidents (int)

Product & channel

-product_area (auth, billing, mobile, data_pipeline, analytics, notifications + _cat)

-booking_channel (web, email, chat, phone + _cat)

-reported_by_role (support, devops, product_manager, finance, c_level + _cat)

Impact & flags

-customers_affected (int, heavy-tailed)

-error_rate_pct (float, 0–100; sometimes 0.0 as “unmeasured”)

-downtime_min (int, 0 when only degraded)

-payment_impact_flag, security_incident_flag, data_loss_flag, has_runbook (0/1)

Text proxy

-customer_sentiment (negative/neutral/positive + _cat with 0 = missing)

-description_length (int, 20–2000)

Target

-priority (low/medium/high + priority_cat = 1/2/3)

Notes & limitations

Fully synthetic; suitable for education, benchmarking, and tutorials.

No temporal ordering or post-resolution fields are included, avoiding label leakage.

The noise level is tuned for ~97–98% ceiling performance with well-optimized models, not perfect separability.
f
Table_3_Interpretable machine learning model to predict surgical difficulty...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
docx
Updated Feb 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miao Yu; Zihan Yuan; Ruijie Li; Bo Shi; Daiwei Wan; Xiaoqiang Dong (2024). Table_3_Interpretable machine learning model to predict surgical difficulty in laparoscopic resection for rectal cancer.docx [Dataset]. http://doi.org/10.3389/fonc.2024.1337219.s003
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fonc.2024.1337219.s003
Dataset updated
Feb 6, 2024
Dataset provided by
Frontiers
Authors
Miao Yu; Zihan Yuan; Ruijie Li; Bo Shi; Daiwei Wan; Xiaoqiang Dong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundLaparoscopic total mesorectal excision (LaTME) is standard surgical methods for rectal cancer, and LaTME operation is a challenging procedure. This study is intended to use machine learning to develop and validate prediction models for surgical difficulty of LaTME in patients with rectal cancer and compare these models’ performance.MethodsWe retrospectively collected the preoperative clinical and MRI pelvimetry parameter of rectal cancer patients who underwent laparoscopic total mesorectal resection from 2017 to 2022. The difficulty of LaTME was defined according to the scoring criteria reported by Escal. Patients were randomly divided into training group (80%) and test group (20%). We selected independent influencing features using the least absolute shrinkage and selection operator (LASSO) and multivariate logistic regression method. Adopt synthetic minority oversampling technique (SMOTE) to alleviate the class imbalance problem. Six machine learning model were developed: light gradient boosting machine (LGBM); categorical boosting (CatBoost); extreme gradient boost (XGBoost), logistic regression (LR); random forests (RF); multilayer perceptron (MLP). The area under receiver operating characteristic curve (AUROC), accuracy, sensitivity, specificity and F1 score were used to evaluate the performance of the model. The Shapley Additive Explanations (SHAP) analysis provided interpretation for the best machine learning model. Further decision curve analysis (DCA) was used to evaluate the clinical manifestations of the model.ResultsA total of 626 patients were included. LASSO regression analysis shows that tumor height, prognostic nutrition index (PNI), pelvic inlet, pelvic outlet, sacrococcygeal distance, mesorectal fat area and angle 5 (the angle between the apex of the sacral angle and the lower edge of the pubic bone) are the predictor variables of the machine learning model. In addition, the correlation heatmap shows that there is no significant correlation between these seven variables. When predicting the difficulty of LaTME surgery, the XGBoost model performed best among the six machine learning models (AUROC=0.855). Based on the decision curve analysis (DCA) results, the XGBoost model is also superior, and feature importance analysis shows that tumor height is the most important variable among the seven factors.ConclusionsThis study developed an XGBoost model to predict the difficulty of LaTME surgery. This model can help clinicians quickly and accurately predict the difficulty of surgery and adopt individualized surgical methods.
f
Some features of the dataset from a bank.
plos.figshare.com
xls
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HaiChao Du; Li Lv; Hongliang Wang; An Guo (2024). Some features of the dataset from a bank. [Dataset]. http://doi.org/10.1371/journal.pone.0294537.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0294537.t001
Dataset updated
Mar 6, 2024
Dataset provided by
PLOS ONE
Authors
HaiChao Du; Li Lv; Hongliang Wang; An Guo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Credit card fraud is a significant problem that costs billions of dollars annually. Detecting fraudulent transactions is challenging due to the imbalance in class distribution, where the majority of transactions are legitimate. While pre-processing techniques such as oversampling of minority classes are commonly used to address this issue, they often generate unrealistic or overgeneralized samples. This paper proposes a method called autoencoder with probabilistic xgboost based on SMOTE and CGAN(AE-XGB-SMOTE-CGAN) for detecting credit card frauds.AE-XGB-SMOTE-CGAN is a novel method proposed for credit card fraud detection problems. The credit card fraud dataset comes from a real dataset anonymized by a bank and is highly imbalanced, with normal data far greater than fraud data. Autoencoder (AE) is used to extract relevant features from the dataset, enhancing the ability of feature representation learning, and are then fed into xgboost for classification according to the threshold. Additionally, in this study, we propose a novel approach that hybridizes Generative Adversarial Network (GAN) and Synthetic Minority Over-Sampling Technique (SMOTE) to tackle class imbalance problems. Our two-phase oversampling approach involves knowledge transfer and leverages the synergies of SMOTE and GAN. Specifically, GAN transforms the unrealistic or overgeneralized samples generated by SMOTE into realistic data distributions where there is not enough minority class data available for GAN to process effectively on its own. SMOTE is used to address class imbalance issues and CGAN is used to generate new, realistic data to supplement the original dataset. The AE-XGB-SMOTE-CGAN algorithm is also compared to other commonly used machine learning algorithms, such as KNN and Light GBM, and shows an overall improvement of 2% in terms of the ACC index compared to these algorithms. The AE-XGB-SMOTE-CGAN algorithm also outperforms KNN in terms of the MCC index by 30% when the threshold is set to 0.35. This indicates that the AE-XGB-SMOTE-CGAN algorithm has higher accuracy, true positive rate, true negative rate, and Matthew’s correlation coefficient, making it a promising method for detecting credit card fraud.
f
Passive sensing data.
plos.figshare.com
xls
Updated Feb 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thierry Jean; Rose Guay Hottin; Pierre Orban (2025). Passive sensing data. [Dataset]. http://doi.org/10.1371/journal.pdig.0000734.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000734.t001
Dataset updated
Feb 7, 2025
Dataset provided by
PLOS Digital Health
Authors
Thierry Jean; Rose Guay Hottin; Pierre Orban
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The promise of machine learning successfully exploiting digital phenotyping data to forecast mental states in psychiatric populations could greatly improve clinical practice. Previous research focused on binary classification and continuous regression, disregarding the often ordinal nature of prediction targets derived from clinical rating scales. In addition, mental health ratings typically show important class imbalance or skewness that need to be accounted for when evaluating predictive performance. Besides it remains unclear which machine learning algorithm is best suited for forecast tasks, the eXtreme Gradient Boosting (XGBoost) and long short-term memory (LSTM) algorithms being 2 popular choices in digital phenotyping studies. The CrossCheck dataset includes 6,364 mental state surveys using 4-point ordinal rating scales and 23,551 days of smartphone sensor data contributed by patients with schizophrenia. We trained 120 machine learning models to forecast 10 mental states (e.g., Calm, Depressed, Seeing things) from passive sensor data on 2 predictive tasks (ordinal regression, binary classification) with 2 learning algorithms (XGBoost, LSTM) over 3 forecast horizons (same day, next day, next week). A majority of ordinal regression and binary classification models performed significantly above baseline, with macro-averaged mean absolute error values between 1.19 and 0.77, and balanced accuracy between 58% and 73%, which corresponds to similar levels of performance when these metrics are scaled. Results also showed that metrics that do not account for imbalance (mean absolute error, accuracy) systematically overestimated performance, XGBoost models performed on par with or better than LSTM models, and a significant yet very small decrease in performance was observed as the forecast horizon expanded. In conclusion, when using performance metrics that properly account for class imbalance, ordinal forecast models demonstrated comparable performance to the prevalent binary classification approach without losing valuable clinical information from self-reports, thus providing richer and easier to interpret predictions.
f
Kawasaki disease dataset descriptive statistics.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Dec 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chuan-Sheng Hung; Chun-Hung Richard Lin; Jain-Shing Liu; Shi-Huang Chen; Tsung-Chi Hung; Chih-Min Tsai (2024). Kawasaki disease dataset descriptive statistics. [Dataset]. http://doi.org/10.1371/journal.pone.0314995.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0314995.t002
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Chuan-Sheng Hung; Chun-Hung Richard Lin; Jain-Shing Liu; Shi-Huang Chen; Tsung-Chi Hung; Chih-Min Tsai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Kawasaki Disease (KD) is a rare febrile illness affecting infants and young children, potentially leading to coronary artery complications and, in severe cases, mortality if untreated. However, KD is frequently misdiagnosed as a common fever in clinical settings, and the inherent data imbalance further complicates accurate prediction when using traditional machine learning and statistical methods. This paper introduces two advanced approaches to address these challenges, enhancing prediction accuracy and generalizability. The first approach proposes a stacking model termed the Disease Classifier (DC), specifically designed to recognize minority class samples within imbalanced datasets, thereby mitigating the bias commonly observed in traditional models toward the majority class. Secondly, we introduce a combined model, the Disease Classifier with CTGAN (CTGAN-DC), which integrates DC with Conditional Tabular Generative Adversarial Network (CTGAN) technology to improve data balance and predictive performance further. Utilizing CTGAN-based oversampling techniques, this model retains the original data characteristics of KD while expanding data diversity. This effectively balances positive and negative KD samples, significantly reducing model bias toward the majority class and enhancing both predictive accuracy and generalizability. Experimental evaluations indicate substantial performance gains, with the DC and CTGAN-DC models achieving notably higher predictive accuracy than individual machine learning models. Specifically, the DC model achieves sensitivity and specificity rates of 95%, while the CTGAN-DC model achieves 95% sensitivity and 97% specificity, demonstrating superior recognition capability. Furthermore, both models exhibit strong generalizability across diverse KD datasets, particularly the CTGAN-DC model, which surpasses the JAMA model with a 3% increase in sensitivity and a 95% improvement in generalization sensitivity and specificity, effectively resolving the model collapse issue observed in the JAMA model. In sum, the proposed DC and CTGAN-DC architectures demonstrate robust generalizability across multiple KD datasets from various healthcare institutions and significantly outperform other models, including XGBoost. These findings lay a solid foundation for advancing disease prediction in the context of imbalanced medical data.
f
A comparative analysis of DC, CTGAN-DC, XGBoost, CTGAN-XG, and TVAE-XG...
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chuan-Sheng Hung; Chun-Hung Richard Lin; Jain-Shing Liu; Shi-Huang Chen; Tsung-Chi Hung; Chih-Min Tsai (2024). A comparative analysis of DC, CTGAN-DC, XGBoost, CTGAN-XG, and TVAE-XG models in Kawasaki Disease experiments. [Dataset]. http://doi.org/10.1371/journal.pone.0314995.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0314995.t003
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Chuan-Sheng Hung; Chun-Hung Richard Lin; Jain-Shing Liu; Shi-Huang Chen; Tsung-Chi Hung; Chih-Min Tsai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A comparative analysis of DC, CTGAN-DC, XGBoost, CTGAN-XG, and TVAE-XG models in Kawasaki Disease experiments.
f
Comparison of genomic diabetes prediction models.
plos.figshare.com
xls
Updated Jul 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Said A. Salloum; Khaled Mohammad Alomari; Ayham Salloum (2025). Comparison of genomic diabetes prediction models. [Dataset]. http://doi.org/10.1371/journal.pone.0328253.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0328253.t001
Dataset updated
Jul 18, 2025
Dataset provided by
PLOS ONE
Authors
Said A. Salloum; Khaled Mohammad Alomari; Ayham Salloum
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diabetes Mellitus is a global health concern, characterized by high blood sugar levels over a prolonged period, leading to severe complications if left unmanaged. The early identification of individuals at risk is critical for effective intervention and treatment. Traditional diagnostic methods rely heavily on clinical symptoms and biochemical tests, which may not capture the underlying genetic predispositions. With the advent of genomics, DNA sequence analysis has emerged as a promising approach to uncover the genetic markers associated with Diabetes Mellitus. However, the challenge lies in accurately classifying DNA sequences to predict susceptibility to the disease, given the complex nature of genetic data. This study addresses this challenge by employing two advanced machine learning models, NuSVC (Nu-Support Vector Classification) and XGBoost (Extreme Gradient Boosting), to classify DNA sequences related to Diabetes Mellitus. The dataset, obtained from reputable sources like NCBI, was preprocessed using Natural Language Processing (NLP) techniques, where DNA sequences were treated as textual data and transformed into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency). To handle the class imbalance in the dataset, SMOTE (Synthetic Minority Over-sampling Technique) was applied. The models were trained and validated using 10-fold cross-validation. XGBoost was trained with up to 300 boosting rounds, and performance was evaluated using accuracy, precision, recall, F1-score, ROC-AUC, and log loss. The results demonstrate that XGBoost outperformed NuSVC across all metrics, achieving an accuracy of 98%, a log loss of 0.0650, and an AUC of 1.00, compared to NuSVC’s accuracy of 87%, log loss of 0.2649, and AUC of 0.95. The superior performance of XGBoost indicates its robustness in handling complex genetic data and its potential utility in clinical applications for early diagnosis of Diabetes Mellitus. The findings of this study underscore the importance of advanced machine learning techniques in genomics and suggest that integrating such models into healthcare systems could significantly enhance predictive diagnostics.
f
Characteristics of MSM.
plos.figshare.com
xls
Updated Jul 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Owen Mugurungi; Elliot Mbunge; Rutendo Birri-Makota; Innocent Chingombe; Munyaradzi Mapingure; Brian Moyo; Amon Mpofu; John Batani; Benhildah Muchemwa; Chesterfield Samba; Delight Murigo; Musa Sibindi; Enos Moyo; Tafadzwa Dzinamarira; Godfrey Musuka (2024). Characteristics of MSM. [Dataset]. http://doi.org/10.1371/journal.pdig.0000541.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000541.t001
Dataset updated
Jul 3, 2024
Dataset provided by
PLOS Digital Health
Authors
Owen Mugurungi; Elliot Mbunge; Rutendo Birri-Makota; Innocent Chingombe; Munyaradzi Mapingure; Brian Moyo; Amon Mpofu; John Batani; Benhildah Muchemwa; Chesterfield Samba; Delight Murigo; Musa Sibindi; Enos Moyo; Tafadzwa Dzinamarira; Godfrey Musuka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
There is a substantial increase in sexually transmitted infections (STIs) among men who have sex with men (MSM) globally. Unprotected sexual practices, multiple sex partners, criminalization, stigmatisation, fear of discrimination, substance use, poor access to care, and lack of early STI screening tools are among the contributing factors. Therefore, this study applied multilayer perceptron (MLP), extremely randomized trees (ExtraTrees) and XGBoost machine learning models to predict STIs among MSM using bio-behavioural survey (BBS) data in Zimbabwe. Data were collected from 1538 MSM in Zimbabwe. The dataset was split into training and testing sets using the ratio of 80% and 20%, respectively. The synthetic minority oversampling technique (SMOTE) was applied to address class imbalance. Using a stepwise logistic regression model, the study revealed several predictors of STIs among MSM such as age, cohabitation with sex partners, education status and employment status. The results show that MLP performed better than STI predictive models (XGBoost and ExtraTrees) and achieved accuracy of 87.54%, recall of 97.29%, precision of 89.64%, F1-Score of 93.31% and AUC of 66.78%. XGBoost also achieved an accuracy of 86.51%, recall of 96.51%, precision of 89.25%, F1-Score of 92.74% and AUC of 54.83%. ExtraTrees recorded an accuracy of 85.47%, recall of 95.35%, precision of 89.13%, F1-Score of 92.13% and AUC of 60.21%. These models can be effectively used to identify highly at-risk MSM, for STI surveillance and to further develop STI infection screening tools to improve health outcomes of MSM.
f
Results from the logistic regression model and STI risk factors among MSM.
plos.figshare.com
xls
Updated Jul 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Owen Mugurungi; Elliot Mbunge; Rutendo Birri-Makota; Innocent Chingombe; Munyaradzi Mapingure; Brian Moyo; Amon Mpofu; John Batani; Benhildah Muchemwa; Chesterfield Samba; Delight Murigo; Musa Sibindi; Enos Moyo; Tafadzwa Dzinamarira; Godfrey Musuka (2024). Results from the logistic regression model and STI risk factors among MSM. [Dataset]. http://doi.org/10.1371/journal.pdig.0000541.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000541.t002
Dataset updated
Jul 3, 2024
Dataset provided by
PLOS Digital Health
Authors
Owen Mugurungi; Elliot Mbunge; Rutendo Birri-Makota; Innocent Chingombe; Munyaradzi Mapingure; Brian Moyo; Amon Mpofu; John Batani; Benhildah Muchemwa; Chesterfield Samba; Delight Murigo; Musa Sibindi; Enos Moyo; Tafadzwa Dzinamarira; Godfrey Musuka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results from the logistic regression model and STI risk factors among MSM.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Qijuan Gao; Xiu Jin; Enhua Xia; Xiangwei Wu; Lichuan Gu; Hanwei Yan; Yingchun Xia; Shaowen Li (2023). Data_Sheet_1_Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning.CSV [Dataset]. http://doi.org/10.3389/fgene.2020.00820.s001

Data_Sheet_1_Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning.CSV

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.3389/fgene.2020.00820.s001

Dataset updated

May 31, 2023

Dataset provided by

Frontiers

Authors

Qijuan Gao; Xiu Jin; Enhua Xia; Xiangwei Wu; Lichuan Gu; Hanwei Yan; Yingchun Xia; Shaowen Li

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets.

Clear search

Close search

Google apps

Main menu

Data_Sheet_1_Identification of Orphan Genes in Unbalanced Datasets Based on...

Data Sheet 3_Prediction of outpatient rehabilitation patient preferences and...

High resolution landslide susceptibility mapping using ensemble machine...

CREDIT CARD FRAUD DETECTION (NEW)

Support Ticket Priority Dataset (50K)

Table_3_Interpretable machine learning model to predict surgical difficulty...

Some features of the dataset from a bank.

Passive sensing data.

Kawasaki disease dataset descriptive statistics.

A comparative analysis of DC, CTGAN-DC, XGBoost, CTGAN-XG, and TVAE-XG...

Comparison of genomic diabetes prediction models.

Characteristics of MSM.

Results from the logistic regression model and STI risk factors among MSM.

Data_Sheet_1_Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning.CSV