Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundThe Department of Rehabilitation Medicine is key to improving patients’ quality of life. Driven by chronic diseases and an aging population, there is a need to enhance the efficiency and resource allocation of outpatient facilities. This study aims to analyze the treatment preferences of outpatient rehabilitation patients by using data and a grading tool to establish predictive models. The goal is to improve patient visit efficiency and optimize resource allocation through these predictive models.MethodsData were collected from 38 Chinese institutions, including 4,244 patients visiting outpatient rehabilitation clinics. Data processing was conducted using Python software. The pandas library was used for data cleaning and preprocessing, involving 68 categorical and 12 continuous variables. The steps included handling missing values, data normalization, and encoding conversion. The data were divided into 80% training and 20% test sets using the Scikit-learn library to ensure model independence and prevent overfitting. Performance comparisons among XGBoost, random forest, and logistic regression were conducted using metrics, including accuracy and receiver operating characteristic (ROC) curves. The imbalanced learning library’s SMOTE technique was used to address the sample imbalance during model training. The model was optimized using a confusion matrix and feature importance analysis, and partial dependence plots (PDP) were used to analyze the key influencing factors.ResultsXGBoost achieved the highest overall accuracy of 80.21% with high precision and recall in Category 1. random forest showed a similar overall accuracy. Logistic Regression had a significantly lower accuracy, indicating difficulties with nonlinear data. The key influencing factors identified include distance to medical institutions, arrival time, length of hospital stay, and specific diseases, such as cardiovascular, pulmonary, oncological, and orthopedic conditions. The tiered diagnosis and treatment tool effectively helped doctors assess patients’ conditions and recommend suitable medical institutions based on rehabilitation grading.ConclusionThis study confirmed that ensemble learning methods, particularly XGBoost, outperform single models in classification tasks involving complex datasets. Addressing class imbalance and enhancing feature engineering can further improve model performance. Understanding patient preferences and the factors influencing medical institution selection can guide healthcare policies to optimize resource allocation, improve service quality, and enhance patient satisfaction. Tiered diagnosis and treatment tools play a crucial role in helping doctors evaluate patient conditions and make informed recommendations for appropriate medical care.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Landslide susceptibility represents the potential of slope failure for given geo-environmental conditions. The existing landslide susceptibility maps suffer from several limitations, such as being based on limited data, heuristic methodologies, low spatial resolution, and small areas of interest. In this study, we overcome all these limitations by developing a probabilistic framework that combines imbalance handling and ensemble machine learning for landslide susceptibility mapping. We employ a combination of One -Sided Selection and Support Vector Machine Synthetic Minority Oversampling Technique (SVMSMOTE) to eliminate class imbalance and develop smaller representative data from big data for model training. A blending ensemble approach using hyperparameter tuned Artificial Neural Networks, Random Forests, and Support Vector Machine, is employed to reduce the uncertainty associated with a single model. The methodology provides the landslide susceptibility probability and a landslide susceptibility class. A thorough evaluation of the framework is performed using receiver operating characteristic curves, confusion matrices, and the derivatives of confusion matrices. This framework is used to develop India's first national-scale machine learning based landslide susceptibility map. The landslide database is carefully curated from global and local inventories, and the landslide conditioning factors are selected from a multitude of geophysical and climatological variables. The Indian Landslide Susceptibility Map (ILSM) is developed at a resolution of 0.001° (∼100 m) and is classified into five classes: very low, low, medium, high, and very high. We report an accuracy of 95.73 %, sensitivity of 97.08 %, and matthews correlation coefficient (MCC) of 0.915 on test data, demonstrating the accuracy, robustness, and generalizability of the framework for landslide identification. The model classified 4.75 % area in India as very highly susceptible to landslides and detected new landslide susceptible zones in the Eastern Ghats, hitherto unreported in the government landslide records. The ILSM is expected to aid policymaking in disaster risk reduction and developing landslide prediction models.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
🔍 Dataset Description: Credit Card Fraud Detection This dataset is designed for building and evaluating machine learning models for credit card fraud detection. It contains anonymized transaction records where the goal is to classify transactions as fraudulent (1) or non-fraudulent (0) based on several features.
📁 Dataset Overview: Each row represents a single credit card transaction.
Features include a mix of numerical and transformed variables (e.g., V1 to V28) derived from PCA for confidentiality.
The Amount and Hour_of_Day features represent the transaction value and time, respectively.
The Class column is the target variable:
0 → Legitimate transaction
1 → Fraudulent transaction
✅ Key Highlights: The dataset contains both classes (0 and 1) to ensure balanced evaluation for binary classification.
Suitable for testing anomaly detection, binary classification, and imbalanced dataset handling techniques like SMOTE or under-sampling.
Ideal for learners, researchers, and practitioners working on fraud detection in real-world scenarios.
🧠 Suggested Use Cases: Model evaluation with metrics like precision, recall, F1-score (due to class imbalance).
Experimentation with algorithms such as Logistic Regression, Random Forest, XGBoost, and Neural Networks.
Feature engineering and explainability techniques (e.g., SHAP values).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Overview
Synthetic tabular dataset of 50,000 support tickets from 25 companies used to study priority classification (low, medium, high). Companies differ by size and industry; large companies operate across multiple regions. Features mix numeric and categorical signals commonly available at ticket intake. Data is fully artificial—no real users, systems, or proprietary logs.
Intended use: benchmarking supervised learning for tabular classification (e.g., Gradient Boosting, XGBoost, LightGBM, AdaBoost, SVM, Naive Bayes), feature engineering, handling mixed types, class imbalance, and mild label noise.
File & schema
Identifiers & time
-ticket_id (int64): unique ticket identifier (randomized order)
-day_of_week (Mon–Sun), day_of_week_num (1–7; Mon=1)
Company profile (replicated per row)
-company_id (int), company_size (Small/Medium/Large + _cat),
-industry (7 categories + _cat),
-customer_tier (Basic/Plus/Enterprise + _cat),
-org_users (int): active user seats (Large up to ~10,000)
Context
-region (AMER/EMEA/APAC + _cat)
-past_30d_tickets (int), past_90d_incidents (int)
Product & channel
-product_area (auth, billing, mobile, data_pipeline, analytics, notifications + _cat)
-booking_channel (web, email, chat, phone + _cat)
-reported_by_role (support, devops, product_manager, finance, c_level + _cat)
Impact & flags
-customers_affected (int, heavy-tailed)
-error_rate_pct (float, 0–100; sometimes 0.0 as “unmeasured”)
-downtime_min (int, 0 when only degraded)
-payment_impact_flag, security_incident_flag, data_loss_flag, has_runbook (0/1)
Text proxy
-customer_sentiment (negative/neutral/positive + _cat with 0 = missing)
-description_length (int, 20–2000)
Target
-priority (low/medium/high + priority_cat = 1/2/3)
Notes & limitations
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundLaparoscopic total mesorectal excision (LaTME) is standard surgical methods for rectal cancer, and LaTME operation is a challenging procedure. This study is intended to use machine learning to develop and validate prediction models for surgical difficulty of LaTME in patients with rectal cancer and compare these models’ performance.MethodsWe retrospectively collected the preoperative clinical and MRI pelvimetry parameter of rectal cancer patients who underwent laparoscopic total mesorectal resection from 2017 to 2022. The difficulty of LaTME was defined according to the scoring criteria reported by Escal. Patients were randomly divided into training group (80%) and test group (20%). We selected independent influencing features using the least absolute shrinkage and selection operator (LASSO) and multivariate logistic regression method. Adopt synthetic minority oversampling technique (SMOTE) to alleviate the class imbalance problem. Six machine learning model were developed: light gradient boosting machine (LGBM); categorical boosting (CatBoost); extreme gradient boost (XGBoost), logistic regression (LR); random forests (RF); multilayer perceptron (MLP). The area under receiver operating characteristic curve (AUROC), accuracy, sensitivity, specificity and F1 score were used to evaluate the performance of the model. The Shapley Additive Explanations (SHAP) analysis provided interpretation for the best machine learning model. Further decision curve analysis (DCA) was used to evaluate the clinical manifestations of the model.ResultsA total of 626 patients were included. LASSO regression analysis shows that tumor height, prognostic nutrition index (PNI), pelvic inlet, pelvic outlet, sacrococcygeal distance, mesorectal fat area and angle 5 (the angle between the apex of the sacral angle and the lower edge of the pubic bone) are the predictor variables of the machine learning model. In addition, the correlation heatmap shows that there is no significant correlation between these seven variables. When predicting the difficulty of LaTME surgery, the XGBoost model performed best among the six machine learning models (AUROC=0.855). Based on the decision curve analysis (DCA) results, the XGBoost model is also superior, and feature importance analysis shows that tumor height is the most important variable among the seven factors.ConclusionsThis study developed an XGBoost model to predict the difficulty of LaTME surgery. This model can help clinicians quickly and accurately predict the difficulty of surgery and adopt individualized surgical methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Credit card fraud is a significant problem that costs billions of dollars annually. Detecting fraudulent transactions is challenging due to the imbalance in class distribution, where the majority of transactions are legitimate. While pre-processing techniques such as oversampling of minority classes are commonly used to address this issue, they often generate unrealistic or overgeneralized samples. This paper proposes a method called autoencoder with probabilistic xgboost based on SMOTE and CGAN(AE-XGB-SMOTE-CGAN) for detecting credit card frauds.AE-XGB-SMOTE-CGAN is a novel method proposed for credit card fraud detection problems. The credit card fraud dataset comes from a real dataset anonymized by a bank and is highly imbalanced, with normal data far greater than fraud data. Autoencoder (AE) is used to extract relevant features from the dataset, enhancing the ability of feature representation learning, and are then fed into xgboost for classification according to the threshold. Additionally, in this study, we propose a novel approach that hybridizes Generative Adversarial Network (GAN) and Synthetic Minority Over-Sampling Technique (SMOTE) to tackle class imbalance problems. Our two-phase oversampling approach involves knowledge transfer and leverages the synergies of SMOTE and GAN. Specifically, GAN transforms the unrealistic or overgeneralized samples generated by SMOTE into realistic data distributions where there is not enough minority class data available for GAN to process effectively on its own. SMOTE is used to address class imbalance issues and CGAN is used to generate new, realistic data to supplement the original dataset. The AE-XGB-SMOTE-CGAN algorithm is also compared to other commonly used machine learning algorithms, such as KNN and Light GBM, and shows an overall improvement of 2% in terms of the ACC index compared to these algorithms. The AE-XGB-SMOTE-CGAN algorithm also outperforms KNN in terms of the MCC index by 30% when the threshold is set to 0.35. This indicates that the AE-XGB-SMOTE-CGAN algorithm has higher accuracy, true positive rate, true negative rate, and Matthew’s correlation coefficient, making it a promising method for detecting credit card fraud.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The promise of machine learning successfully exploiting digital phenotyping data to forecast mental states in psychiatric populations could greatly improve clinical practice. Previous research focused on binary classification and continuous regression, disregarding the often ordinal nature of prediction targets derived from clinical rating scales. In addition, mental health ratings typically show important class imbalance or skewness that need to be accounted for when evaluating predictive performance. Besides it remains unclear which machine learning algorithm is best suited for forecast tasks, the eXtreme Gradient Boosting (XGBoost) and long short-term memory (LSTM) algorithms being 2 popular choices in digital phenotyping studies. The CrossCheck dataset includes 6,364 mental state surveys using 4-point ordinal rating scales and 23,551 days of smartphone sensor data contributed by patients with schizophrenia. We trained 120 machine learning models to forecast 10 mental states (e.g., Calm, Depressed, Seeing things) from passive sensor data on 2 predictive tasks (ordinal regression, binary classification) with 2 learning algorithms (XGBoost, LSTM) over 3 forecast horizons (same day, next day, next week). A majority of ordinal regression and binary classification models performed significantly above baseline, with macro-averaged mean absolute error values between 1.19 and 0.77, and balanced accuracy between 58% and 73%, which corresponds to similar levels of performance when these metrics are scaled. Results also showed that metrics that do not account for imbalance (mean absolute error, accuracy) systematically overestimated performance, XGBoost models performed on par with or better than LSTM models, and a significant yet very small decrease in performance was observed as the forecast horizon expanded. In conclusion, when using performance metrics that properly account for class imbalance, ordinal forecast models demonstrated comparable performance to the prevalent binary classification approach without losing valuable clinical information from self-reports, thus providing richer and easier to interpret predictions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Kawasaki Disease (KD) is a rare febrile illness affecting infants and young children, potentially leading to coronary artery complications and, in severe cases, mortality if untreated. However, KD is frequently misdiagnosed as a common fever in clinical settings, and the inherent data imbalance further complicates accurate prediction when using traditional machine learning and statistical methods. This paper introduces two advanced approaches to address these challenges, enhancing prediction accuracy and generalizability. The first approach proposes a stacking model termed the Disease Classifier (DC), specifically designed to recognize minority class samples within imbalanced datasets, thereby mitigating the bias commonly observed in traditional models toward the majority class. Secondly, we introduce a combined model, the Disease Classifier with CTGAN (CTGAN-DC), which integrates DC with Conditional Tabular Generative Adversarial Network (CTGAN) technology to improve data balance and predictive performance further. Utilizing CTGAN-based oversampling techniques, this model retains the original data characteristics of KD while expanding data diversity. This effectively balances positive and negative KD samples, significantly reducing model bias toward the majority class and enhancing both predictive accuracy and generalizability. Experimental evaluations indicate substantial performance gains, with the DC and CTGAN-DC models achieving notably higher predictive accuracy than individual machine learning models. Specifically, the DC model achieves sensitivity and specificity rates of 95%, while the CTGAN-DC model achieves 95% sensitivity and 97% specificity, demonstrating superior recognition capability. Furthermore, both models exhibit strong generalizability across diverse KD datasets, particularly the CTGAN-DC model, which surpasses the JAMA model with a 3% increase in sensitivity and a 95% improvement in generalization sensitivity and specificity, effectively resolving the model collapse issue observed in the JAMA model. In sum, the proposed DC and CTGAN-DC architectures demonstrate robust generalizability across multiple KD datasets from various healthcare institutions and significantly outperform other models, including XGBoost. These findings lay a solid foundation for advancing disease prediction in the context of imbalanced medical data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A comparative analysis of DC, CTGAN-DC, XGBoost, CTGAN-XG, and TVAE-XG models in Kawasaki Disease experiments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Diabetes Mellitus is a global health concern, characterized by high blood sugar levels over a prolonged period, leading to severe complications if left unmanaged. The early identification of individuals at risk is critical for effective intervention and treatment. Traditional diagnostic methods rely heavily on clinical symptoms and biochemical tests, which may not capture the underlying genetic predispositions. With the advent of genomics, DNA sequence analysis has emerged as a promising approach to uncover the genetic markers associated with Diabetes Mellitus. However, the challenge lies in accurately classifying DNA sequences to predict susceptibility to the disease, given the complex nature of genetic data. This study addresses this challenge by employing two advanced machine learning models, NuSVC (Nu-Support Vector Classification) and XGBoost (Extreme Gradient Boosting), to classify DNA sequences related to Diabetes Mellitus. The dataset, obtained from reputable sources like NCBI, was preprocessed using Natural Language Processing (NLP) techniques, where DNA sequences were treated as textual data and transformed into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency). To handle the class imbalance in the dataset, SMOTE (Synthetic Minority Over-sampling Technique) was applied. The models were trained and validated using 10-fold cross-validation. XGBoost was trained with up to 300 boosting rounds, and performance was evaluated using accuracy, precision, recall, F1-score, ROC-AUC, and log loss. The results demonstrate that XGBoost outperformed NuSVC across all metrics, achieving an accuracy of 98%, a log loss of 0.0650, and an AUC of 1.00, compared to NuSVC’s accuracy of 87%, log loss of 0.2649, and AUC of 0.95. The superior performance of XGBoost indicates its robustness in handling complex genetic data and its potential utility in clinical applications for early diagnosis of Diabetes Mellitus. The findings of this study underscore the importance of advanced machine learning techniques in genomics and suggest that integrating such models into healthcare systems could significantly enhance predictive diagnostics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
There is a substantial increase in sexually transmitted infections (STIs) among men who have sex with men (MSM) globally. Unprotected sexual practices, multiple sex partners, criminalization, stigmatisation, fear of discrimination, substance use, poor access to care, and lack of early STI screening tools are among the contributing factors. Therefore, this study applied multilayer perceptron (MLP), extremely randomized trees (ExtraTrees) and XGBoost machine learning models to predict STIs among MSM using bio-behavioural survey (BBS) data in Zimbabwe. Data were collected from 1538 MSM in Zimbabwe. The dataset was split into training and testing sets using the ratio of 80% and 20%, respectively. The synthetic minority oversampling technique (SMOTE) was applied to address class imbalance. Using a stepwise logistic regression model, the study revealed several predictors of STIs among MSM such as age, cohabitation with sex partners, education status and employment status. The results show that MLP performed better than STI predictive models (XGBoost and ExtraTrees) and achieved accuracy of 87.54%, recall of 97.29%, precision of 89.64%, F1-Score of 93.31% and AUC of 66.78%. XGBoost also achieved an accuracy of 86.51%, recall of 96.51%, precision of 89.25%, F1-Score of 92.74% and AUC of 54.83%. ExtraTrees recorded an accuracy of 85.47%, recall of 95.35%, precision of 89.13%, F1-Score of 92.13% and AUC of 60.21%. These models can be effectively used to identify highly at-risk MSM, for STI surveillance and to further develop STI infection screening tools to improve health outcomes of MSM.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Results from the logistic regression model and STI risk factors among MSM.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets.