25 datasets found

f
Performance comparison of machine learning models across accuracy, AUC, MCC,...
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t005
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.
f
Data from: Addressing Imbalanced Classification Problems in Drug Discovery...
acs.figshare.com
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5c00023.s001
Dataset updated
Apr 15, 2025
Dataset provided by
ACS Publications
Authors
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
f
Over-sampled dataset.
figshare.com
xls
Updated Dec 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Over-sampled dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t004
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
f
Reviewed literature descriptions.
plos.figshare.com
xls
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Owusu-Adjei; James Ben Hayfron-Acquah; Twum Frimpong; Gaddafi Abdul-Salaam (2023). Reviewed literature descriptions. [Dataset]. http://doi.org/10.1371/journal.pdig.0000290.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000290.t001
Dataset updated
Nov 30, 2023
Dataset provided by
PLOS Digital Health
Authors
Michael Owusu-Adjei; James Ben Hayfron-Acquah; Twum Frimpong; Gaddafi Abdul-Salaam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Focus on predictive algorithm and its performance evaluation is extensively covered in most research studies to determine best or appropriate predictive model with Optimum prediction solution indicated by prediction accuracy score, precision, recall, f1score etc. Prediction accuracy score from performance evaluation has been used extensively as the main determining metric for performance recommendation. It is one of the most widely used metric for identifying optimal prediction solution irrespective of dataset class distribution context or nature of dataset and output class distribution between the minority and majority variables. The key research question however is the impact of class inequality on prediction accuracy score in such datasets with output class distribution imbalance as compared to balanced accuracy score in the determination of model performance in healthcare and other real-world application systems. Answering this question requires an appraisal of current state of knowledge in both prediction accuracy score and balanced accuracy score use in real-world applications where there is unequal class distribution. Review of related works that highlight the use of imbalanced class distribution datasets with evaluation metrics will assist in contextualizing this systematic review.
f
GMSC dataset (IR: Imbalance Ratio).
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). GMSC dataset (IR: Imbalance Ratio). [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t001
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
f
Data from: Machine Learning Model for Screening Thyroid Stimulating Hormone...
figshare.com
xlsx
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenjia Liu; Zhongyu Wang; Jingwen Chen; Weihao Tang; Haobo Wang (2023). Machine Learning Model for Screening Thyroid Stimulating Hormone Receptor Agonists Based on Updated Datasets and Improved Applicability Domain Metrics [Dataset]. http://doi.org/10.1021/acs.chemrestox.3c00074.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.chemrestox.3c00074.s002
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Wenjia Liu; Zhongyu Wang; Jingwen Chen; Weihao Tang; Haobo Wang
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Machine learning (ML) models for screening endocrine-disrupting chemicals (EDCs), such as thyroid stimulating hormone receptor (TSHR) agonists, are essential for sound management of chemicals. Previous models for screening TSHR agonists were built on imbalanced datasets and lacked applicability domain (AD) characterization essential for regulatory application. Herein, an updated TSHR agonist dataset was built, for which the ratio of active to inactive compounds greatly increased to 1:2.6, and chemical spaces of structure–activity landscapes (SALs) were enhanced. Resulting models based on 7 molecular representations and 4 ML algorithms were proven to outperform previous ones. Weighted similarity density (ρs) and weighted inconsistency of activities (IA) were proposed to characterize the SALs, and a state-of-the-art AD characterization methodology ADSAL{ρs, IA} was established. An optimal classifier developed with PubChem fingerprints and the random forest algorithm, coupled with ADSAL{ρs ≥ 0.15, IA ≤ 0.65}, exhibited good performance on the validation set with the area under the receiver operating characteristic curve being 0.984 and balanced accuracy being 0.941 and identified 90 TSHR agonist classes that could not be found previously. The classifier together with the ADSAL{ρs, IA} may serve as efficient tools for screening EDCs, and the AD characterization methodology may be applied to other ML models.
f
Increase in AUC, MCC, and F1 between oversampling and undersampling.
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Increase in AUC, MCC, and F1 between oversampling and undersampling. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t009
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Increase in AUC, MCC, and F1 between oversampling and undersampling.
f
Mean and standard deviation of accuracy and recall of different classifiers...
plos.figshare.com
xls
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wangyouchen Zhang; Zhenhua Xia; Guoqing Cai; Junhao Wang; Xutao Dong (2025). Mean and standard deviation of accuracy and recall of different classifiers in cross validation. Mean: average of various metrics; SD (×10 − 3): standard deviation of various metrics. [Dataset]. http://doi.org/10.1371/journal.pone.0327120.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0327120.t005
Dataset updated
Jul 8, 2025
Dataset provided by
PLOS ONE
Authors
Wangyouchen Zhang; Zhenhua Xia; Guoqing Cai; Junhao Wang; Xutao Dong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Mean and standard deviation of accuracy and recall of different classifiers in cross validation. Mean: average of various metrics; SD (×10 − 3): standard deviation of various metrics.
f
XGBTree achieved best performance in most of the evaluation metrics...
plos.figshare.com
xls
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tuan Tran; Uyen Le; Yihui Shi (2023). XGBTree achieved best performance in most of the evaluation metrics (PrePro—Pre-processing type (B—Balanced (ENUS), O—Original data); VarRem—Variable removal (Y—Yes, N—No)). [Dataset]. http://doi.org/10.1371/journal.pone.0269135.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0269135.t005
Dataset updated
Jun 11, 2023
Dataset provided by
PLOS ONE
Authors
Tuan Tran; Uyen Le; Yihui Shi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
XGBTree achieved best performance in most of the evaluation metrics (PrePro—Pre-processing type (B—Balanced (ENUS), O—Original data); VarRem—Variable removal (Y—Yes, N—No)).
f
Diabetes-Prediction-Analysis dataset (8).
plos.figshare.com
xls
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wangyouchen Zhang; Zhenhua Xia; Guoqing Cai; Junhao Wang; Xutao Dong (2025). Diabetes-Prediction-Analysis dataset (8). [Dataset]. http://doi.org/10.1371/journal.pone.0327120.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0327120.t001
Dataset updated
Jul 8, 2025
Dataset provided by
PLOS ONE
Authors
Wangyouchen Zhang; Zhenhua Xia; Guoqing Cai; Junhao Wang; Xutao Dong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To improve the effectiveness of diabetes risk prediction, this study proposes a novel method based on focal active learning strategies combined with machine learning models. Existing machine learning models often suffer from poor performance on imbalanced medical datasets, where minority class instances such as diabetic cases are underrepresented. Our proposed Focal Active Learning method selectively samples informative instances to mitigate this imbalance, leading to better prediction outcomes with fewer labeled samples. The method integrates SHAP (SHapley Additive Explanations) to quantify feature importance and applies attention mechanisms to dynamically adjust feature weights, enhancing model interpretability and performance in predicting diabetes risk. To address the issue of imbalanced classification in diabetes datasets, we employed a clustering-based method to identify representative data points (called foci), and iteratively constructed a smaller labeled dataset (sub-pool) around them using similarity-based sampling. This method aims to overcome common challenges, such as poor performance on minority classes and limited generalization, by enabling more efficient data utilization and reducing labeling costs. The experimental results demonstrated that our approach significantly improved the evaluation metrics for diabetes risk prediction, achieving an accuracy of 97.41% and a recall rate of 94.70%, clearly outperforming traditional models that typically achieve 95% accuracy and 92% recall. Additionally, the model’s generalization ability was further validated on the public PIMA Indians Diabetes DataBase, outperforming traditional models in both accuracy and recall. This approach can enhance early diabetes screening in clinical settings, helping healthcare professionals reduce diagnostic errors and optimize resource allocation.
f
Evaluation of benchmark and optimal model performance with resampling...
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Evaluation of benchmark and optimal model performance with resampling techniques. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t008
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Evaluation of benchmark and optimal model performance with resampling techniques.
f
Searching space for hyperparameters in Table 7.
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Searching space for hyperparameters in Table 7. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t006
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
f
Model performance metrics on imbalanced dataset.
plos.figshare.com
xls
Updated Nov 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nelson Kimeli Kemboi Yego; Joseph Nkurunziza; Juma Kasozi (2023). Model performance metrics on imbalanced dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0294166.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0294166.t002
Dataset updated
Nov 30, 2023
Dataset provided by
PLOS ONE
Authors
Nelson Kimeli Kemboi Yego; Joseph Nkurunziza; Juma Kasozi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Universal Health Coverage (UHC) is a global objective aimed at providing equitable access to essential and cost-effective healthcare services, irrespective of individuals’ financial circumstances. Despite efforts to promote UHC through health insurance programs, the uptake in Kenya remains low. This study aimed to explore the factors influencing health insurance uptake and offer insights for effective policy development and outreach programs. The study utilized machine learning techniques on data from the 2021 FinAccess Survey. Among the models examined, the Random Forest model demonstrated the highest performance with notable metrics, including a high Kappa score of 0.9273, Recall score of 0.9640, F1 score of 0.9636, and Accuracy of 0.9636. The study identified several crucial predictors of health insurance uptake, ranked in ascending order of importance by the optimal model, including poverty vulnerability, social security usage, income, education, and marital status. The results suggest that affordability is a significant barrier to health insurance uptake. The study highlights the need to address affordability challenges and implement targeted interventions to improve health insurance uptake in Kenya, thereby advancing progress towards achieving Universal Health Coverage (UHC) and ensuring universal access to quality healthcare services.
f
Confusion matrix.
plos.figshare.com
xls
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300785.t003
Dataset updated
May 16, 2024
Dataset provided by
PLOS ONE
Authors
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diabetes is a persistent metabolic disorder linked to elevated levels of blood glucose, commonly referred to as blood sugar. This condition can have detrimental effects on the heart, blood vessels, eyes, kidneys, and nerves as time passes. It is a chronic ailment that arises when the body fails to produce enough insulin or is unable to effectively use the insulin it produces. When diabetes is not properly managed, it often leads to hyperglycemia, a condition characterized by elevated blood sugar levels or impaired glucose tolerance. This can result in significant harm to various body systems, including the nerves and blood vessels. In this paper, we propose a multiclass diabetes mellitus detection and classification approach using an extremely imbalanced Laboratory of Medical City Hospital data dynamics. We also formulate a new dataset that is moderately imbalanced based on the Laboratory of Medical City Hospital data dynamics. To correctly identify the multiclass diabetes mellitus, we employ three machine learning classifiers namely support vector machine, logistic regression, and k-nearest neighbor. We also focus on dimensionality reduction (feature selection—filter, wrapper, and embedded method) to prune the unnecessary features and to scale up the classification performance. To optimize the classification performance of classifiers, we tune the model by hyperparameter optimization with 10-fold grid search cross-validation. In the case of the original extremely imbalanced dataset with 70:30 partition and support vector machine classifier, we achieved maximum accuracy of 0.964, precision of 0.968, recall of 0.964, F1-score of 0.962, Cohen kappa of 0.835, and AUC of 0.99 by using top 4 feature according to filter method. By using the top 9 features according to wrapper-based sequential feature selection, the k-nearest neighbor provides an accuracy of 0.935 and 1.0 for the other performance metrics. For our created moderately imbalanced dataset with an 80:20 partition, the SVM classifier achieves a maximum accuracy of 0.938, and 1.0 for other performance metrics. For the multiclass diabetes mellitus detection and classification, our experiments outperformed conducted research based on the Laboratory of Medical City Hospital data dynamics.
f
Data Sheet 3_Prediction of outpatient rehabilitation patient preferences and...
frontiersin.figshare.com
docx
Updated Jan 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang (2025). Data Sheet 3_Prediction of outpatient rehabilitation patient preferences and optimization of graded diagnosis and treatment based on XGBoost machine learning algorithm.docx [Dataset]. http://doi.org/10.3389/frai.2024.1473837.s003
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2024.1473837.s003
Dataset updated
Jan 15, 2025
Dataset provided by
Frontiers
Authors
Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe Department of Rehabilitation Medicine is key to improving patients’ quality of life. Driven by chronic diseases and an aging population, there is a need to enhance the efficiency and resource allocation of outpatient facilities. This study aims to analyze the treatment preferences of outpatient rehabilitation patients by using data and a grading tool to establish predictive models. The goal is to improve patient visit efficiency and optimize resource allocation through these predictive models.MethodsData were collected from 38 Chinese institutions, including 4,244 patients visiting outpatient rehabilitation clinics. Data processing was conducted using Python software. The pandas library was used for data cleaning and preprocessing, involving 68 categorical and 12 continuous variables. The steps included handling missing values, data normalization, and encoding conversion. The data were divided into 80% training and 20% test sets using the Scikit-learn library to ensure model independence and prevent overfitting. Performance comparisons among XGBoost, random forest, and logistic regression were conducted using metrics, including accuracy and receiver operating characteristic (ROC) curves. The imbalanced learning library’s SMOTE technique was used to address the sample imbalance during model training. The model was optimized using a confusion matrix and feature importance analysis, and partial dependence plots (PDP) were used to analyze the key influencing factors.ResultsXGBoost achieved the highest overall accuracy of 80.21% with high precision and recall in Category 1. random forest showed a similar overall accuracy. Logistic Regression had a significantly lower accuracy, indicating difficulties with nonlinear data. The key influencing factors identified include distance to medical institutions, arrival time, length of hospital stay, and specific diseases, such as cardiovascular, pulmonary, oncological, and orthopedic conditions. The tiered diagnosis and treatment tool effectively helped doctors assess patients’ conditions and recommend suitable medical institutions based on rehabilitation grading.ConclusionThis study confirmed that ensemble learning methods, particularly XGBoost, outperform single models in classification tasks involving complex datasets. Addressing class imbalance and enhancing feature engineering can further improve model performance. Understanding patient preferences and the factors influencing medical institution selection can guide healthcare policies to optimize resource allocation, improve service quality, and enhance patient satisfaction. Tiered diagnosis and treatment tools play a crucial role in helping doctors evaluate patient conditions and make informed recommendations for appropriate medical care.
f
Hyperparameters used in Scikit-learn package in Python [56], including both...
figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nivedita Bhadra; Shre Kumar Chatterjee; Saptarshi Das (2023). Hyperparameters used in Scikit-learn package in Python [56], including both the default and customized values yielding robust classification on both the 15D and 7D feature space. [Dataset]. http://doi.org/10.1371/journal.pone.0285321.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0285321.t003
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Nivedita Bhadra; Shre Kumar Chatterjee; Saptarshi Das
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hyperparameters used in Scikit-learn package in Python [56], including both the default and customized values yielding robust classification on both the 15D and 7D feature space.
f
Hyperparameter Search Space for KNN.
plos.figshare.com
xls
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Hyperparameter Search Space for KNN. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300785.t006
Dataset updated
May 16, 2024
Dataset provided by
PLOS ONE
Authors
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diabetes is a persistent metabolic disorder linked to elevated levels of blood glucose, commonly referred to as blood sugar. This condition can have detrimental effects on the heart, blood vessels, eyes, kidneys, and nerves as time passes. It is a chronic ailment that arises when the body fails to produce enough insulin or is unable to effectively use the insulin it produces. When diabetes is not properly managed, it often leads to hyperglycemia, a condition characterized by elevated blood sugar levels or impaired glucose tolerance. This can result in significant harm to various body systems, including the nerves and blood vessels. In this paper, we propose a multiclass diabetes mellitus detection and classification approach using an extremely imbalanced Laboratory of Medical City Hospital data dynamics. We also formulate a new dataset that is moderately imbalanced based on the Laboratory of Medical City Hospital data dynamics. To correctly identify the multiclass diabetes mellitus, we employ three machine learning classifiers namely support vector machine, logistic regression, and k-nearest neighbor. We also focus on dimensionality reduction (feature selection—filter, wrapper, and embedded method) to prune the unnecessary features and to scale up the classification performance. To optimize the classification performance of classifiers, we tune the model by hyperparameter optimization with 10-fold grid search cross-validation. In the case of the original extremely imbalanced dataset with 70:30 partition and support vector machine classifier, we achieved maximum accuracy of 0.964, precision of 0.968, recall of 0.964, F1-score of 0.962, Cohen kappa of 0.835, and AUC of 0.99 by using top 4 feature according to filter method. By using the top 9 features according to wrapper-based sequential feature selection, the k-nearest neighbor provides an accuracy of 0.935 and 1.0 for the other performance metrics. For our created moderately imbalanced dataset with an 80:20 partition, the SVM classifier achieves a maximum accuracy of 0.938, and 1.0 for other performance metrics. For the multiclass diabetes mellitus detection and classification, our experiments outperformed conducted research based on the Laboratory of Medical City Hospital data dynamics.
f
Hyperparameter search space for LR.
plos.figshare.com
xls
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Hyperparameter search space for LR. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300785.t005
Dataset updated
May 16, 2024
Dataset provided by
PLOS ONE
Authors
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diabetes is a persistent metabolic disorder linked to elevated levels of blood glucose, commonly referred to as blood sugar. This condition can have detrimental effects on the heart, blood vessels, eyes, kidneys, and nerves as time passes. It is a chronic ailment that arises when the body fails to produce enough insulin or is unable to effectively use the insulin it produces. When diabetes is not properly managed, it often leads to hyperglycemia, a condition characterized by elevated blood sugar levels or impaired glucose tolerance. This can result in significant harm to various body systems, including the nerves and blood vessels. In this paper, we propose a multiclass diabetes mellitus detection and classification approach using an extremely imbalanced Laboratory of Medical City Hospital data dynamics. We also formulate a new dataset that is moderately imbalanced based on the Laboratory of Medical City Hospital data dynamics. To correctly identify the multiclass diabetes mellitus, we employ three machine learning classifiers namely support vector machine, logistic regression, and k-nearest neighbor. We also focus on dimensionality reduction (feature selection—filter, wrapper, and embedded method) to prune the unnecessary features and to scale up the classification performance. To optimize the classification performance of classifiers, we tune the model by hyperparameter optimization with 10-fold grid search cross-validation. In the case of the original extremely imbalanced dataset with 70:30 partition and support vector machine classifier, we achieved maximum accuracy of 0.964, precision of 0.968, recall of 0.964, F1-score of 0.962, Cohen kappa of 0.835, and AUC of 0.99 by using top 4 feature according to filter method. By using the top 9 features according to wrapper-based sequential feature selection, the k-nearest neighbor provides an accuracy of 0.935 and 1.0 for the other performance metrics. For our created moderately imbalanced dataset with an 80:20 partition, the SVM classifier achieves a maximum accuracy of 0.938, and 1.0 for other performance metrics. For the multiclass diabetes mellitus detection and classification, our experiments outperformed conducted research based on the Laboratory of Medical City Hospital data dynamics.
f
Hyperparameter search space for SVM.
plos.figshare.com
xls
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Hyperparameter search space for SVM. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300785.t004
Dataset updated
May 16, 2024
Dataset provided by
PLOS ONE
Authors
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diabetes is a persistent metabolic disorder linked to elevated levels of blood glucose, commonly referred to as blood sugar. This condition can have detrimental effects on the heart, blood vessels, eyes, kidneys, and nerves as time passes. It is a chronic ailment that arises when the body fails to produce enough insulin or is unable to effectively use the insulin it produces. When diabetes is not properly managed, it often leads to hyperglycemia, a condition characterized by elevated blood sugar levels or impaired glucose tolerance. This can result in significant harm to various body systems, including the nerves and blood vessels. In this paper, we propose a multiclass diabetes mellitus detection and classification approach using an extremely imbalanced Laboratory of Medical City Hospital data dynamics. We also formulate a new dataset that is moderately imbalanced based on the Laboratory of Medical City Hospital data dynamics. To correctly identify the multiclass diabetes mellitus, we employ three machine learning classifiers namely support vector machine, logistic regression, and k-nearest neighbor. We also focus on dimensionality reduction (feature selection—filter, wrapper, and embedded method) to prune the unnecessary features and to scale up the classification performance. To optimize the classification performance of classifiers, we tune the model by hyperparameter optimization with 10-fold grid search cross-validation. In the case of the original extremely imbalanced dataset with 70:30 partition and support vector machine classifier, we achieved maximum accuracy of 0.964, precision of 0.968, recall of 0.964, F1-score of 0.962, Cohen kappa of 0.835, and AUC of 0.99 by using top 4 feature according to filter method. By using the top 9 features according to wrapper-based sequential feature selection, the k-nearest neighbor provides an accuracy of 0.935 and 1.0 for the other performance metrics. For our created moderately imbalanced dataset with an 80:20 partition, the SVM classifier achieves a maximum accuracy of 0.938, and 1.0 for other performance metrics. For the multiclass diabetes mellitus detection and classification, our experiments outperformed conducted research based on the Laboratory of Medical City Hospital data dynamics.
f
CDR dataset features information.
plos.figshare.com
xls
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
An Tong; Bochao Chen; Zhe Wang; Jiawei Gao; Chi Kin Lam (2025). CDR dataset features information. [Dataset]. http://doi.org/10.1371/journal.pone.0322004.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0322004.t002
Dataset updated
May 30, 2025
Dataset provided by
PLOS ONE
Authors
An Tong; Bochao Chen; Zhe Wang; Jiawei Gao; Chi Kin Lam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the number of telecom frauds has increased significantly, causing substantial losses to people’s daily lives. With technological advancements, telecom fraud methods have also become more sophisticated, making fraudsters harder to detect as they often imitate normal users and exhibit highly similar features. Traditional graph neural network (GNN) methods aggregate the features of neighboring nodes, which makes it difficult to distinguish between fraudsters and normal users when their features are highly similar. To address this issue, we proposed a spatio-temporal graph attention network (GDFGAT) with feature difference-based weight updates. We conducted comprehensive experiments on our method on a real telecom fraud dataset. Our method obtained an accuracy of 93.28%, f1 score of 92.08%, precision rate of 93.51%, recall rate of 90.97%, and AUC value of 94.53%. The results showed that our method (GDFGAT) is better than the classical method, the latest methods and the baseline model in many metrics; each metric improved by nearly 2%. In addition, we also conducted experiments on the imbalanced datasets: Amazon and YelpChi. The results showed that our model GDFGAT performed better than the baseline model in some metrics.

Facebook

Twitter

Click to copy link

Link copied

Cite

Seongil Han; Haemin Jung (2024). Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t005

Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0316454.t005

Dataset updated

Dec 31, 2024

Dataset provided by

PLOS ONE

Authors

Seongil Han; Haemin Jung

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.

Clear search

Close search

Google apps

Main menu

Performance comparison of machine learning models across accuracy, AUC, MCC,...

Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

Over-sampled dataset.

Reviewed literature descriptions.

GMSC dataset (IR: Imbalance Ratio).

Data from: Machine Learning Model for Screening Thyroid Stimulating Hormone...

Increase in AUC, MCC, and F1 between oversampling and undersampling.

Mean and standard deviation of accuracy and recall of different classifiers...

XGBTree achieved best performance in most of the evaluation metrics...

Diabetes-Prediction-Analysis dataset (8).

Evaluation of benchmark and optimal model performance with resampling...

Searching space for hyperparameters in Table 7.

Model performance metrics on imbalanced dataset.

Confusion matrix.

Data Sheet 3_Prediction of outpatient rehabilitation patient preferences and...

Hyperparameters used in Scikit-learn package in Python [56], including both...

Hyperparameter Search Space for KNN.

Hyperparameter search space for LR.

Hyperparameter search space for SVM.

CDR dataset features information.

Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.