29 datasets found

f
A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk...
plos.figshare.com
xls
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t009
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier.
f
A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...
plos.figshare.com
xls
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the ILPD and QSAR datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t007
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the ILPD and QSAR datasets is presented, based on various classification metrics using the Random Forest classifier.
f
Additional file 2 of Implementation of ensemble machine learning algorithms...
springernature.figshare.com
txt
Updated Jun 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdu Rehaman Pasha Syed; Rahul Anbalagan; Anagha S. Setlur; Chandrashekar Karunakaran; Jyoti Shetty; Jitendra Kumar; Vidya Niranjan (2023). Additional file 2 of Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers [Dataset]. http://doi.org/10.6084/m9.figshare.21592787.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21592787.v1
Dataset updated
Jun 20, 2023
Dataset provided by
figshare
Authors
Abdu Rehaman Pasha Syed; Rahul Anbalagan; Anagha S. Setlur; Chandrashekar Karunakaran; Jyoti Shetty; Jitendra Kumar; Vidya Niranjan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 2. The synthetic dataset generated through TVAE method.
f
The average values of evaluation metrics on ILDP, QSAR, Blood and Health...
plos.figshare.com
xls
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The average values of evaluation metrics on ILDP, QSAR, Blood and Health risk imbalanced datasets using ADA classifiers and 10-fold cross validation methodology. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t005
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The average values of evaluation metrics on ILDP, QSAR, Blood and Health risk imbalanced datasets using ADA classifiers and 10-fold cross validation methodology.
f
Performance of machine learning models on test set using the SMOTE-adjusted...
plos.figshare.com
xls
Updated Dec 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Performance of machine learning models on test set using the SMOTE-adjusted balanced training set. [Dataset]. https://plos.figshare.com/articles/dataset/Performance_of_machine_learning_models_on_test_set_using_the_SMOTE-adjusted_balanced_training_set_/24767487
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295427.t006
Dataset updated
Dec 7, 2023
Dataset provided by
PLOS ONE
Authors
Nirajan Budhathoki; Ramesh Bhandari; Suraj Bashyal; Carl Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance of machine learning models on test set using the SMOTE-adjusted balanced training set.
f
Acronym table with description.
plos.figshare.com
xls
Updated Nov 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf (2023). Acronym table with description. [Dataset]. http://doi.org/10.1371/journal.pone.0293061.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0293061.t007
Dataset updated
Nov 8, 2023
Dataset provided by
PLOS ONE
Authors
Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Predicting student performance automatically is of utmost importance, due to the substantial volume of data within educational databases. Educational data mining (EDM) devises techniques to uncover insights from data originating in educational settings. Artificial intelligence (AI) can mine educational data to predict student performance and provide measures to help students avoid failing and learn better. Learning platforms complement traditional learning settings by analyzing student performance, which can help reduce the chance of student failure. Existing methods for student performance prediction in educational data mining faced challenges such as limited accuracy, imbalanced data, and difficulties in feature engineering. These issues hindered effective adaptability and generalization across diverse educational contexts. This study proposes a machine learning-based system with deep convoluted features for the prediction of students’ academic performance. The proposed framework is employed to predict student academic performance using balanced as well as, imbalanced datasets using the synthetic minority oversampling technique (SMOTE). In addition, the performance is also evaluated using the original and deep convoluted features. Experimental results indicate that the use of deep convoluted features provides improved prediction accuracy compared to original features. Results obtained using the extra tree classifier with convoluted features show the highest classification accuracy of 99.9%. In comparison with the state-of-the-art approaches, the proposed approach achieved higher performance. This research introduces a powerful AI-driven system for student performance prediction, offering substantial advancements in accuracy compared to existing approaches.
f
DataSheet1_Using Image Recognition to Process Unbalanced Data in Genetic...
frontiersin.figshare.com
xlsx
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai-Ru Hsieh; Yi-Mei Aimee Li (2023). DataSheet1_Using Image Recognition to Process Unbalanced Data in Genetic Diseases From Biobanks.xlsx [Dataset]. http://doi.org/10.3389/fgene.2022.822117.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2022.822117.s001
Dataset updated
Jun 4, 2023
Dataset provided by
Frontiers
Authors
Ai-Ru Hsieh; Yi-Mei Aimee Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With precision medicine as the goal, the human biobank of each country should be analyzed to determine the complete research results related to genetic diseases. In addition, with the increase in medical imaging data, automatic image processing with image recognition has been widely studied and applied in biomedicine. However, case–control data imbalance often occurs in human biobanks, which is usually solved by the statistical method SAIGE. Due to the huge amount of genetic data in human biobanks, the direct use of the SAIGE method often faces the problem of insufficient computer memory to support calculations and excessive calculation time. The other method is to use sampling to adjust the data to balance the case–control ratio, which is called Synthetic Minority Oversampling Technique (SMOTE). Our study employed the Manhattan plot and genetic disease information from the Taiwan Biobank to adjust the imbalance in the case–control ratio by SMOTE, called “TW-SMOTE.” We further used a deep learning image recognition system to identify the TW-SMOTE. We found that TW-SMOTE can achieve the same results as that of SAIGE and the UK Biobank (UKB). The processing of the technical data can be equivalent to the use of data plots with a relatively large UKB sample size and achieve the same effect as that of SAIGE in addressing data imbalance.
Classification result classifiers using TF-IDF with SMOTE.
plos.figshare.com
xls
Updated May 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Classification result classifiers using TF-IDF with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t007
Dataset updated
May 28, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification result classifiers using TF-IDF with SMOTE.
f
Training testing accuracy result of TF and TF-IDF features with SMOTE.
figshare.com
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Training testing accuracy result of TF and TF-IDF features with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t010
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t010
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Training testing accuracy result of TF and TF-IDF features with SMOTE.
f
The performance of the Different Machine Learning Models evaluated using the...
plos.figshare.com
xls
Updated Jun 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sherif Sakr; Radwa Elshawi; Amjad Ahmed; Waqas T. Qureshi; Clinton Brawner; Steven Keteyian; Michael J. Blaha; Mouaz H. Al-Mallah (2023). The performance of the Different Machine Learning Models evaluated using the Hold Out method (80/20) using SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0195344.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0195344.t007
Dataset updated
Jun 7, 2023
Dataset provided by
PLOS ONE
Authors
Sherif Sakr; Radwa Elshawi; Amjad Ahmed; Waqas T. Qureshi; Clinton Brawner; Steven Keteyian; Michael J. Blaha; Mouaz H. Al-Mallah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The RTF model achieves the highest AUC (0.89), Sensitivity (75%), Precision (73%) and F-Score (74%). The SVM model achieves the highest Specificity (88.9%).
f
The selected explanatory variables.
plos.figshare.com
xls
Updated Jun 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada (2023). The selected explanatory variables. [Dataset]. http://doi.org/10.1371/journal.pone.0281901.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0281901.t002
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity.
f
Data from: S5 File -
plos.figshare.com
zip
Updated Sep 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Youzhi Zhang; Sijie Yao; Peng Chen (2023). S5 File - [Dataset]. http://doi.org/10.1371/journal.pone.0290899.s005
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0290899.s005
Dataset updated
Sep 18, 2023
Dataset provided by
PLOS ONE
Authors
Youzhi Zhang; Sijie Yao; Peng Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Protein hotspot residues are key sites that mediate protein-protein interactions. Accurate identification of these residues is essential for understanding the mechanism from protein to function and for designing drug targets. Current research has mostly focused on using machine learning methods to predict hot spots from known interface residues, which artificially extract the corresponding features of amino acid residues from sequence, structure, evolution, energy, and other information to train and test machine learning models. The process is cumbersome, time-consuming and laborious to some extent. This paper proposes a novel idea that develops a pre-trained protein sequence embedding model combined with a one-dimensional convolutional neural network, called Embed-1dCNN, to predict protein hotspot residues. In order to obtain large data samples, this work integrates and extracts data from the datasets of ASEdb, BID, SKEMPI and dbMPIKT to generate a new dataset, and adopts the SMOTE algorithm to expand positive samples to form the training set. The experimental results show that the method achieves an F1 score of 0.82 on the test set. Compared with other hot spot prediction methods, our model achieved better prediction performance.
f
Performance measure of our scheme using K-means+SMOTE+ENN.
figshare.com
xls
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumya Akter; Hossen A. Mustafa (2024). Performance measure of our scheme using K-means+SMOTE+ENN. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300670.t005
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Sumya Akter; Hossen A. Mustafa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance measure of our scheme using K-means+SMOTE+ENN.
f
Data_Sheet_1_Developing a machine-learning model for real-time prediction of...
frontiersin.figshare.com
docx
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kuo-Yang Huang; Ying-Lin Hsu; Huang-Chi Chen; Ming-Hwarng Horng; Che-Liang Chung; Ching-Hsiung Lin; Jia-Lang Xu; Ming-Hon Hou (2023). Data_Sheet_1_Developing a machine-learning model for real-time prediction of successful extubation in mechanically ventilated patients using time-series ventilator-derived parameters.docx [Dataset]. http://doi.org/10.3389/fmed.2023.1167445.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fmed.2023.1167445.s001
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Kuo-Yang Huang; Ying-Lin Hsu; Huang-Chi Chen; Ming-Hwarng Horng; Che-Liang Chung; Ching-Hsiung Lin; Jia-Lang Xu; Ming-Hon Hou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundSuccessful weaning from mechanical ventilation is important for patients admitted to intensive care units. However, models for predicting real-time weaning outcomes remain inadequate. Therefore, this study aimed to develop a machine-learning model for predicting successful extubation only using time-series ventilator-derived parameters with good accuracy.MethodsPatients with mechanical ventilation admitted to the Yuanlin Christian Hospital in Taiwan between August 2015 and November 2020 were retrospectively included. A dataset with ventilator-derived parameters was obtained before extubation. Recursive feature elimination was applied to select the most important features. Machine-learning models of logistic regression, random forest (RF), and support vector machine were adopted to predict extubation outcomes. In addition, the synthetic minority oversampling technique (SMOTE) was employed to address the data imbalance problem. The area under the receiver operating characteristic (AUC), F1 score, and accuracy, along with the 10-fold cross-validation, were used to evaluate prediction performance.ResultsIn this study, 233 patients were included, of whom 28 (12.0%) failed extubation. The six ventilatory variables per 180 s dataset had optimal feature importance. RF exhibited better performance than the others, with an AUC value of 0.976 (95% confidence interval [CI], 0.975–0.976), accuracy of 94.0% (95% CI, 93.8–94.3%), and an F1 score of 95.8% (95% CI, 95.7–96.0%). The difference in performance between the RF and the original and SMOTE datasets was small.ConclusionThe RF model demonstrated a good performance in predicting successful extubation in mechanically ventilated patients. This algorithm made a precise real-time extubation outcome prediction for patients at different time points.
f
Data Sheet 7_Prediction of outpatient rehabilitation patient preferences and...
frontiersin.figshare.com
docx
Updated Jan 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang (2025). Data Sheet 7_Prediction of outpatient rehabilitation patient preferences and optimization of graded diagnosis and treatment based on XGBoost machine learning algorithm.docx [Dataset]. http://doi.org/10.3389/frai.2024.1473837.s008
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2024.1473837.s008
Dataset updated
Jan 15, 2025
Dataset provided by
Frontiers
Authors
Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe Department of Rehabilitation Medicine is key to improving patients’ quality of life. Driven by chronic diseases and an aging population, there is a need to enhance the efficiency and resource allocation of outpatient facilities. This study aims to analyze the treatment preferences of outpatient rehabilitation patients by using data and a grading tool to establish predictive models. The goal is to improve patient visit efficiency and optimize resource allocation through these predictive models.MethodsData were collected from 38 Chinese institutions, including 4,244 patients visiting outpatient rehabilitation clinics. Data processing was conducted using Python software. The pandas library was used for data cleaning and preprocessing, involving 68 categorical and 12 continuous variables. The steps included handling missing values, data normalization, and encoding conversion. The data were divided into 80% training and 20% test sets using the Scikit-learn library to ensure model independence and prevent overfitting. Performance comparisons among XGBoost, random forest, and logistic regression were conducted using metrics, including accuracy and receiver operating characteristic (ROC) curves. The imbalanced learning library’s SMOTE technique was used to address the sample imbalance during model training. The model was optimized using a confusion matrix and feature importance analysis, and partial dependence plots (PDP) were used to analyze the key influencing factors.ResultsXGBoost achieved the highest overall accuracy of 80.21% with high precision and recall in Category 1. random forest showed a similar overall accuracy. Logistic Regression had a significantly lower accuracy, indicating difficulties with nonlinear data. The key influencing factors identified include distance to medical institutions, arrival time, length of hospital stay, and specific diseases, such as cardiovascular, pulmonary, oncological, and orthopedic conditions. The tiered diagnosis and treatment tool effectively helped doctors assess patients’ conditions and recommend suitable medical institutions based on rehabilitation grading.ConclusionThis study confirmed that ensemble learning methods, particularly XGBoost, outperform single models in classification tasks involving complex datasets. Addressing class imbalance and enhancing feature engineering can further improve model performance. Understanding patient preferences and the factors influencing medical institution selection can guide healthcare policies to optimize resource allocation, improve service quality, and enhance patient satisfaction. Tiered diagnosis and treatment tools play a crucial role in helping doctors evaluate patient conditions and make informed recommendations for appropriate medical care.
f
Performance Evaluation of Three Models Before and After SMOTE Resampling.
figshare.com
xls
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhe Wang; Ni Jia (2025). Performance Evaluation of Three Models Before and After SMOTE Resampling. [Dataset]. http://doi.org/10.1371/journal.pone.0319232.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0319232.t001
Dataset updated
Mar 18, 2025
Dataset provided by
PLOS ONE
Authors
Zhe Wang; Ni Jia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance Evaluation of Three Models Before and After SMOTE Resampling.
f
Detailed overview of cohort characteristics for train and test cohort.
plos.figshare.com
figshare.com
xls
Updated Sep 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hexin Li; Negin Ashrafi; Chris Kang; Guanlan Zhao; Yubing Chen; Maryam Pishgar (2024). Detailed overview of cohort characteristics for train and test cohort. [Dataset]. http://doi.org/10.1371/journal.pone.0309383.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309383.t002
Dataset updated
Sep 4, 2024
Dataset provided by
PLOS ONE
Authors
Hexin Li; Negin Ashrafi; Chris Kang; Guanlan Zhao; Yubing Chen; Maryam Pishgar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Values are presented as means with the standard deviations in parentheses.
f
Performance measure after applying SMOTE+ENN.
figshare.com
plos.figshare.com
xls
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumya Akter; Hossen A. Mustafa (2024). Performance measure after applying SMOTE+ENN. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t011
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300670.t011
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Sumya Akter; Hossen A. Mustafa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Thyroid disease classification plays a crucial role in early diagnosis and effective treatment of thyroid disorders. Machine learning (ML) techniques have demonstrated remarkable potential in this domain, offering accurate and efficient diagnostic tools. Most of the real-life datasets have imbalanced characteristics that hamper the overall performance of the classifiers. Existing data balancing techniques process the whole dataset at a time that sometimes causes overfitting and underfitting. However, the complexity of some ML models, often referred to as “black boxes,” raises concerns about their interpretability and clinical applicability. This paper presents a comprehensive study focused on the analysis and interpretability of various ML models for classifying thyroid diseases. In our work, we first applied a new data-balancing mechanism using a clustering technique and then analyzed the performance of different ML algorithms. To address the interpretability challenge, we explored techniques for model explanation and feature importance analysis using eXplainable Artificial Intelligence (XAI) tools globally as well as locally. Finally, the XAI results are validated with the domain experts. Experimental results have shown that our proposed mechanism is efficient in diagnosing thyroid disease and can explain the models effectively. The findings can contribute to bridging the gap between adopting advanced ML techniques and the clinical requirements of transparency and accountability in diagnostic decision-making.
f
Data set presentation.
figshare.com
plos.figshare.com
xls
Updated Sep 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenguang Li; Yan Peng; Ke Peng (2024). Data set presentation. [Dataset]. http://doi.org/10.1371/journal.pone.0311222.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311222.t001
Dataset updated
Sep 30, 2024
Dataset provided by
PLOS ONE
Authors
Wenguang Li; Yan Peng; Ke Peng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diabetes, as an incurable lifelong chronic disease, has profound and far-reaching effects on patients. Given this, early intervention is particularly crucial, as it can not only significantly improve the prognosis of patients but also provide valuable reference information for clinical treatment. This study selected the BRFSS (Behavioral Risk Factor Surveillance System) dataset, which is publicly available on the Kaggle platform, as the research object, aiming to provide a scientific basis for the early diagnosis and treatment of diabetes through advanced machine learning techniques. Firstly, the dataset was balanced using various sampling methods; secondly, a Stacking model based on GA-XGBoost (XGBoost model optimized by genetic algorithm) was constructed for the risk prediction of diabetes; finally, the interpretability of the model was deeply analyzed using Shapley values. The results show: (1) Random oversampling, ADASYN, SMOTE, and SMOTEENN were used for data balance processing, among which SMOTEENN showed better efficiency and effect in dealing with data imbalance. (2) The GA-XGBoost model optimized the hyperparameters of the XGBoost model through a genetic algorithm to improve the model’s predictive accuracy. Combined with the better-performing LightGBM model and random forest model, a two-layer Stacking model was constructed. This model not only outperforms single machine learning models in predictive effect but also provides a new idea and method in the field of model integration. (3) Shapley value analysis identified features that have a significant impact on the prediction of diabetes, such as age and body mass index. This analysis not only enhances the transparency of the model but also provides more precise treatment decision support for doctors and patients. In summary, this study has not only improved the accuracy of predicting the risk of diabetes by adopting advanced machine learning techniques and model integration strategies but also provided a powerful tool for the early diagnosis and personalized treatment of diabetes.
f
Table_1_Development and evaluation of a model for predicting the risk of...
frontiersin.figshare.com
docx
Updated Sep 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jin Wang; Gan Wang; Yujie Wang; Yun Wang (2024). Table_1_Development and evaluation of a model for predicting the risk of healthcare-associated infections in patients admitted to intensive care units.DOCX [Dataset]. http://doi.org/10.3389/fpubh.2024.1444176.s003
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2024.1444176.s003
Dataset updated
Sep 12, 2024
Dataset provided by
Frontiers
Authors
Jin Wang; Gan Wang; Yujie Wang; Yun Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This retrospective study used 10 machine learning algorithms to predict the risk of healthcare-associated infections (HAIs) in patients admitted to intensive care units (ICUs). A total of 2,517 patients treated in the ICU of a tertiary hospital in China from January 2019 to December 2023 were included, of whom 455 (18.1%) developed an HAI. Data on 32 potential risk factors for infection were considered, of which 18 factors that were statistically significant on single-factor analysis were used to develop a machine learning prediction model using the synthetic minority oversampling technique (SMOTE). The main HAIs were respiratory tract infections (28.7%) and ventilator-associated pneumonia (25.0%), and were predominantly caused by gram-negative bacteria (78.8%). The CatBoost model showed good predictive performance (area under the curve: 0.944, and sensitivity 0.872). The 10 most important predictors of HAIs in this model were the Penetration Aspiration Scale score, Braden score, high total bilirubin level, female, high white blood cell count, Caprini Risk Score, Nutritional Risk Screening 2002 score, low eosinophil count, medium white blood cell count, and the Glasgow Coma Scale score. The CatBoost model accurately predicted the occurrence of HAIs and could be used in clinical practice.

Facebook

Twitter

Click to copy link

Link copied

Cite

Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t009

A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0317396.t009

Dataset updated

Feb 10, 2025

Dataset provided by

PLOS ONE

Authors

Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier.

Clear search

Close search

Google apps

Main menu

A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk...

A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...

Additional file 2 of Implementation of ensemble machine learning algorithms...

The average values of evaluation metrics on ILDP, QSAR, Blood and Health...

Performance of machine learning models on test set using the SMOTE-adjusted...

Acronym table with description.

DataSheet1_Using Image Recognition to Process Unbalanced Data in Genetic...

Classification result classifiers using TF-IDF with SMOTE.

Training testing accuracy result of TF and TF-IDF features with SMOTE.

The performance of the Different Machine Learning Models evaluated using the...

The selected explanatory variables.

Data from: S5 File -

Performance measure of our scheme using K-means+SMOTE+ENN.

Data_Sheet_1_Developing a machine-learning model for real-time prediction of...

Data Sheet 7_Prediction of outpatient rehabilitation patient preferences and...

Performance Evaluation of Three Models Before and After SMOTE Resampling.

Detailed overview of cohort characteristics for train and test cohort.

Performance measure after applying SMOTE+ENN.

Data set presentation.

Table_1_Development and evaluation of a model for predicting the risk of...

A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier.