41 datasets found

f
Data from: Isometric Stratified Ensembles: A Partial and Incremental...
acs.figshare.com
xlsx
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux (2023). Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation [Dataset]. http://doi.org/10.1021/acs.jcim.2c00293.s004
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.2c00293.s004
Dataset updated
Jun 7, 2023
Dataset provided by
ACS Publications
Authors
Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Partial and incremental stratification analysis of a quantitative structure-interference relationship (QSIR) is a novel strategy intended to categorize classification provided by machine learning techniques. It is based on a 2D mapping of classification statistics onto two categorical axes: the degree of consensus and level of applicability domain. An internal cross-validation set allows to determine the statistical performance of the ensemble at every 2D map stratum and hence to define isometric local performance regions with the aim of better hit ranking and selection. During training, isometric stratified ensembles (ISE) applies a recursive decorrelated variable selection and considers the cardinal ratio of classes to balance training sets and thus avoid bias due to possible class imbalance. To exemplify the interest of this strategy, three different highly imbalanced PubChem pairs of AmpC β-lactamase and cruzain inhibition assay campaigns of colloidal aggregators and complementary aggregators data set available at the AGGREGATOR ADVISOR predictor web page were employed. Statistics obtained using this new strategy show outperforming results compared to former published tools, with and without a classical applicability domain. ISE performance on classifying colloidal aggregators shows from a global AUC of 0.82, when the whole test data set is considered, up to a maximum AUC of 0.88, when its highest confidence isometric stratum is retained.
f
DataSheet1_Comparison of Resampling Algorithms to Address Class Imbalance...
frontiersin.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Lowell Weller; Tanzy M. T. Love; Martin Wiedmann (2023). DataSheet1_Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water.docx [Dataset]. http://doi.org/10.3389/fenvs.2021.701288.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fenvs.2021.701288.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Daniel Lowell Weller; Tanzy M. T. Love; Martin Wiedmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, research is needed to determine 1) if predictive models can be used to assess Listeria contamination of agricultural water, and 2) how resampling (to deal with imbalanced data) affects performance of these models. To address these knowledge gaps, this study developed models that predict nonpathogenic Listeria spp. (excluding L. monocytogenes) and L. monocytogenes presence in agricultural water using various combinations of learner (e.g., random forest, regression), feature type, and resampling method (none, oversampling, SMOTE). Four feature types were used in model training: microbial, physicochemical, spatial, and weather. “Full models” were trained using all four feature types, while “nested models” used between one and three types. In total, 45 full (15 learners*3 resampling approaches) and 108 nested (5 learners*9 feature sets*3 resampling approaches) models were trained per outcome. Model performance was compared against baseline models where E. coli concentration was the sole predictor. Overall, the machine learning models outperformed the baseline E. coli models, with random forests outperforming models built using other learners (e.g., rule-based learners). Resampling produced more accurate models than not resampling, with SMOTE models outperforming, on average, oversampling models. Regardless of resampling method, spatial and physicochemical water quality features drove accurate predictions for the nonpathogenic Listeria spp. and L. monocytogenes models, respectively. Overall, these findings 1) illustrate the need for alternatives to existing E. coli-based monitoring programs for assessing agricultural water for the presence of potential food safety hazards, and 2) suggest that predictive models may be one such alternative. Moreover, these findings provide a conceptual framework for how such models can be developed in the future with the ultimate aim of developing models that can be integrated into on-farm risk management programs. For example, future studies should consider using random forest learners, SMOTE resampling, and spatial features to develop models to predict the presence of foodborne pathogens, such as L. monocytogenes, in agricultural water when the training data is imbalanced.
f
S2 Dataset -
plos.figshare.com
xlsx
Updated Dec 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JiaMing Gong; MingGang Dong (2024). S2 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0311133.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311133.s002
Dataset updated
Dec 13, 2024
Dataset provided by
PLOS ONE
Authors
JiaMing Gong; MingGang Dong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Online imbalanced learning is an emerging topic that combines the challenges of class imbalance and concept drift. However, current works account for issues of class imbalance and concept drift. And only few works have considered these issues simultaneously. To this end, this paper proposes an entropy-based dynamic ensemble classification algorithm (EDAC) to consider data streams with class imbalance and concept drift simultaneously. First, to address the problem of imbalanced learning in training data chunks arriving at different times, EDAC adopts an entropy-based balanced strategy. It divides the data chunks into multiple balanced sample pairs based on the differences in the information entropy between classes in the sample data chunk. Additionally, we propose a density-based sampling method to improve the accuracy of classifying minority class samples into high quality samples and common samples via the density of similar samples. In this manner high quality and common samples are randomly selected for training the classifier. Finally, to solve the issue of concept drift, EDAC designs and implements an ensemble classifier that uses a self-feedback strategy to determine the initial weight of the classifier by adjusting the weight of the sub-classifier according to the performance on the arrived data chunks. The experimental results demonstrate that EDAC outperforms five state-of-the-art algorithms considering four synthetic and one real-world data streams.
m
Data from: Mental issues, internet addiction and quality of life predict...
data.mendeley.com
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andras Matuz (2024). Mental issues, internet addiction and quality of life predict burnout among Hungarian teachers: a machine learning analysis [Dataset]. http://doi.org/10.17632/2yy4j7rgvg.1
Explore at:
Unique identifier
https://doi.org/10.17632/2yy4j7rgvg.1
Dataset updated
Jul 12, 2024
Authors
Andras Matuz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: Burnout is usually defined as a state of emotional, physical, and mental exhaustion that affects people in various professions (e.g. physicians, nurses, teachers). The consequences of burnout involve decreased motivation, productivity, and overall diminished well-being. The machine learning-based prediction of burnout has therefore become the focus of recent research. In this study, the aim was to detect burnout using machine learning and to identify its most important predictors in a sample of Hungarian high-school teachers. Methods: The final sample consisted of 1,576 high-school teachers (522 male), who completed a survey including various sociodemographic and health-related questions and psychological questionnaires. Specifically, depression, insomnia, internet habits (e.g. when and why one uses the internet) and problematic internet usage were among the most important predictors tested in this study. Supervised classification algorithms were trained to detect burnout assessed by two well-known burnout questionnaires. Feature selection was conducted using recursive feature elimination. Hyperparameters were tuned via grid search with 5-fold cross-validation. Due to class imbalance, class weights (i.e. cost-sensitive learning), downsampling and a hybrid method (SMOTE-ENN) were applied in separate analyses. The final model evaluation was carried out on a previously unseen holdout test sample. Results: Burnout was detected in 19.7% of the teachers included in the final dataset. The best predictive performance on the holdout test sample was achieved by support vector machine with SMOTE-ENN (AUC = .942; balanced accuracy = .868, sensitivity = .898; specificity = .837). The best predictors of burnout were Beck’s Depression Inventory scores, Athen’s Insomnia Scale scores, subscales of the Problematic Internet Use Questionnaire and self-reported current health status. Conclusions: The performances of the algorithms were comparable with previous studies; however, it is important to note that we tested our models on previously unseen holdout samples suggesting higher levels of generalizability. Another remarkable finding is that besides depression and insomnia, other variables such as problematic internet use and time spent online also turned out to be important predictors of burnout.
f
Table1_A comparative study in class imbalance mitigation when working with...
frontiersin.figshare.com
pdf
Updated Mar 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rawan S. Abdulsadig; Esther Rodriguez-Villegas (2024). Table1_A comparative study in class imbalance mitigation when working with physiological signals.pdf [Dataset]. http://doi.org/10.3389/fdgth.2024.1377165.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fdgth.2024.1377165.s001
Dataset updated
Mar 26, 2024
Dataset provided by
Frontiers
Authors
Rawan S. Abdulsadig; Esther Rodriguez-Villegas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Class imbalance is a common challenge that is often faced when dealing with classification tasks aiming to detect medical events that are particularly infrequent. Apnoea is an example of such events. This challenge can however be mitigated using class rebalancing algorithms. This work investigated 10 widely used data-level class imbalance mitigation methods aiming towards building a random forest (RF) model that attempts to detect apnoea events from photoplethysmography (PPG) signals acquired from the neck. Those methods are random undersampling (RandUS), random oversampling (RandOS), condensed nearest-neighbors (CNNUS), edited nearest-neighbors (ENNUS), Tomek’s links (TomekUS), synthetic minority oversampling technique (SMOTE), Borderline-SMOTE (BLSMOTE), adaptive synthetic oversampling (ADASYN), SMOTE with TomekUS (SMOTETomek) and SMOTE with ENNUS (SMOTEENN). Feature-space transformation using PCA and KernelPCA was also examined as a potential way of providing better representations of the data for the class rebalancing methods to operate. This work showed that RandUS is the best option for improving the sensitivity score (up to 11%). However, it could hinder the overall accuracy due to the reduced amount of training data. On the other hand, augmenting the data with new artificial data points was shown to be a non-trivial task that needs further development, especially in the presence of subject dependencies, as was the case in this work.
f
Values of the evaluation measures for the reference model derived from the...
plos.figshare.com
xls
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barbara Więckowska; Katarzyna B. Kubiak; Przemysław Guzik (2025). Values of the evaluation measures for the reference model derived from the training and test datasets across imbalance ranging from 1% to 99% of the event class. [Dataset]. http://doi.org/10.1371/journal.pone.0321661.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0321661.t002
Dataset updated
Apr 10, 2025
Dataset provided by
PLOS ONE
Authors
Barbara Więckowska; Katarzyna B. Kubiak; Przemysław Guzik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Values of the evaluation measures for the reference model derived from the training and test datasets across imbalance ranging from 1% to 99% of the event class.
f
p-values by Wilcoson rank sum test comparing MW-RDS with feature selection...
figshare.com
plos.figshare.com
xls
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheema Gul; Dost Muhammad Khan; Saeed Aldahmani; Zardad Khan (2025). p-values by Wilcoson rank sum test comparing MW-RDS with feature selection methods across 9 datasets in terms classification accuracy. Statistically significance p-value (*p< 0.05, **p< ***p [Dataset]. http://doi.org/10.1371/journal.pone.0325147.t011
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325147.t011
Dataset updated
Jun 10, 2025
Dataset provided by
PLOS ONE
Authors
Sheema Gul; Dost Muhammad Khan; Saeed Aldahmani; Zardad Khan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
p-values by Wilcoson rank sum test comparing MW-RDS with feature selection methods across 9 datasets in terms classification accuracy. Statistically significance p-value (*p< 0.05, **p< ***p
f
Level 2: Values of the class-specific net BA-RB-I coefficients for models...
figshare.com
xlsx
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barbara Więckowska; Katarzyna B. Kubiak; Przemysław Guzik (2025). Level 2: Values of the class-specific net BA-RB-I coefficients for models derived from the test dataset across imbalance ranging from 1% to 99% of the event class. [Dataset]. http://doi.org/10.1371/journal.pone.0321661.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0321661.s002
Dataset updated
Apr 10, 2025
Dataset provided by
PLOS ONE
Authors
Barbara Więckowska; Katarzyna B. Kubiak; Przemysław Guzik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Level 2: Values of the class-specific net BA-RB-I coefficients for models derived from the test dataset across imbalance ranging from 1% to 99% of the event class.
f
The confusion matrix shows a cross-tabulation of the actual class with the...
plos.figshare.com
xls
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barbara Więckowska; Katarzyna B. Kubiak; Przemysław Guzik (2025). The confusion matrix shows a cross-tabulation of the actual class with the model’s predicted class (based on the conventional probability threshold of 0.5). [Dataset]. http://doi.org/10.1371/journal.pone.0321661.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0321661.t001
Dataset updated
Apr 10, 2025
Dataset provided by
PLOS ONE
Authors
Barbara Więckowska; Katarzyna B. Kubiak; Przemysław Guzik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The confusion matrix shows a cross-tabulation of the actual class with the model’s predicted class (based on the conventional probability threshold of 0.5).
f
Level 1: Values of the subclass-specific BA-RB-I coefficients for new models...
plos.figshare.com
xlsx
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barbara Więckowska; Katarzyna B. Kubiak; Przemysław Guzik (2025). Level 1: Values of the subclass-specific BA-RB-I coefficients for new models derived from the training and test datasets across imbalance ranging from 1% to 99% of the event class. [Dataset]. http://doi.org/10.1371/journal.pone.0321661.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0321661.s001
Dataset updated
Apr 10, 2025
Dataset provided by
PLOS ONE
Authors
Barbara Więckowska; Katarzyna B. Kubiak; Przemysław Guzik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Level 1: Values of the subclass-specific BA-RB-I coefficients for new models derived from the training and test datasets across imbalance ranging from 1% to 99% of the event class.
f
A comparison of methods for variable selection.
figshare.com
plos.figshare.com
xlsx
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barbara Więckowska; Katarzyna B. Kubiak; Przemysław Guzik (2025). A comparison of methods for variable selection. [Dataset]. http://doi.org/10.1371/journal.pone.0321661.s004
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0321661.s004
Dataset updated
Apr 10, 2025
Dataset provided by
PLOS ONE
Authors
Barbara Więckowska; Katarzyna B. Kubiak; Przemysław Guzik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Criteria such as interpretability, effectiveness in imbalanced datasets, computational complexity, dependence on classification threshold, dedicated applicability and origin and graphical representation were used for the comparison. (XLSX)
f
Performance of trained models.
plos.figshare.com
xls
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shimels Derso Kebede; Agmasie Damtew Walle; Daniel Niguse Mamo; Ermias Bekele Enyew; Jibril Bashir Adem; Meron Asmamaw Alemayehu (2025). Performance of trained models. [Dataset]. http://doi.org/10.1371/journal.pgph.0004787.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgph.0004787.t002
Dataset updated
Jun 20, 2025
Dataset provided by
PLOS Global Public Health
Authors
Shimels Derso Kebede; Agmasie Damtew Walle; Daniel Niguse Mamo; Ermias Bekele Enyew; Jibril Bashir Adem; Meron Asmamaw Alemayehu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ensuring complete utilization of maternal continuum of care is essential for reducing maternal and neonatal mortality. In Ethiopia, significant gaps remain in maternal healthcare utilization, particularly among women who do not engage in any stage of the maternal care continuum. This study aims to identify the determinants of zero utilization in the maternal continuum of care among Ethiopian women using machine learning techniques, with insights provided by SHAP (SHapley Additive exPlanations) analysis. This study analyzed data from the 2019 Ethiopian Mini Demographic and Health Survey, using a cross-sectional design. The dataset was preprocessed and modeled using various machine learning algorithms through the PyCaret library, with lightGBM emerging as the best model after various models trained and evaluated based on classification performance metrics. S Synthetic Minority Over-sampling Technique was applied to address class imbalance. SHAP analysis was used to interpret model predictions and identify key predictors. lightGBM demonstrated robust performance with an accuracy of 84.47%, an AUC of 0.93, a recall of 0.80, a precision of 0.95, and an F1-score of 0.87 on test data. SHAP analysis revealed that residence in rural areas, the Somali region, being a daughter in the household, and Protestant religion were positively associated with zero maternal care utilization. Conversely, secondary or higher education, being married, higher wealth status, and having multiple children were associated with lower likelihoods of zero care utilization. The findings highlight the critical role of socioeconomic, demographic, and regional factors in maternal care utilization in Ethiopia. Targeted interventions, particularly in rural and underserved areas, are necessary to reduce barriers and promote equitable access to maternal healthcare services across Ethiopia. These insights can inform policies aimed at expanding female education, strengthening community-based maternal health programs, and prioritizing resource allocation to regions such as Somali where zero utilization is highest.
f
Sample size (n) of the full dataset generated under each class-imbalance...
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khurram Nadeem; Mehdi-Abderrahman Jabri (2023). Sample size (n) of the full dataset generated under each class-imbalance ratio (IR) to achieve a target balanced sample size (nb). [Dataset]. http://doi.org/10.1371/journal.pone.0280258.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0280258.t002
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Khurram Nadeem; Mehdi-Abderrahman Jabri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sample size (n) of the full dataset generated under each class-imbalance ratio (IR) to achieve a target balanced sample size (nb).
f
Level 3: Values of the weighted overall BA-RB-I coefficients and traditional...
plos.figshare.com
xls
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barbara Więckowska; Katarzyna B. Kubiak; Przemysław Guzik (2025). Level 3: Values of the weighted overall BA-RB-I coefficients and traditional performance measures for models derived from the training dataset across imbalance ranging from 1% to 99% of the event class. [Dataset]. http://doi.org/10.1371/journal.pone.0321661.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0321661.t004
Dataset updated
Apr 10, 2025
Dataset provided by
PLOS ONE
Authors
Barbara Więckowska; Katarzyna B. Kubiak; Przemysław Guzik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Level 3: Values of the weighted overall BA-RB-I coefficients and traditional performance measures for models derived from the training dataset across imbalance ranging from 1% to 99% of the event class.
f
Experimental data sets.
plos.figshare.com
xls
Updated Jan 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yansong Liu; Shuang Wang; He Sui; Li Zhu (2024). Experimental data sets. [Dataset]. http://doi.org/10.1371/journal.pone.0292140.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292140.t001
Dataset updated
Jan 26, 2024
Dataset provided by
PLOS ONE
Authors
Yansong Liu; Shuang Wang; He Sui; Li Zhu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A challenge to many real-world data streams is imbalance with concept drift, which is one of the most critical tasks in anomaly detection. Learning nonstationary data streams for anomaly detection has been well studied in recent years. However, most of the researches assume that the class of data streams is relatively balanced. Only a few approaches tackle the joint issue of imbalance and concept drift. To overcome this joint issue, we propose an ensemble learning method with generative adversarial network-based sampling and consistency check (EGSCC) in this paper. First, we design a comprehensive anomaly detection framework that includes an oversampling module by generative adversarial network, an ensemble classifier, and a consistency check module. Next, we introduce double encoders into GAN to better capture the distribution characteristics of imbalanced data for oversampling. Then, we apply the stacking ensemble learning to deal with concept drift. Four base classifiers of SVM, KNN, DT and RF are used in the first layer, and LR is used as meta classifier in second layer. Last but not least, we take consistency check of the incremental instance and check set to determine whether it is anormal by statistical learning, instead of threshold-based method. And the validation set is dynamic updated according to the consistency check result. Finally, three artificial data sets obtained from Massive Online Analysis platform and two real data sets are used to verify the performance of the proposed method from four aspects: detection performance, parameter sensitivity, algorithm cost and anti-noise ability. Experimental results show that the proposed method has significant advantages in anomaly detection of imbalanced data streams with concept drift.
f
Comparison of results before and after RFE.
plos.figshare.com
xls
Updated Dec 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li Wang; Yu zhang; Feng li; Caiyun Li; Hongzeng Xu (2024). Comparison of results before and after RFE. [Dataset]. http://doi.org/10.1371/journal.pone.0312448.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0312448.t006
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Li Wang; Yu zhang; Feng li; Caiyun Li; Hongzeng Xu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundAcute myocardial infarction (AMI) remains a leading cause of hospitalization and death in China. Accurate mortality prediction of inpatient is crucial for clinical decision-making of non-ST-segment elevation myocardial infarction (NSTEMI) patients.MethodsIn this study, a total of 3061 patients between January 1, 2017 and December 31, 2022 diagnosed with NSTEMI were enrolled in this study. A new method based on Stacking ensemble model is proposed to predict the in-hospital mortality risk of NSTEMI using clinical data. This method mainly consists of three parts. Firstly, oversampling technique was used to alleviate the class imbalance problem. Secondly, the feature selection method of Recursive Feature Elimination (RFE) was selected for effective feature selection. Finally, a unique double-layer stacking model is designed to improve the performance of the algorithm. Seven classical artificial intelligence methods of Logistic Regression (LR), Decision Tree (DT), Support Vector Machine (SVM), Random Forest (RF), Adaptive Boosting (ADB), Extra Tree (ET), and Gradient Boosting Decision Tree (GBDT) were selected as candidate models for the base model of the first layer of the model, and extreme gradient enhancement (XGBOOST) was selected as the meta-model for the second layer.ResultsPatient were divided into the surviving group and the death group, and a total of 57 clinical features showed statistically significant for the two groups and finally included in the subsequent model. The results show that the Area Under Curve (AUC) of the Stacking model proposed in this paper is 0.987, which is higher than that of LR (0.934), DT (0.946), SVM (0.942), RF (0.948), ADB (0.949), ET (0.938) and GBDT (0.920). At the same time, the proposed Stacking model has higher performance than each single model in terms of Accuracy, Precision, Recall and F1 evaluation indicators.ConclusionsThe Stacking model proposed in this paper can integrate the advantages of LR, DT, SVM, RF, ADB, ET and GBDT models to achieve better prediction performance. This model can provide valuable insights for physicians to identify high-risk patients more precisely and timely, thereby maximizing the potential for early clinical interventions to reduce the mortality rate.
f
Using the ID3 dataset, results of the 3 classifiers for the given feature...
plos.figshare.com
xls
Updated Jun 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheema Gul; Dost Muhammad Khan; Saeed Aldahmani; Zardad Khan (2025). Using the ID3 dataset, results of the 3 classifiers for the given feature selection methods. [Dataset]. http://doi.org/10.1371/journal.pone.0325147.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325147.t004
Dataset updated
Jun 10, 2025
Dataset provided by
PLOS ONE
Authors
Sheema Gul; Dost Muhammad Khan; Saeed Aldahmani; Zardad Khan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Using the ID3 dataset, results of the 3 classifiers for the given feature selection methods.
f
Accuracy comparison with existing approaches for Binary Classification with...
plos.figshare.com
xls
Updated May 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arshad Hashmi; Omar M. Barukab; Ahmad Hamza Osman (2024). Accuracy comparison with existing approaches for Binary Classification with state of art on UNSW-NB15 and NSL-KDD. [Dataset]. http://doi.org/10.1371/journal.pone.0302294.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302294.t008
Dataset updated
May 23, 2024
Dataset provided by
PLOS ONE
Authors
Arshad Hashmi; Omar M. Barukab; Ahmad Hamza Osman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Accuracy comparison with existing approaches for Binary Classification with state of art on UNSW-NB15 and NSL-KDD.
f
Classification performance (accuracy, sensitivity, specificity, F1-score,...
plos.figshare.com
figshare.com
xls
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheema Gul; Dost Muhammad Khan; Saeed Aldahmani; Zardad Khan (2025). Classification performance (accuracy, sensitivity, specificity, F1-score, and precision) based on 50 selected features, reported as over 500 runs. [Dataset]. http://doi.org/10.1371/journal.pone.0325147.t013
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325147.t013
Dataset updated
Jun 10, 2025
Dataset provided by
PLOS ONE
Authors
Sheema Gul; Dost Muhammad Khan; Saeed Aldahmani; Zardad Khan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification performance (accuracy, sensitivity, specificity, F1-score, and precision) based on 50 selected features, reported as over 500 runs.
f
Summary of the gene expression datasets. Number of samples, number of...
plos.figshare.com
xls
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheema Gul; Dost Muhammad Khan; Saeed Aldahmani; Zardad Khan (2025). Summary of the gene expression datasets. Number of samples, number of features, and class-wise frequency distribution are shown against each dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0325147.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325147.t001
Dataset updated
Jun 10, 2025
Dataset provided by
PLOS ONE
Authors
Sheema Gul; Dost Muhammad Khan; Saeed Aldahmani; Zardad Khan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of the gene expression datasets. Number of samples, number of features, and class-wise frequency distribution are shown against each dataset.

Facebook

Twitter

Click to copy link

Link copied

Cite

Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux (2023). Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation [Dataset]. http://doi.org/10.1021/acs.jcim.2c00293.s004

Data from: Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.1021/acs.jcim.2c00293.s004

Dataset updated

Jun 7, 2023

Dataset provided by

ACS Publications

Authors

Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Partial and incremental stratification analysis of a quantitative structure-interference relationship (QSIR) is a novel strategy intended to categorize classification provided by machine learning techniques. It is based on a 2D mapping of classification statistics onto two categorical axes: the degree of consensus and level of applicability domain. An internal cross-validation set allows to determine the statistical performance of the ensemble at every 2D map stratum and hence to define isometric local performance regions with the aim of better hit ranking and selection. During training, isometric stratified ensembles (ISE) applies a recursive decorrelated variable selection and considers the cardinal ratio of classes to balance training sets and thus avoid bias due to possible class imbalance. To exemplify the interest of this strategy, three different highly imbalanced PubChem pairs of AmpC β-lactamase and cruzain inhibition assay campaigns of colloidal aggregators and complementary aggregators data set available at the AGGREGATOR ADVISOR predictor web page were employed. Statistics obtained using this new strategy show outperforming results compared to former published tools, with and without a classical applicability domain. ISE performance on classifying colloidal aggregators shows from a global AUC of 0.82, when the whole test data set is considered, up to a maximum AUC of 0.88, when its highest confidence isometric stratum is retained.

Clear search

Close search

Google apps

Main menu

Data from: Isometric Stratified Ensembles: A Partial and Incremental...

DataSheet1_Comparison of Resampling Algorithms to Address Class Imbalance...

S2 Dataset -

Data from: Mental issues, internet addiction and quality of life predict...

Table1_A comparative study in class imbalance mitigation when working with...

Values of the evaluation measures for the reference model derived from the...

p-values by Wilcoson rank sum test comparing MW-RDS with feature selection...

Level 2: Values of the class-specific net BA-RB-I coefficients for models...

The confusion matrix shows a cross-tabulation of the actual class with the...

Level 1: Values of the subclass-specific BA-RB-I coefficients for new models...

A comparison of methods for variable selection.

Performance of trained models.

Sample size (n) of the full dataset generated under each class-imbalance...

Level 3: Values of the weighted overall BA-RB-I coefficients and traditional...

Experimental data sets.

Comparison of results before and after RFE.

Using the ID3 dataset, results of the 3 classifiers for the given feature...

Accuracy comparison with existing approaches for Binary Classification with...

Classification performance (accuracy, sensitivity, specificity, F1-score,...

Summary of the gene expression datasets. Number of samples, number of...

Data from: Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation