53 datasets found

f
Imbalanced classification metric.
plos.figshare.com
xls
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). Imbalanced classification metric. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t003
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
f
Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...
plos.figshare.com
xls
Updated Nov 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0290581.t007
Dataset updated
Nov 16, 2023
Dataset provided by
PLOS ONE
Authors
Alaa Alomari; Hossam Faris; Pedro A. Castillo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes.
Classification result classifiers using TF-IDF with SMOTE.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated May 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Classification result classifiers using TF-IDF with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t007
Dataset updated
May 28, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification result classifiers using TF-IDF with SMOTE.
f
Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced...
plos.figshare.com
txt
Updated Jun 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data [Dataset]. http://doi.org/10.1371/journal.pone.0180830
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0180830
Dataset updated
Jun 18, 2023
Dataset provided by
PLOS ONE
Authors
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clinical data analysis and forecasting have made substantial contributions to disease control, prevention and detection. However, such data usually suffer from highly imbalanced samples in class distributions. In this paper, we aim to formulate effective methods to rebalance binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat algorithm, and apply them to empower the effects of synthetic minority over-sampling technique (SMOTE) for pre-processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reported in this paper reveal that the performance improvements obtained by the former methods are not scalable to larger data scales. The latter methods, which we call Adaptive Swarm Balancing Algorithms, lead to significant efficiency and effectiveness improvements on large datasets while the first method is invalid. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. The proposed methods lead to more credible performances of the classifier, and shortening the run time compared to brute-force method.
f
A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...
plos.figshare.com
xls
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t008
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier.
f
DataSheet1_Comparison of Resampling Algorithms to Address Class Imbalance...
frontiersin.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Lowell Weller; Tanzy M. T. Love; Martin Wiedmann (2023). DataSheet1_Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water.docx [Dataset]. http://doi.org/10.3389/fenvs.2021.701288.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fenvs.2021.701288.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Daniel Lowell Weller; Tanzy M. T. Love; Martin Wiedmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, research is needed to determine 1) if predictive models can be used to assess Listeria contamination of agricultural water, and 2) how resampling (to deal with imbalanced data) affects performance of these models. To address these knowledge gaps, this study developed models that predict nonpathogenic Listeria spp. (excluding L. monocytogenes) and L. monocytogenes presence in agricultural water using various combinations of learner (e.g., random forest, regression), feature type, and resampling method (none, oversampling, SMOTE). Four feature types were used in model training: microbial, physicochemical, spatial, and weather. “Full models” were trained using all four feature types, while “nested models” used between one and three types. In total, 45 full (15 learners*3 resampling approaches) and 108 nested (5 learners*9 feature sets*3 resampling approaches) models were trained per outcome. Model performance was compared against baseline models where E. coli concentration was the sole predictor. Overall, the machine learning models outperformed the baseline E. coli models, with random forests outperforming models built using other learners (e.g., rule-based learners). Resampling produced more accurate models than not resampling, with SMOTE models outperforming, on average, oversampling models. Regardless of resampling method, spatial and physicochemical water quality features drove accurate predictions for the nonpathogenic Listeria spp. and L. monocytogenes models, respectively. Overall, these findings 1) illustrate the need for alternatives to existing E. coli-based monitoring programs for assessing agricultural water for the presence of potential food safety hazards, and 2) suggest that predictive models may be one such alternative. Moreover, these findings provide a conceptual framework for how such models can be developed in the future with the ultimate aim of developing models that can be integrated into on-farm risk management programs. For example, future studies should consider using random forest learners, SMOTE resampling, and spatial features to develop models to predict the presence of foodborne pathogens, such as L. monocytogenes, in agricultural water when the training data is imbalanced.
f
Top 10 performing oversamplers for DTS2 versus baseline (no oversampling and...
plos.figshare.com
xls
Updated Jun 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin Teh; Paul Armitage; Solomon Tesfaye; Dinesh Selvarajah; Iain D. Wilkinson (2023). Top 10 performing oversamplers for DTS2 versus baseline (no oversampling and SMOTE) averaged across four classifiers. [Dataset]. http://doi.org/10.1371/journal.pone.0243907.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0243907.t003
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Kevin Teh; Paul Armitage; Solomon Tesfaye; Dinesh Selvarajah; Iain D. Wilkinson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Top 10 performing oversamplers for DTS2 versus baseline (no oversampling and SMOTE) averaged across four classifiers.
f
Table1_A comparative study in class imbalance mitigation when working with...
frontiersin.figshare.com
pdf
Updated Mar 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rawan S. Abdulsadig; Esther Rodriguez-Villegas (2024). Table1_A comparative study in class imbalance mitigation when working with physiological signals.pdf [Dataset]. http://doi.org/10.3389/fdgth.2024.1377165.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fdgth.2024.1377165.s001
Dataset updated
Mar 26, 2024
Dataset provided by
Frontiers
Authors
Rawan S. Abdulsadig; Esther Rodriguez-Villegas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Class imbalance is a common challenge that is often faced when dealing with classification tasks aiming to detect medical events that are particularly infrequent. Apnoea is an example of such events. This challenge can however be mitigated using class rebalancing algorithms. This work investigated 10 widely used data-level class imbalance mitigation methods aiming towards building a random forest (RF) model that attempts to detect apnoea events from photoplethysmography (PPG) signals acquired from the neck. Those methods are random undersampling (RandUS), random oversampling (RandOS), condensed nearest-neighbors (CNNUS), edited nearest-neighbors (ENNUS), Tomek’s links (TomekUS), synthetic minority oversampling technique (SMOTE), Borderline-SMOTE (BLSMOTE), adaptive synthetic oversampling (ADASYN), SMOTE with TomekUS (SMOTETomek) and SMOTE with ENNUS (SMOTEENN). Feature-space transformation using PCA and KernelPCA was also examined as a potential way of providing better representations of the data for the class rebalancing methods to operate. This work showed that RandUS is the best option for improving the sensitivity score (up to 11%). However, it could hinder the overall accuracy due to the reduced amount of training data. On the other hand, augmenting the data with new artificial data points was shown to be a non-trivial task that needs further development, especially in the presence of subject dependencies, as was the case in this work.
f
Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanka Banerjee; Frederic O. Dehnbostel; Robert Preissner (2023). Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets.PDF [Dataset]. http://doi.org/10.3389/fchem.2018.00362.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fchem.2018.00362.s001
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Priyanka Banerjee; Frederic O. Dehnbostel; Robert Preissner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Increase in the number of new chemicals synthesized in past decades has resulted in constant growth in the development and application of computational models for prediction of activity as well as safety profiles of the chemicals. Most of the time, such computational models and its application must deal with imbalanced chemical data. It is indeed a challenge to construct a classifier using imbalanced data set. In this study, we analyzed and validated the importance of different sampling methods over non-sampling method, to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data. Additionally, this study has achieved an accuracy of 93.00%, an AUC of 0.94, F1 measure of 0.90, sensitivity of 96.00% and specificity of 91.00% using SMOTE sampling and Random Forest classifier for the prediction of Drug Induced Liver Injury (DILI). Our results suggest that, irrespective of data set used, sampling methods can have major influence on reducing the gap between sensitivity and specificity of a model. This study demonstrates the efficacy of different sampling methods for class imbalanced problem using binary chemical data sets.
f
Data_Sheet_1_Identification of Orphan Genes in Unbalanced Datasets Based on...
frontiersin.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qijuan Gao; Xiu Jin; Enhua Xia; Xiangwei Wu; Lichuan Gu; Hanwei Yan; Yingchun Xia; Shaowen Li (2023). Data_Sheet_1_Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning.CSV [Dataset]. http://doi.org/10.3389/fgene.2020.00820.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2020.00820.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Qijuan Gao; Xiu Jin; Enhua Xia; Xiangwei Wu; Lichuan Gu; Hanwei Yan; Yingchun Xia; Shaowen Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets.
Classification result of classifiers models using TF without SMOTE.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Classification result of classifiers models using TF without SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t004
Dataset updated
May 28, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification result of classifiers models using TF without SMOTE.
f
Parameters of machine learning models.
plos.figshare.com
xls
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran (2024). Parameters of machine learning models. [Dataset]. http://doi.org/10.1371/journal.pone.0301263.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0301263.t002
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The diagnosis of human knee abnormalities using the surface electromyography (sEMG) signal obtained from lower limb muscles with machine learning is a major problem due to the noisy nature of the sEMG signal and the imbalance in data corresponding to healthy and knee abnormal subjects. To address this challenge, a combination of wavelet decomposition (WD) with ensemble empirical mode decomposition (EEMD) and the Synthetic Minority Oversampling Technique (S-WD-EEMD) is proposed. In this study, a hybrid WD-EEMD is considered for the minimization of noises produced in the sEMG signal during the collection, while the Synthetic Minority Oversampling Technique (SMOTE) is considered to balance the data by increasing the minority class samples during the training of machine learning techniques. The findings indicate that the hybrid WD-EEMD with SMOTE oversampling technique enhances the efficacy of the examined classifiers when employed on the imbalanced sEMG data. The F-Score of the Extra Tree Classifier, when utilizing WD-EEMD signal processing with SMOTE oversampling, is 98.4%, whereas, without the SMOTE oversampling technique, it is 95.1%.
f
Performance comparisons between SOS and ROS, SMOTE, and ADASYN for ATP168...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jun Hu; Xue He; Dong-Jun Yu; Xi-Bei Yang; Jing-Yu Yang; Hong-Bin Shen (2023). Performance comparisons between SOS and ROS, SMOTE, and ADASYN for ATP168 and ATP227 over five-fold cross-validation under MaxMCC Evaluation. [Dataset]. http://doi.org/10.1371/journal.pone.0107676.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0107676.t004
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Jun Hu; Xue He; Dong-Jun Yu; Xi-Bei Yang; Jing-Yu Yang; Hong-Bin Shen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance comparisons between SOS and ROS, SMOTE, and ADASYN for ATP168 and ATP227 over five-fold cross-validation under MaxMCC Evaluation.
f
The average values of evaluation metrics on ILDP, QSAR, Blood and Health...
plos.figshare.com
xls
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The average values of evaluation metrics on ILDP, QSAR, Blood and Health risk imbalanced datasets using RF classifiers and 10-fold cross validation methodology. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t006
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The average values of evaluation metrics on ILDP, QSAR, Blood and Health risk imbalanced datasets using RF classifiers and 10-fold cross validation methodology.
f
iProtDNA-SMOTE trained weights.
plos.figshare.com
rar
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruiyan Huang; Wangren Qiu; Xuan Xiao; Weizhong Lin (2025). iProtDNA-SMOTE trained weights. [Dataset]. http://doi.org/10.1371/journal.pone.0320817.s004
Explore at:
rarAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320817.s004
Dataset updated
May 15, 2025
Dataset provided by
PLOS ONE
Authors
Ruiyan Huang; Wangren Qiu; Xuan Xiao; Weizhong Lin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Protein-DNA interactions play a crucial role in cellular biology, essential for maintaining life processes and regulating cellular functions. We propose a method called iProtDNA-SMOTE, which utilizes non-equilibrium graph neural networks along with pre-trained protein language models to predict DNA binding residues. This approach effectively addresses the class imbalance issue in predicting protein-DNA binding sites by leveraging unbalanced graph data, thus enhancing model’s generalization and specificity. We trained the model on two datasets, TR646 and TR573, and conducted a series of experiments to evaluate its performance. The model achieved AUC values of 0.850, 0.896, and 0.858 on the independent test datasets TE46, TE129, and TE181, respectively. These results indicate that iProtDNA-SMOTE outperforms existing methods in terms of accuracy and generalization for predicting DNA binding sites, offering reliable and effective predictions to minimize errors. The model has been thoroughly validated for its ability to predict protein-DNA binding sites with high reliability and precision. For the convenience of the scientific community, the benchmark datasets and codes are publicly available at https://github.com/primrosehry/iProtDNA-SMOTE.
f
Supplementary Material 8
figshare.com
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nishitha R Kumar; Tejashree A Balraj; Kerry K Cooper; Akila Prashant (2025). Supplementary Material 8 [Dataset]. http://doi.org/10.6084/m9.figshare.28601057.v1
Explore at:
application/x-wine-extension-iniAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28601057.v1
Dataset updated
May 12, 2025
Dataset provided by
figshare
Authors
Nishitha R Kumar; Tejashree A Balraj; Kerry K Cooper; Akila Prashant
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Synthetic Minority Over-sampling Technique (SMOTE) is a machine learning approach to address class imbalance in datasets. It is beneficial for identifying antimicrobial resistance (AMR) patterns. In AMR studies, datasets often contain more susceptible isolates than resistant ones, leading to biased model performance. SMOTE overcomes this issue by generating synthetic samples of the minority class (resistant isolates) through interpolation rather than simple duplication, thereby improving model generalization.When applied to AMR prediction, SMOTE enhances the ability of classification models to accurately identify resistant Escherichia coli strains by balancing the dataset, ensuring that machine learning algorithms do not overlook rare resistance patterns. It is commonly used with classifiers like decision trees, support vector machines (SVM), and deep learning models to improve predictive accuracy. By mitigating class imbalance, SMOTE enables robust AMR detection, aiding in early identification of drug-resistant bacteria and informing antibiotic stewardship efforts.Supervised machine learning is widely used in Escherichia coli genomic analysis to predict antimicrobial resistance, virulence factors, and strain classification. By training models on labeled genomic data (e.g., the presence or absence of resistance genes, SNP profiles, or MLST types), these classifiers help identify patterns and make accurate predictions.10 Supervised machine learning classifiers for E.coli genome analysis:Logistic regression (LR): A simple yet effective statistical model for binary classification, such as predicting antibiotic resistance or susceptibility in E. coli.Linear support vector machine (Linear SVM): This machine finds the optimal hyperplane to separate E. coli strains based on genomic features such as gene presence or sequence variations.Radial basis function kernel-support vector machine (RBF-SVM): A more flexible version of SVM that uses a non-linear kernel to capture complex relationships in genomic data, improving classification accuracy.Extra trees classifier: This tree-based ensemble method enhances classification by randomly selecting features and thresholds, improving robustness in E. coli strain differentiation.Random forest (RF): An ensemble learning method that constructs multiple decision trees, reducing overfitting and improving prediction accuracy for resistance genes and virulence factors.Adaboost: A boosting algorithm that combines weak classifiers iteratively, refining predictions and improving the identification of antimicrobial resistance patterns.XGboost: An optimized gradient boosting algorithm that efficiently handles large genomic datasets, commonly used for high-accuracy predictions in E. coli classification.Naïve bayes (NB): A probabilistic classifier based on Bayes' theorem, suitable for predicting resistance phenotypes based on genomic features.Linear discriminant Analysis (LDA) is a statistical approach that maximizes class separability. It helps distinguish between resistant and susceptible E. coli strains.Quadratic discriminant Analysis (QDA) is a variation of LDA that allows for non-linear decision boundaries, improving classification in datasets with complex genomic structures. When applied to E. coli genomes, these classifiers help predict antibiotic resistance, track outbreak strains, and understand genomic adaptations. Combining them with feature selection and optimization techniques enhances accuracy, making them valuable tools in bacterial genomics and clinical research.
f
A hybrid neural network-based model for landslide susceptibility...
figshare.com
application/cdfv2
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuchen Jia (2025). A hybrid neural network-based model for landslide susceptibility mapping_data [Dataset]. http://doi.org/10.6084/m9.figshare.29195249.v1
Explore at:
application/cdfv2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29195249.v1
Dataset updated
May 30, 2025
Dataset provided by
figshare
Authors
Yuchen Jia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Landslides represent one of the most destructive geological hazards worldwide, where susceptibility assessment serves as a critical component in regional landslide risk management. To address the limitations of conventional methods in spatial representation, class imbalance handling, and temporal feature extraction, this study proposes a Buffer-SMOTE-Transformer comprehensive optimization framework. The framework integrates geospatial buffer sampling techniques to refine negative sample selection, employs SMOTE algorithm to effectively resolve class imbalance issues, and incorporates a weighted hybrid Transformer network to enhance modeling capability for complex geographical features. An empirical analysis conducted in China's Guangdong Province demonstrates that the BST model reveals the varying impacts of sample selection, dataset construction, and model performance on assessment outcomes. The framework achieves significant superiority over conventional machine learning methods (Random Forest, LGB) in key metrics, with AUC reaching 0.964 and Recall attaining 0.953. These findings not only elucidate the cascading amplification effects of comprehensive optimization in susceptibility modeling but also establish a novel technical pathway for large-regional-scale geological hazard risk assessment.
f
Area under the ROC and PR curves for medicated patients, using the original...
plos.figshare.com
xls
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Albuquerque; Ana Margarida Medeiros; Ana Catarina Alves; Mafalda Bourbon; Marília Antunes (2023). Area under the ROC and PR curves for medicated patients, using the original and SMOTE sample data. [Dataset]. http://doi.org/10.1371/journal.pone.0269713.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0269713.t003
Dataset updated
Jun 14, 2023
Dataset provided by
PLOS ONE
Authors
João Albuquerque; Ana Margarida Medeiros; Ana Catarina Alves; Mafalda Bourbon; Marília Antunes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Area under the ROC and PR curves for medicated patients, using the original and SMOTE sample data.
f
Performance of STI predictive models with SMOTE.
plos.figshare.com
xls
Updated Jul 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Owen Mugurungi; Elliot Mbunge; Rutendo Birri-Makota; Innocent Chingombe; Munyaradzi Mapingure; Brian Moyo; Amon Mpofu; John Batani; Benhildah Muchemwa; Chesterfield Samba; Delight Murigo; Musa Sibindi; Enos Moyo; Tafadzwa Dzinamarira; Godfrey Musuka (2024). Performance of STI predictive models with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pdig.0000541.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000541.t003
Dataset updated
Jul 3, 2024
Dataset provided by
PLOS Digital Health
Authors
Owen Mugurungi; Elliot Mbunge; Rutendo Birri-Makota; Innocent Chingombe; Munyaradzi Mapingure; Brian Moyo; Amon Mpofu; John Batani; Benhildah Muchemwa; Chesterfield Samba; Delight Murigo; Musa Sibindi; Enos Moyo; Tafadzwa Dzinamarira; Godfrey Musuka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
There is a substantial increase in sexually transmitted infections (STIs) among men who have sex with men (MSM) globally. Unprotected sexual practices, multiple sex partners, criminalization, stigmatisation, fear of discrimination, substance use, poor access to care, and lack of early STI screening tools are among the contributing factors. Therefore, this study applied multilayer perceptron (MLP), extremely randomized trees (ExtraTrees) and XGBoost machine learning models to predict STIs among MSM using bio-behavioural survey (BBS) data in Zimbabwe. Data were collected from 1538 MSM in Zimbabwe. The dataset was split into training and testing sets using the ratio of 80% and 20%, respectively. The synthetic minority oversampling technique (SMOTE) was applied to address class imbalance. Using a stepwise logistic regression model, the study revealed several predictors of STIs among MSM such as age, cohabitation with sex partners, education status and employment status. The results show that MLP performed better than STI predictive models (XGBoost and ExtraTrees) and achieved accuracy of 87.54%, recall of 97.29%, precision of 89.64%, F1-Score of 93.31% and AUC of 66.78%. XGBoost also achieved an accuracy of 86.51%, recall of 96.51%, precision of 89.25%, F1-Score of 92.74% and AUC of 54.83%. ExtraTrees recorded an accuracy of 85.47%, recall of 95.35%, precision of 89.13%, F1-Score of 92.13% and AUC of 60.21%. These models can be effectively used to identify highly at-risk MSM, for STI surveillance and to further develop STI infection screening tools to improve health outcomes of MSM.
f
Performance comparisons of iProtDNA-SMOTE and 4 competing predictors on...
plos.figshare.com
xls
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruiyan Huang; Wangren Qiu; Xuan Xiao; Weizhong Lin (2025). Performance comparisons of iProtDNA-SMOTE and 4 competing predictors on TE181 under independent validation. [Dataset]. http://doi.org/10.1371/journal.pone.0320817.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320817.t004
Dataset updated
May 15, 2025
Dataset provided by
PLOS ONE
Authors
Ruiyan Huang; Wangren Qiu; Xuan Xiao; Weizhong Lin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance comparisons of iProtDNA-SMOTE and 4 competing predictors on TE181 under independent validation.

Facebook

Twitter

Click to copy link

Link copied

Cite

Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). Imbalanced classification metric. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t003

Imbalanced classification metric.

Explore at:

28 scholarly articles cite this dataset (View in Google Scholar)

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0317396.t003

Dataset updated

Feb 10, 2025

Dataset provided by

PLOS ONE

Authors

Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.

Clear search

Close search

Google apps

Main menu

Imbalanced classification metric.

Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...

Classification result classifiers using TF-IDF with SMOTE.

Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced...

A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...

DataSheet1_Comparison of Resampling Algorithms to Address Class Imbalance...

Top 10 performing oversamplers for DTS2 versus baseline (no oversampling and...

Table1_A comparative study in class imbalance mitigation when working with...

Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...

Data_Sheet_1_Identification of Orphan Genes in Unbalanced Datasets Based on...

Classification result of classifiers models using TF without SMOTE.

Parameters of machine learning models.

Performance comparisons between SOS and ROS, SMOTE, and ADASYN for ATP168...

The average values of evaluation metrics on ILDP, QSAR, Blood and Health...

iProtDNA-SMOTE trained weights.

Supplementary Material 8

A hybrid neural network-based model for landslide susceptibility...

Area under the ROC and PR curves for medicated patients, using the original...

Performance of STI predictive models with SMOTE.

Performance comparisons of iProtDNA-SMOTE and 4 competing predictors on...

Imbalanced classification metric.See More Versions

Imbalanced classification metric.