91 datasets found

f
Performance of machine learning models using SMOTE-balanced dataset.
plos.figshare.com
xls
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf (2023). Performance of machine learning models using SMOTE-balanced dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0293061.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0293061.t004
Dataset updated
Nov 8, 2023
Dataset provided by
PLOS ONE
Authors
Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance of machine learning models using SMOTE-balanced dataset.
f
The definition of a confusion matrix.
plos.figshare.com
xls
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t002
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
f
Classification result classifiers using TF-IDF with SMOTE.
plos.figshare.com
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Classification result classifiers using TF-IDF with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t007
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification result classifiers using TF-IDF with SMOTE.
Data from: Enhancing automatic early arteriosclerosis prediction: an...
zenodo.org
Updated Dec 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eka Miranda; Eka Miranda (2024). Enhancing automatic early arteriosclerosis prediction: an explainable machine learning evidence [Dataset]. http://doi.org/10.5281/zenodo.14554016
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14554016
Dataset updated
Dec 25, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eka Miranda; Eka Miranda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset from our research. A research paper has already been published and can be accessed at https://www.sciencedirect.com/science/article/pii/S2588914124000169.
ml_smote
kaggle.com
zip
Updated May 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexis Moraga (2021). ml_smote [Dataset]. https://www.kaggle.com/senoratiramisu/ml-smote
Explore at:
zip(1428 bytes)Available download formats
Dataset updated
May 20, 2021
Authors
Alexis Moraga
Description
Dataset

This dataset was created by Alexis Moraga

Contents
f
Data from: S1 Datasets -
plos.figshare.com
bin
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). S1 Datasets - [Dataset]. http://doi.org/10.1371/journal.pone.0317396.s001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.s001
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
f
Confusion matrix.
plos.figshare.com
xls
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran (2024). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0301263.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0301263.t001
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The diagnosis of human knee abnormalities using the surface electromyography (sEMG) signal obtained from lower limb muscles with machine learning is a major problem due to the noisy nature of the sEMG signal and the imbalance in data corresponding to healthy and knee abnormal subjects. To address this challenge, a combination of wavelet decomposition (WD) with ensemble empirical mode decomposition (EEMD) and the Synthetic Minority Oversampling Technique (S-WD-EEMD) is proposed. In this study, a hybrid WD-EEMD is considered for the minimization of noises produced in the sEMG signal during the collection, while the Synthetic Minority Oversampling Technique (SMOTE) is considered to balance the data by increasing the minority class samples during the training of machine learning techniques. The findings indicate that the hybrid WD-EEMD with SMOTE oversampling technique enhances the efficacy of the examined classifiers when employed on the imbalanced sEMG data. The F-Score of the Extra Tree Classifier, when utilizing WD-EEMD signal processing with SMOTE oversampling, is 98.4%, whereas, without the SMOTE oversampling technique, it is 95.1%.
f
Data from: Dataset for classification of signaling proteins based on...
figshare.com
portalcientifico.sergas.es
txt
Updated Jan 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Fernandez-Lozano; Cristian Robert Munteanu (2016). Dataset for classification of signaling proteins based on molecular star graph descriptors using machine-learning models [Dataset]. http://doi.org/10.6084/m9.figshare.1330132.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1330132.v1
Dataset updated
Jan 19, 2016
Dataset provided by
figshare
Authors
Carlos Fernandez-Lozano; Cristian Robert Munteanu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The positive group of 608 signaling protein sequences was downloaded as FASTA format from Protein Databank (Berman et al., 2000) by using the “Molecular Function Browser” in the “Advanced Search Interface” (“Signaling (GO ID23052)”, protein identity cut-off = 30%). The negative group of 2077 non-signaling proteins was downloaded as the PISCES CulledPDB (http://dunbrack.fccc.edu/PISCES.php) (Wang & R. L. Dunbrack, 2003) (November 19th, 2012) using identity (degree of correspondence between two sequences) less than 20%, resolution of 1.6 Å and R-factor 0.25. The full dataset is containing 2685 FASTA sequences of protein chains from the PDB databank: 608 are signaling proteins and 2077 are non-signaling peptides. This kind of unbalanced data is not the most suitable to be used as an input for learning algorithms because the results would present a high sensitivity and low specificity; learning algorithms would tend to classify most of samples as part of the most common group. To avoid this situation, a pre-processing stage is needed in order to get a more balanced dataset, in this case by means of the synthetic minority oversampling technique (SMOTE). In short, SMOTE provides a more balanced dataset using an expansion of the lower class by creating new samples, interpolating other minority-class samples. After this pre-processing, the final dataset is composed of 1824 positive samples (signaling protein chains) and 2432 negative cases (non-signaling protein chains). Paper is available at: http://dx.doi.org/10.1016/j.jtbi.2015.07.038

Please cite: Carlos Fernandez-Lozano, Rubén F. Cuiñas, José A. Seoane, Enrique Fernández-Blanco, Julian Dorado, Cristian R. Munteanu, Classification of signaling proteins based on molecular star graph descriptors using Machine Learning models, Journal of Theoretical Biology, Volume 384, 7 November 2015, Pages 50-58, ISSN 0022-5193, http://dx.doi.org/10.1016/j.jtbi.2015.07.038.(http://www.sciencedirect.com/science/article/pii/S0022519315003999)
f
Data from: Prediction of 35 Target Per- and Polyfluoroalkyl Substances...
acs.figshare.com
figshare.com
txt
Updated Aug 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jialin Dong; Gabriel Tsai; Christopher I. Olivares (2023). Prediction of 35 Target Per- and Polyfluoroalkyl Substances (PFASs) in California Groundwater Using Multilabel Semisupervised Machine Learning [Dataset]. http://doi.org/10.1021/acsestwater.3c00134.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/acsestwater.3c00134.s002
Dataset updated
Aug 18, 2023
Dataset provided by
ACS Publications
Authors
Jialin Dong; Gabriel Tsai; Christopher I. Olivares
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Comprehensive monitoring of perfluoroalkyl and polyfluoroalkyl substances (PFASs) is challenging because of the high analytical cost and an increasing number of analytes. We developed a machine learning pipeline to understand environmental features influencing PFAS profiles in groundwater. By examining 23 public data sets (2016–2022) in California, we built a state-wide groundwater database (25,000 observations across 4200 wells) encompassing contamination sources, weather, air quality, soil, hydrology, and groundwater quality (PFASs and cocontaminants). We used supervised learning to prescreen total PFAS concentrations above 70 ng/L and multilabel semisupervised learning to predict 35 individual PFAS concentrations above 2 ng/L. Random forest with ADASYN oversampling performed the best for total PFASs (AUROC 99%). XGBoost with SMOTE oversampling achieved the AUROC of 73–100% for individual PFAS prediction. Contamination sources and soil variables contributed the most to accuracy. Individual PFASs were strongly correlated within each PFAS’s subfamily (i.e., short- vs long-chain PFCAs, sulfonamides). These associations improved prediction performance using classifier chains, which predicts a PFAS based on previously predicted species. We applied the model to reconstruct PFAS profiles in groundwater wells with missing data in previous years. Our approach can complement monitoring programs of environmental agencies to validate previous investigation results and prioritize sites for future PFAS sampling.
m
Montreal Road Collision Dataset (2012-2021)
data.mendeley.com
Updated Aug 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bappa Muktar (2024). Montreal Road Collision Dataset (2012-2021) [Dataset]. http://doi.org/10.17632/gg8c7t3v54.1
Explore at:
Unique identifier
https://doi.org/10.17632/gg8c7t3v54.1
Dataset updated
Aug 14, 2024
Authors
Bappa Muktar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is derived from the public dataset of road collisions that occurred in Montreal, which is accessible at https://www.donneesquebec.ca/recherche/dataset/vmtl-collisions-routieres. Unlike the original dataset, this dataset has been preprocessed (handling of missing data, data rebalancing via the SMOTE-ENN algorithm, etc.), and categorical variables have been encoded, making it ready for machine learning and other tasks. The .pkl file containing the encoding and the notebook demonstrating how to use the .pkl file are provided. For more details, please refer to the table below, which represents the data dictionary of this dataset. This dataset is shared under the Attribution License (CC-BY 4.0).

If you use this dataset for publication, please cite the following reference: Muktar, B.; Fono, V. Toward Safer Roads: Predicting the Severity of Traffic Accidents in Montreal Using Machine Learning. Electronics 2024, 13, 3036. https://doi.org/10.3390/electronics13153036
m
Data from: Mental issues, internet addiction and quality of life predict...
data.mendeley.com
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andras Matuz (2024). Mental issues, internet addiction and quality of life predict burnout among Hungarian teachers: a machine learning analysis [Dataset]. http://doi.org/10.17632/2yy4j7rgvg.1
Explore at:
Unique identifier
https://doi.org/10.17632/2yy4j7rgvg.1
Dataset updated
Jul 12, 2024
Authors
Andras Matuz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: Burnout is usually defined as a state of emotional, physical, and mental exhaustion that affects people in various professions (e.g. physicians, nurses, teachers). The consequences of burnout involve decreased motivation, productivity, and overall diminished well-being. The machine learning-based prediction of burnout has therefore become the focus of recent research. In this study, the aim was to detect burnout using machine learning and to identify its most important predictors in a sample of Hungarian high-school teachers. Methods: The final sample consisted of 1,576 high-school teachers (522 male), who completed a survey including various sociodemographic and health-related questions and psychological questionnaires. Specifically, depression, insomnia, internet habits (e.g. when and why one uses the internet) and problematic internet usage were among the most important predictors tested in this study. Supervised classification algorithms were trained to detect burnout assessed by two well-known burnout questionnaires. Feature selection was conducted using recursive feature elimination. Hyperparameters were tuned via grid search with 5-fold cross-validation. Due to class imbalance, class weights (i.e. cost-sensitive learning), downsampling and a hybrid method (SMOTE-ENN) were applied in separate analyses. The final model evaluation was carried out on a previously unseen holdout test sample. Results: Burnout was detected in 19.7% of the teachers included in the final dataset. The best predictive performance on the holdout test sample was achieved by support vector machine with SMOTE-ENN (AUC = .942; balanced accuracy = .868, sensitivity = .898; specificity = .837). The best predictors of burnout were Beck’s Depression Inventory scores, Athen’s Insomnia Scale scores, subscales of the Problematic Internet Use Questionnaire and self-reported current health status. Conclusions: The performances of the algorithms were comparable with previous studies; however, it is important to note that we tested our models on previously unseen holdout samples suggesting higher levels of generalizability. Another remarkable finding is that besides depression and insomnia, other variables such as problematic internet use and time spent online also turned out to be important predictors of burnout.
DoH Attack and Malware Detection using ML/DL
kaggle.com
Updated May 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BCCC Datasets (2024). DoH Attack and Malware Detection using ML/DL [Dataset]. https://www.kaggle.com/datasets/bcccdatasets/bccc-cira-cic-dohbrw-2020/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
BCCC Datasets
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The 'BCCC-CIRA-CIC-DoHBrw-2020' dataset was created to address the imbalance in the 'CIRA-CIC-DoBre-2020' dataset. Unlike the 'CIRA-CIC-DoHBrw-2020' dataset, which is skewed with about 90% malicious and only 10% benign Domain over HTTPS (DoH) network traffic, the 'BCCC-CIRA-CIC-DoHBrw-2020' dataset offers a more balanced composition. It includes equal numbers of malicious and benign DoH network traffic instances, with 249,836 instances in each category. This balance was achieved using the Synthetic Minority Over-sampling Technique (SMOTE). The 'BCCC-CIRA-CIC-DoHBrw-2020' dataset comprises three CSV files: one for malicious DoH traffic, one for benign DoH traffic, and a third that combines both types.

The full research paper outlining the details of the dataset and its underlying principles: “Unveiling DoH Tunnel: Toward Generating a Balanced DoH EncryptedTraffic Dataset and Profiling malicious Behaviour using InherentlyInterpretable Machine Learning“, Sepideh Niktabe, Arash Habibi Lashkari, Arousha Haghighian Roudsari, Peer-to-Peer Networking and Applications, Vol. 17, 2023
Data from: Image-based automated species identification: Can virtual data...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jun 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morris Klasen; Morris Klasen; Jonas Eberle; Dirk Ahrens; Volker Steinhage; Jonas Eberle; Dirk Ahrens; Volker Steinhage (2022). Image-based automated species identification: Can virtual data augmentation overcome problems of insufficient sampling? [Dataset]. http://doi.org/10.5061/dryad.f1vhhmgx9
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.f1vhhmgx9
Dataset updated
Jun 4, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Morris Klasen; Morris Klasen; Jonas Eberle; Dirk Ahrens; Volker Steinhage; Jonas Eberle; Dirk Ahrens; Volker Steinhage
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Automated species identification and delimitation is challenging, particularly in rare and thus often scarcely sampled species, which do not allow sufficient discrimination of infraspecific versus interspecific variation. Typical problems arising from either low or exaggerated interspecific morphological differentiation are best met by automated methods of machine learning that learn efficient and effective species identification from training samples. However, limited infraspecific sampling remains a key challenge also in machine learning.

In this study, we assessed whether a data augmentation approach may help to overcome the problem of scarce training data in automated visual species identification. The stepwise augmentation of data comprised image rotation as well as visual and virtual augmentation. The visual data augmentation applies classic approaches of data augmentation and generation of artificial images using a Generative Adversarial Networks (GAN) approach. Descriptive feature vectors are derived from bottleneck features of a VGG-16 convolutional neural network (CNN) that are then stepwise reduced in dimensionality using Global Average Pooling and PCA to prevent overfitting. Finally, data augmentation employs synthetic additional sampling in feature space by an oversampling algorithm in vector space (SMOTE). Applied on four different image datasets, which include scarab beetle genitalia (Pleophylla, Schizonycha) as well as wing patterns of bees (Osmia) and cattleheart butterflies (Parides), our augmentation approach outperformed a deep learning baseline approach by means of resulting identification accuracy with non-augmented data as well as a traditional 2D morphometric approach (Procrustes analysis of scarab beetle genitalia).
f
DataSheet1_Comparison of Resampling Algorithms to Address Class Imbalance...
frontiersin.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Lowell Weller; Tanzy M. T. Love; Martin Wiedmann (2023). DataSheet1_Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water.docx [Dataset]. http://doi.org/10.3389/fenvs.2021.701288.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fenvs.2021.701288.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Daniel Lowell Weller; Tanzy M. T. Love; Martin Wiedmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, research is needed to determine 1) if predictive models can be used to assess Listeria contamination of agricultural water, and 2) how resampling (to deal with imbalanced data) affects performance of these models. To address these knowledge gaps, this study developed models that predict nonpathogenic Listeria spp. (excluding L. monocytogenes) and L. monocytogenes presence in agricultural water using various combinations of learner (e.g., random forest, regression), feature type, and resampling method (none, oversampling, SMOTE). Four feature types were used in model training: microbial, physicochemical, spatial, and weather. “Full models” were trained using all four feature types, while “nested models” used between one and three types. In total, 45 full (15 learners*3 resampling approaches) and 108 nested (5 learners*9 feature sets*3 resampling approaches) models were trained per outcome. Model performance was compared against baseline models where E. coli concentration was the sole predictor. Overall, the machine learning models outperformed the baseline E. coli models, with random forests outperforming models built using other learners (e.g., rule-based learners). Resampling produced more accurate models than not resampling, with SMOTE models outperforming, on average, oversampling models. Regardless of resampling method, spatial and physicochemical water quality features drove accurate predictions for the nonpathogenic Listeria spp. and L. monocytogenes models, respectively. Overall, these findings 1) illustrate the need for alternatives to existing E. coli-based monitoring programs for assessing agricultural water for the presence of potential food safety hazards, and 2) suggest that predictive models may be one such alternative. Moreover, these findings provide a conceptual framework for how such models can be developed in the future with the ultimate aim of developing models that can be integrated into on-farm risk management programs. For example, future studies should consider using random forest learners, SMOTE resampling, and spatial features to develop models to predict the presence of foodborne pathogens, such as L. monocytogenes, in agricultural water when the training data is imbalanced.
f
Additional file 2 of Implementation of ensemble machine learning algorithms...
springernature.figshare.com
txt
Updated Jun 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdu Rehaman Pasha Syed; Rahul Anbalagan; Anagha S. Setlur; Chandrashekar Karunakaran; Jyoti Shetty; Jitendra Kumar; Vidya Niranjan (2023). Additional file 2 of Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers [Dataset]. http://doi.org/10.6084/m9.figshare.21592787.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21592787.v1
Dataset updated
Jun 20, 2023
Dataset provided by
figshare
Authors
Abdu Rehaman Pasha Syed; Rahul Anbalagan; Anagha S. Setlur; Chandrashekar Karunakaran; Jyoti Shetty; Jitendra Kumar; Vidya Niranjan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 2. The synthetic dataset generated through TVAE method.
h
ml_data_test_detection_bank_transaction_frauds_unbalanced
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roberto Armas, ml_data_test_detection_bank_transaction_frauds_unbalanced [Dataset]. https://huggingface.co/datasets/roberto-armas/ml_data_test_detection_bank_transaction_frauds_unbalanced
Explore at:
Authors
Roberto Armas
Description
ML Data Test Detection Bank Transaction Frauds Unbalanced

The project provides a quick and accessible dataset designed for learning and experimenting with machine learning algorithms, specifically in the context of detecting fraudulent bank transactions. It is intended for practicing and applying concepts such as Random Forest, Support Vector Machines (SVM), and Synthetic Minority Over-sampling Technique (SMOTE) to address unbalanced classification problems. Note: This dataset is… See the full description on the dataset page: https://huggingface.co/datasets/roberto-armas/ml_data_test_detection_bank_transaction_frauds_unbalanced.
f
The average values of evaluation metrics on ILDP, QSAR, Blood and Health...
plos.figshare.com
xls
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The average values of evaluation metrics on ILDP, QSAR, Blood and Health risk imbalanced datasets using RF classifiers and 10-fold cross validation methodology. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t006
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The average values of evaluation metrics on ILDP, QSAR, Blood and Health risk imbalanced datasets using RF classifiers and 10-fold cross validation methodology.
f
Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...
frontiersin.figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanka Banerjee; Frederic O. Dehnbostel; Robert Preissner (2023). Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets.PDF [Dataset]. http://doi.org/10.3389/fchem.2018.00362.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fchem.2018.00362.s001
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Priyanka Banerjee; Frederic O. Dehnbostel; Robert Preissner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Increase in the number of new chemicals synthesized in past decades has resulted in constant growth in the development and application of computational models for prediction of activity as well as safety profiles of the chemicals. Most of the time, such computational models and its application must deal with imbalanced chemical data. It is indeed a challenge to construct a classifier using imbalanced data set. In this study, we analyzed and validated the importance of different sampling methods over non-sampling method, to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data. Additionally, this study has achieved an accuracy of 93.00%, an AUC of 0.94, F1 measure of 0.90, sensitivity of 96.00% and specificity of 91.00% using SMOTE sampling and Random Forest classifier for the prediction of Drug Induced Liver Injury (DILI). Our results suggest that, irrespective of data set used, sampling methods can have major influence on reducing the gap between sensitivity and specificity of a model. This study demonstrates the efficacy of different sampling methods for class imbalanced problem using binary chemical data sets.
f
Data Sheet 1_The classification method of donkey breeds based on SNPs data...
frontiersin.figshare.com
csv
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dekui Li; Xiaolong Hu; Yongdong Peng (2025). Data Sheet 1_The classification method of donkey breeds based on SNPs data and machine learning.csv [Dataset]. http://doi.org/10.3389/fgene.2025.1496246.s001
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2025.1496246.s001
Dataset updated
Apr 9, 2025
Dataset provided by
Frontiers
Authors
Dekui Li; Xiaolong Hu; Yongdong Peng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A method for accurately classifying donkey breeds has been developed by integrating single nucleotide polymorphism (SNPs) data with machine learning algorithms. The approach includes preprocessing donkey genomic sequencing data, addressing data imbalance with the Synthetic Minority Over-sampling Technique (SMOTE), and utilizing an improved Leave-One-Out Cross-Validation (LOOCV) for dataset partitioning. Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Random Forest (RF) models were constructed and evaluated. The results demonstrated that different chromosomes significantly influence classifier performance. For instance, chromosome Chr2 showed the highest classification accuracy with KNN, while chromosome Chr19 performed best with SVM and RF models. After enhancing data quality and addressing imbalances, classification performance improved substantially, with accuracy, precision, recall, and F1 score showing increases of up to 15% in certain models, particularly on key chromosomes. This method offers an effective solution for donkey breed classification and provides technical support for the conservation and development of donkey genetic resources.
f
Table1_A comparative study in class imbalance mitigation when working with...
frontiersin.figshare.com
pdf
Updated Mar 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rawan S. Abdulsadig; Esther Rodriguez-Villegas (2024). Table1_A comparative study in class imbalance mitigation when working with physiological signals.pdf [Dataset]. http://doi.org/10.3389/fdgth.2024.1377165.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fdgth.2024.1377165.s001
Dataset updated
Mar 26, 2024
Dataset provided by
Frontiers
Authors
Rawan S. Abdulsadig; Esther Rodriguez-Villegas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Class imbalance is a common challenge that is often faced when dealing with classification tasks aiming to detect medical events that are particularly infrequent. Apnoea is an example of such events. This challenge can however be mitigated using class rebalancing algorithms. This work investigated 10 widely used data-level class imbalance mitigation methods aiming towards building a random forest (RF) model that attempts to detect apnoea events from photoplethysmography (PPG) signals acquired from the neck. Those methods are random undersampling (RandUS), random oversampling (RandOS), condensed nearest-neighbors (CNNUS), edited nearest-neighbors (ENNUS), Tomek’s links (TomekUS), synthetic minority oversampling technique (SMOTE), Borderline-SMOTE (BLSMOTE), adaptive synthetic oversampling (ADASYN), SMOTE with TomekUS (SMOTETomek) and SMOTE with ENNUS (SMOTEENN). Feature-space transformation using PCA and KernelPCA was also examined as a potential way of providing better representations of the data for the class rebalancing methods to operate. This work showed that RandUS is the best option for improving the sensitivity score (up to 11%). However, it could hinder the overall accuracy due to the reduced amount of training data. On the other hand, augmenting the data with new artificial data points was shown to be a non-trivial task that needs further development, especially in the presence of subject dependencies, as was the case in this work.

Facebook

Twitter

Click to copy link

Link copied

Cite

Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf (2023). Performance of machine learning models using SMOTE-balanced dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0293061.t004

Performance of machine learning models using SMOTE-balanced dataset.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0293061.t004

Dataset updated

Nov 8, 2023

Dataset provided by

PLOS ONE

Authors

Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Performance of machine learning models using SMOTE-balanced dataset.

Clear search

Close search

Google apps

Main menu

Performance of machine learning models using SMOTE-balanced dataset.

The definition of a confusion matrix.

Classification result classifiers using TF-IDF with SMOTE.

Data from: Enhancing automatic early arteriosclerosis prediction: an...

ml_smote

Dataset

Contents

Data from: S1 Datasets -

Confusion matrix.

Data from: Dataset for classification of signaling proteins based on...

Data from: Prediction of 35 Target Per- and Polyfluoroalkyl Substances...

Montreal Road Collision Dataset (2012-2021)

Data from: Mental issues, internet addiction and quality of life predict...

DoH Attack and Malware Detection using ML/DL

Data from: Image-based automated species identification: Can virtual data...

DataSheet1_Comparison of Resampling Algorithms to Address Class Imbalance...

Additional file 2 of Implementation of ensemble machine learning algorithms...

ml_data_test_detection_bank_transaction_frauds_unbalanced

The average values of evaluation metrics on ILDP, QSAR, Blood and Health...

Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...

Data Sheet 1_The classification method of donkey breeds based on SNPs data...

Table1_A comparative study in class imbalance mitigation when working with...

Performance of machine learning models using SMOTE-balanced dataset.