39 datasets found

f
A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk...
plos.figshare.com
xls
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t009
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier.
f
Data from: S1 Datasets -
figshare.com
plos.figshare.com
bin
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). S1 Datasets - [Dataset]. http://doi.org/10.1371/journal.pone.0317396.s001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.s001
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
f
A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...
plos.figshare.com
xls
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t008
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier.
f
A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...
plos.figshare.com
xls
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the ILPD and QSAR datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t007
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the ILPD and QSAR datasets is presented, based on various classification metrics using the Random Forest classifier.
Data from: Image-based automated species identification: Can virtual data...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jun 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morris Klasen; Morris Klasen; Jonas Eberle; Dirk Ahrens; Volker Steinhage; Jonas Eberle; Dirk Ahrens; Volker Steinhage (2022). Image-based automated species identification: Can virtual data augmentation overcome problems of insufficient sampling? [Dataset]. http://doi.org/10.5061/dryad.f1vhhmgx9
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.f1vhhmgx9
Dataset updated
Jun 4, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Morris Klasen; Morris Klasen; Jonas Eberle; Dirk Ahrens; Volker Steinhage; Jonas Eberle; Dirk Ahrens; Volker Steinhage
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Automated species identification and delimitation is challenging, particularly in rare and thus often scarcely sampled species, which do not allow sufficient discrimination of infraspecific versus interspecific variation. Typical problems arising from either low or exaggerated interspecific morphological differentiation are best met by automated methods of machine learning that learn efficient and effective species identification from training samples. However, limited infraspecific sampling remains a key challenge also in machine learning.

In this study, we assessed whether a data augmentation approach may help to overcome the problem of scarce training data in automated visual species identification. The stepwise augmentation of data comprised image rotation as well as visual and virtual augmentation. The visual data augmentation applies classic approaches of data augmentation and generation of artificial images using a Generative Adversarial Networks (GAN) approach. Descriptive feature vectors are derived from bottleneck features of a VGG-16 convolutional neural network (CNN) that are then stepwise reduced in dimensionality using Global Average Pooling and PCA to prevent overfitting. Finally, data augmentation employs synthetic additional sampling in feature space by an oversampling algorithm in vector space (SMOTE). Applied on four different image datasets, which include scarab beetle genitalia (Pleophylla, Schizonycha) as well as wing patterns of bees (Osmia) and cattleheart butterflies (Parides), our augmentation approach outperformed a deep learning baseline approach by means of resulting identification accuracy with non-augmented data as well as a traditional 2D morphometric approach (Procrustes analysis of scarab beetle genitalia).
f
The average values of evaluation metrics on ILDP, QSAR, Blood and Health...
plos.figshare.com
xls
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The average values of evaluation metrics on ILDP, QSAR, Blood and Health risk imbalanced datasets using ADA classifiers and 10-fold cross validation methodology. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t005
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The average values of evaluation metrics on ILDP, QSAR, Blood and Health risk imbalanced datasets using ADA classifiers and 10-fold cross validation methodology.
f
Top 10 performing oversamplers for DTS2 versus baseline (no oversampling and...
plos.figshare.com
xls
Updated Jun 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin Teh; Paul Armitage; Solomon Tesfaye; Dinesh Selvarajah; Iain D. Wilkinson (2023). Top 10 performing oversamplers for DTS2 versus baseline (no oversampling and SMOTE) averaged across four classifiers. [Dataset]. http://doi.org/10.1371/journal.pone.0243907.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0243907.t003
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Kevin Teh; Paul Armitage; Solomon Tesfaye; Dinesh Selvarajah; Iain D. Wilkinson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Top 10 performing oversamplers for DTS2 versus baseline (no oversampling and SMOTE) averaged across four classifiers.
f
Acronym table with description.
plos.figshare.com
xls
Updated Nov 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf (2023). Acronym table with description. [Dataset]. http://doi.org/10.1371/journal.pone.0293061.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0293061.t007
Dataset updated
Nov 8, 2023
Dataset provided by
PLOS ONE
Authors
Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Predicting student performance automatically is of utmost importance, due to the substantial volume of data within educational databases. Educational data mining (EDM) devises techniques to uncover insights from data originating in educational settings. Artificial intelligence (AI) can mine educational data to predict student performance and provide measures to help students avoid failing and learn better. Learning platforms complement traditional learning settings by analyzing student performance, which can help reduce the chance of student failure. Existing methods for student performance prediction in educational data mining faced challenges such as limited accuracy, imbalanced data, and difficulties in feature engineering. These issues hindered effective adaptability and generalization across diverse educational contexts. This study proposes a machine learning-based system with deep convoluted features for the prediction of students’ academic performance. The proposed framework is employed to predict student academic performance using balanced as well as, imbalanced datasets using the synthetic minority oversampling technique (SMOTE). In addition, the performance is also evaluated using the original and deep convoluted features. Experimental results indicate that the use of deep convoluted features provides improved prediction accuracy compared to original features. Results obtained using the extra tree classifier with convoluted features show the highest classification accuracy of 99.9%. In comparison with the state-of-the-art approaches, the proposed approach achieved higher performance. This research introduces a powerful AI-driven system for student performance prediction, offering substantial advancements in accuracy compared to existing approaches.
f
Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...
plos.figshare.com
xls
Updated Nov 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0290581.t007
Dataset updated
Nov 16, 2023
Dataset provided by
PLOS ONE
Authors
Alaa Alomari; Hossam Faris; Pedro A. Castillo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes.
BILSTM using SMOTE and ADASYN oversampling techniques.
plos.figshare.com
xls
Updated Nov 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). BILSTM using SMOTE and ADASYN oversampling techniques. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0290581.t002
Dataset updated
Nov 16, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Alaa Alomari; Hossam Faris; Pedro A. Castillo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BILSTM using SMOTE and ADASYN oversampling techniques.
f
Predicting epileptic seizures using nonnegative matrix factorization
plos.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olivera Stojanović; Levin Kuhlmann; Gordon Pipa (2023). Predicting epileptic seizures using nonnegative matrix factorization [Dataset]. http://doi.org/10.1371/journal.pone.0228025
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0228025
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Olivera Stojanović; Levin Kuhlmann; Gordon Pipa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper presents a procedure for the patient-specific prediction of epileptic seizures. To this end, a combination of nonnegative matrix factorization (NMF) and smooth basis functions with robust regression is applied to power spectra of intracranial electroencephalographic (iEEG) signals. The resulting time and frequency components capture the dominant information from power spectra, while removing outliers and noise. This makes it possible to detect structure in preictal states, which is used for classification. Linear support vector machines (SVM) with L1 regularization are used to select and weigh the contributions from different number of not equally informative channels among patients. Due to class imbalance in data, synthetic minority over-sampling technique (SMOTE) is applied. The resulting method yields a computationally and conceptually simple, interpretable model of EEG signals of preictal and interictal states, which shows a good performance for the task of seizure prediction on two datasets (the EPILEPSIAE and on the public Epilepsyecosystem dataset).
f
The selected explanatory variables.
plos.figshare.com
xls
Updated Jun 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada (2023). The selected explanatory variables. [Dataset]. http://doi.org/10.1371/journal.pone.0281901.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0281901.t002
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity.
f
Results of Bioassay 456 dataset in experiment 2.
plos.figshare.com
xls
Updated Jun 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Results of Bioassay 456 dataset in experiment 2. [Dataset]. http://doi.org/10.1371/journal.pone.0180830.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0180830.t009
Dataset updated
Jun 18, 2023
Dataset provided by
PLOS ONE
Authors
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results of Bioassay 456 dataset in experiment 2.
Results of Bioassay 362 dataset in experiment 2.
plos.figshare.com
figshare.com
xls
Updated May 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Results of Bioassay 362 dataset in experiment 2. [Dataset]. http://doi.org/10.1371/journal.pone.0180830.t010
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0180830.t010
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results of Bioassay 362 dataset in experiment 2.
f
Results of bioassay 1284 dataset in experiment 1.
figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Results of bioassay 1284 dataset in experiment 1. [Dataset]. http://doi.org/10.1371/journal.pone.0180830.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0180830.t006
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results of bioassay 1284 dataset in experiment 1.
f
Results of Bioassay 1608 dataset in experiment 2.
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Results of Bioassay 1608 dataset in experiment 2. [Dataset]. http://doi.org/10.1371/journal.pone.0180830.t011
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0180830.t011
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results of Bioassay 1608 dataset in experiment 2.
Ranking of the dataset attributes based on their Information Gain (IG).
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ranking of the dataset attributes based on their Information Gain (IG). [Dataset]. https://plos.figshare.com/articles/dataset/Ranking_of_the_dataset_attributes_based_on_their_Information_Gain_IG_/5236687
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0179805.t001
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Manal Alghamdi; Mouaz Al-Mallah; Steven Keteyian; Clinton Brawner; Jonathan Ehrman; Sherif Sakr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ranking of the dataset attributes based on their Information Gain (IG).
Results of Bioassay 746 dataset in experiment 2.
plos.figshare.com
figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Results of Bioassay 746 dataset in experiment 2. [Dataset]. http://doi.org/10.1371/journal.pone.0180830.t014
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0180830.t014
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results of Bioassay 746 dataset in experiment 2.
f
The environment and parmeters of PSO and BA.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). The environment and parmeters of PSO and BA. [Dataset]. http://doi.org/10.1371/journal.pone.0180830.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0180830.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The environment and parmeters of PSO and BA.
f
Data set presentation.
figshare.com
plos.figshare.com
xls
Updated Sep 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenguang Li; Yan Peng; Ke Peng (2024). Data set presentation. [Dataset]. http://doi.org/10.1371/journal.pone.0311222.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311222.t001
Dataset updated
Sep 30, 2024
Dataset provided by
PLOS ONE
Authors
Wenguang Li; Yan Peng; Ke Peng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diabetes, as an incurable lifelong chronic disease, has profound and far-reaching effects on patients. Given this, early intervention is particularly crucial, as it can not only significantly improve the prognosis of patients but also provide valuable reference information for clinical treatment. This study selected the BRFSS (Behavioral Risk Factor Surveillance System) dataset, which is publicly available on the Kaggle platform, as the research object, aiming to provide a scientific basis for the early diagnosis and treatment of diabetes through advanced machine learning techniques. Firstly, the dataset was balanced using various sampling methods; secondly, a Stacking model based on GA-XGBoost (XGBoost model optimized by genetic algorithm) was constructed for the risk prediction of diabetes; finally, the interpretability of the model was deeply analyzed using Shapley values. The results show: (1) Random oversampling, ADASYN, SMOTE, and SMOTEENN were used for data balance processing, among which SMOTEENN showed better efficiency and effect in dealing with data imbalance. (2) The GA-XGBoost model optimized the hyperparameters of the XGBoost model through a genetic algorithm to improve the model’s predictive accuracy. Combined with the better-performing LightGBM model and random forest model, a two-layer Stacking model was constructed. This model not only outperforms single machine learning models in predictive effect but also provides a new idea and method in the field of model integration. (3) Shapley value analysis identified features that have a significant impact on the prediction of diabetes, such as age and body mass index. This analysis not only enhances the transparency of the model but also provides more precise treatment decision support for doctors and patients. In summary, this study has not only improved the accuracy of predicting the risk of diabetes by adopting advanced machine learning techniques and model integration strategies but also provided a powerful tool for the early diagnosis and personalized treatment of diabetes.

Facebook

Twitter

Click to copy link

Link copied

Cite

Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t009

A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0317396.t009

Dataset updated

Feb 10, 2025

Dataset provided by

PLOS ONE

Authors

Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier.

Clear search

Close search

Google apps

Main menu

A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk...

Data from: S1 Datasets -

A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...

A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...

Data from: Image-based automated species identification: Can virtual data...

The average values of evaluation metrics on ILDP, QSAR, Blood and Health...

Top 10 performing oversamplers for DTS2 versus baseline (no oversampling and...

Acronym table with description.

Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...

BILSTM using SMOTE and ADASYN oversampling techniques.

Predicting epileptic seizures using nonnegative matrix factorization

The selected explanatory variables.

Results of Bioassay 456 dataset in experiment 2.

Results of Bioassay 362 dataset in experiment 2.

Results of bioassay 1284 dataset in experiment 1.

Results of Bioassay 1608 dataset in experiment 2.

Ranking of the dataset attributes based on their Information Gain (IG).

Results of Bioassay 746 dataset in experiment 2.

The environment and parmeters of PSO and BA.

Data set presentation.

A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier.