100+ datasets found

A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...
plos.figshare.com
xls
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t008
Dataset updated
Feb 10, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier.
f
Data from: S1 Datasets -
plos.figshare.com
bin
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). S1 Datasets - [Dataset]. http://doi.org/10.1371/journal.pone.0317396.s001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.s001
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
Animals Imbalance + Smote
kaggle.com
Updated Nov 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taro_pan (2021). Animals Imbalance + Smote [Dataset]. https://www.kaggle.com/stgkrtua/animals-imbalance-smote/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Taro_pan
Description
Dataset

This dataset was created by Taro_pan

Contents
s
Citation Trends for "Medical data classification scheme based on hybridized...
shibatadb.com
Updated Apr 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yubetsu (2017). Citation Trends for "Medical data classification scheme based on hybridized SMOTE technique (HST) and Rough Set technique (RST)" [Dataset]. https://www.shibatadb.com/article/6SEJEsMa
Explore at:
Dataset updated
Apr 15, 2017
Dataset authored and provided by
Yubetsu
License
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Time period covered
2019 - 2025
Variables measured
New Citations per Year
Description
Yearly citation counts for the publication titled "Medical data classification scheme based on hybridized SMOTE technique (HST) and Rough Set technique (RST)".
t
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, W Philip Kegelmeyer...
service.tib.eu
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, W Philip Kegelmeyer (2024). Dataset: SMOTE: Synthetic Minority Over-Sampling Technique. https://doi.org/10.57702/tq0zp0i3 [Dataset]. https://service.tib.eu/ldmservice/dataset/smote--synthetic-minority-over-sampling-technique
Explore at:
Dataset updated
Dec 3, 2024
Description
SMOTE: synthetic minority over-sampling technique.
s
Data from: High impact bug report identification with imbalanced learning...
researchdata.smu.edu.sg
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN (2023). Data from: High impact bug report identification with imbalanced learning strategies [Dataset]. http://doi.org/10.25440/smu.12062763.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25440/smu.12062763.v1
Dataset updated
Jun 1, 2023
Dataset provided by
SMU Research Data Repository (RDR)
Authors
YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This record contains the underlying research data for the publication "High impact bug report identification with imbalanced learning strategies" and the full-text is available from: https://ink.library.smu.edu.sg/sis_research/3702In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they often concentrate on bugs that are highly impactful. In the literature, high-impact bugs are used to refer to the bugs which appear at unexpected time or locations and bring more unexpected effects (i.e., surprise bugs), or break pre-existing functionalities and destroy the user experience (i.e., breakage bugs). Unfortunately, identifying high-impact bugs from thousands of bug reports in a bug tracking system is not an easy feat. Thus, an automated technique that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. Considering that only a small proportion of bugs are high-impact bugs, the identification of high-impact bug reports is a difficult task. In this paper, we propose an approach to identify high-impact bug reports by leveraging imbalanced learning strategies. We investigate the effectiveness of various variants, each of which combines one particular imbalanced learning strategy and one particular classification algorithm. In particular, we choose four widely used strategies for dealing with imbalanced data and four state-of-the-art text classification algorithms to conduct experiments on four datasets from four different open source projects. We mainly perform an analytical study on two types of high-impact bugs, i.e., surprise bugs and breakage bugs. The results show that different variants have different performances, and the best performing variants SMOTE (synthetic minority over-sampling technique) + KNN (K-nearest neighbours) for surprise bug identification and RUS (random under-sampling) + NB (naive Bayes) for breakage bug identification outperform the F1-scores of the two state-of-the-art approaches by Thung et al. and Garcia and Shihab.Supplementary code and data available from GitHub:
f
Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced...
plos.figshare.com
txt
Updated Jun 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data [Dataset]. http://doi.org/10.1371/journal.pone.0180830
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0180830
Dataset updated
Jun 18, 2023
Dataset provided by
PLOS ONE
Authors
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clinical data analysis and forecasting have made substantial contributions to disease control, prevention and detection. However, such data usually suffer from highly imbalanced samples in class distributions. In this paper, we aim to formulate effective methods to rebalance binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat algorithm, and apply them to empower the effects of synthetic minority over-sampling technique (SMOTE) for pre-processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reported in this paper reveal that the performance improvements obtained by the former methods are not scalable to larger data scales. The latter methods, which we call Adaptive Swarm Balancing Algorithms, lead to significant efficiency and effectiveness improvements on large datasets while the first method is invalid. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. The proposed methods lead to more credible performances of the classifier, and shortening the run time compared to brute-force method.
f
Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...
plos.figshare.com
xls
Updated Nov 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0290581.t007
Dataset updated
Nov 16, 2023
Dataset provided by
PLOS ONE
Authors
Alaa Alomari; Hossam Faris; Pedro A. Castillo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes.
m
Synthetic oversampling for credit card default prediction
data.mendeley.com
Updated Mar 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fransiscus Pratikto (2023). Synthetic oversampling for credit card default prediction [Dataset]. http://doi.org/10.17632/jrss9jdjz9.1
Explore at:
Unique identifier
https://doi.org/10.17632/jrss9jdjz9.1
Dataset updated
Mar 8, 2023
Authors
Fransiscus Pratikto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains more than 17000 data of credit card holder with 20 predictor variables and 1 binary target variable. The corresponding R code for comparing several proposed (density-based) and existing synthetic oversampling methods (SMOTE-based) is also provided.
f
The selected explanatory variables.
plos.figshare.com
xls
Updated Jun 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada (2023). The selected explanatory variables. [Dataset]. http://doi.org/10.1371/journal.pone.0281901.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0281901.t002
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity.
f
Area ratio and historical landslide numbers in differentsusceptibility...
datasetcatalog.nlm.nih.gov
Updated May 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gao, Jia-jun; Xu, Hui; Mao, Jun; Li, Kun-lun; Lv, Ming-zhou; Cai, Jia-zeng (2025). Area ratio and historical landslide numbers in differentsusceptibility categories for different models using the SMOTE-Tomek sampling method. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002091040
Explore at:
Dataset updated
May 21, 2025
Authors
Gao, Jia-jun; Xu, Hui; Mao, Jun; Li, Kun-lun; Lv, Ming-zhou; Cai, Jia-zeng
Description
Area ratio and historical landslide numbers in differentsusceptibility categories for different models using the SMOTE-Tomek sampling method.
Area ratio and historical landslide numbers in different susceptibility...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated May 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ming-zhou Lv; Kun-lun Li; Jia-zeng Cai; Jun Mao; Jia-jun Gao; Hui Xu (2025). Area ratio and historical landslide numbers in different susceptibility categories for different models using the SMOTE sampling method. [Dataset]. http://doi.org/10.1371/journal.pone.0323487.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0323487.t007
Dataset updated
May 21, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Ming-zhou Lv; Kun-lun Li; Jia-zeng Cai; Jun Mao; Jia-jun Gao; Hui Xu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Area ratio and historical landslide numbers in different susceptibility categories for different models using the SMOTE sampling method.
f
Comparison of model evaluation indicators.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated May 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ming-zhou Lv; Kun-lun Li; Jia-zeng Cai; Jun Mao; Jia-jun Gao; Hui Xu (2025). Comparison of model evaluation indicators. [Dataset]. http://doi.org/10.1371/journal.pone.0323487.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0323487.t005
Dataset updated
May 21, 2025
Dataset provided by
PLOS ONE
Authors
Ming-zhou Lv; Kun-lun Li; Jia-zeng Cai; Jun Mao; Jia-jun Gao; Hui Xu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Landslides are frequent and hazardous geological disasters, posing significant risks to human safety and infrastructure. Accurate assessments of landslide susceptibility are crucial for risk management and mitigation. However, geological surveys of landslide areas are typically conducted at the township level, have lowsample sizes, and rely on experience. This study proposes a framework for assessing landslide susceptibility in Taiping Township, Zhejiang Province, China, using data balancing, machine learning, and data from 1,325 slope units with nine slope characteristics. The dataset was balanced using the Synthetic Minority Oversampling Technique and the Tomek link undersampling method (SMOTE-Tomek). A comparative analysis of six machine learning models was performed, and the SHapley Additive exPlanation (SHAP) method was used to assess the influencing factors. The results indicate that the machine learning algorithms provide high accuracy, and the random forest (RF) algorithm achieves the optimum model accuracy (0.791, F1 = 0.723). The very low, low, medium, and high sensitivity zones account for 92.27%, 5.12%, 1.78%, and 0.83% of the area, respectively. The height of cut slopes has the most significant impact on landslide sensitivity, whereas the altitude has a minor impact. The proposed model accurately assesses landslide susceptibility at the township scale, providing valuable insights for risk management and mitigation.
Y
Citation Network Graph
shibatadb.com
Updated Apr 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yubetsu (2017). Citation Network Graph [Dataset]. https://www.shibatadb.com/article/6SEJEsMa
Explore at:
Dataset updated
Apr 15, 2017
Dataset authored and provided by
Yubetsu
License
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Description
Network of 43 papers and 62 citation links related to "Medical data classification scheme based on hybridized SMOTE technique (HST) and Rough Set technique (RST)".
f
A comparative analysis of earlier studies.
plos.figshare.com
xls
Updated Jan 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Praveen Talari; Bharathiraja N; Gaganpreet Kaur; Hani Alshahrani; Mana Saleh Al Reshan; Adel Sulaiman; Asadullah Shaikh (2024). A comparative analysis of earlier studies. [Dataset]. http://doi.org/10.1371/journal.pone.0292100.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292100.t001
Dataset updated
Jan 18, 2024
Dataset provided by
PLOS ONE
Authors
Praveen Talari; Bharathiraja N; Gaganpreet Kaur; Hani Alshahrani; Mana Saleh Al Reshan; Adel Sulaiman; Asadullah Shaikh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diabetes prediction is an ongoing study topic in which medical specialists are attempting to forecast the condition with greater precision. Diabetes typically stays lethargic, and on the off chance that patients are determined to have another illness, like harm to the kidney vessels, issues with the retina of the eye, or a heart issue, it can cause metabolic problems and various complexities in the body. Various worldwide learning procedures, including casting a ballot, supporting, and sacking, have been applied in this review. The Engineered Minority Oversampling Procedure (Destroyed), along with the K-overlay cross-approval approach, was utilized to achieve class evening out and approve the discoveries. Pima Indian Diabetes (PID) dataset is accumulated from the UCI Machine Learning (UCI ML) store for this review, and this dataset was picked. A highlighted engineering technique was used to calculate the influence of lifestyle factors. A two-phase classification model has been developed to predict insulin resistance using the Sequential Minimal Optimisation (SMO) and SMOTE approaches together. The SMOTE technique is used to preprocess data in the model’s first phase, while SMO classes are used in the second phase. All other categorization techniques were outperformed by bagging decision trees in terms of Misclassification Error rate, Accuracy, Specificity, Precision, Recall, F1 measures, and ROC curve. The model was created using a combined SMOTE and SMO strategy, which achieved 99.07% correction with 0.1 ms of runtime. The suggested system’s result is to enhance the classifier’s performance in spotting illness early.
u
Data from: Dataset for classification of signaling proteins based on...
portalinvestigacion.udc.gal
portalcientifico.sergas.es
+1more
Updated 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernandez-Lozano, Carlos; Munteanu, Cristian Robert; Fernandez-Lozano, Carlos; Munteanu, Cristian Robert (2015). Dataset for classification of signaling proteins based on molecular star graph descriptors using machine-learning models [Dataset]. https://portalinvestigacion.udc.gal/documentos/668fc447b9e7c03b01bd8975
Explore at:
Dataset updated
2015
Authors
Fernandez-Lozano, Carlos; Munteanu, Cristian Robert; Fernandez-Lozano, Carlos; Munteanu, Cristian Robert
Description
The positive group of 608 signaling protein sequences was downloaded as FASTA format from Protein Databank (Berman et al., 2000) by using the “Molecular Function Browser” in the “Advanced Search Interface” (“Signaling (GO ID23052)”, protein identity cut-off = 30%). The negative group of 2077 non-signaling proteins was downloaded as the PISCES CulledPDB (http://dunbrack.fccc.edu/PISCES.php) (Wang & R. L. Dunbrack, 2003) (November 19th, 2012) using identity (degree of correspondence between two sequences) less than 20%, resolution of 1.6 Å and R-factor 0.25. The full dataset is containing 2685 FASTA sequences of protein chains from the PDB databank: 608 are signaling proteins and 2077 are non-signaling peptides. This kind of unbalanced data is not the most suitable to be used as an input for learning algorithms because the results would present a high sensitivity and low specificity; learning algorithms would tend to classify most of samples as part of the most common group. To avoid this situation, a pre-processing stage is needed in order to get a more balanced dataset, in this case by means of the synthetic minority oversampling technique (SMOTE). In short, SMOTE provides a more balanced dataset using an expansion of the lower class by creating new samples, interpolating other minority-class samples. After this pre-processing, the final dataset is composed of 1824 positive samples (signaling protein chains) and 2432 negative cases (non-signaling protein chains). Paper is available at: http://dx.doi.org/10.1016/j.jtbi.2015.07.038 Please cite: Carlos Fernandez-Lozano, Rubén F. Cuiñas, José A. Seoane, Enrique Fernández-Blanco, Julian Dorado, Cristian R. Munteanu, Classification of signaling proteins based on molecular star graph descriptors using Machine Learning models, Journal of Theoretical Biology, Volume 384, 7 November 2015, Pages 50-58, ISSN 0022-5193, http://dx.doi.org/10.1016/j.jtbi.2015.07.038.(http://www.sciencedirect.com/science/article/pii/S0022519315003999)
f
Performance of machine learning models using SMOTE-balanced dataset.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf (2023). Performance of machine learning models using SMOTE-balanced dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0293061.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0293061.t004
Dataset updated
Nov 8, 2023
Dataset provided by
PLOS ONE
Authors
Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance of machine learning models using SMOTE-balanced dataset.
d
Data from: Skin Cancer Diagnostics with an All-Inclusive Smartphone...
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pandey, Santosh (2023). Skin Cancer Diagnostics with an All-Inclusive Smartphone Application [Dataset]. http://doi.org/10.7910/DVN/HUQK9R
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/HUQK9R
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Pandey, Santosh
Description
Among the different types of skin cancer, melanoma is considered to be the deadliest and is difficult to treat at advanced stages. Detection of melanoma at earlier stages can lead to reduced mortality rates. Desktop-based computer-aided systems have been developed to assist dermatologists with early diagnosis. However, there is significant interest in developing portable, at-home melanoma diagnostic systems which can assess the risk of cancerous skin lesions. Here, we present a smartphone application that combines image capture capabilities with preprocessing and segmentation to extract the Asymmetry, Border irregularity, Color variegation, and Diameter (ABCD) features of a skin lesion. Using the feature sets, classification of malignancy is achieved through support vector machine classifiers. By using adaptive algorithms in the individual data-processing stages, our approach is made computationally light, user friendly, and reliable in discriminating melanoma cases from benign ones. Images of skin lesions are either captured with the smartphone camera or imported from public datasets. The entire process from image capture to classification runs on an Android smartphone equipped with a detachable 10x lens, and processes an image in less than a second. The overall performance metrics are evaluated on a public database of 200 images with Synthetic Minority Over-sampling Technique (SMOTE) (80% sensitivity, 90% specificity, 88% accuracy, and 0.85 area under curve (AUC)) and without SMOTE (55% sensitivity, 95% specificity, 90% accuracy, and 0.75 AUC). The evaluated performance metrics and computation times are comparable or better than previous methods. This all-inclusive smartphone application is designed to be easy-to-download and easy-to-navigate for the end user, which is imperative for the eventual democratization of such medical diagnostic systems.
f
Performance of models using CNN features.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umer, Muhammad; Mohamed, Abdullah; Abuzinadah, Nihal; Ishaq, Abid; Eshmawi, Ala’ Abdulmajid; Alsubai, Shtwai; Ashraf, Imran; Al Hejaili, Abdullah (2023). Performance of models using CNN features. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000971153
Explore at:
Dataset updated
Nov 8, 2023
Authors
Umer, Muhammad; Mohamed, Abdullah; Abuzinadah, Nihal; Ishaq, Abid; Eshmawi, Ala’ Abdulmajid; Alsubai, Shtwai; Ashraf, Imran; Al Hejaili, Abdullah
Description
Predicting student performance automatically is of utmost importance, due to the substantial volume of data within educational databases. Educational data mining (EDM) devises techniques to uncover insights from data originating in educational settings. Artificial intelligence (AI) can mine educational data to predict student performance and provide measures to help students avoid failing and learn better. Learning platforms complement traditional learning settings by analyzing student performance, which can help reduce the chance of student failure. Existing methods for student performance prediction in educational data mining faced challenges such as limited accuracy, imbalanced data, and difficulties in feature engineering. These issues hindered effective adaptability and generalization across diverse educational contexts. This study proposes a machine learning-based system with deep convoluted features for the prediction of students’ academic performance. The proposed framework is employed to predict student academic performance using balanced as well as, imbalanced datasets using the synthetic minority oversampling technique (SMOTE). In addition, the performance is also evaluated using the original and deep convoluted features. Experimental results indicate that the use of deep convoluted features provides improved prediction accuracy compared to original features. Results obtained using the extra tree classifier with convoluted features show the highest classification accuracy of 99.9%. In comparison with the state-of-the-art approaches, the proposed approach achieved higher performance. This research introduces a powerful AI-driven system for student performance prediction, offering substantial advancements in accuracy compared to existing approaches.
f
The generated graph dataset.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Apr 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhang, Yibo; Ru, Renxin; Ding, Cheng; Zhang, Jiasheng; Lan, Yao; Niu, Dongge (2025). The generated graph dataset. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002050657
Explore at:
Dataset updated
Apr 25, 2025
Authors
Zhang, Yibo; Ru, Renxin; Ding, Cheng; Zhang, Jiasheng; Lan, Yao; Niu, Dongge
Description
Anesthesia plays a pivotal role in modern surgery by facilitating controlled states of unconsciousness. Precise control is crucial for safe and pain-free surgeries. Monitoring anesthesia depth accurately is essential to guide anesthesiologists, optimize drug usage, and mitigate postoperative complications. This study focuses on enhancing the classification performance of anesthesia-induced transitions between wakefulness and deep sleep into eight classes by leveraging advanced graph neural network (GNN). The research combines seven datasets into a single dataset comprising 290 samples and investigates key brain regions, to develop a robust classification framework. Initially, the dataset is augmented using the Synthetic Minority Over-sampling Technique (SMOTE) to expand the sample size to 1197. A graph-based approach is employed to get the intricate relationships between features, constructing a graph dataset with 1197 nodes and 714,610 edges, where nodes represent data samples and edges are the connections between the nodes. The connection (edge weight) is calculated using Spearman correlation coefficient matrix. An optimized GNN model is developed through an ablation study of eight hyperparameters, achieving an accuracy of 92.8%. The model’s performance is further evaluated against one-dimensional (1D) CNN, and six machine learning models, demonstrating superior classification capabilities for small and imbalanced datasets. Additionally, we evaluated the proposed model on six different anesthesia datasets, observing no decline in performance. This work advances the understanding and classification of anesthesia states, providing a valuable tool for improved anesthesia management.

Facebook

Twitter

Click to copy link

Link copied

Cite

Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t008

A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0317396.t008

Dataset updated

Feb 10, 2025

Dataset provided by

PLOShttp://plos.org/

Authors

Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier.

Clear search

Close search

Google apps

Main menu

A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...

Data from: S1 Datasets -

Animals Imbalance + Smote

Dataset

Contents

Citation Trends for "Medical data classification scheme based on hybridized...

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, W Philip Kegelmeyer...

Data from: High impact bug report identification with imbalanced learning...

Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced...

Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...

Synthetic oversampling for credit card default prediction

The selected explanatory variables.

Area ratio and historical landslide numbers in differentsusceptibility...

Area ratio and historical landslide numbers in different susceptibility...

Comparison of model evaluation indicators.

Citation Network Graph

A comparative analysis of earlier studies.

Data from: Dataset for classification of signaling proteins based on...

Performance of machine learning models using SMOTE-balanced dataset.

Data from: Skin Cancer Diagnostics with an All-Inclusive Smartphone...

Performance of models using CNN features.

The generated graph dataset.

A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier.