100+ datasets found

f
Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced...
plos.figshare.com
txt
Updated Jun 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data [Dataset]. http://doi.org/10.1371/journal.pone.0180830
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0180830
Dataset updated
Jun 18, 2023
Dataset provided by
PLOS ONE
Authors
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clinical data analysis and forecasting have made substantial contributions to disease control, prevention and detection. However, such data usually suffer from highly imbalanced samples in class distributions. In this paper, we aim to formulate effective methods to rebalance binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat algorithm, and apply them to empower the effects of synthetic minority over-sampling technique (SMOTE) for pre-processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reported in this paper reveal that the performance improvements obtained by the former methods are not scalable to larger data scales. The latter methods, which we call Adaptive Swarm Balancing Algorithms, lead to significant efficiency and effectiveness improvements on large datasets while the first method is invalid. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. The proposed methods lead to more credible performances of the classifier, and shortening the run time compared to brute-force method.
f
Data from: S1 Datasets -
plos.figshare.com
bin
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). S1 Datasets - [Dataset]. http://doi.org/10.1371/journal.pone.0317396.s001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.s001
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
f
Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...
plos.figshare.com
xls
Updated Nov 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0290581.t007
Dataset updated
Nov 16, 2023
Dataset provided by
PLOS ONE
Authors
Alaa Alomari; Hossam Faris; Pedro A. Castillo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes.
f
A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...
plos.figshare.com
xls
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t008
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier.
f
Performance of machine learning models using SMOTE-balanced dataset.
plos.figshare.com
xls
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf (2023). Performance of machine learning models using SMOTE-balanced dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0293061.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0293061.t004
Dataset updated
Nov 8, 2023
Dataset provided by
PLOS ONE
Authors
Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance of machine learning models using SMOTE-balanced dataset.
t
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, W Philip Kegelmeyer...
service.tib.eu
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, W Philip Kegelmeyer (2024). Dataset: SMOTE: Synthetic Minority Over-Sampling Technique. https://doi.org/10.57702/tq0zp0i3 [Dataset]. https://service.tib.eu/ldmservice/dataset/smote--synthetic-minority-over-sampling-technique
Explore at:
Dataset updated
Dec 3, 2024
Description
SMOTE: synthetic minority over-sampling technique.
s
Data from: High impact bug report identification with imbalanced learning...
researchdata.smu.edu.sg
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN (2023). Data from: High impact bug report identification with imbalanced learning strategies [Dataset]. http://doi.org/10.25440/smu.12062763.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25440/smu.12062763.v1
Dataset updated
Jun 1, 2023
Dataset provided by
SMU Research Data Repository (RDR)
Authors
YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This record contains the underlying research data for the publication "High impact bug report identification with imbalanced learning strategies" and the full-text is available from: https://ink.library.smu.edu.sg/sis_research/3702In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they often concentrate on bugs that are highly impactful. In the literature, high-impact bugs are used to refer to the bugs which appear at unexpected time or locations and bring more unexpected effects (i.e., surprise bugs), or break pre-existing functionalities and destroy the user experience (i.e., breakage bugs). Unfortunately, identifying high-impact bugs from thousands of bug reports in a bug tracking system is not an easy feat. Thus, an automated technique that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. Considering that only a small proportion of bugs are high-impact bugs, the identification of high-impact bug reports is a difficult task. In this paper, we propose an approach to identify high-impact bug reports by leveraging imbalanced learning strategies. We investigate the effectiveness of various variants, each of which combines one particular imbalanced learning strategy and one particular classification algorithm. In particular, we choose four widely used strategies for dealing with imbalanced data and four state-of-the-art text classification algorithms to conduct experiments on four datasets from four different open source projects. We mainly perform an analytical study on two types of high-impact bugs, i.e., surprise bugs and breakage bugs. The results show that different variants have different performances, and the best performing variants SMOTE (synthetic minority over-sampling technique) + KNN (K-nearest neighbours) for surprise bug identification and RUS (random under-sampling) + NB (naive Bayes) for breakage bug identification outperform the F1-scores of the two state-of-the-art approaches by Thung et al. and Garcia and Shihab.Supplementary code and data available from GitHub:
m
Synthetic oversampling for credit card default prediction
data.mendeley.com
Updated Mar 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fransiscus Pratikto (2023). Synthetic oversampling for credit card default prediction [Dataset]. http://doi.org/10.17632/jrss9jdjz9.1
Explore at:
Unique identifier
https://doi.org/10.17632/jrss9jdjz9.1
Dataset updated
Mar 8, 2023
Authors
Fransiscus Pratikto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains more than 17000 data of credit card holder with 20 predictor variables and 1 binary target variable. The corresponding R code for comparing several proposed (density-based) and existing synthetic oversampling methods (SMOTE-based) is also provided.
i
Korean Voice Phishing Detection Dataset with Multilingual Back-Translation...
ieee-dataport.org
Updated Nov 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MILANDU KEITH MOUSSAVOU BOUSSOUGOU (2024). Korean Voice Phishing Detection Dataset with Multilingual Back-Translation and SMOTE Augmentations [Dataset]. https://ieee-dataport.org/documents/korean-voice-phishing-detection-dataset-multilingual-back-translation-and-smote
Explore at:
Dataset updated
Nov 11, 2024
Authors
MILANDU KEITH MOUSSAVOU BOUSSOUGOU
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Chinese
f
Data from: Dataset for classification of signaling proteins based on...
figshare.com
portalcientifico.sergas.es
txt
Updated Jan 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Fernandez-Lozano; Cristian Robert Munteanu (2016). Dataset for classification of signaling proteins based on molecular star graph descriptors using machine-learning models [Dataset]. http://doi.org/10.6084/m9.figshare.1330132.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1330132.v1
Dataset updated
Jan 19, 2016
Dataset provided by
figshare
Authors
Carlos Fernandez-Lozano; Cristian Robert Munteanu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The positive group of 608 signaling protein sequences was downloaded as FASTA format from Protein Databank (Berman et al., 2000) by using the “Molecular Function Browser” in the “Advanced Search Interface” (“Signaling (GO ID23052)”, protein identity cut-off = 30%). The negative group of 2077 non-signaling proteins was downloaded as the PISCES CulledPDB (http://dunbrack.fccc.edu/PISCES.php) (Wang & R. L. Dunbrack, 2003) (November 19th, 2012) using identity (degree of correspondence between two sequences) less than 20%, resolution of 1.6 Å and R-factor 0.25. The full dataset is containing 2685 FASTA sequences of protein chains from the PDB databank: 608 are signaling proteins and 2077 are non-signaling peptides. This kind of unbalanced data is not the most suitable to be used as an input for learning algorithms because the results would present a high sensitivity and low specificity; learning algorithms would tend to classify most of samples as part of the most common group. To avoid this situation, a pre-processing stage is needed in order to get a more balanced dataset, in this case by means of the synthetic minority oversampling technique (SMOTE). In short, SMOTE provides a more balanced dataset using an expansion of the lower class by creating new samples, interpolating other minority-class samples. After this pre-processing, the final dataset is composed of 1824 positive samples (signaling protein chains) and 2432 negative cases (non-signaling protein chains). Paper is available at: http://dx.doi.org/10.1016/j.jtbi.2015.07.038

Please cite: Carlos Fernandez-Lozano, Rubén F. Cuiñas, José A. Seoane, Enrique Fernández-Blanco, Julian Dorado, Cristian R. Munteanu, Classification of signaling proteins based on molecular star graph descriptors using Machine Learning models, Journal of Theoretical Biology, Volume 384, 7 November 2015, Pages 50-58, ISSN 0022-5193, http://dx.doi.org/10.1016/j.jtbi.2015.07.038.(http://www.sciencedirect.com/science/article/pii/S0022519315003999)
d
Data from: Skin Cancer Diagnostics with an All-Inclusive Smartphone...
search.dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pandey, Santosh (2023). Skin Cancer Diagnostics with an All-Inclusive Smartphone Application [Dataset]. http://doi.org/10.7910/DVN/HUQK9R
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/HUQK9R
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Pandey, Santosh
Description
Among the different types of skin cancer, melanoma is considered to be the deadliest and is difficult to treat at advanced stages. Detection of melanoma at earlier stages can lead to reduced mortality rates. Desktop-based computer-aided systems have been developed to assist dermatologists with early diagnosis. However, there is significant interest in developing portable, at-home melanoma diagnostic systems which can assess the risk of cancerous skin lesions. Here, we present a smartphone application that combines image capture capabilities with preprocessing and segmentation to extract the Asymmetry, Border irregularity, Color variegation, and Diameter (ABCD) features of a skin lesion. Using the feature sets, classification of malignancy is achieved through support vector machine classifiers. By using adaptive algorithms in the individual data-processing stages, our approach is made computationally light, user friendly, and reliable in discriminating melanoma cases from benign ones. Images of skin lesions are either captured with the smartphone camera or imported from public datasets. The entire process from image capture to classification runs on an Android smartphone equipped with a detachable 10x lens, and processes an image in less than a second. The overall performance metrics are evaluated on a public database of 200 images with Synthetic Minority Over-sampling Technique (SMOTE) (80% sensitivity, 90% specificity, 88% accuracy, and 0.85 area under curve (AUC)) and without SMOTE (55% sensitivity, 95% specificity, 90% accuracy, and 0.75 AUC). The evaluated performance metrics and computation times are comparable or better than previous methods. This all-inclusive smartphone application is designed to be easy-to-download and easy-to-navigate for the end user, which is imperative for the eventual democratization of such medical diagnostic systems.
c
Data from: CreditCardTransactions Dataset
cubig.ai
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). CreditCardTransactions Dataset [Dataset]. https://cubig.ai/store/products/554/creditcardtransactions-dataset
Explore at:
Dataset updated
Jul 7, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
Description
1) Data Introduction • The Credit_Card_Transactions Dataset is a representative sample data for building fraud detection models, including anonymized real-world transaction data such as financial transaction type, amount, sender/receiver account balance, and fraud indicators.

2) Data Utilization (1) Credit_Card_Transactions Dataset has characteristics that: • This dataset provides individual transaction records on a row-by-row basis, reflecting the real-world class imbalance problem with the extremely low percentage of fraudulent transactions (isFraud=1). • It is an unprocessed raw data structure that allows you to directly utilize key variables such as transaction time, amount, and account change. (2) Credit_Card_Transactions Dataset can be used to: • Binary classification modeling: Fraud transaction detection models can be developed by applying imbalance processing techniques such as SMOTE and undersampling, and appropriate evaluation indicators such as F1-score and ROC-AUC. • Real-time anomaly detection: It can be used to build a real-time anomaly signal detection system through analysis of transaction patterns (amount, frequency, account change).
f
Number of instances increased by SMOTE technique.
plos.figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manal Alghamdi; Mouaz Al-Mallah; Steven Keteyian; Clinton Brawner; Jonathan Ehrman; Sherif Sakr (2023). Number of instances increased by SMOTE technique. [Dataset]. http://doi.org/10.1371/journal.pone.0179805.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0179805.t003
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Manal Alghamdi; Mouaz Al-Mallah; Steven Keteyian; Clinton Brawner; Jonathan Ehrman; Sherif Sakr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of instances increased by SMOTE technique.
m
MQTTEEB-D: A Real-World IoT Cybersecurity Dataset for AI-Powered Threat...
data.mendeley.com
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ABDERRAHMANE AQACHTOUL (2025). MQTTEEB-D: A Real-World IoT Cybersecurity Dataset for AI-Powered Threat Detection in MQTT Networks [Dataset]. http://doi.org/10.17632/jfttfjn6tr.1
Explore at:
Unique identifier
https://doi.org/10.17632/jfttfjn6tr.1
Dataset updated
Mar 20, 2025
Authors
ABDERRAHMANE AQACHTOUL
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset accompanies the research article on MQTTEEB-D and is intended for public use in cybersecurity research. The MQTTEEB-D dataset is a practical real-world data set for intrusion detection improvement in Message Queuing Telemetry Transport (MQTT)-based Internet of Things (IoT) networks. In contrast to already existing datasets that are constructed on simulated network traffic, MQTTEEB-D is obtained from a real-time IoT deployment at the International University of Rabat (UIR), Morocco. Using MySignals IoT health sensors, Raspberry Pi 4, and an MQTT broker server, this dataset represents the actual complexity of the active IoT communication process, which synthetic data fails to offer. To narrow the gap between simulated and real-world attack scenarios, various cyberattacks including Denial of Service (DoS), Slow DoS against Internet of Things Environments (SlowITe), Malformed Data Injection, Brute Force, and MQTT publish flooding were carried out in real-time, permitting close monitoring of network traffic anomalies. The data was captured using Python wrapper for tshark (PyShark) and organized into multiple Comma-Separated Values (CSV) files. To ensure high data quality, we performed pre-processing steps, such as outlier removal, normalization, standardization, and class balance. Several processed forms (raw, cleaned, normalized, standardized, Synthetic Minority Over-sampling Technique (SMOTE)) applied for this dataset are provided, along with detailed metadata to facilitate ease of use in cybersecurity research. This dataset provides an opportunity for researchers to develop and validate intrusion detection models in a real-world MQTT environment - a critical ingredient in Artificial Intelligence (AI)-driven cybersecurity solutions for IoT networks. The dataset will support future research IoT security and anomaly detection domains.
Z
Data from: Date Fruit classification using a wide range of classifiers
data.niaid.nih.gov
zenodo.org
Updated May 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yandre M. G. Costa (2023). Date Fruit classification using a wide range of classifiers [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7938712
Explore at:
Dataset updated
May 16, 2023
Dataset provided by
Yandre M. G. Costa
André F. R. Cordeiro
Edson OliveiraJr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets generated by the techniques RUS and SMOTE. These datasets were used in the paper Date Fruit classification using a wide range of classifiers, accepted for publication in the International Conference on Systems, Signals and Image Processing (IWSSIP) 2023.
h
Language_Indentification_v2
huggingface.co
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ProcessVenue (2025). Language_Indentification_v2 [Dataset]. https://huggingface.co/datasets/Process-Venue/Language_Indentification_v2
Explore at:
Dataset updated
Mar 18, 2025
Dataset authored and provided by
ProcessVenue
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for Language Identification Dataset

Sample Notebook:

https://www.kaggle.com/code/rishabhbhartiya/indian-language-classification-smote-resampled

Kaggle Dataset link:

https://www.kaggle.com/datasets/processvenue/indian-language-identification

Dataset Summary

A comprehensive dataset for Indian language identification and text classification. The dataset contains text samples across 18 major Indian languages, making it suitable for… See the full description on the dataset page: https://huggingface.co/datasets/Process-Venue/Language_Indentification_v2.
h
ml_data_test_detection_bank_transaction_frauds_unbalanced
huggingface.co
Updated Jun 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roberto Armas (2023). ml_data_test_detection_bank_transaction_frauds_unbalanced [Dataset]. https://huggingface.co/datasets/roberto-armas/ml_data_test_detection_bank_transaction_frauds_unbalanced
Explore at:
Dataset updated
Jun 19, 2023
Authors
Roberto Armas
Description
ML Data Test Detection Bank Transaction Frauds Unbalanced

The project provides a quick and accessible dataset designed for learning and experimenting with machine learning algorithms, specifically in the context of detecting fraudulent bank transactions. It is intended for practicing and applying concepts such as Random Forest, Support Vector Machines (SVM), and Synthetic Minority Over-sampling Technique (SMOTE) to address unbalanced classification problems. Note: This dataset is… See the full description on the dataset page: https://huggingface.co/datasets/roberto-armas/ml_data_test_detection_bank_transaction_frauds_unbalanced.
f
A comparative analysis of earlier studies.
plos.figshare.com
xls
Updated Jan 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Praveen Talari; Bharathiraja N; Gaganpreet Kaur; Hani Alshahrani; Mana Saleh Al Reshan; Adel Sulaiman; Asadullah Shaikh (2024). A comparative analysis of earlier studies. [Dataset]. http://doi.org/10.1371/journal.pone.0292100.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292100.t001
Dataset updated
Jan 18, 2024
Dataset provided by
PLOS ONE
Authors
Praveen Talari; Bharathiraja N; Gaganpreet Kaur; Hani Alshahrani; Mana Saleh Al Reshan; Adel Sulaiman; Asadullah Shaikh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diabetes prediction is an ongoing study topic in which medical specialists are attempting to forecast the condition with greater precision. Diabetes typically stays lethargic, and on the off chance that patients are determined to have another illness, like harm to the kidney vessels, issues with the retina of the eye, or a heart issue, it can cause metabolic problems and various complexities in the body. Various worldwide learning procedures, including casting a ballot, supporting, and sacking, have been applied in this review. The Engineered Minority Oversampling Procedure (Destroyed), along with the K-overlay cross-approval approach, was utilized to achieve class evening out and approve the discoveries. Pima Indian Diabetes (PID) dataset is accumulated from the UCI Machine Learning (UCI ML) store for this review, and this dataset was picked. A highlighted engineering technique was used to calculate the influence of lifestyle factors. A two-phase classification model has been developed to predict insulin resistance using the Sequential Minimal Optimisation (SMO) and SMOTE approaches together. The SMOTE technique is used to preprocess data in the model’s first phase, while SMO classes are used in the second phase. All other categorization techniques were outperformed by bagging decision trees in terms of Misclassification Error rate, Accuracy, Specificity, Precision, Recall, F1 measures, and ROC curve. The model was created using a combined SMOTE and SMO strategy, which achieved 99.07% correction with 0.1 ms of runtime. The suggested system’s result is to enhance the classifier’s performance in spotting illness early.
f
Classification result classifiers using TF-IDF with SMOTE.
plos.figshare.com
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Classification result classifiers using TF-IDF with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t007
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification result classifiers using TF-IDF with SMOTE.
f
Results of Bioassay 1608 dataset in experiment 2.
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Results of Bioassay 1608 dataset in experiment 2. [Dataset]. http://doi.org/10.1371/journal.pone.0180830.t011
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0180830.t011
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results of Bioassay 1608 dataset in experiment 2.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data [Dataset]. http://doi.org/10.1371/journal.pone.0180830

Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data

Explore at:

25 scholarly articles cite this dataset (View in Google Scholar)

txtAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0180830

Dataset updated

Jun 18, 2023

Dataset provided by

PLOS ONE

Authors

Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Clinical data analysis and forecasting have made substantial contributions to disease control, prevention and detection. However, such data usually suffer from highly imbalanced samples in class distributions. In this paper, we aim to formulate effective methods to rebalance binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat algorithm, and apply them to empower the effects of synthetic minority over-sampling technique (SMOTE) for pre-processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reported in this paper reveal that the performance improvements obtained by the former methods are not scalable to larger data scales. The latter methods, which we call Adaptive Swarm Balancing Algorithms, lead to significant efficiency and effectiveness improvements on large datasets while the first method is invalid. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. The proposed methods lead to more credible performances of the classifier, and shortening the run time compared to brute-force method.

Clear search

Close search

Google apps

Main menu

Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced...

Data from: S1 Datasets -

Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...

A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...

Performance of machine learning models using SMOTE-balanced dataset.

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, W Philip Kegelmeyer...

Data from: High impact bug report identification with imbalanced learning...

Synthetic oversampling for credit card default prediction

Korean Voice Phishing Detection Dataset with Multilingual Back-Translation...

Data from: Dataset for classification of signaling proteins based on...

Data from: Skin Cancer Diagnostics with an All-Inclusive Smartphone...

Data from: CreditCardTransactions Dataset

Number of instances increased by SMOTE technique.

MQTTEEB-D: A Real-World IoT Cybersecurity Dataset for AI-Powered Threat...

Data from: Date Fruit classification using a wide range of classifiers

Language_Indentification_v2

ml_data_test_detection_bank_transaction_frauds_unbalanced

A comparative analysis of earlier studies.

Classification result classifiers using TF-IDF with SMOTE.

Results of Bioassay 1608 dataset in experiment 2.

Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data