100+ datasets found

i
Imbalanced Data
ieee-dataport.org
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0
Explore at:
Dataset updated
Aug 23, 2023
Authors
Blessa Binolin M
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.
f
Performance comparison of machine learning models across accuracy, AUC, MCC,...
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t005
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.
i
Tackling Class Imbalance with Ranking - Dataset - CKAN
rdm.inesctec.pt
Updated Feb 20, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Tackling Class Imbalance with Ranking - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/nis-2017-002
Explore at:
Dataset updated
Feb 20, 2017
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The dataset comes originally from UCI Machine Learning. The multiclass datasets were transformed in binary classification as mentioned in the paper. Ranking methods were applied to improve class imbalance. The datasets are divided in 30 folds so that other class imbalance methods can be compared to the methods in the paper. The code used in the paper is also provided.
f
Over-sampled dataset.
figshare.com
xls
Updated Dec 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Over-sampled dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t004
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
Dataset for Imbalance Class Problem
kaggle.com
Updated Feb 12, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akalya Subramanian (2021). Dataset for Imbalance Class Problem [Dataset]. https://www.kaggle.com/akalyasubramanian/dataset-for-imbalance-class-problem/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Akalya Subramanian
Description
Dataset

This dataset was created by Akalya Subramanian

Contents
f
Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...
frontiersin.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fninf.2021.715421.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.
f
Data from: Addressing Imbalanced Classification Problems in Drug Discovery...
acs.figshare.com
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5c00023.s001
Dataset updated
Apr 15, 2025
Dataset provided by
ACS Publications
Authors
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
Dataset: The effects of class balance on the training energy consumption of...
zenodo.org
data.niaid.nih.gov
csv
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Gutierrez; Maria Gutierrez; Coral Calero; Coral Calero; Félix García; Félix García; Mª Ángeles Moraga; Mª Ángeles Moraga (2024). Dataset: The effects of class balance on the training energy consumption of logistic regression models [Dataset]. http://doi.org/10.5281/zenodo.10823624
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10823624
Dataset updated
Mar 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maria Gutierrez; Maria Gutierrez; Coral Calero; Coral Calero; Félix García; Félix García; Mª Ángeles Moraga; Mª Ángeles Moraga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2024
Description
Two synthetic datasets for binary classification, generated with the Random Radial Basis Function generator from WEKA. They are the same shape and size (104.952 instances, 185 attributes), but the "balanced" dataset has 52,13% of its instances belonging to class c0, while the "unbalanced" one only has 4,04% of its instances belonging to class c0. Therefore, this set of datasets is primarily meant to study how class balance influences the behaviour of a machine learning model.
UVP5 data sorted with EcoTaxa and MorphoCluster
seanoe.org
image/*
Updated 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rainer Kiko; Simon-Martin Schröder (2020). UVP5 data sorted with EcoTaxa and MorphoCluster [Dataset]. http://doi.org/10.17882/73002
Explore at:
image/*Available download formats
Unique identifier
https://doi.org/10.17882/73002
Dataset updated
2020
Dataset provided by
SEANOE
Authors
Rainer Kiko; Simon-Martin Schröder
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Time period covered
Oct 23, 2012 - Aug 7, 2017
Area covered
Description
here, we provide plankton image data that was sorted with the web applications ecotaxa and morphocluster. the data set was used for image classification tasks as described in schröder et. al (in preparation) and does not include any geospatial or temporal meta-data.plankton was imaged using the underwater vision profiler 5 (picheral et al. 2010) in various regions of the world's oceans between 2012-10-24 and 2017-08-08.this data publication consists of an archive containing "training.csv" (list of 392k training images for classification, validated using ecotaxa), "validation.csv" (list of 196k validation images for classification, validated using ecotaxa), "unlabeld.csv" (list of 1m unlabeled images), "morphocluster.csv" (1.2m objects validated using morphocluster, a subset of "unlabeled.csv" and "validation.csv") and the image files themselves. the csv files each contain the columns "object_id" (a unique id), "image_fn" (the relative filename), and "label" (the assigned name).the training and validation sets were sorted into 65 classes using the web application ecotaxa (http://ecotaxa.obs-vlfr.fr). this data shows a severe class imbalance; the 10% most populated classes contain more than 80% of the objects and the class sizes span four orders of magnitude. the validation set and a set of additional 1m unlabeled images were sorted during the first trial of morphocluster (https://github.com/morphocluster).the images in this data set were sampled during rv meteor cruises m92, m93, m96, m97, m98, m105, m106, m107, m108, m116, m119, m121, m130, m131, m135, m136, m137 and m138, during rv maria s merian cruises msm22, msm23, msm40 and msm49, during the rv polarstern cruise ps88b and during the fluxes1 experiment with rv sarmiento de gamboa.the following people have contributed to the sorting of the image data on ecotaxa:rainer kiko, tristan biard, benjamin blanc, svenja christiansen, justine courboules, charlotte eich, jannik faustmann, christine gawinski, augustin lafond, aakash panchal, marc picheral, akanksha singh and helena haussin schröder et al. (in preparation), the training set serves as a source for knowledge transfer in the training of the feature extractor. the classification using morphocluster was conducted by rainer kiko. used labels are operational and not yet matched to respective ecotaxa classes.
m
Safety dataset
data.mendeley.com
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kai lai Sun (2025). Safety dataset [Dataset]. http://doi.org/10.17632/m8rwjx67bk.1
Explore at:
Unique identifier
https://doi.org/10.17632/m8rwjx67bk.1
Dataset updated
Jul 16, 2025
Authors
Kai lai Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Three imbalanced safety datasets are established and shared: a 9-year nationwide construction accident dataset in Singapore, an accident and safety management dataset in a major development project in Singapore, and a US truck driver safety climate survey dataset. The paper link: https://doi.org/10.1016/j.knosys.2025.114120.
Credit Card Fraud Detection Dataset
kaggle.com
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghanshyam Saini (2025). Credit Card Fraud Detection Dataset [Dataset]. https://www.kaggle.com/datasets/ghnshymsaini/credit-card-fraud-detection-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ghanshyam Saini
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Credit Card Fraud Detection Dataset (European Cardholders, September 2013)

As a data contributor, I'm sharing this crucial dataset focused on the detection of fraudulent credit card transactions. Recognizing these illicit activities is paramount for protecting customers and the integrity of financial systems.

About the Dataset:

This dataset encompasses credit card transactions made by European cardholders during a two-day period in September 2013. It presents a real-world scenario with a significant class imbalance, where fraudulent transactions are considerably less frequent than legitimate ones. Out of a total of 284,807 transactions, only 492 are instances of fraud, representing a mere 0.172% of the entire dataset.

Content of the Data:

Due to confidentiality concerns, the majority of the input features in this dataset have undergone a Principal Component Analysis (PCA) transformation. This means the original meaning and context of features V1, V2, ..., V28 are not directly provided. However, these principal components capture the variance in the underlying transaction data.

The only features that have not been transformed by PCA are:

Time: Numerical. Represents the number of seconds elapsed between each transaction and the very first transaction recorded in the dataset.

Amount: Numerical. The transaction amount in Euros (€). This feature could be valuable for cost-sensitive learning approaches.

The target variable for this classification task is:

Class: Integer. Takes the value 1 in the case of a fraudulent transaction and 0 otherwise.

Important Note on Evaluation:

Given the substantial class imbalance (far more legitimate transactions than fraudulent ones), traditional accuracy metrics based on the confusion matrix can be misleading. It is strongly recommended to evaluate models using the Area Under the Precision-Recall Curve (AUPRC), as this metric is more sensitive to the performance on the minority class (fraudulent transactions).

How to Use This Dataset:

Download the dataset file (likely in CSV format).

Load the data using libraries like Pandas.

Understand the class imbalance: Be aware that fraudulent transactions are rare.

Explore the features: Analyze the distributions of 'Time', 'Amount', and the PCA-transformed features (V1-V28).

Address the class imbalance: Consider using techniques like oversampling the minority class, undersampling the majority class, or using specialized algorithms designed for imbalanced datasets.

Build and train binary classification models to predict the 'Class' variable.

Evaluate your models using AUPRC to get a meaningful assessment of performance in detecting fraud.

Acknowledgements and Citation:

This dataset has been collected and analyzed through a research collaboration between Worldline and the Machine Learning Group (MLG) of ULB (Université Libre de Bruxelles).

When using this dataset in your research or projects, please cite the following works as appropriate:

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.

Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon.

Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE.

Andrea Dal Pozzolo. Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi).

Fabrizio Carcillo, Andrea Dal Pozzolo, Yann-Aël Le Borgne, Olivier Caelen, Yannis Mazzer, Gianluca Bontempi. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier.

Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Gianluca Bontempi. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing.

Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019.

Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi *Combining Unsupervised and Supervised...
f
GMSC dataset (IR: Imbalance Ratio).
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). GMSC dataset (IR: Imbalance Ratio). [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t001
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
f
The definition of a confusion matrix.
plos.figshare.com
xls
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t002
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
R
Balance Class Dataset
universe.roboflow.com
zip
Updated Mar 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rinat Gil (2022). Balance Class Dataset [Dataset]. https://universe.roboflow.com/rinat-gil/balance-class/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Mar 23, 2022
Dataset authored and provided by
Rinat Gil
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Masks Bounding Boxes
Description
Balance Class

## Overview Balance Class is a dataset for object detection tasks - it contains Masks annotations for 1,420 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Credit scoring with class imbalance data: An out-of-sample and out-of-time...
zenodo.org
Updated Oct 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonah Mushava; Mike Murray; Jonah Mushava; Mike Murray (2023). Credit scoring with class imbalance data: An out-of-sample and out-of-time perspective [Dataset]. http://doi.org/10.5281/zenodo.8401978
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.8401978
Dataset updated
Oct 6, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jonah Mushava; Mike Murray; Jonah Mushava; Mike Murray
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The raw datasets provided here are intended for use in a Data in Brief article. These comprehensive files, sourced from the Freddie Mac website, offer quarterly snapshots of mortgage loans that have been originated in the USA since 1999, along with details of their subsequent repayment behaviours. This data remains current and is updated every three months. Specifically, the loan origination data present here encompasses amortized fixed-rate mortgage loans from 1999 up to June 2022. In contrast, the performance data is presented on a monthly basis, detailing loan repayment profiles from 1999 until September 30, 2022. Both the origination and performance datasets feature a unique loan ID, which can be utilized to integrate the data on loan originations with that of loan repayments.
Imbalance Classes(Sentiment Analysis) Dataset
kaggle.com
Updated Sep 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zohaib Arshid (2022). Imbalance Classes(Sentiment Analysis) Dataset [Dataset]. https://www.kaggle.com/datasets/zohaibarshid/imbalance-classessentiment-analysis/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 29, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Zohaib Arshid
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
We have 3 classes(Negative, Positive, Neutral) in initial stage we have only 620 review sentences (150, 406, 64) respectively. By using Semi-Supervised Learning Techniques we generate dataset for Imbalance classes Problem.
R
Class Balance Class 2 Dataset
universe.roboflow.com
zip
Updated Dec 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FPT (2021). Class Balance Class 2 Dataset [Dataset]. https://universe.roboflow.com/fpt/class-balance-class-2
Explore at:
zipAvailable download formats
Dataset updated
Dec 19, 2021
Dataset authored and provided by
FPT
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Mask Bounding Boxes
Description
Class Balance Class 2

## Overview Class Balance Class 2 is a dataset for object detection tasks - it contains Mask annotations for 323 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Replication package for "An Empirical Assessment of Best-Answer Prediction...
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabio Calefato; Fabio Calefato; Filippo Lanubile; Nicole Novielli; Filippo Lanubile; Nicole Novielli (2020). Replication package for "An Empirical Assessment of Best-Answer Prediction Models in Technical Q&A Sites" (EMSE 2018) [Dataset]. http://doi.org/10.5281/zenodo.2575593
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2575593
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fabio Calefato; Fabio Calefato; Filippo Lanubile; Nicole Novielli; Filippo Lanubile; Nicole Novielli
Description
Replication package for the paper:

F. Calefato, F. Lanubile, and N. Novielli (2018) “An Empirical Assessment of Best-Answer Prediction Models in Technical Q&A Sites.” Empirical Software Engineering Journal, DOI: 10.1007/s10664-018-9642-5.
f
Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanka Banerjee; Frederic O. Dehnbostel; Robert Preissner (2023). Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets.PDF [Dataset]. http://doi.org/10.3389/fchem.2018.00362.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fchem.2018.00362.s001
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Priyanka Banerjee; Frederic O. Dehnbostel; Robert Preissner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Increase in the number of new chemicals synthesized in past decades has resulted in constant growth in the development and application of computational models for prediction of activity as well as safety profiles of the chemicals. Most of the time, such computational models and its application must deal with imbalanced chemical data. It is indeed a challenge to construct a classifier using imbalanced data set. In this study, we analyzed and validated the importance of different sampling methods over non-sampling method, to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data. Additionally, this study has achieved an accuracy of 93.00%, an AUC of 0.94, F1 measure of 0.90, sensitivity of 96.00% and specificity of 91.00% using SMOTE sampling and Random Forest classifier for the prediction of Drug Induced Liver Injury (DILI). Our results suggest that, irrespective of data set used, sampling methods can have major influence on reducing the gap between sensitivity and specificity of a model. This study demonstrates the efficacy of different sampling methods for class imbalanced problem using binary chemical data sets.
CerditCard fraud dataset
kaggle.com
Updated Aug 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wasiq Ali (2025). CerditCard fraud dataset [Dataset]. https://www.kaggle.com/datasets/wasiqaliyasir/cerditcard-fraud-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Wasiq Ali
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Credit Card Fraud Detection Dataset

Uncover fraudulent transactions with this anonymized, PCA-transformed dataset. Perfect for building and testing fraud detection algorithms!

Dataset Overview

Objective: Detect fraudulent credit card transactions using anonymized features- - - -

Samples: 1,000 transactions

Features: 7 columns (5 PCA components + Transaction Amount + Target)

Class Distribution:

Legit (Class 0): 993 transactions (~99.3%)

Fraud (Class 1): 7 transactions (~0.7%)

Key Challenge: Extreme class imbalance – realistic representation of fraud patterns

Features Description

Feature Description Characteristics

V1-V5 Anonymized principal components PCA-transformed numerical features; preserves >transaction patterns while hiding sensitive details Amount Transaction value Highly variable (min: $0.20, max: $1,916.06); critical for fraud analysis Class Target variable Binary labels: • 0 = Legitimate transaction • 1 = Fraudulent transaction Key Insights & Patterns

Fraud Indicators:

Fraudulent transactions occur across diverse amounts (low: $1.83 → high: $1,916)

No obvious amount threshold for fraud – requires nuanced modeling

Sample fraud cases:

V1:0.579, V2:-0.384, Amount:1916.06

V1:1.023, V2:-0.638, Amount:1094.42

Data Characteristics:

V1-V5 Distributions:

V1: Concentrated near zero (mean ≈ -0.1)

V2: Wider spread (mean ≈ 0.05)

V3-V5: Asymmetric distributions

Amount Distribution:

Right-skewed – most transactions < $500

2.Fraud cases span low and high values

Class Imbalance:

- Severe skew: 993:7 legit-to-fraud ratio - Models must optimize for recall/precision over accuracy

Analysis Challenges

⚠️ Class Imbalance: Standard accuracy metrics misleading

🔍 Feature Interpretation: PCA components lack real-world context

📊 Non-linear Patterns: Complex interactions between V1-V5

⚡ High Stakes: False negatives (missed fraud) costlier than false positives

Recommended Applications Fraud Detection Models:

Logistic Regression (with class weighting)

Random Forests / XGBoost (handle non-linearities)

Isolation Forests (anomaly detection)

Evaluation Focus:

Precision-Recall Curves > ROC-AUC

F2-Score (prioritize recall)

Confusion matrix analysis

Advanced Techniques:

SMOTE/ADASYN for oversampling

Autoencoders for anomaly detection

Feature engineering: Amount-to-Var ratios

Dataset Source & Ethics Origin: Synthetic dataset mirroring real-world financial patterns

Anonymization: Original features transformed via PCA for privacy compliance

Bias Consideration: Geographic/cultural biases possible in source data

Potential Use Cases

🏦 Banking: Real-time transaction monitoring systems

📱 FinTech Apps: Fraud detection APIs for payment gateways

🎓 Education: Imbalanced classification tutorials

🏆 Kaggle Competitions: Lightweight fraud detection challenge

Example Project Idea "Minimalist Fraud Detector":

# python from imblearn.pipeline import make_pipeline from sklearn.ensemble import RandomForestClassifier model = make_pipeline( RobustScaler(), SMOTE(sampling_strategy=0.3), RandomForestClassifier(class_weight={0:1, 1:15}) ) Optimize for: Recall @ Precision > 0.85

Dataset Summary markdown | Feature | Mean | Std | Min | Max | |----------|----------|----------|-----------|-----------| | V1 | -0.11 | 1.02 | -3.24 | 3.85 | | V2 | 0.05 | 1.01 | -2.94 | 2.60 | | V3 | 0.02 | 0.98 | -3.02 | 2.95 |
| Amount | 250.32 | 190.19 | 0.20 | 1916.06 |

Facebook

Twitter

Click to copy link

Link copied

Cite

Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0

Imbalanced Data

Explore at:

Dataset updated

Aug 23, 2023

Authors

Blessa Binolin M

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.

Clear search

Close search

Google apps

Main menu

Imbalanced Data

Performance comparison of machine learning models across accuracy, AUC, MCC,...

Tackling Class Imbalance with Ranking - Dataset - CKAN

Over-sampled dataset.

Dataset for Imbalance Class Problem

Dataset

Contents

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

Dataset: The effects of class balance on the training energy consumption of...

UVP5 data sorted with EcoTaxa and MorphoCluster

Safety dataset

Credit Card Fraud Detection Dataset

Credit Card Fraud Detection Dataset (European Cardholders, September 2013)

GMSC dataset (IR: Imbalance Ratio).

The definition of a confusion matrix.

Balance Class Dataset

Balance Class

Credit scoring with class imbalance data: An out-of-sample and out-of-time...

Imbalance Classes(Sentiment Analysis) Dataset

Class Balance Class 2 Dataset

Class Balance Class 2

Replication package for "An Empirical Assessment of Best-Answer Prediction...

Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...

CerditCard fraud dataset

Dataset Overview

Features Description

Fraud Indicators:

Sample fraud cases:

Data Characteristics:

Analysis Challenges

Imbalanced Data