Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The dataset comes originally from UCI Machine Learning. The multiclass datasets were transformed in binary classification as mentioned in the paper. Ranking methods were applied to improve class imbalance. The datasets are divided in 30 folds so that other class imbalance methods can be compared to the methods in the paper. The code used in the paper is also provided.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
This dataset was created by Akalya Subramanian
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two synthetic datasets for binary classification, generated with the Random Radial Basis Function generator from WEKA. They are the same shape and size (104.952 instances, 185 attributes), but the "balanced" dataset has 52,13% of its instances belonging to class c0, while the "unbalanced" one only has 4,04% of its instances belonging to class c0. Therefore, this set of datasets is primarily meant to study how class balance influences the behaviour of a machine learning model.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
here, we provide plankton image data that was sorted with the web applications ecotaxa and morphocluster. the data set was used for image classification tasks as described in schröder et. al (in preparation) and does not include any geospatial or temporal meta-data.plankton was imaged using the underwater vision profiler 5 (picheral et al. 2010) in various regions of the world's oceans between 2012-10-24 and 2017-08-08.this data publication consists of an archive containing "training.csv" (list of 392k training images for classification, validated using ecotaxa), "validation.csv" (list of 196k validation images for classification, validated using ecotaxa), "unlabeld.csv" (list of 1m unlabeled images), "morphocluster.csv" (1.2m objects validated using morphocluster, a subset of "unlabeled.csv" and "validation.csv") and the image files themselves. the csv files each contain the columns "object_id" (a unique id), "image_fn" (the relative filename), and "label" (the assigned name).the training and validation sets were sorted into 65 classes using the web application ecotaxa (http://ecotaxa.obs-vlfr.fr). this data shows a severe class imbalance; the 10% most populated classes contain more than 80% of the objects and the class sizes span four orders of magnitude. the validation set and a set of additional 1m unlabeled images were sorted during the first trial of morphocluster (https://github.com/morphocluster).the images in this data set were sampled during rv meteor cruises m92, m93, m96, m97, m98, m105, m106, m107, m108, m116, m119, m121, m130, m131, m135, m136, m137 and m138, during rv maria s merian cruises msm22, msm23, msm40 and msm49, during the rv polarstern cruise ps88b and during the fluxes1 experiment with rv sarmiento de gamboa.the following people have contributed to the sorting of the image data on ecotaxa:rainer kiko, tristan biard, benjamin blanc, svenja christiansen, justine courboules, charlotte eich, jannik faustmann, christine gawinski, augustin lafond, aakash panchal, marc picheral, akanksha singh and helena haussin schröder et al. (in preparation), the training set serves as a source for knowledge transfer in the training of the feature extractor. the classification using morphocluster was conducted by rainer kiko. used labels are operational and not yet matched to respective ecotaxa classes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Three imbalanced safety datasets are established and shared: a 9-year nationwide construction accident dataset in Singapore, an accident and safety management dataset in a major development project in Singapore, and a US truck driver safety climate survey dataset. The paper link: https://doi.org/10.1016/j.knosys.2025.114120.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
As a data contributor, I'm sharing this crucial dataset focused on the detection of fraudulent credit card transactions. Recognizing these illicit activities is paramount for protecting customers and the integrity of financial systems.
About the Dataset:
This dataset encompasses credit card transactions made by European cardholders during a two-day period in September 2013. It presents a real-world scenario with a significant class imbalance, where fraudulent transactions are considerably less frequent than legitimate ones. Out of a total of 284,807 transactions, only 492 are instances of fraud, representing a mere 0.172% of the entire dataset.
Content of the Data:
Due to confidentiality concerns, the majority of the input features in this dataset have undergone a Principal Component Analysis (PCA) transformation. This means the original meaning and context of features V1, V2, ..., V28 are not directly provided. However, these principal components capture the variance in the underlying transaction data.
The only features that have not been transformed by PCA are:
The target variable for this classification task is:
Important Note on Evaluation:
Given the substantial class imbalance (far more legitimate transactions than fraudulent ones), traditional accuracy metrics based on the confusion matrix can be misleading. It is strongly recommended to evaluate models using the Area Under the Precision-Recall Curve (AUPRC), as this metric is more sensitive to the performance on the minority class (fraudulent transactions).
How to Use This Dataset:
Acknowledgements and Citation:
This dataset has been collected and analyzed through a research collaboration between Worldline and the Machine Learning Group (MLG) of ULB (Université Libre de Bruxelles).
When using this dataset in your research or projects, please cite the following works as appropriate:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Balance Class is a dataset for object detection tasks - it contains Masks annotations for 1,420 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The raw datasets provided here are intended for use in a Data in Brief article. These comprehensive files, sourced from the Freddie Mac website, offer quarterly snapshots of mortgage loans that have been originated in the USA since 1999, along with details of their subsequent repayment behaviours. This data remains current and is updated every three months. Specifically, the loan origination data present here encompasses amortized fixed-rate mortgage loans from 1999 up to June 2022. In contrast, the performance data is presented on a monthly basis, detailing loan repayment profiles from 1999 until September 30, 2022. Both the origination and performance datasets feature a unique loan ID, which can be utilized to integrate the data on loan originations with that of loan repayments.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
We have 3 classes(Negative, Positive, Neutral) in initial stage we have only 620 review sentences (150, 406, 64) respectively. By using Semi-Supervised Learning Techniques we generate dataset for Imbalance classes Problem.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Class Balance Class 2 is a dataset for object detection tasks - it contains Mask annotations for 323 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Replication package for the paper:
F. Calefato, F. Lanubile, and N. Novielli (2018) “An Empirical Assessment of Best-Answer Prediction Models in Technical Q&A Sites.” Empirical Software Engineering Journal, DOI: 10.1007/s10664-018-9642-5.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Increase in the number of new chemicals synthesized in past decades has resulted in constant growth in the development and application of computational models for prediction of activity as well as safety profiles of the chemicals. Most of the time, such computational models and its application must deal with imbalanced chemical data. It is indeed a challenge to construct a classifier using imbalanced data set. In this study, we analyzed and validated the importance of different sampling methods over non-sampling method, to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data. Additionally, this study has achieved an accuracy of 93.00%, an AUC of 0.94, F1 measure of 0.90, sensitivity of 96.00% and specificity of 91.00% using SMOTE sampling and Random Forest classifier for the prediction of Drug Induced Liver Injury (DILI). Our results suggest that, irrespective of data set used, sampling methods can have major influence on reducing the gap between sensitivity and specificity of a model. This study demonstrates the efficacy of different sampling methods for class imbalanced problem using binary chemical data sets.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Credit Card Fraud Detection Dataset
Uncover fraudulent transactions with this anonymized, PCA-transformed dataset. Perfect for building and testing fraud detection algorithms!
Objective: Detect fraudulent credit card transactions using anonymized features- - - -
Samples: 1,000 transactions
Features: 7 columns (5 PCA components + Transaction Amount + Target)
Class Distribution:
Legit (Class 0): 993 transactions (~99.3%)
Fraud (Class 1): 7 transactions (~0.7%)
Key Challenge: Extreme class imbalance – realistic representation of fraud patterns
Feature Description Characteristics
V1-V5 Anonymized principal components PCA-transformed numerical features; preserves >transaction patterns while hiding sensitive details Amount Transaction value Highly variable (min: $0.20, max: $1,916.06); critical for fraud analysis Class Target variable Binary labels: • 0 = Legitimate transaction • 1 = Fraudulent transaction Key Insights & Patterns
Fraud Indicators:
Fraudulent transactions occur across diverse amounts (low: $1.83 → high: $1,916)
No obvious amount threshold for fraud – requires nuanced modeling
V1:0.579, V2:-0.384, Amount:1916.06
V1:1.023, V2:-0.638, Amount:1094.42
V1-V5 Distributions:
V1: Concentrated near zero (mean ≈ -0.1)
V2: Wider spread (mean ≈ 0.05)
V3-V5: Asymmetric distributions
Amount Distribution:
2.Fraud cases span low and high values
Class Imbalance:
- Severe skew: 993:7 legit-to-fraud ratio
- Models must optimize for recall/precision over accuracy
⚠️ Class Imbalance: Standard accuracy metrics misleading
🔍 Feature Interpretation: PCA components lack real-world context
📊 Non-linear Patterns: Complex interactions between V1-V5
⚡ High Stakes: False negatives (missed fraud) costlier than false positives
Recommended Applications Fraud Detection Models:
Logistic Regression (with class weighting)
Random Forests / XGBoost (handle non-linearities)
Isolation Forests (anomaly detection)
Evaluation Focus:
Precision-Recall Curves > ROC-AUC
F2-Score (prioritize recall)
Confusion matrix analysis
Advanced Techniques:
SMOTE/ADASYN for oversampling
Autoencoders for anomaly detection
Feature engineering: Amount-to-Var ratios
Dataset Source & Ethics Origin: Synthetic dataset mirroring real-world financial patterns
Anonymization: Original features transformed via PCA for privacy compliance
Bias Consideration: Geographic/cultural biases possible in source data
Potential Use Cases
🏦 Banking: Real-time transaction monitoring systems
📱 FinTech Apps: Fraud detection APIs for payment gateways
🎓 Education: Imbalanced classification tutorials
🏆 Kaggle Competitions: Lightweight fraud detection challenge
Example Project Idea "Minimalist Fraud Detector":
# python
from imblearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
model = make_pipeline(
RobustScaler(),
SMOTE(sampling_strategy=0.3),
RandomForestClassifier(class_weight={0:1, 1:15})
)
Optimize for: Recall @ Precision > 0.85
Dataset Summary
markdown
| Feature | Mean | Std | Min | Max |
|----------|----------|----------|-----------|-----------|
| V1 | -0.11 | 1.02 | -3.24 | 3.85 |
| V2 | 0.05 | 1.01 | -2.94 | 2.60 |
| V3 | 0.02 | 0.98 | -3.02 | 2.95 |
| Amount | 250.32 | 190.19 | 0.20 | 1916.06 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.