Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset file is used for the study of imbalanced data and contains 6 imbalanced datasets
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Imbalanced dataset for benchmarking
=======================
The different algorithms of the `imbalanced-learn` toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.
Characteristics
-------------------
|ID |Name |Repository & Target |Ratio |# samples| # features |
|:---:|:----------------------:|--------------------------------------|:------:|:-------------:|:--------------:|
|1 |Ecoli |UCI, target: imU |8.6:1 |336 |7 |
|2 |Optical Digits |UCI, target: 8 |9.1:1 |5,620 |64 |
|3 |SatImage |UCI, target: 4 |9.3:1 |6,435 |36 |
|4 |Pen Digits |UCI, target: 5 |9.4:1 |10,992 |16 |
|5 |Abalone |UCI, target: 7 |9.7:1 |4,177 |8 |
|6 |Sick Euthyroid |UCI, target: sick euthyroid |9.8:1 |3,163 |25 |
|7 |Spectrometer |UCI, target: >=44 |11:1 |531 |93 |
|8 |Car_Eval_34 |UCI, target: good, v good |12:1 |1,728 |6 |
|9 |ISOLET |UCI, target: A, B |12:1 |7,797 |617 |
|10 |US Crime |UCI, target: >0.65 |12:1 |1,994 |122 |
|11 |Yeast_ML8 |LIBSVM, target: 8 |13:1 |2,417 |103 |
|12 |Scene |LIBSVM, target: >one label |13:1 |2,407 |294 |
|13 |Libras Move |UCI, target: 1 |14:1 |360 |90 |
|14 |Thyroid Sick |UCI, target: sick |15:1 |3,772 |28 |
|15 |Coil_2000 |KDD, CoIL, target: minority |16:1 |9,822 |85 |
|16 |Arrhythmia |UCI, target: 06 |17:1 |452 |279 |
|17 |Solar Flare M0 |UCI, target: M->0 |19:1 |1,389 |10 |
|18 |OIL |UCI, target: minority |22:1 |937 |49 |
|19 |Car_Eval_4 |UCI, target: vgood |26:1 |1,728 |6 |
|20 |Wine Quality |UCI, wine, target: <=4 |26:1 |4,898 |11 |
|21 |Letter Img |UCI, target: Z |26:1 |20,000 |16 |
|22 |Yeast _ME2 |UCI, target: ME2 |28:1 |1,484 |8 |
|23 |Webpage |LIBSVM, w7a, target: minority|33:1 |49,749 |300 |
|24 |Ozone Level |UCI, ozone, data |34:1 |2,536 |72 |
|25 |Mammography |UCI, target: minority |42:1 |11,183 |6 |
|26 |Protein homo. |KDD CUP 2004, minority |111:1|145,751 |74 |
|27 |Abalone_19 |UCI, target: 19 |130:1|4,177 |8 |
References
----------
[1] Ding, Zejin, "Diversified Ensemble Classifiers for H
ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).
[2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).
[3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.
[4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Imbalanced Data is a dataset for object detection tasks - it contains Lego Bricks annotations for 888 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies.
The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters.
The data is used to build classification models to predict students' dropout and academic success. The problem is formulated as a three category classification task (dropout, enrolled, and graduate) at the end of the normal duration of the course.
Funding
We acknowledge support of this work by the program "SATDAP - Capacitação da Administração Pública under grant POCI-05-5762-FSE-000191, Portugal"
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The dataset comes originally from UCI Machine Learning. The multiclass datasets were transformed in binary classification as mentioned in the paper. Ranking methods were applied to improve class imbalance. The datasets are divided in 30 folds so that other class imbalance methods can be compared to the methods in the paper. The code used in the paper is also provided.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Derived from public unbalanced data sets
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data sets supporting the results reported in the paper: Hellinger Distance Trees for Imbalanced Streams, R. J. Lyon, J.M. Brooke, J.D. Knowles, B.W Stappers, 22nd International Conference on Pattern Recognition (ICPR), p.1969 - 1974, 2014. DOI: 10.1109/ICPR.2014.344 Contained in this distribution are results of stream classifier perfromance on four different data sets. Also included are the test results from our attempt at reproducing the outcome of the paper, Learning Decision Trees for Un-balanced Data, D. A. Cieslak and N. V. Chawla, in Machine Learning and Knowledge Discovery in Databases (W. Daelemans, B. Goethals, and K. Morik, eds.), vol. 5211 of LNCS, pp. 241-256, 2008. The data sets used for these experiments include, MAGIC Gamma Telescope Data Set : https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+TelescopeMiniBooNE particle identification Data Set : https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identificationSkin Segmentation Data Set : https://archive.ics.uci.edu/ml/datasets/Skin+SegmentationLetter Recognition Data Set : https://archive.ics.uci.edu/ml/datasets/Letter+RecognitionPen-Based Recognition of Handwritten Digits Data Set : https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+DigitsStatlog (Landsat Satellite) Data Set : https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)Statlog (Image Segmentation) Data Set : https://archive.ics.uci.edu/ml/datasets/Statlog+(Image+Segmentation) A further data set used is not publicly available at present. However we are in the process of releasing it for public use. Please get in touch if you'd like to use it.
A readme file accompanies the data describing it in more detail.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Imbalanced Class
This dataset was created by Akalya Subramanian
The different algorithms of the "imbalanced-learn" toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in Ding, Zejin, "Diversified Ensemble Classifiers for H ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This record contains the underlying research data for the publication "High impact bug report identification with imbalanced learning strategies" and the full-text is available from: https://ink.library.smu.edu.sg/sis_research/3702In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they often concentrate on bugs that are highly impactful. In the literature, high-impact bugs are used to refer to the bugs which appear at unexpected time or locations and bring more unexpected effects (i.e., surprise bugs), or break pre-existing functionalities and destroy the user experience (i.e., breakage bugs). Unfortunately, identifying high-impact bugs from thousands of bug reports in a bug tracking system is not an easy feat. Thus, an automated technique that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. Considering that only a small proportion of bugs are high-impact bugs, the identification of high-impact bug reports is a difficult task. In this paper, we propose an approach to identify high-impact bug reports by leveraging imbalanced learning strategies. We investigate the effectiveness of various variants, each of which combines one particular imbalanced learning strategy and one particular classification algorithm. In particular, we choose four widely used strategies for dealing with imbalanced data and four state-of-the-art text classification algorithms to conduct experiments on four datasets from four different open source projects. We mainly perform an analytical study on two types of high-impact bugs, i.e., surprise bugs and breakage bugs. The results show that different variants have different performances, and the best performing variants SMOTE (synthetic minority over-sampling technique) + KNN (K-nearest neighbours) for surprise bug identification and RUS (random under-sampling) + NB (naive Bayes) for breakage bug identification outperform the F1-scores of the two state-of-the-art approaches by Thung et al. and Garcia and Shihab.Supplementary code and data available from GitHub:
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
codesignal/imbalanced-data-practice dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset contains all the simulation results on the effect of ensemble models in dealing with data imbalance. The simulations are performed with sample size n=2000, number of variables p=200, and number of groups k=20 under six imbalanced scenarios. It shows the result of ensemble models with threshold from [0, 0.05, 0.1, ..., 0.95, 1.0], in terms of the overall AP/AR and discrete (continuous) specific AP/AR. This dataset serves as a reference for practitioners to find the appropriate ensemble threshold that fits their business needs the best.
Comparison of the classification accuracy of different class imbalance models with classifiers on LIHC dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.