100+ datasets found

f
Data from: GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data...
acs.figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carmen Esposito; Gregory A. Landrum; Nadine Schneider; Nikolaus Stiefl; Sereina Riniker (2023). GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning [Dataset]. http://doi.org/10.1021/acs.jcim.1c00160.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.1c00160.s002
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Carmen Esposito; Gregory A. Landrum; Nadine Schneider; Nikolaus Stiefl; Sereina Riniker
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure–activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.
Data from: A virtual multi-label approach to imbalanced data classification
tandf.figshare.com
text/x-tex
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth P. Chou; Shan-Ping Yang (2024). A virtual multi-label approach to imbalanced data classification [Dataset]. http://doi.org/10.6084/m9.figshare.19390561.v1
Explore at:
text/x-texAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19390561.v1
Dataset updated
Feb 28, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Elizabeth P. Chou; Shan-Ping Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.
Results of BILSTM for rare classes for the imbalanced dataset with different...
plos.figshare.com
xls
Updated Nov 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). Results of BILSTM for rare classes for the imbalanced dataset with different reweighting factors. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0290581.t006
Dataset updated
Nov 16, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Alaa Alomari; Hossam Faris; Pedro A. Castillo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results of BILSTM for rare classes for the imbalanced dataset with different reweighting factors.
Real Time Bidding
kaggle.com
zip
Updated Feb 27, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ricky (2017). Real Time Bidding [Dataset]. https://www.kaggle.com/zurfer/rtb
Explore at:
zip(144371473 bytes)Available download formats
Dataset updated
Feb 27, 2017
Authors
Ricky
Description
Context

This is real real-time bidding data that is used to predict if an advertiser should bid for a marketing slot e.g. a banner on a webpage. Explanatory variables are things like browser, operation system or time of the day the user is online, marketplace his identifiers were traded on earlier, etc. The column 'convert' is 1, when the person clicked on the ad, and 0 if this is not the case.

Content

Unfortunately, the data had to be anonymized, so you basically can't do a lot of feature engineering. I just applied PCA and kept 0.99 of the linear explanatory power. However, I think it's still really interesting data to just test your general algorithms on imbalanced data. ;)

Inspiration

Since it's heavily imbalanced data, it doesn't make sense to train for accuracy, but rather try to get obtain a good AUC, F1Score, MCC or recall rate, by cross-validating your data. It's interesting to compare different models (logistic regression, decision trees, svms, ...) over these metrics and see the impact that your split in train:test data has on the data.

It might be good strategy to follow these Tactics to combat imbalanced classes.
Imbalanced Cifar-10
kaggle.com
zip
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akhil Theerthala (2023). Imbalanced Cifar-10 [Dataset]. https://www.kaggle.com/datasets/akhiltheerthala/imbalanced-cifar-10
Explore at:
zip(807146485 bytes)Available download formats
Dataset updated
Jun 17, 2023
Authors
Akhil Theerthala
Description
This dataset is a modified version of the classic CIFAR 10, deliberately designed to be imbalanced across its classes. CIFAR 10 typically consists of 60,000 32x32 color images in 10 classes, with 5000 images per class in the training set. However, this dataset skews these distributions to create a more challenging environment for developing and testing machine learning algorithms. The distribution can be visualized as follows,

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7862887%2Fae7643fe0e58a489901ce121dc2e8262%2FCifar_Imbalanced_data.png?generation=1686732867580792&alt=media" alt="">

The primary purpose of this dataset is to offer researchers and practitioners a platform to develop, test, and enhance algorithms' robustness when faced with class imbalances. It is especially suited for those interested in binary and multi-class imbalance learning, anomaly detection, and other relevant fields.

The imbalance was created synthetically, maintaining the same quality and diversity of the original CIFAR 10 dataset, but with varying degrees of representation for each class. Details of the class distributions are included in the dataset's metadata.

This dataset is beneficial for: - Developing and testing strategies for handling imbalanced datasets. - Investigating the effects of class imbalance on model performance. - Comparing different machine learning algorithms' performance under class imbalance.

Usage Information:

The dataset maintains the same format as the original CIFAR 10 dataset, making it easy to incorporate into existing projects. It is organised in a way such that the dataset can be integrated into PyTorch ImageFolder directly. You can load the dataset in Python using popular libraries like NumPy and PyTorch.

License: This dataset follows the same license terms as the original CIFAR 10 dataset. Please refer to the official CIFAR 10 website for details.

Acknowledgments: We want to acknowledge the creators of the CIFAR 10 dataset. Without their work and willingness to share data, this synthetic imbalanced dataset wouldn't be possible.
The definition of a confusion matrix.
plos.figshare.com
xls
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t002
Dataset updated
Feb 10, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.

Data from: Imbalanced dataset for benchmarking

data.niaid.nih.gov

Updated Jan 24, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Lemaitre, Guillaume; Nogueira, Fernando; Aridas, Christos K.; Oliveira, Dayvid V. R. (2020). Imbalanced dataset for benchmarking [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_61452

Explore at:

Dataset updated

Jan 24, 2020

Dataset provided by

Universite de Bourgogne, Universitat de Girona
ShoppeAI
University of Patras
Universidade Federal de Pernambuco

Authors

Lemaitre, Guillaume; Nogueira, Fernando; Aridas, Christos K.; Oliveira, Dayvid V. R.

License

Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically

Description

Imbalanced dataset for benchmarking

The different algorithms of the imbalanced-learn toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.

Characteristics

ID	Name	Repository & Target	Ratio	# samples	# features
1	Ecoli	UCI, target: imU	8.6:1	336	7
2	Optical Digits	UCI, target: 8	9.1:1	5,620	64
3	SatImage	UCI, target: 4	9.3:1	6,435	36
4	Pen Digits	UCI, target: 5	9.4:1	10,992	16
5	Abalone	UCI, target: 7	9.7:1	4,177	8
6	Sick Euthyroid	UCI, target: sick euthyroid	9.8:1	3,163	25
7	Spectrometer	UCI, target: >=44	11:1	531	93
8	Car_Eval_34	UCI, target: good, v good	12:1	1,728	6
9	ISOLET	UCI, target: A, B	12:1	7,797	617
10	US Crime	UCI, target: >0.65	12:1	1,994	122
11	Yeast_ML8	LIBSVM, target: 8	13:1	2,417	103
12	Scene	LIBSVM, target: >one label	13:1	2,407	294
13	Libras Move	UCI, target: 1	14:1	360	90
14	Thyroid Sick	UCI, target: sick	15:1	3,772	28
15	Coil_2000	KDD, CoIL, target: minority	16:1	9,822	85
16	Arrhythmia	UCI, target: 06	17:1	452	279
17	Solar Flare M0	UCI, target: M->0	19:1	1,389	10
18	OIL	UCI, target: minority	22:1	937	49
19	Car_Eval_4	UCI, target: vgood	26:1	1,728	6
20	Wine Quality	UCI, wine, target: <=4	26:1	4,898	11
21	Letter Img	UCI, target: Z	26:1	20,000	16
22	Yeast _ME2	UCI, target: ME2	28:1	1,484	8
23	Webpage	LIBSVM, w7a, target: minority	33:1	49,749	300
24	Ozone Level	UCI, ozone, data	34:1	2,536	72
25	Mammography	UCI, target: minority	42:1	11,183	6
26	Protein homo.	KDD CUP 2004, minority	111:1	145,751	74
27	Abalone_19	UCI, target: 19	130:1	4,177	8

References

[1] Ding, Zejin, "Diversified Ensemble Classifiers for H ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).

[2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).

[3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.

[4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.

f
Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Aug 28, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dehnbostel, Frederic O.; Banerjee, Priyanka; Preissner, Robert (2018). Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets.PDF [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000631456
Explore at:
Dataset updated
Aug 28, 2018
Authors
Dehnbostel, Frederic O.; Banerjee, Priyanka; Preissner, Robert
Description
Increase in the number of new chemicals synthesized in past decades has resulted in constant growth in the development and application of computational models for prediction of activity as well as safety profiles of the chemicals. Most of the time, such computational models and its application must deal with imbalanced chemical data. It is indeed a challenge to construct a classifier using imbalanced data set. In this study, we analyzed and validated the importance of different sampling methods over non-sampling method, to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data. Additionally, this study has achieved an accuracy of 93.00%, an AUC of 0.94, F1 measure of 0.90, sensitivity of 96.00% and specificity of 91.00% using SMOTE sampling and Random Forest classifier for the prediction of Drug Induced Liver Injury (DILI). Our results suggest that, irrespective of data set used, sampling methods can have major influence on reducing the gap between sensitivity and specificity of a model. This study demonstrates the efficacy of different sampling methods for class imbalanced problem using binary chemical data sets.
Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fninf.2021.715421.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.
G
Data Balancing for Model Training Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Oct 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Data Balancing for Model Training Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-balancing-for-model-training-market
Explore at:
pptx, pdf, csvAvailable download formats
Dataset updated
Oct 3, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Data Balancing for Model Training Market Outlook

According to our latest research, the global Data Balancing for Model Training market size in 2024 is valued at USD 1.37 billion, with a robust CAGR of 19.8% expected during the forecast period. By 2033, the market is forecasted to reach USD 6.59 billion. The primary growth factor driving this market is the exponential increase in demand for high-quality, unbiased machine learning models across industries, fueled by the rapid digital transformation and adoption of artificial intelligence.

One of the most significant growth drivers for the Data Balancing for Model Training market is the surging need for accurate and reliable AI models in critical sectors such as healthcare, finance, and retail. As organizations increasingly leverage AI and machine learning for decision-making, the importance of balanced datasets becomes paramount to ensure model fairness, accuracy, and compliance. Data imbalance, if not addressed, can lead to biased predictions and suboptimal business outcomes, making data balancing solutions essential for organizations aiming to deploy trustworthy and high-performing models. Furthermore, regulatory pressures and ethical considerations are compelling enterprises to adopt advanced data balancing techniques, further accelerating market growth.

Another key factor propelling the market is the proliferation of big data and the complexity of modern datasets. With the explosion of data sources and the diversity of data types, organizations are facing unprecedented challenges in managing and processing imbalanced datasets. This complexity necessitates the adoption of sophisticated data balancing solutions such as oversampling, undersampling, hybrid methods, and synthetic data generation. These solutions not only enhance model performance but also streamline the data preparation process, enabling faster and more efficient model training cycles. The growing integration of automated machine learning (AutoML) platforms is also contributing to the adoption of data balancing tools, as these platforms increasingly embed balancing techniques to democratize AI development.

The ongoing digital transformation across industries, coupled with the rise of Industry 4.0, is further boosting the demand for data balancing solutions. Enterprises in manufacturing, IT & telecommunications, and retail are deploying AI-powered applications at scale, which rely heavily on balanced training data to deliver accurate insights and automation. The expanding use of Internet of Things (IoT) devices and connected systems is generating vast volumes of imbalanced data, necessitating robust data balancing frameworks. Additionally, advancements in synthetic data generation are opening new avenues for addressing data scarcity and imbalance, especially in sensitive domains like healthcare where data privacy is a concern.

From a regional perspective, North America leads the Data Balancing for Model Training market, driven by early adoption of AI technologies, strong presence of tech giants, and significant investments in AI research and development. Europe follows closely, supported by stringent regulatory frameworks and a growing focus on ethical AI. The Asia Pacific region is witnessing the fastest growth, propelled by rapid digitalization, expanding IT infrastructure, and increasing adoption of AI in emerging economies such as China and India. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, with increasing awareness and investments in AI-driven solutions.

Solution Type Analysis

The Solution Type segment of the Data Balancing for Model Training market encompasses Oversampling, Undersampling, Hybrid Methods, Synthetic Data Generation, and Others. Oversampling remains one of the most widely adopted techniques, particularly in scenarios where minority class data is scarce but critical for accurate model predictions. Techniques such as SMOTE (Synthetic Minority Over-sampling Technique) and its variants are extensively used to generate synthetic samples, thereby improv
Cerebral Stroke Prediction-Imbalanced Dataset
kaggle.com
zip
Updated Aug 22, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shashwat Tiwari (2021). Cerebral Stroke Prediction-Imbalanced Dataset [Dataset]. https://www.kaggle.com/shashwatwork/cerebral-stroke-predictionimbalaced-dataset
Explore at:
zip(573312 bytes)Available download formats
Dataset updated
Aug 22, 2021
Authors
Shashwat Tiwari
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
Context

A stroke, also known as a cerebrovascular accident or CVA is when part of the brain loses its blood supply and the part of the body that the blood-deprived brain cells control stops working. This loss of blood supply can be ischemic because of lack of blood flow, or hemorrhagic because of bleeding into brain tissue. A stroke is a medical emergency because strokes can lead to death or permanent disability. There are opportunities to treat ischemic strokes but that treatment needs to be started in the first few hours after the signs of a stroke begin.

Content

The cerebral Stroke dataset consists of 12 features including the target column which is imbalanced.

Acknowledgements

Liu, Tianyu; Fan, Wenhui; Wu, Cheng (2019), “Data for A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical-datasets”, Mendeley Data, V1, doi: 10.17632/x8ygrw87jw.1 Dataset is sourced from here.
Is this a good customer?
kaggle.com
zip
Updated Apr 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
podsyp (2020). Is this a good customer? [Dataset]. https://www.kaggle.com/podsyp/is-this-a-good-customer
Explore at:
zip(19523 bytes)Available download formats
Dataset updated
Apr 16, 2020
Authors
podsyp
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Imbalanced classes put “accuracy” out of business. This is a surprisingly common problem in machine learning (specifically in classification), occurring in datasets with a disproportionate ratio of observations in each class.

Content

Standard accuracy no longer reliably measures performance, which makes model training much trickier. Imbalanced classes appear in many domains, including: - Antifraud - Antispam - ...

Inspiration

5 tactics for handling imbalanced classes in machine learning: - Up-sample the minority class - Down-sample the majority class - Change your performance metric - Penalize algorithms (cost-sensitive training) - Use tree-based algorithms
Predict students' dropout and academic success
zenodo.org
data-staging.niaid.nih.gov
+1more
Updated Mar 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins (2023). Predict students' dropout and academic success [Dataset]. http://doi.org/10.5281/zenodo.5777340
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5777340
Dataset updated
Mar 14, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins
Description
A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies.

The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters.

The data is used to build classification models to predict students' dropout and academic success. The problem is formulated as a three category classification task (dropout, enrolled, and graduate) at the end of the normal duration of the course.

Funding
We acknowledge support of this work by the program "SATDAP - Capacitação da Administração Pública under grant POCI-05-5762-FSE-000191, Portugal"
Financial Transaction Fraud Detection
kaggle.com
zip
Updated Aug 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhi pratap (2025). Financial Transaction Fraud Detection [Dataset]. https://www.kaggle.com/datasets/abhipratapsingh/fraud-detection
Explore at:
zip(186385507 bytes)Available download formats
Dataset updated
Aug 20, 2025
Authors
Abhi pratap
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset is a valuable resource for building and evaluating machine learning models to predict fraudulent transactions in an e-commerce environment. With 6.3 million rows, it provides a rich, real-world scenario for data science tasks.

The data is an excellent case study for several key challenges in machine learning, including:

Handling Imbalanced Data: The dataset is highly imbalanced, as legitimate transactions vastly outnumber fraudulent ones. This necessitates the use of specialized techniques like SMOTE or advanced models like XGBoost that can handle class imbalance effectively.

Feature Engineering: The raw data provides an opportunity to create new, more powerful features, such as transaction velocity or the ratio of account balances, which can improve model performance.

Model Evaluation: Traditional metrics like accuracy are misleading for this type of dataset. The project requires a deeper analysis using metrics such as Precision, Recall, F1-Score, and the Precision-Recall AUC to truly understand the model's effectiveness.

Key Features: The dataset includes a variety of anonymized transaction details:

amount: The value of the transaction.

type: The type of transaction (e.g., TRANSFER, CASH_OUT).

oldbalance & newbalance: The balances of the origin and destination accounts before and after the transaction.

isFraud: The target variable, a binary flag indicating a fraudulent transaction.
D
Data Balance Optimization AI Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Data Balance Optimization AI Market Research Report 2033 [Dataset]. https://dataintelo.com/report/data-balance-optimization-ai-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Balance Optimization AI Market Outlook

According to our latest research, the global Data Balance Optimization AI market size in 2024 stands at USD 2.18 billion, with a robust compound annual growth rate (CAGR) of 23.7% projected from 2025 to 2033. By the end of 2033, the market is forecasted to reach an impressive USD 17.3 billion. This substantial growth is driven by the surging demand for AI-powered analytics and increasing adoption of data-intensive applications across industries, establishing Data Balance Optimization AI as a critical enabler for enterprise digital transformation.

One of the primary growth factors fueling the Data Balance Optimization AI market is the exponential surge in data generation across various sectors. Organizations are increasingly leveraging digital technologies, IoT devices, and cloud platforms, resulting in vast, complex, and often imbalanced datasets. The need for advanced AI solutions that can optimize, balance, and manage these datasets has become paramount to ensure high-quality analytics, accurate machine learning models, and improved business decision-making. Enterprises recognize that imbalanced data can severely skew AI outcomes, leading to biases and reduced operational efficiency. Consequently, the demand for Data Balance Optimization AI tools is accelerating as businesses strive to extract actionable insights from diverse and voluminous data sources.

Another critical driver is the rapid evolution of AI and machine learning algorithms, which require balanced and high-integrity datasets for optimal performance. As industries such as healthcare, finance, and retail increasingly rely on predictive analytics and automation, the integrity of underlying data becomes a focal point. Data Balance Optimization AI technologies are being integrated into data pipelines to automatically detect and correct imbalances, ensuring that AI models are trained on representative and unbiased data. This not only enhances model accuracy but also helps organizations comply with stringent regulatory requirements related to data fairness and transparency, further reinforcing the market’s upward trajectory.

The proliferation of cloud computing and the shift toward hybrid IT infrastructures are also significant contributors to market growth. Cloud-based Data Balance Optimization AI solutions offer scalability, flexibility, and cost-effectiveness, making them attractive to both large enterprises and small and medium-sized businesses. These solutions facilitate seamless integration with existing data management systems, enabling real-time optimization and balancing of data across distributed environments. Furthermore, the rise of data-centric business models in sectors such as e-commerce, telecommunications, and manufacturing is amplifying the need for robust data optimization frameworks, propelling further adoption of Data Balance Optimization AI technologies globally.

From a regional perspective, North America currently dominates the Data Balance Optimization AI market, accounting for the largest share due to its advanced technological infrastructure, high investment in AI research, and the presence of leading technology firms. However, the Asia Pacific region is poised to experience the fastest growth during the forecast period, driven by rapid digitalization, expanding IT ecosystems, and increasing adoption of AI-powered solutions in emerging economies such as China, India, and Southeast Asia. Europe also presents significant opportunities, particularly in regulated industries such as finance and healthcare, where data integrity and compliance are paramount. Collectively, these regional trends underscore the global momentum behind Data Balance Optimization AI adoption.

Component Analysis

The Data Balance Optimization AI market by component is segmented into software, hardware, and services, each playing a pivotal role in the overall ecosystem. The software segment commands the largest market share, driven by the continuous evolution of AI algorithms, data preprocessing tools, and machine learning frameworks designed to address data imbalance challenges. Organizations are increasingly investing in advanced software solutions that automate data balancing, cleansing, and augmentation processes, ensuring the reliability of AI-driven analytics. These software platforms often integrate seamlessly with existing data management systems, providing us

Stroke Risk Synthetic 2025

kaggle.com

zip

Updated Sep 26, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Imaad Mahmood (2025). Stroke Risk Synthetic 2025 [Dataset]. https://www.kaggle.com/datasets/imaadmahmood/stroke-risk-synthetic-2025

Explore at:

zip(2288 bytes)Available download formats

Dataset updated

Sep 26, 2025

Authors

Imaad Mahmood

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

StrokeRiskSynthetic2025 Dataset

Overview

The StrokeRiskSynthetic2025 dataset is a synthetically generated dataset designed for machine learning and data analysis tasks focused on predicting stroke risk. Created in September 2025, it simulates realistic patient profiles based on established stroke risk factors, drawing inspiration from medical literature and existing healthcare datasets. With 1,000 records, it provides a balanced yet imbalanced target (approximately 5% stroke cases) to reflect real-world stroke prevalence, making it ideal for binary classification, feature engineering, and handling imbalanced data in educational and research settings.

Data Description

Rows: 1,000
Columns: 12
Target Variable: stroke (binary: 0 = No stroke, 1 = Stroke)
File Format: CSV
Size: Approximately 60 KB

Columns

Column Name	Type	Description
`id`	Integer	Unique identifier for each record (1 to 1,000).
`gender`	Categorical	Patient gender: Male, Female, Other.
`age`	Integer	Patient age in years (0 to 100, skewed toward older adults).
`hypertension`	Binary	Hypertension status: 0 = No, 1 = Yes (~30% prevalence).
`heart_disease`	Binary	Heart disease status: 0 = No, 1 = Yes (~5-10% prevalence).
`ever_married`	Categorical	Marital status: Yes, No (correlated with age).
`work_type`	Categorical	Employment type: children, Govt_job, Never_worked, Private, Self-employed.
`Residence_type`	Categorical	Residence: Urban, Rural (balanced distribution).
`avg_glucose_level`	Float	Average blood glucose level in mg/dL (50 to 300, mean ~100).
`bmi`	Float	Body Mass Index (10 to 60, mean ~25).
`smoking_status`	Categorical	Smoking history: formerly smoked, never smoked, smokes, Unknown.
`stroke`	Binary	Target variable: 0 = No stroke, 1 = Stroke (~5% positive cases).

Key Features

Realistic Distributions: Reflects real-world stroke risk factors (e.g., age, hypertension, glucose levels) based on 2025 medical data, with ~5% stroke prevalence to mimic imbalanced healthcare datasets.
Synthetic Data: Generated to avoid privacy concerns, ensuring ethical use for research and education.
Versatility: Suitable for binary classification, feature importance analysis (e.g., SHAP), data preprocessing (e.g., imputation, scaling), and handling imbalanced data (e.g., SMOTE).
No Missing Values: Clean dataset for straightforward analysis, though users can introduce missingness for preprocessing practice.

Use Cases

Machine Learning: Train models like Logistic Regression, Random Forest, or XGBoost for stroke prediction.
Data Analysis: Explore correlations between risk factors (e.g., age, hypertension) and stroke outcomes.
Educational Projects: Ideal for learning EDA, feature engineering, and model deployment (e.g., Flask apps).
Healthcare Research: Simulate clinical scenarios for studying stroke risk without real patient data.

Source and Inspiration

This dataset is inspired by stroke risk factors outlined in medical literature (e.g., CDC, WHO) and existing datasets like the Kaggle Stroke Prediction Dataset (2021) and Mendeley’s Synthetic Stroke Prediction Dataset (2025). It incorporates 2025 trends in healthcare ML, such as handling imbalanced data and feature importance analysis.

Usage Notes

Preprocessing: Numerical features (age, avg_glucose_level, bmi) may require scaling; categorical features (gender, work_type, etc.) need encoding (e.g., one-hot, label).
Imbalanced Data: The ~5% stroke prevalence requires techniques like SMOTE, oversampling, or class weighting for effective modeling.
Scalability: Contact the creator to generate larger datasets (e.g., 10,000+ rows) if needed.

License

This dataset is provided for educational and research purposes under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

Contact

For questions or to request expanded datasets, contact the creator via the platform where this dataset is hosted.

Lending Club Loan Data
kaggle.com
zip
Updated Nov 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sweta Shetye (2020). Lending Club Loan Data [Dataset]. https://www.kaggle.com/swetashetye/lending-club-loan-data-imbalance-dataset
Explore at:
zip(218250 bytes)Available download formats
Dataset updated
Nov 8, 2020
Authors
Sweta Shetye
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I wanted a highly imbalanced dataset to share with others. It has the perfect one for us.

Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed; often you'll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes).

For example, In this dataset, There are way more samples of fully paid borrowers versus not fully paid borrowers.

Full LendingClub data available from their site.

Content

For companies like Lending Club correctly predicting whether or not a loan will be default is very important. This dataset contains historical data from 2007 to 2015, you can to build a deep learning model to predict the chance of default for future loans. As you will see this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.
Fraudulent Financial Transaction Prediction
kaggle.com
zip
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Younus_Mohamed (2025). Fraudulent Financial Transaction Prediction [Dataset]. https://www.kaggle.com/datasets/younusmohamed/fraudulent-financial-transaction-prediction
Explore at:
zip(41695207 bytes)Available download formats
Dataset updated
Feb 15, 2025
Authors
Younus_Mohamed
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Fraud Detection with Imbalanced Data

Overview
This dataset is designed to help build, train, and evaluate machine learning models that detect fraudulent transactions. We have included additional CSV files containing location-based scores, proprietary weights for grouping, network turn-around times, and vulnerability scores.

Key Points
- Severe Class Imbalance: Only a tiny fraction (less than 1%) of transactions are fraud.
- Multiple Feature Files: Combine them by matching on id or Group.
- Target: The Target column in train.csv indicates fraud (1) vs. clean (0).
- Goal: Predict which transactions in test_share.csv might be fraudulent.

Files in this Dataset

train.csv

Rows: 227,845 (example size)

Columns: 28

Description: Contains historical transaction data for training a fraud detection model.

Important: The Target column (0 = Clean, 1 = Fraud).

test_share.csv

Rows: 56,962 (example size)

Columns: 27

Description: Test dataset, with the same structure as train.csv but without the Target column.

Geo_scores.csv

Columns: (id, geo_score)

Description: Location-based geospatial scores for each transaction.

Lambda_wts.csv

Columns: (Group, lambda_wt)

Description: Proprietary “lambda” weights associated with each Group.

Qset_tats.csv

Columns: (id, qsets_normalized_tat)

Description: Network turn-around times (TAT) for each transaction.

instance_scores.csv

Columns: (id, instance_scores)

Description: Vulnerability or risk qualification scores for each transaction.

Suggested Usage

Load all CSVs into dataframes.

Merge additional files (Geo_scores.csv, Lambda_wts.csv, etc.) by matching id or Group.

Explore the severe class imbalance in train.csv (Target ~1% is fraud).

Train any suitable classification model (Random Forest, XGBoost, etc.) on train.csv.

Predict on test_share.csv or your own external data.

Possible Tools:
- Python: pandas, NumPy, scikit-learn
- Imbalance Handling: SMOTE, Random Oversampler, or class weights
- Metrics: Precision, Recall, F1-score, ROC-AUC, etc.

Beginner Tip: Check how these extra CSVs (Geo, lambda, instance scores, TAT) might improve fraud detection performance!

Tags

fraud-detection

classification

imbalanced-data

financial-transactions

machine-learning

python

beginner-friendly

License: CC BY-NC-SA 4.0
Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...
plos.figshare.com
xls
Updated Nov 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0290581.t007
Dataset updated
Nov 16, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Alaa Alomari; Hossam Faris; Pedro A. Castillo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes.
f
Additional file 1 of Prediction of low Apgar score at five minutes following...
datasetcatalog.nlm.nih.gov
springernature.figshare.com
Updated Apr 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tarimo, Clifford Silver; Li, Quanman; Wang, Yuhui; Zhao, Yizhen; Mohammed, Akram; Gardner, Marilyn; Ren, Weicun; Wu, Jian; Bhuyan, Soumitra S.; Mahande, Michael Johnson (2022). Additional file 1 of Prediction of low Apgar score at five minutes following labor induction intervention in vaginal deliveries: machine learning approach for imbalanced data at a tertiary hospital in North Tanzania [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000421685
Explore at:
Dataset updated
Apr 2, 2022
Authors
Tarimo, Clifford Silver; Li, Quanman; Wang, Yuhui; Zhao, Yizhen; Mohammed, Akram; Gardner, Marilyn; Ren, Weicun; Wu, Jian; Bhuyan, Soumitra S.; Mahande, Michael Johnson
Description
Additional file 1.

Facebook

Twitter

Click to copy link

Link copied

Cite

Carmen Esposito; Gregory A. Landrum; Nadine Schneider; Nikolaus Stiefl; Sereina Riniker (2023). GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning [Dataset]. http://doi.org/10.1021/acs.jcim.1c00160.s002

Data from: GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.1021/acs.jcim.1c00160.s002

Dataset updated

Jun 2, 2023

Dataset provided by

ACS Publications

Authors

Carmen Esposito; Gregory A. Landrum; Nadine Schneider; Nikolaus Stiefl; Sereina Riniker

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure–activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.

Clear search

Close search

Google apps

Main menu

Data from: GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data...

Data from: A virtual multi-label approach to imbalanced data classification

Results of BILSTM for rare classes for the imbalanced dataset with different...

Real Time Bidding

Context

Content

Inspiration

Imbalanced Cifar-10

The definition of a confusion matrix.

Data from: Imbalanced dataset for benchmarking

Imbalanced dataset for benchmarking

Characteristics

References

Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

Data Balancing for Model Training Market Research Report 2033

Data Balancing for Model Training Market Outlook

Solution Type Analysis

Cerebral Stroke Prediction-Imbalanced Dataset

Context

Content

Acknowledgements

Is this a good customer?

Context

Content

Inspiration

Predict students' dropout and academic success

Financial Transaction Fraud Detection

Data Balance Optimization AI Market Research Report 2033

Data Balance Optimization AI Market Outlook

Component Analysis

Stroke Risk Synthetic 2025

StrokeRiskSynthetic2025 Dataset

Overview

Data Description

Columns

Key Features

Use Cases

Source and Inspiration

Usage Notes

License

Contact

Lending Club Loan Data

Context

Content

Fraudulent Financial Transaction Prediction

Fraud Detection with Imbalanced Data

Files in this Dataset

Suggested Usage

Tags

Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...

Additional file 1 of Prediction of low Apgar score at five minutes following...

Data from: GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning