100+ datasets found

Data from: A virtual multi-label approach to imbalanced data classification
tandf.figshare.com
text/x-tex
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth P. Chou; Shan-Ping Yang (2024). A virtual multi-label approach to imbalanced data classification [Dataset]. http://doi.org/10.6084/m9.figshare.19390561.v1
Explore at:
text/x-texAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19390561.v1
Dataset updated
Feb 28, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Elizabeth P. Chou; Shan-Ping Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.
f
Data from: GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data...
acs.figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carmen Esposito; Gregory A. Landrum; Nadine Schneider; Nikolaus Stiefl; Sereina Riniker (2023). GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning [Dataset]. http://doi.org/10.1021/acs.jcim.1c00160.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.1c00160.s002
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Carmen Esposito; Gregory A. Landrum; Nadine Schneider; Nikolaus Stiefl; Sereina Riniker
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure–activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.
Imbalanced Cifar-10
kaggle.com
zip
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akhil Theerthala (2023). Imbalanced Cifar-10 [Dataset]. https://www.kaggle.com/datasets/akhiltheerthala/imbalanced-cifar-10
Explore at:
zip(807146485 bytes)Available download formats
Dataset updated
Jun 17, 2023
Authors
Akhil Theerthala
Description
This dataset is a modified version of the classic CIFAR 10, deliberately designed to be imbalanced across its classes. CIFAR 10 typically consists of 60,000 32x32 color images in 10 classes, with 5000 images per class in the training set. However, this dataset skews these distributions to create a more challenging environment for developing and testing machine learning algorithms. The distribution can be visualized as follows,

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7862887%2Fae7643fe0e58a489901ce121dc2e8262%2FCifar_Imbalanced_data.png?generation=1686732867580792&alt=media" alt="">

The primary purpose of this dataset is to offer researchers and practitioners a platform to develop, test, and enhance algorithms' robustness when faced with class imbalances. It is especially suited for those interested in binary and multi-class imbalance learning, anomaly detection, and other relevant fields.

The imbalance was created synthetically, maintaining the same quality and diversity of the original CIFAR 10 dataset, but with varying degrees of representation for each class. Details of the class distributions are included in the dataset's metadata.

This dataset is beneficial for: - Developing and testing strategies for handling imbalanced datasets. - Investigating the effects of class imbalance on model performance. - Comparing different machine learning algorithms' performance under class imbalance.

Usage Information:

The dataset maintains the same format as the original CIFAR 10 dataset, making it easy to incorporate into existing projects. It is organised in a way such that the dataset can be integrated into PyTorch ImageFolder directly. You can load the dataset in Python using popular libraries like NumPy and PyTorch.

License: This dataset follows the same license terms as the original CIFAR 10 dataset. Please refer to the official CIFAR 10 website for details.

Acknowledgments: We want to acknowledge the creators of the CIFAR 10 dataset. Without their work and willingness to share data, this synthetic imbalanced dataset wouldn't be possible.
Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fninf.2021.715421.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.
Results of BILSTM for rare classes for the imbalanced dataset with different...
plos.figshare.com
xls
Updated Nov 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). Results of BILSTM for rare classes for the imbalanced dataset with different reweighting factors. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0290581.t006
Dataset updated
Nov 16, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Alaa Alomari; Hossam Faris; Pedro A. Castillo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results of BILSTM for rare classes for the imbalanced dataset with different reweighting factors.
The definition of a confusion matrix.
plos.figshare.com
xls
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t002
Dataset updated
Feb 10, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
Predict students' dropout and academic success
zenodo.org
data-staging.niaid.nih.gov
+1more
Updated Mar 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins (2023). Predict students' dropout and academic success [Dataset]. http://doi.org/10.5281/zenodo.5777340
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5777340
Dataset updated
Mar 14, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins
Description
A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies.

The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters.

The data is used to build classification models to predict students' dropout and academic success. The problem is formulated as a three category classification task (dropout, enrolled, and graduate) at the end of the normal duration of the course.

Funding
We acknowledge support of this work by the program "SATDAP - Capacitação da Administração Pública under grant POCI-05-5762-FSE-000191, Portugal"
Is this a good customer?
kaggle.com
zip
Updated Apr 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
podsyp (2020). Is this a good customer? [Dataset]. https://www.kaggle.com/podsyp/is-this-a-good-customer
Explore at:
zip(19523 bytes)Available download formats
Dataset updated
Apr 16, 2020
Authors
podsyp
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Imbalanced classes put “accuracy” out of business. This is a surprisingly common problem in machine learning (specifically in classification), occurring in datasets with a disproportionate ratio of observations in each class.

Content

Standard accuracy no longer reliably measures performance, which makes model training much trickier. Imbalanced classes appear in many domains, including: - Antifraud - Antispam - ...

Inspiration

5 tactics for handling imbalanced classes in machine learning: - Up-sample the minority class - Down-sample the majority class - Change your performance metric - Penalize algorithms (cost-sensitive training) - Use tree-based algorithms
Learning from Imbalanced Insurance Data
kaggle.com
zip
Updated Nov 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Möbius (2020). Learning from Imbalanced Insurance Data [Dataset]. https://www.kaggle.com/arashnic/imbalanced-data-practice
Explore at:
zip(7004103 bytes)Available download formats
Dataset updated
Nov 23, 2020
Authors
Möbius
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Insurance companies that sell life, health, and property and casualty insurance are using machine learning (ML) to drive improvements in customer service, fraud detection, and operational efficiency. The data provided by an Insurance company which is not excluded from other companies to getting advantage of ML. This company provides Health Insurance to its customers. We can build a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalized in that year, the insurance provider company will bear the cost of hospitalization etc. for up to Rs. 200,000. Now if you are wondering how can company bear such high hospitalization cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalized that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Content

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimize its business model and revenue.

We have information about: - Demographics (gender, age, region code type), - Vehicles (Vehicle Age, Damage), - Policy (Premium, sourcing channel) etc.

Update: Test data target values has been added. To evaluate your models more precisely you can use: https://www.kaggle.com/arashnic/answer

#
#

Moreover the supplemental goal is to practice learning imbalanced data and verify how the results can help in real operational process. The Response feature (target) is highly imbalanced.

#

0: 319594 1: 62531 Name: Response, dtype: int64

#
Practicing some techniques like resampling is useful to verify impacts on validation results and confusion matrix. #
https://miro.medium.com/max/640/1*KxFmI15rxhvKRVl-febp-Q.png"> figure. Under-sampling: Tomek links # #

Starter Kernel(s)

Quick EDA and LGB ,XGB

Handling Imbalanced: Resampling the right way

Inspiration

Predict whether a customer would be interested in Vehicle Insurance

#
#

MORE DATASETs ...
Cerebral Stroke Dataset
kaggle.com
zip
Updated Sep 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dailydaisy2 (2025). Cerebral Stroke Dataset [Dataset]. https://www.kaggle.com/datasets/viviansam/cerebral-stroke-dataset
Explore at:
zip(573312 bytes)Available download formats
Dataset updated
Sep 25, 2025
Authors
dailydaisy2
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Retrieved from Mendeley Data on 16-Dec-2024: https://data.mendeley.com/datasets/x8ygrw87jw/1

This dataset comprises vital information on potential cerebral stroke patients, including personal data (e.g., age, gender, etc.), and disease history (e.g. hypertension, heart disease, etc.), which was collected from HealthData.gov by Liu, Fan & Wu (2019) during their study titled 'A hybrid machine learning approach to cerebral stroke prediction based on an imbalanced medical dataset'. The data collection prioritized physiological indicators over complex medical monitoring to minimize diagnosis expenses.

This cerebral stroke dataset records information from 43400 potential patients, comprising 12 attributes with various data types.

id - Unique identifier of each patient

gender - Gender of the patient: male, female, other

age - Age of the patient: ranged from 0.08 to 82

hypertension - If the patient has hypertension: 0, 1 (no, yes, respectively)

heart_disease - If the patient has heart disease: 0, 1 (no, yes, respectively)

ever_married - Marital status of patient: No, Yes

work_type - Occupation type of patient: children, private sector, self-employed, government sector, never worked

Residence_type - Residency type of patient: rural, urban

avg_glucose_level - Average glucose level in blood: ranged from 55 to 279.66

bmi - Body mass index: ranged from 10.1 to 97.6

smoking_status - Smoking status: formerly smoked, never smoked, smokes

stroke - If the patient has stroke: 0, 1 (no, yes, respectively)

The target variable, ‘stroke' is categorized into ‘0’ and ‘1’, representing ‘no stroke’ and ‘have stroke’ respectively. It is a categorical variable, making the problem a binary classification task. This dataset includes 783 occurrences of stroke, which account for 1.18% of the total, resulting in a highly imbalanced dataset. This imbalance reflects actual clinical practice, where most of the medical datasets suffer from class imbalance by nature.
Lending Club Loan Data
kaggle.com
zip
Updated Nov 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sweta Shetye (2020). Lending Club Loan Data [Dataset]. https://www.kaggle.com/swetashetye/lending-club-loan-data-imbalance-dataset
Explore at:
zip(218250 bytes)Available download formats
Dataset updated
Nov 8, 2020
Authors
Sweta Shetye
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I wanted a highly imbalanced dataset to share with others. It has the perfect one for us.

Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed; often you'll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes).

For example, In this dataset, There are way more samples of fully paid borrowers versus not fully paid borrowers.

Full LendingClub data available from their site.

Content

For companies like Lending Club correctly predicting whether or not a loan will be default is very important. This dataset contains historical data from 2007 to 2015, you can to build a deep learning model to predict the chance of default for future loans. As you will see this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.
Data from: WikiChurches – A Fine-Grained Dataset of Architectural Styles...
zenodo.org
explore.openaire.eu
+1more
bin, json, pdf, txt +1
Updated Jul 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Björn Barz; Björn Barz; Joachim Denzler; Joachim Denzler (2024). WikiChurches – A Fine-Grained Dataset of Architectural Styles with Real-World Challenges [Dataset]. http://doi.org/10.5281/zenodo.5166987
Explore at:
pdf, txt, json, bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5166987
Dataset updated
Jul 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Björn Barz; Björn Barz; Joachim Denzler; Joachim Denzler
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
WikiChurches is a dataset for architectural style classification, consisting of 9,485 images of church buildings. Both images and style labels were sourced from Wikipedia. The dataset can serve as a benchmark for various research fields, as it combines numerous real-world challenges: fine-grained distinctions between classes based on subtle visual features, a comparatively small sample size, a highly imbalanced class distribution, a high variance of viewpoints, and a hierarchical organization of labels, where only some images are labeled at the most precise level. In addition, we provide 631 bounding box annotations of characteristic visual features for 139 churches from four major categories. These annotations can, for example, be useful for research on fine-grained classification, where additional expert knowledge about distinctive object parts is often available.

Please refer to the README.md file for information about the different files contained in this dataset.
Performance comparison of machine learning models across accuracy, AUC, MCC,...
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t005
Dataset updated
Dec 31, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.
Financial Transaction Fraud Detection
kaggle.com
zip
Updated Aug 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhi pratap (2025). Financial Transaction Fraud Detection [Dataset]. https://www.kaggle.com/datasets/abhipratapsingh/fraud-detection
Explore at:
zip(186385507 bytes)Available download formats
Dataset updated
Aug 20, 2025
Authors
Abhi pratap
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset is a valuable resource for building and evaluating machine learning models to predict fraudulent transactions in an e-commerce environment. With 6.3 million rows, it provides a rich, real-world scenario for data science tasks.

The data is an excellent case study for several key challenges in machine learning, including:

Handling Imbalanced Data: The dataset is highly imbalanced, as legitimate transactions vastly outnumber fraudulent ones. This necessitates the use of specialized techniques like SMOTE or advanced models like XGBoost that can handle class imbalance effectively.

Feature Engineering: The raw data provides an opportunity to create new, more powerful features, such as transaction velocity or the ratio of account balances, which can improve model performance.

Model Evaluation: Traditional metrics like accuracy are misleading for this type of dataset. The project requires a deeper analysis using metrics such as Precision, Recall, F1-Score, and the Precision-Recall AUC to truly understand the model's effectiveness.

Key Features: The dataset includes a variety of anonymized transaction details:

amount: The value of the transaction.

type: The type of transaction (e.g., TRANSFER, CASH_OUT).

oldbalance & newbalance: The balances of the origin and destination accounts before and after the transaction.

isFraud: The target variable, a binary flag indicating a fraudulent transaction.
Data from: Handling Imbalanced Classification Problems by Weighted...
tandf.figshare.com
ai
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen Dou; Yan Lv; Zhen Wang; Lan Bai (2024). Handling Imbalanced Classification Problems by Weighted Generalization Memorization Machine [Dataset]. http://doi.org/10.6084/m9.figshare.25858505.v1
Explore at:
aiAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25858505.v1
Dataset updated
Dec 16, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Chen Dou; Yan Lv; Zhen Wang; Lan Bai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Imbalanced classification problems are of great significance in life, and there have been many methods to deal with them, e.g. eXtreme Gradient Boosting (XGBoost), Logistic Regression (LR), Decision Trees (DT), and Support Vector Machine (SVM). Recently, a novel Generalization-Memorization Machine (GMM) was proposed to maintain good generalization ability with zero empirical for binary classification. This paper proposes a Weighted Generalization Memorization Machine (WGMM) for imbalanced classification. By improving the memory cost function and memory influence function of GMM, our WGMM also maintains zero empirical risk with well generalization ability for imbalanced classification learning. The new adaptive memory influence function in our WGMM achieves that samples are described individually and not affected by other training samples from different category. We conduct experiments on 31 datasets and compare the WGMM with some other classification methods. The results exhibit the effectiveness of the WGMM.

Bank Telemarketing

kaggle.com

zip

Updated Jun 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Younus_Mohamed (2025). Bank Telemarketing [Dataset]. https://www.kaggle.com/datasets/younusmohamed/bank-telemarketing

Explore at:

zip(3248401 bytes)Available download formats

Dataset updated

Jun 1, 2025

Authors

Younus_Mohamed

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

📞 Bank Marketing (Term Deposit Subscription) Dataset

Source : UCI Machine Learning Repository – Bank Marketing (#222)

A Portuguese retail bank’s phone-based marketing campaigns (May 2008 → Nov 2010).
The task is to predict whether a client will subscribe to a term deposit (target y).

1 · Background

Each row records the outcome of the last phone call (plus client history).
Multiple calls to the same client may appear across campaigns.
The original authors showed that data-driven targeting boosts campaign ROI – see the reference paper below.

2 · Files in this Kaggle release

File	Rows	Columns	Notes
`bank_marketing.xlsx`	45 211	17	Classic “bank-full” version (all examples, 17 predictors + target)

Need the enriched “bank-additional” version with 20 predictors? Grab it from the UCI link.

3 · Data Dictionary (17 predictors + target)

Column	Type	Description
`age`	int	Age of the client
`job`	cat	Job type (admin., blue-collar, …)
`marital`	cat	Marital status (married / single / divorced)
`education`	cat	Education level (primary / secondary / tertiary / unknown)
`default`	bin	Has credit in default?
`balance`	int	Average yearly balance (EUR)
`housing`	bin	Has housing loan?
`loan`	bin	Has personal loan?
`contact`	cat	Contact channel (cellular / telephone / unknown)
`day`	int	Day of month of last contact
`month`	cat	Month of last contact (`jan`-`dec`)
`duration`	int	Call duration (secs)*
`campaign`	int	Contacts made in this campaign (incl. last)
`pdays`	int	Days since last contact (-1 ⇒ never)
`previous`	int	Previous contacts before this campaign
`poutcome`	cat	Outcome of previous campaign (failure / success / nonexistent)
`y`	bin	Target – subscribed to term deposit? (`yes`/`no`)

*⚠️ duration is only known after the call ends; include it only for benchmarking, not for live prediction.

4 · Quick Start in Python

import pandas as pd

df = pd.read_excel('/kaggle/input/bank-marketing/bank_marketing.xlsx')
print(df.shape)     # (45211, 17)
df.head()

Prefer pip? Fetch directly from ucimlrepo:
'''
!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo
bm = fetch_ucirepo(id=222)
X, y = bm.data.features, bm.data.targets
'''

## 5 · Use-Cases & Ideas 

| 🛠️ ML Task       | Why it’s interesting                                              |
|--------------------------|----------------------------------------------------------------------------------------------------------------|
| Binary classification  | Classic imbalanced dataset – try **SMOTE**, cost-sensitive learning, threshold tuning             |
| Feature engineering   | Combine `pdays`, `campaign`, `previous` into a **contact-intensity score**                   |
| Model interpretability  | Use **SHAP** / **LIME** to explain “yes” predictions                              |
| Time-aware validation  | Data are date-ordered → split train/test chronologically to avoid leakage                   |

---

## 6 · Credits & Citations 

> **Creators :** **Sérgio Moro, Paulo Rita, Paulo Cortez** 
> **Original paper :** 
> Moro S., Cortez P., Rita P. (2014). 
> *A data-driven approach to predict the success of bank telemarketing campaigns.* 
> *Decision Support Systems.* [[PDF]](https://www.semanticscholar.org/paper/cab86052882d126d43f72108c6cb41b295cc8a9e)

If you use this dataset, please cite:

Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset].
UCI Machine Learning Repository. https://doi.org/10.24432/C5K306


---

## 7 · License 

This dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)**. 
You are free to share & adapt, **provided you credit the original creators**.

f
Hyperparameter search space for LR.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahid, Abdus; Uddin, Palash; Babar, Mozaddid Ul Hoque (2024). Hyperparameter search space for LR. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001273267
Explore at:
Dataset updated
May 16, 2024
Authors
Sahid, Abdus; Uddin, Palash; Babar, Mozaddid Ul Hoque
Description
Diabetes is a persistent metabolic disorder linked to elevated levels of blood glucose, commonly referred to as blood sugar. This condition can have detrimental effects on the heart, blood vessels, eyes, kidneys, and nerves as time passes. It is a chronic ailment that arises when the body fails to produce enough insulin or is unable to effectively use the insulin it produces. When diabetes is not properly managed, it often leads to hyperglycemia, a condition characterized by elevated blood sugar levels or impaired glucose tolerance. This can result in significant harm to various body systems, including the nerves and blood vessels. In this paper, we propose a multiclass diabetes mellitus detection and classification approach using an extremely imbalanced Laboratory of Medical City Hospital data dynamics. We also formulate a new dataset that is moderately imbalanced based on the Laboratory of Medical City Hospital data dynamics. To correctly identify the multiclass diabetes mellitus, we employ three machine learning classifiers namely support vector machine, logistic regression, and k-nearest neighbor. We also focus on dimensionality reduction (feature selection—filter, wrapper, and embedded method) to prune the unnecessary features and to scale up the classification performance. To optimize the classification performance of classifiers, we tune the model by hyperparameter optimization with 10-fold grid search cross-validation. In the case of the original extremely imbalanced dataset with 70:30 partition and support vector machine classifier, we achieved maximum accuracy of 0.964, precision of 0.968, recall of 0.964, F1-score of 0.962, Cohen kappa of 0.835, and AUC of 0.99 by using top 4 feature according to filter method. By using the top 9 features according to wrapper-based sequential feature selection, the k-nearest neighbor provides an accuracy of 0.935 and 1.0 for the other performance metrics. For our created moderately imbalanced dataset with an 80:20 partition, the SVM classifier achieves a maximum accuracy of 0.938, and 1.0 for other performance metrics. For the multiclass diabetes mellitus detection and classification, our experiments outperformed conducted research based on the Laboratory of Medical City Hospital data dynamics.

Stroke Risk Synthetic 2025

kaggle.com

zip

Updated Sep 26, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Imaad Mahmood (2025). Stroke Risk Synthetic 2025 [Dataset]. https://www.kaggle.com/datasets/imaadmahmood/stroke-risk-synthetic-2025

Explore at:

zip(2288 bytes)Available download formats

Dataset updated

Sep 26, 2025

Authors

Imaad Mahmood

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

StrokeRiskSynthetic2025 Dataset

Overview

The StrokeRiskSynthetic2025 dataset is a synthetically generated dataset designed for machine learning and data analysis tasks focused on predicting stroke risk. Created in September 2025, it simulates realistic patient profiles based on established stroke risk factors, drawing inspiration from medical literature and existing healthcare datasets. With 1,000 records, it provides a balanced yet imbalanced target (approximately 5% stroke cases) to reflect real-world stroke prevalence, making it ideal for binary classification, feature engineering, and handling imbalanced data in educational and research settings.

Data Description

Rows: 1,000
Columns: 12
Target Variable: stroke (binary: 0 = No stroke, 1 = Stroke)
File Format: CSV
Size: Approximately 60 KB

Columns

Column Name	Type	Description
`id`	Integer	Unique identifier for each record (1 to 1,000).
`gender`	Categorical	Patient gender: Male, Female, Other.
`age`	Integer	Patient age in years (0 to 100, skewed toward older adults).
`hypertension`	Binary	Hypertension status: 0 = No, 1 = Yes (~30% prevalence).
`heart_disease`	Binary	Heart disease status: 0 = No, 1 = Yes (~5-10% prevalence).
`ever_married`	Categorical	Marital status: Yes, No (correlated with age).
`work_type`	Categorical	Employment type: children, Govt_job, Never_worked, Private, Self-employed.
`Residence_type`	Categorical	Residence: Urban, Rural (balanced distribution).
`avg_glucose_level`	Float	Average blood glucose level in mg/dL (50 to 300, mean ~100).
`bmi`	Float	Body Mass Index (10 to 60, mean ~25).
`smoking_status`	Categorical	Smoking history: formerly smoked, never smoked, smokes, Unknown.
`stroke`	Binary	Target variable: 0 = No stroke, 1 = Stroke (~5% positive cases).

Key Features

Realistic Distributions: Reflects real-world stroke risk factors (e.g., age, hypertension, glucose levels) based on 2025 medical data, with ~5% stroke prevalence to mimic imbalanced healthcare datasets.
Synthetic Data: Generated to avoid privacy concerns, ensuring ethical use for research and education.
Versatility: Suitable for binary classification, feature importance analysis (e.g., SHAP), data preprocessing (e.g., imputation, scaling), and handling imbalanced data (e.g., SMOTE).
No Missing Values: Clean dataset for straightforward analysis, though users can introduce missingness for preprocessing practice.

Use Cases

Machine Learning: Train models like Logistic Regression, Random Forest, or XGBoost for stroke prediction.
Data Analysis: Explore correlations between risk factors (e.g., age, hypertension) and stroke outcomes.
Educational Projects: Ideal for learning EDA, feature engineering, and model deployment (e.g., Flask apps).
Healthcare Research: Simulate clinical scenarios for studying stroke risk without real patient data.

Source and Inspiration

This dataset is inspired by stroke risk factors outlined in medical literature (e.g., CDC, WHO) and existing datasets like the Kaggle Stroke Prediction Dataset (2021) and Mendeley’s Synthetic Stroke Prediction Dataset (2025). It incorporates 2025 trends in healthcare ML, such as handling imbalanced data and feature importance analysis.

Usage Notes

Preprocessing: Numerical features (age, avg_glucose_level, bmi) may require scaling; categorical features (gender, work_type, etc.) need encoding (e.g., one-hot, label).
Imbalanced Data: The ~5% stroke prevalence requires techniques like SMOTE, oversampling, or class weighting for effective modeling.
Scalability: Contact the creator to generate larger datasets (e.g., 10,000+ rows) if needed.

License

This dataset is provided for educational and research purposes under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

Contact

For questions or to request expanded datasets, contact the creator via the platform where this dataset is hosted.

h
lungcancer
huggingface.co
Updated Nov 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harsh Verma (2025). lungcancer [Dataset]. https://huggingface.co/datasets/Harsh6388/lungcancer
Explore at:
Dataset updated
Nov 17, 2025
Authors
Harsh Verma
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Synthetic Imbalanced Lung Cancer Dataset for Machine Learning

Overview

This is a synthetic (AI-generated) lung cancer prediction dataset created for academic and research purposes. The dataset contains an imbalanced distribution of cancer vs. non-cancer cases, reflecting real-world medical datasets.

Key Features

Fully synthetic (no real patient data) Suitable for ML model training & testing Highly imbalanced target variable Useful for classification, feature… See the full description on the dataset page: https://huggingface.co/datasets/Harsh6388/lungcancer.
CSIRO Sentinel-1 SAR image dataset of oil- and non-oil features for machine...
data.csiro.au
researchdata.edu.au
Updated Dec 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Blondeau-Patissier; Thomas Schroeder; Foivos Diakogiannis; Zhibin Li (2022). CSIRO Sentinel-1 SAR image dataset of oil- and non-oil features for machine learning ( Deep Learning ) [Dataset]. http://doi.org/10.25919/4v55-dn16
Explore at:
Unique identifier
https://doi.org/10.25919/4v55-dn16
Dataset updated
Dec 15, 2022
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
David Blondeau-Patissier; Thomas Schroeder; Foivos Diakogiannis; Zhibin Li
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Time period covered
May 1, 2015 - Aug 31, 2022
Area covered

Dataset funded by
CSIROhttp://www.csiro.au/
ESA
Description
What this collection is: A curated, binary-classified image dataset of grayscale (1 band) 400 x 400-pixel size, or image chips, in a JPEG format extracted from processed Sentinel-1 Synthetic Aperture Radar (SAR) satellite scenes acquired over various regions of the world, and featuring clear open ocean chips, look-alikes (wind or biogenic features) and oil slick chips.

This binary dataset contains chips labelled as: - "0" for chips not containing any oil features (look-alikes or clean seas)
- "1" for those containing oil features.

This binary dataset is imbalanced, and biased towards "0" labelled chips (i.e., no oil features), which correspond to 66% of the dataset. Chips containing oil features, labelled "1", correspond to 34% of the dataset.

Why: This dataset can be used for training, validation and/or testing of machine learning, including deep learning, algorithms for the detection of oil features in SAR imagery. Directly applicable for algorithm development for the European Space Agency Sentinel-1 SAR mission (https://sentinel.esa.int/web/sentinel/missions/sentinel-1 ), it may be suitable for the development of detection algorithms for other SAR satellite sensors.

Overview of this dataset: Total number of chips (both classes) is N=5,630 Class 0 1 Total 3,725 1,905

Further information and description is found in the ReadMe file provided (ReadMe_Sentinel1_SAR_OilNoOil_20221215.txt)

Facebook

Twitter

Click to copy link

Link copied

Cite

Elizabeth P. Chou; Shan-Ping Yang (2024). A virtual multi-label approach to imbalanced data classification [Dataset]. http://doi.org/10.6084/m9.figshare.19390561.v1

Data from: A virtual multi-label approach to imbalanced data classification

Explore at:

text/x-texAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.19390561.v1

Dataset updated

Feb 28, 2024

Dataset provided by

Taylor & Francishttps://taylorandfrancis.com/

Authors

Elizabeth P. Chou; Shan-Ping Yang

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.

Clear search

Close search

Google apps

Main menu

Data from: A virtual multi-label approach to imbalanced data classification

Data from: GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data...

Imbalanced Cifar-10

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

Results of BILSTM for rare classes for the imbalanced dataset with different...

The definition of a confusion matrix.

Predict students' dropout and academic success

Is this a good customer?

Context

Content

Inspiration

Learning from Imbalanced Insurance Data

Context

Content

Starter Kernel(s)

Inspiration

MORE DATASETs ...

Cerebral Stroke Dataset

Lending Club Loan Data

Context

Content

Data from: WikiChurches – A Fine-Grained Dataset of Architectural Styles...

Performance comparison of machine learning models across accuracy, AUC, MCC,...

Financial Transaction Fraud Detection

Data from: Handling Imbalanced Classification Problems by Weighted...

Bank Telemarketing

📞 Bank Marketing (Term Deposit Subscription) Dataset

1 · Background

2 · Files in this Kaggle release

3 · Data Dictionary (17 predictors + target)

4 · Quick Start in Python

Hyperparameter search space for LR.

Stroke Risk Synthetic 2025

StrokeRiskSynthetic2025 Dataset

Overview

Data Description

Columns

Key Features

Use Cases

Source and Inspiration

Usage Notes

License

Contact

lungcancer

CSIRO Sentinel-1 SAR image dataset of oil- and non-oil features for machine...

Data from: A virtual multi-label approach to imbalanced data classification