100+ datasets found
  1. Data from: A virtual multi-label approach to imbalanced data classification

    • tandf.figshare.com
    text/x-tex
    Updated Feb 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth P. Chou; Shan-Ping Yang (2024). A virtual multi-label approach to imbalanced data classification [Dataset]. http://doi.org/10.6084/m9.figshare.19390561.v1
    Explore at:
    text/x-texAvailable download formats
    Dataset updated
    Feb 28, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Elizabeth P. Chou; Shan-Ping Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.

  2. f

    Data from: GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data...

    • acs.figshare.com
    zip
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carmen Esposito; Gregory A. Landrum; Nadine Schneider; Nikolaus Stiefl; Sereina Riniker (2023). GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning [Dataset]. http://doi.org/10.1021/acs.jcim.1c00160.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Carmen Esposito; Gregory A. Landrum; Nadine Schneider; Nikolaus Stiefl; Sereina Riniker
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure–activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.

  3. Imbalanced Cifar-10

    • kaggle.com
    zip
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akhil Theerthala (2023). Imbalanced Cifar-10 [Dataset]. https://www.kaggle.com/datasets/akhiltheerthala/imbalanced-cifar-10
    Explore at:
    zip(807146485 bytes)Available download formats
    Dataset updated
    Jun 17, 2023
    Authors
    Akhil Theerthala
    Description

    This dataset is a modified version of the classic CIFAR 10, deliberately designed to be imbalanced across its classes. CIFAR 10 typically consists of 60,000 32x32 color images in 10 classes, with 5000 images per class in the training set. However, this dataset skews these distributions to create a more challenging environment for developing and testing machine learning algorithms. The distribution can be visualized as follows,

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7862887%2Fae7643fe0e58a489901ce121dc2e8262%2FCifar_Imbalanced_data.png?generation=1686732867580792&alt=media" alt="">

    The primary purpose of this dataset is to offer researchers and practitioners a platform to develop, test, and enhance algorithms' robustness when faced with class imbalances. It is especially suited for those interested in binary and multi-class imbalance learning, anomaly detection, and other relevant fields.

    The imbalance was created synthetically, maintaining the same quality and diversity of the original CIFAR 10 dataset, but with varying degrees of representation for each class. Details of the class distributions are included in the dataset's metadata.

    This dataset is beneficial for: - Developing and testing strategies for handling imbalanced datasets. - Investigating the effects of class imbalance on model performance. - Comparing different machine learning algorithms' performance under class imbalance.

    Usage Information:

    The dataset maintains the same format as the original CIFAR 10 dataset, making it easy to incorporate into existing projects. It is organised in a way such that the dataset can be integrated into PyTorch ImageFolder directly. You can load the dataset in Python using popular libraries like NumPy and PyTorch.

    License: This dataset follows the same license terms as the original CIFAR 10 dataset. Please refer to the official CIFAR 10 website for details.

    Acknowledgments: We want to acknowledge the creators of the CIFAR 10 dataset. Without their work and willingness to share data, this synthetic imbalanced dataset wouldn't be possible.

  4. Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

  5. Results of BILSTM for rare classes for the imbalanced dataset with different...

    • plos.figshare.com
    xls
    Updated Nov 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). Results of BILSTM for rare classes for the imbalanced dataset with different reweighting factors. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 16, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Alaa Alomari; Hossam Faris; Pedro A. Castillo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results of BILSTM for rare classes for the imbalanced dataset with different reweighting factors.

  6. The definition of a confusion matrix.

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.

  7. Predict students' dropout and academic success

    • zenodo.org
    • data-staging.niaid.nih.gov
    • +1more
    Updated Mar 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins (2023). Predict students' dropout and academic success [Dataset]. http://doi.org/10.5281/zenodo.5777340
    Explore at:
    Dataset updated
    Mar 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins
    Description

    A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies.

    The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters.

    The data is used to build classification models to predict students' dropout and academic success. The problem is formulated as a three category classification task (dropout, enrolled, and graduate) at the end of the normal duration of the course.

    Funding
    We acknowledge support of this work by the program "SATDAP - Capacitação da Administração Pública under grant POCI-05-5762-FSE-000191, Portugal"

  8. Is this a good customer?

    • kaggle.com
    zip
    Updated Apr 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    podsyp (2020). Is this a good customer? [Dataset]. https://www.kaggle.com/podsyp/is-this-a-good-customer
    Explore at:
    zip(19523 bytes)Available download formats
    Dataset updated
    Apr 16, 2020
    Authors
    podsyp
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Imbalanced classes put “accuracy” out of business. This is a surprisingly common problem in machine learning (specifically in classification), occurring in datasets with a disproportionate ratio of observations in each class.

    Content

    Standard accuracy no longer reliably measures performance, which makes model training much trickier. Imbalanced classes appear in many domains, including: - Antifraud - Antispam - ...

    Inspiration

    5 tactics for handling imbalanced classes in machine learning: - Up-sample the minority class - Down-sample the majority class - Change your performance metric - Penalize algorithms (cost-sensitive training) - Use tree-based algorithms

  9. Learning from Imbalanced Insurance Data

    • kaggle.com
    zip
    Updated Nov 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Möbius (2020). Learning from Imbalanced Insurance Data [Dataset]. https://www.kaggle.com/arashnic/imbalanced-data-practice
    Explore at:
    zip(7004103 bytes)Available download formats
    Dataset updated
    Nov 23, 2020
    Authors
    Möbius
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Insurance companies that sell life, health, and property and casualty insurance are using machine learning (ML) to drive improvements in customer service, fraud detection, and operational efficiency. The data provided by an Insurance company which is not excluded from other companies to getting advantage of ML. This company provides Health Insurance to its customers. We can build a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

    An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

    For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalized in that year, the insurance provider company will bear the cost of hospitalization etc. for up to Rs. 200,000. Now if you are wondering how can company bear such high hospitalization cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalized that year and not everyone. This way everyone shares the risk of everyone else.

    Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

    Content

    Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimize its business model and revenue.

    We have information about: - Demographics (gender, age, region code type), - Vehicles (Vehicle Age, Damage), - Policy (Premium, sourcing channel) etc.

    Update: Test data target values has been added. To evaluate your models more precisely you can use: https://www.kaggle.com/arashnic/answer

    #
    #

    Moreover the supplemental goal is to practice learning imbalanced data and verify how the results can help in real operational process. The Response feature (target) is highly imbalanced.

    #

    0: 319594 1: 62531 Name: Response, dtype: int64

    #
    Practicing some techniques like resampling is useful to verify impacts on validation results and confusion matrix. #
    https://miro.medium.com/max/640/1*KxFmI15rxhvKRVl-febp-Q.png"> figure. Under-sampling: Tomek links # #

    Starter Kernel(s)

    Inspiration

    Predict whether a customer would be interested in Vehicle Insurance

    #
    #

    MORE DATASETs ...

  10. Cerebral Stroke Dataset

    • kaggle.com
    zip
    Updated Sep 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dailydaisy2 (2025). Cerebral Stroke Dataset [Dataset]. https://www.kaggle.com/datasets/viviansam/cerebral-stroke-dataset
    Explore at:
    zip(573312 bytes)Available download formats
    Dataset updated
    Sep 25, 2025
    Authors
    dailydaisy2
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Retrieved from Mendeley Data on 16-Dec-2024: https://data.mendeley.com/datasets/x8ygrw87jw/1

    This dataset comprises vital information on potential cerebral stroke patients, including personal data (e.g., age, gender, etc.), and disease history (e.g. hypertension, heart disease, etc.), which was collected from HealthData.gov by Liu, Fan & Wu (2019) during their study titled 'A hybrid machine learning approach to cerebral stroke prediction based on an imbalanced medical dataset'. The data collection prioritized physiological indicators over complex medical monitoring to minimize diagnosis expenses.

    This cerebral stroke dataset records information from 43400 potential patients, comprising 12 attributes with various data types.

    1. id - Unique identifier of each patient
    2. gender - Gender of the patient: male, female, other
    3. age - Age of the patient: ranged from 0.08 to 82
    4. hypertension - If the patient has hypertension: 0, 1 (no, yes, respectively)
    5. heart_disease - If the patient has heart disease: 0, 1 (no, yes, respectively)
    6. ever_married - Marital status of patient: No, Yes
    7. work_type - Occupation type of patient: children, private sector, self-employed, government sector, never worked
    8. Residence_type - Residency type of patient: rural, urban
    9. avg_glucose_level - Average glucose level in blood: ranged from 55 to 279.66
    10. bmi - Body mass index: ranged from 10.1 to 97.6
    11. smoking_status - Smoking status: formerly smoked, never smoked, smokes
    12. stroke - If the patient has stroke: 0, 1 (no, yes, respectively)

    The target variable, ‘stroke' is categorized into ‘0’ and ‘1’, representing ‘no stroke’ and ‘have stroke’ respectively. It is a categorical variable, making the problem a binary classification task. This dataset includes 783 occurrences of stroke, which account for 1.18% of the total, resulting in a highly imbalanced dataset. This imbalance reflects actual clinical practice, where most of the medical datasets suffer from class imbalance by nature.

  11. Lending Club Loan Data

    • kaggle.com
    zip
    Updated Nov 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sweta Shetye (2020). Lending Club Loan Data [Dataset]. https://www.kaggle.com/swetashetye/lending-club-loan-data-imbalance-dataset
    Explore at:
    zip(218250 bytes)Available download formats
    Dataset updated
    Nov 8, 2020
    Authors
    Sweta Shetye
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    I wanted a highly imbalanced dataset to share with others. It has the perfect one for us.

    Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed; often you'll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes).

    For example, In this dataset, There are way more samples of fully paid borrowers versus not fully paid borrowers.

    Full LendingClub data available from their site.

    Content

    For companies like Lending Club correctly predicting whether or not a loan will be default is very important. This dataset contains historical data from 2007 to 2015, you can to build a deep learning model to predict the chance of default for future loans. As you will see this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.

  12. Data from: WikiChurches – A Fine-Grained Dataset of Architectural Styles...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    bin, json, pdf, txt +1
    Updated Jul 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Björn Barz; Björn Barz; Joachim Denzler; Joachim Denzler (2024). WikiChurches – A Fine-Grained Dataset of Architectural Styles with Real-World Challenges [Dataset]. http://doi.org/10.5281/zenodo.5166987
    Explore at:
    pdf, txt, json, bin, zipAvailable download formats
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Björn Barz; Björn Barz; Joachim Denzler; Joachim Denzler
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    WikiChurches is a dataset for architectural style classification, consisting of 9,485 images of church buildings. Both images and style labels were sourced from Wikipedia. The dataset can serve as a benchmark for various research fields, as it combines numerous real-world challenges: fine-grained distinctions between classes based on subtle visual features, a comparatively small sample size, a highly imbalanced class distribution, a high variance of viewpoints, and a hierarchical organization of labels, where only some images are labeled at the most precise level. In addition, we provide 631 bounding box annotations of characteristic visual features for 139 churches from four major categories. These annotations can, for example, be useful for research on fine-grained classification, where additional expert knowledge about distinctive object parts is often available.

    Please refer to the README.md file for information about the different files contained in this dataset.

  13. Performance comparison of machine learning models across accuracy, AUC, MCC,...

    • plos.figshare.com
    xls
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seongil Han; Haemin Jung (2024). Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Seongil Han; Haemin Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.

  14. Financial Transaction Fraud Detection

    • kaggle.com
    zip
    Updated Aug 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhi pratap (2025). Financial Transaction Fraud Detection [Dataset]. https://www.kaggle.com/datasets/abhipratapsingh/fraud-detection
    Explore at:
    zip(186385507 bytes)Available download formats
    Dataset updated
    Aug 20, 2025
    Authors
    Abhi pratap
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset is a valuable resource for building and evaluating machine learning models to predict fraudulent transactions in an e-commerce environment. With 6.3 million rows, it provides a rich, real-world scenario for data science tasks.

    The data is an excellent case study for several key challenges in machine learning, including:

    • Handling Imbalanced Data: The dataset is highly imbalanced, as legitimate transactions vastly outnumber fraudulent ones. This necessitates the use of specialized techniques like SMOTE or advanced models like XGBoost that can handle class imbalance effectively.

    • Feature Engineering: The raw data provides an opportunity to create new, more powerful features, such as transaction velocity or the ratio of account balances, which can improve model performance.

    • Model Evaluation: Traditional metrics like accuracy are misleading for this type of dataset. The project requires a deeper analysis using metrics such as Precision, Recall, F1-Score, and the Precision-Recall AUC to truly understand the model's effectiveness.

    Key Features: The dataset includes a variety of anonymized transaction details:

    • amount: The value of the transaction.

    • type: The type of transaction (e.g., TRANSFER, CASH_OUT).

    • oldbalance & newbalance: The balances of the origin and destination accounts before and after the transaction.

    • isFraud: The target variable, a binary flag indicating a fraudulent transaction.

  15. Data from: Handling Imbalanced Classification Problems by Weighted...

    • tandf.figshare.com
    ai
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen Dou; Yan Lv; Zhen Wang; Lan Bai (2024). Handling Imbalanced Classification Problems by Weighted Generalization Memorization Machine [Dataset]. http://doi.org/10.6084/m9.figshare.25858505.v1
    Explore at:
    aiAvailable download formats
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Chen Dou; Yan Lv; Zhen Wang; Lan Bai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Imbalanced classification problems are of great significance in life, and there have been many methods to deal with them, e.g. eXtreme Gradient Boosting (XGBoost), Logistic Regression (LR), Decision Trees (DT), and Support Vector Machine (SVM). Recently, a novel Generalization-Memorization Machine (GMM) was proposed to maintain good generalization ability with zero empirical for binary classification. This paper proposes a Weighted Generalization Memorization Machine (WGMM) for imbalanced classification. By improving the memory cost function and memory influence function of GMM, our WGMM also maintains zero empirical risk with well generalization ability for imbalanced classification learning. The new adaptive memory influence function in our WGMM achieves that samples are described individually and not affected by other training samples from different category. We conduct experiments on 31 datasets and compare the WGMM with some other classification methods. The results exhibit the effectiveness of the WGMM.

  16. Bank Telemarketing

    • kaggle.com
    zip
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Younus_Mohamed (2025). Bank Telemarketing [Dataset]. https://www.kaggle.com/datasets/younusmohamed/bank-telemarketing
    Explore at:
    zip(3248401 bytes)Available download formats
    Dataset updated
    Jun 1, 2025
    Authors
    Younus_Mohamed
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    📞 Bank Marketing (Term Deposit Subscription) Dataset

    Source : UCI Machine Learning Repository – Bank Marketing (#222)

    A Portuguese retail bank’s phone-based marketing campaigns (May 2008 → Nov 2010).
    The task is to predict whether a client will subscribe to a term deposit (target y).

    1 · Background

    • Each row records the outcome of the last phone call (plus client history).
    • Multiple calls to the same client may appear across campaigns.
    • The original authors showed that data-driven targeting boosts campaign ROI – see the reference paper below.

    2 · Files in this Kaggle release

    FileRowsColumnsNotes
    bank_marketing.xlsx45 21117Classic “bank-full” version (all examples, 17 predictors + target)

    Need the enriched “bank-additional” version with 20 predictors? Grab it from the UCI link.

    3 · Data Dictionary (17 predictors + target)

    ColumnTypeDescription
    ageintAge of the client
    jobcatJob type (admin., blue-collar, …)
    maritalcatMarital status (married / single / divorced)
    educationcatEducation level (primary / secondary / tertiary / unknown)
    defaultbinHas credit in default?
    balanceintAverage yearly balance (EUR)
    housingbinHas housing loan?
    loanbinHas personal loan?
    contactcatContact channel (cellular / telephone / unknown)
    dayintDay of month of last contact
    monthcatMonth of last contact (jan-dec)
    durationintCall duration (secs)*
    campaignintContacts made in this campaign (incl. last)
    pdaysintDays since last contact (-1 ⇒ never)
    previousintPrevious contacts before this campaign
    poutcomecatOutcome of previous campaign (failure / success / nonexistent)
    ybinTarget – subscribed to term deposit? (yes/no)

    *⚠️ duration is only known after the call ends; include it only for benchmarking, not for live prediction.

    4 · Quick Start in Python

    import pandas as pd
    
    df = pd.read_excel('/kaggle/input/bank-marketing/bank_marketing.xlsx')
    print(df.shape)     # (45211, 17)
    df.head()
    
    Prefer pip? Fetch directly from ucimlrepo:
    '''
    !pip install ucimlrepo
    from ucimlrepo import fetch_ucirepo
    bm = fetch_ucirepo(id=222)
    X, y = bm.data.features, bm.data.targets
    '''
    
    ## 5 · Use-Cases & Ideas 
    
    | 🛠️ ML Task       | Why it’s interesting                                              |
    |--------------------------|----------------------------------------------------------------------------------------------------------------|
    | Binary classification  | Classic imbalanced dataset – try **SMOTE**, cost-sensitive learning, threshold tuning             |
    | Feature engineering   | Combine `pdays`, `campaign`, `previous` into a **contact-intensity score**                   |
    | Model interpretability  | Use **SHAP** / **LIME** to explain “yes” predictions                              |
    | Time-aware validation  | Data are date-ordered → split train/test chronologically to avoid leakage                   |
    
    ---
    
    ## 6 · Credits & Citations 
    
    > **Creators :** **Sérgio Moro, Paulo Rita, Paulo Cortez** 
    > **Original paper :** 
    > Moro S., Cortez P., Rita P. (2014). 
    > *A data-driven approach to predict the success of bank telemarketing campaigns.* 
    > *Decision Support Systems.* [[PDF]](https://www.semanticscholar.org/paper/cab86052882d126d43f72108c6cb41b295cc8a9e)
    
    If you use this dataset, please cite:
    
    Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset].
    UCI Machine Learning Repository. https://doi.org/10.24432/C5K306
    
    
    ---
    
    ## 7 · License 
    
    This dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)**. 
    You are free to share & adapt, **provided you credit the original creators**.
    
  17. f

    Hyperparameter search space for LR.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated May 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sahid, Abdus; Uddin, Palash; Babar, Mozaddid Ul Hoque (2024). Hyperparameter search space for LR. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001273267
    Explore at:
    Dataset updated
    May 16, 2024
    Authors
    Sahid, Abdus; Uddin, Palash; Babar, Mozaddid Ul Hoque
    Description

    Diabetes is a persistent metabolic disorder linked to elevated levels of blood glucose, commonly referred to as blood sugar. This condition can have detrimental effects on the heart, blood vessels, eyes, kidneys, and nerves as time passes. It is a chronic ailment that arises when the body fails to produce enough insulin or is unable to effectively use the insulin it produces. When diabetes is not properly managed, it often leads to hyperglycemia, a condition characterized by elevated blood sugar levels or impaired glucose tolerance. This can result in significant harm to various body systems, including the nerves and blood vessels. In this paper, we propose a multiclass diabetes mellitus detection and classification approach using an extremely imbalanced Laboratory of Medical City Hospital data dynamics. We also formulate a new dataset that is moderately imbalanced based on the Laboratory of Medical City Hospital data dynamics. To correctly identify the multiclass diabetes mellitus, we employ three machine learning classifiers namely support vector machine, logistic regression, and k-nearest neighbor. We also focus on dimensionality reduction (feature selection—filter, wrapper, and embedded method) to prune the unnecessary features and to scale up the classification performance. To optimize the classification performance of classifiers, we tune the model by hyperparameter optimization with 10-fold grid search cross-validation. In the case of the original extremely imbalanced dataset with 70:30 partition and support vector machine classifier, we achieved maximum accuracy of 0.964, precision of 0.968, recall of 0.964, F1-score of 0.962, Cohen kappa of 0.835, and AUC of 0.99 by using top 4 feature according to filter method. By using the top 9 features according to wrapper-based sequential feature selection, the k-nearest neighbor provides an accuracy of 0.935 and 1.0 for the other performance metrics. For our created moderately imbalanced dataset with an 80:20 partition, the SVM classifier achieves a maximum accuracy of 0.938, and 1.0 for other performance metrics. For the multiclass diabetes mellitus detection and classification, our experiments outperformed conducted research based on the Laboratory of Medical City Hospital data dynamics.

  18. Stroke Risk Synthetic 2025

    • kaggle.com
    zip
    Updated Sep 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Imaad Mahmood (2025). Stroke Risk Synthetic 2025 [Dataset]. https://www.kaggle.com/datasets/imaadmahmood/stroke-risk-synthetic-2025
    Explore at:
    zip(2288 bytes)Available download formats
    Dataset updated
    Sep 26, 2025
    Authors
    Imaad Mahmood
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    StrokeRiskSynthetic2025 Dataset

    Overview

    The StrokeRiskSynthetic2025 dataset is a synthetically generated dataset designed for machine learning and data analysis tasks focused on predicting stroke risk. Created in September 2025, it simulates realistic patient profiles based on established stroke risk factors, drawing inspiration from medical literature and existing healthcare datasets. With 1,000 records, it provides a balanced yet imbalanced target (approximately 5% stroke cases) to reflect real-world stroke prevalence, making it ideal for binary classification, feature engineering, and handling imbalanced data in educational and research settings.

    Data Description

    • Rows: 1,000
    • Columns: 12
    • Target Variable: stroke (binary: 0 = No stroke, 1 = Stroke)
    • File Format: CSV
    • Size: Approximately 60 KB

    Columns

    Column NameTypeDescription
    idIntegerUnique identifier for each record (1 to 1,000).
    genderCategoricalPatient gender: Male, Female, Other.
    ageIntegerPatient age in years (0 to 100, skewed toward older adults).
    hypertensionBinaryHypertension status: 0 = No, 1 = Yes (~30% prevalence).
    heart_diseaseBinaryHeart disease status: 0 = No, 1 = Yes (~5-10% prevalence).
    ever_marriedCategoricalMarital status: Yes, No (correlated with age).
    work_typeCategoricalEmployment type: children, Govt_job, Never_worked, Private, Self-employed.
    Residence_typeCategoricalResidence: Urban, Rural (balanced distribution).
    avg_glucose_levelFloatAverage blood glucose level in mg/dL (50 to 300, mean ~100).
    bmiFloatBody Mass Index (10 to 60, mean ~25).
    smoking_statusCategoricalSmoking history: formerly smoked, never smoked, smokes, Unknown.
    strokeBinaryTarget variable: 0 = No stroke, 1 = Stroke (~5% positive cases).

    Key Features

    • Realistic Distributions: Reflects real-world stroke risk factors (e.g., age, hypertension, glucose levels) based on 2025 medical data, with ~5% stroke prevalence to mimic imbalanced healthcare datasets.
    • Synthetic Data: Generated to avoid privacy concerns, ensuring ethical use for research and education.
    • Versatility: Suitable for binary classification, feature importance analysis (e.g., SHAP), data preprocessing (e.g., imputation, scaling), and handling imbalanced data (e.g., SMOTE).
    • No Missing Values: Clean dataset for straightforward analysis, though users can introduce missingness for preprocessing practice.

    Use Cases

    • Machine Learning: Train models like Logistic Regression, Random Forest, or XGBoost for stroke prediction.
    • Data Analysis: Explore correlations between risk factors (e.g., age, hypertension) and stroke outcomes.
    • Educational Projects: Ideal for learning EDA, feature engineering, and model deployment (e.g., Flask apps).
    • Healthcare Research: Simulate clinical scenarios for studying stroke risk without real patient data.

    Source and Inspiration

    This dataset is inspired by stroke risk factors outlined in medical literature (e.g., CDC, WHO) and existing datasets like the Kaggle Stroke Prediction Dataset (2021) and Mendeley’s Synthetic Stroke Prediction Dataset (2025). It incorporates 2025 trends in healthcare ML, such as handling imbalanced data and feature importance analysis.

    Usage Notes

    • Preprocessing: Numerical features (age, avg_glucose_level, bmi) may require scaling; categorical features (gender, work_type, etc.) need encoding (e.g., one-hot, label).
    • Imbalanced Data: The ~5% stroke prevalence requires techniques like SMOTE, oversampling, or class weighting for effective modeling.
    • Scalability: Contact the creator to generate larger datasets (e.g., 10,000+ rows) if needed.

    License

    This dataset is provided for educational and research purposes under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

    Contact

    For questions or to request expanded datasets, contact the creator via the platform where this dataset is hosted.

  19. h

    lungcancer

    • huggingface.co
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harsh Verma (2025). lungcancer [Dataset]. https://huggingface.co/datasets/Harsh6388/lungcancer
    Explore at:
    Dataset updated
    Nov 17, 2025
    Authors
    Harsh Verma
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Synthetic Imbalanced Lung Cancer Dataset for Machine Learning

      Overview
    

    This is a synthetic (AI-generated) lung cancer prediction dataset created for academic and research purposes. The dataset contains an imbalanced distribution of cancer vs. non-cancer cases, reflecting real-world medical datasets.

      Key Features
    

    Fully synthetic (no real patient data) Suitable for ML model training & testing Highly imbalanced target variable Useful for classification, feature… See the full description on the dataset page: https://huggingface.co/datasets/Harsh6388/lungcancer.

  20. CSIRO Sentinel-1 SAR image dataset of oil- and non-oil features for machine...

    • data.csiro.au
    • researchdata.edu.au
    Updated Dec 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Blondeau-Patissier; Thomas Schroeder; Foivos Diakogiannis; Zhibin Li (2022). CSIRO Sentinel-1 SAR image dataset of oil- and non-oil features for machine learning ( Deep Learning ) [Dataset]. http://doi.org/10.25919/4v55-dn16
    Explore at:
    Dataset updated
    Dec 15, 2022
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    David Blondeau-Patissier; Thomas Schroeder; Foivos Diakogiannis; Zhibin Li
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    May 1, 2015 - Aug 31, 2022
    Area covered
    Dataset funded by
    CSIROhttp://www.csiro.au/
    ESA
    Description

    What this collection is: A curated, binary-classified image dataset of grayscale (1 band) 400 x 400-pixel size, or image chips, in a JPEG format extracted from processed Sentinel-1 Synthetic Aperture Radar (SAR) satellite scenes acquired over various regions of the world, and featuring clear open ocean chips, look-alikes (wind or biogenic features) and oil slick chips.

    This binary dataset contains chips labelled as: - "0" for chips not containing any oil features (look-alikes or clean seas)
    - "1" for those containing oil features.

    This binary dataset is imbalanced, and biased towards "0" labelled chips (i.e., no oil features), which correspond to 66% of the dataset. Chips containing oil features, labelled "1", correspond to 34% of the dataset.

    Why: This dataset can be used for training, validation and/or testing of machine learning, including deep learning, algorithms for the detection of oil features in SAR imagery. Directly applicable for algorithm development for the European Space Agency Sentinel-1 SAR mission (https://sentinel.esa.int/web/sentinel/missions/sentinel-1 ), it may be suitable for the development of detection algorithms for other SAR satellite sensors.

    Overview of this dataset: Total number of chips (both classes) is N=5,630 Class 0 1 Total 3,725 1,905

    Further information and description is found in the ReadMe file provided (ReadMe_Sentinel1_SAR_OilNoOil_20221215.txt)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Elizabeth P. Chou; Shan-Ping Yang (2024). A virtual multi-label approach to imbalanced data classification [Dataset]. http://doi.org/10.6084/m9.figshare.19390561.v1
Organization logo

Data from: A virtual multi-label approach to imbalanced data classification

Related Article
Explore at:
text/x-texAvailable download formats
Dataset updated
Feb 28, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Elizabeth P. Chou; Shan-Ping Yang
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.

Search
Clear search
Close search
Google apps
Main menu