100+ datasets found
  1. i

    Imbalanced Data

    • ieee-dataport.org
    Updated Aug 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0
    Explore at:
    Dataset updated
    Aug 23, 2023
    Authors
    Blessa Binolin M
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.

  2. f

    Performance comparison of machine learning models across accuracy, AUC, MCC,...

    • plos.figshare.com
    xls
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seongil Han; Haemin Jung (2024). Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Seongil Han; Haemin Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.

  3. i

    Tackling Class Imbalance with Ranking - Dataset - CKAN

    • rdm.inesctec.pt
    Updated Feb 20, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Tackling Class Imbalance with Ranking - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/nis-2017-002
    Explore at:
    Dataset updated
    Feb 20, 2017
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The dataset comes originally from UCI Machine Learning. The multiclass datasets were transformed in binary classification as mentioned in the paper. Ranking methods were applied to improve class imbalance. The datasets are divided in 30 folds so that other class imbalance methods can be compared to the methods in the paper. The code used in the paper is also provided.

  4. f

    Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

    • frontiersin.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

  5. Dataset for Class Imbalance Classification Problem

    • kaggle.com
    Updated Jan 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akalya Subramanian (2021). Dataset for Class Imbalance Classification Problem [Dataset]. https://www.kaggle.com/akalyasubramanian/dataset-for-class-imbalance-classification-problem/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Akalya Subramanian
    Description

    Dataset

    This dataset was created by Akalya Subramanian

    Contents

  6. f

    MMR maternal deaths and LTR MMEIG-trends from 2000-2017 from Improving...

    • rs.figshare.com
    bin
    Updated Aug 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sharon Awuor Okello; Evans Otieno Omondi; Collins O. Odhiambo (2023). MMR maternal deaths and LTR MMEIG-trends from 2000-2017 from Improving performance of hurdle models using rare-event weighted logistic regression: an application to maternal mortality data [Dataset]. http://doi.org/10.6084/m9.figshare.23974115.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 23, 2023
    Dataset provided by
    The Royal Society
    Authors
    Sharon Awuor Okello; Evans Otieno Omondi; Collins O. Odhiambo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this paper, performance of hurdle models in rare events data is improved by modifying their binary component. Rare-event weighted logistic regression model is adopted in place of logistic regression to deal with class imbalance due to rare events. Poisson Hurdle Rare Event Weighted Logistic Regression (REWLR) and Negative Binomial Hurdle (NBH) REWLR are developed as two-part models which use the REWLR model to estimate the probability of a positive count and a Poisson or NB zero-truncated count model to estimate non-zero counts. This research aimed to develop and assess the performance of the Poisson and Negative Binomial (NB) Hurdle Rare Event Weighted Logistic Regression (REWLR) models, applied to simulated data with various degrees of zero inflation and to Nairobi county’s maternal mortality data. The study data on maternal mortality were pulled from JPHES. The data contain the number of maternal deaths, which is the outcome variable, and other obstetric and demographic factors recorded in MNCH facilities in Nairobi between October 2021 and January 2022. The models were also fit and evaluated based on simulated data with varying degrees of zero inflation. The obtained results are numerically validated and then discussed from both the mathematical and the maternal mortality perspective. Numerical simulations are also presented to give a more complete representation of the model dynamics. Results obtained suggest that NB Hurdle REWLR is the best performing model for zero inflated count data due to rare events.

  7. f

    Data from: Effective Feature Selection Method for Class-Imbalance Datasets...

    • acs.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aurelio Antelo-Collado; Ramón Carrasco-Velar; Nicolás García-Pedrajas; Gonzalo Cerruela-García (2023). Effective Feature Selection Method for Class-Imbalance Datasets Applied to Chemical Toxicity Prediction [Dataset]. http://doi.org/10.1021/acs.jcim.0c00908.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Aurelio Antelo-Collado; Ramón Carrasco-Velar; Nicolás García-Pedrajas; Gonzalo Cerruela-García
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    During the drug development process, it is common to carry out toxicity tests and adverse effect studies, which are essential to guarantee patient safety and the success of the research. The use of in silico quantitative structure–activity relationship (QSAR) approaches for this task involves processing a huge amount of data that, in many cases, have an imbalanced distribution of active and inactive samples. This is usually termed the class-imbalance problem and may have a significant negative effect on the performance of the learned models. The performance of feature selection (FS) for QSAR models is usually damaged by the class-imbalance nature of the involved datasets. This paper proposes the use of an FS method focused on dealing with the class-imbalance problems. The method is based on the use of FS ensembles constructed by boosting and using two well-known FS methods, fast clustering-based FS and the fast correlation-based filter. The experimental results demonstrate the efficiency of the proposal in terms of the classification performance compared to standard methods. The proposal can be extended to other FS methods and applied to other problems in cheminformatics.

  8. Dataset: The effects of class balance on the training energy consumption of...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Gutierrez; Maria Gutierrez; Coral Calero; Coral Calero; Félix García; Félix García; Mª Ángeles Moraga; Mª Ángeles Moraga (2024). Dataset: The effects of class balance on the training energy consumption of logistic regression models [Dataset]. http://doi.org/10.5281/zenodo.10823624
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maria Gutierrez; Maria Gutierrez; Coral Calero; Coral Calero; Félix García; Félix García; Mª Ángeles Moraga; Mª Ángeles Moraga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2024
    Description

    Two synthetic datasets for binary classification, generated with the Random Radial Basis Function generator from WEKA. They are the same shape and size (104.952 instances, 185 attributes), but the "balanced" dataset has 52,13% of its instances belonging to class c0, while the "unbalanced" one only has 4,04% of its instances belonging to class c0. Therefore, this set of datasets is primarily meant to study how class balance influences the behaviour of a machine learning model.

  9. P

    HDSNE Chest X-ray Dataset Dataset

    • paperswithcode.com
    Updated Feb 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). HDSNE Chest X-ray Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/hdsne-chest-x-ray-dataset
    Explore at:
    Dataset updated
    Feb 25, 2025
    Description

    Description:

    👉 Download the dataset here

    The continuous release of medical image databases, often featuring overlapping or identical categories, poses a significant challenge for the development of autonomous Computer-Aided Diagnostics (CAD) systems. These systems are essential for creating truly comprehensive medical diagnostics. However, one of the main obstacles lies in the frequent bulk release of datasets, which commonly suffer from two critical issues: image duplication and data corruption.

    The Problem of Dataset Redundancy

    Repeated releases of the same categories often fail to integrate or deduplicate similar images across databases, which can severely impact the effectiveness of machine learning models. Data duplication not only reduces the efficiency of learning models but also leads to overfitting, wastes computational resources, and increases the carbon footprint due to the energy required for training complex models.

    Download Dataset

    Proposed Solution: Global Data Aggregation Model

    In response to these challenges, we introduce a global data aggregation model that intelligently combines data from six distinct and reputable medical imaging databases. Each database was carefully curated to ensure the elimination of redundancies while preserving data diversity. Two robust algorithms were employed:

    Hash MD5 Algorithm: This algorithm generates unique hash values for each image, helping in the effective detection and elimination of duplicate images.

    t-SNE Algorithm: This technique is used for dimensionality reduction, with a tunable perplexity parameter to ensure accurate representation of high-dimensional data.

    Dataset Categories

    The final dataset includes an equal number of samples from three key categories of chest X-ray images:

    Normal Pneumonia COVID-19

    This uniform distribution ensures that the dataset is balanced, avoiding class imbalance—a common issue that can skew results in medical image analysis.

    Dataset Application & Model Evaluation

    The dataset was applied to the Inception V3 pre-trained model, a leading convolutional neural network (CNN) architecture known for its excellence in image classification tasks. The evaluation was conduct using the following performance metrics:

    Accuracy: An exceptional accuracy rate of 98.48% was achieve.

    Precision, Recall, and F1-score: The dataset showed strong performance across these metrics, reducing both false positives and false negatives.

    Statistical Validation: A t-test was conduct to validate the results, and the t-values and p-values confirm the statistical significance of the model’s performance.

    Conclusion

    The HDSNE Chest X-ray Dataset offers a novel and effective approach to data aggregation, tackling the issues of redundancy and data duplication that have long plagued the field of medical imaging. By maintaining a balance class distribution and eliminating unnecessary data, this dataset provides a cleaner and more efficient resource for training machine learning models.

    This dataset is sourced from Kaggle.

  10. UVP5 data sorted with EcoTaxa and MorphoCluster

    • seanoe.org
    image/*
    Updated 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rainer Kiko; Simon-Martin Schröder (2020). UVP5 data sorted with EcoTaxa and MorphoCluster [Dataset]. http://doi.org/10.17882/73002
    Explore at:
    image/*Available download formats
    Dataset updated
    2020
    Dataset provided by
    SEANOE
    Authors
    Rainer Kiko; Simon-Martin Schröder
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    Oct 23, 2012 - Aug 7, 2017
    Area covered
    Description

    here, we provide plankton image data that was sorted with the web applications ecotaxa and morphocluster. the data set was used for image classification tasks as described in schröder et. al (in preparation) and does not include any geospatial or temporal meta-data.plankton was imaged using the underwater vision profiler 5 (picheral et al. 2010) in various regions of the world's oceans between 2012-10-24 and 2017-08-08.this data publication consists of an archive containing "training.csv" (list of 392k training images for classification, validated using ecotaxa), "validation.csv" (list of 196k validation images for classification, validated using ecotaxa), "unlabeld.csv" (list of 1m unlabeled images), "morphocluster.csv" (1.2m objects validated using morphocluster, a subset of "unlabeled.csv" and "validation.csv") and the image files themselves. the csv files each contain the columns "object_id" (a unique id), "image_fn" (the relative filename), and "label" (the assigned name).the training and validation sets were sorted into 65 classes using the web application ecotaxa (http://ecotaxa.obs-vlfr.fr). this data shows a severe class imbalance; the 10% most populated classes contain more than 80% of the objects and the class sizes span four orders of magnitude. the validation set and a set of additional 1m unlabeled images were sorted during the first trial of morphocluster (https://github.com/morphocluster).the images in this data set were sampled during rv meteor cruises m92, m93, m96, m97, m98, m105, m106, m107, m108, m116, m119, m121, m130, m131, m135, m136, m137 and m138, during rv maria s merian cruises msm22, msm23, msm40 and msm49, during the rv polarstern cruise ps88b and during the fluxes1 experiment with rv sarmiento de gamboa.the following people have contributed to the sorting of the image data on ecotaxa:rainer kiko, tristan biard, benjamin blanc, svenja christiansen, justine courboules, charlotte eich, jannik faustmann, christine gawinski, augustin lafond, aakash panchal, marc picheral, akanksha singh and helena haussin schröder et al. (in preparation), the training set serves as a source for knowledge transfer in the training of the feature extractor. the classification using morphocluster was conducted by rainer kiko. used labels are operational and not yet matched to respective ecotaxa classes.

  11. f

    Classification result classifiers using TF-IDF with SMOTE.

    • plos.figshare.com
    xls
    Updated May 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khaled Alnowaiser (2024). Classification result classifiers using TF-IDF with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 28, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Khaled Alnowaiser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification result classifiers using TF-IDF with SMOTE.

  12. Credit scoring with class imbalance data: An out-of-sample and out-of-time...

    • zenodo.org
    Updated Oct 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonah Mushava; Mike Murray; Jonah Mushava; Mike Murray (2023). Credit scoring with class imbalance data: An out-of-sample and out-of-time perspective [Dataset]. http://doi.org/10.5281/zenodo.8401978
    Explore at:
    Dataset updated
    Oct 6, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jonah Mushava; Mike Murray; Jonah Mushava; Mike Murray
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The raw datasets provided here are intended for use in a Data in Brief article. These comprehensive files, sourced from the Freddie Mac website, offer quarterly snapshots of mortgage loans that have been originated in the USA since 1999, along with details of their subsequent repayment behaviours. This data remains current and is updated every three months. Specifically, the loan origination data present here encompasses amortized fixed-rate mortgage loans from 1999 up to June 2022. In contrast, the performance data is presented on a monthly basis, detailing loan repayment profiles from 1999 until September 30, 2022. Both the origination and performance datasets feature a unique loan ID, which can be utilized to integrate the data on loan originations with that of loan repayments.

  13. f

    Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

    • acs.figshare.com
    zip
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    ACS Publications
    Authors
    Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.

  14. Credit Card Fraud Detection Dataset

    • kaggle.com
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghanshyam Saini (2025). Credit Card Fraud Detection Dataset [Dataset]. https://www.kaggle.com/datasets/ghnshymsaini/credit-card-fraud-detection-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ghanshyam Saini
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Credit Card Fraud Detection Dataset (European Cardholders, September 2013)

    As a data contributor, I'm sharing this crucial dataset focused on the detection of fraudulent credit card transactions. Recognizing these illicit activities is paramount for protecting customers and the integrity of financial systems.

    About the Dataset:

    This dataset encompasses credit card transactions made by European cardholders during a two-day period in September 2013. It presents a real-world scenario with a significant class imbalance, where fraudulent transactions are considerably less frequent than legitimate ones. Out of a total of 284,807 transactions, only 492 are instances of fraud, representing a mere 0.172% of the entire dataset.

    Content of the Data:

    Due to confidentiality concerns, the majority of the input features in this dataset have undergone a Principal Component Analysis (PCA) transformation. This means the original meaning and context of features V1, V2, ..., V28 are not directly provided. However, these principal components capture the variance in the underlying transaction data.

    The only features that have not been transformed by PCA are:

    • Time: Numerical. Represents the number of seconds elapsed between each transaction and the very first transaction recorded in the dataset.
    • Amount: Numerical. The transaction amount in Euros (€). This feature could be valuable for cost-sensitive learning approaches.

    The target variable for this classification task is:

    • Class: Integer. Takes the value 1 in the case of a fraudulent transaction and 0 otherwise.

    Important Note on Evaluation:

    Given the substantial class imbalance (far more legitimate transactions than fraudulent ones), traditional accuracy metrics based on the confusion matrix can be misleading. It is strongly recommended to evaluate models using the Area Under the Precision-Recall Curve (AUPRC), as this metric is more sensitive to the performance on the minority class (fraudulent transactions).

    How to Use This Dataset:

    1. Download the dataset file (likely in CSV format).
    2. Load the data using libraries like Pandas.
    3. Understand the class imbalance: Be aware that fraudulent transactions are rare.
    4. Explore the features: Analyze the distributions of 'Time', 'Amount', and the PCA-transformed features (V1-V28).
    5. Address the class imbalance: Consider using techniques like oversampling the minority class, undersampling the majority class, or using specialized algorithms designed for imbalanced datasets.
    6. Build and train binary classification models to predict the 'Class' variable.
    7. Evaluate your models using AUPRC to get a meaningful assessment of performance in detecting fraud.

    Acknowledgements and Citation:

    This dataset has been collected and analyzed through a research collaboration between Worldline and the Machine Learning Group (MLG) of ULB (Université Libre de Bruxelles).

    When using this dataset in your research or projects, please cite the following works as appropriate:

    • Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.
    • Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon.
    • Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE.
    • Andrea Dal Pozzolo. Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi).
    • Fabrizio Carcillo, Andrea Dal Pozzolo, Yann-Aël Le Borgne, Olivier Caelen, Yannis Mazzer, Gianluca Bontempi. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier.
    • Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Gianluca Bontempi. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing.
    • Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019.
    • Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi *Combining Unsupervised and Supervised...
  15. o

    Data from: Financial Fraud Detection Dataset

    • opendatabay.com
    .undefined
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Review Nexus (2025). Financial Fraud Detection Dataset [Dataset]. https://www.opendatabay.com/data/financial/d226c56e-5929-4059-a30d-13632e07b344
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 25, 2025
    Dataset authored and provided by
    Review Nexus
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    Fraud Detection & Risk Management
    Description

    This dataset is designed to support research and model development in the area of fraud detection. It consists of real-world credit card transactions made by European cardholders over a two-day period in September 2013. Out of 284,807 transactions, 492 are labeled as fraudulent (positive class), making this a highly imbalanced classification problem.

    Performance Note:

    Due to the extreme class imbalance, standard accuracy metrics are not informative. We recommend using the Area Under the Precision-Recall Curve (AUPRC) or F1-score for model evaluation.

    Features:

    • Time Series Data: Each row represents a transaction, with the Time feature indicating the number of seconds elapsed since the first transaction.
    • Dimensionality Reduction Applied: Features V1 through V28 are anonymized principal components derived from a PCA transformation due to confidentiality constraints.
    • Raw Transaction Amount: The Amount field reflects the transaction value, useful for cost-sensitive modeling.
    • Binary Classification Target: The Class label is 1 for fraud and 0 for legitimate transactions.

    Usage:

    • Machine learning model training for fraud detection.
    • Evaluation of anomaly detection and imbalanced classification methods.
    • Development of cost-sensitive learning approaches using the Amount variable.

    Data Summary:

    • Total Records: 284,807
    • Fraud Cases: 492
    • Imbalance Ratio: Fraudulent transactions account for just 0.172% of the dataset.
    • Columns: 31 total (28 PCA features, plus Time, Amount, and Class)

    License:

    The dataset is provided under the CC0 (Public Domain) license, allowing users to freely use, modify, and distribute the data without any restrictions.

    Acknowledgements

    The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

    Please cite the following works:

    Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

    Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon

    Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

    Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

    Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier

    Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

    Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019

    Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019

    Yann-Aël Le Borgne, Gianluca Bontempi Reproducible machine Learning for Credit Card Fraud Detection - Practical Handbook

    Bertrand Lebichot, Gianmarco Paldino, Wissam Siblini, Liyun He, Frederic Oblé, Gianluca Bontempi Incremental learning strategies for credit cards fraud detection, IInternational Journal of Data Science and Analytics

  16. R

    Class Balance 1 Dataset

    • universe.roboflow.com
    zip
    Updated Dec 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FPT (2021). Class Balance 1 Dataset [Dataset]. https://universe.roboflow.com/fpt/class-balance-1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 15, 2021
    Dataset authored and provided by
    FPT
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Mask Bounding Boxes
    Description

    Class Balance 1

    ## Overview
    
    Class Balance 1 is a dataset for object detection tasks - it contains Mask annotations for 500 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  17. Datasets for: A Study on Machine Vision Techniques...

    • zenodo.org
    Updated Jan 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alice Stazio; Alice Stazio; Juan G Victores; Juan G Victores; David Estevez; David Estevez; Carlos Balaguer; Carlos Balaguer (2020). Datasets for: A Study on Machine Vision Techniques... [Dataset]. http://doi.org/10.5281/zenodo.3251898
    Explore at:
    Dataset updated
    Jan 21, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alice Stazio; Alice Stazio; Juan G Victores; Juan G Victores; David Estevez; David Estevez; Carlos Balaguer; Carlos Balaguer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets for: A Study on Machine Vision Techniques...

  18. Additional file 2 of Hellinger distance-based stable sparse feature...

    • springernature.figshare.com
    zip
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guang-Hui Fu; Yuan-Jiao Wu; Min-Jie Zong; Jianxin Pan (2023). Additional file 2 of Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data [Dataset]. http://doi.org/10.6084/m9.figshare.12023622.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Guang-Hui Fu; Yuan-Jiao Wu; Min-Jie Zong; Jianxin Pan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 2 Five datasets used in this study are given as a.txt file.

  19. o

    Phishing URL Classifier Dataset

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Phishing URL Classifier Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/705b35a9-e638-462d-a5e1-d9f70ff4234a
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Website Analytics & User Experience
    Description

    This dataset is a curated collection of over 800,000 URLs, designed to represent a variety of online domains. Approximately 52% of these domains are identified as legitimate entities, while the remaining 47% are categorised as phishing domains, indicating potential online threats. The dataset consists of two key columns: "url" and "status". The "status" column uses binary encoding, where 0 signifies phishing domains and 1 indicates legitimate domains. This balanced distribution between phishing and legitimate instances helps ensure the dataset's robustness for analysis and model development.

    Columns

    • url: This field contains the Uniform Resource Locators (URLs) for each domain, including both legitimate and phishing entries.
    • status: This field denotes the classification of the URL. A value of 0 represents a phishing domain, indicating a potential risk, while a value of 1 signifies a legitimate domain, offering assurance.

    Distribution

    The dataset is provided in a CSV file format. It contains 808,042 unique entries. The distribution of statuses is approximately 394,982 entries flagged as phishing (0) and 427,028 entries flagged as legitimate (1). This offers an almost equal balance across the two categories.

    Usage

    This dataset is ideal for applications aimed at understanding, combating, and mitigating online threats. It can be used for developing models related to phishing detection, binary classification, and website analytics. It is also suitable for data cleaning exercises and projects involving Natural Language Processing (NLP) and Deep Learning.

    Coverage

    The data collection for this dataset is global in scope. While a specific time range for data collection is not provided, the dataset was listed on 05/06/2025.

    License

    CCO

    Who Can Use It

    This dataset is particularly valuable for researchers and practitioners working in the fields of AI and Machine Learning. Intended users include those looking to: * Develop and train models for identifying malicious URLs. * Analyse patterns distinguishing legitimate websites from phishing attempts. * Enhance cybersecurity measures and protect users from online threats.

    Dataset Name Suggestions

    • URL Phishing Detection
    • Legitimate and Malicious URLs
    • Online Threat URL Dataset
    • Phishing URL Classifier Data
    • Web Security URL Collection

    Attributes

    Original Data Source: Phishing and Legitimate URLS

  20. Wafer UCR Archive Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jul 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). Wafer UCR Archive Dataset [Dataset]. http://doi.org/10.5281/zenodo.11198387
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of the UCR Archive maintained by University of Southampton researchers. Please cite a relevant or the latest full archive release if you use the datasets. See http://www.timeseriesclassification.com/.

    This dataset was formatted by R. Olszewski as part of his thesis Generalized feature extraction for structural pattern recognition in time-series data at Carnegie Mellon University, 2001. Wafer data relates to semi-conductor microelectronics fabrication. A collection of inline process control measurements recorded from various sensors during the processing of silicon wafers for semiconductor fabrication constitute the wafer database; each data set in the wafer database contains the measurements recorded by one sensor during the processing of one wafer by one tool. The two classes are normal and abnormal. There is a large class imbalance between normal and abnormal (10.7% of the train are abnormal, 12.1% of the test).

    Donator: R. Olszewski

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0

Imbalanced Data

Explore at:
Dataset updated
Aug 23, 2023
Authors
Blessa Binolin M
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.

Search
Clear search
Close search
Google apps
Main menu