100+ datasets found
  1. i

    Imbalanced Data

    • ieee-dataport.org
    Updated Aug 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0
    Explore at:
    Dataset updated
    Aug 23, 2023
    Authors
    Blessa Binolin M
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.

  2. f

    Performance comparison of machine learning models across accuracy, AUC, MCC,...

    • plos.figshare.com
    xls
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seongil Han; Haemin Jung (2024). Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Seongil Han; Haemin Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.

  3. f

    Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

    • frontiersin.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

  4. i

    Tackling Class Imbalance with Ranking - Dataset - CKAN

    • rdm.inesctec.pt
    Updated Feb 20, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Tackling Class Imbalance with Ranking - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/nis-2017-002
    Explore at:
    Dataset updated
    Feb 20, 2017
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The dataset comes originally from UCI Machine Learning. The multiclass datasets were transformed in binary classification as mentioned in the paper. Ranking methods were applied to improve class imbalance. The datasets are divided in 30 folds so that other class imbalance methods can be compared to the methods in the paper. The code used in the paper is also provided.

  5. f

    Under-sampled dataset.

    • plos.figshare.com
    xls
    Updated Dec 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seongil Han; Haemin Jung (2024). Under-sampled dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Seongil Han; Haemin Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.

  6. Dataset: The effects of class balance on the training energy consumption of...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Gutierrez; Maria Gutierrez; Coral Calero; Coral Calero; Félix García; Félix García; Mª Ángeles Moraga; Mª Ángeles Moraga (2024). Dataset: The effects of class balance on the training energy consumption of logistic regression models [Dataset]. http://doi.org/10.5281/zenodo.10823624
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maria Gutierrez; Maria Gutierrez; Coral Calero; Coral Calero; Félix García; Félix García; Mª Ángeles Moraga; Mª Ángeles Moraga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2024
    Description

    Two synthetic datasets for binary classification, generated with the Random Radial Basis Function generator from WEKA. They are the same shape and size (104.952 instances, 185 attributes), but the "balanced" dataset has 52,13% of its instances belonging to class c0, while the "unbalanced" one only has 4,04% of its instances belonging to class c0. Therefore, this set of datasets is primarily meant to study how class balance influences the behaviour of a machine learning model.

  7. o

    YouTube Content Classification Dataset

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). YouTube Content Classification Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/fef9b558-dda7-42c6-83e3-048d99e5135b
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    YouTube, Social Media and Networking
    Description

    This dataset provides YouTube video metadata, suitable for practising text classification using Natural Language Processing (NLP) techniques. It includes video IDs, titles, descriptions, and categories, making it a valuable resource for those looking to apply and refine their NLP skills. The dataset was generated by scraping YouTube, offering a real-world scenario for data cleaning and analysis, including challenges such as missing values and class imbalance.

    Columns

    • Video ID: A unique identifier for each YouTube video. Note that this column contains some missing data.
    • title: The title of the YouTube video.
    • description: The textual description associated with the YouTube video.
    • category: The category under which the video was classified when scraped.
    • link: A direct URL to the YouTube video.

    Distribution

    The dataset is typically provided in a CSV file format. It contains approximately 3,400 video records, derived from an initial scrape of 3,600 videos. The dataset is known to be untidy, featuring missing values and imbalanced classes across its categories, presenting an opportunity for data cleaning and preprocessing exercises.

    Usage

    This dataset is ideally suited for: * Practising basic text classification using various NLP techniques. * Learning how to handle common data issues such as missing values and imbalanced classes. * Developing and applying data cleaning and preprocessing methods. * Experimenting with different machine learning algorithms for text analysis.

    Coverage

    The dataset has a global reach, as it comprises YouTube videos accessible worldwide. It was listed on 08/06/2025. The video categories included in the dataset were specifically queried across four main areas: Travel Vlogs, Food, Art and Music, and History. Users should be aware that the data includes missing values and exhibits class imbalance across these categories.

    License

    CCO

    Who Can Use It

    This dataset is intended for individuals and researchers, particularly those at an intermediate skill level, who wish to practise and improve their text classification and NLP capabilities. It is also highly beneficial for anyone looking to gain practical experience in data cleaning, handling missing data, and addressing class imbalance in real-world datasets.

    Dataset Name Suggestions

    • YouTube Video Classification Data
    • NLP YouTube Metadata Dataset
    • YouTube Content Classification Dataset
    • Video Description Text Analysis Dataset

    Attributes

    Original Data Source: Youtube Videos Dataset (~3400 videos)

  8. P

    HDSNE Chest X-ray Dataset Dataset

    • paperswithcode.com
    Updated Feb 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). HDSNE Chest X-ray Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/hdsne-chest-x-ray-dataset
    Explore at:
    Dataset updated
    Feb 25, 2025
    Description

    Description:

    👉 Download the dataset here

    The continuous release of medical image databases, often featuring overlapping or identical categories, poses a significant challenge for the development of autonomous Computer-Aided Diagnostics (CAD) systems. These systems are essential for creating truly comprehensive medical diagnostics. However, one of the main obstacles lies in the frequent bulk release of datasets, which commonly suffer from two critical issues: image duplication and data corruption.

    The Problem of Dataset Redundancy

    Repeated releases of the same categories often fail to integrate or deduplicate similar images across databases, which can severely impact the effectiveness of machine learning models. Data duplication not only reduces the efficiency of learning models but also leads to overfitting, wastes computational resources, and increases the carbon footprint due to the energy required for training complex models.

    Download Dataset

    Proposed Solution: Global Data Aggregation Model

    In response to these challenges, we introduce a global data aggregation model that intelligently combines data from six distinct and reputable medical imaging databases. Each database was carefully curated to ensure the elimination of redundancies while preserving data diversity. Two robust algorithms were employed:

    Hash MD5 Algorithm: This algorithm generates unique hash values for each image, helping in the effective detection and elimination of duplicate images.

    t-SNE Algorithm: This technique is used for dimensionality reduction, with a tunable perplexity parameter to ensure accurate representation of high-dimensional data.

    Dataset Categories

    The final dataset includes an equal number of samples from three key categories of chest X-ray images:

    Normal Pneumonia COVID-19

    This uniform distribution ensures that the dataset is balanced, avoiding class imbalance—a common issue that can skew results in medical image analysis.

    Dataset Application & Model Evaluation

    The dataset was applied to the Inception V3 pre-trained model, a leading convolutional neural network (CNN) architecture known for its excellence in image classification tasks. The evaluation was conduct using the following performance metrics:

    Accuracy: An exceptional accuracy rate of 98.48% was achieve.

    Precision, Recall, and F1-score: The dataset showed strong performance across these metrics, reducing both false positives and false negatives.

    Statistical Validation: A t-test was conduct to validate the results, and the t-values and p-values confirm the statistical significance of the model’s performance.

    Conclusion

    The HDSNE Chest X-ray Dataset offers a novel and effective approach to data aggregation, tackling the issues of redundancy and data duplication that have long plagued the field of medical imaging. By maintaining a balance class distribution and eliminating unnecessary data, this dataset provides a cleaner and more efficient resource for training machine learning models.

    This dataset is sourced from Kaggle.

  9. f

    Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

    • acs.figshare.com
    zip
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    ACS Publications
    Authors
    Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.

  10. m

    Dataset for Transient Stability Assessment of IEEE 39-Bus System

    • data.mendeley.com
    Updated Dec 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Živko Sokolović (2024). Dataset for Transient Stability Assessment of IEEE 39-Bus System [Dataset]. http://doi.org/10.17632/p992nhb8ss.1
    Explore at:
    Dataset updated
    Dec 20, 2024
    Authors
    Živko Sokolović
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 50 features and was generated through 12,852 time-domain simulations performed on the IEEE New England 39 bus system test case using DIgSILENT PowerFactory and Python automation. The simulations span diverse operating conditions by varying the generation/load profile from 80% to 120% in 5% increments. For each condition, three-phase short-circuit faults were applied at seven distinct locations (0%, 10%, 20%, 50%, 80%, 90%, 100%) along all transmission lines, with fault clearing times ranging from 0.1s to 0.3s.

    Key features captured for each of the 10 generators (G02 is the reference machine) include:

    P in MW - Active Power ut in p.u. - Terminal Voltage ie in p.u. - Excitation Current xspeed in p.u. - Rotor Speed firel in deg - Rotor Angle (relative to G02)

    Simulations lasted 10 seconds to ensure accurate transient stability assessment. Post-fault data was sampled every 0.01s from fault clearance up to 0.6s afterward, labeling the stability state as 1 (stable) or 0 (unstable). The dataset generation process took 5,840 seconds. The dataset exhibits a class imbalance, with 42% of cases belonging to the unstable class. All simulation data were exported to .csv files and subsequently unified into a single pickle file (tsa_data.pkl).

    Helper scripts are provided:

    dataset_loader.py: Includes the load_tsa_data function to load the dataset. usage.py: Demonstrates how to use the loader module.

    This dataset serves as a comprehensive foundation for machine learning applications in transient stability assessment (TSA), offering insights into system behavior under dynamic conditions.

  11. f

    The definition of a confusion matrix.

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.

  12. Balanced Affectnet Dataset (75×75, RGB)

    • kaggle.com
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dolly prajapati 182 (2025). Balanced Affectnet Dataset (75×75, RGB) [Dataset]. https://www.kaggle.com/datasets/dollyprajapati182/balanced-affectnet
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    dolly prajapati 182
    Description

    The Balanced Affectnet Dataset is a uniformly processed, class-balanced, and augmented version of the affect-fer composite dataset. This curated version is tailored for deep learning and machine learning applications in Facial Emotion Recognition (FER). It addresses class imbalance and standardizes input dimensions to enhance model performance and comparability.

    🎯 Purpose The goal of this dataset is to balance the representation of seven basic emotions, enabling the training of fairer and more robust FER models. Each emotion class contains an equal number of images, facilitating consistent model learning and evaluation across all classes.

    🧾 Dataset Characteristics Source: Based on the Affectnet dataset

    Image Format: RGB .png

    Image Size: 75 × 75 pixels

    Emotion 8-Classes: Anger Contempt disgust fear happy neutral sad surprise

    Total Images: 41,008

    Images per Class: 5,126

    ⚙️ Preprocessing Pipeline Each image in the dataset has been preprocessed using the following steps:

    ✅ Converted to grayscale

    ✅ Resized to 75×75 pixels

    ✅ Augmented using:

    Random rotation

    Horizontal flip

    Brightness adjustment

    Contrast enhancement

    Sharpness modification

    This results in a clean, uniform, and diverse dataset ideal for FER tasks.

    Testing (10%): 4100 images

    Training (80% of remainder): 29526 images

    Validation (20% of remainder): 7,382 images

    ✅ Advantages ⚖️ Balanced Classes: Equal images across all seven emotions

    🧠 Model-Friendly: Grayscale, resized format reduces preprocessing overhead

    🚀 Augmented: Improves model generalization and robustness

    📦 Split Ready: Train/Val/Test folders structured per class

    📊 Great for Benchmarking: Ideal for training CNNs, Transformers, and ensemble models for FER

  13. Learning Privacy from Visual Entities - Curated data sets and pre-computed...

    • zenodo.org
    zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessio Xompero; Alessio Xompero; Andrea Cavallaro; Andrea Cavallaro (2025). Learning Privacy from Visual Entities - Curated data sets and pre-computed visual entities [Dataset]. http://doi.org/10.5281/zenodo.15348506
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alessio Xompero; Alessio Xompero; Andrea Cavallaro; Andrea Cavallaro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This repository contains the curated image privacy datasets and pre-computed visual entities used in the publication Learning Privacy from Visual Entities by A. Xompero and A. Cavallaro.
    [
    arxiv][code]

    Curated image privacy data sets

    In the article, we trained and evaluated models on the Image Privacy Dataset (IPD) and the PrivacyAlert dataset. The datasets are originally provided by other sources and have been re-organised and curated for this work.

    Our curation organises the datasets in a common structure. We updated the annotations and labelled the splits of the data in the annotation file. This avoids having separated folders of images for each data split (training, validation, testing) and allows a flexible handling of new splits, e.g. created with a stratified K-Fold cross-validation procedure. As for the original datasets (PicAlert and PrivacyAlert), we provide the link to the images in bash scripts to download the images. Another bash script re-organises the images in sub-folders with maximum 1000 images in each folder.

    Both datasets refer to images publicly available on Flickr. These images have a large variety of content, including sensitive content, seminude people, vehicle plates, documents, private events. Images were annotated with a binary label denoting if the content was deemed to be public or private. As the images are publicly available, their label is mostly public. These datasets have therefore a high imbalance towards the public class. Note that IPD combines two other existing datasets, PicAlert and part of VISPR, to increase the number of private images already limited in PicAlert. Further details in our corresponding https://doi.org/10.48550/arXiv.2503.12464" target="_blank" rel="noopener">publication.

    List of datasets and their original source:

    Notes:

    • For PicAlert and PrivacyAlert, only urls to the original locations in Flickr are available in the Zenodo record
    • Collector and authors of the PrivacyAlert dataset selected the images from Flickr under Public Domain license
    • Owners of the photos on Flick could have removed the photos from the social media platform
    • Running the bash scripts to download the images can incur in the "429 Too Many Requests" status code

    Pre-computed visual entitities

    Some of the models run their pipeline end-to-end with the images as input, whereas other models require different or additional inputs. These inputs include the pre-computed visual entities (scene types and object types) represented in a graph format, e.g. for a Graph Neural Network. Re-using these pre-computed visual entities allows other researcher to build new models based on these features while avoiding re-computing the same on their own or for each epoch during the training of a model (faster training).

    For each image of each dataset, namely PrivacyAlert, PicAlert, and VISPR, we provide the predicted scene probabilities as a .csv file , the detected objects as a .json file in COCO data format, and the node features (visual entities already organised in graph format with their features) as a .json file. For consistency, all the files are already organised in batches following the structure of the images in the datasets folder. For each dataset, we also provide the pre-computed adjacency matrix for the graph data.

    Note: IPD is based on PicAlert and VISPR and therefore IPD refers to the scene probabilities and object detections of the other two datasets. Both PicAlert and VISPR must be downloaded and prepared to use IPD for training and testing.

    Further details on downloading and organising data can be found in our GitHub repository: https://github.com/graphnex/privacy-from-visual-entities (see ARTIFACT-EVALUATION.md#pre-computed-visual-entitities-)

    Enquiries, questions and comments

    If you have any enquiries, question, or comments, or you would like to file a bug report or a feature request, use the issue tracker of our GitHub repository.

  14. f

    This table compares the performance of RF, BR, CC, and CSMLP algorithms...

    • plos.figshare.com
    xls
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Durgesh Ameta; Surendra Kumar; Rishav Mishra; Laxmidhar Behera; Aniruddha Chakraborty; Tushar Sandhan (2025). This table compares the performance of RF, BR, CC, and CSMLP algorithms across various datasets and features. The evaluation uses Multi-Label Random Under-Sampling (MLRUS) and Multi-Label Random Over-Sampling (MLROS) techniques with sampling ratios of 10%, 20%, and 30%. Micro-averaged F1-scores, Precision and Recall are reported for each algorithm-dataset pair; each cell has F1 score on the top, then Precision in the middle and Recall at the bottom. The last column shows percentage increases from the baseline compared to our best result to provide insights into handling class imbalance and improving classification accuracy. Additionally, the RF, BR, and CC results are compared with the findings from Saini et al. [8] on the IGD_FP dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0322514.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 28, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Durgesh Ameta; Surendra Kumar; Rishav Mishra; Laxmidhar Behera; Aniruddha Chakraborty; Tushar Sandhan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This table compares the performance of RF, BR, CC, and CSMLP algorithms across various datasets and features. The evaluation uses Multi-Label Random Under-Sampling (MLRUS) and Multi-Label Random Over-Sampling (MLROS) techniques with sampling ratios of 10%, 20%, and 30%. Micro-averaged F1-scores, Precision and Recall are reported for each algorithm-dataset pair; each cell has F1 score on the top, then Precision in the middle and Recall at the bottom. The last column shows percentage increases from the baseline compared to our best result to provide insights into handling class imbalance and improving classification accuracy. Additionally, the RF, BR, and CC results are compared with the findings from Saini et al. [8] on the IGD_FP dataset.

  15. Dataset for Classification of Suspicious Financial Transactions

    • zenodo.org
    bin, csv
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edho Dwi Jayanto; Edho Dwi Jayanto (2025). Dataset for Classification of Suspicious Financial Transactions [Dataset]. http://doi.org/10.5281/zenodo.15493392
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Edho Dwi Jayanto; Edho Dwi Jayanto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract This study investigates the application of machine learning models for detecting suspicious financial transactions. Utilizing a dataset of 12,571 transactions from PT Bank ABC, the research encompasses various stages such as data preprocessing, feature selection, and addressing class imbalance. The models evaluated include Random Forest, XGBoost, and SVM, which were assessed through cross-validation with StratifiedKFold and optimized using RandomizedSearchCV.

  16. u

    Data from: Voxelized fragment dataset for machine learning

    • investigacion.ujaen.es
    • zenodo.org
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    López Ruiz, Alfonso; Rueda Ruiz, Antonio Jesús; Segura, Rafael; Ogayar Anguita, Carlos Javier; Navarro, Pablo; Fuertes García, José Manuel; López Ruiz, Alfonso; Rueda Ruiz, Antonio Jesús; Segura, Rafael; Ogayar Anguita, Carlos Javier; Navarro, Pablo; Fuertes García, José Manuel (2024). Voxelized fragment dataset for machine learning [Dataset]. https://investigacion.ujaen.es/documentos/67321f1daea56d4af04863a7
    Explore at:
    Dataset updated
    2024
    Authors
    López Ruiz, Alfonso; Rueda Ruiz, Antonio Jesús; Segura, Rafael; Ogayar Anguita, Carlos Javier; Navarro, Pablo; Fuertes García, José Manuel; López Ruiz, Alfonso; Rueda Ruiz, Antonio Jesús; Segura, Rafael; Ogayar Anguita, Carlos Javier; Navarro, Pablo; Fuertes García, José Manuel
    Description

    One of the primary challenges inherent in utilizing deep learning models is the scarcity and accessibility hurdles associated with acquiring datasets of sufficient size to facilitate effective training of these networks. This is particularly significant in object detection, shape completion, and fracture assembly. Instead of scanning a large number of real-world fragments, it is possible to generate massive datasets with synthetic pieces. However, realistic fragmentation is computationally intensive in the preparation (e.g., pre-factured models) and generation. Otherwise, simpler algorithms such as Voronoi diagrams provide faster processing speeds at the expense of compromising realism. Hence, it is required to balance computational efficiency and realism for generating large datasets for marching learning.

    We proposed a GPU-based fragmentation method to improve the baseline Discrete Voronoi Chain aimed at completing this dataset generation task. The dataset in this repository includes voxelized fragments from high-resolution 3D models, curated to be used as training sets for machine learning models. More specifically, these models come from an archaeological dataset, which led to more than 1M fragments from 1,052 Iberian vessels. In this dataset, fragments are not stored individually; instead, the fragmented voxelizations are provided in a compressed binary file (.rle.zip). Once uncompressed, each fragment is represented by a different number in the grid. The class to which each vessel belongs is also included in class.csv. The GPU-based pipeline that generated this dataset is explained at https://doi.org/10.1016/j.cag.2024.104104.

    Please, note that this dataset originally provided voxel data, point clouds and triangle meshes. However, we opted for including only voxel data because 1) the original dataset is too large to be uploaded to Zenodo and 2) the original intent of our paper is to generate implicit data in the form of voxels. If interested in the whole dataset (450GB), please visit the web page of our research institute.

  17. BIOSCAN-1M Insect Dataset

    • zenodo.org
    bin, tsv, zip
    Updated Jan 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zahra Gharaee; Zahra Gharaee; ZeMing Gong; Nicholas Pellegrino; Iuliia Zarubiieva; Joakim Bruslund Haurum; Scott C. Lowe; Jaclyn T.A. McKeown; Chris C.Y. Ho; Joschka McLeod; Yi-Yun C Wei; Jireh Agda; Sujeevan Ratnasingham; Dirk Steinke; Angel X. Chang; Graham W. Taylor; Paul Fieguth; ZeMing Gong; Nicholas Pellegrino; Iuliia Zarubiieva; Joakim Bruslund Haurum; Scott C. Lowe; Jaclyn T.A. McKeown; Chris C.Y. Ho; Joschka McLeod; Yi-Yun C Wei; Jireh Agda; Sujeevan Ratnasingham; Dirk Steinke; Angel X. Chang; Graham W. Taylor; Paul Fieguth (2025). BIOSCAN-1M Insect Dataset [Dataset]. http://doi.org/10.5281/zenodo.8030065
    Explore at:
    bin, zip, tsvAvailable download formats
    Dataset updated
    Jan 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zahra Gharaee; Zahra Gharaee; ZeMing Gong; Nicholas Pellegrino; Iuliia Zarubiieva; Joakim Bruslund Haurum; Scott C. Lowe; Jaclyn T.A. McKeown; Chris C.Y. Ho; Joschka McLeod; Yi-Yun C Wei; Jireh Agda; Sujeevan Ratnasingham; Dirk Steinke; Angel X. Chang; Graham W. Taylor; Paul Fieguth; ZeMing Gong; Nicholas Pellegrino; Iuliia Zarubiieva; Joakim Bruslund Haurum; Scott C. Lowe; Jaclyn T.A. McKeown; Chris C.Y. Ho; Joschka McLeod; Yi-Yun C Wei; Jireh Agda; Sujeevan Ratnasingham; Dirk Steinke; Angel X. Chang; Graham W. Taylor; Paul Fieguth
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Overview

    In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-1M Insect Dataset. Each record is taxonomically classified by an expert, and also has associated genetic information including raw nucleotide barcode sequences and assigned barcode index numbers, which are genetically-based proxies for species classification. This dataset presents a curated million-image dataset, primarily to train computer-vision models capable of providing image-based taxonomic assessment, however, the dataset also presents compelling characteristics, the study of which would be of interest to the broader machine learning community. Driven by the biological nature inherent to the dataset, a characteristic long-tailed class-imbalance distribution is exhibited. Furthermore, taxonomic labelling is a hierarchical classification scheme, presenting a highly fine-grained classification problem at lower levels. Beyond spurring interest in biodiversity research within the machine learning community, progress on creating an image-based taxonomic classifier will also further the ultimate goal of all BIOSCAN research: to lay the foundation for a comprehensive survey of global biodiversity.

  18. Credit Card Fraud Detection Dataset

    • kaggle.com
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghanshyam Saini (2025). Credit Card Fraud Detection Dataset [Dataset]. https://www.kaggle.com/datasets/ghnshymsaini/credit-card-fraud-detection-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ghanshyam Saini
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Credit Card Fraud Detection Dataset (European Cardholders, September 2013)

    As a data contributor, I'm sharing this crucial dataset focused on the detection of fraudulent credit card transactions. Recognizing these illicit activities is paramount for protecting customers and the integrity of financial systems.

    About the Dataset:

    This dataset encompasses credit card transactions made by European cardholders during a two-day period in September 2013. It presents a real-world scenario with a significant class imbalance, where fraudulent transactions are considerably less frequent than legitimate ones. Out of a total of 284,807 transactions, only 492 are instances of fraud, representing a mere 0.172% of the entire dataset.

    Content of the Data:

    Due to confidentiality concerns, the majority of the input features in this dataset have undergone a Principal Component Analysis (PCA) transformation. This means the original meaning and context of features V1, V2, ..., V28 are not directly provided. However, these principal components capture the variance in the underlying transaction data.

    The only features that have not been transformed by PCA are:

    • Time: Numerical. Represents the number of seconds elapsed between each transaction and the very first transaction recorded in the dataset.
    • Amount: Numerical. The transaction amount in Euros (€). This feature could be valuable for cost-sensitive learning approaches.

    The target variable for this classification task is:

    • Class: Integer. Takes the value 1 in the case of a fraudulent transaction and 0 otherwise.

    Important Note on Evaluation:

    Given the substantial class imbalance (far more legitimate transactions than fraudulent ones), traditional accuracy metrics based on the confusion matrix can be misleading. It is strongly recommended to evaluate models using the Area Under the Precision-Recall Curve (AUPRC), as this metric is more sensitive to the performance on the minority class (fraudulent transactions).

    How to Use This Dataset:

    1. Download the dataset file (likely in CSV format).
    2. Load the data using libraries like Pandas.
    3. Understand the class imbalance: Be aware that fraudulent transactions are rare.
    4. Explore the features: Analyze the distributions of 'Time', 'Amount', and the PCA-transformed features (V1-V28).
    5. Address the class imbalance: Consider using techniques like oversampling the minority class, undersampling the majority class, or using specialized algorithms designed for imbalanced datasets.
    6. Build and train binary classification models to predict the 'Class' variable.
    7. Evaluate your models using AUPRC to get a meaningful assessment of performance in detecting fraud.

    Acknowledgements and Citation:

    This dataset has been collected and analyzed through a research collaboration between Worldline and the Machine Learning Group (MLG) of ULB (Université Libre de Bruxelles).

    When using this dataset in your research or projects, please cite the following works as appropriate:

    • Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.
    • Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon.
    • Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE.
    • Andrea Dal Pozzolo. Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi).
    • Fabrizio Carcillo, Andrea Dal Pozzolo, Yann-Aël Le Borgne, Olivier Caelen, Yannis Mazzer, Gianluca Bontempi. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier.
    • Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Gianluca Bontempi. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing.
    • Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019.
    • Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi *Combining Unsupervised and Supervised...
  19. Z

    The Turku UAS DeepSeaSalama - GAN dataset 1 (TDSS-G1)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Auranen, Jani (2024). The Turku UAS DeepSeaSalama - GAN dataset 1 (TDSS-G1) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10714822
    Explore at:
    Dataset updated
    Jul 7, 2024
    Dataset provided by
    Asadi, Mehdi
    Turku University of Applied Sciences
    Auranen, Jani
    Majd, Amin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Turku
    Description

    The Turku UAS DeepSeaSalama-GAN dataset 1 (TDSS-G1) is a comprehensive image dataset obtained from a maritime environment. This dataset was assembled in the southwest Finnish archipelago area at Taalintehdas, using two stationary RGB fisheye cameras in the month of August 2022. The technical setup is described in the section “Sensor Platform design” in report “Development of Applied Research Platforms for Autonomous and Remotely Operated Systems” (https://www.theseus.fi/handle/10024/815628).

    The data collection and annotation process was carried out in the Autonomous and Intelligent Systems laboratory at Turku University of Applied Sciences. The dataset is a blend of original images captured by our cameras and synthetic data generated by a Generative Adversarial Network (GAN), simulating 18 distinct weather conditions.

    The TDSS-G1 dataset comprises 199 original images and a substantial addition of 3582 synthetic images, culminating in a total of 3781 annotated images. These images provide a diverse representation of various maritime objects, including motorboats, sailing boats, and seamarks.

    The creation of TDSS-G1 involved extracting images from videos recorded in MPEG format, with a resolution of 720p at 30 frames per second (FPS). An image was extracted every 100 milliseconds.

    The distribution of labels within TDSS-G1 is as follows: motorboats (62.1%), sailing boats (16.8%), and seamarks (21.1%).

    This distribution highlights a class imbalance, with motorboats being the most represented class and sailing boats being the least. This imbalance is an important factor to consider during the model training process, as it could influence the model’s ability to accurately recognize underrepresented classes. In the future synthetic datasets, vision Transformers will be used to tackle this problem.

    The TDSS-G1 dataset is organized into three distinct subsets for the purpose of training and evaluating machine learning models. These subsets are as follows:

    Training Set: Located in dataset/train/images, this set is used to train the model. It learns to recognize the different classes of maritime objects from this data.

    Validation Set: Stored in dataset/valid/images, this set is used to tune the model parameters and to prevent overfitting during the training process.

    Test Set: Found in dataset/test/images, this set is used to evaluate the final performance of the model. It provides an unbiased assessment of how the model will perform on unseen data.

    The dataset comprises three classes (nc: 3), each representing a different type of maritime object. The classes are as follows:

    Motor Boat (motor_boat)

    Sailing Boat (sailing_boat)

    Seamark (seamark)

    These labels correspond to the annotated objects in the images. The model trained on this dataset will be capable of identifying these three types of maritime objects. As mentioned earlier, the distribution of these classes is imbalanced, which is an important factor to consider during the training process.

  20. o

    Data from: Financial Fraud Detection Dataset

    • opendatabay.com
    .undefined
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Review Nexus (2025). Financial Fraud Detection Dataset [Dataset]. https://www.opendatabay.com/data/financial/d226c56e-5929-4059-a30d-13632e07b344
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 25, 2025
    Dataset authored and provided by
    Review Nexus
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    Fraud Detection & Risk Management
    Description

    This dataset is designed to support research and model development in the area of fraud detection. It consists of real-world credit card transactions made by European cardholders over a two-day period in September 2013. Out of 284,807 transactions, 492 are labeled as fraudulent (positive class), making this a highly imbalanced classification problem.

    Performance Note:

    Due to the extreme class imbalance, standard accuracy metrics are not informative. We recommend using the Area Under the Precision-Recall Curve (AUPRC) or F1-score for model evaluation.

    Features:

    • Time Series Data: Each row represents a transaction, with the Time feature indicating the number of seconds elapsed since the first transaction.
    • Dimensionality Reduction Applied: Features V1 through V28 are anonymized principal components derived from a PCA transformation due to confidentiality constraints.
    • Raw Transaction Amount: The Amount field reflects the transaction value, useful for cost-sensitive modeling.
    • Binary Classification Target: The Class label is 1 for fraud and 0 for legitimate transactions.

    Usage:

    • Machine learning model training for fraud detection.
    • Evaluation of anomaly detection and imbalanced classification methods.
    • Development of cost-sensitive learning approaches using the Amount variable.

    Data Summary:

    • Total Records: 284,807
    • Fraud Cases: 492
    • Imbalance Ratio: Fraudulent transactions account for just 0.172% of the dataset.
    • Columns: 31 total (28 PCA features, plus Time, Amount, and Class)

    License:

    The dataset is provided under the CC0 (Public Domain) license, allowing users to freely use, modify, and distribute the data without any restrictions.

    Acknowledgements

    The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

    Please cite the following works:

    Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

    Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon

    Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

    Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

    Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier

    Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

    Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019

    Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019

    Yann-Aël Le Borgne, Gianluca Bontempi Reproducible machine Learning for Credit Card Fraud Detection - Practical Handbook

    Bertrand Lebichot, Gianmarco Paldino, Wissam Siblini, Liyun He, Frederic Oblé, Gianluca Bontempi Incremental learning strategies for credit cards fraud detection, IInternational Journal of Data Science and Analytics

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0

Imbalanced Data

Explore at:
Dataset updated
Aug 23, 2023
Authors
Blessa Binolin M
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.

Search
Clear search
Close search
Google apps
Main menu