77 datasets found
  1. Data from: IsolationForest

    • kaggle.com
    zip
    Updated Feb 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lionel Bottan (2022). IsolationForest [Dataset]. https://www.kaggle.com/lionelbottan/isolationforest
    Explore at:
    zip(904242 bytes)Available download formats
    Dataset updated
    Feb 2, 2022
    Authors
    Lionel Bottan
    Description

    Dataset

    This dataset was created by Lionel Bottan

    Contents

  2. t

    Isolation Forest - Dataset - LDM

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Isolation Forest - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/isolation-forest
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    Isolation Forest

  3. Comparative analysis with unsupervised anomaly detection algorithms.

    • plos.figshare.com
    xls
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kenichiro Nagata; Toshikazu Tsuji; Kimitaka Suetsugu; Kayoko Muraoka; Hiroyuki Watanabe; Akiko Kanaya; Nobuaki Egashira; Ichiro Ieiri (2023). Comparative analysis with unsupervised anomaly detection algorithms. [Dataset]. http://doi.org/10.1371/journal.pone.0260315.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Kenichiro Nagata; Toshikazu Tsuji; Kimitaka Suetsugu; Kayoko Muraoka; Hiroyuki Watanabe; Akiko Kanaya; Nobuaki Egashira; Ichiro Ieiri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparative analysis with unsupervised anomaly detection algorithms.

  4. SAP FI Anomaly Detection - Prepared Data & Models

    • kaggle.com
    zip
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    aidsmlProjects (2025). SAP FI Anomaly Detection - Prepared Data & Models [Dataset]. https://www.kaggle.com/datasets/aidsmlprojects/sap-fi-anomaly-detection-prepared-data-and-models
    Explore at:
    zip(9285 bytes)Available download formats
    Dataset updated
    Apr 30, 2025
    Authors
    aidsmlProjects
    Description

    Intelligent SAP Financial Integrity Monitor

    Project Status: Proof-of-Concept (POC) - Capstone Project

    Overview

    This project demonstrates a proof-of-concept system for detecting financial document anomalies within core SAP FI/CO data, specifically leveraging the New General Ledger table (FAGLFLEXA) and document headers (BKPF). It addresses the challenge that standard SAP reporting and rule-based checks often struggle to identify subtle, complex, or novel irregularities in high-volume financial postings.

    The solution employs a Hybrid Anomaly Detection strategy, combining unsupervised Machine Learning models with expert-defined SAP business rules. Findings are prioritized using a multi-faceted scoring system and presented via an interactive dashboard built with Streamlit for efficient investigation.

    This project was developed as a capstone, showcasing the application of AI/ML techniques to enhance financial controls within an SAP context, bridging deep SAP domain knowledge with modern data science practices.

    Author: Anitha R (https://www.linkedin.com/in/anithaswamy)

    Dataset Origin: Kaggle SAP Dataset by Sunitha Siva License:Other (specified in description)-No description available.

    Motivation

    Financial integrity is critical. Undetected anomalies in SAP FI/CO postings can lead to: * Inaccurate financial reporting * Significant reconciliation efforts * Potential audit failures or compliance issues * Masking of operational errors or fraud

    Standard SAP tools may not catch all types of anomalies, especially complex or novel patterns. This project explores how AI/ML can augment traditional methods to provide more robust and efficient financial monitoring.

    Key Features

    • Data Cleansing & Preparation: Rigorous process to handle common SAP data extract issues (duplicates, financial imbalance), prioritizing FAGLFLEXA for reliability.
    • Exploratory Data Analysis (EDA): Uncovered baseline patterns in posting times, user activity, amounts, and process context.
    • Feature Engineering: Created 16 context-aware features (FE_...) to quantify potential deviations from normalcy based on EDA and SAP knowledge.
    • Hybrid Anomaly Detection:
      • Ensemble ML: Utilized unsupervised models: Isolation Forest (IF), Local Outlier Factor (LOF) (via Scikit-learn), and an Autoencoder (AE) (via TensorFlow/Keras).
      • Expert Rules (HRFs): Implemented highly customizable High-Risk Flags based on percentile thresholds and SAP logic (e.g., weekend posting, missing cost center).
    • Multi-Faceted Prioritization: Combined ML model consensus (Model_Anomaly_Count) and HRF counts (HRF_Count) into a Priority_Tier for focusing investigation efforts.
    • Contextual Anomaly Reason: Generated a Review_Focus text description summarizing why an item was flagged.
    • Interactive Dashboard (Streamlit):
      • File upload for anomaly/feature data.
      • Overview KPIs (including multi-currency "Value at Risk by CoCode").
      • Comprehensive filtering capabilities.
      • Dynamic visualizations (User/Doc Type/HRF frequency, Time Trends).
      • Interactive AgGrid table for anomaly list investigation.
      • Detailed drill-down view for selected anomalies.

    Methodology Overview

    The project followed a structured approach:

    1. Phase 1: Data Quality Assessment & Preparation: Cleaned and validated raw BKPF and FAGLFLEXA data extracts. Discarded BSEG due to imbalances. Removed duplicates.
    2. Phase 2: Exploratory Data Analysis & Feature Engineering: Analyzed cleaned data patterns and engineered 16 features quantifying anomaly indicators. Resulted in sap_engineered_features.csv.
    3. Phase 3: Baseline Anomaly Detection & Evaluation: Scaled features, applied IF and LOF models, evaluated initial results.
    4. Phase 4: Advanced Modeling & Prioritization: Trained Autoencoder model, combined all model outputs and HRFs, implemented prioritization logic, generated context, and created the final anomaly list.
    5. Phase 5: UI Development: Built the Streamlit dashboard for interactive analysis and investigation.

    (For detailed methodology, please refer to the Comprehensive_Project_Report.pdf in the /docs folder - if you include it).

    Technology Stack

    • Core Language: Python 3.x
    • Data Manipulation & Analysis: Pandas, NumPy
    • Machine Learning: Scikit-learn (IsolationForest, LocalOutlierFactor, StandardScaler), TensorFlow/Keras (Autoencoder)
    • Visualization: Matplotlib, Seaborn, Plotly Express
    • Dashboard: Streamlit, streamlit-aggrid
    • Utilities: Joblib (for saving scaler)

    Libraries:

    Model/Scaler Saving

    joblib==1.4.2

    Data I/O Efficiency (Optional but good practice if used)

    pyarrow==19.0.1

    Machine L...

  5. Feature importance calculated by Random Forest classifier considering the 80...

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marinara Marcato; Salvatore Tedesco; Conor O’Mahony; Brendan O’Flynn; Paul Galvin (2023). Feature importance calculated by Random Forest classifier considering the 80 features previously selected by Select K Best. [Dataset]. http://doi.org/10.1371/journal.pone.0286311.t010
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Marinara Marcato; Salvatore Tedesco; Conor O’Mahony; Brendan O’Flynn; Paul Galvin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Feature importance calculated by Random Forest classifier considering the 80 features previously selected by Select K Best.

  6. f

    Data from: Boundary peeling: An outlier detection method

    • tandf.figshare.com
    pdf
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sheikh Arafat; Na Sun; Maria L. Weese; Waldyn G. Martinez (2025). Boundary peeling: An outlier detection method [Dataset]. http://doi.org/10.6084/m9.figshare.28776694.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Sheikh Arafat; Na Sun; Maria L. Weese; Waldyn G. Martinez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Unsupervised outlier detection constitutes a crucial phase within data analysis and remains an open area of research. A good outlier detection algorithm should be computationally efficient, robust to tuning parameter selection, and perform consistently well across diverse underlying data distributions. We introduce Boundary Peeling, an unsupervised outlier detection algorithm. Boundary Peeling uses the average signed distance from iteratively peeled, flexible boundaries generated by one-class support vector machines to flag outliers. The method is similar to convex hull peeling but well suited for high-dimensional data and has flexibility to adapt to different distributions. Boundary Peeling has robust hyperparameter settings and, for increased flexibility, can be cast as an ensemble method. In unimodal and multimodal synthetic data simulations Boundary Peeling outperforms all state of the art methods when no outliers are present while maintaining comparable or superior performance in the presence of outliers. Boundary Peeling performs competitively or better in terms of correct classification, AUC, and processing time using semantically meaningful benchmark datasets.

  7. Cybersecurity 🪪 Intrusion 🦠 Detection Dataset

    • kaggle.com
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinesh Naveen Kumar Samudrala (2025). Cybersecurity 🪪 Intrusion 🦠 Detection Dataset [Dataset]. https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dinesh Naveen Kumar Samudrala
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. Below, I’ll explain each aspect in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning.

    1. Understanding the Features

    The dataset consists of network-based and user behavior-based features. Each feature provides valuable information about potential cyber threats.

    A. Network-Based Features

    These features describe network-level information such as packet size, protocol type, and encryption methods.

    1. network_packet_size (Packet Size in Bytes)

      • Represents the size of network packets, ranging between 64 to 1500 bytes.
      • Packets on the lower end (~64 bytes) may indicate control messages, while larger packets (~1500 bytes) often carry bulk data.
      • Attackers may use abnormally small or large packets for reconnaissance or exploitation attempts.
    2. protocol_type (Communication Protocol)

      • The protocol used in the session: TCP, UDP, or ICMP.
      • TCP (Transmission Control Protocol): Reliable, connection-oriented (common for HTTP, HTTPS, SSH).
      • UDP (User Datagram Protocol): Faster but less reliable (used for VoIP, streaming).
      • ICMP (Internet Control Message Protocol): Used for network diagnostics (ping); often abused in Denial-of-Service (DoS) attacks.
    3. encryption_used (Encryption Protocol)

      • Values: AES, DES, None.
      • AES (Advanced Encryption Standard): Strong encryption, commonly used.
      • DES (Data Encryption Standard): Older encryption, weaker security.
      • None: Indicates unencrypted communication, which can be risky.
      • Attackers might use no encryption to avoid detection or weak encryption to exploit vulnerabilities.

    B. User Behavior-Based Features

    These features track user activities, such as login attempts and session duration.

    1. login_attempts (Number of Logins)

      • High values might indicate brute-force attacks (repeated login attempts).
      • Typical users have 1–3 login attempts, while an attack may have hundreds or thousands.
    2. session_duration (Session Length in Seconds)

      • A very long session might indicate unauthorized access or persistence by an attacker.
      • Attackers may try to stay connected to maintain access.
    3. failed_logins (Failed Login Attempts)

      • High failed login counts indicate credential stuffing or dictionary attacks.
      • Many failed attempts followed by a successful login could suggest an account was compromised.
    4. unusual_time_access (Login Time Anomaly)

      • A binary flag (0 or 1) indicating whether access happened at an unusual time.
      • Attackers often operate outside normal business hours to evade detection.
    5. ip_reputation_score (Trustworthiness of IP Address)

      • A score from 0 to 1, where higher values indicate suspicious activity.
      • IP addresses associated with botnets, spam, or previous attacks tend to have higher scores.
    6. browser_type (User’s Browser)

      • Common browsers: Chrome, Firefox, Edge, Safari.
      • Unknown: Could be an indicator of automated scripts or bots.

    2. Target Variable (attack_detected)

    • Binary classification: 1 means an attack was detected, 0 means normal activity.
    • The dataset is useful for supervised machine learning, where a model learns from labeled attack patterns.

    3. Possible Use Cases

    This dataset can be used for intrusion detection systems (IDS) and cybersecurity research. Some key applications include:

    A. Machine Learning-Based Intrusion Detection

    1. Supervised Learning Approaches

      • Classification Models (Logistic Regression, Decision Trees, Random Forest, XGBoost, SVM)
      • Train the model using labeled data (attack_detected as the target).
      • Evaluate using accuracy, precision, recall, F1-score.
    2. Deep Learning Approaches

      • Use Neural Networks (DNN, LSTM, CNN) for pattern recognition.
      • LSTMs work well for time-series-based network traffic analysis.

    B. Anomaly Detection (Unsupervised Learning)

    If attack labels are missing, anomaly detection can be used: - Autoencoders: Learn normal traffic and flag anomalies. - Isolation Forest: Detects outliers based on feature isolation. - One-Class SVM: Learns normal behavior and detects deviations.

    C. Rule-Based Detection

    • If certain thresholds are met (e.g., failed_logins > 10 & ip_reputation_score > 0.8), an alert is triggered.

    4. Challenges & Considerations

    • Adversarial Attacks: Attackers may modify traffic to evade detection.
    • Concept Drift: Cyber threats...
  8. SKAB - Skoltech Anomaly Benchmark

    • kaggle.com
    zip
    Updated Nov 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iurii Katser (2020). SKAB - Skoltech Anomaly Benchmark [Dataset]. https://www.kaggle.com/datasets/yuriykatser/skoltech-anomaly-benchmark-skab/code
    Explore at:
    zip(1300142 bytes)Available download formats
    Dataset updated
    Nov 28, 2020
    Authors
    Iurii Katser
    License

    http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html

    Description

    ❗️❗️❗️**The current version of SKAB (v0.9) contains 34 datasets with collective anomalies. But the upcoming update to v1.0 (probably up to the summer of 2021) will contain 300+ additional files with point and collective anomalies. It will make SKAB one of the largest changepoint-containing benchmarks, especially in the technical field.**

    About SKAB Maintenance DOI License: GPL v3.0

    We propose the Skoltech Anomaly Benchmark (SKAB) designed for evaluating the anomaly detection algorithms. SKAB allows working with two main problems (there are two markups for anomalies): * Outlier detection (anomalies considered and marked up as single-point anomalies) * Changepoint detection (anomalies considered and marked up as collective anomalies)

    SKAB consists of the following artifacts: * Datasets. * Leaderboard (scoreboard). * Python modules for algorithms’ evaluation. * Notebooks: python notebooks with anomaly detection algorithms.

    The IIot testbed system is located in the Skolkovo Institute of Science and Technology (Skoltech). All the details regarding the testbed and the experimenting process are presented in the following artifacts: - Position paper (currently submitted for publication) - Slides about the project

    Datasets

    The SKAB v0.9 corpus contains 35 individual data files in .csv format. Each file represents a single experiment and contains a single anomaly. The dataset represents a multivariate time series collected from the sensors installed on the testbed. The data folder contains datasets from the benchmark. The structure of the data folder is presented in the structure file. Columns in each data file are following: - datetime - Represents dates and times of the moment when the value is written to the database (YYYY-MM-DD hh:mm:ss) - Accelerometer1RMS - Shows a vibration acceleration (Amount of g units) - Accelerometer2RMS - Shows a vibration acceleration (Amount of g units) - Current - Shows the amperage on the electric motor (Ampere) - Pressure - Represents the pressure in the loop after the water pump (Bar) - Temperature - Shows the temperature of the engine body (The degree Celsius) - Thermocouple - Represents the temperature of the fluid in the circulation loop (The degree Celsius) - Voltage - Shows the voltage on the electric motor (Volt) - RateRMS - Represents the circulation flow rate of the fluid inside the loop (Liter per minute) - anomaly - Shows if the point is anomalous (0 or 1) - changepoint - Shows if the point is a changepoint for collective anomalies (0 or 1)

    Leaderboard (Scoreboard)

    Here we propose the leaderboard for SKAB v0.9 both for outlier and changepoint detection problems. You can also present and evaluate your algorithm using SKAB on kaggle. The results in the tables are calculated in the python notebooks from the notebooks folder.

    Outlier detection problem

    Sorted by F1; for F1 bigger is better; both for FAR and MAR less is better
    | Algorithm | F1 | FAR, % | MAR, % |---|---|---|---| Perfect detector | 1 | 0 | 0 T-squared+Q (PCA) | 0.67 | 13.95 | 36.32 LSTM | 0.64 | 15.4 | 39.93 MSCRED | 0.64 | 13.56 | 41.16 T-squared | 0.56 | 12.14 | 52.56 Autoencoder | 0.45 | 7.56 | 66.57 Isolation forest | 0.4 | 6.86 | 72.09 Null detector | 0 | 0 | 100

    Changepoint detection problem

    Sorted by NAB (standart); for all metrics bigger is better
    | Algorithm | NAB (standart) | NAB (lowFP) | NAB (LowFN) | |---|---|---|---| Perfect detector | 100 | 100 | 100 Isolation forest | 37.53 | 17.09 | 45.02 MSCRED | 28.74 | 23.43 | 31.21 LSTM | 27.09 | 11.06 | 32.68 T-squared+Q (PCA) | 26.71 | 22.42 | 28.32 T-squared | 17.87 | 3.44 | 23.2 ArimaFD | 16.06 | 14.03 | 17.12 Autoencoder | 15.59 | 0.78 | 20.91 Null detector | 0 | 0 | 0

    Notebooks

    The notebooks folder contains python notebooks with the code for the proposed leaderboard results reproducing.

    We have calculated the results for five quite common anomaly detection algorithms: - Hotelling's T-squared statistics; - Hotelling's T-squared statistics + Q statistics based on PCA; - Isolation forest; - LSTM-based NN; - Feed-Forward Autoencoder.

    Additionaly to the repository were added the results of the following algorithms: - ArimaFD; - MSCRED.

    Citat...

  9. R

    IsoFMiR: An unsupervised anomaly detection framework for biomarker discovery...

    • repod.icm.edu.pl
    zip
    Updated Oct 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dey, Mritunjoy (2025). IsoFMiR: An unsupervised anomaly detection framework for biomarker discovery in rare cancers [Dataset]. http://doi.org/10.18150/UAKCCS
    Explore at:
    zip(11578)Available download formats
    Dataset updated
    Oct 14, 2025
    Dataset provided by
    RepOD
    Authors
    Dey, Mritunjoy
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    Maria Sklodowska-Curie National Research Institute of Oncology
    Description

  10. d

    Systematic review of validation of supervised machine learning models in...

    • search.dataone.org
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oakleigh Wilson (2025). Systematic review of validation of supervised machine learning models in accelerometer-based animal behaviour classification literature [Dataset]. http://doi.org/10.5061/dryad.fxpnvx14d
    Explore at:
    Dataset updated
    Jun 25, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Oakleigh Wilson
    Description

    Supervised machine learning has been used to detect fine-scale animal behaviour from accelerometer data, but a standardised protocol for implementing this workflow is currently lacking. As the application of machine learning to ecological problems expands, it is essential to establish technical protocols and validation standards that align with those in other "big data" fields. Overfitting is a prevalent and often misunderstood challenge in machine learning. Overfit models overly adapt to the training data to memorise specific instances rather than to discern the underlying signal. Associated results can indicate high performance on the training set, yet these models are unlikely to generalise to new data. Overfitting can be detected through rigorous validation using independent test sets. Our systematic review of 119 studies using accelerometer-based supervised machine learning to classify animal behaviour reveals that 79% (94 papers) did not validate their models sufficiently wel..., We defined eligibility criteria as 'peer-reviewed primary research papers published 2013-present that use supervised machine learning to identify specific behaviours from raw, non-livestock animal accelerometer data'. We elected to ignore analysis of livestock behaviour as agricultural methods often operate within different constraints to the analyses conducted on wild animals and this body of literature has mostly developed in isolation to wild animal research. Our search was conducted on 27/09/2024. Initial keyword search across 3 databases (Google Scholar, PubMed, and Scopus) yielded 249 unique papers. Papers outside of the search criteria — including hardware and software advances, non-ML analysis, insufficient accelerometry application (e.g., research focused on other sensors with accelerometry providing minimal support), unsupervised methods, and research limited to activity intensity or active and inactive states— were excluded, resulting in 119 papers., , # Systematic review of validation of supervised machine learning models in accelerometer-based animal behaviour classification literature

    https://doi.org/10.5061/dryad.fxpnvx14d

    Description of the data and file structure

    Files and variables

    File: Systematic_Review_Supplementary.xlsx

    Description:Â Methods information from animal accelerometer-based behaviour classification literature utilising supervised machine learning techniques.

    Variables

    • Citation: Citation information for paper
    • Title: Extracted title from citation information
    • Year: Year of publication
    • ModelCategory: General category of the supervised machine learning model used (e.g., all Support Vector Machines are listed as SVM)
      • DT — Decision Tree
      • EM — Expectation Maximisation
      • Ensemble — Ensemble methods (e.g., boosting, bagging)
      • HMM — Hidden Markov Model
      • Isolation Forest — Anomaly detection using Isolation Forest ...,
  11. Credit Card Fraud Dataset

    • kaggle.com
    zip
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Waqas Ishtiaq (2025). Credit Card Fraud Dataset [Dataset]. https://www.kaggle.com/datasets/waqasishtiaq/credit-card-fraud-dataset
    Explore at:
    zip(69155672 bytes)Available download formats
    Dataset updated
    Sep 11, 2025
    Authors
    Waqas Ishtiaq
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset, commonly known as creditcard.csv, contains anonymized credit card transactions made by European cardholders in September 2013. It includes 284,807 transactions, with 492 labeled as fraudulent. Due to confidentiality constraints, features have been transformed using PCA, except for 'Time' and 'Amount'.

    This dataset was used in the research article titled "A Hybrid Anomaly Detection Framework Combining Supervised and Unsupervised Learning for Credit Card Fraud Detection". The study proposes an ensemble model integrating techniques such as Autoencoders, Isolation Forest, Local Outlier Factor, and supervised classifiers including XGBoost and Random Forest, aiming to improve the detection of rare fraudulent patterns while maintaining efficiency and scalability.

    Key Features:

    30 numerical input features (V1–V28, Time, Amount) Class label indicating fraud (1) or normal (0) Imbalanced class distribution typical in real-world fraud detection Use Case: Ideal for benchmarking and evaluating anomaly detection and classification algorithms in highly imbalanced data scenarios.

    Source: Originally published by the Machine Learning Group at Université Libre de Bruxelles.

  12. Feature importance calculated by Random Forest classifier considering the 80...

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marinara Marcato; Salvatore Tedesco; Conor O’Mahony; Brendan O’Flynn; Paul Galvin (2023). Feature importance calculated by Random Forest classifier considering the 80 features selected by Select K Best by domain. [Dataset]. http://doi.org/10.1371/journal.pone.0286311.t011
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Marinara Marcato; Salvatore Tedesco; Conor O’Mahony; Brendan O’Flynn; Paul Galvin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Feature importance calculated by Random Forest classifier considering the 80 features selected by Select K Best by domain.

  13. Fraud Guard Synthetic 2025

    • kaggle.com
    zip
    Updated Sep 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Imaad Mahmood (2025). Fraud Guard Synthetic 2025 [Dataset]. https://www.kaggle.com/datasets/imaadmahmood/fraud-guard-synthetic-2025
    Explore at:
    zip(1586 bytes)Available download formats
    Dataset updated
    Sep 26, 2025
    Authors
    Imaad Mahmood
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    FraudGuardSynthetic2025 Dataset

    Overview

    The FraudGuardSynthetic2025 dataset is a synthetically generated dataset designed for machine learning and data analysis tasks focused on detecting credit card fraud. Created in September 2025, it simulates realistic transaction profiles based on established fraud detection patterns, drawing inspiration from financial data studies and existing datasets. With 1,000 records, it features a highly imbalanced target (~0.2% fraud cases) to mirror real-world fraud prevalence, making it ideal for anomaly detection, classification, and preprocessing practice in educational and research settings.

    Data Description

    • Rows: 1,000
    • Columns: 11
    • Target Variable: is_fraud (binary: 0 = Non-fraud, 1 = Fraud)
    • File Format: CSV
    • Size: Approximately 50 KB

    Columns

    Column NameTypeDescription
    transaction_idIntegerUnique identifier for each transaction (1 to 1,000).
    amountFloatTransaction amount in USD (0.01 to 10,000, skewed toward smaller values).
    timeIntegerSeconds since first transaction (0 to 86,400, simulating one day).
    merchant_categoryCategoricalMerchant type: Retail, Online, Travel, Food, Other.
    cardholder_ageIntegerCardholder age in years (18 to 90, skewed toward 25-50).
    cardholder_zipIntegerCardholder ZIP code (10000 to 99999, synthetic US-based).
    distanceFloatDistance (km) between cardholder and merchant (0 to 500, mean ~50).
    is_onlineBinaryOnline transaction: 0 = In-person, 1 = Online (~30% online).
    card_typeCategoricalCard type: Credit, Debit (~60% credit).
    transaction_hourIntegerHour of transaction (0 to 23, skewed toward daytime).
    is_fraudBinaryTarget variable: 0 = Non-fraud, 1 = Fraud (~0.2% positive cases).

    Key Features

    • Realistic Distributions: Reflects 2025 financial fraud patterns (e.g., low fraud prevalence, higher fraud in online/high-amount/unusual-location transactions) based on recent studies like FinGraphFL federated learning and GDPR-compliant synthetic data.
    • Synthetic Data: Generated to avoid privacy concerns, ensuring ethical use for research and education, inspired by 2025 generative ML techniques for imbalanced datasets.
    • Versatility: Suitable for anomaly detection (e.g., Isolation Forest), binary classification (e.g., XGBoost, Random Forest), and advanced DL (e.g., Deep Belief Networks).
    • No Missing Values: Clean dataset for straightforward analysis, though users can introduce missingness for practice.

    Use Cases

    • Machine Learning: Train models like Random Forest with SMOTE, Isolation Forest, or ensemble stacking for fraud detection.
    • Data Analysis: Explore correlations between features (e.g., amount, distance, is_online) and fraud outcomes.
    • Educational Projects: Ideal for learning EDA, feature engineering, and handling imbalanced data (e.g., SMOTE/ENN).
    • Financial Research: Simulate fraud scenarios for studying detection algorithms without real transaction data.

    Source and Inspiration

    This dataset is inspired by fraud detection patterns from 2025 financial literature (e.g., Federal Reserve, IEEE studies on Nigerian datasets) and existing datasets like Kaggle Credit Card Fraud Detection (2018, with 2021 simulator), Kartik2112 simulated transactions (2020), and new 2025 releases (e.g., Figshare creditcard.csv, Synthesized GDPR-compliant synthetic data). It incorporates trends like hybrid feature selection and explainable AI.

    Usage Notes

    • Preprocessing: Numerical features (amount, distance, time) require scaling; categorical features (merchant_category, card_type) need encoding (e.g., one-hot).
    • Imbalanced Data: The ~0.2% fraud prevalence requires techniques like SMOTE, ENN undersampling, or anomaly detection algorithms.
    • Scalability: Contact the creator to generate larger datasets (e.g., 10,000+ rows) if needed.

    License

    This dataset is provided for educational and research purposes under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

    Contact

    For questions or to request expanded datasets, contact the creator via the platform where this dataset is hosted.

  14. i

    Imputed-VAE IIoT-2021

    • ieee-dataport.org
    Updated Nov 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kowshik Balasubramanian (2025). Imputed-VAE IIoT-2021 [Dataset]. https://ieee-dataport.org/documents/imputed-vae-iiot-2021
    Explore at:
    Dataset updated
    Nov 26, 2025
    Authors
    Kowshik Balasubramanian
    Description

    neural network imputation

  15. Data from: Wine Quality

    • kaggle.com
    zip
    Updated Jul 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdelaziz Sami (2024). Wine Quality [Dataset]. https://www.kaggle.com/datasets/abdelazizsami/wine-quality
    Explore at:
    zip(99409 bytes)Available download formats
    Dataset updated
    Jul 14, 2024
    Authors
    Abdelaziz Sami
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Overview

    Input Variables: Physicochemical properties (e.g., pH, alcohol content, acidity). Output Variable: Sensory ratings (quality), which are ordered categories.

    Tasks

    Classification or Regression:

    Treat the output as a categorical variable (classification) or as a continuous score (regression). Outlier Detection:

    Identify outliers (e.g., excellent or poor wines) using techniques like Isolation Forest or Local Outlier Factor (LOF). Feature Selection:

    Apply methods such as Recursive Feature Elimination (RFE), LASSO, or tree-based feature importance to identify relevant features.

    Suggested Analysis Steps

    Data Preprocessing:

    • Handle missing values if any.
    • Normalize or standardize input features for better model performance.

    Exploratory Data Analysis (EDA):

    • Visualize the distribution of quality ratings.
    • Use pair plots or correlation heatmaps to understand relationships between features.

    Modeling:

    For Classification:

    Try models like Logistic Regression, Decision Trees, Random Forest, or Gradient Boosting.

    For Regression:

    Use Linear Regression, SVR, or Tree-based models like Random Forest Regressor.

    Evaluation:

    • Use metrics like accuracy, F1-score, or ROC-AUC for classification.
    • For regression, consider MAE, MSE, or R².

    Feature Importance:

    Analyze which features contribute the most to the predictions to aid in understanding the data.

  16. Credit Card Fraud Dataset

    • kaggle.com
    zip
    Updated Jun 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Moraes (2024). Credit Card Fraud Dataset [Dataset]. https://www.kaggle.com/datasets/dylanmoraes/credit-card-fraud-dataset/discussion
    Explore at:
    zip(186385507 bytes)Available download formats
    Dataset updated
    Jun 22, 2024
    Authors
    Dylan Moraes
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This dataset contains synthetic credit card transaction data designed for fraud detection and machine learning research. With over 6.3 million transactions, it provides a realistic simulation of financial transaction patterns including both legitimate and fraudulent activities.

    Source

    This is a synthetic dataset generated to simulate credit card transaction behavior. The data represents financial transactions over a 30-day period (743 hours) with various transaction types including payments, transfers, cash-outs, debits, and cash-ins.

    Purpose

    The dataset is specifically designed for: - Training and testing fraud detection models - Anomaly detection research - Binary classification tasks - Imbalanced learning scenarios - Financial machine learning applications

    Column Descriptions

    • step: Maps a unit of time in the real world. 1 step represents 1 hour of time. Range: 1 to 743
    • type: Type of transaction (PAYMENT, TRANSFER, CASH_OUT, DEBIT, CASH_IN)
    • amount: Amount of the transaction in local currency
    • nameOrig: Customer ID who initiated the transaction
    • oldbalanceOrg: Initial balance before the transaction (origin account)
    • newbalanceOrig: New balance after the transaction (origin account)
    • nameDest: Recipient ID of the transaction
    • oldbalanceDest: Initial recipient balance before the transaction
    • newbalanceDest: New recipient balance after the transaction
    • isFraud: Binary flag indicating fraud (1 = fraud, 0 = legitimate)
    • isFlaggedFraud: Flag for illegal attempts to transfer more than 200,000 in a single transaction

    Dataset Statistics

    • Total Transactions: 6,362,620
    • Fraudulent Transactions: 8,213 (~0.13%)
    • Legitimate Transactions: 6,354,407 (~99.87%)
    • Time Period: 30 days (743 hours)
    • File Size: 493.53 MB

    Class Imbalance Note

    This dataset exhibits significant class imbalance with only 0.13% fraudulent transactions. This mirrors real-world fraud detection scenarios where fraudulent transactions are rare. Consider using techniques such as: - SMOTE (Synthetic Minority Over-sampling Technique) - Undersampling of majority class - Cost-sensitive learning - Ensemble methods - Anomaly detection algorithms

    Model Suitability

    This dataset is well-suited for: - Logistic Regression - Random Forest - Gradient Boosting (XGBoost, LightGBM, CatBoost) - Neural Networks - Isolation Forest - Autoencoders - Support Vector Machines

    Quick Start Example

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Load the dataset
    df = pd.read_csv('/kaggle/input/credit-card-fraud-dataset/Fraud.csv')
    
    # Display basic information
    print(df.info())
    print(df.head())
    
    # Check fraud distribution
    print(df['isFraud'].value_counts())
    
    # Visualize fraud distribution
    plt.figure(figsize=(8, 5))
    sns.countplot(data=df, x='isFraud')
    plt.title('Distribution of Fraud vs Legitimate Transactions')
    plt.xlabel('Is Fraud (0=No, 1=Yes)')
    plt.ylabel('Count')
    plt.show()
    
    # Transaction type distribution
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x='type', hue='isFraud')
    plt.title('Transaction Types by Fraud Status')
    plt.xticks(rotation=45)
    plt.show()
    

    Usage Tips

    1. Handle Class Imbalance: Use appropriate sampling techniques or algorithms designed for imbalanced data
    2. Feature Engineering: Consider creating features like transaction velocity, time-based patterns, and balance differences
    3. Evaluation Metrics: Use precision, recall, F1-score, and AUC-ROC rather than accuracy due to class imbalance
    4. Cross-validation: Use stratified k-fold to maintain class distribution across folds
    5. Transaction Patterns: Analyze transaction types - TRANSFER and CASH_OUT are more associated with fraud

    Update Frequency

    This is a static dataset with no planned future updates. It serves as a benchmark for fraud detection research and model development.

    Acknowledgments

    This dataset is made available under the MIT License for educational and research purposes in the field of fraud detection and financial machine learning.

  17. Credit Card Fraud Detection

    • kaggle.com
    zip
    Updated Feb 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tushar Bhadouria (2025). Credit Card Fraud Detection [Dataset]. https://www.kaggle.com/datasets/tusharbhadouria/credit-card-fraud-detection/code
    Explore at:
    zip(211766662 bytes)Available download formats
    Dataset updated
    Feb 28, 2025
    Authors
    Tushar Bhadouria
    Description

    📌 Overview This dataset provides a real-world representation of credit card transactions, labeled as fraudulent or legitimate. It is designed to aid in the development of machine learning models for fraud detection and financial security applications. Given the rising cases of online fraud, detecting suspicious transactions is crucial for financial institutions.

    This dataset allows users to experiment with various fraud detection techniques, such as supervised and unsupervised learning models, anomaly detection, and pattern recognition.

    📊 Dataset Details

    Number of Transactions: 1852394 Number of Features: 23 Fraudulent Transactions: Contains transactions labeled as is_fraud = 1 for fraud and is_fraud = 0 for legitimate payments.

    📁 Columns Explained Transaction Information:

    trans_date_trans_time – Timestamp of the transaction cc_num – Unique (anonymized) credit card number merchant – Merchant where the transaction occurred category – Type of transaction (e.g., travel, food, personal care) amt – Transaction amount

    Cardholder Details:

    first, last – First and last name of the cardholder gender – Gender of the cardholder street, city, state, zip – Address of the cardholder lat, long – Geographical location of the cardholder city_pop – Population of the cardholder’s city job – Profession of the cardholder dob – Date of birth of the cardholder

    Transaction Identifiers & Timing:

    trans_num – Unique transaction identifier unix_time – Timestamp of transaction in Unix format

    Merchant Details:

    merch_lat, merch_long – Merchant's location (latitude & longitude)

    Fraud Indicator:

    is_fraud – Target variable (1 = Fraud, 0 = Legitimate)

    🎯 Usage

    This dataset is ideal for: ✅ Fraud detection research ✅ Machine learning model development ✅ Anomaly detection projects ✅ Financial analytics

    🛠️ Suggested Machine Learning Approaches

    Supervised Learning:

    Logistic Regression Decision Trees / Random Forest XGBoost / LightGBM Deep Learning (Neural Networks)

    Unsupervised Learning:

    Autoencoders Isolation Forest DBSCAN for anomaly detection

    Feature Engineering Ideas:

    Creating transaction frequency features Aggregating spending behavior per merchant/category Analyzing location-based fraud patterns

    ⚠️ Disclaimer This dataset has been anonymized and should be used strictly for research and educational purposes. It does not contain any real-world personal information, and the credit card numbers have been randomly generated for simulation purposes.

  18. f

    Classification metrics per posture achieved using the best models selected...

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marinara Marcato; Salvatore Tedesco; Conor O’Mahony; Brendan O’Flynn; Paul Galvin (2023). Classification metrics per posture achieved using the best models selected by grid search in Classifier 1 on the test set. [Dataset]. http://doi.org/10.1371/journal.pone.0286311.t012
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Marinara Marcato; Salvatore Tedesco; Conor O’Mahony; Brendan O’Flynn; Paul Galvin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification metrics per posture achieved using the best models selected by grid search in Classifier 1 on the test set.

  19. h

    Data from: hvac-fault-detection

    • huggingface.co
    Updated Dec 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahab Salehi (2025). hvac-fault-detection [Dataset]. https://huggingface.co/datasets/shahabsalehi/hvac-fault-detection
    Explore at:
    Dataset updated
    Dec 1, 2025
    Authors
    Shahab Salehi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    HVAC Fault Detection Dataset

    ⚠️ Synthetic Data Disclaimer: This dataset contains synthetically generated data for demonstration and testing purposes. It does not represent real equipment faults or actual building systems.

      Overview
    

    Anomaly detection results from HVAC equipment monitoring using Isolation Forest. This dataset includes detected faults, anomaly scores, and equipment status.

      Schema
    

    { "pipeline": "hvac_fault_detection_anomaly", "generated_at":… See the full description on the dataset page: https://huggingface.co/datasets/shahabsalehi/hvac-fault-detection.

  20. Z

    Data from: INNOVATIVE MACHINE LEARNING APPROACHES TO FOSTER FINANCIAL...

    • data-staging.niaid.nih.gov
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Shujan Shak; Aftab Uddin; Md Habibur Rahman; Nafis Anjum; Md Nad Vi Al Bony; Murshida Alam; Mohammad Helal; Afrina Khan; Pritom Das; Tamanna Pervin (2024). INNOVATIVE MACHINE LEARNING APPROACHES TO FOSTER FINANCIAL INCLUSION IN MICROFINANCE [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14044728
    Explore at:
    Dataset updated
    Nov 6, 2024
    Dataset provided by
    Department of Business Administration, International American University, Los Angeles, CA
    Department of Business Administration, Westcliff University, Irvine, California, USA
    Master of Science in Information Technology, Washington University of Science and Technology, USA
    Department of Business Administration, International American University, Los Angeles, California, USA
    Master's in business administration, BRAC Business School, Dhaka Bangladesh
    Fox School of Business & Management, Temple University, USA
    Department of Management Science, Adelphi University, Garden city, New York, USA
    College of Computer Science, Pacific States University, Los Angeles, CA
    College of Technology and Engineering, Westcliff University, Irvine, CA
    Department of Business Administration, International American University, Los Angeles, California
    Authors
    Md Shujan Shak; Aftab Uddin; Md Habibur Rahman; Nafis Anjum; Md Nad Vi Al Bony; Murshida Alam; Mohammad Helal; Afrina Khan; Pritom Das; Tamanna Pervin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study examines the application of machine learning algorithms to enhance financial inclusion in microfinance, focusing on credit scoring, risk and fraud detection, and customer segmentation. We performed feature engineering and employed models such as Logistic Regression, Decision Trees, Random Forests, Gradient Boosting Machines (XGBoost and LightGBM), Support Vector Machines (SVM), Autoencoders, Isolation Forests, and K-means Clustering. LightGBM achieved the highest accuracy (89.6%) and AUC (0.92) in credit scoring, while Random Forests demonstrated strong performance in both loan approval (86.7% accuracy) and fraud detection (87.6% accuracy, AUC of 0.88). SVM also performed competitively, and unsupervised methods like Autoencoders and Isolation Forests showed potential for anomaly detection but required further refinement.K-means Clustering excelled in customer segmentation with a silhouette score of 0.72, enabling tailored services based on client demographics. Our findings highlight the significant impact of machine learning on improving credit scoring accuracy, reducing fraud risks, and enhancing customer service delivery in microfinance, thereby promoting financial inclusion for underserved populations. Ethical considerations and model interpretability are crucial, particularly for smaller institutions. This study advocates for the broader adoption of machine learning in the microfinance sector.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lionel Bottan (2022). IsolationForest [Dataset]. https://www.kaggle.com/lionelbottan/isolationforest
Organization logo

Data from: IsolationForest

Related Article
Explore at:
zip(904242 bytes)Available download formats
Dataset updated
Feb 2, 2022
Authors
Lionel Bottan
Description

Dataset

This dataset was created by Lionel Bottan

Contents

Search
Clear search
Close search
Google apps
Main menu