77 datasets found

Data from: IsolationForest
kaggle.com
zip
Updated Feb 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lionel Bottan (2022). IsolationForest [Dataset]. https://www.kaggle.com/lionelbottan/isolationforest
Explore at:
zip(904242 bytes)Available download formats
Dataset updated
Feb 2, 2022
Authors
Lionel Bottan
Description
Dataset

This dataset was created by Lionel Bottan

Contents
t
Isolation Forest - Dataset - LDM
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Isolation Forest - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/isolation-forest
Explore at:
Dataset updated
Dec 2, 2024
Description
Isolation Forest
Comparative analysis with unsupervised anomaly detection algorithms.
plos.figshare.com
xls
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kenichiro Nagata; Toshikazu Tsuji; Kimitaka Suetsugu; Kayoko Muraoka; Hiroyuki Watanabe; Akiko Kanaya; Nobuaki Egashira; Ichiro Ieiri (2023). Comparative analysis with unsupervised anomaly detection algorithms. [Dataset]. http://doi.org/10.1371/journal.pone.0260315.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0260315.t005
Dataset updated
Jun 8, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Kenichiro Nagata; Toshikazu Tsuji; Kimitaka Suetsugu; Kayoko Muraoka; Hiroyuki Watanabe; Akiko Kanaya; Nobuaki Egashira; Ichiro Ieiri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparative analysis with unsupervised anomaly detection algorithms.
SAP FI Anomaly Detection - Prepared Data & Models
kaggle.com
zip
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
aidsmlProjects (2025). SAP FI Anomaly Detection - Prepared Data & Models [Dataset]. https://www.kaggle.com/datasets/aidsmlprojects/sap-fi-anomaly-detection-prepared-data-and-models
Explore at:
zip(9285 bytes)Available download formats
Dataset updated
Apr 30, 2025
Authors
aidsmlProjects
Description
Intelligent SAP Financial Integrity Monitor

Project Status: Proof-of-Concept (POC) - Capstone Project

Overview

This project demonstrates a proof-of-concept system for detecting financial document anomalies within core SAP FI/CO data, specifically leveraging the New General Ledger table (FAGLFLEXA) and document headers (BKPF). It addresses the challenge that standard SAP reporting and rule-based checks often struggle to identify subtle, complex, or novel irregularities in high-volume financial postings.

The solution employs a Hybrid Anomaly Detection strategy, combining unsupervised Machine Learning models with expert-defined SAP business rules. Findings are prioritized using a multi-faceted scoring system and presented via an interactive dashboard built with Streamlit for efficient investigation.

This project was developed as a capstone, showcasing the application of AI/ML techniques to enhance financial controls within an SAP context, bridging deep SAP domain knowledge with modern data science practices.

Author: Anitha R (https://www.linkedin.com/in/anithaswamy)

Dataset Origin: Kaggle SAP Dataset by Sunitha Siva License:Other (specified in description)-No description available.

Motivation

Financial integrity is critical. Undetected anomalies in SAP FI/CO postings can lead to: * Inaccurate financial reporting * Significant reconciliation efforts * Potential audit failures or compliance issues * Masking of operational errors or fraud

Standard SAP tools may not catch all types of anomalies, especially complex or novel patterns. This project explores how AI/ML can augment traditional methods to provide more robust and efficient financial monitoring.

Key Features

Data Cleansing & Preparation: Rigorous process to handle common SAP data extract issues (duplicates, financial imbalance), prioritizing FAGLFLEXA for reliability.

Exploratory Data Analysis (EDA): Uncovered baseline patterns in posting times, user activity, amounts, and process context.

Feature Engineering: Created 16 context-aware features (FE_...) to quantify potential deviations from normalcy based on EDA and SAP knowledge.

Hybrid Anomaly Detection:

Ensemble ML: Utilized unsupervised models: Isolation Forest (IF), Local Outlier Factor (LOF) (via Scikit-learn), and an Autoencoder (AE) (via TensorFlow/Keras).

Expert Rules (HRFs): Implemented highly customizable High-Risk Flags based on percentile thresholds and SAP logic (e.g., weekend posting, missing cost center).

Multi-Faceted Prioritization: Combined ML model consensus (Model_Anomaly_Count) and HRF counts (HRF_Count) into a Priority_Tier for focusing investigation efforts.

Contextual Anomaly Reason: Generated a Review_Focus text description summarizing why an item was flagged.

Interactive Dashboard (Streamlit):

File upload for anomaly/feature data.

Overview KPIs (including multi-currency "Value at Risk by CoCode").

Comprehensive filtering capabilities.

Dynamic visualizations (User/Doc Type/HRF frequency, Time Trends).

Interactive AgGrid table for anomaly list investigation.

Detailed drill-down view for selected anomalies.

Methodology Overview

The project followed a structured approach:

Phase 1: Data Quality Assessment & Preparation: Cleaned and validated raw BKPF and FAGLFLEXA data extracts. Discarded BSEG due to imbalances. Removed duplicates.

Phase 2: Exploratory Data Analysis & Feature Engineering: Analyzed cleaned data patterns and engineered 16 features quantifying anomaly indicators. Resulted in sap_engineered_features.csv.

Phase 3: Baseline Anomaly Detection & Evaluation: Scaled features, applied IF and LOF models, evaluated initial results.

Phase 4: Advanced Modeling & Prioritization: Trained Autoencoder model, combined all model outputs and HRFs, implemented prioritization logic, generated context, and created the final anomaly list.

Phase 5: UI Development: Built the Streamlit dashboard for interactive analysis and investigation.

(For detailed methodology, please refer to the Comprehensive_Project_Report.pdf in the /docs folder - if you include it).

Technology Stack

Core Language: Python 3.x

Data Manipulation & Analysis: Pandas, NumPy

Machine Learning: Scikit-learn (IsolationForest, LocalOutlierFactor, StandardScaler), TensorFlow/Keras (Autoencoder)

Visualization: Matplotlib, Seaborn, Plotly Express

Dashboard: Streamlit, streamlit-aggrid

Utilities: Joblib (for saving scaler)

Libraries:

Model/Scaler Saving

joblib==1.4.2

Data I/O Efficiency (Optional but good practice if used)

pyarrow==19.0.1

Machine L...
Feature importance calculated by Random Forest classifier considering the 80...
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marinara Marcato; Salvatore Tedesco; Conor O’Mahony; Brendan O’Flynn; Paul Galvin (2023). Feature importance calculated by Random Forest classifier considering the 80 features previously selected by Select K Best. [Dataset]. http://doi.org/10.1371/journal.pone.0286311.t010
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0286311.t010
Dataset updated
Jun 21, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Marinara Marcato; Salvatore Tedesco; Conor O’Mahony; Brendan O’Flynn; Paul Galvin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Feature importance calculated by Random Forest classifier considering the 80 features previously selected by Select K Best.
f
Data from: Boundary peeling: An outlier detection method
tandf.figshare.com
pdf
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheikh Arafat; Na Sun; Maria L. Weese; Waldyn G. Martinez (2025). Boundary peeling: An outlier detection method [Dataset]. http://doi.org/10.6084/m9.figshare.28776694.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28776694.v1
Dataset updated
Oct 1, 2025
Dataset provided by
Taylor & Francis
Authors
Sheikh Arafat; Na Sun; Maria L. Weese; Waldyn G. Martinez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Unsupervised outlier detection constitutes a crucial phase within data analysis and remains an open area of research. A good outlier detection algorithm should be computationally efficient, robust to tuning parameter selection, and perform consistently well across diverse underlying data distributions. We introduce Boundary Peeling, an unsupervised outlier detection algorithm. Boundary Peeling uses the average signed distance from iteratively peeled, flexible boundaries generated by one-class support vector machines to flag outliers. The method is similar to convex hull peeling but well suited for high-dimensional data and has flexibility to adapt to different distributions. Boundary Peeling has robust hyperparameter settings and, for increased flexibility, can be cast as an ensemble method. In unimodal and multimodal synthetic data simulations Boundary Peeling outperforms all state of the art methods when no outliers are present while maintaining comparable or superior performance in the presence of outliers. Boundary Peeling performs competitively or better in terms of correct classification, AUC, and processing time using semantically meaningful benchmark datasets.
Cybersecurity 🪪 Intrusion 🦠 Detection Dataset
kaggle.com
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dinesh Naveen Kumar Samudrala (2025). Cybersecurity 🪪 Intrusion 🦠 Detection Dataset [Dataset]. https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dinesh Naveen Kumar Samudrala
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. Below, I’ll explain each aspect in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning.

1. Understanding the Features

The dataset consists of network-based and user behavior-based features. Each feature provides valuable information about potential cyber threats.

A. Network-Based Features

These features describe network-level information such as packet size, protocol type, and encryption methods.

network_packet_size (Packet Size in Bytes)

Represents the size of network packets, ranging between 64 to 1500 bytes.

Packets on the lower end (~64 bytes) may indicate control messages, while larger packets (~1500 bytes) often carry bulk data.

Attackers may use abnormally small or large packets for reconnaissance or exploitation attempts.

protocol_type (Communication Protocol)

The protocol used in the session: TCP, UDP, or ICMP.

TCP (Transmission Control Protocol): Reliable, connection-oriented (common for HTTP, HTTPS, SSH).

UDP (User Datagram Protocol): Faster but less reliable (used for VoIP, streaming).

ICMP (Internet Control Message Protocol): Used for network diagnostics (ping); often abused in Denial-of-Service (DoS) attacks.

encryption_used (Encryption Protocol)

Values: AES, DES, None.

AES (Advanced Encryption Standard): Strong encryption, commonly used.

DES (Data Encryption Standard): Older encryption, weaker security.

None: Indicates unencrypted communication, which can be risky.

Attackers might use no encryption to avoid detection or weak encryption to exploit vulnerabilities.

B. User Behavior-Based Features

These features track user activities, such as login attempts and session duration.

login_attempts (Number of Logins)

High values might indicate brute-force attacks (repeated login attempts).

Typical users have 1–3 login attempts, while an attack may have hundreds or thousands.

session_duration (Session Length in Seconds)

A very long session might indicate unauthorized access or persistence by an attacker.

Attackers may try to stay connected to maintain access.

failed_logins (Failed Login Attempts)

High failed login counts indicate credential stuffing or dictionary attacks.

Many failed attempts followed by a successful login could suggest an account was compromised.

unusual_time_access (Login Time Anomaly)

A binary flag (0 or 1) indicating whether access happened at an unusual time.

Attackers often operate outside normal business hours to evade detection.

ip_reputation_score (Trustworthiness of IP Address)

A score from 0 to 1, where higher values indicate suspicious activity.

IP addresses associated with botnets, spam, or previous attacks tend to have higher scores.

browser_type (User’s Browser)

Common browsers: Chrome, Firefox, Edge, Safari.

Unknown: Could be an indicator of automated scripts or bots.

2. Target Variable (attack_detected)

Binary classification: 1 means an attack was detected, 0 means normal activity.

The dataset is useful for supervised machine learning, where a model learns from labeled attack patterns.

3. Possible Use Cases

This dataset can be used for intrusion detection systems (IDS) and cybersecurity research. Some key applications include:

A. Machine Learning-Based Intrusion Detection

Supervised Learning Approaches

Classification Models (Logistic Regression, Decision Trees, Random Forest, XGBoost, SVM)

Train the model using labeled data (attack_detected as the target).

Evaluate using accuracy, precision, recall, F1-score.

Deep Learning Approaches

Use Neural Networks (DNN, LSTM, CNN) for pattern recognition.

LSTMs work well for time-series-based network traffic analysis.

B. Anomaly Detection (Unsupervised Learning)

If attack labels are missing, anomaly detection can be used: - Autoencoders: Learn normal traffic and flag anomalies. - Isolation Forest: Detects outliers based on feature isolation. - One-Class SVM: Learns normal behavior and detects deviations.

C. Rule-Based Detection

If certain thresholds are met (e.g., failed_logins > 10 & ip_reputation_score > 0.8), an alert is triggered.

4. Challenges & Considerations

Adversarial Attacks: Attackers may modify traffic to evade detection.

Concept Drift: Cyber threats...
SKAB - Skoltech Anomaly Benchmark
kaggle.com
zip
Updated Nov 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iurii Katser (2020). SKAB - Skoltech Anomaly Benchmark [Dataset]. https://www.kaggle.com/datasets/yuriykatser/skoltech-anomaly-benchmark-skab/code
Explore at:
zip(1300142 bytes)Available download formats
Dataset updated
Nov 28, 2020
Authors
Iurii Katser
License
http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
Description
❗️❗️❗️**The current version of SKAB (v0.9) contains 34 datasets with collective anomalies. But the upcoming update to v1.0 (probably up to the summer of 2021) will contain 300+ additional files with point and collective anomalies. It will make SKAB one of the largest changepoint-containing benchmarks, especially in the technical field.**

About SKAB

We propose the Skoltech Anomaly Benchmark (SKAB) designed for evaluating the anomaly detection algorithms. SKAB allows working with two main problems (there are two markups for anomalies): * Outlier detection (anomalies considered and marked up as single-point anomalies) * Changepoint detection (anomalies considered and marked up as collective anomalies)

SKAB consists of the following artifacts: * Datasets. * Leaderboard (scoreboard). * Python modules for algorithms’ evaluation. * Notebooks: python notebooks with anomaly detection algorithms.

The IIot testbed system is located in the Skolkovo Institute of Science and Technology (Skoltech). All the details regarding the testbed and the experimenting process are presented in the following artifacts: - Position paper (currently submitted for publication) - Slides about the project

Datasets

The SKAB v0.9 corpus contains 35 individual data files in .csv format. Each file represents a single experiment and contains a single anomaly. The dataset represents a multivariate time series collected from the sensors installed on the testbed. The data folder contains datasets from the benchmark. The structure of the data folder is presented in the structure file. Columns in each data file are following: - datetime - Represents dates and times of the moment when the value is written to the database (YYYY-MM-DD hh:mm:ss) - Accelerometer1RMS - Shows a vibration acceleration (Amount of g units) - Accelerometer2RMS - Shows a vibration acceleration (Amount of g units) - Current - Shows the amperage on the electric motor (Ampere) - Pressure - Represents the pressure in the loop after the water pump (Bar) - Temperature - Shows the temperature of the engine body (The degree Celsius) - Thermocouple - Represents the temperature of the fluid in the circulation loop (The degree Celsius) - Voltage - Shows the voltage on the electric motor (Volt) - RateRMS - Represents the circulation flow rate of the fluid inside the loop (Liter per minute) - anomaly - Shows if the point is anomalous (0 or 1) - changepoint - Shows if the point is a changepoint for collective anomalies (0 or 1)

Leaderboard (Scoreboard)

Here we propose the leaderboard for SKAB v0.9 both for outlier and changepoint detection problems. You can also present and evaluate your algorithm using SKAB on kaggle. The results in the tables are calculated in the python notebooks from the notebooks folder.

Outlier detection problem

Sorted by F1; for F1 bigger is better; both for FAR and MAR less is better
| Algorithm | F1 | FAR, % | MAR, % |---|---|---|---| Perfect detector | 1 | 0 | 0 T-squared+Q (PCA) | 0.67 | 13.95 | 36.32 LSTM | 0.64 | 15.4 | 39.93 MSCRED | 0.64 | 13.56 | 41.16 T-squared | 0.56 | 12.14 | 52.56 Autoencoder | 0.45 | 7.56 | 66.57 Isolation forest | 0.4 | 6.86 | 72.09 Null detector | 0 | 0 | 100

Changepoint detection problem

Sorted by NAB (standart); for all metrics bigger is better
| Algorithm | NAB (standart) | NAB (lowFP) | NAB (LowFN) | |---|---|---|---| Perfect detector | 100 | 100 | 100 Isolation forest | 37.53 | 17.09 | 45.02 MSCRED | 28.74 | 23.43 | 31.21 LSTM | 27.09 | 11.06 | 32.68 T-squared+Q (PCA) | 26.71 | 22.42 | 28.32 T-squared | 17.87 | 3.44 | 23.2 ArimaFD | 16.06 | 14.03 | 17.12 Autoencoder | 15.59 | 0.78 | 20.91 Null detector | 0 | 0 | 0

Notebooks

The notebooks folder contains python notebooks with the code for the proposed leaderboard results reproducing.

We have calculated the results for five quite common anomaly detection algorithms: - Hotelling's T-squared statistics; - Hotelling's T-squared statistics + Q statistics based on PCA; - Isolation forest; - LSTM-based NN; - Feed-Forward Autoencoder.

Additionaly to the repository were added the results of the following algorithms: - ArimaFD; - MSCRED.

Citat...
R
IsoFMiR: An unsupervised anomaly detection framework for biomarker discovery...
repod.icm.edu.pl
zip
Updated Oct 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dey, Mritunjoy (2025). IsoFMiR: An unsupervised anomaly detection framework for biomarker discovery in rare cancers [Dataset]. http://doi.org/10.18150/UAKCCS
Explore at:
zip(11578)Available download formats
Unique identifier
https://doi.org/10.18150/UAKCCS
Dataset updated
Oct 14, 2025
Dataset provided by
RepOD
Authors
Dey, Mritunjoy
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
Maria Sklodowska-Curie National Research Institute of Oncology
Description
d
Systematic review of validation of supervised machine learning models in...
search.dataone.org
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oakleigh Wilson (2025). Systematic review of validation of supervised machine learning models in accelerometer-based animal behaviour classification literature [Dataset]. http://doi.org/10.5061/dryad.fxpnvx14d
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.fxpnvx14d
Dataset updated
Jun 25, 2025
Dataset provided by
Dryad Digital Repository
Authors
Oakleigh Wilson
Description
Supervised machine learning has been used to detect fine-scale animal behaviour from accelerometer data, but a standardised protocol for implementing this workflow is currently lacking. As the application of machine learning to ecological problems expands, it is essential to establish technical protocols and validation standards that align with those in other "big data" fields. Overfitting is a prevalent and often misunderstood challenge in machine learning. Overfit models overly adapt to the training data to memorise specific instances rather than to discern the underlying signal. Associated results can indicate high performance on the training set, yet these models are unlikely to generalise to new data. Overfitting can be detected through rigorous validation using independent test sets. Our systematic review of 119 studies using accelerometer-based supervised machine learning to classify animal behaviour reveals that 79% (94 papers) did not validate their models sufficiently wel..., We defined eligibility criteria as 'peer-reviewed primary research papers published 2013-present that use supervised machine learning to identify specific behaviours from raw, non-livestock animal accelerometer data'. We elected to ignore analysis of livestock behaviour as agricultural methods often operate within different constraints to the analyses conducted on wild animals and this body of literature has mostly developed in isolation to wild animal research. Our search was conducted on 27/09/2024. Initial keyword search across 3 databases (Google Scholar, PubMed, and Scopus) yielded 249 unique papers. Papers outside of the search criteria â€” including hardware and software advances, non-ML analysis, insufficient accelerometry application (e.g., research focused on other sensors with accelerometry providing minimal support), unsupervised methods, and research limited to activity intensity or active and inactive statesâ€” were excluded, resulting in 119 papers., , # Systematic review of validation of supervised machine learning models in accelerometer-based animal behaviour classification literature

https://doi.org/10.5061/dryad.fxpnvx14d

Description of the data and file structure

Files and variables

File: Systematic_Review_Supplementary.xlsx

Description:Â Methods information from animal accelerometer-based behaviour classification literature utilising supervised machine learning techniques.

Variables

Citation: Citation information for paper

Title: Extracted title from citation information

Year: Year of publication

ModelCategory: General category of the supervised machine learning model used (e.g., all Support Vector Machines are listed as SVM)

DT â€” Decision Tree

EM â€” Expectation Maximisation

Ensemble â€” Ensemble methods (e.g., boosting, bagging)

HMM â€” Hidden Markov Model

Isolation Forest â€” Anomaly detection using Isolation Forest ...,
Credit Card Fraud Dataset
kaggle.com
zip
Updated Sep 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Waqas Ishtiaq (2025). Credit Card Fraud Dataset [Dataset]. https://www.kaggle.com/datasets/waqasishtiaq/credit-card-fraud-dataset
Explore at:
zip(69155672 bytes)Available download formats
Dataset updated
Sep 11, 2025
Authors
Waqas Ishtiaq
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset, commonly known as creditcard.csv, contains anonymized credit card transactions made by European cardholders in September 2013. It includes 284,807 transactions, with 492 labeled as fraudulent. Due to confidentiality constraints, features have been transformed using PCA, except for 'Time' and 'Amount'.

This dataset was used in the research article titled "A Hybrid Anomaly Detection Framework Combining Supervised and Unsupervised Learning for Credit Card Fraud Detection". The study proposes an ensemble model integrating techniques such as Autoencoders, Isolation Forest, Local Outlier Factor, and supervised classifiers including XGBoost and Random Forest, aiming to improve the detection of rare fraudulent patterns while maintaining efficiency and scalability.

Key Features:

30 numerical input features (V1–V28, Time, Amount) Class label indicating fraud (1) or normal (0) Imbalanced class distribution typical in real-world fraud detection Use Case: Ideal for benchmarking and evaluating anomaly detection and classification algorithms in highly imbalanced data scenarios.

Source: Originally published by the Machine Learning Group at Université Libre de Bruxelles.
Feature importance calculated by Random Forest classifier considering the 80...
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marinara Marcato; Salvatore Tedesco; Conor O’Mahony; Brendan O’Flynn; Paul Galvin (2023). Feature importance calculated by Random Forest classifier considering the 80 features selected by Select K Best by domain. [Dataset]. http://doi.org/10.1371/journal.pone.0286311.t011
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0286311.t011
Dataset updated
Jun 21, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Marinara Marcato; Salvatore Tedesco; Conor O’Mahony; Brendan O’Flynn; Paul Galvin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Feature importance calculated by Random Forest classifier considering the 80 features selected by Select K Best by domain.

Fraud Guard Synthetic 2025

kaggle.com

zip

Updated Sep 26, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Imaad Mahmood (2025). Fraud Guard Synthetic 2025 [Dataset]. https://www.kaggle.com/datasets/imaadmahmood/fraud-guard-synthetic-2025

Explore at:

zip(1586 bytes)Available download formats

Dataset updated

Sep 26, 2025

Authors

Imaad Mahmood

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

FraudGuardSynthetic2025 Dataset

Overview

The FraudGuardSynthetic2025 dataset is a synthetically generated dataset designed for machine learning and data analysis tasks focused on detecting credit card fraud. Created in September 2025, it simulates realistic transaction profiles based on established fraud detection patterns, drawing inspiration from financial data studies and existing datasets. With 1,000 records, it features a highly imbalanced target (~0.2% fraud cases) to mirror real-world fraud prevalence, making it ideal for anomaly detection, classification, and preprocessing practice in educational and research settings.

Data Description

Rows: 1,000
Columns: 11
Target Variable: is_fraud (binary: 0 = Non-fraud, 1 = Fraud)
File Format: CSV
Size: Approximately 50 KB

Columns

Column Name	Type	Description
`transaction_id`	Integer	Unique identifier for each transaction (1 to 1,000).
`amount`	Float	Transaction amount in USD (0.01 to 10,000, skewed toward smaller values).
`time`	Integer	Seconds since first transaction (0 to 86,400, simulating one day).
`merchant_category`	Categorical	Merchant type: Retail, Online, Travel, Food, Other.
`cardholder_age`	Integer	Cardholder age in years (18 to 90, skewed toward 25-50).
`cardholder_zip`	Integer	Cardholder ZIP code (10000 to 99999, synthetic US-based).
`distance`	Float	Distance (km) between cardholder and merchant (0 to 500, mean ~50).
`is_online`	Binary	Online transaction: 0 = In-person, 1 = Online (~30% online).
`card_type`	Categorical	Card type: Credit, Debit (~60% credit).
`transaction_hour`	Integer	Hour of transaction (0 to 23, skewed toward daytime).
`is_fraud`	Binary	Target variable: 0 = Non-fraud, 1 = Fraud (~0.2% positive cases).

Key Features

Realistic Distributions: Reflects 2025 financial fraud patterns (e.g., low fraud prevalence, higher fraud in online/high-amount/unusual-location transactions) based on recent studies like FinGraphFL federated learning and GDPR-compliant synthetic data.
Synthetic Data: Generated to avoid privacy concerns, ensuring ethical use for research and education, inspired by 2025 generative ML techniques for imbalanced datasets.
Versatility: Suitable for anomaly detection (e.g., Isolation Forest), binary classification (e.g., XGBoost, Random Forest), and advanced DL (e.g., Deep Belief Networks).
No Missing Values: Clean dataset for straightforward analysis, though users can introduce missingness for practice.

Use Cases

Machine Learning: Train models like Random Forest with SMOTE, Isolation Forest, or ensemble stacking for fraud detection.
Data Analysis: Explore correlations between features (e.g., amount, distance, is_online) and fraud outcomes.
Educational Projects: Ideal for learning EDA, feature engineering, and handling imbalanced data (e.g., SMOTE/ENN).
Financial Research: Simulate fraud scenarios for studying detection algorithms without real transaction data.

Source and Inspiration

This dataset is inspired by fraud detection patterns from 2025 financial literature (e.g., Federal Reserve, IEEE studies on Nigerian datasets) and existing datasets like Kaggle Credit Card Fraud Detection (2018, with 2021 simulator), Kartik2112 simulated transactions (2020), and new 2025 releases (e.g., Figshare creditcard.csv, Synthesized GDPR-compliant synthetic data). It incorporates trends like hybrid feature selection and explainable AI.

Usage Notes

Preprocessing: Numerical features (amount, distance, time) require scaling; categorical features (merchant_category, card_type) need encoding (e.g., one-hot).
Imbalanced Data: The ~0.2% fraud prevalence requires techniques like SMOTE, ENN undersampling, or anomaly detection algorithms.
Scalability: Contact the creator to generate larger datasets (e.g., 10,000+ rows) if needed.

License

This dataset is provided for educational and research purposes under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

Contact

For questions or to request expanded datasets, contact the creator via the platform where this dataset is hosted.

i
Imputed-VAE IIoT-2021
ieee-dataport.org
Updated Nov 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kowshik Balasubramanian (2025). Imputed-VAE IIoT-2021 [Dataset]. https://ieee-dataport.org/documents/imputed-vae-iiot-2021
Explore at:
Dataset updated
Nov 26, 2025
Authors
Kowshik Balasubramanian
Description
neural network imputation
Data from: Wine Quality
kaggle.com
zip
Updated Jul 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdelaziz Sami (2024). Wine Quality [Dataset]. https://www.kaggle.com/datasets/abdelazizsami/wine-quality
Explore at:
zip(99409 bytes)Available download formats
Dataset updated
Jul 14, 2024
Authors
Abdelaziz Sami
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Overview

Input Variables: Physicochemical properties (e.g., pH, alcohol content, acidity). Output Variable: Sensory ratings (quality), which are ordered categories.

Tasks

Classification or Regression:

Treat the output as a categorical variable (classification) or as a continuous score (regression). Outlier Detection:

Identify outliers (e.g., excellent or poor wines) using techniques like Isolation Forest or Local Outlier Factor (LOF). Feature Selection:

Apply methods such as Recursive Feature Elimination (RFE), LASSO, or tree-based feature importance to identify relevant features.

Suggested Analysis Steps

Data Preprocessing:

Handle missing values if any.

Normalize or standardize input features for better model performance.

Exploratory Data Analysis (EDA):

Visualize the distribution of quality ratings.

Use pair plots or correlation heatmaps to understand relationships between features.

Modeling:

For Classification:

Try models like Logistic Regression, Decision Trees, Random Forest, or Gradient Boosting.

For Regression:

Use Linear Regression, SVR, or Tree-based models like Random Forest Regressor.

Evaluation:

Use metrics like accuracy, F1-score, or ROC-AUC for classification.

For regression, consider MAE, MSE, or R².

Feature Importance:

Analyze which features contribute the most to the predictions to aid in understanding the data.
Credit Card Fraud Dataset
kaggle.com
zip
Updated Jun 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Moraes (2024). Credit Card Fraud Dataset [Dataset]. https://www.kaggle.com/datasets/dylanmoraes/credit-card-fraud-dataset/discussion
Explore at:
zip(186385507 bytes)Available download formats
Dataset updated
Jun 22, 2024
Authors
Dylan Moraes
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

This dataset contains synthetic credit card transaction data designed for fraud detection and machine learning research. With over 6.3 million transactions, it provides a realistic simulation of financial transaction patterns including both legitimate and fraudulent activities.

Source

This is a synthetic dataset generated to simulate credit card transaction behavior. The data represents financial transactions over a 30-day period (743 hours) with various transaction types including payments, transfers, cash-outs, debits, and cash-ins.

Purpose

The dataset is specifically designed for: - Training and testing fraud detection models - Anomaly detection research - Binary classification tasks - Imbalanced learning scenarios - Financial machine learning applications

Column Descriptions

step: Maps a unit of time in the real world. 1 step represents 1 hour of time. Range: 1 to 743

type: Type of transaction (PAYMENT, TRANSFER, CASH_OUT, DEBIT, CASH_IN)

amount: Amount of the transaction in local currency

nameOrig: Customer ID who initiated the transaction

oldbalanceOrg: Initial balance before the transaction (origin account)

newbalanceOrig: New balance after the transaction (origin account)

nameDest: Recipient ID of the transaction

oldbalanceDest: Initial recipient balance before the transaction

newbalanceDest: New recipient balance after the transaction

isFraud: Binary flag indicating fraud (1 = fraud, 0 = legitimate)

isFlaggedFraud: Flag for illegal attempts to transfer more than 200,000 in a single transaction

Dataset Statistics

Total Transactions: 6,362,620

Fraudulent Transactions: 8,213 (~0.13%)

Legitimate Transactions: 6,354,407 (~99.87%)

Time Period: 30 days (743 hours)

File Size: 493.53 MB

Class Imbalance Note

This dataset exhibits significant class imbalance with only 0.13% fraudulent transactions. This mirrors real-world fraud detection scenarios where fraudulent transactions are rare. Consider using techniques such as: - SMOTE (Synthetic Minority Over-sampling Technique) - Undersampling of majority class - Cost-sensitive learning - Ensemble methods - Anomaly detection algorithms

Model Suitability

This dataset is well-suited for: - Logistic Regression - Random Forest - Gradient Boosting (XGBoost, LightGBM, CatBoost) - Neural Networks - Isolation Forest - Autoencoders - Support Vector Machines

Quick Start Example

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Load the dataset df = pd.read_csv('/kaggle/input/credit-card-fraud-dataset/Fraud.csv') # Display basic information print(df.info()) print(df.head()) # Check fraud distribution print(df['isFraud'].value_counts()) # Visualize fraud distribution plt.figure(figsize=(8, 5)) sns.countplot(data=df, x='isFraud') plt.title('Distribution of Fraud vs Legitimate Transactions') plt.xlabel('Is Fraud (0=No, 1=Yes)') plt.ylabel('Count') plt.show() # Transaction type distribution plt.figure(figsize=(10, 6)) sns.countplot(data=df, x='type', hue='isFraud') plt.title('Transaction Types by Fraud Status') plt.xticks(rotation=45) plt.show()

Usage Tips

Handle Class Imbalance: Use appropriate sampling techniques or algorithms designed for imbalanced data

Feature Engineering: Consider creating features like transaction velocity, time-based patterns, and balance differences

Evaluation Metrics: Use precision, recall, F1-score, and AUC-ROC rather than accuracy due to class imbalance

Cross-validation: Use stratified k-fold to maintain class distribution across folds

Transaction Patterns: Analyze transaction types - TRANSFER and CASH_OUT are more associated with fraud

Update Frequency

This is a static dataset with no planned future updates. It serves as a benchmark for fraud detection research and model development.

Acknowledgments

This dataset is made available under the MIT License for educational and research purposes in the field of fraud detection and financial machine learning.
Credit Card Fraud Detection
kaggle.com
zip
Updated Feb 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tushar Bhadouria (2025). Credit Card Fraud Detection [Dataset]. https://www.kaggle.com/datasets/tusharbhadouria/credit-card-fraud-detection/code
Explore at:
zip(211766662 bytes)Available download formats
Dataset updated
Feb 28, 2025
Authors
Tushar Bhadouria
Description
📌 Overview This dataset provides a real-world representation of credit card transactions, labeled as fraudulent or legitimate. It is designed to aid in the development of machine learning models for fraud detection and financial security applications. Given the rising cases of online fraud, detecting suspicious transactions is crucial for financial institutions.

This dataset allows users to experiment with various fraud detection techniques, such as supervised and unsupervised learning models, anomaly detection, and pattern recognition.

📊 Dataset Details

Number of Transactions: 1852394 Number of Features: 23 Fraudulent Transactions: Contains transactions labeled as is_fraud = 1 for fraud and is_fraud = 0 for legitimate payments.

📁 Columns Explained Transaction Information:

trans_date_trans_time – Timestamp of the transaction cc_num – Unique (anonymized) credit card number merchant – Merchant where the transaction occurred category – Type of transaction (e.g., travel, food, personal care) amt – Transaction amount

Cardholder Details:

first, last – First and last name of the cardholder gender – Gender of the cardholder street, city, state, zip – Address of the cardholder lat, long – Geographical location of the cardholder city_pop – Population of the cardholder’s city job – Profession of the cardholder dob – Date of birth of the cardholder

Transaction Identifiers & Timing:

trans_num – Unique transaction identifier unix_time – Timestamp of transaction in Unix format

Merchant Details:

merch_lat, merch_long – Merchant's location (latitude & longitude)

Fraud Indicator:

is_fraud – Target variable (1 = Fraud, 0 = Legitimate)

🎯 Usage

This dataset is ideal for: ✅ Fraud detection research ✅ Machine learning model development ✅ Anomaly detection projects ✅ Financial analytics

🛠️ Suggested Machine Learning Approaches

Supervised Learning:

Logistic Regression Decision Trees / Random Forest XGBoost / LightGBM Deep Learning (Neural Networks)

Unsupervised Learning:

Autoencoders Isolation Forest DBSCAN for anomaly detection

Feature Engineering Ideas:

Creating transaction frequency features Aggregating spending behavior per merchant/category Analyzing location-based fraud patterns

⚠️ Disclaimer This dataset has been anonymized and should be used strictly for research and educational purposes. It does not contain any real-world personal information, and the credit card numbers have been randomly generated for simulation purposes.
f
Classification metrics per posture achieved using the best models selected...
plos.figshare.com
xls
Updated Jun 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marinara Marcato; Salvatore Tedesco; Conor O’Mahony; Brendan O’Flynn; Paul Galvin (2023). Classification metrics per posture achieved using the best models selected by grid search in Classifier 1 on the test set. [Dataset]. http://doi.org/10.1371/journal.pone.0286311.t012
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0286311.t012
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Marinara Marcato; Salvatore Tedesco; Conor O’Mahony; Brendan O’Flynn; Paul Galvin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification metrics per posture achieved using the best models selected by grid search in Classifier 1 on the test set.
h
Data from: hvac-fault-detection
huggingface.co
Updated Dec 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahab Salehi (2025). hvac-fault-detection [Dataset]. https://huggingface.co/datasets/shahabsalehi/hvac-fault-detection
Explore at:
Dataset updated
Dec 1, 2025
Authors
Shahab Salehi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
HVAC Fault Detection Dataset

⚠️ Synthetic Data Disclaimer: This dataset contains synthetically generated data for demonstration and testing purposes. It does not represent real equipment faults or actual building systems.

Overview

Anomaly detection results from HVAC equipment monitoring using Isolation Forest. This dataset includes detected faults, anomaly scores, and equipment status.

Schema

{ "pipeline": "hvac_fault_detection_anomaly", "generated_at":… See the full description on the dataset page: https://huggingface.co/datasets/shahabsalehi/hvac-fault-detection.
Z
Data from: INNOVATIVE MACHINE LEARNING APPROACHES TO FOSTER FINANCIAL...
data-staging.niaid.nih.gov
Updated Nov 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Shujan Shak; Aftab Uddin; Md Habibur Rahman; Nafis Anjum; Md Nad Vi Al Bony; Murshida Alam; Mohammad Helal; Afrina Khan; Pritom Das; Tamanna Pervin (2024). INNOVATIVE MACHINE LEARNING APPROACHES TO FOSTER FINANCIAL INCLUSION IN MICROFINANCE [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14044728
Explore at:
Dataset updated
Nov 6, 2024
Dataset provided by
Department of Business Administration, International American University, Los Angeles, CA
Department of Business Administration, Westcliff University, Irvine, California, USA
Master of Science in Information Technology, Washington University of Science and Technology, USA
Department of Business Administration, International American University, Los Angeles, California, USA
Master's in business administration, BRAC Business School, Dhaka Bangladesh
Fox School of Business & Management, Temple University, USA
Department of Management Science, Adelphi University, Garden city, New York, USA
College of Computer Science, Pacific States University, Los Angeles, CA
College of Technology and Engineering, Westcliff University, Irvine, CA
Department of Business Administration, International American University, Los Angeles, California
Authors
Md Shujan Shak; Aftab Uddin; Md Habibur Rahman; Nafis Anjum; Md Nad Vi Al Bony; Murshida Alam; Mohammad Helal; Afrina Khan; Pritom Das; Tamanna Pervin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study examines the application of machine learning algorithms to enhance financial inclusion in microfinance, focusing on credit scoring, risk and fraud detection, and customer segmentation. We performed feature engineering and employed models such as Logistic Regression, Decision Trees, Random Forests, Gradient Boosting Machines (XGBoost and LightGBM), Support Vector Machines (SVM), Autoencoders, Isolation Forests, and K-means Clustering. LightGBM achieved the highest accuracy (89.6%) and AUC (0.92) in credit scoring, while Random Forests demonstrated strong performance in both loan approval (86.7% accuracy) and fraud detection (87.6% accuracy, AUC of 0.88). SVM also performed competitively, and unsupervised methods like Autoencoders and Isolation Forests showed potential for anomaly detection but required further refinement.K-means Clustering excelled in customer segmentation with a silhouette score of 0.72, enabling tailored services based on client demographics. Our findings highlight the significant impact of machine learning on improving credit scoring accuracy, reducing fraud risks, and enhancing customer service delivery in microfinance, thereby promoting financial inclusion for underserved populations. Ethical considerations and model interpretability are crucial, particularly for smaller institutions. This study advocates for the broader adoption of machine learning in the microfinance sector.

Facebook

Twitter

Click to copy link

Link copied

Cite

Lionel Bottan (2022). IsolationForest [Dataset]. https://www.kaggle.com/lionelbottan/isolationforest

Data from: IsolationForest

Explore at:

zip(904242 bytes)Available download formats

Dataset updated

Feb 2, 2022

Authors

Lionel Bottan

Description

Dataset

This dataset was created by Lionel Bottan

Clear search

Close search

Google apps

Main menu

Data from: IsolationForest

Dataset

Contents

Isolation Forest - Dataset - LDM

Comparative analysis with unsupervised anomaly detection algorithms.

SAP FI Anomaly Detection - Prepared Data & Models

Intelligent SAP Financial Integrity Monitor

Overview

Motivation

Key Features

Methodology Overview

Technology Stack

Model/Scaler Saving

Data I/O Efficiency (Optional but good practice if used)

pyarrow==19.0.1

Machine L...

Feature importance calculated by Random Forest classifier considering the 80...

Data from: Boundary peeling: An outlier detection method

Cybersecurity 🪪 Intrusion 🦠 Detection Dataset

1. Understanding the Features

A. Network-Based Features

B. User Behavior-Based Features

2. Target Variable (attack_detected)

3. Possible Use Cases

A. Machine Learning-Based Intrusion Detection

B. Anomaly Detection (Unsupervised Learning)

C. Rule-Based Detection

4. Challenges & Considerations

SKAB - Skoltech Anomaly Benchmark

About SKAB

Datasets

Leaderboard (Scoreboard)

Outlier detection problem

Changepoint detection problem

Notebooks

Citat...

IsoFMiR: An unsupervised anomaly detection framework for biomarker discovery...

Systematic review of validation of supervised machine learning models in...

Description of the data and file structure

Files and variables

File: Systematic_Review_Supplementary.xlsx

Variables

Credit Card Fraud Dataset

Feature importance calculated by Random Forest classifier considering the 80...

Fraud Guard Synthetic 2025

FraudGuardSynthetic2025 Dataset

Overview

Data Description

Columns

Key Features

Use Cases

Source and Inspiration

Usage Notes

License

Contact

Imputed-VAE IIoT-2021

Data from: Wine Quality

Dataset Overview

Tasks

Suggested Analysis Steps

Data Preprocessing:

Exploratory Data Analysis (EDA):

Modeling:

For Classification:

For Regression:

Evaluation:

Feature Importance:

Credit Card Fraud Dataset

Overview

Source

Purpose

Column Descriptions

Dataset Statistics

Class Imbalance Note

Model Suitability

Quick Start Example

Usage Tips

Update Frequency

Acknowledgments

Credit Card Fraud Detection

2. Target Variable (`attack_detected`)