Facebook
TwitterThis dataset was created by Lionel Bottan
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparative analysis with unsupervised anomaly detection algorithms.
Facebook
TwitterProject Status: Proof-of-Concept (POC) - Capstone Project
This project demonstrates a proof-of-concept system for detecting financial document anomalies within core SAP FI/CO data, specifically leveraging the New General Ledger table (FAGLFLEXA) and document headers (BKPF). It addresses the challenge that standard SAP reporting and rule-based checks often struggle to identify subtle, complex, or novel irregularities in high-volume financial postings.
The solution employs a Hybrid Anomaly Detection strategy, combining unsupervised Machine Learning models with expert-defined SAP business rules. Findings are prioritized using a multi-faceted scoring system and presented via an interactive dashboard built with Streamlit for efficient investigation.
This project was developed as a capstone, showcasing the application of AI/ML techniques to enhance financial controls within an SAP context, bridging deep SAP domain knowledge with modern data science practices.
Author: Anitha R (https://www.linkedin.com/in/anithaswamy)
Dataset Origin: Kaggle SAP Dataset by Sunitha Siva License:Other (specified in description)-No description available.
Financial integrity is critical. Undetected anomalies in SAP FI/CO postings can lead to: * Inaccurate financial reporting * Significant reconciliation efforts * Potential audit failures or compliance issues * Masking of operational errors or fraud
Standard SAP tools may not catch all types of anomalies, especially complex or novel patterns. This project explores how AI/ML can augment traditional methods to provide more robust and efficient financial monitoring.
FAGLFLEXA for reliability.FE_...) to quantify potential deviations from normalcy based on EDA and SAP knowledge.Model_Anomaly_Count) and HRF counts (HRF_Count) into a Priority_Tier for focusing investigation efforts.Review_Focus text description summarizing why an item was flagged.The project followed a structured approach:
BKPF and FAGLFLEXA data extracts. Discarded BSEG due to imbalances. Removed duplicates.sap_engineered_features.csv.(For detailed methodology, please refer to the Comprehensive_Project_Report.pdf in the /docs folder - if you include it).
Libraries:
joblib==1.4.2
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Feature importance calculated by Random Forest classifier considering the 80 features previously selected by Select K Best.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Unsupervised outlier detection constitutes a crucial phase within data analysis and remains an open area of research. A good outlier detection algorithm should be computationally efficient, robust to tuning parameter selection, and perform consistently well across diverse underlying data distributions. We introduce Boundary Peeling, an unsupervised outlier detection algorithm. Boundary Peeling uses the average signed distance from iteratively peeled, flexible boundaries generated by one-class support vector machines to flag outliers. The method is similar to convex hull peeling but well suited for high-dimensional data and has flexibility to adapt to different distributions. Boundary Peeling has robust hyperparameter settings and, for increased flexibility, can be cast as an ensemble method. In unimodal and multimodal synthetic data simulations Boundary Peeling outperforms all state of the art methods when no outliers are present while maintaining comparable or superior performance in the presence of outliers. Boundary Peeling performs competitively or better in terms of correct classification, AUC, and processing time using semantically meaningful benchmark datasets.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. Below, I’ll explain each aspect in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning.
The dataset consists of network-based and user behavior-based features. Each feature provides valuable information about potential cyber threats.
These features describe network-level information such as packet size, protocol type, and encryption methods.
network_packet_size (Packet Size in Bytes)
protocol_type (Communication Protocol)
encryption_used (Encryption Protocol)
These features track user activities, such as login attempts and session duration.
login_attempts (Number of Logins)
session_duration (Session Length in Seconds)
failed_logins (Failed Login Attempts)
unusual_time_access (Login Time Anomaly)
0 or 1) indicating whether access happened at an unusual time.ip_reputation_score (Trustworthiness of IP Address)
browser_type (User’s Browser)
attack_detected)1 means an attack was detected, 0 means normal activity.This dataset can be used for intrusion detection systems (IDS) and cybersecurity research. Some key applications include:
Supervised Learning Approaches
attack_detected as the target).Deep Learning Approaches
If attack labels are missing, anomaly detection can be used: - Autoencoders: Learn normal traffic and flag anomalies. - Isolation Forest: Detects outliers based on feature isolation. - One-Class SVM: Learns normal behavior and detects deviations.
Facebook
Twitterhttp://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
❗️❗️❗️**The current version of SKAB (v0.9) contains 34 datasets with collective anomalies. But the upcoming update to v1.0 (probably up to the summer of 2021) will contain 300+ additional files with point and collective anomalies. It will make SKAB one of the largest changepoint-containing benchmarks, especially in the technical field.**
We propose the Skoltech Anomaly Benchmark (SKAB) designed for evaluating the anomaly detection algorithms. SKAB allows working with two main problems (there are two markups for anomalies): * Outlier detection (anomalies considered and marked up as single-point anomalies) * Changepoint detection (anomalies considered and marked up as collective anomalies)
SKAB consists of the following artifacts: * Datasets. * Leaderboard (scoreboard). * Python modules for algorithms’ evaluation. * Notebooks: python notebooks with anomaly detection algorithms.
The IIot testbed system is located in the Skolkovo Institute of Science and Technology (Skoltech). All the details regarding the testbed and the experimenting process are presented in the following artifacts: - Position paper (currently submitted for publication) - Slides about the project
The SKAB v0.9 corpus contains 35 individual data files in .csv format. Each file represents a single experiment and contains a single anomaly. The dataset represents a multivariate time series collected from the sensors installed on the testbed. The data folder contains datasets from the benchmark. The structure of the data folder is presented in the structure file. Columns in each data file are following:
- datetime - Represents dates and times of the moment when the value is written to the database (YYYY-MM-DD hh:mm:ss)
- Accelerometer1RMS - Shows a vibration acceleration (Amount of g units)
- Accelerometer2RMS - Shows a vibration acceleration (Amount of g units)
- Current - Shows the amperage on the electric motor (Ampere)
- Pressure - Represents the pressure in the loop after the water pump (Bar)
- Temperature - Shows the temperature of the engine body (The degree Celsius)
- Thermocouple - Represents the temperature of the fluid in the circulation loop (The degree Celsius)
- Voltage - Shows the voltage on the electric motor (Volt)
- RateRMS - Represents the circulation flow rate of the fluid inside the loop (Liter per minute)
- anomaly - Shows if the point is anomalous (0 or 1)
- changepoint - Shows if the point is a changepoint for collective anomalies (0 or 1)
Here we propose the leaderboard for SKAB v0.9 both for outlier and changepoint detection problems. You can also present and evaluate your algorithm using SKAB on kaggle. The results in the tables are calculated in the python notebooks from the notebooks folder.
Sorted by F1; for F1 bigger is better; both for FAR and MAR less is better
| Algorithm | F1 | FAR, % | MAR, %
|---|---|---|---|
Perfect detector | 1 | 0 | 0
T-squared+Q (PCA) | 0.67 | 13.95 | 36.32
LSTM | 0.64 | 15.4 | 39.93
MSCRED | 0.64 | 13.56 | 41.16
T-squared | 0.56 | 12.14 | 52.56
Autoencoder | 0.45 | 7.56 | 66.57
Isolation forest | 0.4 | 6.86 | 72.09
Null detector | 0 | 0 | 100
Sorted by NAB (standart); for all metrics bigger is better
| Algorithm | NAB (standart) | NAB (lowFP) | NAB (LowFN) |
|---|---|---|---|
Perfect detector | 100 | 100 | 100
Isolation forest | 37.53 | 17.09 | 45.02
MSCRED | 28.74 | 23.43 | 31.21
LSTM | 27.09 | 11.06 | 32.68
T-squared+Q (PCA) | 26.71 | 22.42 | 28.32
T-squared | 17.87 | 3.44 | 23.2
ArimaFD | 16.06 | 14.03 | 17.12
Autoencoder | 15.59 | 0.78 | 20.91
Null detector | 0 | 0 | 0
The notebooks folder contains python notebooks with the code for the proposed leaderboard results reproducing.
We have calculated the results for five quite common anomaly detection algorithms: - Hotelling's T-squared statistics; - Hotelling's T-squared statistics + Q statistics based on PCA; - Isolation forest; - LSTM-based NN; - Feed-Forward Autoencoder.
Additionaly to the repository were added the results of the following algorithms: - ArimaFD; - MSCRED.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Facebook
TwitterSupervised machine learning has been used to detect fine-scale animal behaviour from accelerometer data, but a standardised protocol for implementing this workflow is currently lacking. As the application of machine learning to ecological problems expands, it is essential to establish technical protocols and validation standards that align with those in other "big data" fields. Overfitting is a prevalent and often misunderstood challenge in machine learning. Overfit models overly adapt to the training data to memorise specific instances rather than to discern the underlying signal. Associated results can indicate high performance on the training set, yet these models are unlikely to generalise to new data. Overfitting can be detected through rigorous validation using independent test sets. Our systematic review of 119 studies using accelerometer-based supervised machine learning to classify animal behaviour reveals that 79% (94 papers) did not validate their models sufficiently wel..., We defined eligibility criteria as 'peer-reviewed primary research papers published 2013-present that use supervised machine learning to identify specific behaviours from raw, non-livestock animal accelerometer data'. We elected to ignore analysis of livestock behaviour as agricultural methods often operate within different constraints to the analyses conducted on wild animals and this body of literature has mostly developed in isolation to wild animal research. Our search was conducted on 27/09/2024. Initial keyword search across 3 databases (Google Scholar, PubMed, and Scopus) yielded 249 unique papers. Papers outside of the search criteria — including hardware and software advances, non-ML analysis, insufficient accelerometry application (e.g., research focused on other sensors with accelerometry providing minimal support), unsupervised methods, and research limited to activity intensity or active and inactive states— were excluded, resulting in 119 papers., , # Systematic review of validation of supervised machine learning models in accelerometer-based animal behaviour classification literature
https://doi.org/10.5061/dryad.fxpnvx14d
Description:Â Methods information from animal accelerometer-based behaviour classification literature utilising supervised machine learning techniques.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset, commonly known as creditcard.csv, contains anonymized credit card transactions made by European cardholders in September 2013. It includes 284,807 transactions, with 492 labeled as fraudulent. Due to confidentiality constraints, features have been transformed using PCA, except for 'Time' and 'Amount'.
This dataset was used in the research article titled "A Hybrid Anomaly Detection Framework Combining Supervised and Unsupervised Learning for Credit Card Fraud Detection". The study proposes an ensemble model integrating techniques such as Autoencoders, Isolation Forest, Local Outlier Factor, and supervised classifiers including XGBoost and Random Forest, aiming to improve the detection of rare fraudulent patterns while maintaining efficiency and scalability.
Key Features:
30 numerical input features (V1–V28, Time, Amount) Class label indicating fraud (1) or normal (0) Imbalanced class distribution typical in real-world fraud detection Use Case: Ideal for benchmarking and evaluating anomaly detection and classification algorithms in highly imbalanced data scenarios.
Source: Originally published by the Machine Learning Group at Université Libre de Bruxelles.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Feature importance calculated by Random Forest classifier considering the 80 features selected by Select K Best by domain.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The FraudGuardSynthetic2025 dataset is a synthetically generated dataset designed for machine learning and data analysis tasks focused on detecting credit card fraud. Created in September 2025, it simulates realistic transaction profiles based on established fraud detection patterns, drawing inspiration from financial data studies and existing datasets. With 1,000 records, it features a highly imbalanced target (~0.2% fraud cases) to mirror real-world fraud prevalence, making it ideal for anomaly detection, classification, and preprocessing practice in educational and research settings.
is_fraud (binary: 0 = Non-fraud, 1 = Fraud)| Column Name | Type | Description |
|---|---|---|
transaction_id | Integer | Unique identifier for each transaction (1 to 1,000). |
amount | Float | Transaction amount in USD (0.01 to 10,000, skewed toward smaller values). |
time | Integer | Seconds since first transaction (0 to 86,400, simulating one day). |
merchant_category | Categorical | Merchant type: Retail, Online, Travel, Food, Other. |
cardholder_age | Integer | Cardholder age in years (18 to 90, skewed toward 25-50). |
cardholder_zip | Integer | Cardholder ZIP code (10000 to 99999, synthetic US-based). |
distance | Float | Distance (km) between cardholder and merchant (0 to 500, mean ~50). |
is_online | Binary | Online transaction: 0 = In-person, 1 = Online (~30% online). |
card_type | Categorical | Card type: Credit, Debit (~60% credit). |
transaction_hour | Integer | Hour of transaction (0 to 23, skewed toward daytime). |
is_fraud | Binary | Target variable: 0 = Non-fraud, 1 = Fraud (~0.2% positive cases). |
This dataset is inspired by fraud detection patterns from 2025 financial literature (e.g., Federal Reserve, IEEE studies on Nigerian datasets) and existing datasets like Kaggle Credit Card Fraud Detection (2018, with 2021 simulator), Kartik2112 simulated transactions (2020), and new 2025 releases (e.g., Figshare creditcard.csv, Synthesized GDPR-compliant synthetic data). It incorporates trends like hybrid feature selection and explainable AI.
amount, distance, time) require scaling; categorical features (merchant_category, card_type) need encoding (e.g., one-hot).This dataset is provided for educational and research purposes under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
For questions or to request expanded datasets, contact the creator via the platform where this dataset is hosted.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Input Variables: Physicochemical properties (e.g., pH, alcohol content, acidity). Output Variable: Sensory ratings (quality), which are ordered categories.
Classification or Regression:
Treat the output as a categorical variable (classification) or as a continuous score (regression). Outlier Detection:
Identify outliers (e.g., excellent or poor wines) using techniques like Isolation Forest or Local Outlier Factor (LOF). Feature Selection:
Apply methods such as Recursive Feature Elimination (RFE), LASSO, or tree-based feature importance to identify relevant features.
Try models like Logistic Regression, Decision Trees, Random Forest, or Gradient Boosting.
Use Linear Regression, SVR, or Tree-based models like Random Forest Regressor.
Analyze which features contribute the most to the predictions to aid in understanding the data.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains synthetic credit card transaction data designed for fraud detection and machine learning research. With over 6.3 million transactions, it provides a realistic simulation of financial transaction patterns including both legitimate and fraudulent activities.
This is a synthetic dataset generated to simulate credit card transaction behavior. The data represents financial transactions over a 30-day period (743 hours) with various transaction types including payments, transfers, cash-outs, debits, and cash-ins.
The dataset is specifically designed for: - Training and testing fraud detection models - Anomaly detection research - Binary classification tasks - Imbalanced learning scenarios - Financial machine learning applications
This dataset exhibits significant class imbalance with only 0.13% fraudulent transactions. This mirrors real-world fraud detection scenarios where fraudulent transactions are rare. Consider using techniques such as: - SMOTE (Synthetic Minority Over-sampling Technique) - Undersampling of majority class - Cost-sensitive learning - Ensemble methods - Anomaly detection algorithms
This dataset is well-suited for: - Logistic Regression - Random Forest - Gradient Boosting (XGBoost, LightGBM, CatBoost) - Neural Networks - Isolation Forest - Autoencoders - Support Vector Machines
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df = pd.read_csv('/kaggle/input/credit-card-fraud-dataset/Fraud.csv')
# Display basic information
print(df.info())
print(df.head())
# Check fraud distribution
print(df['isFraud'].value_counts())
# Visualize fraud distribution
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='isFraud')
plt.title('Distribution of Fraud vs Legitimate Transactions')
plt.xlabel('Is Fraud (0=No, 1=Yes)')
plt.ylabel('Count')
plt.show()
# Transaction type distribution
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='type', hue='isFraud')
plt.title('Transaction Types by Fraud Status')
plt.xticks(rotation=45)
plt.show()
This is a static dataset with no planned future updates. It serves as a benchmark for fraud detection research and model development.
This dataset is made available under the MIT License for educational and research purposes in the field of fraud detection and financial machine learning.
Facebook
Twitter📌 Overview This dataset provides a real-world representation of credit card transactions, labeled as fraudulent or legitimate. It is designed to aid in the development of machine learning models for fraud detection and financial security applications. Given the rising cases of online fraud, detecting suspicious transactions is crucial for financial institutions.
This dataset allows users to experiment with various fraud detection techniques, such as supervised and unsupervised learning models, anomaly detection, and pattern recognition.
📊 Dataset Details
Number of Transactions: 1852394 Number of Features: 23 Fraudulent Transactions: Contains transactions labeled as is_fraud = 1 for fraud and is_fraud = 0 for legitimate payments.
📁 Columns Explained Transaction Information:
trans_date_trans_time – Timestamp of the transaction cc_num – Unique (anonymized) credit card number merchant – Merchant where the transaction occurred category – Type of transaction (e.g., travel, food, personal care) amt – Transaction amount
Cardholder Details:
first, last – First and last name of the cardholder gender – Gender of the cardholder street, city, state, zip – Address of the cardholder lat, long – Geographical location of the cardholder city_pop – Population of the cardholder’s city job – Profession of the cardholder dob – Date of birth of the cardholder
Transaction Identifiers & Timing:
trans_num – Unique transaction identifier unix_time – Timestamp of transaction in Unix format
Merchant Details:
merch_lat, merch_long – Merchant's location (latitude & longitude)
Fraud Indicator:
is_fraud – Target variable (1 = Fraud, 0 = Legitimate)
🎯 Usage
This dataset is ideal for: ✅ Fraud detection research ✅ Machine learning model development ✅ Anomaly detection projects ✅ Financial analytics
🛠️ Suggested Machine Learning Approaches
Supervised Learning:
Logistic Regression Decision Trees / Random Forest XGBoost / LightGBM Deep Learning (Neural Networks)
Unsupervised Learning:
Autoencoders Isolation Forest DBSCAN for anomaly detection
Feature Engineering Ideas:
Creating transaction frequency features Aggregating spending behavior per merchant/category Analyzing location-based fraud patterns
⚠️ Disclaimer This dataset has been anonymized and should be used strictly for research and educational purposes. It does not contain any real-world personal information, and the credit card numbers have been randomly generated for simulation purposes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification metrics per posture achieved using the best models selected by grid search in Classifier 1 on the test set.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
HVAC Fault Detection Dataset
⚠️ Synthetic Data Disclaimer: This dataset contains synthetically generated data for demonstration and testing purposes. It does not represent real equipment faults or actual building systems.
Overview
Anomaly detection results from HVAC equipment monitoring using Isolation Forest. This dataset includes detected faults, anomaly scores, and equipment status.
Schema
{ "pipeline": "hvac_fault_detection_anomaly", "generated_at":… See the full description on the dataset page: https://huggingface.co/datasets/shahabsalehi/hvac-fault-detection.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study examines the application of machine learning algorithms to enhance financial inclusion in microfinance, focusing on credit scoring, risk and fraud detection, and customer segmentation. We performed feature engineering and employed models such as Logistic Regression, Decision Trees, Random Forests, Gradient Boosting Machines (XGBoost and LightGBM), Support Vector Machines (SVM), Autoencoders, Isolation Forests, and K-means Clustering. LightGBM achieved the highest accuracy (89.6%) and AUC (0.92) in credit scoring, while Random Forests demonstrated strong performance in both loan approval (86.7% accuracy) and fraud detection (87.6% accuracy, AUC of 0.88). SVM also performed competitively, and unsupervised methods like Autoencoders and Isolation Forests showed potential for anomaly detection but required further refinement.K-means Clustering excelled in customer segmentation with a silhouette score of 0.72, enabling tailored services based on client demographics. Our findings highlight the significant impact of machine learning on improving credit scoring accuracy, reducing fraud risks, and enhancing customer service delivery in microfinance, thereby promoting financial inclusion for underserved populations. Ethical considerations and model interpretability are crucial, particularly for smaller institutions. This study advocates for the broader adoption of machine learning in the microfinance sector.
Facebook
TwitterThis dataset was created by Lionel Bottan