The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependent cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Rupeswara Babu Sangoju
Released under Apache 2.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Below is a draft DMP–style description of your credit‐card fraud detection experiment, modeled on the antiquities example:
Research Domain
This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous credit‐card transactions in real time to reduce financial losses and improve trust in digital payment systems.
Purpose
The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraud‐detection algorithms and support future research on anomaly detection in transaction data.
Data Sources
We used the publicly available credit‐card transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.
Method of Dataset Preparation
Schema validation: Renamed columns to snake_case (e.g. transaction_amount
, is_declined
) so they conform to DBRepo’s requirements.
Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).
Splitting: Programmatically derived three subsets—training (70%), validation (15%), test (15%)—using range‐based filters on the primary key actionnr
. Each subset was materialized in DBRepo and assigned its own PID for precise citation.
Cleaning: Converted the categorical flags (is_declined
, isforeigntransaction
, ishighriskcountry
, isfradulent
) from “Y”/“N” to 1/0 and dropped non‐feature identifiers (actionnr
, merchant_id
).
Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the held‐out test set.
Dataset Structure
The raw data is a single CSV with columns:
actionnr
(integer transaction ID)
merchant_id
(string)
average_amount_transaction_day
(float)
transaction_amount
(float)
is_declined
, isforeigntransaction
, ishighriskcountry
, isfradulent
(binary flags)
total_number_of_declines_day
, daily_chargeback_avg_amt
, sixmonth_avg_chbk_amt
, sixmonth_chbk_freq
(numeric features)
Naming Conventions
All columns use lowercase snake_case.
Subsets are named creditcard_training
, creditcard_validation
, creditcard_test
in DBRepo.
Files in the code repo follow a clear structure:
├── data/ # local copies only; raw data lives in DBRepo
├── notebooks/Task.ipynb
├── models/rf_model_v1.joblib
├── outputs/ # confusion_matrix.png, roc_curve.png, predictions.csv
├── README.md
├── requirements.txt
└── codemeta.json
Required Software
Python 3.9+
pandas, numpy (data handling)
scikit-learn (modeling, metrics)
matplotlib (visualizations)
dbrepo‐client.py (DBRepo API)
requests (TU WRD API)
Additional Resources
Original dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud
Scikit-learn docs: https://scikit-learn.org/stable
DBRepo API guide: via the starter notebook’s dbrepo_client.py
template
TU WRD REST API spec: https://test.researchdata.tuwien.ac.at/api/docs
Data Limitations
Highly imbalanced: only ~0.17% of transactions are fraudulent.
Anonymized PCA features (V1
–V28
) hidden; we extended with domain features but cannot reverse engineer raw variables.
Time‐bounded: only covers two days of transactions, may not capture seasonal patterns.
Licensing and Attribution
Raw data: CC-0 (per Kaggle terms)
Code & notebooks: MIT License
Model artifacts & outputs: CC-BY 4.0
DUWRD records include ORCID identifiers for the author.
Recommended Uses
Benchmarking new fraud‐detection algorithms on a standard imbalanced dataset.
Educational purposes: demonstrating model‐training pipelines, FAIR data practices.
Extension: adding time‐series or deep‐learning models.
Known Issues
Possible temporal leakage if date/time features not handled correctly.
Model performance may degrade on live data due to concept drift.
Binary flags may oversimplify nuanced transaction outcomes.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
As a data contributor, I'm sharing this crucial dataset focused on the detection of fraudulent credit card transactions. Recognizing these illicit activities is paramount for protecting customers and the integrity of financial systems.
About the Dataset:
This dataset encompasses credit card transactions made by European cardholders during a two-day period in September 2013. It presents a real-world scenario with a significant class imbalance, where fraudulent transactions are considerably less frequent than legitimate ones. Out of a total of 284,807 transactions, only 492 are instances of fraud, representing a mere 0.172% of the entire dataset.
Content of the Data:
Due to confidentiality concerns, the majority of the input features in this dataset have undergone a Principal Component Analysis (PCA) transformation. This means the original meaning and context of features V1, V2, ..., V28 are not directly provided. However, these principal components capture the variance in the underlying transaction data.
The only features that have not been transformed by PCA are:
The target variable for this classification task is:
Important Note on Evaluation:
Given the substantial class imbalance (far more legitimate transactions than fraudulent ones), traditional accuracy metrics based on the confusion matrix can be misleading. It is strongly recommended to evaluate models using the Area Under the Precision-Recall Curve (AUPRC), as this metric is more sensitive to the performance on the minority class (fraudulent transactions).
How to Use This Dataset:
Acknowledgements and Citation:
This dataset has been collected and analyzed through a research collaboration between Worldline and the Machine Learning Group (MLG) of ULB (Université Libre de Bruxelles).
When using this dataset in your research or projects, please cite the following works as appropriate:
This dataset was created by Shuvom Dhar
This dataset was created by Emily Smith
Released under Data files © Original Authors
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset has been released by [1], which had been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of Université Libre de Bruxelles (ULB) on big data mining and fraud detection. [1] Pozzolo, A. D., Caelan, O., Johnson, R. A., and Bontempi, G. (2015). Calibrating Probability with Undersampling for Unbalanced Classification. 2015 IEEE Symposium Series on Computational, pp. 159-166, doi: 10.1109/SSCI.2015.33 open source kaggle : https://www.kaggle.com/mlg-ulb/creditcardfraud
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Title: Credit Card Transactions Dataset for Fraud Detection (Used in: A Hybrid Anomaly Detection Framework Combining Supervised and Unsupervised Learning)Description:This dataset, commonly known as creditcard.csv, contains anonymized credit card transactions made by European cardholders in September 2013. It includes 284,807 transactions, with 492 labeled as fraudulent. Due to confidentiality constraints, features have been transformed using PCA, except for 'Time' and 'Amount'.This dataset was used in the research article titled "A Hybrid Anomaly Detection Framework Combining Supervised and Unsupervised Learning for Credit Card Fraud Detection". The study proposes an ensemble model integrating techniques such as Autoencoders, Isolation Forest, Local Outlier Factor, and supervised classifiers including XGBoost and Random Forest, aiming to improve the detection of rare fraudulent patterns while maintaining efficiency and scalability.Key Features:30 numerical input features (V1–V28, Time, Amount)Class label indicating fraud (1) or normal (0)Imbalanced class distribution typical in real-world fraud detectionUse Case:Ideal for benchmarking and evaluating anomaly detection and classification algorithms in highly imbalanced data scenarios.Source:Originally published by the Machine Learning Group at Université Libre de Bruxelles.https://www.kaggle.com/mlg-ulb/creditcardfraudLicense:This dataset is distributed for academic and research purposes only. Please cite the original source when using the dataset.
This dataset was created by Karthik
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by gowtham battineedi
Released under Apache 2.0
We provide you with a data set in CSV format. The data set contains 2 lakhh+ record train instances and 56 thousand test instance There are 31 input features, labeled V1 to V28 and Amount .
The target variable is labeled Class.
Create a Classification model to predict the target variable Class.
This dataset was created by BELLAS MUGOH
This dataset was created by Thiyagarajan Palaniyappan
This dataset was created by Kunal.Manore.
This dataset was created by Rahul Aggarwal
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Md Feroz Ahmed
Released under Apache 2.0
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was created by Binayatosh Panigrahi
Released under Database: Open Database, Contents: © Original Authors
This dataset was created by Manish Kumar
It contains the following files:
This dataset was created by Sinoth Mabasa
This dataset was created by Narendra Parmanand Bariha
Released under Other (specified in description)
The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependent cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.