Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Below is a draft DMPāstyle description of your creditācard fraud detection experiment, modeled on the antiquities example:
Research Domain
This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous creditācard transactions in real time to reduce financial losses and improve trust in digital payment systems.
Purpose
The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraudādetection algorithms and support future research on anomaly detection in transaction data.
Data Sources
We used the publicly available creditācard transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.
Method of Dataset Preparation
Schema validation: Renamed columns to snake_case (e.g. transaction_amount
, is_declined
) so they conform to DBRepoās requirements.
Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).
Splitting: Programmatically derived three subsetsātraining (70%), validation (15%), test (15%)āusing rangeābased filters on the primary key actionnr
. Each subset was materialized in DBRepo and assigned its own PID for precise citation.
Cleaning: Converted the categorical flags (is_declined
, isforeigntransaction
, ishighriskcountry
, isfradulent
) from āYā/āNā to 1/0 and dropped nonāfeature identifiers (actionnr
, merchant_id
).
Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the heldāout test set.
Dataset Structure
The raw data is a single CSV with columns:
actionnr
(integer transaction ID)
merchant_id
(string)
average_amount_transaction_day
(float)
transaction_amount
(float)
is_declined
, isforeigntransaction
, ishighriskcountry
, isfradulent
(binary flags)
total_number_of_declines_day
, daily_chargeback_avg_amt
, sixmonth_avg_chbk_amt
, sixmonth_chbk_freq
(numeric features)
Naming Conventions
All columns use lowercase snake_case.
Subsets are named creditcard_training
, creditcard_validation
, creditcard_test
in DBRepo.
Files in the code repo follow a clear structure:
āāā data/ # local copies only; raw data lives in DBRepo
āāā notebooks/Task.ipynb
āāā models/rf_model_v1.joblib
āāā outputs/ # confusion_matrix.png, roc_curve.png, predictions.csv
āāā README.md
āāā requirements.txt
āāā codemeta.json
Required Software
Python 3.9+
pandas, numpy (data handling)
scikit-learn (modeling, metrics)
matplotlib (visualizations)
dbrepoāclient.py (DBRepo API)
requests (TU WRD API)
Additional Resources
Original dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud
Scikit-learn docs: https://scikit-learn.org/stable
DBRepo API guide: via the starter notebookās dbrepo_client.py
template
TU WRD REST API spec: https://test.researchdata.tuwien.ac.at/api/docs
Data Limitations
Highly imbalanced: only ~0.17% of transactions are fraudulent.
Anonymized PCA features (V1
āV28
) hidden; we extended with domain features but cannot reverse engineer raw variables.
Timeābounded: only covers two days of transactions, may not capture seasonal patterns.
Licensing and Attribution
Raw data: CC-0 (per Kaggle terms)
Code & notebooks: MIT License
Model artifacts & outputs: CC-BY 4.0
DUWRD records include ORCID identifiers for the author.
Recommended Uses
Benchmarking new fraudādetection algorithms on a standard imbalanced dataset.
Educational purposes: demonstrating modelātraining pipelines, FAIR data practices.
Extension: adding timeāseries or deepālearning models.
Known Issues
Possible temporal leakage if date/time features not handled correctly.
Model performance may degrade on live data due to concept drift.
Binary flags may oversimplify nuanced transaction outcomes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Below is a draft DMPāstyle description of your creditācard fraud detection experiment, modeled on the antiquities example:
Research Domain
This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous creditācard transactions in real time to reduce financial losses and improve trust in digital payment systems.
Purpose
The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraudādetection algorithms and support future research on anomaly detection in transaction data.
Data Sources
We used the publicly available creditācard transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.
Method of Dataset Preparation
Schema validation: Renamed columns to snake_case (e.g. transaction_amount
, is_declined
) so they conform to DBRepoās requirements.
Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).
Splitting: Programmatically derived three subsetsātraining (70%), validation (15%), test (15%)āusing rangeābased filters on the primary key actionnr
. Each subset was materialized in DBRepo and assigned its own PID for precise citation.
Cleaning: Converted the categorical flags (is_declined
, isforeigntransaction
, ishighriskcountry
, isfradulent
) from āYā/āNā to 1/0 and dropped nonāfeature identifiers (actionnr
, merchant_id
).
Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the heldāout test set.
Dataset Structure
The raw data is a single CSV with columns:
actionnr
(integer transaction ID)
merchant_id
(string)
average_amount_transaction_day
(float)
transaction_amount
(float)
is_declined
, isforeigntransaction
, ishighriskcountry
, isfradulent
(binary flags)
total_number_of_declines_day
, daily_chargeback_avg_amt
, sixmonth_avg_chbk_amt
, sixmonth_chbk_freq
(numeric features)
Naming Conventions
All columns use lowercase snake_case.
Subsets are named creditcard_training
, creditcard_validation
, creditcard_test
in DBRepo.
Files in the code repo follow a clear structure:
āāā data/ # local copies only; raw data lives in DBRepo
āāā notebooks/Task.ipynb
āāā models/rf_model_v1.joblib
āāā outputs/ # confusion_matrix.png, roc_curve.png, predictions.csv
āāā README.md
āāā requirements.txt
āāā codemeta.json
Required Software
Python 3.9+
pandas, numpy (data handling)
scikit-learn (modeling, metrics)
matplotlib (visualizations)
dbrepoāclient.py (DBRepo API)
requests (TU WRD API)
Additional Resources
Original dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud
Scikit-learn docs: https://scikit-learn.org/stable
DBRepo API guide: via the starter notebookās dbrepo_client.py
template
TU WRD REST API spec: https://test.researchdata.tuwien.ac.at/api/docs
Data Limitations
Highly imbalanced: only ~0.17% of transactions are fraudulent.
Anonymized PCA features (V1
āV28
) hidden; we extended with domain features but cannot reverse engineer raw variables.
Timeābounded: only covers two days of transactions, may not capture seasonal patterns.
Licensing and Attribution
Raw data: CC-0 (per Kaggle terms)
Code & notebooks: MIT License
Model artifacts & outputs: CC-BY 4.0
DUWRD records include ORCID identifiers for the author.
Recommended Uses
Benchmarking new fraudādetection algorithms on a standard imbalanced dataset.
Educational purposes: demonstrating modelātraining pipelines, FAIR data practices.
Extension: adding timeāseries or deepālearning models.
Known Issues
Possible temporal leakage if date/time features not handled correctly.
Model performance may degrade on live data due to concept drift.
Binary flags may oversimplify nuanced transaction outcomes.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset presents detailed energy consumption records from various households over the month. With 90,000 rows and multiple features such as temperature, household size, air conditioning usage, and peak hour consumption, this dataset is perfect for performing time-series analysis, machine learning, and sustainability research.
Column Name | Data Type Category | Description |
---|---|---|
Household_ID | Categorical (Nominal) | Unique identifier for each household |
Date | Datetime | The date of the energy usage record |
Energy_Consumption_kWh | Numerical (Continuous) | Total energy consumed by the household in kWh |
Household_Size | Numerical (Discrete) | Number of individuals living in the household |
Avg_Temperature_C | Numerical (Continuous) | Average daily temperature in degrees Celsius |
Has_AC | Categorical (Binary) | Indicates if the household has air conditioning (Yes/No) |
Peak_Hours_Usage_kWh | Numerical (Continuous) | Energy consumed during peak hours in kWh |
Library | Purpose |
---|---|
pandas | Reading, cleaning, and transforming tabular data |
numpy | Numerical operations, working with arrays |
Library | Purpose |
---|---|
matplotlib | Creating static plots (line, bar, histograms, etc.) |
seaborn | Statistical visualizations, heatmaps, boxplots, etc. |
plotly | Interactive charts (time series, pie, bar, scatter, etc.) |
Library | Purpose |
---|---|
scikit-learn | Preprocessing, regression, classification, clustering |
xgboost / lightgbm | Gradient boosting models for better accuracy |
Library | Purpose |
---|---|
sklearn.preprocessing | Encoding categorical features, scaling, normalization |
datetime / pandas | Date-time conversion and manipulation |
Library | Purpose |
---|---|
sklearn.metrics | Accuracy, MAE, RMSE, R² score, confusion matrix, etc. |
ā These libraries provide a complete toolkit for performing data analysis, modeling, and visualization tasks efficiently.
This dataset is ideal for a wide variety of analytics and machine learning projects:
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Below is a draft DMPāstyle description of your creditācard fraud detection experiment, modeled on the antiquities example:
Research Domain
This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous creditācard transactions in real time to reduce financial losses and improve trust in digital payment systems.
Purpose
The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraudādetection algorithms and support future research on anomaly detection in transaction data.
Data Sources
We used the publicly available creditācard transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.
Method of Dataset Preparation
Schema validation: Renamed columns to snake_case (e.g. transaction_amount
, is_declined
) so they conform to DBRepoās requirements.
Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).
Splitting: Programmatically derived three subsetsātraining (70%), validation (15%), test (15%)āusing rangeābased filters on the primary key actionnr
. Each subset was materialized in DBRepo and assigned its own PID for precise citation.
Cleaning: Converted the categorical flags (is_declined
, isforeigntransaction
, ishighriskcountry
, isfradulent
) from āYā/āNā to 1/0 and dropped nonāfeature identifiers (actionnr
, merchant_id
).
Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the heldāout test set.
Dataset Structure
The raw data is a single CSV with columns:
actionnr
(integer transaction ID)
merchant_id
(string)
average_amount_transaction_day
(float)
transaction_amount
(float)
is_declined
, isforeigntransaction
, ishighriskcountry
, isfradulent
(binary flags)
total_number_of_declines_day
, daily_chargeback_avg_amt
, sixmonth_avg_chbk_amt
, sixmonth_chbk_freq
(numeric features)
Naming Conventions
All columns use lowercase snake_case.
Subsets are named creditcard_training
, creditcard_validation
, creditcard_test
in DBRepo.
Files in the code repo follow a clear structure:
āāā data/ # local copies only; raw data lives in DBRepo
āāā notebooks/Task.ipynb
āāā models/rf_model_v1.joblib
āāā outputs/ # confusion_matrix.png, roc_curve.png, predictions.csv
āāā README.md
āāā requirements.txt
āāā codemeta.json
Required Software
Python 3.9+
pandas, numpy (data handling)
scikit-learn (modeling, metrics)
matplotlib (visualizations)
dbrepoāclient.py (DBRepo API)
requests (TU WRD API)
Additional Resources
Original dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud
Scikit-learn docs: https://scikit-learn.org/stable
DBRepo API guide: via the starter notebookās dbrepo_client.py
template
TU WRD REST API spec: https://test.researchdata.tuwien.ac.at/api/docs
Data Limitations
Highly imbalanced: only ~0.17% of transactions are fraudulent.
Anonymized PCA features (V1
āV28
) hidden; we extended with domain features but cannot reverse engineer raw variables.
Timeābounded: only covers two days of transactions, may not capture seasonal patterns.
Licensing and Attribution
Raw data: CC-0 (per Kaggle terms)
Code & notebooks: MIT License
Model artifacts & outputs: CC-BY 4.0
DUWRD records include ORCID identifiers for the author.
Recommended Uses
Benchmarking new fraudādetection algorithms on a standard imbalanced dataset.
Educational purposes: demonstrating modelātraining pipelines, FAIR data practices.
Extension: adding timeāseries or deepālearning models.
Known Issues
Possible temporal leakage if date/time features not handled correctly.
Model performance may degrade on live data due to concept drift.
Binary flags may oversimplify nuanced transaction outcomes.