3 datasets found

t
Credit Card Fraud Detection
test.researchdata.tuwien.at
zenodo.org
+1more
csv, json, pdf +2
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja (2025). Credit Card Fraud Detection [Dataset]. http://doi.org/10.82556/yvxj-9t22
Explore at:
csv, pdf, text/markdown, txt, jsonAvailable download formats
Unique identifier
https://doi.org/10.82556/yvxj-9t22
Dataset updated
Apr 28, 2025
Dataset provided by
TU Wien
Authors
Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 28, 2025
Description
Below is a draft DMP–style description of your credit‐card fraud detection experiment, modeled on the antiquities example:

1. Dataset Description

Research Domain
This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous credit‐card transactions in real time to reduce financial losses and improve trust in digital payment systems.

Purpose
The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraud‐detection algorithms and support future research on anomaly detection in transaction data.

Data Sources
We used the publicly available credit‐card transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.

Method of Dataset Preparation

Schema validation: Renamed columns to snake_case (e.g. transaction_amount, is_declined) so they conform to DBRepo’s requirements.

Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).

Splitting: Programmatically derived three subsets—training (70%), validation (15%), test (15%)—using range‐based filters on the primary key actionnr. Each subset was materialized in DBRepo and assigned its own PID for precise citation.

Cleaning: Converted the categorical flags (is_declined, isforeigntransaction, ishighriskcountry, isfradulent) from “Y”/“N” to 1/0 and dropped non‐feature identifiers (actionnr, merchant_id).

Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the held‐out test set.

2. Technical Details

Dataset Structure

The raw data is a single CSV with columns:

actionnr (integer transaction ID)

merchant_id (string)

average_amount_transaction_day (float)

transaction_amount (float)

is_declined, isforeigntransaction, ishighriskcountry, isfradulent (binary flags)

total_number_of_declines_day, daily_chargeback_avg_amt, sixmonth_avg_chbk_amt, sixmonth_chbk_freq (numeric features)

Naming Conventions

All columns use lowercase snake_case.

Subsets are named creditcard_training, creditcard_validation, creditcard_test in DBRepo.

Files in the code repo follow a clear structure:

├── data/ # local copies only; raw data lives in DBRepo ├── notebooks/Task.ipynb ├── models/rf_model_v1.joblib ├── outputs/ # confusion_matrix.png, roc_curve.png, predictions.csv ├── README.md ├── requirements.txt └── codemeta.json

Required Software

Python 3.9+

pandas, numpy (data handling)

scikit-learn (modeling, metrics)

matplotlib (visualizations)

dbrepo‐client.py (DBRepo API)

requests (TU WRD API)

Additional Resources

Original dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud

Scikit-learn docs: https://scikit-learn.org/stable

DBRepo API guide: via the starter notebook’s dbrepo_client.py template

TU WRD REST API spec: https://test.researchdata.tuwien.ac.at/api/docs

3. Further Details

Data Limitations

Highly imbalanced: only ~0.17% of transactions are fraudulent.

Anonymized PCA features (V1–V28) hidden; we extended with domain features but cannot reverse engineer raw variables.

Time‐bounded: only covers two days of transactions, may not capture seasonal patterns.

Licensing and Attribution

Raw data: CC-0 (per Kaggle terms)

Code & notebooks: MIT License

Model artifacts & outputs: CC-BY 4.0

DUWRD records include ORCID identifiers for the author.

Recommended Uses

Benchmarking new fraud‐detection algorithms on a standard imbalanced dataset.

Educational purposes: demonstrating model‐training pipelines, FAIR data practices.

Extension: adding time‐series or deep‐learning models.

Known Issues

Possible temporal leakage if date/time features not handled correctly.

Model performance may degrade on live data due to concept drift.

Binary flags may oversimplify nuanced transaction outcomes.
t
Credit Card Fraud Detection
test.researchdata.tuwien.ac.at
csv, json, pdf +2
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja (2025). Credit Card Fraud Detection [Dataset]. http://doi.org/10.82556/yvxj-9t22
Explore at:
text/markdown, csv, pdf, txt, jsonAvailable download formats
Unique identifier
https://doi.org/10.82556/yvxj-9t22
Dataset updated
Apr 28, 2025
Dataset provided by
TU Wien
Authors
Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 28, 2025
Description
Below is a draft DMP–style description of your credit‐card fraud detection experiment, modeled on the antiquities example:

1. Dataset Description

Research Domain
This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous credit‐card transactions in real time to reduce financial losses and improve trust in digital payment systems.

Purpose
The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraud‐detection algorithms and support future research on anomaly detection in transaction data.

Data Sources
We used the publicly available credit‐card transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.

Method of Dataset Preparation

Schema validation: Renamed columns to snake_case (e.g. transaction_amount, is_declined) so they conform to DBRepo’s requirements.

Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).

Splitting: Programmatically derived three subsets—training (70%), validation (15%), test (15%)—using range‐based filters on the primary key actionnr. Each subset was materialized in DBRepo and assigned its own PID for precise citation.

Cleaning: Converted the categorical flags (is_declined, isforeigntransaction, ishighriskcountry, isfradulent) from “Y”/“N” to 1/0 and dropped non‐feature identifiers (actionnr, merchant_id).

Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the held‐out test set.

2. Technical Details

Dataset Structure

The raw data is a single CSV with columns:

actionnr (integer transaction ID)

merchant_id (string)

average_amount_transaction_day (float)

transaction_amount (float)

is_declined, isforeigntransaction, ishighriskcountry, isfradulent (binary flags)

total_number_of_declines_day, daily_chargeback_avg_amt, sixmonth_avg_chbk_amt, sixmonth_chbk_freq (numeric features)

Naming Conventions

All columns use lowercase snake_case.

Subsets are named creditcard_training, creditcard_validation, creditcard_test in DBRepo.

Files in the code repo follow a clear structure:

├── data/ # local copies only; raw data lives in DBRepo ├── notebooks/Task.ipynb ├── models/rf_model_v1.joblib ├── outputs/ # confusion_matrix.png, roc_curve.png, predictions.csv ├── README.md ├── requirements.txt └── codemeta.json

Required Software

Python 3.9+

pandas, numpy (data handling)

scikit-learn (modeling, metrics)

matplotlib (visualizations)

dbrepo‐client.py (DBRepo API)

requests (TU WRD API)

Additional Resources

Original dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud

Scikit-learn docs: https://scikit-learn.org/stable

DBRepo API guide: via the starter notebook’s dbrepo_client.py template

TU WRD REST API spec: https://test.researchdata.tuwien.ac.at/api/docs

3. Further Details

Data Limitations

Highly imbalanced: only ~0.17% of transactions are fraudulent.

Anonymized PCA features (V1–V28) hidden; we extended with domain features but cannot reverse engineer raw variables.

Time‐bounded: only covers two days of transactions, may not capture seasonal patterns.

Licensing and Attribution

Raw data: CC-0 (per Kaggle terms)

Code & notebooks: MIT License

Model artifacts & outputs: CC-BY 4.0

DUWRD records include ORCID identifiers for the author.

Recommended Uses

Benchmarking new fraud‐detection algorithms on a standard imbalanced dataset.

Educational purposes: demonstrating model‐training pipelines, FAIR data practices.

Extension: adding time‐series or deep‐learning models.

Known Issues

Possible temporal leakage if date/time features not handled correctly.

Model performance may degrade on live data due to concept drift.

Binary flags may oversimplify nuanced transaction outcomes.

Household Energy Consumption

kaggle.com

Updated Apr 5, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Samx_sam (2025). Household Energy Consumption [Dataset]. https://www.kaggle.com/datasets/samxsam/household-energy-consumption

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 5, 2025

Dataset provided by

Kaggle

Authors

Samx_sam

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

🏡 Household Energy Consumption - April 2025 (90,000 Records)

📌 Overview

This dataset presents detailed energy consumption records from various households over the month. With 90,000 rows and multiple features such as temperature, household size, air conditioning usage, and peak hour consumption, this dataset is perfect for performing time-series analysis, machine learning, and sustainability research.

Column Name	Data Type Category	Description
Household_ID	Categorical (Nominal)	Unique identifier for each household
Date	Datetime	The date of the energy usage record
Energy_Consumption_kWh	Numerical (Continuous)	Total energy consumed by the household in kWh
Household_Size	Numerical (Discrete)	Number of individuals living in the household
Avg_Temperature_C	Numerical (Continuous)	Average daily temperature in degrees Celsius
Has_AC	Categorical (Binary)	Indicates if the household has air conditioning (Yes/No)
Peak_Hours_Usage_kWh	Numerical (Continuous)	Energy consumed during peak hours in kWh

📂 Dataset Summary

Rows: 90,000
Time Range: April 1, 2025 – April 30, 2025
Data Granularity: Daily per household
Location: Simulated global coverage
Format: CSV (Comma-Separated Values)

📚 Libraries Used for Working with household_energy_consumption_2025.csv

🔍 1. Data Manipulation & Analysis

Library	Purpose
`pandas`	Reading, cleaning, and transforming tabular data
`numpy`	Numerical operations, working with arrays

📊 2. Data Visualization

Library	Purpose
`matplotlib`	Creating static plots (line, bar, histograms, etc.)
`seaborn`	Statistical visualizations, heatmaps, boxplots, etc.
`plotly`	Interactive charts (time series, pie, bar, scatter, etc.)

📈 3. Machine Learning / Modeling

Library	Purpose
`scikit-learn`	Preprocessing, regression, classification, clustering
`xgboost` / `lightgbm`	Gradient boosting models for better accuracy

🧹 4. Data Preprocessing

Library	Purpose
`sklearn.preprocessing`	Encoding categorical features, scaling, normalization
`datetime` / `pandas`	Date-time conversion and manipulation

🧪 5. Model Evaluation

Library	Purpose
`sklearn.metrics`	Accuracy, MAE, RMSE, R² score, confusion matrix, etc.

✅ These libraries provide a complete toolkit for performing data analysis, modeling, and visualization tasks efficiently.

📈 Potential Use Cases

This dataset is ideal for a wide variety of analytics and machine learning projects:

🔮 Forecasting & Time Series Analysis

Predict future household energy consumption based on previous trends and weather conditions.
Identify seasonal and daily consumption patterns.

💡 Energy Efficiency Analysis

Analyze differences in energy consumption between households with and without air conditioning.
Compare energy usage efficiency across varying household sizes.

🌡️ Climate Impact Studies

Investigate how temperature affects electricity usage across households.
Model the potential impact of climate change on residential energy demand.

🔌 Peak Load Management

Build models to predict and manage energy demand during peak hours.
Support research on smart grid technologies and dynamic pricing.

🧠 Machine Learning Projects

Supervised learning (regression/classification) to predict energy consumption.
Clustering households by usage patterns for targeted energy programs.
Anomaly detection in energy usage for fault detection.

🛠️ Example Starter Projects

Time-series forecasting using Facebook Prophet or ARIMA
Regression modeling using XGBoost or LightGBM
Classification of AC vs. non-AC household behavior
Energy-saving recommendation systems
Heatmaps of temperature vs. energy usage

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja (2025). Credit Card Fraud Detection [Dataset]. http://doi.org/10.82556/yvxj-9t22

Credit Card Fraud Detection

Explore at:

csv, pdf, text/markdown, txt, jsonAvailable download formats

Unique identifier

https://doi.org/10.82556/yvxj-9t22

Dataset updated

Apr 28, 2025

Dataset provided by

TU Wien

Authors

Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Apr 28, 2025

Description

Below is a draft DMP–style description of your credit‐card fraud detection experiment, modeled on the antiquities example:

1. Dataset Description

Research Domain
This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous credit‐card transactions in real time to reduce financial losses and improve trust in digital payment systems.

Purpose
The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraud‐detection algorithms and support future research on anomaly detection in transaction data.

Data Sources
We used the publicly available credit‐card transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.

Method of Dataset Preparation

Schema validation: Renamed columns to snake_case (e.g. transaction_amount, is_declined) so they conform to DBRepo’s requirements.
Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).
Splitting: Programmatically derived three subsets—training (70%), validation (15%), test (15%)—using range‐based filters on the primary key actionnr. Each subset was materialized in DBRepo and assigned its own PID for precise citation.
Cleaning: Converted the categorical flags (is_declined, isforeigntransaction, ishighriskcountry, isfradulent) from “Y”/“N” to 1/0 and dropped non‐feature identifiers (actionnr, merchant_id).
Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the held‐out test set.

2. Technical Details

Dataset Structure

The raw data is a single CSV with columns:
- actionnr (integer transaction ID)
- merchant_id (string)
- average_amount_transaction_day (float)
- transaction_amount (float)
- is_declined, isforeigntransaction, ishighriskcountry, isfradulent (binary flags)
- total_number_of_declines_day, daily_chargeback_avg_amt, sixmonth_avg_chbk_amt, sixmonth_chbk_freq (numeric features)

Naming Conventions

All columns use lowercase snake_case.
Subsets are named creditcard_training, creditcard_validation, creditcard_test in DBRepo.

Files in the code repo follow a clear structure:

├── data/         # local copies only; raw data lives in DBRepo 
├── notebooks/Task.ipynb 
├── models/rf_model_v1.joblib 
├── outputs/        # confusion_matrix.png, roc_curve.png, predictions.csv 
├── README.md 
├── requirements.txt 
└── codemeta.json

Required Software

Python 3.9+
pandas, numpy (data handling)
scikit-learn (modeling, metrics)
matplotlib (visualizations)
dbrepo‐client.py (DBRepo API)
requests (TU WRD API)

Additional Resources

Original dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud
Scikit-learn docs: https://scikit-learn.org/stable
DBRepo API guide: via the starter notebook’s dbrepo_client.py template
TU WRD REST API spec: https://test.researchdata.tuwien.ac.at/api/docs

3. Further Details

Data Limitations

Highly imbalanced: only ~0.17% of transactions are fraudulent.
Anonymized PCA features (V1–V28) hidden; we extended with domain features but cannot reverse engineer raw variables.
Time‐bounded: only covers two days of transactions, may not capture seasonal patterns.

Licensing and Attribution

Raw data: CC-0 (per Kaggle terms)
Code & notebooks: MIT License
Model artifacts & outputs: CC-BY 4.0
DUWRD records include ORCID identifiers for the author.

Recommended Uses

Benchmarking new fraud‐detection algorithms on a standard imbalanced dataset.
Educational purposes: demonstrating model‐training pipelines, FAIR data practices.
Extension: adding time‐series or deep‐learning models.

Known Issues

Possible temporal leakage if date/time features not handled correctly.
Model performance may degrade on live data due to concept drift.
Binary flags may oversimplify nuanced transaction outcomes.

Clear search

Close search

Google apps

Main menu