3 datasets found
  1. t

    Credit Card Fraud Detection

    • test.researchdata.tuwien.at
    • zenodo.org
    • +1more
    csv, json, pdf +2
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja (2025). Credit Card Fraud Detection [Dataset]. http://doi.org/10.82556/yvxj-9t22
    Explore at:
    csv, pdf, text/markdown, txt, jsonAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Below is a draft DMP–style description of your credit‐card fraud detection experiment, modeled on the antiquities example:

    1. Dataset Description

    Research Domain
    This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous credit‐card transactions in real time to reduce financial losses and improve trust in digital payment systems.

    Purpose
    The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraud‐detection algorithms and support future research on anomaly detection in transaction data.

    Data Sources
    We used the publicly available credit‐card transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.

    Method of Dataset Preparation

    1. Schema validation: Renamed columns to snake_case (e.g. transaction_amount, is_declined) so they conform to DBRepo’s requirements.

    2. Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).

    3. Splitting: Programmatically derived three subsets—training (70%), validation (15%), test (15%)—using range‐based filters on the primary key actionnr. Each subset was materialized in DBRepo and assigned its own PID for precise citation.

    4. Cleaning: Converted the categorical flags (is_declined, isforeigntransaction, ishighriskcountry, isfradulent) from ā€œYā€/ā€œNā€ to 1/0 and dropped non‐feature identifiers (actionnr, merchant_id).

    5. Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the held‐out test set.

    2. Technical Details

    Dataset Structure

    • The raw data is a single CSV with columns:

      • actionnr (integer transaction ID)

      • merchant_id (string)

      • average_amount_transaction_day (float)

      • transaction_amount (float)

      • is_declined, isforeigntransaction, ishighriskcountry, isfradulent (binary flags)

      • total_number_of_declines_day, daily_chargeback_avg_amt, sixmonth_avg_chbk_amt, sixmonth_chbk_freq (numeric features)

    Naming Conventions

    • All columns use lowercase snake_case.

    • Subsets are named creditcard_training, creditcard_validation, creditcard_test in DBRepo.

    • Files in the code repo follow a clear structure:

      ā”œā”€ā”€ data/         # local copies only; raw data lives in DBRepo 
      ā”œā”€ā”€ notebooks/Task.ipynb 
      ā”œā”€ā”€ models/rf_model_v1.joblib 
      ā”œā”€ā”€ outputs/        # confusion_matrix.png, roc_curve.png, predictions.csv 
      ā”œā”€ā”€ README.md 
      ā”œā”€ā”€ requirements.txt 
      └── codemeta.json 
      

    Required Software

    • Python 3.9+

    • pandas, numpy (data handling)

    • scikit-learn (modeling, metrics)

    • matplotlib (visualizations)

    • dbrepo‐client.py (DBRepo API)

    • requests (TU WRD API)

    Additional Resources

    3. Further Details

    Data Limitations

    • Highly imbalanced: only ~0.17% of transactions are fraudulent.

    • Anonymized PCA features (V1–V28) hidden; we extended with domain features but cannot reverse engineer raw variables.

    • Time‐bounded: only covers two days of transactions, may not capture seasonal patterns.

    Licensing and Attribution

    • Raw data: CC-0 (per Kaggle terms)

    • Code & notebooks: MIT License

    • Model artifacts & outputs: CC-BY 4.0

    • DUWRD records include ORCID identifiers for the author.

    Recommended Uses

    • Benchmarking new fraud‐detection algorithms on a standard imbalanced dataset.

    • Educational purposes: demonstrating model‐training pipelines, FAIR data practices.

    • Extension: adding time‐series or deep‐learning models.

    Known Issues

    • Possible temporal leakage if date/time features not handled correctly.

    • Model performance may degrade on live data due to concept drift.

    • Binary flags may oversimplify nuanced transaction outcomes.

  2. t

    Credit Card Fraud Detection

    • test.researchdata.tuwien.ac.at
    csv, json, pdf +2
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja (2025). Credit Card Fraud Detection [Dataset]. http://doi.org/10.82556/yvxj-9t22
    Explore at:
    text/markdown, csv, pdf, txt, jsonAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Below is a draft DMP–style description of your credit‐card fraud detection experiment, modeled on the antiquities example:

    1. Dataset Description

    Research Domain
    This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous credit‐card transactions in real time to reduce financial losses and improve trust in digital payment systems.

    Purpose
    The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraud‐detection algorithms and support future research on anomaly detection in transaction data.

    Data Sources
    We used the publicly available credit‐card transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.

    Method of Dataset Preparation

    1. Schema validation: Renamed columns to snake_case (e.g. transaction_amount, is_declined) so they conform to DBRepo’s requirements.

    2. Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).

    3. Splitting: Programmatically derived three subsets—training (70%), validation (15%), test (15%)—using range‐based filters on the primary key actionnr. Each subset was materialized in DBRepo and assigned its own PID for precise citation.

    4. Cleaning: Converted the categorical flags (is_declined, isforeigntransaction, ishighriskcountry, isfradulent) from ā€œYā€/ā€œNā€ to 1/0 and dropped non‐feature identifiers (actionnr, merchant_id).

    5. Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the held‐out test set.

    2. Technical Details

    Dataset Structure

    • The raw data is a single CSV with columns:

      • actionnr (integer transaction ID)

      • merchant_id (string)

      • average_amount_transaction_day (float)

      • transaction_amount (float)

      • is_declined, isforeigntransaction, ishighriskcountry, isfradulent (binary flags)

      • total_number_of_declines_day, daily_chargeback_avg_amt, sixmonth_avg_chbk_amt, sixmonth_chbk_freq (numeric features)

    Naming Conventions

    • All columns use lowercase snake_case.

    • Subsets are named creditcard_training, creditcard_validation, creditcard_test in DBRepo.

    • Files in the code repo follow a clear structure:

      ā”œā”€ā”€ data/         # local copies only; raw data lives in DBRepo 
      ā”œā”€ā”€ notebooks/Task.ipynb 
      ā”œā”€ā”€ models/rf_model_v1.joblib 
      ā”œā”€ā”€ outputs/        # confusion_matrix.png, roc_curve.png, predictions.csv 
      ā”œā”€ā”€ README.md 
      ā”œā”€ā”€ requirements.txt 
      └── codemeta.json 
      

    Required Software

    • Python 3.9+

    • pandas, numpy (data handling)

    • scikit-learn (modeling, metrics)

    • matplotlib (visualizations)

    • dbrepo‐client.py (DBRepo API)

    • requests (TU WRD API)

    Additional Resources

    3. Further Details

    Data Limitations

    • Highly imbalanced: only ~0.17% of transactions are fraudulent.

    • Anonymized PCA features (V1–V28) hidden; we extended with domain features but cannot reverse engineer raw variables.

    • Time‐bounded: only covers two days of transactions, may not capture seasonal patterns.

    Licensing and Attribution

    • Raw data: CC-0 (per Kaggle terms)

    • Code & notebooks: MIT License

    • Model artifacts & outputs: CC-BY 4.0

    • DUWRD records include ORCID identifiers for the author.

    Recommended Uses

    • Benchmarking new fraud‐detection algorithms on a standard imbalanced dataset.

    • Educational purposes: demonstrating model‐training pipelines, FAIR data practices.

    • Extension: adding time‐series or deep‐learning models.

    Known Issues

    • Possible temporal leakage if date/time features not handled correctly.

    • Model performance may degrade on live data due to concept drift.

    • Binary flags may oversimplify nuanced transaction outcomes.

  3. Household Energy Consumption

    • kaggle.com
    Updated Apr 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samx_sam (2025). Household Energy Consumption [Dataset]. https://www.kaggle.com/datasets/samxsam/household-energy-consumption
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 5, 2025
    Dataset provided by
    Kaggle
    Authors
    Samx_sam
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    šŸ” Household Energy Consumption - April 2025 (90,000 Records)

    šŸ“Œ Overview

    This dataset presents detailed energy consumption records from various households over the month. With 90,000 rows and multiple features such as temperature, household size, air conditioning usage, and peak hour consumption, this dataset is perfect for performing time-series analysis, machine learning, and sustainability research.

    Column NameData Type CategoryDescription
    Household_IDCategorical (Nominal)Unique identifier for each household
    DateDatetimeThe date of the energy usage record
    Energy_Consumption_kWhNumerical (Continuous)Total energy consumed by the household in kWh
    Household_SizeNumerical (Discrete)Number of individuals living in the household
    Avg_Temperature_CNumerical (Continuous)Average daily temperature in degrees Celsius
    Has_ACCategorical (Binary)Indicates if the household has air conditioning (Yes/No)
    Peak_Hours_Usage_kWhNumerical (Continuous)Energy consumed during peak hours in kWh

    šŸ“‚ Dataset Summary

    • Rows: 90,000
    • Time Range: April 1, 2025 – April 30, 2025
    • Data Granularity: Daily per household
    • Location: Simulated global coverage
    • Format: CSV (Comma-Separated Values)

    šŸ“š Libraries Used for Working with household_energy_consumption_2025.csv

    šŸ” 1. Data Manipulation & Analysis

    LibraryPurpose
    pandasReading, cleaning, and transforming tabular data
    numpyNumerical operations, working with arrays

    šŸ“Š 2. Data Visualization

    LibraryPurpose
    matplotlibCreating static plots (line, bar, histograms, etc.)
    seabornStatistical visualizations, heatmaps, boxplots, etc.
    plotlyInteractive charts (time series, pie, bar, scatter, etc.)

    šŸ“ˆ 3. Machine Learning / Modeling

    LibraryPurpose
    scikit-learnPreprocessing, regression, classification, clustering
    xgboost / lightgbmGradient boosting models for better accuracy

    🧹 4. Data Preprocessing

    LibraryPurpose
    sklearn.preprocessingEncoding categorical features, scaling, normalization
    datetime / pandasDate-time conversion and manipulation

    🧪 5. Model Evaluation

    LibraryPurpose
    sklearn.metricsAccuracy, MAE, RMSE, R² score, confusion matrix, etc.

    āœ… These libraries provide a complete toolkit for performing data analysis, modeling, and visualization tasks efficiently.

    šŸ“ˆ Potential Use Cases

    This dataset is ideal for a wide variety of analytics and machine learning projects:

    šŸ”® Forecasting & Time Series Analysis

    • Predict future household energy consumption based on previous trends and weather conditions.
    • Identify seasonal and daily consumption patterns.

    šŸ’” Energy Efficiency Analysis

    • Analyze differences in energy consumption between households with and without air conditioning.
    • Compare energy usage efficiency across varying household sizes.

    šŸŒ”ļø Climate Impact Studies

    • Investigate how temperature affects electricity usage across households.
    • Model the potential impact of climate change on residential energy demand.

    šŸ”Œ Peak Load Management

    • Build models to predict and manage energy demand during peak hours.
    • Support research on smart grid technologies and dynamic pricing.

    🧠 Machine Learning Projects

    • Supervised learning (regression/classification) to predict energy consumption.
    • Clustering households by usage patterns for targeted energy programs.
    • Anomaly detection in energy usage for fault detection.

    šŸ› ļø Example Starter Projects

    • Time-series forecasting using Facebook Prophet or ARIMA
    • Regression modeling using XGBoost or LightGBM
    • Classification of AC vs. non-AC household behavior
    • Energy-saving recommendation systems
    • Heatmaps of temperature vs. energy usage
  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja (2025). Credit Card Fraud Detection [Dataset]. http://doi.org/10.82556/yvxj-9t22

Credit Card Fraud Detection

Explore at:
csv, pdf, text/markdown, txt, jsonAvailable download formats
Dataset updated
Apr 28, 2025
Dataset provided by
TU Wien
Authors
Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
Apr 28, 2025
Description

Below is a draft DMP–style description of your credit‐card fraud detection experiment, modeled on the antiquities example:

1. Dataset Description

Research Domain
This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous credit‐card transactions in real time to reduce financial losses and improve trust in digital payment systems.

Purpose
The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraud‐detection algorithms and support future research on anomaly detection in transaction data.

Data Sources
We used the publicly available credit‐card transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.

Method of Dataset Preparation

  1. Schema validation: Renamed columns to snake_case (e.g. transaction_amount, is_declined) so they conform to DBRepo’s requirements.

  2. Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).

  3. Splitting: Programmatically derived three subsets—training (70%), validation (15%), test (15%)—using range‐based filters on the primary key actionnr. Each subset was materialized in DBRepo and assigned its own PID for precise citation.

  4. Cleaning: Converted the categorical flags (is_declined, isforeigntransaction, ishighriskcountry, isfradulent) from ā€œYā€/ā€œNā€ to 1/0 and dropped non‐feature identifiers (actionnr, merchant_id).

  5. Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the held‐out test set.

2. Technical Details

Dataset Structure

  • The raw data is a single CSV with columns:

    • actionnr (integer transaction ID)

    • merchant_id (string)

    • average_amount_transaction_day (float)

    • transaction_amount (float)

    • is_declined, isforeigntransaction, ishighriskcountry, isfradulent (binary flags)

    • total_number_of_declines_day, daily_chargeback_avg_amt, sixmonth_avg_chbk_amt, sixmonth_chbk_freq (numeric features)

Naming Conventions

  • All columns use lowercase snake_case.

  • Subsets are named creditcard_training, creditcard_validation, creditcard_test in DBRepo.

  • Files in the code repo follow a clear structure:

    ā”œā”€ā”€ data/         # local copies only; raw data lives in DBRepo 
    ā”œā”€ā”€ notebooks/Task.ipynb 
    ā”œā”€ā”€ models/rf_model_v1.joblib 
    ā”œā”€ā”€ outputs/        # confusion_matrix.png, roc_curve.png, predictions.csv 
    ā”œā”€ā”€ README.md 
    ā”œā”€ā”€ requirements.txt 
    └── codemeta.json 
    

Required Software

  • Python 3.9+

  • pandas, numpy (data handling)

  • scikit-learn (modeling, metrics)

  • matplotlib (visualizations)

  • dbrepo‐client.py (DBRepo API)

  • requests (TU WRD API)

Additional Resources

3. Further Details

Data Limitations

  • Highly imbalanced: only ~0.17% of transactions are fraudulent.

  • Anonymized PCA features (V1–V28) hidden; we extended with domain features but cannot reverse engineer raw variables.

  • Time‐bounded: only covers two days of transactions, may not capture seasonal patterns.

Licensing and Attribution

  • Raw data: CC-0 (per Kaggle terms)

  • Code & notebooks: MIT License

  • Model artifacts & outputs: CC-BY 4.0

  • DUWRD records include ORCID identifiers for the author.

Recommended Uses

  • Benchmarking new fraud‐detection algorithms on a standard imbalanced dataset.

  • Educational purposes: demonstrating model‐training pipelines, FAIR data practices.

  • Extension: adding time‐series or deep‐learning models.

Known Issues

  • Possible temporal leakage if date/time features not handled correctly.

  • Model performance may degrade on live data due to concept drift.

  • Binary flags may oversimplify nuanced transaction outcomes.

Search
Clear search
Close search
Google apps
Main menu