https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
E-commerce has become a new channel to support businesses development. Through e-commerce, businesses can get access and establish a wider market presence by providing cheaper and more efficient distribution channels for their products or services. E-commerce has also changed the way people shop and consume products and services. Many people are turning to their computers or smart devices to order goods, which can easily be delivered to their homes.
This is a sales transaction data set of UK-based e-commerce (online retail) for one year. This London-based shop has been selling gifts and homewares for adults and children through the website since 2007. Their customers come from all over the world and usually make direct purchases for themselves. There are also small businesses that buy in bulk and sell to other customers through retail outlet channels.
The data set contains 500K rows and 8 columns. The following is the description of each column. 1. TransactionNo (categorical): a six-digit unique number that defines each transaction. The letter “C” in the code indicates a cancellation. 2. Date (numeric): the date when each transaction was generated. 3. ProductNo (categorical): a five or six-digit unique character used to identify a specific product. 4. Product (categorical): product/item name. 5. Price (numeric): the price of each product per unit in pound sterling (£). 6. Quantity (numeric): the quantity of each product per transaction. Negative values related to cancelled transactions. 7. CustomerNo (categorical): a five-digit unique number that defines each customer. 8. Country (categorical): name of the country where the customer resides.
There is a small percentage of order cancellation in the data set. Most of these cancellations were due to out-of-stock conditions on some products. Under this situation, customers tend to cancel an order as they want all products delivered all at once.
Information is a main asset of businesses nowadays. The success of a business in a competitive environment depends on its ability to acquire, store, and utilize information. Data is one of the main sources of information. Therefore, data analysis is an important activity for acquiring new and useful information. Analyze this dataset and try to answer the following questions. 1. How was the sales trend over the months? 2. What are the most frequently purchased products? 3. How many products does the customer purchase in each transaction? 4. What are the most profitable segment customers? 5. Based on your findings, what strategy could you recommend to the business to gain more profit?
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 5 million synthetically generated financial transactions designed to simulate real-world behavior for fraud detection research and machine learning applications. Each transaction record includes fields such as:
Transaction Details: ID, timestamp, sender/receiver accounts, amount, type (deposit, transfer, etc.)
Behavioral Features: time since last transaction, spending deviation score, velocity score, geo-anomaly score
Metadata: location, device used, payment channel, IP address, device hash
Fraud Indicators: binary fraud label (is_fraud) and type of fraud (e.g., money laundering, account takeover)
The dataset follows realistic fraud patterns and behavioral anomalies, making it suitable for:
Binary and multiclass classification models
Fraud detection systems
Time-series anomaly detection
Feature engineering and model explainability
The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependent cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.
By UCI [source]
Comprehensive Dataset on Online Retail Sales and Customer Data
Welcome to this comprehensive dataset offering a wide array of information related to online retail sales. This data set provides an in-depth look at transactions, product details, and customer information documented by an online retail company based in the UK. The scope of the data spans vastly, from granular details about each product sold to extensive customer data sets from different countries.
This transnational data set is a treasure trove of vital business insights as it meticulously catalogues all the transactions that happened during its span. It houses rich transactional records curated by a renowned non-store online retail company based in the UK known for selling unique all-occasion gifts. A considerable portion of its clientele includes wholesalers; ergo, this dataset can prove instrumental for companies looking for patterns or studying purchasing trends among such businesses.
The available attributes within this dataset offer valuable pieces of information:
InvoiceNo: This attribute refers to invoice numbers that are six-digit integral numbers uniquely assigned to every transaction logged in this system. Transactions marked with 'c' at the beginning signify cancellations - adding yet another dimension for purchase pattern analysis.
StockCode: Stock Code corresponds with specific items as they're represented within the inventory system via 5-digit integral numbers; these allow easy identification and distinction between products.
Description: This refers to product names, giving users qualitative knowledge about what kind of items are being bought and sold frequently.
Quantity: These figures ascertain the volume of each product per transaction – important figures that can help understand buying trends better.
InvoiceDate: Invoice Dates detail when each transaction was generated down to precise timestamps – invaluable when conducting time-based trend analysis or segmentation studies.
UnitPrice: Unit prices represent how much each unit retails at — crucial for revenue calculations or cost-related analyses.
Finally,
- Country: This locational attribute shows where each customer hails from, adding geographical segmentation to your data investigation toolkit.
This dataset was originally collated by Dr Daqing Chen, Director of the Public Analytics group based at the School of Engineering, London South Bank University. His research studies and business cases with this dataset have been published in various papers contributing to establishing a solid theoretical basis for direct, data and digital marketing strategies.
Access to such records can ensure enriching explorations or formulating insightful hypotheses about consumer behavior patterns among wholesalers. Whether it's managing inventory or studying transactional trends over time or spotting cancellation patterns - this dataset is apt for multiple forms of retail analysis
1. Sales Analysis:
Sales data forms the backbone of this dataset, and it allows users to delve into various aspects of sales performance. You can use the Quantity and UnitPrice fields to calculate metrics like revenue, and further combine it with InvoiceNo information to understand sales over individual transactions.
2. Product Analysis:
Each product in this dataset comes with its unique identifier (StockCode) and its name (Description). You could analyse which products are most popular based on Quantity sold or look at popularity per transaction by considering both Quantity and InvoiceNo.
3. Customer Segmentation:
If you associated specific business logic onto the transactions (such as calculating total amounts), then you could use standard machine learning methods or even RFM (Recency, Frequency, Monetary) segmentation techniques combining it with 'CustomerID' for your customer base to understand customer behavior better. Concatenating invoice numbers (which stand for separate transactions) per client will give insights about your clients as well.
4. Geographical Analysis:
The Country column enables analysts to study purchase patterns across different geographical locations.
Practical applications
Understand what products sell best where - It can help drive tailored marketing strategies. Anomalies detection – Identify unusual behaviors that might lead frau...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Retail Transaction Data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/michalfr/retail-transaction-data on 13 February 2022.
--- Dataset description provided by original source is as follows ---
This dataset contains transactions and the products they contain, which were obtained by scanning receipts from retail establishments by numerous users. Products were categorized by our proprietary NLP model.
Data was collected over a one-year period and contains product information from purchases made within that period, product category inferred from product name, information about organization, transaction to which products belong to and user that uploaded receipt.
The total user count is 22. The total retail organization count is 179. The total transaction count is 805. The total product count is 7477.
@kserno
Product categorization, User Behaviour Analysis, Product Analysis, Product Price Comparison between Various Retail Stores, Prediction of Next Transaction
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Below is a draft DMP–style description of your credit‐card fraud detection experiment, modeled on the antiquities example:
Research Domain
This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous credit‐card transactions in real time to reduce financial losses and improve trust in digital payment systems.
Purpose
The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraud‐detection algorithms and support future research on anomaly detection in transaction data.
Data Sources
We used the publicly available credit‐card transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.
Method of Dataset Preparation
Schema validation: Renamed columns to snake_case (e.g. transaction_amount
, is_declined
) so they conform to DBRepo’s requirements.
Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).
Splitting: Programmatically derived three subsets—training (70%), validation (15%), test (15%)—using range‐based filters on the primary key actionnr
. Each subset was materialized in DBRepo and assigned its own PID for precise citation.
Cleaning: Converted the categorical flags (is_declined
, isforeigntransaction
, ishighriskcountry
, isfradulent
) from “Y”/“N” to 1/0 and dropped non‐feature identifiers (actionnr
, merchant_id
).
Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the held‐out test set.
Dataset Structure
The raw data is a single CSV with columns:
actionnr
(integer transaction ID)
merchant_id
(string)
average_amount_transaction_day
(float)
transaction_amount
(float)
is_declined
, isforeigntransaction
, ishighriskcountry
, isfradulent
(binary flags)
total_number_of_declines_day
, daily_chargeback_avg_amt
, sixmonth_avg_chbk_amt
, sixmonth_chbk_freq
(numeric features)
Naming Conventions
All columns use lowercase snake_case.
Subsets are named creditcard_training
, creditcard_validation
, creditcard_test
in DBRepo.
Files in the code repo follow a clear structure:
├── data/ # local copies only; raw data lives in DBRepo
├── notebooks/Task.ipynb
├── models/rf_model_v1.joblib
├── outputs/ # confusion_matrix.png, roc_curve.png, predictions.csv
├── README.md
├── requirements.txt
└── codemeta.json
Required Software
Python 3.9+
pandas, numpy (data handling)
scikit-learn (modeling, metrics)
matplotlib (visualizations)
dbrepo‐client.py (DBRepo API)
requests (TU WRD API)
Additional Resources
Original dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud
Scikit-learn docs: https://scikit-learn.org/stable
DBRepo API guide: via the starter notebook’s dbrepo_client.py
template
TU WRD REST API spec: https://test.researchdata.tuwien.ac.at/api/docs
Data Limitations
Highly imbalanced: only ~0.17% of transactions are fraudulent.
Anonymized PCA features (V1
–V28
) hidden; we extended with domain features but cannot reverse engineer raw variables.
Time‐bounded: only covers two days of transactions, may not capture seasonal patterns.
Licensing and Attribution
Raw data: CC-0 (per Kaggle terms)
Code & notebooks: MIT License
Model artifacts & outputs: CC-BY 4.0
DUWRD records include ORCID identifiers for the author.
Recommended Uses
Benchmarking new fraud‐detection algorithms on a standard imbalanced dataset.
Educational purposes: demonstrating model‐training pipelines, FAIR data practices.
Extension: adding time‐series or deep‐learning models.
Known Issues
Possible temporal leakage if date/time features not handled correctly.
Model performance may degrade on live data due to concept drift.
Binary flags may oversimplify nuanced transaction outcomes.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Many projects require datasets about bank transactions to test their systems. Unfortunately, it is hard to find a dataset that would have transaction product categorization which is important for many analytical projects.
There you have 4 datasets. Clients - basic information about bank users. Categories - standart transaction categories which are being by many banks worldwide. Transactions - the core of our dataset, basic information about transactions like who is the second account of transaction, category, amount, etc. Subscriptions - information about subscriptions, in other words, transactions which are made automatically.
Synthetic transactional data with labels for fraud detection. For more information, see: https://www.kaggle.com/ntnu-testimon/paysim1/version/2
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
🔍 Online Payment Fraud Detection Dataset | Real-World Transactions 💳
Fraudulent transactions are a growing challenge for fintech companies. This dataset captures 51,000+ transactions, each labeled as fraudulent or legitimate, based on real-world patterns.
It includes transaction details, user behavior, payment methods, and device usage, making it ideal for: ✅ Fraud detection modeling (classification) ✅ Feature engineering & anomaly detection ✅ Exploratory data analysis (EDA) & pattern recognition
Transaction Details: Amount, type, time, and payment method 💰 User Behavior: Past fraud history, account age, recent activity 📊 Device & Location: Device used, transaction location 🌍
🚀 Train machine learning models to detect fraud 📉 Analyze patterns of fraud in financial transactions 🔎 Optimize fraud prevention strategies
Ready to fight fraud? Let’s dive in! 🔥
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Retail Transaction Data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/regivm/retailtransactiondata on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The data provides customer and date level transactions for few years. It can be used for demonstration of any analysis that require transaction information like RFM. The data also provide response information of customers to a promotion campaign.
Highlight of this dataset is that you can evaluate the effectiveness RFM group by checking the one of the business metric; the response of customers.
Transaction data provides customer_id, transaction date and Amount of purchase. Response data provides the response information of each of the customers. It is a binary variable indicating whether the customer responded to a campaign or not.
Extremely thankful numerous kernel and data publishers of Kaggle and Github. Learnt a lot from these communities.
More innovative approaches for handling RFM Analysis.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Store Transaction data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/iamprateek/store-transaction-data on 14 February 2022.
--- Dataset description provided by original source is as follows ---
Nielsen receives transaction level scanning data (POS Data) from its partner stores on a regular basis. Stores sharing POS data include bigger format store types such as supermarkets, hypermarkets as well as smaller traditional trade grocery stores (Kirana stores), medical stores etc. using a POS machine.
While in a bigger format store, all items for all transactions are scanned using a POS machine, smaller and more localized shops do not have a 100% compliance rate in terms of scanning and inputting information into the POS machine for all transactions.
A transaction involving a single packet of chips or a single piece of candy may not be scanned and recorded to spare customer the inconvenience or during rush hours when the store is crowded with customers.
Thus, the data received from such stores is often incomplete and lacks complete information of all transactions completed within a day.
Additionally, apart from incomplete transaction data in a day, it is observed that certain stores do not share data for all active days. Stores share data ranging from 2 to 28 days in a month. While it is possible to impute/extrapolate data for 2 days of a month using 28 days of actual historical data, the vice versa is not recommended.
Nielsen encourages you to create a model which can help impute/extrapolate data to fill in the missing data gaps in the store level POS data currently received.
You are provided with the dataset that contains store level data by brands and categories for select stores-
Hackathon_ Ideal_Data - The file contains brand level data for 10 stores for the last 3 months. This can be referred to as the ideal data.
Hackathon_Working_Data - This contains data for selected stores which are missing and/or incomplete.
Hackathon_Mapping_File - This file is provided to help understand the column names in the data set.
Hackathon_Validation_Data - This file contains the data stores and product groups for which you have to predict the Total_VALUE.
Sample Submission - This file represents what needs to be uploaded as output by candidate in the same format. The sample data is provided in the file to help understand the columns and values required.
Nielsen Holdings plc (NYSE: NLSN) is a global measurement and data analytics company that provides the most complete and trusted view available of consumers and markets worldwide. Nielsen is divided into two business units. Nielsen Global Media, the arbiter of truth for media markets, provides media and advertising industries with unbiased and reliable metrics that create a shared understanding of the industry required for markets to function. Nielsen Global Connect provides consumer packaged goods manufacturers and retailers with accurate, actionable information and insights and a complete picture of the complex and changing marketplace that companies need to innovate and grow. Our approach marries proprietary Nielsen data with other data sources to help clients around the world understand what’s happening now, what’s happening next, and how to best act on this knowledge. An S&P 500 company, Nielsen has operations in over 100 countries, covering more than 90% of the world’s population.
Know more: https://www.nielsen.com/us/en/
Build an imputation and/or extrapolation model to fill the missing data gaps for select stores by analyzing the data and determine which factors/variables/features can help best predict the store sales.
--- Original source retains full ownership of the source dataset ---
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is designed to help data scientists and machine learning enthusiasts develop robust fraud detection models. It contains realistic synthetic transaction data, including user information, transaction types, risk scores, and more, making it ideal for binary classification tasks with models like XGBoost and LightGBM.
Column Name | Description |
---|---|
Transaction_ID | Unique identifier for each transaction |
User_ID | Unique identifier for the user |
Transaction_Amount | Amount of money involved in the transaction |
Transaction_Type | Type of transaction (Online , In-Store , ATM , etc.) |
Timestamp | Date and time of the transaction |
Account_Balance | User's current account balance before the transaction |
Device_Type | Type of device used (Mobile , Desktop , etc.) |
Location | Geographical location of the transaction |
Merchant_Category | Type of merchant (Retail , Food , Travel , etc.) |
IP_Address_Flag | Whether the IP address was flagged as suspicious (0 or 1 ) |
Previous_Fraudulent_Activity | Number of past fraudulent activities by the user |
Daily_Transaction_Count | Number of transactions made by the user that day |
Avg_Transaction_Amount_7d | User's average transaction amount in the past 7 days |
Failed_Transaction_Count_7d | Count of failed transactions in the past 7 days |
Card_Type | Type of payment card used (Credit , Debit , Prepaid , etc.) |
Card_Age | Age of the card in months |
Transaction_Distance | Distance between the user's usual location and transaction location |
Authentication_Method | How the user authenticated (PIN , Biometric , etc.) |
Risk_Score | Fraud risk score computed for the transaction |
Is_Weekend | Whether the transaction occurred on a weekend (0 or 1 ) |
Fraud_Label | Target variable (0 = Not Fraud , 1 = Fraud ) |
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains 30,000 unique retail transactions, each representing a customer's shopping basket in a simulated grocery store environment. The data was generated with realistic product combinations and purchase patterns, suitable for association rule mining, recommendation systems and market basket analysis.
Each row corresponds to a single transaction, listing:
A unique transaction ID A customer ID The full list of products bought in that transaction The time of the transaction The dataset includes products across various categories such as beverages, snacks, dairy, household items, fruits, vegetables and frozen foods.
This data is entirely synthetic and does not contain any real user information.
Original Data Source: Retail Transaction Dataset
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Priyanshu Gautam
Released under Apache 2.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset from https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Fraud detection bank dataset 20K records binary ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/volodymyrgavrysh/fraud-detection-bank-dataset-20k-records-binary on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Banks are often exposed to fraud transactions and constantly improve systems to track them.
Bank dataset that contains 20k+ transactions with 112 features (numerical)
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘OpenSea Daily Ethereum Transactions’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ankanhore545/opensea-daily-transactions on 13 February 2022.
--- Dataset description provided by original source is as follows ---
This all-time data represents the raw on-chain activity of the tracked smart contracts.
I am thankful that we could collect the data from the dapprader platform: https://dappradar.com/ethereum/marketplaces/opensea These are for 5 ETH Smart Contracts as mentioned in the above site.
--- Original source retains full ownership of the source dataset ---
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains detailed sales transactions, including order details, revenue, profit, and customer information. It can be used for sales analysis, trend forecasting, and business intelligence insights. The data covers multiple product categories and is structured to facilitate easy analysis of sales performance across different locations and time periods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘OpenSea Daily Polygon Transactions’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ankanhore545/opensea-daily-polygon-transactions on 13 February 2022.
--- Dataset description provided by original source is as follows ---
This all-time data represents the raw on-chain activity of the tracked smart contracts.
I am thankful that we could collect the data from the dapprader platform: https://dappradar.com/polygon/marketplaces/opensea These are for 1 Polygon Smart Contract as mentioned in the above site.
--- Original source retains full ownership of the source dataset ---
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Jashandeep Kaur
Released under Apache 2.0
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
E-commerce has become a new channel to support businesses development. Through e-commerce, businesses can get access and establish a wider market presence by providing cheaper and more efficient distribution channels for their products or services. E-commerce has also changed the way people shop and consume products and services. Many people are turning to their computers or smart devices to order goods, which can easily be delivered to their homes.
This is a sales transaction data set of UK-based e-commerce (online retail) for one year. This London-based shop has been selling gifts and homewares for adults and children through the website since 2007. Their customers come from all over the world and usually make direct purchases for themselves. There are also small businesses that buy in bulk and sell to other customers through retail outlet channels.
The data set contains 500K rows and 8 columns. The following is the description of each column. 1. TransactionNo (categorical): a six-digit unique number that defines each transaction. The letter “C” in the code indicates a cancellation. 2. Date (numeric): the date when each transaction was generated. 3. ProductNo (categorical): a five or six-digit unique character used to identify a specific product. 4. Product (categorical): product/item name. 5. Price (numeric): the price of each product per unit in pound sterling (£). 6. Quantity (numeric): the quantity of each product per transaction. Negative values related to cancelled transactions. 7. CustomerNo (categorical): a five-digit unique number that defines each customer. 8. Country (categorical): name of the country where the customer resides.
There is a small percentage of order cancellation in the data set. Most of these cancellations were due to out-of-stock conditions on some products. Under this situation, customers tend to cancel an order as they want all products delivered all at once.
Information is a main asset of businesses nowadays. The success of a business in a competitive environment depends on its ability to acquire, store, and utilize information. Data is one of the main sources of information. Therefore, data analysis is an important activity for acquiring new and useful information. Analyze this dataset and try to answer the following questions. 1. How was the sales trend over the months? 2. What are the most frequently purchased products? 3. How many products does the customer purchase in each transaction? 4. What are the most profitable segment customers? 5. Based on your findings, what strategy could you recommend to the business to gain more profit?