Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset provides a detailed look into transactional behavior and financial activity patterns, ideal for exploring fraud detection and anomaly identification. It contains 2,512 samples of transaction data, covering various transaction attributes, customer demographics, and usage patterns. Each entry offers comprehensive insights into transaction behavior, enabling analysis for financial security and fraud detection applications.
Key Features:
This dataset is ideal for data scientists, financial analysts, and researchers looking to analyze transactional patterns, detect fraud, and build predictive models for financial security applications. The dataset was designed for machine learning and pattern analysis tasks and is not intended as a primary data source for academic publications.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by AnjaliGupta
Released under CC0: Public Domain
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
To identify online payment fraud with machine learning, we need to train a machine learning model for classifying fraudulent and non-fraudulent payments. For this, we need a dataset containing information about online payment fraud, so that we can understand what type of transactions lead to fraud. For this task, I collected a dataset from Kaggle, which contains historical information about fraudulent transactions which can be used to detect fraud in online payments. Below are all the columns from the dataset I’m using here:
step: represents a unit of time where 1 step equals 1 hour type: type of online transaction amount: the amount of the transaction nameOrig: customer starting the transaction oldbalanceOrg: balance before the transaction newbalanceOrig: balance after the transaction nameDest: recipient of the transaction oldbalanceDest: initial balance of recipient before the transaction newbalanceDest: the new balance of recipient after the transaction isFraud: fraud transaction
I hope you now know about the data I am using for the online payment fraud detection task. Now in the section below, I’ll explain how we can use machine learning to detect online payment fraud using Python.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Mohamed NIANG
Released under CC0: Public Domain
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Mohamed NIANG
Released under CC0: Public Domain
This dataset is designed to support the creation and detection of fake reviews for online products. It comprises a collection of 40,000 product reviews, equally split between 20,000 authentic, human-generated reviews and 20,000 computer-generated fake reviews. The dataset includes information on review content, categorisation, and associated ratings, making it a valuable resource for developing and testing review integrity solutions within e-commerce and other online platforms.
The dataset contains a total of 40,412 unique entries, with a balanced distribution of 20,000 fake and 20,000 real product reviews. Data is typically provided in a CSV file format.
The distribution of ratings is as follows: * 1.00 - 1.20: 2,155 entries * 2.00 - 2.20: 1,967 entries * 3.00 - 3.20: 3,786 entries * 4.00 - 4.20: 7,965 entries * 4.80 - 5.00: 24,559 entries
The dataset categorisation includes: * Kindle_Store_5: 12% * Books_5: 11% * Other: 77% (31,332 entries)
This dataset is ideal for training machine learning models to identify and flag fraudulent or computer-generated product reviews. It can be utilised for: * Developing Natural Language Processing (NLP) models for sentiment analysis and text classification. * Building AI & Machine Learning solutions for fraud detection in online marketplaces. * Researching the characteristics and patterns of authentic versus fabricated consumer feedback. * Enhancing the trustworthiness and reliability of online review systems.
The dataset has global coverage, making it applicable for systems and research worldwide. While specific time ranges for the reviews themselves are not explicitly detailed, the data's utility is broad across various product categories and review contexts within e-commerce.
CC-BY
This dataset is suitable for: * Data Scientists and Machine Learning Engineers: To develop and fine-tune models for fake review detection and NLP tasks. * Researchers: Studying consumer behaviour, online trust, and adversarial attacks in digital platforms. * E-commerce Businesses: To implement internal systems for maintaining review authenticity and improving customer trust. * Academics and Students: For educational purposes, projects, and academic studies in AI, NLP, and data science.
Original Data Source: 🚨 Fake Reviews Dataset
This Demonstration utilized a fraud detection data set and kernel, referenced below to showcase the accuracy and safety of using the products of the kymera fabrication machine
The original data set we have used is the Synthetic Financial Datasets For Fraud Detection This file accurately mimics the original data set features while in fact generating the entire data set from scratch.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. The data has been made public and presents a diverse set of email information ranging from internal, marketing emails to spam and fraud attempts.
In the early 2000s, Leslie Kaelbling at MIT purchased the dataset and noted that, though the dataset contained scam emails, it also had several integrity problems. The dataset was updated later, but it becomes key to ensure privacy in the data while it is used to train a deep neural network model.
Though the Enron Email Dataset contains over 500K emails, one of the problems with the dataset is the availability of labeled frauds in the dataset. Label annotation is done to detect an umbrella of fraud emails accurately. Since, fraud emails fall into several types such as Phishing, Financial, Romance, Subscription, and Nigerian Prince scams, there have to be multiple heuristics used to label all types of fraudulent emails effectively.
To tackle this problem, heuristics have been used to label the Enron data corpus using email signals, and automated labeling has been performed using simple ML models on other smaller email datasets available online. These fraud annotation techniques are discussed in detail below.
To perform fraud annotation on the Enron dataset as well as provide more fraud examples for modeling, two more fraud data sources have been used, Phishing Email Dataset: https://www.kaggle.com/dsv/6090437 Social Engineering Dataset: http://aclweb.org/aclwiki
To label the Enron email dataset two signals are used to filter suspicious emails and label them into fraud and non-fraud classes. Automated ML labeling Email Signals
The following heuristics are used to annotate labels for Enron email data using the other two data sources,
Phishing Model Annotation: A high-precision SVM model trained on the Phishing mails dataset, which is used to annotate the Phishing Label on the Enron Dataset.
Social Engineering Model Annotation: A high-precision SVM model trained on the Social Engineering mails dataset, which is used to annotate the Social Engineering Label on the Enron Dataset.
The two ML Annotator models use Term Frequency Inverse Document Frequency (TF-IDF) to embed the input text and make use of SVM models with Gaussian Kernel.
If either of the models predicted that an email was a fraud, the mail metadata was checked for several email signals. If these heuristics meet the requirements of a high-probability fraud email, we label it as a fraud email.
Email Signal-based heuristics are used to filter and target suspicious emails for fraud labeling specifically. The signals used were,
Person Of Interest: There is a publicly available list of email addresses of employees who were liable for the massive data leak at Enron. These user mailboxes have a higher chance of containing quality fraud emails.
Suspicious Folders: The Enron data is dumped into several folders for every employee. Folders consist of inbox, deleted_items, junk, calendar, etc. A set of folders with a higher chance of containing fraud emails, such as Deleted Items and Junk.
Sender Type: The sender type was categorized as ‘Internal’ and ‘External’ based on their email address.
Low Communication: A threshold of 4 emails based on the table below was used to define Low Communication. A user qualifies as a Low-Comm sender if their emails are below this threshold. Mails sent from low-comm senders have been assigned with a high probability of being a fraud.
Contains Replies and Forwards: If an email contains forwards or replies, a low probability was assigned for it to be a fraud email.
To ensure high-quality labels, the mismatch examples from ML Annotation have been manually inspected for Enron dataset relabeling.
Fraud | Non-Fraud |
---|---|
2327 | 445090 |
Enron Dataset Title: Enron Email Dataset URL: https://www.cs.cmu.edu/~enron/ Publisher: MIT, CMU Author: Leslie Kaelbling, William W. Cohen Year: 2015
Phishing Email Detection Dataset Title: Phishing Email Detection URL: https://www.kaggle.com/dsv/6090437 DOI: 10.34740/KAGGLE/DSV/6090437 Publisher: Kaggle Author: Subhadeep Chakraborty Year: 2023
CLAIR Fraud Email Collection Title: CLAIR collection of fraud email URL: http://aclweb.org/aclwiki Author: Radev, D. Year: 2008
This dataset was created by Arturo Garcia
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
The context of this dataset is to find fraudulent credit cards by analyzing the features. The detection of fraudulent credit card can be done using ML or DL.
The data actually collected from Weka Repository: https://weka.8497.n7.nabble.com/file/n23121/credit_fruad.arff
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This synthetic dataset was specifically designed to support machine learning research and development in counterfeit product detection and anti-fraud systems. The dataset mimics real-world patterns found in e-commerce platforms while containing no actual sensitive or proprietary information, making it ideal for educational purposes, algorithm development, and public research.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Overview:
Total Records: 749 Original Records: 700 Duplicate Records: 49 (7% of total) File Name: synthetic_claims_with_duplicates.csv Key Features:
Claim Information: Unique claim IDs (CLAIM000001 to CLAIM000700) Employee IDs (EMP0001 to EMP0700) Realistic employee names Financial Data: Amounts range: 100.00 to 20,000.00 Service codes: SVC001, SVC002, SVC003, SVC004 Departments: Finance, HR, IT, Marketing, Operations Transaction Details: Dates within the last 2 years Timestamps for submission Statuses: Submitted, Approved, Paid Random UUIDs for submitter IDs Fraud Detection: 49 exact duplicates (7%) Random distribution throughout the dataset Boolean is_duplicate flag for identification Purpose: The dataset is designed to test fraud detection systems, particularly for identifying duplicate transactions. It simulates real-world scenarios where duplicate entries might occur due to fraud or data entry errors.
Usage:
Testing duplicate transaction detection Training fraud detection models Data validation and cleaning Algorithm benchmarking The dataset is now ready for analysis in your fraud detection system.
This dataset is a sample from the TalkingData AdTracking competition. I kept all the positive examples (where is_attributed == 1
), while discarding 99% of the negative samples. The sample has roughly 20% positive examples.
For this competition, your objective was to predict whether a user will download an app after clicking a mobile app advertisement.
train_sample.csv
- Sampled data
Each row of the training data contains a click record, with the following features.
ip
: ip address of click.app
: app id for marketing.device
: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)os
: os version id of user mobile phonechannel
: channel id of mobile ad publisherclick_time
: timestamp of click (UTC)attributed_time
: if user download the app for after clicking an ad, this is the time of the app downloadis_attributed
: the target that is to be predicted, indicating the app was downloadedNote that ip, app, device, os, and channel are encoded.
I'm also including Parquet files with various features for use within the course.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset provides a detailed look into transactional behavior and financial activity patterns, ideal for exploring fraud detection and anomaly identification. It contains 2,512 samples of transaction data, covering various transaction attributes, customer demographics, and usage patterns. Each entry offers comprehensive insights into transaction behavior, enabling analysis for financial security and fraud detection applications.
Key Features:
This dataset is ideal for data scientists, financial analysts, and researchers looking to analyze transactional patterns, detect fraud, and build predictive models for financial security applications. The dataset was designed for machine learning and pattern analysis tasks and is not intended as a primary data source for academic publications.