46 datasets found

Phishing Awareness Dataset for security breaches
kaggle.com
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rasika Ekanayaka @ devLK (2025). Phishing Awareness Dataset for security breaches [Dataset]. https://www.kaggle.com/datasets/rasikaekanayakadevlk/phishing-awareness-dataset-for-security-breaches/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rasika Ekanayaka @ devLK
Description
🛡️ Simulated Phishing Interaction Dataset

Overview:
This dataset captures user interactions with potentially malicious emails, simulating scenarios relevant to phishing detection and human-centric security analysis. Each row represents a unique email event, enriched with behavioral, technical, and contextual metadata.

🔍 Use Cases

Phishing Click Prediction
Predict if a user will click a link based on hover time, device type, and email domain.

User Risk Profiling
Build behavior models: e.g., do mobile users report threats less often?

Language & Localization Patterns
Evaluate phishing success rates by language and region.

Realistic Red Teaming Simulations
Use as a training or benchmarking set for phishing email simulations.

📉 Example Insights

Users with hover_time_ms < 1000 are 60% more likely to click malicious links.

Emails in Japanese and German had a higher click-through rate, especially on mobile.

Edge and Opera browsers had a lower phishing report rate compared to Firefox and Chrome.

Sample Code Snippet

import pandas as pd df = pd.read_csv("phishing_email_behavior.csv") clicked_ratio = df.groupby("device_type")["clicked_link"].value_counts(normalize=True).unstack() print(clicked_ratio)
u
Don't Take the Bait: Recognize and Avoid Phishing Attacks
data.urbandatacentre.ca
datasets.ai
+2more
Updated Oct 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Don't Take the Bait: Recognize and Avoid Phishing Attacks [Dataset]. https://data.urbandatacentre.ca/dataset/gov-canada-2bbfd0ea-1757-488e-89bf-8ad90c521a52
Explore at:
Dataset updated
Oct 1, 2024
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
Phishing is an attack where a scammer calls you, texts or emails you, or uses social media to trick you into clicking a malicious link, downloading malware, or sharing sensitive information. Phishing attempts are often generic mass messages, but the message appears to be legitimate and from a trusted source (e.g. from a bank, courier company).
Three common types of phishing scams - Get Cyber Safe 2021
datasets.ai
ouvert.canada.ca
+1more
21
Updated Sep 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Communications Security Establishment Canada | Centre de la sécurité des télécommunications Canada (2024). Three common types of phishing scams - Get Cyber Safe 2021 [Dataset]. https://datasets.ai/datasets/46785d5c-21e4-4b4a-a499-ed3445fac760
Explore at:
21Available download formats
Dataset updated
Sep 11, 2024
Dataset provided by
Communications Security Establishment Canadahttps://cyber.gc.ca/en/
Authors
Communications Security Establishment Canada | Centre de la sécurité des télécommunications Canada
Description
Be aware of the common types of phishing scams that are out there.
u
What are the most common forms of phishing? An overview - Get Cyber Safe...
beta.data.urbandatacentre.ca
data.urbandatacentre.ca
Updated Oct 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). What are the most common forms of phishing? An overview - Get Cyber Safe 2021 - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://beta.data.urbandatacentre.ca/dataset/gov-canada-65e7c86c-77a2-4283-b493-ea56f36ea36b
Explore at:
Dataset updated
Oct 22, 2024
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Area covered
Canada
Description
A quick overview of the most common types of phishing campaigns that cyber criminals use to steal your information.
Enron Fraud Email Dataset
kaggle.com
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Advaith S Rao (2023). Enron Fraud Email Dataset [Dataset]. https://www.kaggle.com/datasets/advaithsrao/enron-fraud-email-dataset/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Advaith S Rao
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. The data has been made public and presents a diverse set of email information ranging from internal, marketing emails to spam and fraud attempts.

In the early 2000s, Leslie Kaelbling at MIT purchased the dataset and noted that, though the dataset contained scam emails, it also had several integrity problems. The dataset was updated later, but it becomes key to ensure privacy in the data while it is used to train a deep neural network model.

Though the Enron Email Dataset contains over 500K emails, one of the problems with the dataset is the availability of labeled frauds in the dataset. Label annotation is done to detect an umbrella of fraud emails accurately. Since, fraud emails fall into several types such as Phishing, Financial, Romance, Subscription, and Nigerian Prince scams, there have to be multiple heuristics used to label all types of fraudulent emails effectively.

To tackle this problem, heuristics have been used to label the Enron data corpus using email signals, and automated labeling has been performed using simple ML models on other smaller email datasets available online. These fraud annotation techniques are discussed in detail below.

To perform fraud annotation on the Enron dataset as well as provide more fraud examples for modeling, two more fraud data sources have been used, Phishing Email Dataset: https://www.kaggle.com/dsv/6090437 Social Engineering Dataset: http://aclweb.org/aclwiki

Label Annotation

To label the Enron email dataset two signals are used to filter suspicious emails and label them into fraud and non-fraud classes. Automated ML labeling Email Signals

Automated ML Labeling

The following heuristics are used to annotate labels for Enron email data using the other two data sources,

Phishing Model Annotation: A high-precision SVM model trained on the Phishing mails dataset, which is used to annotate the Phishing Label on the Enron Dataset.

Social Engineering Model Annotation: A high-precision SVM model trained on the Social Engineering mails dataset, which is used to annotate the Social Engineering Label on the Enron Dataset.

The two ML Annotator models use Term Frequency Inverse Document Frequency (TF-IDF) to embed the input text and make use of SVM models with Gaussian Kernel.

If either of the models predicted that an email was a fraud, the mail metadata was checked for several email signals. If these heuristics meet the requirements of a high-probability fraud email, we label it as a fraud email.

Email Signals

Email Signal-based heuristics are used to filter and target suspicious emails for fraud labeling specifically. The signals used were,

Person Of Interest: There is a publicly available list of email addresses of employees who were liable for the massive data leak at Enron. These user mailboxes have a higher chance of containing quality fraud emails.

Suspicious Folders: The Enron data is dumped into several folders for every employee. Folders consist of inbox, deleted_items, junk, calendar, etc. A set of folders with a higher chance of containing fraud emails, such as Deleted Items and Junk.

Sender Type: The sender type was categorized as ‘Internal’ and ‘External’ based on their email address.

Low Communication: A threshold of 4 emails based on the table below was used to define Low Communication. A user qualifies as a Low-Comm sender if their emails are below this threshold. Mails sent from low-comm senders have been assigned with a high probability of being a fraud.

Contains Replies and Forwards: If an email contains forwards or replies, a low probability was assigned for it to be a fraud email.

Manual Inspection

To ensure high-quality labels, the mismatch examples from ML Annotation have been manually inspected for Enron dataset relabeling.

Dataset Breakdown

Fraud Non-Fraud
2327 445090

Citations

Enron Dataset Title: Enron Email Dataset URL: https://www.cs.cmu.edu/~enron/ Publisher: MIT, CMU Author: Leslie Kaelbling, William W. Cohen Year: 2015

Phishing Email Detection Dataset Title: Phishing Email Detection URL: https://www.kaggle.com/dsv/6090437 DOI: 10.34740/KAGGLE/DSV/6090437 Publisher: Kaggle Author: Subhadeep Chakraborty Year: 2023

CLAIR Fraud Email Collection Title: CLAIR collection of fraud email URL: http://aclweb.org/aclwiki Author: Radev, D. Year: 2008
u
Three common types of phishing scams - Get Cyber Safe 2021 - Catalogue -...
data.urbandatacentre.ca
beta.data.urbandatacentre.ca
Updated Oct 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Three common types of phishing scams - Get Cyber Safe 2021 - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://data.urbandatacentre.ca/dataset/gov-canada-46785d5c-21e4-4b4a-a499-ed3445fac760
Explore at:
Dataset updated
Oct 1, 2024
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Area covered
Canada
Description
Be aware of the common types of phishing scams that are out there.
h
Phishing_Link_Pattern_Dataset
huggingface.co
Updated Jul 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sunny thakur (2025). Phishing_Link_Pattern_Dataset [Dataset]. https://huggingface.co/datasets/darkknight25/Phishing_Link_Pattern_Dataset
Explore at:
Dataset updated
Jul 26, 2025
Authors
Sunny thakur
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Phishing Link Pattern Dataset

Overview

This dataset provides a comprehensive collection of URLs labeled as either legitimate or phishing, designed for machine learning, cybersecurity analysis, and penetration testing. It includes 1000 entries (IDs 1–1000) covering popular brands across multiple top-level domains (TLDs) such as .es, .de, and .co.uk. The dataset captures advanced features like domain entropy, subdomain count, and suspicious keywords to aid in phishing… See the full description on the dataset page: https://huggingface.co/datasets/darkknight25/Phishing_Link_Pattern_Dataset.
Data from: Password Reset Dataset
kaggle.com
Updated Oct 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HariSellowpay (2023). Password Reset Dataset [Dataset]. https://www.kaggle.com/datasets/harisellowpay/password-reset-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 3, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
HariSellowpay
Description
The dataset is designed to simulate password-related events, creating a synthetic representation of actions related to password management. It includes fields like timestamp, action, event type, location, IP address, password, hour, and time difference.

The dataset comprises 50,000 records representing a variety of password-related events.

A list of commonly used passwords is incorporated to mimic real-world scenarios.

Timestamps are spread throughout the current year.

Features like 'hour' and 'time_difference' are derived to provide additional insights into the temporal aspects of the events.

This synthetic dataset can be used for training and testing machine learning models related to cyber security, anomaly detection, or password management. It allows researchers and practitioners to experiment with data resembling real-world scenarios without compromising actual user information.
What are the most common forms of phishing? An overview - Get Cyber Safe...
open.canada.ca
ouvert.canada.ca
html
Updated Mar 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Communications Security Establishment Canada (2023). What are the most common forms of phishing? An overview - Get Cyber Safe 2021 [Dataset]. https://open.canada.ca/data/info/65e7c86c-77a2-4283-b493-ea56f36ea36b
Explore at:
htmlAvailable download formats
Dataset updated
Mar 8, 2023
Dataset provided by
Communications Security Establishment Canadahttps://cyber.gc.ca/en/
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
A quick overview of the most common types of phishing campaigns that cyber criminals use to steal your information.
b
Scam Survivors Sextortion Reports - Datasets - data.bris
data.bris.ac.uk
Updated Dec 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Scam Survivors Sextortion Reports - Datasets - data.bris [Dataset]. https://data.bris.ac.uk/data/dataset/mmtun4gufpdb2tmmrcpos4shq
Explore at:
Dataset updated
Dec 19, 2023
Description
This dataset contains over 41,000 posts from the sextortion reporting board of scamsurvivors.com, as collected on the 14th of July, 2023. The data was collected and is shared with the approval of the Scam Survivors administrator. Of these posts, 23,705 were automatically identified as following a common structured report format, and the reported answers to specific questions were extracted into a tabular CSV format, which was then further processed to clean and standardise responses. The data does not contain identifiable or demographic victim information, as the reports are anonymous at source, but does include details of sextortion offenders' (purported) names and ages, as well as their online presence, meeting locations, conversation platforms, interaction dynamics, payment demands and some victim reflection on incidents. This dataset has been created as part of the REPHRAIN project (https://www.rephrain.ac.uk/).
f
creditcard Dataset
figshare.com
csv
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Shanaa; Sherief Abdallah (2025). creditcard Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.29270873.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29270873.v1
Dataset updated
Jun 9, 2025
Dataset provided by
figshare
Authors
Mohammad Shanaa; Sherief Abdallah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Title: Credit Card Transactions Dataset for Fraud Detection (Used in: A Hybrid Anomaly Detection Framework Combining Supervised and Unsupervised Learning)Description:This dataset, commonly known as creditcard.csv, contains anonymized credit card transactions made by European cardholders in September 2013. It includes 284,807 transactions, with 492 labeled as fraudulent. Due to confidentiality constraints, features have been transformed using PCA, except for 'Time' and 'Amount'.This dataset was used in the research article titled "A Hybrid Anomaly Detection Framework Combining Supervised and Unsupervised Learning for Credit Card Fraud Detection". The study proposes an ensemble model integrating techniques such as Autoencoders, Isolation Forest, Local Outlier Factor, and supervised classifiers including XGBoost and Random Forest, aiming to improve the detection of rare fraudulent patterns while maintaining efficiency and scalability.Key Features:30 numerical input features (V1–V28, Time, Amount)Class label indicating fraud (1) or normal (0)Imbalanced class distribution typical in real-world fraud detectionUse Case:Ideal for benchmarking and evaluating anomaly detection and classification algorithms in highly imbalanced data scenarios.Source:Originally published by the Machine Learning Group at Université Libre de Bruxelles.https://www.kaggle.com/mlg-ulb/creditcardfraudLicense:This dataset is distributed for academic and research purposes only. Please cite the original source when using the dataset.
Z
Data from: Set of obfuscated spam dataset by using LeetSpeak transformations...
data.niaid.nih.gov
portalcientifico.uvigo.gal
+1more
Updated Mar 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
José Ramón Méndez (2022). Set of obfuscated spam dataset by using LeetSpeak transformations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6373652
Explore at:
Dataset updated
Mar 22, 2022
Dataset provided by
José Ramón Méndez
Enaitz Ezpeleta
Vitor Basto Fernandes
Urko Zurutuza
Xabier Vidriales
Iñaki Velez de Mendizabal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The usage of LeetSpeak and other text hiding tricks is often used by spammers in the distribution of unsolicited contents. To evaluate deobfuscation techniques and their impact on spam content classification, we preprocessed several popular public datasets to partially obfuscate the text. The datasets transformed are:

YouTube Spam Collection [2, 3] which is available on https://www.dt.fee.unicamp.br/~tiago/youtubespamcollection/.

a subset of YouTube Comments [4, 5] which is available on http://mlg.ucd.ie/yt/.

CSDMC2010 which is available on http://csmining.org/index.php/spam-email-datasets-.html.

TREC2007 which is available on https://plg.uwaterloo.ca/~gvcormac/treccorpus07/
D
Ai Based Fraud Detection Tools Market Report | Global Forecast From 2025 To...
dataintelo.com
csv, pdf, pptx
Updated Oct 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2024). Ai Based Fraud Detection Tools Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/ai-based-fraud-detection-tools-market
Explore at:
pptx, pdf, csvAvailable download formats
Dataset updated
Oct 16, 2024
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
AI-Based Fraud Detection Tools Market Outlook

The global AI-based fraud detection tools market size was valued at approximately USD 6.5 billion in 2023 and is projected to reach USD 22.8 billion by 2032, growing at a robust CAGR of 15.1% during the forecast period. The significant growth factors driving this market include the increasing sophistication of fraudulent activities, the growing adoption of AI and machine learning technologies in various sectors, and the heightened demand for real-time fraud detection solutions.

One of the primary growth factors for the AI-based fraud detection tools market is the rising complexity of fraudulent activities. In today's digital age, fraudsters are employing increasingly sophisticated techniques to breach security systems, making traditional detection methods inadequate. AI-based solutions, which leverage advanced algorithms and machine learning, are capable of analyzing large volumes of data to identify patterns and anomalies indicative of fraud. This capability is crucial for organizations seeking to protect their assets and maintain customer trust in an environment where cyber threats are continually evolving.

Another significant growth driver is the widespread adoption of AI and machine learning technologies across various industries. Businesses are recognizing the potential of these technologies to enhance their fraud detection capabilities, leading to increased investments in AI-driven solutions. The banking and financial services sector, in particular, has been at the forefront of adopting AI-based fraud detection tools to combat financial crimes such as identity theft, credit card fraud, and money laundering. Furthermore, the retail and e-commerce sectors are increasingly implementing these tools to safeguard against fraudulent transactions and account takeovers.

The growing demand for real-time fraud detection solutions is also propelling the market forward. Traditional fraud detection systems often rely on rule-based approaches that can be slow and reactive, allowing fraudulent activities to go undetected until significant damage has been done. In contrast, AI-based solutions can process and analyze data in real-time, enabling organizations to identify and respond to threats rapidly. This real-time capability is essential for minimizing losses and mitigating risks, particularly in sectors where the speed of transactions is critical, such as online retail and financial services.

Regionally, North America currently dominates the AI-based fraud detection tools market, owing to the high adoption rate of advanced technologies and the presence of major industry players. However, other regions like Asia Pacific and Europe are also experiencing significant growth. Asia Pacific, in particular, is expected to exhibit the highest CAGR during the forecast period, driven by the increasing digitization of economies, rising internet penetration, and the growing awareness of cybersecurity threats. Europe is also witnessing substantial growth due to stringent regulatory requirements and the increasing focus on data privacy and security.

Component Analysis

The AI-based fraud detection tools market can be segmented by component into software, hardware, and services. The software segment is expected to hold the largest market share during the forecast period. This dominance can be attributed to the continuous advancements in AI algorithms and machine learning models, which enhance the accuracy and efficiency of fraud detection systems. Furthermore, the software solutions are designed to be scalable and easily integrated into existing systems, making them an attractive option for organizations of all sizes.

Hardware components, though not as dominant as software, play a crucial role in the deployment of AI-based fraud detection systems. High-performance computing hardware, including GPUs and specialized AI processors, are essential for handling the large datasets and complex computations required for real-time fraud detection. As the demand for more powerful and efficient hardware grows, this segment is expected to see steady growth, particularly in large enterprises that require robust infrastructure to support their AI initiatives.

The services segment, encompassing consulting, integration, and maintenance services, is also poised for significant growth. Organizations often lack the in-house expertise required to develop and implement AI-based fraud detection systems, leading to an increased reliance on external service providers. These services help organizations to customize and opti
Spam Email Detection Model
kaggle.com
Updated Aug 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Usamakhanswati (2024). Spam Email Detection Model [Dataset]. http://doi.org/10.34740/kaggle/ds/5524456
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/5524456
Dataset updated
Aug 10, 2024
Dataset provided by
Kaggle
Authors
Usamakhanswati
Description
In a bustling digital landscape where businesses and individuals alike rely on email communication, a hidden threat lurks—spam emails. These unsolicited and often malicious messages clog inboxes, steal valuable time, and even endanger sensitive data. The need for a powerful shield against this growing menace has never been more urgent.

Enter the Spam Email Detection Model—a cutting-edge creation designed to bring order to the chaos of modern email communication. Imagine a business owner named Sarah, whose company relies heavily on email for client communication, order processing, and customer support. Every day, her inbox is flooded with hundreds of emails, many of which are nothing but spam. These emails not only waste her time but also pose a risk to her company's security. The Spam Email Detection Model is a state-of-the-art solution designed to combat the ever-growing threat of spam emails with unparalleled accuracy and efficiency. Leveraging advanced machine learning algorithms, this model achieves a remarkable 99.9% accuracy rate, far surpassing the industry standard of 50%. It intelligently distinguishes between legitimate emails and spam, learning and adapting to new patterns to ensure ongoing protection.

Designed for seamless integration, the model can be easily implemented into any existing email system, providing businesses with a robust defense against unsolicited messages and potential security threats. Its user-friendly interface allows for effortless control and customization, making it a versatile tool for businesses of all sizes.

By dramatically reducing the time wasted on managing spam and enhancing email security, the Spam Email Detection Model empowers businesses to focus on what truly matters, offering peace of mind in a world where digital communication is vital.
f
Data_Sheet_1_Lumen: A machine learning framework to expose influence cues in...
figshare.com
pdf
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hanyu Shi; Mirela Silva; Luiz Giovanini; Daniel Capecci; Lauren Czech; Juliana Fernandes; Daniela Oliveira (2023). Data_Sheet_1_Lumen: A machine learning framework to expose influence cues in texts.PDF [Dataset]. http://doi.org/10.3389/fcomp.2022.929515.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fcomp.2022.929515.s001
Dataset updated
Jun 16, 2023
Dataset provided by
Frontiers
Authors
Hanyu Shi; Mirela Silva; Luiz Giovanini; Daniel Capecci; Lauren Czech; Juliana Fernandes; Daniela Oliveira
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Phishing and disinformation are popular social engineering attacks with attackers invariably applying influence cues in texts to make them more appealing to users. We introduce Lumen, a learning-based framework that exposes influence cues in text: (i) persuasion, (ii) framing, (iii) emotion, (iv) objectivity/subjectivity, (v) guilt/blame, and (vi) use of emphasis. Lumen was trained with a newly developed dataset of 3K texts comprised of disinformation, phishing, hyperpartisan news, and mainstream news. Evaluation of Lumen in comparison to other learning models showed that Lumen and LSTM presented the best F1-micro score, but Lumen yielded better interpretability. Our results highlight the promise of ML to expose influence cues in text, toward the goal of application in automatic labeling tools to improve the accuracy of human-based detection and reduce the likelihood of users falling for deceptive online content.
Credit Card Fraud Detection Dataset
kaggle.com
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghanshyam Saini (2025). Credit Card Fraud Detection Dataset [Dataset]. https://www.kaggle.com/datasets/ghnshymsaini/credit-card-fraud-detection-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ghanshyam Saini
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Credit Card Fraud Detection Dataset (European Cardholders, September 2013)

As a data contributor, I'm sharing this crucial dataset focused on the detection of fraudulent credit card transactions. Recognizing these illicit activities is paramount for protecting customers and the integrity of financial systems.

About the Dataset:

This dataset encompasses credit card transactions made by European cardholders during a two-day period in September 2013. It presents a real-world scenario with a significant class imbalance, where fraudulent transactions are considerably less frequent than legitimate ones. Out of a total of 284,807 transactions, only 492 are instances of fraud, representing a mere 0.172% of the entire dataset.

Content of the Data:

Due to confidentiality concerns, the majority of the input features in this dataset have undergone a Principal Component Analysis (PCA) transformation. This means the original meaning and context of features V1, V2, ..., V28 are not directly provided. However, these principal components capture the variance in the underlying transaction data.

The only features that have not been transformed by PCA are:

Time: Numerical. Represents the number of seconds elapsed between each transaction and the very first transaction recorded in the dataset.

Amount: Numerical. The transaction amount in Euros (€). This feature could be valuable for cost-sensitive learning approaches.

The target variable for this classification task is:

Class: Integer. Takes the value 1 in the case of a fraudulent transaction and 0 otherwise.

Important Note on Evaluation:

Given the substantial class imbalance (far more legitimate transactions than fraudulent ones), traditional accuracy metrics based on the confusion matrix can be misleading. It is strongly recommended to evaluate models using the Area Under the Precision-Recall Curve (AUPRC), as this metric is more sensitive to the performance on the minority class (fraudulent transactions).

How to Use This Dataset:

Download the dataset file (likely in CSV format).

Load the data using libraries like Pandas.

Understand the class imbalance: Be aware that fraudulent transactions are rare.

Explore the features: Analyze the distributions of 'Time', 'Amount', and the PCA-transformed features (V1-V28).

Address the class imbalance: Consider using techniques like oversampling the minority class, undersampling the majority class, or using specialized algorithms designed for imbalanced datasets.

Build and train binary classification models to predict the 'Class' variable.

Evaluate your models using AUPRC to get a meaningful assessment of performance in detecting fraud.

Acknowledgements and Citation:

This dataset has been collected and analyzed through a research collaboration between Worldline and the Machine Learning Group (MLG) of ULB (Université Libre de Bruxelles).

When using this dataset in your research or projects, please cite the following works as appropriate:

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.

Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon.

Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE.

Andrea Dal Pozzolo. Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi).

Fabrizio Carcillo, Andrea Dal Pozzolo, Yann-Aël Le Borgne, Olivier Caelen, Yannis Mazzer, Gianluca Bontempi. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier.

Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Gianluca Bontempi. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing.

Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019.

Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi *Combining Unsupervised and Supervised...
Most common scams in Singapore 2023
statista.com
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Most common scams in Singapore 2023 [Dataset]. https://www.statista.com/statistics/981340/leading-types-of-scams-singapore/
Explore at:
Dataset updated
Jun 27, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
Singapore
Description
In 2023, job scams were the most common type of scam in Singapore, with around ***** cases reported. E-commerce scams also represented a prevalent form of fraud in the country, with over ***** cases reported.

Phishing threat in Singapore In Singapore, around *********** different phishing URLs with a .SG domain were detected in 2022. The highest number of phishing URLs was recorded the previous year, with around ***********. Phishing attacks can take many forms, such as corporate e-mail compromise (CEC), mass phishing, or smishing. These phishing e-mails represent a crucial risk for businesses. They can also lead to ransomware infections, which have also increased in recent years.

Data breaches Companies and governments are increasingly relying on technology to collect, analyze, and store personal data. This can lead to potential risks when such data is affected by cyber incidents. In Singapore, the number of exposed data points per thousand people reached ** in 2022. Over the same period, around ************ data sets were reported as leaked in the country.
f
Data in Figures
plos.figshare.com
figshare.com
zip
Updated Jul 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yicheng Long (2025). Data in Figures [Dataset]. http://doi.org/10.1371/journal.pone.0327476.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0327476.s001
Dataset updated
Jul 17, 2025
Dataset provided by
PLOS ONE
Authors
Yicheng Long
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Amid substantial capital influx and the rapid evolution of online user groups, the increasing complexity of user behavior poses significant challenges to cybersecurity, particularly in the domain of vulnerability prediction. This study aims to enhance the accuracy and practical applicability of cyberspace vulnerability prediction. By incorporating the dynamics of user behavioral changes and the logic of platform scaling driven by investment, two representative cybersecurity datasets are selected for analysis: the Canadian Institute for Cybersecurity Intrusion Detection System 2017 and the Network-Based Intrusion Detection Evaluation Dataset 2015. A standardized data preprocessing pipeline is constructed, including redundancy elimination, feature selection, and sample balancing, to ensure data representativeness and compatibility. To address the limited adaptability of traditional support vector machine (SVM) models in identifying nonlinear attacks, this study introduces a distribution-driven, dynamically adaptive kernel optimization approach. This method adjusts kernel parameters or switches kernel functions in real time according to the statistical characteristics of input data, thereby improving the model’s generalization capability and responsiveness in complex attack scenarios. Performance evaluations are conducted on both datasets using cross-validation. The results show that, compared to traditional models, the improved SVM achieves an 11.2% increase in prediction accuracy. Furthermore, the model demonstrates a 22.2% improvement in computational efficiency, measured as the ratio of prediction count to processing time. It also exhibits lower false positive rates and greater stability in detecting common cyberattacks such as distributed denial of service, phishing, and malware. In addition, this study analyzes user behavioral variations under different levels of attack pressure based on network access activity. Findings indicate that during periods of high platform load, attack frequency is positively correlated with users’ defensive behavior, confirming a potential causal sequence of “capital influx—user expansion—increased attack exposure.” This study offers a practical modeling framework and empirical foundation for improving predictive performance and enhancing users’ sense of cybersecurity.
G
Medical Claims Fraud Scenarios
gomask.ai
csv
Updated Jul 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GoMask.ai (2025). Medical Claims Fraud Scenarios [Dataset]. https://gomask.ai/marketplace/datasets/medical-claims-fraud-scenarios
Explore at:
csv(Unknown)Available download formats
Dataset updated
Jul 12, 2025
Dataset provided by
GoMask.ai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
notes, claim_id, claim_date, fraud_flag, patient_id, fraud_score, patient_age, provider_id, review_date, reviewer_id, and 10 more
Description
This dataset contains simulated health insurance claims with detailed patient, provider, and service information, including flags and scores for a variety of fraud scenarios. It is designed to support the development and evaluation of fraud detection algorithms, audit workflows, and compliance monitoring in healthcare insurance. The dataset enables analysis of both common and rare fraudulent patterns for improved anomaly detection.
A
‘Fraud detection bank dataset 20K records binary ’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Fraud detection bank dataset 20K records binary ’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-fraud-detection-bank-dataset-20k-records-binary-6287/e0c752fd/?iid=019-351&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Fraud detection bank dataset 20K records binary ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/volodymyrgavrysh/fraud-detection-bank-dataset-20k-records-binary on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

Banks are often exposed to fraud transactions and constantly improve systems to track them.

Content

Bank dataset that contains 20k+ transactions with 112 features (numerical)

--- Original source retains full ownership of the source dataset ---

Fraud	Non-Fraud
2327	445090

Facebook

Twitter

Click to copy link

Link copied

Cite

Rasika Ekanayaka @ devLK (2025). Phishing Awareness Dataset for security breaches [Dataset]. https://www.kaggle.com/datasets/rasikaekanayakadevlk/phishing-awareness-dataset-for-security-breaches/versions/1

Phishing Awareness Dataset for security breaches

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 7, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Rasika Ekanayaka @ devLK

Description

🛡️ Simulated Phishing Interaction Dataset

Overview:
This dataset captures user interactions with potentially malicious emails, simulating scenarios relevant to phishing detection and human-centric security analysis. Each row represents a unique email event, enriched with behavioral, technical, and contextual metadata.

🔍 Use Cases

Phishing Click Prediction
Predict if a user will click a link based on hover time, device type, and email domain.
User Risk Profiling
Build behavior models: e.g., do mobile users report threats less often?
Language & Localization Patterns
Evaluate phishing success rates by language and region.
Realistic Red Teaming Simulations
Use as a training or benchmarking set for phishing email simulations.

📉 Example Insights

Users with hover_time_ms < 1000 are 60% more likely to click malicious links.
Emails in Japanese and German had a higher click-through rate, especially on mobile.
Edge and Opera browsers had a lower phishing report rate compared to Firefox and Chrome.

Sample Code Snippet

import pandas as pd

df = pd.read_csv("phishing_email_behavior.csv")
clicked_ratio = df.groupby("device_type")["clicked_link"].value_counts(normalize=True).unstack()
print(clicked_ratio)

Clear search

Close search

Google apps

Main menu

Phishing Awareness Dataset for security breaches

🛡️ Simulated Phishing Interaction Dataset

🔍 Use Cases

📉 Example Insights

Sample Code Snippet

Don't Take the Bait: Recognize and Avoid Phishing Attacks

Three common types of phishing scams - Get Cyber Safe 2021

What are the most common forms of phishing? An overview - Get Cyber Safe...

Enron Fraud Email Dataset

Label Annotation

Automated ML Labeling

Email Signals

Manual Inspection

Dataset Breakdown

Citations

Three common types of phishing scams - Get Cyber Safe 2021 - Catalogue -...

Phishing_Link_Pattern_Dataset

Data from: Password Reset Dataset

What are the most common forms of phishing? An overview - Get Cyber Safe...

Scam Survivors Sextortion Reports - Datasets - data.bris

creditcard Dataset

Data from: Set of obfuscated spam dataset by using LeetSpeak transformations...

Ai Based Fraud Detection Tools Market Report | Global Forecast From 2025 To...

AI-Based Fraud Detection Tools Market Outlook

Component Analysis

Spam Email Detection Model

Data_Sheet_1_Lumen: A machine learning framework to expose influence cues in...

Credit Card Fraud Detection Dataset

Credit Card Fraud Detection Dataset (European Cardholders, September 2013)

Most common scams in Singapore 2023

Data in Figures

Medical Claims Fraud Scenarios

‘Fraud detection bank dataset 20K records binary ’ analyzed by Analyst-2

Context

Content

Phishing Awareness Dataset for security breaches

🛡️ Simulated Phishing Interaction Dataset

🔍 Use Cases

📉 Example Insights

Sample Code Snippet