Overview:
This dataset captures user interactions with potentially malicious emails, simulating scenarios relevant to phishing detection and human-centric security analysis. Each row represents a unique email event, enriched with behavioral, technical, and contextual metadata.
Phishing Click Prediction
Predict if a user will click a link based on hover time, device type, and email domain.
User Risk Profiling
Build behavior models: e.g., do mobile users report threats less often?
Language & Localization Patterns
Evaluate phishing success rates by language and region.
Realistic Red Teaming Simulations
Use as a training or benchmarking set for phishing email simulations.
hover_time_ms < 1000
are 60% more likely to click malicious links.import pandas as pd
df = pd.read_csv("phishing_email_behavior.csv")
clicked_ratio = df.groupby("device_type")["clicked_link"].value_counts(normalize=True).unstack()
print(clicked_ratio)
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Phishing is an attack where a scammer calls you, texts or emails you, or uses social media to trick you into clicking a malicious link, downloading malware, or sharing sensitive information. Phishing attempts are often generic mass messages, but the message appears to be legitimate and from a trusted source (e.g. from a bank, courier company).
Be aware of the common types of phishing scams that are out there.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
A quick overview of the most common types of phishing campaigns that cyber criminals use to steal your information.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. The data has been made public and presents a diverse set of email information ranging from internal, marketing emails to spam and fraud attempts.
In the early 2000s, Leslie Kaelbling at MIT purchased the dataset and noted that, though the dataset contained scam emails, it also had several integrity problems. The dataset was updated later, but it becomes key to ensure privacy in the data while it is used to train a deep neural network model.
Though the Enron Email Dataset contains over 500K emails, one of the problems with the dataset is the availability of labeled frauds in the dataset. Label annotation is done to detect an umbrella of fraud emails accurately. Since, fraud emails fall into several types such as Phishing, Financial, Romance, Subscription, and Nigerian Prince scams, there have to be multiple heuristics used to label all types of fraudulent emails effectively.
To tackle this problem, heuristics have been used to label the Enron data corpus using email signals, and automated labeling has been performed using simple ML models on other smaller email datasets available online. These fraud annotation techniques are discussed in detail below.
To perform fraud annotation on the Enron dataset as well as provide more fraud examples for modeling, two more fraud data sources have been used, Phishing Email Dataset: https://www.kaggle.com/dsv/6090437 Social Engineering Dataset: http://aclweb.org/aclwiki
To label the Enron email dataset two signals are used to filter suspicious emails and label them into fraud and non-fraud classes. Automated ML labeling Email Signals
The following heuristics are used to annotate labels for Enron email data using the other two data sources,
Phishing Model Annotation: A high-precision SVM model trained on the Phishing mails dataset, which is used to annotate the Phishing Label on the Enron Dataset.
Social Engineering Model Annotation: A high-precision SVM model trained on the Social Engineering mails dataset, which is used to annotate the Social Engineering Label on the Enron Dataset.
The two ML Annotator models use Term Frequency Inverse Document Frequency (TF-IDF) to embed the input text and make use of SVM models with Gaussian Kernel.
If either of the models predicted that an email was a fraud, the mail metadata was checked for several email signals. If these heuristics meet the requirements of a high-probability fraud email, we label it as a fraud email.
Email Signal-based heuristics are used to filter and target suspicious emails for fraud labeling specifically. The signals used were,
Person Of Interest: There is a publicly available list of email addresses of employees who were liable for the massive data leak at Enron. These user mailboxes have a higher chance of containing quality fraud emails.
Suspicious Folders: The Enron data is dumped into several folders for every employee. Folders consist of inbox, deleted_items, junk, calendar, etc. A set of folders with a higher chance of containing fraud emails, such as Deleted Items and Junk.
Sender Type: The sender type was categorized as ‘Internal’ and ‘External’ based on their email address.
Low Communication: A threshold of 4 emails based on the table below was used to define Low Communication. A user qualifies as a Low-Comm sender if their emails are below this threshold. Mails sent from low-comm senders have been assigned with a high probability of being a fraud.
Contains Replies and Forwards: If an email contains forwards or replies, a low probability was assigned for it to be a fraud email.
To ensure high-quality labels, the mismatch examples from ML Annotation have been manually inspected for Enron dataset relabeling.
Fraud | Non-Fraud |
---|---|
2327 | 445090 |
Enron Dataset Title: Enron Email Dataset URL: https://www.cs.cmu.edu/~enron/ Publisher: MIT, CMU Author: Leslie Kaelbling, William W. Cohen Year: 2015
Phishing Email Detection Dataset Title: Phishing Email Detection URL: https://www.kaggle.com/dsv/6090437 DOI: 10.34740/KAGGLE/DSV/6090437 Publisher: Kaggle Author: Subhadeep Chakraborty Year: 2023
CLAIR Fraud Email Collection Title: CLAIR collection of fraud email URL: http://aclweb.org/aclwiki Author: Radev, D. Year: 2008
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Be aware of the common types of phishing scams that are out there.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Phishing Link Pattern Dataset
Overview
This dataset provides a comprehensive collection of URLs labeled as either legitimate or phishing, designed for machine learning, cybersecurity analysis, and penetration testing. It includes 1000 entries (IDs 1–1000) covering popular brands across multiple top-level domains (TLDs) such as .es, .de, and .co.uk. The dataset captures advanced features like domain entropy, subdomain count, and suspicious keywords to aid in phishing… See the full description on the dataset page: https://huggingface.co/datasets/darkknight25/Phishing_Link_Pattern_Dataset.
The dataset is designed to simulate password-related events, creating a synthetic representation of actions related to password management. It includes fields like timestamp, action, event type, location, IP address, password, hour, and time difference.
This synthetic dataset can be used for training and testing machine learning models related to cyber security, anomaly detection, or password management. It allows researchers and practitioners to experiment with data resembling real-world scenarios without compromising actual user information.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
A quick overview of the most common types of phishing campaigns that cyber criminals use to steal your information.
This dataset contains over 41,000 posts from the sextortion reporting board of scamsurvivors.com, as collected on the 14th of July, 2023. The data was collected and is shared with the approval of the Scam Survivors administrator. Of these posts, 23,705 were automatically identified as following a common structured report format, and the reported answers to specific questions were extracted into a tabular CSV format, which was then further processed to clean and standardise responses. The data does not contain identifiable or demographic victim information, as the reports are anonymous at source, but does include details of sextortion offenders' (purported) names and ages, as well as their online presence, meeting locations, conversation platforms, interaction dynamics, payment demands and some victim reflection on incidents. This dataset has been created as part of the REPHRAIN project (https://www.rephrain.ac.uk/).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Title: Credit Card Transactions Dataset for Fraud Detection (Used in: A Hybrid Anomaly Detection Framework Combining Supervised and Unsupervised Learning)Description:This dataset, commonly known as creditcard.csv, contains anonymized credit card transactions made by European cardholders in September 2013. It includes 284,807 transactions, with 492 labeled as fraudulent. Due to confidentiality constraints, features have been transformed using PCA, except for 'Time' and 'Amount'.This dataset was used in the research article titled "A Hybrid Anomaly Detection Framework Combining Supervised and Unsupervised Learning for Credit Card Fraud Detection". The study proposes an ensemble model integrating techniques such as Autoencoders, Isolation Forest, Local Outlier Factor, and supervised classifiers including XGBoost and Random Forest, aiming to improve the detection of rare fraudulent patterns while maintaining efficiency and scalability.Key Features:30 numerical input features (V1–V28, Time, Amount)Class label indicating fraud (1) or normal (0)Imbalanced class distribution typical in real-world fraud detectionUse Case:Ideal for benchmarking and evaluating anomaly detection and classification algorithms in highly imbalanced data scenarios.Source:Originally published by the Machine Learning Group at Université Libre de Bruxelles.https://www.kaggle.com/mlg-ulb/creditcardfraudLicense:This dataset is distributed for academic and research purposes only. Please cite the original source when using the dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The usage of LeetSpeak and other text hiding tricks is often used by spammers in the distribution of unsolicited contents. To evaluate deobfuscation techniques and their impact on spam content classification, we preprocessed several popular public datasets to partially obfuscate the text. The datasets transformed are:
YouTube Spam Collection [2, 3] which is available on https://www.dt.fee.unicamp.br/~tiago/youtubespamcollection/.
a subset of YouTube Comments [4, 5] which is available on http://mlg.ucd.ie/yt/.
CSDMC2010 which is available on http://csmining.org/index.php/spam-email-datasets-.html.
TREC2007 which is available on https://plg.uwaterloo.ca/~gvcormac/treccorpus07/
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global AI-based fraud detection tools market size was valued at approximately USD 6.5 billion in 2023 and is projected to reach USD 22.8 billion by 2032, growing at a robust CAGR of 15.1% during the forecast period. The significant growth factors driving this market include the increasing sophistication of fraudulent activities, the growing adoption of AI and machine learning technologies in various sectors, and the heightened demand for real-time fraud detection solutions.
One of the primary growth factors for the AI-based fraud detection tools market is the rising complexity of fraudulent activities. In today's digital age, fraudsters are employing increasingly sophisticated techniques to breach security systems, making traditional detection methods inadequate. AI-based solutions, which leverage advanced algorithms and machine learning, are capable of analyzing large volumes of data to identify patterns and anomalies indicative of fraud. This capability is crucial for organizations seeking to protect their assets and maintain customer trust in an environment where cyber threats are continually evolving.
Another significant growth driver is the widespread adoption of AI and machine learning technologies across various industries. Businesses are recognizing the potential of these technologies to enhance their fraud detection capabilities, leading to increased investments in AI-driven solutions. The banking and financial services sector, in particular, has been at the forefront of adopting AI-based fraud detection tools to combat financial crimes such as identity theft, credit card fraud, and money laundering. Furthermore, the retail and e-commerce sectors are increasingly implementing these tools to safeguard against fraudulent transactions and account takeovers.
The growing demand for real-time fraud detection solutions is also propelling the market forward. Traditional fraud detection systems often rely on rule-based approaches that can be slow and reactive, allowing fraudulent activities to go undetected until significant damage has been done. In contrast, AI-based solutions can process and analyze data in real-time, enabling organizations to identify and respond to threats rapidly. This real-time capability is essential for minimizing losses and mitigating risks, particularly in sectors where the speed of transactions is critical, such as online retail and financial services.
Regionally, North America currently dominates the AI-based fraud detection tools market, owing to the high adoption rate of advanced technologies and the presence of major industry players. However, other regions like Asia Pacific and Europe are also experiencing significant growth. Asia Pacific, in particular, is expected to exhibit the highest CAGR during the forecast period, driven by the increasing digitization of economies, rising internet penetration, and the growing awareness of cybersecurity threats. Europe is also witnessing substantial growth due to stringent regulatory requirements and the increasing focus on data privacy and security.
The AI-based fraud detection tools market can be segmented by component into software, hardware, and services. The software segment is expected to hold the largest market share during the forecast period. This dominance can be attributed to the continuous advancements in AI algorithms and machine learning models, which enhance the accuracy and efficiency of fraud detection systems. Furthermore, the software solutions are designed to be scalable and easily integrated into existing systems, making them an attractive option for organizations of all sizes.
Hardware components, though not as dominant as software, play a crucial role in the deployment of AI-based fraud detection systems. High-performance computing hardware, including GPUs and specialized AI processors, are essential for handling the large datasets and complex computations required for real-time fraud detection. As the demand for more powerful and efficient hardware grows, this segment is expected to see steady growth, particularly in large enterprises that require robust infrastructure to support their AI initiatives.
The services segment, encompassing consulting, integration, and maintenance services, is also poised for significant growth. Organizations often lack the in-house expertise required to develop and implement AI-based fraud detection systems, leading to an increased reliance on external service providers. These services help organizations to customize and opti
In a bustling digital landscape where businesses and individuals alike rely on email communication, a hidden threat lurks—spam emails. These unsolicited and often malicious messages clog inboxes, steal valuable time, and even endanger sensitive data. The need for a powerful shield against this growing menace has never been more urgent.
Enter the Spam Email Detection Model—a cutting-edge creation designed to bring order to the chaos of modern email communication. Imagine a business owner named Sarah, whose company relies heavily on email for client communication, order processing, and customer support. Every day, her inbox is flooded with hundreds of emails, many of which are nothing but spam. These emails not only waste her time but also pose a risk to her company's security. The Spam Email Detection Model is a state-of-the-art solution designed to combat the ever-growing threat of spam emails with unparalleled accuracy and efficiency. Leveraging advanced machine learning algorithms, this model achieves a remarkable 99.9% accuracy rate, far surpassing the industry standard of 50%. It intelligently distinguishes between legitimate emails and spam, learning and adapting to new patterns to ensure ongoing protection.
Designed for seamless integration, the model can be easily implemented into any existing email system, providing businesses with a robust defense against unsolicited messages and potential security threats. Its user-friendly interface allows for effortless control and customization, making it a versatile tool for businesses of all sizes.
By dramatically reducing the time wasted on managing spam and enhancing email security, the Spam Email Detection Model empowers businesses to focus on what truly matters, offering peace of mind in a world where digital communication is vital.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Phishing and disinformation are popular social engineering attacks with attackers invariably applying influence cues in texts to make them more appealing to users. We introduce Lumen, a learning-based framework that exposes influence cues in text: (i) persuasion, (ii) framing, (iii) emotion, (iv) objectivity/subjectivity, (v) guilt/blame, and (vi) use of emphasis. Lumen was trained with a newly developed dataset of 3K texts comprised of disinformation, phishing, hyperpartisan news, and mainstream news. Evaluation of Lumen in comparison to other learning models showed that Lumen and LSTM presented the best F1-micro score, but Lumen yielded better interpretability. Our results highlight the promise of ML to expose influence cues in text, toward the goal of application in automatic labeling tools to improve the accuracy of human-based detection and reduce the likelihood of users falling for deceptive online content.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
As a data contributor, I'm sharing this crucial dataset focused on the detection of fraudulent credit card transactions. Recognizing these illicit activities is paramount for protecting customers and the integrity of financial systems.
About the Dataset:
This dataset encompasses credit card transactions made by European cardholders during a two-day period in September 2013. It presents a real-world scenario with a significant class imbalance, where fraudulent transactions are considerably less frequent than legitimate ones. Out of a total of 284,807 transactions, only 492 are instances of fraud, representing a mere 0.172% of the entire dataset.
Content of the Data:
Due to confidentiality concerns, the majority of the input features in this dataset have undergone a Principal Component Analysis (PCA) transformation. This means the original meaning and context of features V1, V2, ..., V28 are not directly provided. However, these principal components capture the variance in the underlying transaction data.
The only features that have not been transformed by PCA are:
The target variable for this classification task is:
Important Note on Evaluation:
Given the substantial class imbalance (far more legitimate transactions than fraudulent ones), traditional accuracy metrics based on the confusion matrix can be misleading. It is strongly recommended to evaluate models using the Area Under the Precision-Recall Curve (AUPRC), as this metric is more sensitive to the performance on the minority class (fraudulent transactions).
How to Use This Dataset:
Acknowledgements and Citation:
This dataset has been collected and analyzed through a research collaboration between Worldline and the Machine Learning Group (MLG) of ULB (Université Libre de Bruxelles).
When using this dataset in your research or projects, please cite the following works as appropriate:
In 2023, job scams were the most common type of scam in Singapore, with around ***** cases reported. E-commerce scams also represented a prevalent form of fraud in the country, with over ***** cases reported.
Phishing threat in Singapore In Singapore, around *********** different phishing URLs with a .SG domain were detected in 2022. The highest number of phishing URLs was recorded the previous year, with around ***********. Phishing attacks can take many forms, such as corporate e-mail compromise (CEC), mass phishing, or smishing. These phishing e-mails represent a crucial risk for businesses. They can also lead to ransomware infections, which have also increased in recent years.
Data breaches Companies and governments are increasingly relying on technology to collect, analyze, and store personal data. This can lead to potential risks when such data is affected by cyber incidents. In Singapore, the number of exposed data points per thousand people reached ** in 2022. Over the same period, around ************ data sets were reported as leaked in the country.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Amid substantial capital influx and the rapid evolution of online user groups, the increasing complexity of user behavior poses significant challenges to cybersecurity, particularly in the domain of vulnerability prediction. This study aims to enhance the accuracy and practical applicability of cyberspace vulnerability prediction. By incorporating the dynamics of user behavioral changes and the logic of platform scaling driven by investment, two representative cybersecurity datasets are selected for analysis: the Canadian Institute for Cybersecurity Intrusion Detection System 2017 and the Network-Based Intrusion Detection Evaluation Dataset 2015. A standardized data preprocessing pipeline is constructed, including redundancy elimination, feature selection, and sample balancing, to ensure data representativeness and compatibility. To address the limited adaptability of traditional support vector machine (SVM) models in identifying nonlinear attacks, this study introduces a distribution-driven, dynamically adaptive kernel optimization approach. This method adjusts kernel parameters or switches kernel functions in real time according to the statistical characteristics of input data, thereby improving the model’s generalization capability and responsiveness in complex attack scenarios. Performance evaluations are conducted on both datasets using cross-validation. The results show that, compared to traditional models, the improved SVM achieves an 11.2% increase in prediction accuracy. Furthermore, the model demonstrates a 22.2% improvement in computational efficiency, measured as the ratio of prediction count to processing time. It also exhibits lower false positive rates and greater stability in detecting common cyberattacks such as distributed denial of service, phishing, and malware. In addition, this study analyzes user behavioral variations under different levels of attack pressure based on network access activity. Findings indicate that during periods of high platform load, attack frequency is positively correlated with users’ defensive behavior, confirming a potential causal sequence of “capital influx—user expansion—increased attack exposure.” This study offers a practical modeling framework and empirical foundation for improving predictive performance and enhancing users’ sense of cybersecurity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains simulated health insurance claims with detailed patient, provider, and service information, including flags and scores for a variety of fraud scenarios. It is designed to support the development and evaluation of fraud detection algorithms, audit workflows, and compliance monitoring in healthcare insurance. The dataset enables analysis of both common and rare fraudulent patterns for improved anomaly detection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Fraud detection bank dataset 20K records binary ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/volodymyrgavrysh/fraud-detection-bank-dataset-20k-records-binary on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Banks are often exposed to fraud transactions and constantly improve systems to track them.
Bank dataset that contains 20k+ transactions with 112 features (numerical)
--- Original source retains full ownership of the source dataset ---
Overview:
This dataset captures user interactions with potentially malicious emails, simulating scenarios relevant to phishing detection and human-centric security analysis. Each row represents a unique email event, enriched with behavioral, technical, and contextual metadata.
Phishing Click Prediction
Predict if a user will click a link based on hover time, device type, and email domain.
User Risk Profiling
Build behavior models: e.g., do mobile users report threats less often?
Language & Localization Patterns
Evaluate phishing success rates by language and region.
Realistic Red Teaming Simulations
Use as a training or benchmarking set for phishing email simulations.
hover_time_ms < 1000
are 60% more likely to click malicious links.import pandas as pd
df = pd.read_csv("phishing_email_behavior.csv")
clicked_ratio = df.groupby("device_type")["clicked_link"].value_counts(normalize=True).unstack()
print(clicked_ratio)