http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Dataset Name: Spam Email Dataset
Description: This dataset contains a collection of email text messages, labeled as either spam or not spam. Each email message is associated with a binary label, where "1" indicates that the email is spam, and "0" indicates that it is not spam. The dataset is intended for use in training and evaluating spam email classification models.
Columns:
text (Text): This column contains the text content of the email messages. It includes the body of the emails along with any associated subject lines or headers.
spam_or_not (Binary): This column contains binary labels to indicate whether an email is spam or not. "1" represents spam, while "0" represents not spam.
Usage: This dataset can be used for various Natural Language Processing (NLP) tasks, such as text classification and spam detection. Researchers and data scientists can train and evaluate machine learning models using this dataset to build effective spam email filters.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We have curated 7 repositories.The Ling and Enron datasets possess just two features: ‘Subject’ and ‘Body’. The other datasets consists of six features, namely ‘Sender’, ‘Receiver’, ‘Date’, ‘Subject’, ‘Body’, and ‘Urls’.Please cite this dataset:A. I. Champa, M. F. Rabbi, and M. F. Zibran, “Curated datasets and feature analysis for phishing email detection with machine learning,” in 3rd IEEE International Conference on Computing and Machine Intelligence (ICMI), 2024, pp. 1–7 (to appear).or@inproceedings{champa2024curated,title={Curated Datasets and Feature Analysis for Phishing Email Detection with Machine Learning},author={Champa, Arifa I and Rabbi, Md Fazle and Zibran, Minhaz F},booktitle={3rd IEEE International Conference on Computing and Machine Intelligence (ICMI)},pages = {1--7 (to appear)},year={2024}}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cite the paper if you use this dataset:A. I. Champa, M. F. Rabbi, and M. F. Zibran, “Curated datasets and feature analysis for phishing email detection with machine learning,” in 3rd IEEE International Conference on Computing and Machine Intelligence (ICMI), 2024, pp. 1–7.Bibtex:@inproceedings{champa2024curated,title={Curated Datasets and Feature Analysis for Phishing Email Detection with Machine Learning}, author={Champa, Arifa I and Rabbi, Md Fazle and Zibran, Minhaz F}, booktitle={3rd IEEE International Conference on Computing and Machine Intelligence (ICMI)}, pages = {1--7}, year={2024} }
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Overview: This dataset is designed for phishing email detection using machine learning. It combines: - ~500,000 non-phishing ("safe") emails from the Enron Email Dataset - ~20,000 phishing and safe emails from the Phishing Email Dataset
Every email was cleaned and passed through a custom NLP feature extraction pipeline that focuses on phishing indicators. The goal is to provide a ready-to-use dataset for classification tasks with minimal preprocessing.
num_words
- Total number of words in the email bodynum_unique_words
- Count of unique words usednum_stopwords
- Count of common stopwords (e.g., "the", "and", "in")num_links
- Number of hyperlinks detectednum_unique_domains
- Number of unique domains in links (e.g., "paypal.com")num_email_addresses
- Count of email addresses found in the textnum_spelling_errors
- Count of misspelled wordsnum_urgent_keywords
- Number of urgent words (e.g., "urgent", "verify", "update")label
- Target variable: 0 = Safe Email, 1 = Phishing EmailNotes: - This dataset does not contain raw text or headers, only engineered features for training/testing models. - Spell checking used pyspellchecker on filtered tokens. - Stopwords were a fixed English list. - No personal or PII information is included.
data-phishing-detection
A dataset to test methods to detect phishing emails The file data.parquet contains the dataset, 400 emails. 200 are synthetic phishing attempts and 200 are synthetic regular emails.
Schema
input - an email, synthesized by an LLM, that is either a phishing attempt or a regular email. output - 'Yes' if the email is a phishing attempt, 'No' otherwise.
Prompt
The prompt.md file contains a prompt that can be used with an LLM as a starting… See the full description on the dataset page: https://huggingface.co/datasets/RevaHQ/data-phishing-detection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
the scenarios tested were run on the small_dataset. The most successful configuration that was selected as a result of the analysis on small_dataset was applied to big_dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cite the paper if you use this dataset:1. A. I. Champa, M. F. Rabbi, and M. F. Zibran, “Why phishing emails escape detection: A closer look at the failure points,” in 12th Interna- tional Symposium on Digital Forensics and Security (ISDFS), 2024, pp. 1–6.2. A. I. Champa, M. F. Rabbi, and M. F. Zibran, “Curated datasets and feature analysis for phishing email detection with machine learning,” in 3rd IEEE International Conference on Computing and Machine Intelligence (ICMI), 2024, pp. 1–7.Bibtext:1. @inproceedings{champa2024phishing, title={Why Phishing Emails Escape Detection: A Closer Look at the Failure Points}, author={Champa, Arifa I and Rabbi, Fazle and Zibran, Minhaz F}, booktitle={2024 12th International Symposium on Digital Forensics and Security (ISDFS)}, pages={1--6}, year={2024}, organization={IEEE}}2. @inproceedings{champa2024curated, title={Curated Datasets and Feature Analysis for Phishing Email Detection with Machine Learning}, author={Champa, Arifa I and Rabbi, Md Fazle and Zibran, Minhaz F}, booktitle={3rd IEEE International Conference on Computing and Machine Intelligence (ICMI)}, pages = {1--7}, year={2024}}
This dataset was created by Dhruv Agarwal
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was compiled by researchers to study phishing email tactics. It combines emails from a variety of sources to create a comprehensive resource for analysis.
Enron and Ling Datasets: These datasets focus on the core content of phishing emails, containing subject lines, email body text, and labels indicating whether the email is spam (phishing) or legitimate.
CEAS, Nazario, Nigerian Fraud, and SpamAssassin Datasets: These datasets provide broader context for the emails, including sender information, recipient information, date, and labels for spam/legitimate classification.
The final dataset combines the information from the initial datasets into a single resource for analysis. This dataset contains:
This dataset allows researchers to study the content of phishing emails and the context in which they are sent to improve detection methods.
Please cite the following two articles if you are using this dataset:
Phishing Email Detection Dataset
A comprehensive dataset combining email messages and URLs for phishing detection.
Dataset Overview
Quick Facts
Task Type: Multi-class Classification Languages: English Total Samples: 200,000 entries Size Split: Email samples: 22,644 URL samples: 177,356
Label Distribution: Four classes (0, 1, 2, 3) Format: Two columns - content and labels
Dataset Structure
Features
{ 'content':… See the full description on the dataset page: https://huggingface.co/datasets/cybersectony/PhishingEmailDetectionv2.0.
lleratodev/ai-powered-phishing-email-detection-system dataset hosted on Hugging Face and contributed by the HF Datasets community
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a synthetic dataset for training and testing spam detection models. It contains 20,000 email samples, and each sample is described by five features and one label.
num_links
λ
) of 1.5 num_words
has_offer
sender_score
all_caps
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains 7,500+ Turkish phishing and legitimate emails, making it a valuable resource for phishing detection, cybersecurity, and NLP research.
Phishing (Oltalama)
& Legitimate (Güvenilir)
Column | Description |
---|---|
ID | Unique identifier for each email |
Konu (Subject) | The email’s subject line |
Gönderen (Sender) | The sender's email address (often spoofed) |
İçerik (Content) | The body text of the email |
Kategori (Category) | Oltalama (Phishing) or Güvenilir (Legitimate) |
import pandas as pd
df = pd.read_csv("/kaggle/input/turkish-phishing-email-dataset/turkish_phishing_dataset.csv")
print(df.head())
phishing_emails = df[df["Kategori"] == "Oltalama"]
print(phishing_emails.sample(5))
import re
def clean_text(text):
text = re.sub(r'\W+', ' ', text) # Remove special characters
text = text.lower() # Convert to lowercase
return text
df["Cleaned_Content"] = df["İçerik"].apply(clean_text)
print(df[["İçerik", "Cleaned_Content"].head())
This dataset is released under the CC BY 4.0 License, meaning you can use, modify, and distribute it as long as you provide proper credit.
More details: Creative Commons License
If you have new phishing examples or improvements, feel free to contribute!
For questions or collaborations, reach out via osmancancetlenbik@gmail.com.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The replication package consists of the questionnaire and related materials:Questionnaire.pdf: Includes demographic questions to gather information about participants and contains 20 Emails Folder: Contains 20 emails (10 phishing and 10 non-phishing) used for the phishing identification.responses.xlsx: Contains the actual responses from the participants in the user study.Impact_of_email_category_analysis_code.py: Contains our analysis of participants' responses.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global phishing protection and prevention solutions market is experiencing robust growth, driven by the escalating sophistication and frequency of phishing attacks targeting both large enterprises and SMEs. The increasing reliance on cloud-based services and the expanding attack surface created by remote work and digital transformation initiatives significantly fuel market expansion. A Compound Annual Growth Rate (CAGR) of, let's assume, 15% from 2025 to 2033, based on industry averages for cybersecurity solutions, suggests a substantial market expansion. This growth is further fueled by the increasing adoption of advanced technologies like AI and machine learning in phishing detection and prevention systems. While on-premises solutions still hold a significant market share, the cloud-based segment is rapidly gaining traction due to its scalability, cost-effectiveness, and ease of deployment. The market is segmented geographically, with North America currently holding the largest market share due to high technological adoption and a strong regulatory environment, followed by Europe and Asia-Pacific. However, the Asia-Pacific region is expected to exhibit the highest growth rate during the forecast period driven by increasing internet penetration and rising cyber security awareness. Market restraints include the high cost of implementation and maintenance of advanced phishing protection solutions, especially for SMEs. Furthermore, the constant evolution of phishing techniques requires continuous updates and improvements to these solutions, posing a challenge for vendors and users alike. Despite these challenges, the ever-increasing financial and reputational damage caused by successful phishing attacks creates a compelling need for robust protection, ensuring sustained market growth. Key players in the market, including Cofense, Phish Protection, Check Point, Mimecast, Microsoft, and others, are constantly innovating and expanding their product portfolios to address emerging threats and cater to the diverse needs of different user segments. The competitive landscape is dynamic, characterized by strategic partnerships, acquisitions, and technological advancements.
Turkish Phishing Email Dataset
📌 Overview
This dataset contains 7,500+ Turkish phishing and legitimate emails, making it a valuable resource for phishing detection, natural language processing (NLP), and cybersecurity research. It includes various phishing email types, such as:
Fake cargo delivery alerts Market discount scams Bank fraud emails Government agency impersonation Social media and account takeover phishing
📂 Dataset Details
Total Records: 7… See the full description on the dataset page: https://huggingface.co/datasets/OsmanCan/turkish_phishing_dataset.
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The global Email Threat Detection System market is experiencing robust growth, driven by the escalating sophistication and frequency of email-borne cyberattacks targeting both businesses and governments. The market, currently estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 12% from 2025 to 2033, reaching an estimated $45 billion by 2033. This expansion is fueled by several key factors, including the increasing adoption of cloud-based email services, the rise in remote work and associated security vulnerabilities, and the growing awareness of the financial and reputational damage caused by successful email phishing and malware attacks. Stringent government regulations concerning data privacy and cybersecurity are also driving demand for robust email threat detection solutions. The market segmentation reveals a significant share held by the software segment, reflecting the preference for automated and scalable solutions. Geographically, North America currently dominates the market, owing to advanced technological infrastructure and high cybersecurity awareness. However, the Asia-Pacific region is poised for significant growth, fueled by rapid digitalization and increasing internet penetration across countries like China and India. Competition in the Email Threat Detection System market is intense, with a mix of established players like Proofpoint, Cisco, Symantec, and emerging vendors vying for market share. The market is characterized by continuous innovation, with vendors investing heavily in advanced threat detection technologies, including artificial intelligence (AI) and machine learning (ML) to enhance accuracy and speed of threat identification. While market growth is substantial, challenges remain, including the rising complexity of cyberattacks and the emergence of novel attack vectors, such as sophisticated phishing techniques and polymorphic malware. The ongoing battle between threat actors and security providers fuels the need for continuous adaptation and improvement of email threat detection systems. Furthermore, the high cost of implementation and maintenance, along with the need for skilled personnel to manage these systems, can pose barriers to entry for smaller organizations.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global email anti-spam software market size was valued at approximately USD 1.8 billion in 2023 and is projected to reach nearly USD 4.2 billion by 2032, growing at a compound annual growth rate (CAGR) of 9.7% during the forecast period. The significant growth factor driving this market is the increasing volume of spam emails, which has heightened the demand for robust email security solutions.
One of the primary growth factors for the email anti-spam software market is the proliferation of spam and phishing attacks. As email remains a critical communication tool for both individuals and businesses, the rise in cyber threats has led to a greater need for advanced spam filtering solutions. Organizations are seeking sophisticated software capable of detecting and blocking malicious emails, thereby safeguarding sensitive information and protecting against data breaches. This demand is further fueled by regulatory requirements mandating stringent data protection measures.
Another key growth factor is the increasing adoption of cloud-based solutions. Cloud deployment offers numerous advantages, including scalability, ease of integration, and cost-effectiveness. As more businesses migrate their operations to the cloud, the demand for cloud-based email anti-spam solutions is surging. These solutions are particularly appealing to small and medium enterprises (SMEs), which may lack the resources to invest in extensive on-premises infrastructure. Cloud solutions provide these organizations with robust security features, ensuring their email systems remain secure and compliant.
Technological advancements in artificial intelligence (AI) and machine learning (ML) are also propelling market growth. Modern email anti-spam software leverages AI and ML algorithms to enhance the accuracy and efficiency of spam detection. These technologies enable the software to learn from patterns and behaviors, improving its ability to identify new and sophisticated spam tactics. The continuous evolution of AI and ML technologies promises to further strengthen the capabilities of email anti-spam solutions, driving their adoption across various sectors.
The rise of Cloud-based Email Security solutions is revolutionizing the way organizations approach email protection. By leveraging cloud infrastructure, these solutions offer enhanced flexibility and scalability, allowing businesses to adapt quickly to changing security landscapes. Cloud-based systems are particularly advantageous for organizations with distributed teams, as they provide seamless access to security features from any location. Furthermore, they reduce the burden of maintaining on-premises hardware, enabling IT teams to focus on strategic initiatives rather than routine maintenance. As cyber threats evolve, cloud-based email security solutions continuously update to provide the latest protection, ensuring that organizations remain one step ahead of potential attacks. This adaptability and ease of use are driving more companies to transition to cloud-based models, aligning with broader digital transformation trends.
Regionally, North America holds a substantial share of the email anti-spam software market. The presence of leading market players, coupled with high adoption rates of advanced cybersecurity solutions, drives this dominance. Additionally, stringent regulatory frameworks in the United States and Canada emphasize the need for robust email security, further boosting market growth in the region. Europe follows closely, with the General Data Protection Regulation (GDPR) playing a pivotal role in ensuring data security and privacy, thereby driving the demand for email anti-spam software.
The email anti-spam software market is segmented by components into software and services. The software segment dominates the market, driven by the continuous need for effective spam detection and email security solutions. The software is designed to identify and block spam emails before they reach the userÂ’s inbox, leveraging a combination of filters, algorithms, and databases. This segment is witnessing continuous innovation, with vendors incorporating advanced AI and ML features to enhance detection accuracy and efficiency.
Software solutions are further categorized into standalone and integrated solutions. Standalone software is specifically designed to target spam emails, while integrated solutions are
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The Email Threat Detection System (ETDS) market is experiencing robust growth, driven by the escalating sophistication and volume of email-borne threats targeting businesses and governments globally. The increasing reliance on email for communication and data exchange, coupled with the rise of phishing, malware, and ransomware attacks, fuels the demand for advanced ETDS solutions. While precise market sizing data isn't provided, a reasonable estimation, considering the prevalent use of email and the substantial investments in cybersecurity, would place the 2025 market value at approximately $5 billion USD, with a Compound Annual Growth Rate (CAGR) of 15% projected through 2033. This growth is fueled by several key drivers: the expanding adoption of cloud-based email security solutions, the integration of artificial intelligence (AI) and machine learning (ML) for improved threat detection, and the increasing regulatory pressure on organizations to bolster their email security posture. The market is segmented by software, service, and application, with significant growth anticipated across government, finance, and corporate sectors. Major players like Proofpoint, Cisco, and Microsoft are actively competing, while regional variations exist, with North America and Europe currently holding the largest market shares due to high levels of digitalization and stringent security regulations. However, the Asia-Pacific region is predicted to witness rapid growth due to increasing internet penetration and rising cybersecurity awareness. Despite the positive outlook, the market faces certain challenges. These include the ever-evolving nature of cyber threats, requiring constant updates and adaptations of ETDS solutions. The high cost of implementation and maintenance, especially for advanced features like AI-powered threat intelligence, can also act as a restraint, particularly for smaller businesses. The increasing complexity of integrating ETDS with existing IT infrastructure further poses a hurdle. Nonetheless, the overall market trajectory remains positive, driven by the critical need for robust email security in today's interconnected world. The market's future growth will likely be shaped by continued technological advancements, evolving threat landscapes, and government regulations aimed at enhancing cybersecurity.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Phishing and Benign Email Dataset
This dataset contains a curated collection of phishing and legitimate (benign) emails for use in cybersecurity training, phishing detection models, and email classification systems. Each entry is structured with subject, body, intent, technique, target, and classification label.
📁 Dataset Format
The dataset is stored in .jsonl (JSON Lines) format. Each line is a standalone JSON object.
Fields:
Field Description
id… See the full description on the dataset page: https://huggingface.co/datasets/darkknight25/phishing_benign_email_dataset.
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Dataset Name: Spam Email Dataset
Description: This dataset contains a collection of email text messages, labeled as either spam or not spam. Each email message is associated with a binary label, where "1" indicates that the email is spam, and "0" indicates that it is not spam. The dataset is intended for use in training and evaluating spam email classification models.
Columns:
text (Text): This column contains the text content of the email messages. It includes the body of the emails along with any associated subject lines or headers.
spam_or_not (Binary): This column contains binary labels to indicate whether an email is spam or not. "1" represents spam, while "0" represents not spam.
Usage: This dataset can be used for various Natural Language Processing (NLP) tasks, such as text classification and spam detection. Researchers and data scientists can train and evaluate machine learning models using this dataset to build effective spam email filters.