Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is designed to aid in the analysis and detection of phishing websites. It contains various features that help distinguish between legitimate and phishing websites based on their structural, security, and behavioral attributes.
Result
(Indicates whether a website is phishing or legitimate) Prefix_Suffix
– Checks if the URL contains a hyphen (-
), which is commonly used in phishing domains. double_slash_redirecting
– Detects if the URL redirects using //
, which may indicate a phishing attempt. having_At_Symbol
– Identifies the presence of @
in the URL, which can be used to deceive users. Shortining_Service
– Indicates whether the URL uses a shortening service (e.g., bit.ly, tinyurl). URL_Length
– Measures the length of the URL; phishing URLs tend to be longer. having_IP_Address
– Checks if an IP address is used in place of a domain name, which is suspicious. having_Sub_Domain
– Evaluates the number of subdomains; phishing sites often have excessive subdomains. SSLfinal_State
– Indicates whether the website has a valid SSL certificate (secure connection). Domain_registeration_length
– Measures the duration of domain registration; phishing sites often have short lifespans. age_of_domain
– The age of the domain in days; older domains are usually more trustworthy. DNSRecord
– Checks if the domain has valid DNS records; phishing domains may lack these. Favicon
– Determines if the website uses an external favicon (which can be a sign of phishing). port
– Identifies if the site is using suspicious or non-standard ports. HTTPS_token
– Checks if "HTTPS" is included in the URL but is used deceptively. Request_URL
– Measures the percentage of external resources loaded from different domains. URL_of_Anchor
– Analyzes anchor tags (<a>
links) and their trustworthiness. Links_in_tags
– Examines <meta>
, <script>
, and <link>
tags for external links. SFH
(Server Form Handler) – Determines if form actions are handled suspiciously. Submitting_to_email
– Checks if forms submit data directly to an email instead of a web server. Abnormal_URL
– Identifies if the website’s URL structure is inconsistent with common patterns. Redirect
– Counts the number of redirects; phishing websites may have excessive redirects. on_mouseover
– Checks if the website changes content when hovered over (used in deceptive techniques). RightClick
– Detects if right-click functionality is disabled (phishing sites may disable it). popUpWindow
– Identifies the presence of pop-ups, which can be used to trick users. Iframe
– Checks if the website uses <iframe>
tags, often used in phishing attacks. web_traffic
– Measures the website’s Alexa ranking; phishing sites tend to have low traffic. Page_Rank
– Google PageRank score; phishing sites usually have a low PageRank. Google_Index
– Checks if the website is indexed by Google (phishing sites may not be indexed). Links_pointing_to_page
– Counts the number of backlinks pointing to the website. Statistical_report
– Uses external sources to verify if the website has been reported for phishing. Result
– The classification label (1: Legitimate, -1: Phishing) This dataset is valuable for:
✅ Machine Learning Models – Developing classifiers for phishing detection.
✅ Cybersecurity Research – Understanding patterns in phishing attacks.
✅ Browser Security Extensions – Enhancing anti-phishing tools.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data consist of a collection of legitimate as well as phishing website instances. Each website is represented by the set of features which denote, whether website is legitimate or not. Data can serve as an input for machine learning process.
Here, the two variants of the Phishing Dataset are presented.
Full variant - dataset_full.csv
Small variant - dataset_small.csv
This API is providing the information of press releases issued by the authorized institutions and other similar press releases issued by the HKMA in the past regarding fraudulent bank websites, phishing E-mails and similar scams information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The provided dataset includes 11430 URLs with 87 extracted features. The dataset are designed to be used as a a benchmark for machine learning based phishing detection systems. Features are from three different classes: 56 extracted from the structure and syntax of URLs, 24 extracted from the content of their correspondent pages and 7 are extracetd by querying external services. The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs. Associated to the dataset, we provide Python scripts used for the extraction of the features for potential replication or extension. Datasets are constructed on May 2020.
dataset_A: contains a list a URLs together with their DOM tree objects that can be used for replication and experimenting new URL and content-based features overtaking short-time living of phishing web pages.
dataset_B: containes the extracted feature values that can be used directly as inupt to classifiers for examination. Note that the data in this dataset are indexed with URLs so that one need to remove the index before experimentation.
data-phishing-detection
A dataset to test methods to detect phishing emails The file data.parquet contains the dataset, 400 emails. 200 are synthetic phishing attempts and 200 are synthetic regular emails.
Schema
input - an email, synthesized by an LLM, that is either a phishing attempt or a regular email. output - 'Yes' if the email is a phishing attempt, 'No' otherwise.
Prompt
The prompt.md file contains a prompt that can be used with an LLM as a starting… See the full description on the dataset page: https://huggingface.co/datasets/RevaHQ/data-phishing-detection.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
A partial dataset and document-term matrix of phishing emails targeting an institution of higher education and an associated script used for data analysis.
Surveys of working adults and IT security professionals worldwide conducted in 2021 and 2023 found that the share of organizations experiencing severe consequences due to a successful cyber attack had declined. In 2023, the share of enterprises experiencing a breach of customer or client data was 29 percent, down from 44 percent in 2022. Ransomware infections that occurred through e-mail were common for 32 percent of the respondents in 2023. Cases of a credential or account compromise occurred in 27 percent of the organizations in 2023, a decrease of 25 percent compared to the year prior.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset comprises phishing and legitimate web pages, which have been used for experiments on early phishing detection.
Detailed information on the dataset and data collection is available at
Bram van Dooremaal, Pavlo Burda, Luca Allodi, and Nicola Zannone. 2021.Combining Text and Visual Features to Improve the Identification of Cloned Webpages for Early Phishing Detection. In ARES '21: Proceedings of the 16th International Conference on Availability, Reliability and Security. ACM.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The phishing simulation market is experiencing robust growth, driven by the escalating sophistication of phishing attacks and the increasing regulatory pressure on organizations to enhance their cybersecurity posture. The market, currently valued at approximately $1.5 billion in 2025 (estimated based on typical market sizes for cybersecurity segments with similar growth rates), is projected to experience a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This expansion is fueled by several key factors. Firstly, the rising frequency and success rate of phishing campaigns targeting both large enterprises and SMEs necessitate proactive security measures like simulation training. Secondly, evolving attack vectors and techniques demand continuous adaptation and improvement in security awareness programs, creating a sustained demand for advanced phishing simulation solutions. Thirdly, stringent data privacy regulations like GDPR and CCPA are imposing significant penalties for data breaches resulting from successful phishing attacks, motivating organizations to invest heavily in preventative measures including simulation-based training. The market segmentation reveals a significant share held by software-based solutions, owing to their scalability, ease of deployment, and cost-effectiveness. However, the service segment is also experiencing strong growth due to the increasing need for expert guidance and managed services in designing and implementing effective phishing simulation programs. Geographically, North America currently dominates the market, followed by Europe, reflecting the high level of cybersecurity awareness and regulatory compliance in these regions. However, the Asia-Pacific region is expected to exhibit the highest growth rate over the forecast period, driven by increasing digital adoption and rising awareness of cybersecurity threats in developing economies. While the market faces certain restraints, such as the need for specialized expertise and the potential for high implementation costs, the overall growth trajectory remains positive, driven by the overwhelming need to combat the ever-evolving threat landscape of phishing attacks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
the scenarios tested were run on the small_dataset. The most successful configuration that was selected as a result of the analysis on small_dataset was applied to big_dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PhiUSIIL Phishing URL Dataset is a substantial dataset comprising 134,850 legitimate and 100,945 phishing URLs. Most of the URLs we analyzed while constructing the dataset are the latest URLs. Features are extracted from the source code of the webpage and URL. Features such as CharContinuationRate, URLTitleMatchScore, URLCharProb, and TLDLegitimateProb are derived from existing features.
Citation: Prasad, A., & Chandra, S. (2023). PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning. Computers & Security, 103545. doi: https://doi.org/10.1016/j.cose.2023.103545
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS handshakes and certificates, and GeoIP information for 368,956 benign domains from Cisco Umbrella, 461,338 benign domains from the actual CESNET network traffic, 164,425 phishing domains from PhishTank and OpenPhish services, and 100,809 malware domains from various sources like ThreatFox, The Firebog, MISP threat intelligence platform, and other sources. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered malicious by VT have been removed from phishing and malware datasets. Similarly, benign domain names that were considered risky by VT have been removed from the benign datasets. The data was collected between March 2023 and July 2024. The final assessment of the data was conducted in August 2024.
The dataset is useful for cybersecurity research, e.g. statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing and malware website detection.
The data is located in the following individual files:
Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:
Field name |
Field type |
Nullable |
Description |
domain_name |
String |
No |
The evaluated domain name |
url |
String |
No |
The source URL for the domain name |
evaluated_on |
Date |
No |
Date of last collection attempt |
source |
String |
No |
An identifier of the source |
sourced_on |
Date |
No |
Date of ingestion of the domain name |
dns |
Object |
Yes |
Data from DNS scan |
rdap |
Object |
Yes |
Data from RDAP or WHOIS |
tls |
Object |
Yes |
Data from TLS handshake |
ip_data |
Array of Objects |
Yes |
Array of data objects capturing the IP addresses related to the domain name |
DNS data (dns field) | |||
A |
Array of Strings |
No |
Array of IPv4 addresses |
AAAA |
Array of Strings |
No |
Array of IPv6 addresses |
TXT |
Array of Strings |
No |
Array of raw TXT values |
CNAME |
Object |
No |
The CNAME target and related IPs |
MX |
Array of Objects |
No |
Array of objects with the MX target hostname, priority and related IPs |
NS |
Array of Objects |
No |
Array of objects with the NS target hostname and related IPs |
SOA |
Object |
No |
All the SOA fields, present if found at the target domain name |
zone_SOA |
Object |
No |
The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly |
dnssec |
Object |
No |
Flags describing the DNSSEC validation result for each record type |
ttls |
Object |
No |
The TTL values for each record type |
remarks |
Object |
No |
The zone domain name and DNSSEC flags |
RDAP data (rdap field) | |||
copyright_notice |
String |
No |
RDAP/WHOIS data usage copyright notice |
dnssec |
Bool |
No |
DNSSEC presence flag |
entitites |
Object |
No |
An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities. |
expiration_date |
Date |
Yes |
The current date of expiration |
handle |
String |
No |
RDAP handle |
last_changed_date |
Date |
Yes |
The date when the domain was last changed |
name |
String |
No |
The target domain name for which the data in this object are stored |
nameservers |
Array of Strings |
No |
Nameserver hostnames provided by RDAP or WHOIS |
registration_date |
Date |
Yes |
First registration date |
status |
Array of Strings |
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Dataset Name: Spam Email Dataset
Description: This dataset contains a collection of email text messages, labeled as either spam or not spam. Each email message is associated with a binary label, where "1" indicates that the email is spam, and "0" indicates that it is not spam. The dataset is intended for use in training and evaluating spam email classification models.
Columns:
text (Text): This column contains the text content of the email messages. It includes the body of the emails along with any associated subject lines or headers.
spam_or_not (Binary): This column contains binary labels to indicate whether an email is spam or not. "1" represents spam, while "0" represents not spam.
Usage: This dataset can be used for various Natural Language Processing (NLP) tasks, such as text classification and spam detection. Researchers and data scientists can train and evaluate machine learning models using this dataset to build effective spam email filters.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The spear phishing market is experiencing robust growth, driven by the increasing sophistication of cyberattacks and the expanding digital landscape. While precise market sizing data is unavailable, considering the substantial investments in cybersecurity and the consistent rise in reported phishing incidents, a reasonable estimate for the 2025 market size would be in the range of $5-7 billion. This figure reflects the rising costs associated with data breaches, regulatory fines, and the increasing demand for advanced threat detection and response solutions. A Compound Annual Growth Rate (CAGR) of 12-15% over the forecast period (2025-2033) is plausible, considering ongoing technological advancements in spear phishing techniques and the corresponding need for robust countermeasures. Key drivers include the growth of remote work, increasing reliance on cloud services, and the evolving tactics employed by cybercriminals to target specific individuals and organizations. Trends point towards a greater focus on artificial intelligence (AI) and machine learning (ML) in threat detection, as well as a shift towards proactive security measures and employee training programs to mitigate the impact of spear phishing attacks. However, restraints include the ever-evolving nature of spear phishing techniques, the persistent skills gap in cybersecurity professionals, and the potential for false positives in automated detection systems. Segmentation within the market is likely to exist based on solution type (e.g., email security, security awareness training), deployment model (cloud, on-premises), and target industry (financial services, healthcare, government). Companies like BAE Systems, Check Point Software Technologies, Cisco Systems, and Proofpoint are key players actively innovating and competing within this dynamic market. The significant market expansion is further fueled by the high financial stakes involved in successful spear phishing campaigns. The impact of successful attacks, including data breaches, financial losses, and reputational damage, encourages organizations to invest heavily in comprehensive security solutions. The proliferation of sophisticated spear phishing techniques, such as personalized phishing emails and the use of social engineering, necessitates advanced detection and prevention technologies. The market's competitive landscape is characterized by both established cybersecurity vendors and emerging players who are constantly developing new solutions to combat the threat of spear phishing. The competitive dynamics will likely lead to further innovation and drive market growth in the coming years, enhancing the overall sophistication of spear phishing detection and prevention solutions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We have curated 11 datasets. The Nazario and Nigerian Fraud datasets contain only phishing emails.Cite this dataset:A. I. Champa, M. F. Rabbi, and M. F. Zibran, “Why phishing emails escape detection: A closer look at the failure points,” in 12th International Symposium on Digital Forensics and Security (ISDFS), 2024, pp. 1–6 (to appear).or@inproceedings{champa2024why, title={Why Phishing Emails Escape Detection: A Closer Look at the Failure Points}, author={Champa, Arifa I and Rabbi, Md Fazle and Zibran, Minhaz F}, booktitle={12th International Symposium on Digital Forensics and Security (ISDFS)}, pages = {1--6 (to appear)}, year={2024}}
Korkmaz, M. ., Kocyigit, E. ., Sahingoz, O. K., & Diri, B. (2022). A Hybrid Phishing Detection System Using Deep Learning-based URL and Content Analysis. Elektronika Ir Elektrotechnika, 28(5), 80-89. https://doi.org/10.5755/j02.eie.31197
About
All data is collected from Phishank.com.
The operation on the website of such a large-scale organization is as follows: - Users leave URLs to the URL pool to be queried. - URLs in this list, which are open to all guests, are checked by users and classified as phishing or legitimate. - A URL is tagged according to the number of votes it receives.
Thus, three categories of URLs are listed: Phishing, Legitimate and Unrated.
If the URL is inactive and no user has moderated it, it will be tagged as UNRATED. These can qualify as neutral elements in the URL list. URLs that have been inspected and found to be harmful while they are live are labeled as PHISHING. The phishing part of the dataset contains these URLs. Those with website content from these URLs listed under Online and Valid Phish on the PhishTank website have been added to the Phishing section of the dataset, along with both the URL and the content. URLs that have been inspected and found to be not harmful while they are live are labeled as LEGITIMATE. These URLs, which are labelled as Invalid in PhishTank and have content, form the legitimate part of the dataset. Thus, the data that was added to the checklist after being suspicious by the users and then labelled as legitimate constituted in RISKY legitimate part.
The dataset was created with 51,316 legitimate URLs and contents, 36,173 phishing URLs and contents, listed between 2006 and 2021.
By looking at the file named "Dataset_Distribution.xlsx", the distribution of data in two categories by years can be shown. In addition, when this file is examined, it is possible to access the information of which url was published on which date. Approximately 91% of the phishing data was obtained in 2021. This rate confirms the existence of data used as zero-day attacks in the dataset. However, it can be said that the legitimate data is more evenly distributed. Again, it can be deduced with the idea of how accurately the legitimate data is labelled.
URLs and contents were collected with a written script in Python. The size of "0 Kb." of these URLs is excluded from the dataset. In addition, each of the contents was checked and the contents with the Error 403 code were removed from the dataset.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is a large corpus of 42,619 preprocessed text messages and emails sent by humans in 43 languages. is_spam=1 means spam and is_spam=0 means ham. 1040 rows of balanced data, consisting of casual conversations and scam emails in ≈10 languages, were manually collected and annotated by me, with some help from ChatGPT.
Some preprcoessing algorithms
spam_assassin.js, followed by spam_assassin.py enron_spam.py
Data composition
Description
To make the text… See the full description on the dataset page: https://huggingface.co/datasets/FredZhang7/all-scam-spam.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
A BitTorrent file to download data with the title 'Phishing corpus'
In December 2023, around 9.45 million phishing e-mails were detected worldwide, up from 5.59 million in September 2023. This figure has seen a continuous increase since January 2022. It is partially associated with the launch of ChatGPT in November 2022.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is designed to aid in the analysis and detection of phishing websites. It contains various features that help distinguish between legitimate and phishing websites based on their structural, security, and behavioral attributes.
Result
(Indicates whether a website is phishing or legitimate) Prefix_Suffix
– Checks if the URL contains a hyphen (-
), which is commonly used in phishing domains. double_slash_redirecting
– Detects if the URL redirects using //
, which may indicate a phishing attempt. having_At_Symbol
– Identifies the presence of @
in the URL, which can be used to deceive users. Shortining_Service
– Indicates whether the URL uses a shortening service (e.g., bit.ly, tinyurl). URL_Length
– Measures the length of the URL; phishing URLs tend to be longer. having_IP_Address
– Checks if an IP address is used in place of a domain name, which is suspicious. having_Sub_Domain
– Evaluates the number of subdomains; phishing sites often have excessive subdomains. SSLfinal_State
– Indicates whether the website has a valid SSL certificate (secure connection). Domain_registeration_length
– Measures the duration of domain registration; phishing sites often have short lifespans. age_of_domain
– The age of the domain in days; older domains are usually more trustworthy. DNSRecord
– Checks if the domain has valid DNS records; phishing domains may lack these. Favicon
– Determines if the website uses an external favicon (which can be a sign of phishing). port
– Identifies if the site is using suspicious or non-standard ports. HTTPS_token
– Checks if "HTTPS" is included in the URL but is used deceptively. Request_URL
– Measures the percentage of external resources loaded from different domains. URL_of_Anchor
– Analyzes anchor tags (<a>
links) and their trustworthiness. Links_in_tags
– Examines <meta>
, <script>
, and <link>
tags for external links. SFH
(Server Form Handler) – Determines if form actions are handled suspiciously. Submitting_to_email
– Checks if forms submit data directly to an email instead of a web server. Abnormal_URL
– Identifies if the website’s URL structure is inconsistent with common patterns. Redirect
– Counts the number of redirects; phishing websites may have excessive redirects. on_mouseover
– Checks if the website changes content when hovered over (used in deceptive techniques). RightClick
– Detects if right-click functionality is disabled (phishing sites may disable it). popUpWindow
– Identifies the presence of pop-ups, which can be used to trick users. Iframe
– Checks if the website uses <iframe>
tags, often used in phishing attacks. web_traffic
– Measures the website’s Alexa ranking; phishing sites tend to have low traffic. Page_Rank
– Google PageRank score; phishing sites usually have a low PageRank. Google_Index
– Checks if the website is indexed by Google (phishing sites may not be indexed). Links_pointing_to_page
– Counts the number of backlinks pointing to the website. Statistical_report
– Uses external sources to verify if the website has been reported for phishing. Result
– The classification label (1: Legitimate, -1: Phishing) This dataset is valuable for:
✅ Machine Learning Models – Developing classifiers for phishing detection.
✅ Cybersecurity Research – Understanding patterns in phishing attacks.
✅ Browser Security Extensions – Enhancing anti-phishing tools.