100+ datasets found

🕵️ Phishing Websites Data
kaggle.com
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sairaj Adhav (2025). 🕵️ Phishing Websites Data [Dataset]. https://www.kaggle.com/datasets/sai10py/phishing-websites-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sairaj Adhav
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Phishing Websites Dataset

Overview

This dataset is designed to aid in the analysis and detection of phishing websites. It contains various features that help distinguish between legitimate and phishing websites based on their structural, security, and behavioral attributes.

Dataset Information

Total Columns: 31 (30 Features + 1 Target)

Target Variable: Result (Indicates whether a website is phishing or legitimate)

Features Description

URL-Based Features

Prefix_Suffix – Checks if the URL contains a hyphen (-), which is commonly used in phishing domains.

double_slash_redirecting – Detects if the URL redirects using //, which may indicate a phishing attempt.

having_At_Symbol – Identifies the presence of @ in the URL, which can be used to deceive users.

Shortining_Service – Indicates whether the URL uses a shortening service (e.g., bit.ly, tinyurl).

URL_Length – Measures the length of the URL; phishing URLs tend to be longer.

having_IP_Address – Checks if an IP address is used in place of a domain name, which is suspicious.

Domain-Based Features

having_Sub_Domain – Evaluates the number of subdomains; phishing sites often have excessive subdomains.

SSLfinal_State – Indicates whether the website has a valid SSL certificate (secure connection).

Domain_registeration_length – Measures the duration of domain registration; phishing sites often have short lifespans.

age_of_domain – The age of the domain in days; older domains are usually more trustworthy.

DNSRecord – Checks if the domain has valid DNS records; phishing domains may lack these.

Webpage-Based Features

Favicon – Determines if the website uses an external favicon (which can be a sign of phishing).

port – Identifies if the site is using suspicious or non-standard ports.

HTTPS_token – Checks if "HTTPS" is included in the URL but is used deceptively.

Request_URL – Measures the percentage of external resources loaded from different domains.

URL_of_Anchor – Analyzes anchor tags (<a> links) and their trustworthiness.

Links_in_tags – Examines <meta>, <script>, and <link> tags for external links.

SFH (Server Form Handler) – Determines if form actions are handled suspiciously.

Submitting_to_email – Checks if forms submit data directly to an email instead of a web server.

Abnormal_URL – Identifies if the website’s URL structure is inconsistent with common patterns.

Redirect – Counts the number of redirects; phishing websites may have excessive redirects.

Behavior-Based Features

on_mouseover – Checks if the website changes content when hovered over (used in deceptive techniques).

RightClick – Detects if right-click functionality is disabled (phishing sites may disable it).

popUpWindow – Identifies the presence of pop-ups, which can be used to trick users.

Iframe – Checks if the website uses <iframe> tags, often used in phishing attacks.

Traffic & Search Engine Features

web_traffic – Measures the website’s Alexa ranking; phishing sites tend to have low traffic.

Page_Rank – Google PageRank score; phishing sites usually have a low PageRank.

Google_Index – Checks if the website is indexed by Google (phishing sites may not be indexed).

Links_pointing_to_page – Counts the number of backlinks pointing to the website.

Statistical_report – Uses external sources to verify if the website has been reported for phishing.

Target Variable

Result – The classification label (1: Legitimate, -1: Phishing)

Usage

This dataset is valuable for:
✅ Machine Learning Models – Developing classifiers for phishing detection.
✅ Cybersecurity Research – Understanding patterns in phishing attacks.
✅ Browser Security Extensions – Enhancing anti-phishing tools.
Phishing Websites Dataset
kaggle.com
zip
Updated Mar 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arnav Samal (2024). Phishing Websites Dataset [Dataset]. https://www.kaggle.com/datasets/arnavs19/phishing-websites-dataset
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 23, 2024
Authors
Arnav Samal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These data consist of a collection of legitimate as well as phishing website instances. Each website is represented by the set of features which denote, whether website is legitimate or not. Data can serve as an input for machine learning process.

Here, the two variants of the Phishing Dataset are presented.

Full variant - dataset_full.csv

Total number of instances: 88,647

Number of legitimate website instances (labeled as 0): 58,000

Number of phishing website instances (labeled as 1): 30,647

Total number of features: 111

Small variant - dataset_small.csv

Total number of instances: 58,645

Number of legitimate website instances (labeled as 0): 27,998

Number of phishing website instances (labeled as 1): 30,647

Total number of features: 111
Fraudulent Bank Websites, Phishing E-mails and Similar Scams | DATA.GOV.HK
data.gov.hk
Updated Oct 27, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.gov.hk (2018). Fraudulent Bank Websites, Phishing E-mails and Similar Scams | DATA.GOV.HK [Dataset]. https://data.gov.hk/en-data/dataset/hk-hkma-banksvf-fraudulent-bank-scams
Explore at:
Dataset updated
Oct 27, 2018
Dataset provided by
data.gov.hk
Description
This API is providing the information of press releases issued by the authorized institutions and other similar press releases issued by the HKMA in the past regarding fraudulent bank websites, phishing E-mails and similar scams information.
m
Web page phishing detection
data.mendeley.com
Updated Jun 25, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdelhakim Hannousse (2021). Web page phishing detection [Dataset]. http://doi.org/10.17632/c2gw7fy2j4.3
Explore at:
Unique identifier
https://doi.org/10.17632/c2gw7fy2j4.3
Dataset updated
Jun 25, 2021
Authors
Abdelhakim Hannousse
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The provided dataset includes 11430 URLs with 87 extracted features. The dataset are designed to be used as a a benchmark for machine learning based phishing detection systems. Features are from three different classes: 56 extracted from the structure and syntax of URLs, 24 extracted from the content of their correspondent pages and 7 are extracetd by querying external services. The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs. Associated to the dataset, we provide Python scripts used for the extraction of the features for potential replication or extension. Datasets are constructed on May 2020.

dataset_A: contains a list a URLs together with their DOM tree objects that can be used for replication and experimenting new URL and content-based features overtaking short-time living of phishing web pages.

dataset_B: containes the extracted feature values that can be used directly as inupt to classifiers for examination. Note that the data in this dataset are indexed with URLs so that one need to remove the index before experimentation.
h
data-phishing-detection
huggingface.co
Updated Oct 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reva (2024). data-phishing-detection [Dataset]. https://huggingface.co/datasets/RevaHQ/data-phishing-detection
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 23, 2024
Dataset authored and provided by
Reva
Description
data-phishing-detection

A dataset to test methods to detect phishing emails The file data.parquet contains the dataset, 400 emails. 200 are synthetic phishing attempts and 200 are synthetic regular emails.

Schema

input - an email, synthesized by an LLM, that is either a phishing attempt or a regular email. output - 'Yes' if the email is a phishing attempt, 'No' otherwise.

Prompt

The prompt.md file contains a prompt that can be used with an LLM as a starting… See the full description on the dataset page: https://huggingface.co/datasets/RevaHQ/data-phishing-detection.
o
Textual Data of Phishing Scams Targeting Academia
openicpsr.org
delimited
Updated Apr 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ethan Morrow (2024). Textual Data of Phishing Scams Targeting Academia [Dataset]. http://doi.org/10.3886/E201721V1
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E201721V1
Dataset updated
Apr 30, 2024
Dataset provided by
University of Illinois at Urbana-Champaign
Authors
Ethan Morrow
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
A partial dataset and document-term matrix of phishing emails targeting an institution of higher education and an associated script used for data analysis.
Outcomes of successful phishing attacks in companies worldwide 2021-2023
statista.com
ai-chatbox.pro
Updated Mar 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Outcomes of successful phishing attacks in companies worldwide 2021-2023 [Dataset]. https://www.statista.com/statistics/1350723/consequences-phishing-attacks/
Explore at:
Dataset updated
Mar 10, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
Surveys of working adults and IT security professionals worldwide conducted in 2021 and 2023 found that the share of organizations experiencing severe consequences due to a successful cyber attack had declined. In 2023, the share of enterprises experiencing a breach of customer or client data was 29 percent, down from 44 percent in 2022. Ransomware infections that occurred through e-mail were common for 32 percent of the respondents in 2023. Cases of a credential or account compromise occurred in 27 percent of the organizations in 2023, a decrease of 25 percent compared to the year prior.
Z
Phishing website dataset
data.niaid.nih.gov
zenodo.org
Updated Jun 10, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
van Dooremaal, Bram (2021). Phishing website dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4922597
Explore at:
Dataset updated
Jun 10, 2021
Dataset provided by
Zannone, Nicola
Burda, Pavlo
van Dooremaal, Bram
Allodi, Luca
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset comprises phishing and legitimate web pages, which have been used for experiments on early phishing detection.

Detailed information on the dataset and data collection is available at

Bram van Dooremaal, Pavlo Burda, Luca Allodi, and Nicola Zannone. 2021.Combining Text and Visual Features to Improve the Identification of Cloned Webpages for Early Phishing Detection. In ARES '21: Proceedings of the 16th International Conference on Availability, Reliability and Security. ACM.
P
Phishing Simulation Report
datainsightsmarket.com
doc, pdf, ppt
Updated Apr 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Phishing Simulation Report [Dataset]. https://www.datainsightsmarket.com/reports/phishing-simulation-1442865
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Apr 29, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The phishing simulation market is experiencing robust growth, driven by the escalating sophistication of phishing attacks and the increasing regulatory pressure on organizations to enhance their cybersecurity posture. The market, currently valued at approximately $1.5 billion in 2025 (estimated based on typical market sizes for cybersecurity segments with similar growth rates), is projected to experience a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This expansion is fueled by several key factors. Firstly, the rising frequency and success rate of phishing campaigns targeting both large enterprises and SMEs necessitate proactive security measures like simulation training. Secondly, evolving attack vectors and techniques demand continuous adaptation and improvement in security awareness programs, creating a sustained demand for advanced phishing simulation solutions. Thirdly, stringent data privacy regulations like GDPR and CCPA are imposing significant penalties for data breaches resulting from successful phishing attacks, motivating organizations to invest heavily in preventative measures including simulation-based training. The market segmentation reveals a significant share held by software-based solutions, owing to their scalability, ease of deployment, and cost-effectiveness. However, the service segment is also experiencing strong growth due to the increasing need for expert guidance and managed services in designing and implementing effective phishing simulation programs. Geographically, North America currently dominates the market, followed by Europe, reflecting the high level of cybersecurity awareness and regulatory compliance in these regions. However, the Asia-Pacific region is expected to exhibit the highest growth rate over the forecast period, driven by increasing digital adoption and rising awareness of cybersecurity threats in developing economies. While the market faces certain restraints, such as the need for specialized expertise and the potential for high implementation costs, the overall growth trajectory remains positive, driven by the overwhelming need to combat the ever-evolving threat landscape of phishing attacks.
i
Phishing Attack Dataset
ieee-dataport.org
Updated May 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emin Kugu (2025). Phishing Attack Dataset [Dataset]. https://ieee-dataport.org/documents/phishing-attack-dataset
Explore at:
Dataset updated
May 3, 2025
Authors
Emin Kugu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
the scenarios tested were run on the small_dataset. The most successful configuration that was selected as a result of the analysis on small_dataset was applied to big_dataset.
m
PhiUSIIL Phishing URL Dataset
data.mendeley.com
Updated Nov 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arvind Prasad (2023). PhiUSIIL Phishing URL Dataset [Dataset]. http://doi.org/10.17632/shwpxscxy2.2
Explore at:
Unique identifier
https://doi.org/10.17632/shwpxscxy2.2
Dataset updated
Nov 15, 2023
Authors
Arvind Prasad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PhiUSIIL Phishing URL Dataset is a substantial dataset comprising 134,850 legitimate and 100,945 phishing URLs. Most of the URLs we analyzed while constructing the dataset are the latest URLs. Features are extracted from the source code of the webpage and URL. Features such as CharContinuationRate, URLTitleMatchScore, URLCharProb, and TLDLegitimateProb are derived from existing features.

Citation: Prasad, A., & Chandra, S. (2023). PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning. Computers & Security, 103545. doi: https://doi.org/10.1016/j.cose.2023.103545

A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large...

zenodo.org

json

Updated Dec 10, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák (2024). A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large Corpus of Benign, Phishing, and Malware Domain Names 2024 [Dataset]. http://doi.org/10.5281/zenodo.13330074

Explore at:

jsonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13330074

Dataset updated

Dec 10, 2024

Dataset provided by

Zenodo

Authors

Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Aug 16, 2024

Description

The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS handshakes and certificates, and GeoIP information for 368,956 benign domains from Cisco Umbrella, 461,338 benign domains from the actual CESNET network traffic, 164,425 phishing domains from PhishTank and OpenPhish services, and 100,809 malware domains from various sources like ThreatFox, The Firebog, MISP threat intelligence platform, and other sources. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered malicious by VT have been removed from phishing and malware datasets. Similarly, benign domain names that were considered risky by VT have been removed from the benign datasets. The data was collected between March 2023 and July 2024. The final assessment of the data was conducted in August 2024.

The dataset is useful for cybersecurity research, e.g. statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing and malware website detection.

Data Files

The data is located in the following individual files:
- benign_umbrella.json - data for 368,956 benign domains from Cisco Umbrella,
- benign_cesnet.json - data for 461,338 benign domains from the CESNET network,
- phishing.json - data for 164,425 phishing domains, and
- malware.json - data for 100,809 malware domains.

Data Structure

Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:

some fields may be missing (they should be interpreted as nulls),
extra fields may be present (they should be ignored).

Field name	Field type	Nullable	Description
domain_name	String	No	The evaluated domain name
url	String	No	The source URL for the domain name
evaluated_on	Date	No	Date of last collection attempt
source	String	No	An identifier of the source
sourced_on	Date	No	Date of ingestion of the domain name
dns	Object	Yes	Data from DNS scan
rdap	Object	Yes	Data from RDAP or WHOIS
tls	Object	Yes	Data from TLS handshake
ip_data	Array of Objects	Yes	Array of data objects capturing the IP addresses related to the domain name
DNS data (dns field)
A	Array of Strings	No	Array of IPv4 addresses
AAAA	Array of Strings	No	Array of IPv6 addresses
TXT	Array of Strings	No	Array of raw TXT values
CNAME	Object	No	The CNAME target and related IPs
MX	Array of Objects	No	Array of objects with the MX target hostname, priority and related IPs
NS	Array of Objects	No	Array of objects with the NS target hostname and related IPs
SOA	Object	No	All the SOA fields, present if found at the target domain name
zone_SOA	Object	No	The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly
dnssec	Object	No	Flags describing the DNSSEC validation result for each record type
ttls	Object	No	The TTL values for each record type
remarks	Object	No	The zone domain name and DNSSEC flags
RDAP data (rdap field)
copyright_notice	String	No	RDAP/WHOIS data usage copyright notice
dnssec	Bool	No	DNSSEC presence flag
entitites	Object	No	An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities.
expiration_date	Date	Yes	The current date of expiration
handle	String	No	RDAP handle
last_changed_date	Date	Yes	The date when the domain was last changed
name	String	No	The target domain name for which the data in this object are stored
nameservers	Array of Strings	No	Nameserver hostnames provided by RDAP or WHOIS
registration_date	Date	Yes	First registration date
status	Array of Strings

Data from: Spam email Dataset
kaggle.com
Updated Sep 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
_w1998 (2023). Spam email Dataset [Dataset]. https://www.kaggle.com/datasets/jackksoncsie/spam-email-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
_w1998
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
Dataset Name: Spam Email Dataset

Description: This dataset contains a collection of email text messages, labeled as either spam or not spam. Each email message is associated with a binary label, where "1" indicates that the email is spam, and "0" indicates that it is not spam. The dataset is intended for use in training and evaluating spam email classification models.

Columns:

text (Text): This column contains the text content of the email messages. It includes the body of the emails along with any associated subject lines or headers.

spam_or_not (Binary): This column contains binary labels to indicate whether an email is spam or not. "1" represents spam, while "0" represents not spam.

Usage: This dataset can be used for various Natural Language Processing (NLP) tasks, such as text classification and spam detection. Researchers and data scientists can train and evaluate machine learning models using this dataset to build effective spam email filters.
S
Spear Phishing Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Spear Phishing Report [Dataset]. https://www.datainsightsmarket.com/reports/spear-phishing-1951598
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Jun 6, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The spear phishing market is experiencing robust growth, driven by the increasing sophistication of cyberattacks and the expanding digital landscape. While precise market sizing data is unavailable, considering the substantial investments in cybersecurity and the consistent rise in reported phishing incidents, a reasonable estimate for the 2025 market size would be in the range of $5-7 billion. This figure reflects the rising costs associated with data breaches, regulatory fines, and the increasing demand for advanced threat detection and response solutions. A Compound Annual Growth Rate (CAGR) of 12-15% over the forecast period (2025-2033) is plausible, considering ongoing technological advancements in spear phishing techniques and the corresponding need for robust countermeasures. Key drivers include the growth of remote work, increasing reliance on cloud services, and the evolving tactics employed by cybercriminals to target specific individuals and organizations. Trends point towards a greater focus on artificial intelligence (AI) and machine learning (ML) in threat detection, as well as a shift towards proactive security measures and employee training programs to mitigate the impact of spear phishing attacks. However, restraints include the ever-evolving nature of spear phishing techniques, the persistent skills gap in cybersecurity professionals, and the potential for false positives in automated detection systems. Segmentation within the market is likely to exist based on solution type (e.g., email security, security awareness training), deployment model (cloud, on-premises), and target industry (financial services, healthcare, government). Companies like BAE Systems, Check Point Software Technologies, Cisco Systems, and Proofpoint are key players actively innovating and competing within this dynamic market. The significant market expansion is further fueled by the high financial stakes involved in successful spear phishing campaigns. The impact of successful attacks, including data breaches, financial losses, and reputational damage, encourages organizations to invest heavily in comprehensive security solutions. The proliferation of sophisticated spear phishing techniques, such as personalized phishing emails and the use of social engineering, necessitates advanced detection and prevention technologies. The market's competitive landscape is characterized by both established cybersecurity vendors and emerging players who are constantly developing new solutions to combat the threat of spear phishing. The competitive dynamics will likely lead to further innovation and drive market growth in the coming years, enhancing the overall sophistication of spear phishing detection and prevention solutions.
h
Data from: phishing-emails
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zion van Wyk, phishing-emails [Dataset]. https://huggingface.co/datasets/zionia/phishing-emails
Explore at:
Authors
Zion van Wyk
Description
Dataset Card for "phishing-emails"

More Information needed
f
Phishing Email: 11 Curated Datasets
figshare.com
bin
Updated May 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymized anonym (2024). Phishing Email: 11 Curated Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.24952503.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24952503.v1
Dataset updated
May 2, 2024
Dataset provided by
figshare
Authors
Anonymized anonym
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We have curated 11 datasets. The Nazario and Nigerian Fraud datasets contain only phishing emails.Cite this dataset:A. I. Champa, M. F. Rabbi, and M. F. Zibran, “Why phishing emails escape detection: A closer look at the failure points,” in 12th International Symposium on Digital Forensics and Security (ISDFS), 2024, pp. 1–6 (to appear).or@inproceedings{champa2024why, title={Why Phishing Emails Escape Detection: A Closer Look at the Failure Points}, author={Champa, Arifa I and Rabbi, Md Fazle and Zibran, Minhaz F}, booktitle={12th International Symposium on Digital Forensics and Security (ISDFS)}, pages = {1--6 (to appear)}, year={2024}}
High-Risk URL and Content Dataset
kaggle.com
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mehmet korkmaz (2024). High-Risk URL and Content Dataset [Dataset]. https://www.kaggle.com/datasets/mehmetkorkmaz/high-risk-url-and-content-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 9, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
mehmet korkmaz
Description
Cite from:

Korkmaz, M. ., Kocyigit, E. ., Sahingoz, O. K., & Diri, B. (2022). A Hybrid Phishing Detection System Using Deep Learning-based URL and Content Analysis. Elektronika Ir Elektrotechnika, 28(5), 80-89. https://doi.org/10.5755/j02.eie.31197

About

All data is collected from Phishank.com.

The operation on the website of such a large-scale organization is as follows: - Users leave URLs to the URL pool to be queried. - URLs in this list, which are open to all guests, are checked by users and classified as phishing or legitimate. - A URL is tagged according to the number of votes it receives.

Thus, three categories of URLs are listed: Phishing, Legitimate and Unrated.

If the URL is inactive and no user has moderated it, it will be tagged as UNRATED. These can qualify as neutral elements in the URL list. URLs that have been inspected and found to be harmful while they are live are labeled as PHISHING. The phishing part of the dataset contains these URLs. Those with website content from these URLs listed under Online and Valid Phish on the PhishTank website have been added to the Phishing section of the dataset, along with both the URL and the content. URLs that have been inspected and found to be not harmful while they are live are labeled as LEGITIMATE. These URLs, which are labelled as Invalid in PhishTank and have content, form the legitimate part of the dataset. Thus, the data that was added to the checklist after being suspicious by the users and then labelled as legitimate constituted in RISKY legitimate part.

The dataset was created with 51,316 legitimate URLs and contents, 36,173 phishing URLs and contents, listed between 2006 and 2021.

By looking at the file named "Dataset_Distribution.xlsx", the distribution of data in two categories by years can be shown. In addition, when this file is examined, it is possible to access the information of which url was published on which date. Approximately 91% of the phishing data was obtained in 2021. This rate confirms the existence of data used as zero-day attacks in the dataset. However, it can be said that the legitimate data is more evenly distributed. Again, it can be deduced with the idea of how accurately the legitimate data is labelled.

URLs and contents were collected with a written script in Python. The size of "0 Kb." of these URLs is excluded from the dataset. In addition, each of the contents was checked and the contents with the Error 403 code were removed from the dataset.
h
all-scam-spam
huggingface.co
Updated Sep 2, 2002
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fred Zhang (2002). all-scam-spam [Dataset]. https://huggingface.co/datasets/FredZhang7/all-scam-spam
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 2, 2002
Authors
Fred Zhang
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is a large corpus of 42,619 preprocessed text messages and emails sent by humans in 43 languages. is_spam=1 means spam and is_spam=0 means ham. 1040 rows of balanced data, consisting of casual conversations and scam emails in ≈10 languages, were manually collected and annotated by me, with some help from ChatGPT.

Some preprcoessing algorithms

spam_assassin.js, followed by spam_assassin.py enron_spam.py

Data composition Description

To make the text… See the full description on the dataset page: https://huggingface.co/datasets/FredZhang7/all-scam-spam.
a
Phishing corpus
academictorrents.com
bittorrent
Updated Jan 2, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vit Listik (2019). Phishing corpus [Dataset]. https://academictorrents.com/details/a77cda9a9d89a60dbdfbe581adf6e2df9197995a
Explore at:
bittorrent(37482335)Available download formats
Dataset updated
Jan 2, 2019
Dataset authored and provided by
Vit Listik
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
A BitTorrent file to download data with the title 'Phishing corpus'
Global number of e-mail phishing attacks 2022-2023
statista.com
ai-chatbox.pro
Updated Sep 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Global number of e-mail phishing attacks 2022-2023 [Dataset]. https://www.statista.com/statistics/1493550/phishing-attacks-global-number/
Explore at:
Dataset updated
Sep 23, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 2022 - Dec 2023
Area covered
Worldwide
Description
In December 2023, around 9.45 million phishing e-mails were detected worldwide, up from 5.59 million in September 2023. This figure has seen a continuous increase since January 2022. It is partially associated with the launch of ChatGPT in November 2022.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sairaj Adhav (2025). 🕵️ Phishing Websites Data [Dataset]. https://www.kaggle.com/datasets/sai10py/phishing-websites-data

🕵️ Phishing Websites Data

A useful dataset for analyzing and detecting phishing websites

Explore at:

311 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 24, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Sairaj Adhav

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Phishing Websites Dataset

Overview

This dataset is designed to aid in the analysis and detection of phishing websites. It contains various features that help distinguish between legitimate and phishing websites based on their structural, security, and behavioral attributes.

Dataset Information

Total Columns: 31 (30 Features + 1 Target)
Target Variable: Result (Indicates whether a website is phishing or legitimate)

Features Description

URL-Based Features

Prefix_Suffix – Checks if the URL contains a hyphen (-), which is commonly used in phishing domains.
double_slash_redirecting – Detects if the URL redirects using //, which may indicate a phishing attempt.
having_At_Symbol – Identifies the presence of @ in the URL, which can be used to deceive users.
Shortining_Service – Indicates whether the URL uses a shortening service (e.g., bit.ly, tinyurl).
URL_Length – Measures the length of the URL; phishing URLs tend to be longer.
having_IP_Address – Checks if an IP address is used in place of a domain name, which is suspicious.

Domain-Based Features

having_Sub_Domain – Evaluates the number of subdomains; phishing sites often have excessive subdomains.
SSLfinal_State – Indicates whether the website has a valid SSL certificate (secure connection).
Domain_registeration_length – Measures the duration of domain registration; phishing sites often have short lifespans.
age_of_domain – The age of the domain in days; older domains are usually more trustworthy.
DNSRecord – Checks if the domain has valid DNS records; phishing domains may lack these.

Webpage-Based Features

Favicon – Determines if the website uses an external favicon (which can be a sign of phishing).
port – Identifies if the site is using suspicious or non-standard ports.
HTTPS_token – Checks if "HTTPS" is included in the URL but is used deceptively.
Request_URL – Measures the percentage of external resources loaded from different domains.
URL_of_Anchor – Analyzes anchor tags (<a> links) and their trustworthiness.
Links_in_tags – Examines <meta>, <script>, and <link> tags for external links.
SFH (Server Form Handler) – Determines if form actions are handled suspiciously.
Submitting_to_email – Checks if forms submit data directly to an email instead of a web server.
Abnormal_URL – Identifies if the website’s URL structure is inconsistent with common patterns.
Redirect – Counts the number of redirects; phishing websites may have excessive redirects.

Behavior-Based Features

on_mouseover – Checks if the website changes content when hovered over (used in deceptive techniques).
RightClick – Detects if right-click functionality is disabled (phishing sites may disable it).
popUpWindow – Identifies the presence of pop-ups, which can be used to trick users.
Iframe – Checks if the website uses <iframe> tags, often used in phishing attacks.

Traffic & Search Engine Features

web_traffic – Measures the website’s Alexa ranking; phishing sites tend to have low traffic.
Page_Rank – Google PageRank score; phishing sites usually have a low PageRank.
Google_Index – Checks if the website is indexed by Google (phishing sites may not be indexed).
Links_pointing_to_page – Counts the number of backlinks pointing to the website.
Statistical_report – Uses external sources to verify if the website has been reported for phishing.

Target Variable

Result – The classification label (1: Legitimate, -1: Phishing)

Usage

This dataset is valuable for:
✅ Machine Learning Models – Developing classifiers for phishing detection.
✅ Cybersecurity Research – Understanding patterns in phishing attacks.
✅ Browser Security Extensions – Enhancing anti-phishing tools.

Clear search

Close search

Google apps

Main menu

🕵️ Phishing Websites Data

Phishing Websites Dataset

Overview

Dataset Information

Features Description

URL-Based Features

Domain-Based Features

Webpage-Based Features

Behavior-Based Features

Traffic & Search Engine Features

Target Variable

Usage

Phishing Websites Dataset

Fraudulent Bank Websites, Phishing E-mails and Similar Scams | DATA.GOV.HK

Web page phishing detection

data-phishing-detection

Textual Data of Phishing Scams Targeting Academia

Outcomes of successful phishing attacks in companies worldwide 2021-2023

Phishing website dataset

Phishing Simulation Report

Phishing Attack Dataset

PhiUSIIL Phishing URL Dataset

A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large...

Data Files

Data Structure

Data from: Spam email Dataset

Spear Phishing Report

Data from: phishing-emails

Phishing Email: 11 Curated Datasets

High-Risk URL and Content Dataset

Cite from:

all-scam-spam

Phishing corpus

Global number of e-mail phishing attacks 2022-2023

🕵️ Phishing Websites Data

A useful dataset for analyzing and detecting phishing websites

Phishing Websites Dataset

Overview

Dataset Information

Features Description

URL-Based Features

Domain-Based Features

Webpage-Based Features

Behavior-Based Features

Traffic & Search Engine Features

Target Variable

Usage