Facebook
TwitterDuring the fourth quarter of 2024, nearly 23 percent of phishing attacks worldwide targeted social media. Web-based software services and webmail were targeted by over 23 percent of registered phishing attacks. Furthermore, financial institutions accounted for 12 percent of attacks.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Explore key phishing email stats, including attack frequency, success rates, target industries, user vulnerability, and cybersecurity impact!
Facebook
TwitterIn the 4th quarter of 2024, over 989,000 unique phishing attacks were detected worldwide, representing a slight increase from the preceding quarter. By far, the number of unique phishing sites has seen the most significant jump between the second and the third quarters of 2020, from nearly 147,000 to approximately 572,000. This figure is based on the number of the unique base URLs of the phishing sites.
Facebook
TwitterIn 2023, users in Vietnam were most frequently targeted by phishing attacks. The phishing attack rate among internet users in the country was ***** percent. In the examined year, Peru was the second region, with an attack rate of nearly ** percent, while Taiwan followed with ***** percent.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is designed to aid in the analysis and detection of phishing websites. It contains various features that help distinguish between legitimate and phishing websites based on their structural, security, and behavioral attributes.
Result (Indicates whether a website is phishing or legitimate) Prefix_Suffix β Checks if the URL contains a hyphen (-), which is commonly used in phishing domains. double_slash_redirecting β Detects if the URL redirects using //, which may indicate a phishing attempt. having_At_Symbol β Identifies the presence of @ in the URL, which can be used to deceive users. Shortining_Service β Indicates whether the URL uses a shortening service (e.g., bit.ly, tinyurl). URL_Length β Measures the length of the URL; phishing URLs tend to be longer. having_IP_Address β Checks if an IP address is used in place of a domain name, which is suspicious. having_Sub_Domain β Evaluates the number of subdomains; phishing sites often have excessive subdomains. SSLfinal_State β Indicates whether the website has a valid SSL certificate (secure connection). Domain_registeration_length β Measures the duration of domain registration; phishing sites often have short lifespans. age_of_domain β The age of the domain in days; older domains are usually more trustworthy. DNSRecord β Checks if the domain has valid DNS records; phishing domains may lack these. Favicon β Determines if the website uses an external favicon (which can be a sign of phishing). port β Identifies if the site is using suspicious or non-standard ports. HTTPS_token β Checks if "HTTPS" is included in the URL but is used deceptively. Request_URL β Measures the percentage of external resources loaded from different domains. URL_of_Anchor β Analyzes anchor tags (<a> links) and their trustworthiness. Links_in_tags β Examines <meta>, <script>, and <link> tags for external links. SFH (Server Form Handler) β Determines if form actions are handled suspiciously. Submitting_to_email β Checks if forms submit data directly to an email instead of a web server. Abnormal_URL β Identifies if the websiteβs URL structure is inconsistent with common patterns. Redirect β Counts the number of redirects; phishing websites may have excessive redirects. on_mouseover β Checks if the website changes content when hovered over (used in deceptive techniques). RightClick β Detects if right-click functionality is disabled (phishing sites may disable it). popUpWindow β Identifies the presence of pop-ups, which can be used to trick users. Iframe β Checks if the website uses <iframe> tags, often used in phishing attacks. web_traffic β Measures the websiteβs Alexa ranking; phishing sites tend to have low traffic. Page_Rank β Google PageRank score; phishing sites usually have a low PageRank. Google_Index β Checks if the website is indexed by Google (phishing sites may not be indexed). Links_pointing_to_page β Counts the number of backlinks pointing to the website. Statistical_report β Uses external sources to verify if the website has been reported for phishing. Result β The classification label (1: Legitimate, -1: Phishing) This dataset is valuable for:
β
Machine Learning Models β Developing classifiers for phishing detection.
β
Cybersecurity Research β Understanding patterns in phishing attacks.
β
Browser Security Extensions β Enhancing anti-phishing tools.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a collection of legitimate and phishing websites, along with information on the target brands (brands.csv) being impersonated in the phishing attacks. The dataset includes a total of 10,395 websites, 5,244 of which are legitimate and 5,151 of which are phishing websites. These websites impersonate a total of 86 different target brands.
For phishing datasets, the files can be downloaded in a zip file with a "phishing" prefix, while for legitimate websites, the files can be downloaded in a zip file with a "not-phishing" prefix.
In addition, the dataset includes features such as screenshots, text, CSS, and HTML structure for each website, as well as domain information (WHOIS data), IP information, and SSL information. Each website is labeled as either legitimate or phishing and includes additional metadata such as the date it was discovered, the target brand being impersonated, and any other relevant information.
The dataset has been curated for research purposes and can be used to analyze the effectiveness of phishing attacks, develop and evaluate anti-phishing solutions, and identify trends and patterns in phishing attacks. It is hoped that this dataset will contribute to the advancement of research in the field of cybersecurity and help improve our understanding of phishing attacks.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The StealthPhisher Phishing Attack Dataset, generated at the Cybersecurity Lab, GLA University, Mathura, is a large, diverse, and recent Phishing Attack Dataset developed to address the evolving nature of phishing attacks. It comprises over 336,749 records, including 160,943 legitimate URLs and 175,806 phishing URLs, collected from reliable sources such as PhishTank. Reflecting the most recent phishing tactics, this dataset serves as a valuable resource for training and evaluating AI-based phishing detection systems.
Key features include URL-based attributes (e.g., length, TLD type, IP presence), statistical metrics (e.g., Shannon Entropy, Kolmogorov Complexity, Fractal Dimension), and HTML/interaction-based features (e.g., popups, redirects, forms). These multidimensional attributes provide comprehensive insights into phishing behavior, enabling accurate and robust threat detection. Designed to capture real-world scenarios, the dataset equips AI models to recognize both traditional and emerging phishing strategies effectively.
This dataset was generated as part of the research work presented in the article βStealthPhisher: A Defensive Framework against Phishing Attack using Hybrid Deep Learning and GenAI,β published in Expert Systems with Applications (https://doi.org/10.1016/j.eswa.2025.130205). Researchers using this dataset in their research work are kindly requested to cite this article.
Facebook
TwitterSurveys of working adults and IT security professionals worldwide conducted in 2021 and 2023 found that the share of organizations experiencing severe consequences due to a successful cyber attack had declined. In 2023, the share of enterprises experiencing a breach of customer or client data was 29 percent, down from 44 percent in 2022. Ransomware infections that occurred through e-mail were common for 32 percent of the respondents in 2023. Cases of a credential or account compromise occurred in 27 percent of the organizations in 2023, a decrease of 25 percent compared to the year prior.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Phishing URL dataset exclusively contains 54,807 URLs identified as phishing, providing a focused resource for studying and combating malicious online activities. Meanwhile, the URL dataset comprises 450,176 URLs sourced from various platforms, including PhisTank, the Majestic Million, and other pertinent sources. Each URL in the dataset is meticulously categorized as either "phishing" or "legitimate." Among these URLs, 104,438 have been flagged as phishing URLs, indicating malicious intent, while the remaining 345,738 URLs are classified as legitimate, denoting non-malicious or benign activity. This extensive dataset, drawn from multiple reputable sources, serves as a crucial asset for cybersecurity researchers and practitioners, facilitating the development and validation of advanced techniques for effectively detecting and mitigating phishing attacks.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Phishing Message Dataset (1000 Samples)
This dataset comprises 1,000 phishing messages, categorized based on NLP-based deception techniques commonly used in social engineering attacks.
Urgency β Messages that create a sense of immediate action.
Authority β Messages impersonating trusted figures or organizations.
Persuasion β Messages using manipulative language to convince the recipient.
Each record contains the following fields:
- text β The phishing message (email or SMS).
- category β The type of phishing attack (urgency, authority, persuasion).
- label β A classification label ("phishing") for machine learning tasks.
Natural Language Processing (NLP) β Analyze linguistic patterns in phishing messages.
Cybersecurity Research β Identify deceptive techniques used in phishing attacks.
Phishing Detection Models β Train AI models to classify and detect phishing messages.
AI-driven Threat Analysis β Improve automated cybersecurity threat detection.
This dataset serves as a valuable resource for developing AI-powered solutions in cybersecurity and NLP-based phishing detection.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was used for training machine learning models to detect phishing attacks and for studying the explainability of these models. It was published in 2024. The dataset refers to phishing and legitimate websites. Phishing samples have been collected from two sources, namely, PhishTank and Tranco, whereas legitimate samples were collected from Alexa. The dataset is balanced and contains 5,000 phishing and 5,000 legitimate samples, each described by 74 features extracted from the entire URL as well as from the Fully Qualified Domain Name, pathname, filename, and parameters. Of these features, 70 are numerical and four binary. The target variable is also binary.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Overview This dataset is a large, feature-rich collection of URLs designed for research and development in malicious URL detection. It contains a total of 651,000 samples, each enriched with detailed lexical, structural, statistical, and phishing-related attributes. The goal of the dataset is to support the development of machine learning models capable of identifying harmful URLs before they lead to security incidents such as phishing attacks, data theft, and malware infections. By combining raw URL properties with engineered features, the dataset offers a comprehensive foundation for both traditional and advanced cybersecurity classification models.
The dataset consists of 65 features, including URL length, character frequencies, entropy measures, domain and subdomain statistics, and multiple phishing-specific indicators. These features capture a wide range of behavioral and structural patterns commonly found in malicious URLs. The final column, label, assigns each entry to one of four categories: benign, defacement, phishing, or malware. This multi-class structure allows the dataset to be used not only for malicious vs. benign classification but also for more detailed threat type identification.
The class distribution contains 428,103 benign URLs, 96,457 defacement URLs, 93,920 phishing URLs, and 32,520 malware URLs. While the dataset is naturally imbalanced, it remains representative of real-world cyber environments where benign traffic far exceeds malicious activity. This realistic distribution makes the dataset valuable for evaluating model robustness and handling class imbalance through techniques such as sampling or weighted training. Overall, the dataset provides a solid and versatile benchmark for cybersecurity machine learning tasks.
65 features The final dataset contains 65 features, including both raw URL characteristics and a wide range of engineered attributes. These features cover lexical patterns, special-character counts, entropy measures, subdomain and path statistics, phishing-specific indicators, and various statistical ratios. Together, they provide a comprehensive representation of each URL, making the dataset suitable for building strong and reliable machine learning models for malicious URL detection.
Facebook
TwitterOfficial statistics are produced impartially and free from political influence.
Facebook
TwitterBetween February 2024 and February 2025, nearly 21 percent of employees at global organizations stated they experienced a QR code phishing attack. Additionally, over 21 percent of customers of managed service providers (MSPs) stated encountering such attacks.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The "Phishing Data" dataset is a comprehensive collection of information specifically curated for analyzing and understanding phishing attacks. Phishing attacks involve malicious attempts to deceive individuals or organizations into disclosing sensitive information such as passwords or credit card details. This dataset comprises 18 distinct features that offer valuable insights into the characteristics of phishing attempts. These features include the URL of the website being analyzed, the length of the URL, the use of URL shortening services, the presence of the "@" symbol, the presence of redirection using "//", the presence of prefixes or suffixes in the URL, the number of subdomains, the usage of secure connection protocols (HTTPS), the length of time since domain registration, the presence of a favicon, the presence of HTTP or HTTPS tokens in the domain name, the URL of requested external resources, the presence of anchors in the URL, the number of hyperlinks in HTML tags, the server form handler used, the submission of data to email addresses, abnormal URL patterns, and estimated website traffic or popularity. Together, these features enable the analysis and detection of phishing attempts in the "Phishing Data" dataset, aiding in the development of models and algorithms to combat phishing attacks.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of a collection of legitimate as well as phishing website instances. Each instance contains the URL and the relevant HTML page. The index.sql file is the root file, and it can be used to map the URLs with the relevant HTML pages. The dataset can serve as an input for the machine learning process.
Highlights: - Total number of instances: 80,000 (83,275 instances in the dataset due to the existence of some removed SQL records in preprocessing stage) - Number of legitimate website instances (labelled as 0 in the SQL file): 50,000 - Number of phishing website instances (labelled as 1 in the SQL file): 30,000
Structure: The index.sql file is the root file. It consisted of five fields. 1). rec_id - record number 2). url - URL of the webpage 3). website - Filename of the webpage (i.e. 1635698138155948.html) 4). result - Indicates whether a given URL is phishing or not (0 for legitimate and 1 for phishing). 5). created_date - Webpage downloaded date
Sources: - Legitimate Data [50,000] - These data were collected from two sources. 1). Google search - Simple keyword search on the google search engine was used, and the top 5 URLs of each search were collected. Domain restrictions were used and limited a maximum of 10 collections from a domain to have a diverse collection at the end. 2). Ebbu2017 Phishing Dataset [1] - Nearly 25,874 active URLs were collected from this repository
Data Collection Process: - Legitimate Data: - The URLs were collected from the above sources and fetched the relevant webpages separately. - The URLs are in different lengths to minimize the URL lengths issue mentioned by Verma et al. [3].
- Phishing Data:
- The URLs were collected from the above sources, and at the same time, the relevant web pages were fetched.
- An automated script continuously monitored PhishTank and OpenPhish to collect the latest phishing URLs.
- The collected URLs were fetched simultaneously to minimize the resource unavailable issue since the phishing pages do not exist for a longer period on the web.
- PhishRepo provides all the resources relevant to a phishing webpage; therefore, simply use their download function to download PhishRepo data.
References: [1]. Ebbu2017 Phishing Dataset. Accessed 31 October 2021. Available: https://github.com/ebubekirbbr/pdd/tree/master/input. [2]. PhishRepo. Accessed 31 October 2021. Available: https://moraphishdet.projects.uom.lk/phishrepo/. [3]. Verma, Rakesh M., Victor Zeng, and Houtan Faridi. "Data quality for security challenges: Case studies of phishing, malware and intrusion detection datasets.", 2019.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was compiled by researchers to study phishing email tactics. It combines emails from a variety of sources to create a comprehensive resource for analysis.
Enron and Ling Datasets: These datasets focus on the core content of phishing emails, containing subject lines, email body text, and labels indicating whether the email is spam (phishing) or legitimate.
CEAS, Nazario, Nigerian Fraud, and SpamAssassin Datasets: These datasets provide broader context for the emails, including sender information, recipient information, date, and labels for spam/legitimate classification.
The final dataset combines the information from the initial datasets into a single resource for analysis. This dataset contains:
This dataset allows researchers to study the content of phishing emails and the context in which they are sent to improve detection methods.
Please cite the following two articles if you are using this dataset:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The provided dataset includes 11430 URLs with 87 extracted features. The dataset are designed to be used as a a benchmark for machine learning based phishing detection systems. Features are from three different classes: 56 extracted from the structure and syntax of URLs, 24 extracted from the content of their correspondent pages and 7 are extracetd by querying external services. The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs. Associated to the dataset, we provide Python scripts used for the extraction of the features for potential replication or extension. Datasets are constructed on May 2020.
dataset_A: contains a list a URLs together with their DOM tree objects that can be used for replication and experimenting new URL and content-based features overtaking short-time living of phishing web pages.
dataset_B: containes the extracted feature values that can be used directly as inupt to classifiers for examination. Note that the data in this dataset are indexed with URLs so that one need to remove the index before experimentation.
Facebook
TwitterIn 2024, the most common type of cybercrime reported to the United States internet Crime Complaint Center was phishing, with its variation, spoofing, affecting approximately 193,000 individuals. In addition, over 86,000 cases of extortion were reported to the IC3 during that year. Dynamic of phishing attacks Over the past few years, phishing attacks have increased significantly. In 2024, over 193,000 individuals fell victim to such attacks. The highest number of phishing scam victims since 2018 was recorded in 2021, approximately 324 thousand.Phishing attacks can take many shapes. Bulk phishing, smishing, and business e-mail compromise (BEC) are the most common types. With the recent development of generative AI, it has become easier to craft a believable phishing e-mail. This is currently among the top concerns of organizations leaders. Impact of phishing attacks Among the most targeted industries by cybercriminals are healthcare, financial, manufacturing, and education institutions. An observation carried out in the fourth quarter of 2024 found that software-as-a-service (SaaS) and webmail was most likely to encounter phishing attacks. According to the reports, almost a quarter of them stated being targeted by a phishing scam in the measured period.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study analyzes cybersecurity issues across the UK, using data from an independent sample of 850,000 individuals and organizations during the 12 months leading up to July 2024. With cybersecurity breaches affecting 50% of small businesses, 70% of medium businesses, 74% of large businesses, and up to 66% of charities, the findings highlight the increasing threat of ransomware (45%), phishing attacks (30%), and challenges related to GDPR compliance (15%). These insights provide a comprehensive view of how cybersecurity challenges impact businesses and individuals across the region.
Facebook
TwitterDuring the fourth quarter of 2024, nearly 23 percent of phishing attacks worldwide targeted social media. Web-based software services and webmail were targeted by over 23 percent of registered phishing attacks. Furthermore, financial institutions accounted for 12 percent of attacks.