100+ datasets found
  1. 🕵️ Phishing Websites Data

    • kaggle.com
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sairaj Adhav (2025). 🕵️ Phishing Websites Data [Dataset]. https://www.kaggle.com/datasets/sai10py/phishing-websites-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 24, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sairaj Adhav
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Phishing Websites Dataset

    Overview

    This dataset is designed to aid in the analysis and detection of phishing websites. It contains various features that help distinguish between legitimate and phishing websites based on their structural, security, and behavioral attributes.

    Dataset Information

    • Total Columns: 31 (30 Features + 1 Target)
    • Target Variable: Result (Indicates whether a website is phishing or legitimate)

    Features Description

    URL-Based Features

    • Prefix_Suffix – Checks if the URL contains a hyphen (-), which is commonly used in phishing domains.
    • double_slash_redirecting – Detects if the URL redirects using //, which may indicate a phishing attempt.
    • having_At_Symbol – Identifies the presence of @ in the URL, which can be used to deceive users.
    • Shortining_Service – Indicates whether the URL uses a shortening service (e.g., bit.ly, tinyurl).
    • URL_Length – Measures the length of the URL; phishing URLs tend to be longer.
    • having_IP_Address – Checks if an IP address is used in place of a domain name, which is suspicious.

    Domain-Based Features

    • having_Sub_Domain – Evaluates the number of subdomains; phishing sites often have excessive subdomains.
    • SSLfinal_State – Indicates whether the website has a valid SSL certificate (secure connection).
    • Domain_registeration_length – Measures the duration of domain registration; phishing sites often have short lifespans.
    • age_of_domain – The age of the domain in days; older domains are usually more trustworthy.
    • DNSRecord – Checks if the domain has valid DNS records; phishing domains may lack these.

    Webpage-Based Features

    • Favicon – Determines if the website uses an external favicon (which can be a sign of phishing).
    • port – Identifies if the site is using suspicious or non-standard ports.
    • HTTPS_token – Checks if "HTTPS" is included in the URL but is used deceptively.
    • Request_URL – Measures the percentage of external resources loaded from different domains.
    • URL_of_Anchor – Analyzes anchor tags (<a> links) and their trustworthiness.
    • Links_in_tags – Examines <meta>, <script>, and <link> tags for external links.
    • SFH (Server Form Handler) – Determines if form actions are handled suspiciously.
    • Submitting_to_email – Checks if forms submit data directly to an email instead of a web server.
    • Abnormal_URL – Identifies if the website’s URL structure is inconsistent with common patterns.
    • Redirect – Counts the number of redirects; phishing websites may have excessive redirects.

    Behavior-Based Features

    • on_mouseover – Checks if the website changes content when hovered over (used in deceptive techniques).
    • RightClick – Detects if right-click functionality is disabled (phishing sites may disable it).
    • popUpWindow – Identifies the presence of pop-ups, which can be used to trick users.
    • Iframe – Checks if the website uses <iframe> tags, often used in phishing attacks.

    Traffic & Search Engine Features

    • web_traffic – Measures the website’s Alexa ranking; phishing sites tend to have low traffic.
    • Page_Rank – Google PageRank score; phishing sites usually have a low PageRank.
    • Google_Index – Checks if the website is indexed by Google (phishing sites may not be indexed).
    • Links_pointing_to_page – Counts the number of backlinks pointing to the website.
    • Statistical_report – Uses external sources to verify if the website has been reported for phishing.

    Target Variable

    • Result – The classification label (1: Legitimate, -1: Phishing)

    Usage

    This dataset is valuable for:
    Machine Learning Models – Developing classifiers for phishing detection.
    Cybersecurity Research – Understanding patterns in phishing attacks.
    Browser Security Extensions – Enhancing anti-phishing tools.

  2. Phishing Websites Dataset

    • kaggle.com
    zip
    Updated Mar 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arnav Samal (2024). Phishing Websites Dataset [Dataset]. https://www.kaggle.com/datasets/arnavs19/phishing-websites-dataset
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 23, 2024
    Authors
    Arnav Samal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These data consist of a collection of legitimate as well as phishing website instances. Each website is represented by the set of features which denote, whether website is legitimate or not. Data can serve as an input for machine learning process.

    Here, the two variants of the Phishing Dataset are presented.

    1. Full variant - dataset_full.csv

      • Total number of instances: 88,647
      • Number of legitimate website instances (labeled as 0): 58,000
      • Number of phishing website instances (labeled as 1): 30,647
      • Total number of features: 111
    2. Small variant - dataset_small.csv

      • Total number of instances: 58,645
      • Number of legitimate website instances (labeled as 0): 27,998
      • Number of phishing website instances (labeled as 1): 30,647
      • Total number of features: 111
  3. Fraudulent Bank Websites, Phishing E-mails and Similar Scams | DATA.GOV.HK

    • data.gov.hk
    Updated Oct 27, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.gov.hk (2018). Fraudulent Bank Websites, Phishing E-mails and Similar Scams | DATA.GOV.HK [Dataset]. https://data.gov.hk/en-data/dataset/hk-hkma-banksvf-fraudulent-bank-scams
    Explore at:
    Dataset updated
    Oct 27, 2018
    Dataset provided by
    data.gov.hk
    Description

    This API is providing the information of press releases issued by the authorized institutions and other similar press releases issued by the HKMA in the past regarding fraudulent bank websites, phishing E-mails and similar scams information.

  4. m

    Web page phishing detection

    • data.mendeley.com
    Updated Jun 25, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdelhakim Hannousse (2021). Web page phishing detection [Dataset]. http://doi.org/10.17632/c2gw7fy2j4.3
    Explore at:
    Dataset updated
    Jun 25, 2021
    Authors
    Abdelhakim Hannousse
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The provided dataset includes 11430 URLs with 87 extracted features. The dataset are designed to be used as a a benchmark for machine learning based phishing detection systems. Features are from three different classes: 56 extracted from the structure and syntax of URLs, 24 extracted from the content of their correspondent pages and 7 are extracetd by querying external services. The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs. Associated to the dataset, we provide Python scripts used for the extraction of the features for potential replication or extension. Datasets are constructed on May 2020.

    dataset_A: contains a list a URLs together with their DOM tree objects that can be used for replication and experimenting new URL and content-based features overtaking short-time living of phishing web pages.

    dataset_B: containes the extracted feature values that can be used directly as inupt to classifiers for examination. Note that the data in this dataset are indexed with URLs so that one need to remove the index before experimentation.

  5. h

    data-phishing-detection

    • huggingface.co
    Updated Oct 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reva (2024). data-phishing-detection [Dataset]. https://huggingface.co/datasets/RevaHQ/data-phishing-detection
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 23, 2024
    Dataset authored and provided by
    Reva
    Description

    data-phishing-detection

    A dataset to test methods to detect phishing emails The file data.parquet contains the dataset, 400 emails. 200 are synthetic phishing attempts and 200 are synthetic regular emails.

      Schema
    

    input - an email, synthesized by an LLM, that is either a phishing attempt or a regular email. output - 'Yes' if the email is a phishing attempt, 'No' otherwise.

      Prompt
    

    The prompt.md file contains a prompt that can be used with an LLM as a starting… See the full description on the dataset page: https://huggingface.co/datasets/RevaHQ/data-phishing-detection.

  6. o

    Textual Data of Phishing Scams Targeting Academia

    • openicpsr.org
    delimited
    Updated Apr 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ethan Morrow (2024). Textual Data of Phishing Scams Targeting Academia [Dataset]. http://doi.org/10.3886/E201721V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Apr 30, 2024
    Dataset provided by
    University of Illinois at Urbana-Champaign
    Authors
    Ethan Morrow
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    A partial dataset and document-term matrix of phishing emails targeting an institution of higher education and an associated script used for data analysis.

  7. Outcomes of successful phishing attacks in companies worldwide 2021-2023

    • statista.com
    • ai-chatbox.pro
    Updated Mar 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Outcomes of successful phishing attacks in companies worldwide 2021-2023 [Dataset]. https://www.statista.com/statistics/1350723/consequences-phishing-attacks/
    Explore at:
    Dataset updated
    Mar 10, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    Surveys of working adults and IT security professionals worldwide conducted in 2021 and 2023 found that the share of organizations experiencing severe consequences due to a successful cyber attack had declined. In 2023, the share of enterprises experiencing a breach of customer or client data was 29 percent, down from 44 percent in 2022. Ransomware infections that occurred through e-mail were common for 32 percent of the respondents in 2023. Cases of a credential or account compromise occurred in 27 percent of the organizations in 2023, a decrease of 25 percent compared to the year prior.

  8. Z

    Phishing website dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    van Dooremaal, Bram (2021). Phishing website dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4922597
    Explore at:
    Dataset updated
    Jun 10, 2021
    Dataset provided by
    Zannone, Nicola
    Burda, Pavlo
    van Dooremaal, Bram
    Allodi, Luca
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset comprises phishing and legitimate web pages, which have been used for experiments on early phishing detection.

    Detailed information on the dataset and data collection is available at

    Bram van Dooremaal, Pavlo Burda, Luca Allodi, and Nicola Zannone. 2021.Combining Text and Visual Features to Improve the Identification of Cloned Webpages for Early Phishing Detection. In ARES '21: Proceedings of the 16th International Conference on Availability, Reliability and Security. ACM.

  9. P

    Phishing Simulation Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Apr 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Phishing Simulation Report [Dataset]. https://www.datainsightsmarket.com/reports/phishing-simulation-1442865
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Apr 29, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The phishing simulation market is experiencing robust growth, driven by the escalating sophistication of phishing attacks and the increasing regulatory pressure on organizations to enhance their cybersecurity posture. The market, currently valued at approximately $1.5 billion in 2025 (estimated based on typical market sizes for cybersecurity segments with similar growth rates), is projected to experience a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This expansion is fueled by several key factors. Firstly, the rising frequency and success rate of phishing campaigns targeting both large enterprises and SMEs necessitate proactive security measures like simulation training. Secondly, evolving attack vectors and techniques demand continuous adaptation and improvement in security awareness programs, creating a sustained demand for advanced phishing simulation solutions. Thirdly, stringent data privacy regulations like GDPR and CCPA are imposing significant penalties for data breaches resulting from successful phishing attacks, motivating organizations to invest heavily in preventative measures including simulation-based training. The market segmentation reveals a significant share held by software-based solutions, owing to their scalability, ease of deployment, and cost-effectiveness. However, the service segment is also experiencing strong growth due to the increasing need for expert guidance and managed services in designing and implementing effective phishing simulation programs. Geographically, North America currently dominates the market, followed by Europe, reflecting the high level of cybersecurity awareness and regulatory compliance in these regions. However, the Asia-Pacific region is expected to exhibit the highest growth rate over the forecast period, driven by increasing digital adoption and rising awareness of cybersecurity threats in developing economies. While the market faces certain restraints, such as the need for specialized expertise and the potential for high implementation costs, the overall growth trajectory remains positive, driven by the overwhelming need to combat the ever-evolving threat landscape of phishing attacks.

  10. i

    Phishing Attack Dataset

    • ieee-dataport.org
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emin Kugu (2025). Phishing Attack Dataset [Dataset]. https://ieee-dataport.org/documents/phishing-attack-dataset
    Explore at:
    Dataset updated
    May 3, 2025
    Authors
    Emin Kugu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    the scenarios tested were run on the small_dataset. The most successful configuration that was selected as a result of the analysis on small_dataset was applied to big_dataset.

  11. m

    PhiUSIIL Phishing URL Dataset

    • data.mendeley.com
    Updated Nov 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arvind Prasad (2023). PhiUSIIL Phishing URL Dataset [Dataset]. http://doi.org/10.17632/shwpxscxy2.2
    Explore at:
    Dataset updated
    Nov 15, 2023
    Authors
    Arvind Prasad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PhiUSIIL Phishing URL Dataset is a substantial dataset comprising 134,850 legitimate and 100,945 phishing URLs. Most of the URLs we analyzed while constructing the dataset are the latest URLs. Features are extracted from the source code of the webpage and URL. Features such as CharContinuationRate, URLTitleMatchScore, URLCharProb, and TLDLegitimateProb are derived from existing features.

    Citation: Prasad, A., & Chandra, S. (2023). PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning. Computers & Security, 103545. doi: https://doi.org/10.1016/j.cose.2023.103545

  12. z

    A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large...

    • zenodo.org
    json
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák (2024). A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large Corpus of Benign, Phishing, and Malware Domain Names 2024 [Dataset]. http://doi.org/10.5281/zenodo.13330074
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 10, 2024
    Dataset provided by
    Zenodo
    Authors
    Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Aug 16, 2024
    Description

    The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS handshakes and certificates, and GeoIP information for 368,956 benign domains from Cisco Umbrella, 461,338 benign domains from the actual CESNET network traffic, 164,425 phishing domains from PhishTank and OpenPhish services, and 100,809 malware domains from various sources like ThreatFox, The Firebog, MISP threat intelligence platform, and other sources. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered malicious by VT have been removed from phishing and malware datasets. Similarly, benign domain names that were considered risky by VT have been removed from the benign datasets. The data was collected between March 2023 and July 2024. The final assessment of the data was conducted in August 2024.

    The dataset is useful for cybersecurity research, e.g. statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing and malware website detection.

    Data Files

    • The data is located in the following individual files:

      • benign_umbrella.json - data for 368,956 benign domains from Cisco Umbrella,
      • benign_cesnet.json - data for 461,338 benign domains from the CESNET network,
      • phishing.json - data for 164,425 phishing domains, and
      • malware.json - data for 100,809 malware domains.

    Data Structure

    Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:

    • some fields may be missing (they should be interpreted as nulls),
    • extra fields may be present (they should be ignored).

    Field name

    Field type

    Nullable

    Description

    domain_name

    String

    No

    The evaluated domain name

    url

    String

    No

    The source URL for the domain name

    evaluated_on

    Date

    No

    Date of last collection attempt

    source

    String

    No

    An identifier of the source

    sourced_on

    Date

    No

    Date of ingestion of the domain name

    dns

    Object

    Yes

    Data from DNS scan

    rdap

    Object

    Yes

    Data from RDAP or WHOIS

    tls

    Object

    Yes

    Data from TLS handshake

    ip_data

    Array of Objects

    Yes

    Array of data objects capturing the IP addresses related to the domain name

    DNS data (dns field)

    A

    Array of Strings

    No

    Array of IPv4 addresses

    AAAA

    Array of Strings

    No

    Array of IPv6 addresses

    TXT

    Array of Strings

    No

    Array of raw TXT values

    CNAME

    Object

    No

    The CNAME target and related IPs

    MX

    Array of Objects

    No

    Array of objects with the MX target hostname, priority and related IPs

    NS

    Array of Objects

    No

    Array of objects with the NS target hostname and related IPs

    SOA

    Object

    No

    All the SOA fields, present if found at the target domain name

    zone_SOA

    Object

    No

    The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly

    dnssec

    Object

    No

    Flags describing the DNSSEC validation result for each record type

    ttls

    Object

    No

    The TTL values for each record type

    remarks

    Object

    No

    The zone domain name and DNSSEC flags

    RDAP data (rdap field)

    copyright_notice

    String

    No

    RDAP/WHOIS data usage copyright notice

    dnssec

    Bool

    No

    DNSSEC presence flag

    entitites

    Object

    No

    An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities.

    expiration_date

    Date

    Yes

    The current date of expiration

    handle

    String

    No

    RDAP handle

    last_changed_date

    Date

    Yes

    The date when the domain was last changed

    name

    String

    No

    The target domain name for which the data in this object are stored

    nameservers

    Array of Strings

    No

    Nameserver hostnames provided by RDAP or WHOIS

    registration_date

    Date

    Yes

    First registration date

    status

    Array of Strings

  13. Data from: Spam email Dataset

    • kaggle.com
    Updated Sep 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    _w1998 (2023). Spam email Dataset [Dataset]. https://www.kaggle.com/datasets/jackksoncsie/spam-email-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    _w1998
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Dataset Name: Spam Email Dataset

    Description: This dataset contains a collection of email text messages, labeled as either spam or not spam. Each email message is associated with a binary label, where "1" indicates that the email is spam, and "0" indicates that it is not spam. The dataset is intended for use in training and evaluating spam email classification models.

    Columns:

    text (Text): This column contains the text content of the email messages. It includes the body of the emails along with any associated subject lines or headers.

    spam_or_not (Binary): This column contains binary labels to indicate whether an email is spam or not. "1" represents spam, while "0" represents not spam.

    Usage: This dataset can be used for various Natural Language Processing (NLP) tasks, such as text classification and spam detection. Researchers and data scientists can train and evaluate machine learning models using this dataset to build effective spam email filters.

  14. S

    Spear Phishing Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Spear Phishing Report [Dataset]. https://www.datainsightsmarket.com/reports/spear-phishing-1951598
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Jun 6, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The spear phishing market is experiencing robust growth, driven by the increasing sophistication of cyberattacks and the expanding digital landscape. While precise market sizing data is unavailable, considering the substantial investments in cybersecurity and the consistent rise in reported phishing incidents, a reasonable estimate for the 2025 market size would be in the range of $5-7 billion. This figure reflects the rising costs associated with data breaches, regulatory fines, and the increasing demand for advanced threat detection and response solutions. A Compound Annual Growth Rate (CAGR) of 12-15% over the forecast period (2025-2033) is plausible, considering ongoing technological advancements in spear phishing techniques and the corresponding need for robust countermeasures. Key drivers include the growth of remote work, increasing reliance on cloud services, and the evolving tactics employed by cybercriminals to target specific individuals and organizations. Trends point towards a greater focus on artificial intelligence (AI) and machine learning (ML) in threat detection, as well as a shift towards proactive security measures and employee training programs to mitigate the impact of spear phishing attacks. However, restraints include the ever-evolving nature of spear phishing techniques, the persistent skills gap in cybersecurity professionals, and the potential for false positives in automated detection systems. Segmentation within the market is likely to exist based on solution type (e.g., email security, security awareness training), deployment model (cloud, on-premises), and target industry (financial services, healthcare, government). Companies like BAE Systems, Check Point Software Technologies, Cisco Systems, and Proofpoint are key players actively innovating and competing within this dynamic market. The significant market expansion is further fueled by the high financial stakes involved in successful spear phishing campaigns. The impact of successful attacks, including data breaches, financial losses, and reputational damage, encourages organizations to invest heavily in comprehensive security solutions. The proliferation of sophisticated spear phishing techniques, such as personalized phishing emails and the use of social engineering, necessitates advanced detection and prevention technologies. The market's competitive landscape is characterized by both established cybersecurity vendors and emerging players who are constantly developing new solutions to combat the threat of spear phishing. The competitive dynamics will likely lead to further innovation and drive market growth in the coming years, enhancing the overall sophistication of spear phishing detection and prevention solutions.

  15. h

    Data from: phishing-emails

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zion van Wyk, phishing-emails [Dataset]. https://huggingface.co/datasets/zionia/phishing-emails
    Explore at:
    Authors
    Zion van Wyk
    Description

    Dataset Card for "phishing-emails"

    More Information needed

  16. f

    Phishing Email: 11 Curated Datasets

    • figshare.com
    bin
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymized anonym (2024). Phishing Email: 11 Curated Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.24952503.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 2, 2024
    Dataset provided by
    figshare
    Authors
    Anonymized anonym
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We have curated 11 datasets. The Nazario and Nigerian Fraud datasets contain only phishing emails.Cite this dataset:A. I. Champa, M. F. Rabbi, and M. F. Zibran, “Why phishing emails escape detection: A closer look at the failure points,” in 12th International Symposium on Digital Forensics and Security (ISDFS), 2024, pp. 1–6 (to appear).or@inproceedings{champa2024why, title={Why Phishing Emails Escape Detection: A Closer Look at the Failure Points}, author={Champa, Arifa I and Rabbi, Md Fazle and Zibran, Minhaz F}, booktitle={12th International Symposium on Digital Forensics and Security (ISDFS)}, pages = {1--6 (to appear)}, year={2024}}

  17. High-Risk URL and Content Dataset

    • kaggle.com
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mehmet korkmaz (2024). High-Risk URL and Content Dataset [Dataset]. https://www.kaggle.com/datasets/mehmetkorkmaz/high-risk-url-and-content-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    mehmet korkmaz
    Description

    Cite from:

    Korkmaz, M. ., Kocyigit, E. ., Sahingoz, O. K., & Diri, B. (2022). A Hybrid Phishing Detection System Using Deep Learning-based URL and Content Analysis. Elektronika Ir Elektrotechnika, 28(5), 80-89. https://doi.org/10.5755/j02.eie.31197

    About

    All data is collected from Phishank.com.

    The operation on the website of such a large-scale organization is as follows: - Users leave URLs to the URL pool to be queried. - URLs in this list, which are open to all guests, are checked by users and classified as phishing or legitimate. - A URL is tagged according to the number of votes it receives.

    Thus, three categories of URLs are listed: Phishing, Legitimate and Unrated.

    If the URL is inactive and no user has moderated it, it will be tagged as UNRATED. These can qualify as neutral elements in the URL list. URLs that have been inspected and found to be harmful while they are live are labeled as PHISHING. The phishing part of the dataset contains these URLs. Those with website content from these URLs listed under Online and Valid Phish on the PhishTank website have been added to the Phishing section of the dataset, along with both the URL and the content. URLs that have been inspected and found to be not harmful while they are live are labeled as LEGITIMATE. These URLs, which are labelled as Invalid in PhishTank and have content, form the legitimate part of the dataset. Thus, the data that was added to the checklist after being suspicious by the users and then labelled as legitimate constituted in RISKY legitimate part.

    The dataset was created with 51,316 legitimate URLs and contents, 36,173 phishing URLs and contents, listed between 2006 and 2021.

    By looking at the file named "Dataset_Distribution.xlsx", the distribution of data in two categories by years can be shown. In addition, when this file is examined, it is possible to access the information of which url was published on which date. Approximately 91% of the phishing data was obtained in 2021. This rate confirms the existence of data used as zero-day attacks in the dataset. However, it can be said that the legitimate data is more evenly distributed. Again, it can be deduced with the idea of how accurately the legitimate data is labelled.

    URLs and contents were collected with a written script in Python. The size of "0 Kb." of these URLs is excluded from the dataset. In addition, each of the contents was checked and the contents with the Error 403 code were removed from the dataset.

  18. h

    all-scam-spam

    • huggingface.co
    Updated Sep 2, 2002
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fred Zhang (2002). all-scam-spam [Dataset]. https://huggingface.co/datasets/FredZhang7/all-scam-spam
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2002
    Authors
    Fred Zhang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is a large corpus of 42,619 preprocessed text messages and emails sent by humans in 43 languages. is_spam=1 means spam and is_spam=0 means ham. 1040 rows of balanced data, consisting of casual conversations and scam emails in ≈10 languages, were manually collected and annotated by me, with some help from ChatGPT.

      Some preprcoessing algorithms
    

    spam_assassin.js, followed by spam_assassin.py enron_spam.py

      Data composition
    
    
    
    
    
    
    
    
      Description
    

    To make the text… See the full description on the dataset page: https://huggingface.co/datasets/FredZhang7/all-scam-spam.

  19. a

    Phishing corpus

    • academictorrents.com
    bittorrent
    Updated Jan 2, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vit Listik (2019). Phishing corpus [Dataset]. https://academictorrents.com/details/a77cda9a9d89a60dbdfbe581adf6e2df9197995a
    Explore at:
    bittorrent(37482335)Available download formats
    Dataset updated
    Jan 2, 2019
    Dataset authored and provided by
    Vit Listik
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    A BitTorrent file to download data with the title 'Phishing corpus'

  20. Global number of e-mail phishing attacks 2022-2023

    • statista.com
    • ai-chatbox.pro
    Updated Sep 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Global number of e-mail phishing attacks 2022-2023 [Dataset]. https://www.statista.com/statistics/1493550/phishing-attacks-global-number/
    Explore at:
    Dataset updated
    Sep 23, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 2022 - Dec 2023
    Area covered
    Worldwide
    Description

    In December 2023, around 9.45 million phishing e-mails were detected worldwide, up from 5.59 million in September 2023. This figure has seen a continuous increase since January 2022. It is partially associated with the launch of ChatGPT in November 2022.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sairaj Adhav (2025). 🕵️ Phishing Websites Data [Dataset]. https://www.kaggle.com/datasets/sai10py/phishing-websites-data
Organization logo

🕵️ Phishing Websites Data

A useful dataset for analyzing and detecting phishing websites

Explore at:
311 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sairaj Adhav
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Phishing Websites Dataset

Overview

This dataset is designed to aid in the analysis and detection of phishing websites. It contains various features that help distinguish between legitimate and phishing websites based on their structural, security, and behavioral attributes.

Dataset Information

  • Total Columns: 31 (30 Features + 1 Target)
  • Target Variable: Result (Indicates whether a website is phishing or legitimate)

Features Description

URL-Based Features

  • Prefix_Suffix – Checks if the URL contains a hyphen (-), which is commonly used in phishing domains.
  • double_slash_redirecting – Detects if the URL redirects using //, which may indicate a phishing attempt.
  • having_At_Symbol – Identifies the presence of @ in the URL, which can be used to deceive users.
  • Shortining_Service – Indicates whether the URL uses a shortening service (e.g., bit.ly, tinyurl).
  • URL_Length – Measures the length of the URL; phishing URLs tend to be longer.
  • having_IP_Address – Checks if an IP address is used in place of a domain name, which is suspicious.

Domain-Based Features

  • having_Sub_Domain – Evaluates the number of subdomains; phishing sites often have excessive subdomains.
  • SSLfinal_State – Indicates whether the website has a valid SSL certificate (secure connection).
  • Domain_registeration_length – Measures the duration of domain registration; phishing sites often have short lifespans.
  • age_of_domain – The age of the domain in days; older domains are usually more trustworthy.
  • DNSRecord – Checks if the domain has valid DNS records; phishing domains may lack these.

Webpage-Based Features

  • Favicon – Determines if the website uses an external favicon (which can be a sign of phishing).
  • port – Identifies if the site is using suspicious or non-standard ports.
  • HTTPS_token – Checks if "HTTPS" is included in the URL but is used deceptively.
  • Request_URL – Measures the percentage of external resources loaded from different domains.
  • URL_of_Anchor – Analyzes anchor tags (<a> links) and their trustworthiness.
  • Links_in_tags – Examines <meta>, <script>, and <link> tags for external links.
  • SFH (Server Form Handler) – Determines if form actions are handled suspiciously.
  • Submitting_to_email – Checks if forms submit data directly to an email instead of a web server.
  • Abnormal_URL – Identifies if the website’s URL structure is inconsistent with common patterns.
  • Redirect – Counts the number of redirects; phishing websites may have excessive redirects.

Behavior-Based Features

  • on_mouseover – Checks if the website changes content when hovered over (used in deceptive techniques).
  • RightClick – Detects if right-click functionality is disabled (phishing sites may disable it).
  • popUpWindow – Identifies the presence of pop-ups, which can be used to trick users.
  • Iframe – Checks if the website uses <iframe> tags, often used in phishing attacks.

Traffic & Search Engine Features

  • web_traffic – Measures the website’s Alexa ranking; phishing sites tend to have low traffic.
  • Page_Rank – Google PageRank score; phishing sites usually have a low PageRank.
  • Google_Index – Checks if the website is indexed by Google (phishing sites may not be indexed).
  • Links_pointing_to_page – Counts the number of backlinks pointing to the website.
  • Statistical_report – Uses external sources to verify if the website has been reported for phishing.

Target Variable

  • Result – The classification label (1: Legitimate, -1: Phishing)

Usage

This dataset is valuable for:
Machine Learning Models – Developing classifiers for phishing detection.
Cybersecurity Research – Understanding patterns in phishing attacks.
Browser Security Extensions – Enhancing anti-phishing tools.

Search
Clear search
Close search
Google apps
Main menu