Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identify Phishing using Machine learning Algorithms
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of a collection of legitimate as well as phishing website instances. Each instance contains the URL and the relevant HTML page. The index.sql file is the root file, and it can be used to map the URLs with the relevant HTML pages. The dataset can serve as an input for the machine learning process.Highlights: - Total number of instances: 80,000 (83,275 instances in the dataset due to the existence of some removed SQL records in preprocessing stage) - Number of legitimate website instances (labelled as 0 in the SQL file): 50,000 - Number of phishing website instances (labelled as 1 in the SQL file): 30,000Structure:The index.sql file is the root file. It consisted of five fields. 1). rec_id - record number 2). url - URL of the webpage 3). website - Filename of the webpage (i.e. 1635698138155948.html) 4). result - Indicates whether a given URL is phishing or not (0 for legitimate and 1 for phishing). 5). created_date - Webpage downloaded dateSources: - Legitimate Data [50,000] - These data were collected from two sources. 1). Google search - Simple keyword search on the google search engine was used, and the top 5 URLs of each search were collected. Domain restrictions were used and limited a maximum of 10 collections from a domain to have a diverse collection at the end. 2). Ebbu2017 Phishing Dataset [1] - Nearly 25,874 active URLs were collected from this repository - Phishing Data [30,000] - Three sources were used. 1). PhishTank - From 01 December 2020 to 31 October 2021 2). OpenPhish - From 29 September 2021 to 31 October 2021 3). PhishRepo [2] - From 29 September 2021 to 31 October 2021Data Collection Process: - Legitimate Data: - The URLs were collected from the above sources and fetched the relevant webpages separately. - The URLs are in different lengths to minimize the URL lengths issue mentioned by Verma et al. [3]. - Phishing Data: - The URLs were collected from the above sources, and at the same time, the relevant web pages were fetched. - An automated script continuously monitored PhishTank and OpenPhish to collect the latest phishing URLs. - The collected URLs were fetched simultaneously to minimize the resource unavailable issue since the phishing pages do not exist for a longer period on the web. - PhishRepo provides all the resources relevant to a phishing webpage; therefore, simply use their download function to download PhishRepo data.References:[1]. Ebbu2017 Phishing Dataset. Accessed 31 October 2021. Available: https://github.com/ebubekirbbr/pdd/tree/master/input.[2]. PhishRepo. Accessed 31 October 2021. Available: https://moraphishdet.projects.uom.lk/phishrepo/.[3]. Verma, Rakesh M., Victor Zeng, and Houtan Faridi. "Data quality for security challenges: Case studies of phishing, malware and intrusion detection datasets.", 2019.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
ISSR CS602 Machine Learning - Project
Website Phishing Data Set Download: Data Folder, Data Set Description
Abstract:
Data Set Characteristics : Multivariate | Number of Instances : 1353 |
---|---|
Attribute Characteristics : Integer | Number of Attributes : 10 |
Associated Tasks : Classification | Number of Web Hits : 54880 |
Source: Dataset url
Neda Abdelhamid Auckland Institute of Studies nedah '@' ais.ac.nz
Data Set Information:
The phishing problem is considered a vital issue in “.COM†industry especially e-banking and e-commerce taking the number of online transactions involving payments. We have identified different features related to legitimate and phishy websites and collected 1353 different websites from difference sources.Phishing websites were collected from Phishtank data archive (www.phishtank.com), which is a free community site where users can submit, verify, track and share phishing data. The legitimate websites were collected from Yahoo and starting point directories using a web script developed in PHP. The PHP script was plugged with a browser and we collected 548 legitimate websites out of 1353 websites. There is 702 phishing URLs, and 103 suspicious URLs.
When a website is considered SUSPICIOUS that means it can be either phishy or legitimate, meaning the website held some legit and phishy features.
Attribute Information:
URL Anchor
Request URL
SFH
URL Length
Having ’@’
Prefix/Suffix
IP
Sub Domain
Web traffic
Domain age
Class
collected features hold the categorical values , “Legitimate†, †Suspicious†and “Phishy†, these values have been replaced with numerical values 1,0 and -1 respectively. details of each feature are mentioned in the research paper mentioned below
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data consist of a collection of legitimate as well as phishing website instances. Each website is represented by the set of features which denote, whether website is legitimate or not. Data can serve as an input for machine learning process.
Here, the two variants of the Phishing Dataset are presented.
Full variant - dataset_full.csv
Small variant - dataset_small.csv
Author: Rami Mustafa A Mohammad ( University of Huddersfield","rami.mohammad '@' hud.ac.uk","rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield","t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai","fadi '@' cud.ac.ae)
Source: UCI
Please cite: Please refer to the Machine Learning Repository's citation policy
Source:
Rami Mustafa A Mohammad ( University of Huddersfield, rami.mohammad '@' hud.ac.uk, rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield,t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai,fadi '@' cud.ac.ae)
Data Set Information:
One of the challenges faced by our research was the unavailability of reliable training datasets. In fact this challenge faces any researcher in the field. However, although plenty of articles about predicting phishing websites have been disseminated these days, no reliable training dataset has been published publically, may be because there is no agreement in literature on the definitive features that characterize phishing webpages, hence it is difficult to shape a dataset that covers all possible features. In this dataset, we shed light on the important features that have proved to be sound and effective in predicting phishing websites. In addition, we propose some new features.
Attribute Information:
For Further information about the features see the features file in the data folder of UCI.
Relevant Papers:
Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi (2012) An Assessment of Features Related to Phishing Websites using an Automated Technique. In: International Conferece For Internet Technology And Secured Transactions. ICITST 2012 . IEEE, London, UK, pp. 492-497. ISBN 978-1-4673-5325-0
Mohammad, Rami, Thabtah, Fadi Abdeljaber and McCluskey, T.L. (2014) Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25 (2). pp. 443-458. ISSN 0941-0643
Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi Abdeljaber (2014) Intelligent Rule based Phishing Websites Classification. IET Information Security, 8 (3). pp. 153-160. ISSN 1751-8709
Citation Request:
Please refer to the Machine Learning Repository's citation policy
Cryptocurrency, as blockchain’s most famous implementation, suffers a huge economic loss due to phishing scams. In our work, accounts and transactions in Ethereum are treated as nodes and edges, thus detection of phishing accounts can be modeled as a node classification problem.
In this work, we collected phishing nodes from Ethereum that reported in Etherscan labeled cloud. Starting from phishing nodes we crawl a huge Ethereum transaction network via second-order BFS. Dataset contains 2,973,489 nodes, 13,551,303 edges and 1,165 labeled nodes.
MulDiGraph.pkl:This dataset is stored in pickle format, and it is the networkx object. Each node is an address with an attribute called isp indicating whether it is a phishing node. Each edge has two attributes, including amount and timestamp, which represent the balance of the transaction and the timestamp of the transaction, respectively. In this data set, the total number of nodes is 2,973,489, the number of transactions is 13,551,303, and the average degree is 4.5574.
For more details about blockchain dataset, please click here.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset designed for phishing classification tasks in various data types.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset was created by Akash Kumar
Released under CC0: Public Domain
https://www.enterpriseappstoday.com/privacy-policyhttps://www.enterpriseappstoday.com/privacy-policy
This dataset comprises high-quality, targeted spear-phishing emails created using a proprietary system that harnesses the power of LLMs and knowledge graphs. The primary purpose of releasing this dataset is to promote and facilitate further research in the field of spear-phishing detection.
We anticipate that LLM-generated spear-phishing attacks will soon gain prominence and potentially surpass traditional phishing campaigns, which current detection solutions are designed to identify.
A 2022 survey of working adults and IT security professionals worldwide found that electronics manufacturers showed the highest failure rate for phishing attack simulations, 14 percent. The aerospace and mining companies followed, with a 13 percent failure rate. Legal companies showed the lowest failure rate, down from 11 percent in 2021.
The data set is provided both in text file furthermore csv file which provides the following resources that can be used when enter to model building :
A getting of website URLs on 11000+ websites. Each example can 30 website parameters and adenine class tag identifying computer as a phishing website or not (1 or -1).
The code template features these encrypt blocks: a. Import modules (Part 1) b. Load details function + input/output zone descriptions
The input set also serves as an input for task scoping and tries to specify aforementioned functional and non-functional requirements for it.
You are expected to write who code for a binary classification model (phishing home or not) using Python Scikit-Learn that trains on the date and calculates an accuracy score off the test data. You will to used one either better of the classification algorithms the train a model in aforementioned phishing website your set.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Phishing Dataset for Machine Learning’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/shashwatwork/phishing-dataset-for-machine-learning on 29 August 2021.
--- Dataset description provided by original source is as follows ---
Anti-phishing refers to efforts to block phishing attacks. Phishing is a kind of cybercrime where attackers pose as known or trusted entities and contact individuals through email, text or telephone and ask them to share sensitive information. Typically, in a phishing email attack, and the message will suggest that there is a problem with an invoice, that there has been suspicious activity on an account, or that the user must login to verify an account or password. Users may also be prompted to enter credit card information or bank account details as well as other sensitive data. Once this information is collected, attackers may use it to access accounts, steal data and identities, and download malware onto the user’s computer.
This dataset contains 48 features extracted from 5000 phishing webpages and 5000 legitimate webpages, which were downloaded from January to May 2015 and from May to June 2017. An improved feature extraction technique is employed by leveraging the browser automation framework (i.e., Selenium WebDriver), which is more precise and robust compared to the parsing approach based on regular expressions.
Anti-phishing researchers and experts may find this dataset useful for phishing features analysis, conducting rapid proof of concept experiments or benchmarking phishing classification models.
Tan, Choon Lin (2018), “Phishing Dataset for Machine Learning: Feature Evaluation”, Mendeley Data, V1, doi: 10.17632/h3cgnj8hft.1 Source of the Dataset.
--- Original source retains full ownership of the source dataset ---
This dataset was created by Me_Rahul_K
Surveys of working adults and IT security professionals worldwide conducted in 2021 and 2023 found that the share of organizations experiencing severe consequences due to a successful cyber attack had declined. In 2023, the share of enterprises experiencing a breach of customer or client data was 29 percent, down from 44 percent in 2022. Ransomware infections that occurred through e-mail were common for 32 percent of the respondents in 2023. Cases of a credential or account compromise occurred in 27 percent of the organizations in 2023, a decrease of 25 percent compared to the year prior.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Phishing is a cybercrime in which deceitful websites lure naive users and trick them into disclosing confidential information, such as social media passwords or financial data. This phishing dataset can be used for training supervised or semi-supervised phishing detection models.
The dataset contains 38,800 URLs that have been classified as either phishing or benign.
Mowar, Peya, & Jain, Mini. (2021, December 28). Phishing and Benign Websites Dataset. 2021 International Conference on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA) (CyberSA), Dublin, Ireland. https://doi.org/10.5281/zenodo.5807622
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 48 features extracted from 5000 phishing webpages and 5000 legitimate webpages, which were downloaded from January to May 2015 and from May to June 2017. An improved feature extraction technique is employed by leveraging the browser automation framework (i.e., Selenium WebDriver), which is more precise and robust compared to parsing approach based on regular expressions. This dataset is WEKA-ready.
Phishing webpage source: PhishTank, OpenPhish Legitimate webpage source: Alexa, Common Crawl
Anti-phishing researchers and experts may find this dataset useful for phishing features analysis, conducting rapid proof of concept experiments or benchmarking phishing classification models.
Although many articles about predicting phishing websites have been disseminated, no reliable training dataset has been previously published publically, maybe because there is no agreement in literature on the definitive features that characterize phishing webpages, hence it is difficult to shape a dataset that covers all possible features. This dataset collected mainly from: PhishTank archive, MillerSmiles archive, Googles searching operators.
Data Set Characteristics: N/A
Number of Instances:2456
Area:Computer Security
Attribute Characteristics:Integer
Number of Attributes:30
Date Donated 2015-03-26
Associated Tasks: Classification
Missing Values? N/A
; ml-repository@ics.uci.edu
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data and code to generate Figures from Hakim et al. Evaluating the cognitive mechanisms of phishing detection with PEST, an ecologically valid lab-based measure of phishing susceptibility NOTE: Figure 4 requires data from the original PHIT task. These are available online at XXX Data files are csv files. Naming has the following form: scamdata_SUBJECTNUMBER_DATETIME_AGE_GENDER.dat e.g. scamdata_1_10Oct2018090103_18_F.dat Each datafile has 7 columns : userId : subject response (1 - safe with high confidence, 2 - safe with low confidence, 3 - scam with low confidence, 4 - scam with high confidence) reactTime : reaction time in seconds category : PHIT Email Category (and custom categories for pooled scam/safe emails) type : weapon of influence (for PHIT emails only) hasAtt : binary indicating whether email has an attachment realID : real email identifier (scam or safe) emailCode : unique ID of each email - used to locate specific emails within excel files
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The provided dataset includes 11430 URLs with 87 extracted features. The dataset are designed to be used as a a benchmark for machine learning based phishing detection systems. Features are from three different classes: 56 extracted from the structure and syntax of URLs, 24 extracted from the content of their correspondent pages and 7 are extracetd by querying external services. The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs. Associated to the dataset, we provide Python scripts used for the extraction of the features for potential replication or extension. Datasets are constructed on May 2020.
dataset_A: contains a list a URLs together with their DOM tree objects that can be used for replication and experimenting new URL and content-based features overtaking short-time living of phishing web pages.
dataset_B: containes the extracted feature values that can be used directly as inupt to classifiers for examination. Note that the data in this dataset are indexed with URLs so that one need to remove the index before experimentation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identify Phishing using Machine learning Algorithms