Context Malicious URLs or malicious website is a very serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. We have collected this dataset to include a large number of examples of Malicious URLs so that a machine learning-based model can be developed to identify malicious urls so that we can stop them in advance before infecting computer system or spreading through inteinternet.
Content we have collected a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Figure 2 depicts their distribution in terms of percentage. As we know one of the most crucial tasks is to curate the dataset for a machine learning project. We have curated this dataset from five different sources.
For collecting benign, phishing, malware and defacement URLs we have used URL dataset (ISCX-URL-2016) For increasing phishing and malware URLs, we have used Malware domain black list dataset. We have increased benign URLs using faizan git repo At last, we have increased more number of phishing URLs using Phishtank dataset and PhishStorm dataset As we have told you that dataset is collected from different sources. So firstly, we have collected the URLs from different sources into a separate data frame and finally merge them to retain only URLs and their class type.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The goal of our research is to identify malicious advertisement URLs and to apply adversarial attack on ensembles. We extract lexical and web-scrapped features from using python code. And then 4 machine learning algorithms are applied for the classification process and then used the K-Means clustering for the visual understanding. We check the vulnerability of the models by the adversarial examples. We applied Zeroth Order Optimization adversarial attack on the models and compute the attack accuracy.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of previous works on malicious URL detection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS handshakes and certificates, and GeoIP information for 368,956 benign domains from Cisco Umbrella, 461,338 benign domains from the actual CESNET network traffic, 164,425 phishing domains from PhishTank and OpenPhish services, and 100,809 malware domains from various sources like ThreatFox, The Firebog, MISP threat intelligence platform, and other sources. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered malicious by VT have been removed from phishing and malware datasets. Similarly, benign domain names that were considered risky by VT have been removed from the benign datasets. The data was collected between March 2023 and July 2024. The final assessment of the data was conducted in August 2024.
The dataset is useful for cybersecurity research, e.g. statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing and malware website detection.
The data is located in the following individual files:
Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:
Field name |
Field type |
Nullable |
Description |
domain_name |
String |
No |
The evaluated domain name |
url |
String |
No |
The source URL for the domain name |
evaluated_on |
Date |
No |
Date of last collection attempt |
source |
String |
No |
An identifier of the source |
sourced_on |
Date |
No |
Date of ingestion of the domain name |
dns |
Object |
Yes |
Data from DNS scan |
rdap |
Object |
Yes |
Data from RDAP or WHOIS |
tls |
Object |
Yes |
Data from TLS handshake |
ip_data |
Array of Objects |
Yes |
Array of data objects capturing the IP addresses related to the domain name |
DNS data (dns field) | |||
A |
Array of Strings |
No |
Array of IPv4 addresses |
AAAA |
Array of Strings |
No |
Array of IPv6 addresses |
TXT |
Array of Strings |
No |
Array of raw TXT values |
CNAME |
Object |
No |
The CNAME target and related IPs |
MX |
Array of Objects |
No |
Array of objects with the MX target hostname, priority and related IPs |
NS |
Array of Objects |
No |
Array of objects with the NS target hostname and related IPs |
SOA |
Object |
No |
All the SOA fields, present if found at the target domain name |
zone_SOA |
Object |
No |
The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly |
dnssec |
Object |
No |
Flags describing the DNSSEC validation result for each record type |
ttls |
Object |
No |
The TTL values for each record type |
remarks |
Object |
No |
The zone domain name and DNSSEC flags |
RDAP data (rdap field) | |||
copyright_notice |
String |
No |
RDAP/WHOIS data usage copyright notice |
dnssec |
Bool |
No |
DNSSEC presence flag |
entitites |
Object |
No |
An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities. |
expiration_date |
Date |
Yes |
The current date of expiration |
handle |
String |
No |
RDAP handle |
last_changed_date |
Date |
Yes |
The date when the domain was last changed |
name |
String |
No |
The target domain name for which the data in this object are stored |
nameservers |
Array of Strings |
No |
Nameserver hostnames provided by RDAP or WHOIS |
registration_date |
Date |
Yes |
First registration date |
status |
Array of Strings |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web applications are important for various online businesses and operations because of their platform stability and low operation cost. The increasing usage of Internet-of-Things (IoT) devices within a network has contributed to the rise of network intrusion issues due to malicious Uniform Resource Locators (URLs). Generally, malicious URLs are initiated to promote scams, attacks, and frauds which can lead to high-risk intrusion. Several methods have been developed to detect malicious URLs in previous works. There has been a good amount of work done to detect malicious URLs using various methods such as random forest, regression, LightGBM, and more as reported in the literature. However, most of the previous works focused on the binary classification of malicious URLs and are tested on limited URL datasets. Nevertheless, the detection of malicious URLs remains a challenging task that remains open to research. Hence, this work proposed a stacking-based ensemble classifier to perform multi-class classification of malicious URLs on larger URL datasets to justify the robustness of the proposed method. This study focuses on obtaining lexical features directly from the URL to identify malicious websites. Then, the proposed stacking-based ensemble classifier is developed by integrating Random Forest, XGBoost, LightGBM, and CatBoost. In addition, hyperparameter tuning was performed using the Randomized Search method to optimize the proposed classifier. The proposed stacking-based ensemble classifier aims to take advantage of the performance of each machine learning model and aggregate the output to improve prediction accuracy. The classification accuracies of the machine learning model when applied individually are 93.6%, 95.2%, 95.7% and 94.8% for random forest, XGBoost, LightGBM, and CatBoost respectively. The proposed stacking-based ensemble classifier has shown significant results in classifying four classes of malicious URLs (phishing, malware, defacement, and benign) with an average accuracy of 96.8% when benchmarked with previous works.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global market for suspicious file and URL analysis is experiencing robust growth, projected to reach $88 million in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 6.4% from 2025 to 2033. This expansion is driven by the escalating sophistication of cyber threats, the increasing reliance on digital infrastructure across various sectors, and the growing need for proactive security measures to mitigate risks associated with malicious files and URLs. The market's segmentation reveals a strong preference for cloud-based solutions, offering scalability and accessibility to organizations of all sizes. Large enterprises are the primary consumers, reflecting their higher vulnerability to advanced cyberattacks and their greater capacity for investment in robust security solutions. However, the market is also seeing significant adoption among SMEs, driven by the increasing affordability and ease of use of cloud-based solutions and a rising awareness of the risks associated with malicious online content. Several factors contribute to market growth. The development and proliferation of advanced malware necessitates continuous improvement in threat detection and analysis capabilities. Furthermore, the expanding attack surface due to remote work and the increasing use of IoT devices are contributing to a heightened demand for effective file and URL analysis tools. Regulatory compliance requirements, particularly within sensitive industries like finance and healthcare, further incentivize organizations to invest in these solutions. Conversely, challenges such as the emergence of obfuscated malware, the high cost of advanced solutions, and the need for specialized expertise pose some restraints to broader market penetration. The competitive landscape is diverse, with established cybersecurity players and innovative startups offering a range of solutions catering to specific needs and budgets. This competitive pressure is ultimately beneficial for consumers, driving innovation and fostering a more efficient and effective market for suspicious file and URL analysis.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by kianindeed
Released under MIT
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web applications are important for various online businesses and operations because of their platform stability and low operation cost. The increasing usage of Internet-of-Things (IoT) devices within a network has contributed to the rise of network intrusion issues due to malicious Uniform Resource Locators (URLs). Generally, malicious URLs are initiated to promote scams, attacks, and frauds which can lead to high-risk intrusion. Several methods have been developed to detect malicious URLs in previous works. There has been a good amount of work done to detect malicious URLs using various methods such as random forest, regression, LightGBM, and more as reported in the literature. However, most of the previous works focused on the binary classification of malicious URLs and are tested on limited URL datasets. Nevertheless, the detection of malicious URLs remains a challenging task that remains open to research. Hence, this work proposed a stacking-based ensemble classifier to perform multi-class classification of malicious URLs on larger URL datasets to justify the robustness of the proposed method. This study focuses on obtaining lexical features directly from the URL to identify malicious websites. Then, the proposed stacking-based ensemble classifier is developed by integrating Random Forest, XGBoost, LightGBM, and CatBoost. In addition, hyperparameter tuning was performed using the Randomized Search method to optimize the proposed classifier. The proposed stacking-based ensemble classifier aims to take advantage of the performance of each machine learning model and aggregate the output to improve prediction accuracy. The classification accuracies of the machine learning model when applied individually are 93.6%, 95.2%, 95.7% and 94.8% for random forest, XGBoost, LightGBM, and CatBoost respectively. The proposed stacking-based ensemble classifier has shown significant results in classifying four classes of malicious URLs (phishing, malware, defacement, and benign) with an average accuracy of 96.8% when benchmarked with previous works.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Phishing URL Classification Dataset
This dataset contains URLs labeled as 'Safe' (0) or 'Not Safe' (1) for phishing detection tasks.
Dataset Summary
This dataset contains URLs labeled for phishing detection tasks. It's designed to help train and evaluate models that can identify potentially malicious URLs.
Dataset Creation
The dataset was synthetically generated using a custom script that creates both legitimate and potentially phishing URLs. This approach… See the full description on the dataset page: https://huggingface.co/datasets/imanoop7/phishing_url_classification.
This dataset was created by Shreeshail Chavan
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Web has long become a major platform for online criminal activities. URLs are used as the main vehicle in this domain. To counter this issues security community focused its efforts on developing techniques for mostly blacklisting of malicious URLs.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The market for suspicious file and URL analysis is experiencing robust growth, driven by the escalating sophistication of cyber threats and the increasing reliance on digital infrastructure across various sectors. The $92 million market size in 2025, coupled with a compound annual growth rate (CAGR) of 6.7%, projects substantial expansion to approximately $150 million by 2033. This growth is fueled by several key factors. The rising frequency and severity of ransomware attacks, phishing campaigns, and malware distribution necessitate robust security solutions for proactive threat detection and response. Furthermore, the expanding adoption of cloud-based services and the increasing interconnectedness of devices amplify the attack surface, thereby increasing the demand for advanced file and URL analysis capabilities. The growing awareness of data privacy regulations, such as GDPR and CCPA, also incentivizes organizations to enhance their security posture and invest in solutions that can effectively identify and mitigate potential threats. The market landscape is highly competitive, with a diverse range of players, from established cybersecurity giants like CrowdStrike, McAfee, and Symantec to specialized providers like Any.Run and Joe Sandbox. The market's segmentation likely includes solutions based on different analysis techniques (static, dynamic, sandbox-based), deployment models (cloud, on-premise), and target users (enterprise, SMB, individuals). While the provided data lacks regional specifics, it's reasonable to expect that North America and Europe will initially dominate market share, given their advanced cybersecurity infrastructure and high rates of digital adoption. However, emerging markets in Asia-Pacific and Latin America are poised for significant growth in the coming years, driven by rising digital literacy and economic expansion. Competition will intensify as vendors strive to offer innovative features, such as AI-powered threat detection and improved integration with existing security ecosystems.
In 2023, the total detection cases of web-based malware sites in South Korea amounted to roughly 12.7 thousand, a slight decrease compared to the previous year. The highest number of detected web-based malware sites in South Korea was 47,703 cases in 2014. The type of web-based malware sites was comprised of distribution sites and staging sties.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
PhishingURLDataset
This dataset is created for being used for neural network training, on phishing website detection. It has been generated using this raw template.
Dataset Details
This dataset contains phishing websites, which are labeled with "1" and are called "malignant", and benign websites, which are labeled with "0".
Dataset Sources
Kaggle Dataset on Phishing URLs: https://www.kaggle.com/datasets/siddharthkumar25/malicious-and-benign-urls USOM Phishing… See the full description on the dataset page: https://huggingface.co/datasets/semihGuner2002/PhishingURLsDataset.
In 2022, around 42 thousand different phishing URLs with a .SG domain were detected in Singapore. This was a slight decrease compared to the previous year.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains 44,000 512x512 pixels images containing different malicious payloads, i.e., JavaScript, HTML, PowerShell, URLs, and ethereum addresses, embedded via the Least Significant Bit (LSB) technique. The payloads are selected to fit in the first bit of each color channel, i.e., max 512x512x3 bits.
Dataset Information:
malicious JavaScripts:4600 malicious HTML:4600 malicious PS:4600 malicious URL:4600 malicious ethereum:4600 clean:4600
The main resource for this dataset is https://www.kaggle.com/datasets/marcozuppelli/stegoimagesdataset
In the second half of 2021, websites regarding manufacturing were the most common websites to be targeted by malicious URL redirections, with 39 percent of detected cases being found on these sites. Although manufacturing websites have been a common target for malware attacks before, finds on these sites have largely increased compared to the first half of the year, which recorded around 23 percent of cases redirecting through that industry.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used in "Selvi, J., Rodríguez, R. J., & Soria-Olivas, E. (2019). Detection of algorithmically generated malicious domain names using masked N-grams. Expert Systems with Applications, 124, 156-163."
A docker container with the code used in this work was also shared: https://github.com/jselvi/docker-r-masked-ngrams
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The 22 lexical features were extracted using the feature engineering approach.
An October 2023 phishing simulation carried out at worldwide organizations found that *** percent of employees submitted passwords in the form embedded in the malicious webpage. On the other hand, *** percent of them clicked only the link, and **** percent did not click the link.
Context Malicious URLs or malicious website is a very serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. We have collected this dataset to include a large number of examples of Malicious URLs so that a machine learning-based model can be developed to identify malicious urls so that we can stop them in advance before infecting computer system or spreading through inteinternet.
Content we have collected a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Figure 2 depicts their distribution in terms of percentage. As we know one of the most crucial tasks is to curate the dataset for a machine learning project. We have curated this dataset from five different sources.
For collecting benign, phishing, malware and defacement URLs we have used URL dataset (ISCX-URL-2016) For increasing phishing and malware URLs, we have used Malware domain black list dataset. We have increased benign URLs using faizan git repo At last, we have increased more number of phishing URLs using Phishtank dataset and PhishStorm dataset As we have told you that dataset is collected from different sources. So firstly, we have collected the URLs from different sources into a separate data frame and finally merge them to retain only URLs and their class type.