Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context Malicious URLs or malicious website is a very serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. We have collected this dataset to include a large number of examples of Malicious URLs so that a machine learning-based model can be developed to identify malicious urls so that we can stop them in advance before infecting computer system or spreading through inteinternet.
Content we have collected a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Figure 2 depicts their distribution in terms of percentage. As we know one of the most crucial tasks is to curate the dataset for a machine learning project. We have curated this dataset from five different sources.
For collecting benign, phishing, malware and defacement URLs we have used URL dataset (ISCX-URL-2016) For increasing phishing and malware URLs, we have used Malware domain black list dataset. We have increased benign URLs using faizan git repo At last, we have increased more number of phishing URLs using Phishtank dataset and PhishStorm dataset As we have told you that dataset is collected from different sources. So firstly, we have collected the URLs from different sources into a separate data frame and finally merge them to retain only URLs and their class type.
Facebook
TwitterThis dataset is created to form a Balanced URLs dataset with the same number of unique Benign and Malicious URLs. The total number of URLs in the dataset is 632,508 unique URLs.
The creation of the dataset has involved 2 different datasets from Kaggle which are as follows:
First Dataset: 450,176 URLs, out of which 77% benign and 23% malicious URLs. Can be found here: https://www.kaggle.com/datasets/siddharthkumar25/malicious-and-benign-urls
Second Dataset: 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Can be found here: https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset
To create the Balanced dataset, the first dataset was the main dataset, and then more malicious URLs from the second dataset were added, after that the extra Benign URLs were removed to keep the balance. Of course, unifying the columns and removing the duplicates were done to only keep the unique instances.
For more information about the collection of the URLs themselves, please refer to the mentioned datasets above.
All the URLs are in one .csv file with 3 columns: 1- First column is the 'url' column which has the list of URLs. 2- Second column is the 'label' which states the class of the URL wether 'benign' or 'malicious'. 3- Third column is the 'result' which also represents the class of the URL but with 0 and 1 values. {0 is benign and 1 is malicious}.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS handshakes and certificates, and GeoIP information for 368,956 benign domains from Cisco Umbrella, 461,338 benign domains from the actual CESNET network traffic, 164,425 phishing domains from PhishTank and OpenPhish services, and 100,809 malware domains from various sources like ThreatFox, The Firebog, MISP threat intelligence platform, and other sources. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered malicious by VT have been removed from phishing and malware datasets. Similarly, benign domain names that were considered risky by VT have been removed from the benign datasets. The data was collected between March 2023 and July 2024. The final assessment of the data was conducted in August 2024.
The dataset is useful for cybersecurity research, e.g. statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing and malware website detection.
The data is located in the following individual files:
Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:
|
Field name |
Field type |
Nullable |
Description |
|
domain_name |
String |
No |
The evaluated domain name |
|
url |
String |
No |
The source URL for the domain name |
|
evaluated_on |
Date |
No |
Date of last collection attempt |
|
source |
String |
No |
An identifier of the source |
|
sourced_on |
Date |
No |
Date of ingestion of the domain name |
|
dns |
Object |
Yes |
Data from DNS scan |
|
rdap |
Object |
Yes |
Data from RDAP or WHOIS |
|
tls |
Object |
Yes |
Data from TLS handshake |
|
ip_data |
Array of Objects |
Yes |
Array of data objects capturing the IP addresses related to the domain name |
|
DNS data (dns field) | |||
|
A |
Array of Strings |
No |
Array of IPv4 addresses |
|
AAAA |
Array of Strings |
No |
Array of IPv6 addresses |
|
TXT |
Array of Strings |
No |
Array of raw TXT values |
|
CNAME |
Object |
No |
The CNAME target and related IPs |
|
MX |
Array of Objects |
No |
Array of objects with the MX target hostname, priority and related IPs |
|
NS |
Array of Objects |
No |
Array of objects with the NS target hostname and related IPs |
|
SOA |
Object |
No |
All the SOA fields, present if found at the target domain name |
|
zone_SOA |
Object |
No |
The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly |
|
dnssec |
Object |
No |
Flags describing the DNSSEC validation result for each record type |
|
ttls |
Object |
No |
The TTL values for each record type |
|
remarks |
Object |
No |
The zone domain name and DNSSEC flags |
|
RDAP data (rdap field) | |||
|
copyright_notice |
String |
No |
RDAP/WHOIS data usage copyright notice |
|
dnssec |
Bool |
No |
DNSSEC presence flag |
|
entitites |
Object |
No |
An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities. |
|
expiration_date |
Date |
Yes |
The current date of expiration |
|
handle |
String |
No |
RDAP handle |
|
last_changed_date |
Date |
Yes |
The date when the domain was last changed |
|
name |
String |
No |
The target domain name for which the data in this object are stored |
|
nameservers |
Array of Strings |
No |
Nameserver hostnames provided by RDAP or WHOIS |
|
registration_date |
Date |
Yes |
First registration date |
|
status |
Array of Strings |
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description: This Kaggle Dataset focuses on Malicious URL detection, aiming to facilitate research and development in the field of cybersecurity and threat detection.
The dataset includes a diverse set of URLs, both benign and malicious, accompanied by an extensive list of calculated features. These features cover various aspects of URLs, providing a holistic view to aid in the identification of potential threats.
Key feature types:
Researchers and data scientists can leverage this dataset to develop and benchmark URL detection models, employing various machine learning and deep learning techniques. The abundance of calculated features offers a rich resource for exploring novel approaches to enhance the accuracy and robustness of URL classification systems. Additionally, the dataset encourages collaboration and the sharing of insights within the cybersecurity community.
Facebook
TwitterMalicious websites are of great concern due it is a problem to analyze one by one and to index each URL in a black list. Unfortunately, there is a lack of datasets with malicious and benign web characteristics. This dataset is a research production of my bachelor students whose aims to fill this gap.
This is our first dataset version got from our web security project, we are working to improve its results
The project consisted to evaluate different classification models to predict malicious and benign websites, based on application layer and network characteristics. The data were obtained by using different verified sources of benign and malicious URL's, in a low interactive client honeypot to isolate network traffic. We used additional tools to get other information, such as, server country with Whois.
This is the first version and we have some initial results from applying machine learning classifiers in a bachelor thesis. Further details on the data process making and the data description can be found in the article below.
This is an important topic and one of the most difficult thing to process, according to other articles and another open resource, we used three black list: + machinelearning.inginf.units.it/data-andtools/hidden-fraudulent-urls-dataset + malwaredomainlist.com + zeuztacker.abuse.ch
From them we got around 185181 URLs, we supposed that they were malicious according to their information, we recommend in a next research step to verity them though another security tool, such as, VirusTotal.
We got the benign URLs (345000) from https://github.com/faizann24/Using-machinelearning-to-detect-malicious-URLs.git, similar to the previous step, a verification process is also recommended through other security systems.
First we made different scripts in Python in order to systematically analyze and generate the information of each URL (During the next months we will liberate them to the open source community on GitHub).
First we verified that each URL was available through the libraries in Python (such as request), we started with around 530181 samples, but as a results of this step the samples were filtered and we got 63191 URLs.
![Framework to detect malicious websites][1]
During the research process we found that one way to study a malicious website was the analysis of features from its application layer and network layer, in order to get them, the idea is to apply the dynamic and static analysis.
In the dynamic analysis some articles used web application honeypots kind high interaction, but these resources have not been updated in the last months, so maybe some important vulnerabilities were not mapped.
If your papers or other works use our dataset, please cite our pap...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook
Twitterhttps://webtechsurvey.com/termshttps://webtechsurvey.com/terms
A complete list of live websites affected by CVE-2020-11008, compiled through global website indexing conducted by WebTechSurvey.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Wget provenance data in edge-list format parsed from CamFlow provenance data. This dataset contains attack wget base graph data. Experiments were run for over an hour, with recurrent wget commands issued throughout the experiments (one for every 120 seconds). Background activities were also captured as CamFlow whole-system provenance was turned on. Several malicious URL were run during each experimental session. 5 attack experiments were recorded with different normal benign wget operations mixture. Provenance data was in JSON format and converted into edge-list format for the Unicorn IDS research project. Conversation time was Sept. 26th, 2018. Each experiment consists of a base and a streaming graph component.
Facebook
TwitterThis dataset is created using Python code to generate QR codes from the REAL list of URLs provided in the following dataset from Kaggle: https://www.kaggle.com/datasets/samahsadiq/benign-and-malicious-urls
The mentioned dataset consists of over 600,000 URLs. However, only the first 100,000 URLs from each class {Benign and Malicious} are used to generate the QR codes. In total, there 200,000 QR codes images in the dataset that encoded REAL URLs.
This dataset is a 'Balanced Dataset' of QR codes of version 2. The 100,000 Benign QR codes were generated by a single Loop in python, and the same for the Malicious QR codes.
The QR code images that belong to malicious URLs are under the 'malicious' folder with 'malicious' word in their file name. On the other hand, the QR cods that belongs to benign URLs are listed under 'benign' folder with 'benign' word appears in their filename.
NOTE: Keep in mind that malicious QR codes are encoded a REAL malicious URLs, it is not recommended to scan them manually and visiting their encoded websites.
For more informations about the encoded URLs, please refer to the mentioned dataset above in Kaggle.
Facebook
Twitterhttps://webtechsurvey.com/termshttps://webtechsurvey.com/terms
A complete list of live websites affected by CVE-2020-5260, compiled through global website indexing conducted by WebTechSurvey.
Facebook
TwitterThe purpose of building aDGAclassifier isn't specifically for takedowns of botnets, but to discover and detect the use on our network or services. If we can you have a list of domains resolved and accessed at your organization, it is possible now to see which of those are potentially generated and used bymalware.
The dataset consists of three sources (as decribed in the Data-Driven Security blog):
Alexa: For samples of legitimate domains, an obvious choice is to go to the Alexa list of top web sites. But it's not ready for our use as is. If you grab thetop 1 Million Alexa domainsand parse it, you'll find just over 11 thousand are full URLs and not just domains, and there are thousands of domains with subdomains that don't help us (we are only classifying on domains here). So after I remove the URLs, de-duplicated the domains and clean it up, I end up with the Alexa top965,843.
"Real World" Data fromOpenDNS: After reading the post from Frank Denis at OpenDNS titled"Why Using Real World Data Matters For Building Effective Security Models", I grabbed their10,000 Top Domainsand their10,000 Random samples. If we compare that to the top Alexa domains, 6,901 of the top ten thousand are in the alexa data and 893 of the random domains are in the Alexa data. I will clean that up as I make the final training dataset.
DGAdo: The Click Security version wasn't very clear in where they got their bad domains so I decided to collect my own and this was rather fun. Because I work with some interesting characters (who know interesting characters), I was able to collect several data sets from recent botnets: "Cryptolocker", two seperate "Game-Over Zues" algorithms, and an anonymous collection of malicious (and algorithmically generated) domains. In the end, I was able to collect 73,598 algorithmically generateddomains.
;
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset contains 44,000 512x512 pixels images containing different malicious payloads, i.e., JavaScript, HTML, PowerShell, URLs, and ethereum addresses, embedded via the Least Significant Bit (LSB) technique. The payloads are selected to fit in the first bit of each color channel, i.e., max 512x512x3 bits.
Dataset Information: - the tool used for the LSB steganographic technique is the LSB-Steganography; - the 8,000 original 512x512 pixels images are borrowed from different open source repositories (and released under GPL3 license); - the malicious JavaScripts are borrowed from the JavaScript Malware Collection; - the malicious JavaScripts obfuscated in HTML are borrowed from the Malicious Javascript Dataset; - the malicious PowerShell scripts are borrowed from the PowerDrive repository; - the malicious URLs are borrowed from the Ethereum-lists repository; - the ethereum addresses are borrowed from the URLhaus database.
The train set is composed of 16,000 images containing the different payloads. The test set is composed of 8,000 images containing the different payloads. The validation set is composed of 8,000 images containing the different payloads. The test folder contains 2 additional test set: - stego_b64: 8,000 images containing different payloads "obfuscated" with the base64 algorithm; - stego_zip: 8,000 images containing the different payloads "obfuscated" with the zip algorithm. You can find more information on how the dataset is composed in the corresponding "dataset_info.csv" file.
If you use this dataset, please cite our work (doi: 10.22667/JOWUA.2022.09.30.050): N. Cassavia, L. Caviglione, M. Guarascio, G. Manco, M. Zuppelli, “Detection of Steganographic Threats Targeting Digital Images in Heterogeneous Ecosystems Through Machine Learning”, Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications, Vol. 13, No. 3, pp. 50-67, September 2022.
Facebook
TwitterMachine learning dataset created as part of my 4th year dissertation at Abertay University.
Dataset consists of:
20,175 phishing websites from PhishTank and PhishStats.
49,524 benign websites from Alexa top 1 million websites.
Dataset is formed into two separate parts. An excel spreadsheet and accompanying txt file of the first page scraped from the URL. Additional information for the website such as the amount of redirects a request made and the WHOIS information of the site was gathered also.
The dataset was collected over a 38 day period. LightGBM was found to work best with the dataset. With a larger dataset models on this framework will be very accurate.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Distributed Denial of Service (DDoS) attack is a menace to network security that aims at exhausting the target networks with malicious traffic. Although many statistical methods have been designed for DDoS attack detection, designing a real-time detector with low computational overhead is still one of the main concerns. On the other hand, the evaluation of new detection algorithms and techniques heavily relies on the existence of well-designed datasets. In this paper, first, we review the existing datasets comprehensively and propose a new taxonomy for DDoS attacks. Secondly, we generate a new dataset, namely CICDDoS2019, which remedies all current shortcomings. Thirdly, using the generated dataset, we propose a new detection and family classification approach based on a set of network flow features. Finally, we provide the most important feature sets to detect different types of DDoS attacks with their corresponding weights.
The dataset offers an extended set of Distributed Denial of Service attacks, most of which employ some form of amplification through reflection. The dataset shares its feature set with the other CIC NIDS datasets, IDS2017, IDS2018 and DoS2017
original paper link: https://ieeexplore.ieee.org/abstract/document/8888419 kaggle dataset link: https://www.kaggle.com/datasets/dhoogla/cicddos2019
Facebook
TwitterThese datasets (DREBIN and AndroZoo) are contributions to the paper "Fast & Furious: On the Modelling of Malware Detection as an Evolving Data Stream". If you use them in your work, please cite our paper using the BibTeX below:
@article{CESCHIN2022118590,
title = {Fast & Furious: On the modelling of malware detection as an evolving data stream},
journal = {Expert Systems with Applications},
pages = {118590},
year = {2022},
issn = {0957-4174},
doi = {https://doi.org/10.1016/j.eswa.2022.118590},
url = {https://www.sciencedirect.com/science/article/pii/S0957417422016463},
author = {Fabrício Ceschin and Marcus Botacin and Heitor Murilo Gomes and Felipe Pinagé and Luiz S. Oliveira and André Grégio},
keywords = {Machine learning, Data streams, Concept drift, Malware detection, Android}
}
Both datasets are saved in the parquet file format. To read them, use the following code:
data_drebin = pd.read_parquet("drebin_drift.parquet.zip")
data_androzoo = pd.read_parquet("androbin.parquet.zip")
Note that these datasets are different from their original versions. The original DREBIN dataset does not contain the samples' timestamps, which we collected using VirusTotal API. Our version of the AndroZoo dataset is a subset of reports from their dataset previously available in their APK Analysis API, which was discontinued.
The DREBIN dataset is composed of ten textual attributes from Android APKs (list of API calls, permissions, URLs, etc), which are publicly available to download and contain 123,453 benign and 5,560 malicious Android applications. Their distribution over time is shown below.
https://i.imgur.com/IGKOMtE.png" alt="DREBIN dataset distribution by month">
The AndroZoo dataset is a subset of Android applications reports provided by AndroZoo API, composed of eight textual attributes (resources names, source code classes and methods, manifest permissions etc.) and contains 213,928 benign and 70,340 malicious applications. The distribution over time of our AndroZoo subset, which keeps the same goodware and malware distribution as the original dataset (composed of most of 10 million apps), is shown below.
https://i.imgur.com/8zxH3M4.png" alt="AndroZoo dataset distribution by month">
The source code for all the experiments shown in the paper are also available here on Kaggle (note that the experiments using AndroZoo dataset did not run in the Kaggle environment due to high memory usage).
Experiment 1 (The Best-Case Scenario for AVs - ML Cross-Validation)
Here we classify all samples together to compare which feature extraction algorithm is the best and report baseline results. We tested several parameters for both algorithms and fixed the vocabulary size at 100 for TF-IDF (top-100 features ordered by term frequency) and created projections with 100 dimensions for Word2Vec, resulting in 1, 000 and 800 features for each app in both cases, for DREBIN and AndroZoo, respectively. All results are reported after 10-fold cross-validation procedures, a method commonly used in ML to evaluate models because its results are less prone to biases (note that we are training new classifiers and feature extractors at every iteration of the cross-validation process). . In practice, folding the dataset implies that the AV company has a mixed view of both past and future threats, despite temporal effects, which is the best scenario for AV operation and ML evaluation.
Source Codes: DREBIN TFIDF | DREBIN W2V | ANDROZOO TFIDF | ANDROZOO W2V
Experiment 2 (On Classification Failure - Temporal Classification)
Although the currently used classification methodology helps reduci dataset biases, it would demand knowledge about future threats to work properly. AV companies train their classifiers using data from past samples and leverage them to predict future threats, expecting to present the same characteristics as past ones. However, malware samples are very dynamic, thus this strategy is the worst-case scenario for AV companies. To demonstrate the effects of predicting future threats based on past data, we split our datasets in two: we used the first half (oldest samples) to train our classifiers, which were then used to predict the newest samples from the second half. The results in the paper indicate a drop in all metrics when compared to the 10-fold experiment in both DREBIN and AndroZoo dataset...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains network traffic logs captured by Burp-Suite, aimed at classifying web requests as either good or bad based on their characteristics. By analyzing patterns in the network logs, this dataset helps in identifying web requests that could be categorized as legitimate or malicious. The main goal is to assist in the detection and prevention of web-based attacks, contributing to a more secure online environment.
The dataset was gathered using Burp-Suite, a tool widely recognized for web vulnerability scanning and traffic monitoring. Burp-Suite captures network traffic, providing detailed records of the interactions between clients and servers. This includes information such as URL paths, headers, and parameters. Using this information, it is possible to develop classification models that assess the characteristics of each web request, determining whether they should be considered "good" or "bad."
Characteristics of the Dataset Traffic Log Format: The dataset contains various HTTP(S) requests, including headers, URLs, and request bodies. Each request is labeled based on its perceived legitimacy. The legitimate requests ("good") are standard, benign user activity, while the malicious requests ("bad") are crafted to exploit server vulnerabilities.
Feature Analysis: The characteristics of each web request are crucial to classifying it correctly. This includes analyzing the structure of the URL, the parameters included in the query string, and the content of request headers. By evaluating these elements, we can detect suspicious patterns that may indicate a malicious request, such as SQL injection attempts, cross-site scripting (XSS), or other forms of attack.
Detection Strategy: The classification is based on identifying certain key patterns and terms that are indicative of an attack. This approach is particularly useful for detecting known types of attacks, such as SQL injection, command injection, and XSS. The dataset includes specific keywords that have been identified as potential indicators of malicious intent.
List of Bad Words in the URL Path The dataset specifically highlights the importance of monitoring the URL path for certain keywords that are commonly used in attacks. The list of "bad words" that should be checked in the URL path includes:
sleep: Used in SQL injection attacks to delay server response. uid: Frequently targeted in attempts to exploit user identifier vulnerabilities. select: A common SQL keyword that might indicate a potential SQL injection attempt. waitfor: Used in SQL Server to introduce delays, often seen in timing attacks. delay: Similar to "sleep," used to manipulate response times. system: Can indicate an attempt to execute system commands through vulnerabilities like command injection. union: Often used in SQL injection to combine results from different tables, which may lead to data exposure. order by: Another SQL term that may be exploited to modify query results. group by: Similar to "order by," used in SQL queries that might be manipulated for malicious purposes. admin: An attempt to access administrative sections of a website, possibly leading to privilege escalation. drop: Indicates an attempt to delete database tables or other critical components. script: Often used in cross-site scripting (XSS) attacks to inject malicious code into webpages. These "bad words" serve as potential red flags in the dataset and play an important role in differentiating between legitimate and malicious requests. If any of these keywords appear in the URL path, it significantly increases the likelihood that the request is malicious and warrants further investigation or immediate blocking.
Importance for Web Security The ability to classify web requests as good or bad is a fundamental aspect of web security. Many cyber attacks begin with an attempt to interact with web servers in unintended ways, exploiting vulnerabilities to gain unauthorized access, disrupt services, or steal sensitive information. By training models using this dataset, security teams can create intelligent systems that automatically recognize patterns associated with attacks.
SQL Injection Prevention: By identifying keywords such as "select," "union," "drop," and others, the model can flag requests that appear to contain SQL code, suggesting an attempt to execute an unauthorized database query.
Command Injection: Keywords like "system" may indicate attempts to execute shell commands, which can be highly damaging if successful.
Access Control: Requests that include "admin" may signal attempts to access restricted areas, possibly representing privilege escalation attacks.
Practical Usage In practice, this dataset could be used to build a machine learning model for real-time web traffic analysis. The model could be integrated into web application firewalls (WAFs) to detect and block suspicious requests be...
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context Malicious URLs or malicious website is a very serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. We have collected this dataset to include a large number of examples of Malicious URLs so that a machine learning-based model can be developed to identify malicious urls so that we can stop them in advance before infecting computer system or spreading through inteinternet.
Content we have collected a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Figure 2 depicts their distribution in terms of percentage. As we know one of the most crucial tasks is to curate the dataset for a machine learning project. We have curated this dataset from five different sources.
For collecting benign, phishing, malware and defacement URLs we have used URL dataset (ISCX-URL-2016) For increasing phishing and malware URLs, we have used Malware domain black list dataset. We have increased benign URLs using faizan git repo At last, we have increased more number of phishing URLs using Phishtank dataset and PhishStorm dataset As we have told you that dataset is collected from different sources. So firstly, we have collected the URLs from different sources into a separate data frame and finally merge them to retain only URLs and their class type.