100+ datasets found
  1. Benign and Malicious URLs

    • kaggle.com
    zip
    Updated Jul 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samah Malibari (2022). Benign and Malicious URLs [Dataset]. https://www.kaggle.com/datasets/samahsadiq/benign-and-malicious-urls
    Explore at:
    zip(12229374 bytes)Available download formats
    Dataset updated
    Jul 31, 2022
    Authors
    Samah Malibari
    Description

    This dataset is created to form a Balanced URLs dataset with the same number of unique Benign and Malicious URLs. The total number of URLs in the dataset is 632,508 unique URLs.

    The creation of the dataset has involved 2 different datasets from Kaggle which are as follows:

    First Dataset: 450,176 URLs, out of which 77% benign and 23% malicious URLs. Can be found here: https://www.kaggle.com/datasets/siddharthkumar25/malicious-and-benign-urls

    Second Dataset: 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Can be found here: https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset

    To create the Balanced dataset, the first dataset was the main dataset, and then more malicious URLs from the second dataset were added, after that the extra Benign URLs were removed to keep the balance. Of course, unifying the columns and removing the duplicates were done to only keep the unique instances.

    For more information about the collection of the URLs themselves, please refer to the mentioned datasets above.

    All the URLs are in one .csv file with 3 columns: 1- First column is the 'url' column which has the list of URLs. 2- Second column is the 'label' which states the class of the URL wether 'benign' or 'malicious'. 3- Third column is the 'result' which also represents the class of the URL but with 0 and 1 values. {0 is benign and 1 is malicious}.

  2. Dataset Malicious URLs

    • kaggle.com
    zip
    Updated Jan 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Talha Barkaat Ahmad ☑️ (2025). Dataset Malicious URLs [Dataset]. https://www.kaggle.com/datasets/talhabarkaatahmad/dataset-malicious-urls
    Explore at:
    zip(17866119 bytes)Available download formats
    Dataset updated
    Jan 3, 2025
    Authors
    Talha Barkaat Ahmad ☑️
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context Malicious URLs or malicious website is a very serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. We have collected this dataset to include a large number of examples of Malicious URLs so that a machine learning-based model can be developed to identify malicious urls so that we can stop them in advance before infecting computer system or spreading through inteinternet.

    Content we have collected a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Figure 2 depicts their distribution in terms of percentage. As we know one of the most crucial tasks is to curate the dataset for a machine learning project. We have curated this dataset from five different sources.

    For collecting benign, phishing, malware and defacement URLs we have used URL dataset (ISCX-URL-2016) For increasing phishing and malware URLs, we have used Malware domain black list dataset. We have increased benign URLs using faizan git repo At last, we have increased more number of phishing URLs using Phishtank dataset and PhishStorm dataset As we have told you that dataset is collected from different sources. So firstly, we have collected the URLs from different sources into a separate data frame and finally merge them to retain only URLs and their class type.

  3. Malicious and Benign URLs dataset

    • kaggle.com
    zip
    Updated Nov 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nhu Trinh (Anna) (2025). Malicious and Benign URLs dataset [Dataset]. https://www.kaggle.com/datasets/nhutrinhanna/malicious-and-benign-urls-datasets
    Explore at:
    zip(26384315 bytes)Available download formats
    Dataset updated
    Nov 20, 2025
    Authors
    Nhu Trinh (Anna)
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is a consolidated collection of malicious, phishing, and unsafe URLs gathered from multiple reputable cybersecurity intelligence sources. It is designed to support machine learning research, threat detection modeling, academic projects, and security analysis. The dataset combines various categories of malicious URLs, including malware distribution sites, phishing links, and adult-content blacklist entries, to provide a comprehensive view of harmful web activity.

    This dataset does not contain live malicious content; only URL strings and labels are provided. It is safe for research and educational use.

  4. A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large...

    • zenodo.org
    json
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák (2024). A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large Corpus of Benign, Phishing, and Malware Domain Names 2024 [Dataset]. http://doi.org/10.5281/zenodo.13330074
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 10, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Aug 16, 2024
    Description

    The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS handshakes and certificates, and GeoIP information for 368,956 benign domains from Cisco Umbrella, 461,338 benign domains from the actual CESNET network traffic, 164,425 phishing domains from PhishTank and OpenPhish services, and 100,809 malware domains from various sources like ThreatFox, The Firebog, MISP threat intelligence platform, and other sources. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered malicious by VT have been removed from phishing and malware datasets. Similarly, benign domain names that were considered risky by VT have been removed from the benign datasets. The data was collected between March 2023 and July 2024. The final assessment of the data was conducted in August 2024.

    The dataset is useful for cybersecurity research, e.g. statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing and malware website detection.

    Data Files

    • The data is located in the following individual files:

      • benign_umbrella.json - data for 368,956 benign domains from Cisco Umbrella,
      • benign_cesnet.json - data for 461,338 benign domains from the CESNET network,
      • phishing.json - data for 164,425 phishing domains, and
      • malware.json - data for 100,809 malware domains.

    Data Structure

    Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:

    • some fields may be missing (they should be interpreted as nulls),
    • extra fields may be present (they should be ignored).

    Field name

    Field type

    Nullable

    Description

    domain_name

    String

    No

    The evaluated domain name

    url

    String

    No

    The source URL for the domain name

    evaluated_on

    Date

    No

    Date of last collection attempt

    source

    String

    No

    An identifier of the source

    sourced_on

    Date

    No

    Date of ingestion of the domain name

    dns

    Object

    Yes

    Data from DNS scan

    rdap

    Object

    Yes

    Data from RDAP or WHOIS

    tls

    Object

    Yes

    Data from TLS handshake

    ip_data

    Array of Objects

    Yes

    Array of data objects capturing the IP addresses related to the domain name

    DNS data (dns field)

    A

    Array of Strings

    No

    Array of IPv4 addresses

    AAAA

    Array of Strings

    No

    Array of IPv6 addresses

    TXT

    Array of Strings

    No

    Array of raw TXT values

    CNAME

    Object

    No

    The CNAME target and related IPs

    MX

    Array of Objects

    No

    Array of objects with the MX target hostname, priority and related IPs

    NS

    Array of Objects

    No

    Array of objects with the NS target hostname and related IPs

    SOA

    Object

    No

    All the SOA fields, present if found at the target domain name

    zone_SOA

    Object

    No

    The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly

    dnssec

    Object

    No

    Flags describing the DNSSEC validation result for each record type

    ttls

    Object

    No

    The TTL values for each record type

    remarks

    Object

    No

    The zone domain name and DNSSEC flags

    RDAP data (rdap field)

    copyright_notice

    String

    No

    RDAP/WHOIS data usage copyright notice

    dnssec

    Bool

    No

    DNSSEC presence flag

    entitites

    Object

    No

    An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities.

    expiration_date

    Date

    Yes

    The current date of expiration

    handle

    String

    No

    RDAP handle

    last_changed_date

    Date

    Yes

    The date when the domain was last changed

    name

    String

    No

    The target domain name for which the data in this object are stored

    nameservers

    Array of Strings

    No

    Nameserver hostnames provided by RDAP or WHOIS

    registration_date

    Date

    Yes

    First registration date

    status

    Array of Strings

  5. Total detection cases of web-based malware website South Korea 2015-2024

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Total detection cases of web-based malware website South Korea 2015-2024 [Dataset]. https://www.statista.com/statistics/1308201/south-korea-total-detection-cases-of-web-based-malware/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    South Korea
    Description

    In 2024, the total detection cases of web-based malware sites in South Korea amounted to roughly ** thousand, a slight increase compared to the previous year. The highest number of detected web-based malware sites in South Korea was ****** cases in 2014. The type of web-based malware sites was comprised of distribution sites and staging sties.

  6. Summary of previous works on malicious URL detection.

    • plos.figshare.com
    xls
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait (2024). Summary of previous works on malicious URL detection. [Dataset]. http://doi.org/10.1371/journal.pone.0302196.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary of previous works on malicious URL detection.

  7. S

    Suspicious File and URL Analysis Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Suspicious File and URL Analysis Report [Dataset]. https://www.archivemarketresearch.com/reports/suspicious-file-and-url-analysis-55344
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Mar 10, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The suspicious file and URL analysis market is booming, projected to reach $88 million in 2025 and grow at a CAGR of 6.4% through 2033. Learn about key drivers, trends, and top players shaping this crucial cybersecurity sector. Discover market size projections, regional breakdowns, and insights into cloud-based vs. on-premise solutions.

  8. S

    Suspicious File and URL Analysis Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Suspicious File and URL Analysis Report [Dataset]. https://www.archivemarketresearch.com/reports/suspicious-file-and-url-analysis-55170
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Mar 9, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global suspicious file and URL analysis market is booming, projected to reach $412 million by 2033 with a 15% CAGR. This report analyzes market trends, key players (CrowdStrike, Symantec, McAfee), and regional growth, highlighting the increasing demand for robust cybersecurity solutions in the face of rising cyber threats. Learn more about cloud-based solutions and the impact of AI on threat detection.

  9. h

    phishing_url_classification

    • huggingface.co
    Updated Oct 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anoop Maurya (2024). phishing_url_classification [Dataset]. https://huggingface.co/datasets/imanoop7/phishing_url_classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 19, 2024
    Authors
    Anoop Maurya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Phishing URL Classification Dataset

    This dataset contains URLs labeled as 'Safe' (0) or 'Not Safe' (1) for phishing detection tasks.

      Dataset Summary
    

    This dataset contains URLs labeled for phishing detection tasks. It's designed to help train and evaluate models that can identify potentially malicious URLs.

      Dataset Creation
    

    The dataset was synthetically generated using a custom script that creates both legitimate and potentially phishing URLs. This approach… See the full description on the dataset page: https://huggingface.co/datasets/imanoop7/phishing_url_classification.

  10. i

    Malware Analysis Datasets: Top-1000 PE Imports

    • ieee-dataport.org
    Updated Nov 8, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angelo Oliveira (2019). Malware Analysis Datasets: Top-1000 PE Imports [Dataset]. https://ieee-dataport.org/open-access/malware-analysis-datasets-top-1000-pe-imports
    Explore at:
    Dataset updated
    Nov 8, 2019
    Authors
    Angelo Oliveira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of my PhD research on malware detection and classification using Deep Learning. It contains static analysis data: Top-1000 imported functions extracted from the 'pe_imports' elements of Cuckoo Sandbox reports. PE malware examples were downloaded from virusshare.com. PE goodware examples were downloaded from portableapps.com and from Windows 7 x86 directories.

  11. f

    The performance of the proposed ensemble classifier in classifying four...

    • plos.figshare.com
    xls
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait (2024). The performance of the proposed ensemble classifier in classifying four classes of malicious URLs for testing data. [Dataset]. http://doi.org/10.1371/journal.pone.0302196.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The performance of the proposed ensemble classifier in classifying four classes of malicious URLs for testing data.

  12. S

    Suspicious File and URL Analysis Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Suspicious File and URL Analysis Report [Dataset]. https://www.datainsightsmarket.com/reports/suspicious-file-and-url-analysis-1462174
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Jun 21, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The suspicious file and URL analysis market is booming, projected to reach $150 million by 2033 with a 6.7% CAGR. This in-depth analysis explores market drivers, trends, restraints, key players (CrowdStrike, McAfee, Symantec, etc.), and regional growth. Discover the latest insights on protecting against ransomware, phishing, and malware.

  13. Android malware dataset for machine learning 2

    • figshare.com
    txt
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suleiman Yerima (2025). Android malware dataset for machine learning 2 [Dataset]. http://doi.org/10.6084/m9.figshare.5854653.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Suleiman Yerima
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset consisting of feature vectors of 215 attributes extracted from 15,036 applications (5,560 malware apps from Drebin project and 9,476 benign apps). The dataset has been used to develop and evaluate multilevel classifier fusion approach for Android malware detection, published in the IEEE Transactions on Cybernetics paper 'DroidFusion: A Novel Multilevel Classifier Fusion Approach for Android Malware Detection'. The supporting file contains further description of the feature vectors/attributes obtained via static code analysis of the Android apps.

  14. h

    malicious-url

    • huggingface.co
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wayan Danu Tirta (2024). malicious-url [Dataset]. https://huggingface.co/datasets/EustassKidman/malicious-url
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 28, 2024
    Authors
    Wayan Danu Tirta
    Description

    EustassKidman/malicious-url dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. Benchmarking the proposed work with previous works in malicious URL...

    • plos.figshare.com
    xls
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait (2024). Benchmarking the proposed work with previous works in malicious URL detection. [Dataset]. http://doi.org/10.1371/journal.pone.0302196.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Benchmarking the proposed work with previous works in malicious URL detection.

  16. Phishing URL Dataset

    • kaggle.com
    zip
    Updated Sep 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sri Harshitha Battula (2025). Phishing URL Dataset [Dataset]. https://www.kaggle.com/datasets/sriharshithabattula/phishing-url-dataset
    Explore at:
    zip(290110 bytes)Available download formats
    Dataset updated
    Sep 27, 2025
    Authors
    Sri Harshitha Battula
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset consists of phishing and malicious URLs along with detailed metadata that helps in understanding their activity, status, and technical characteristics. It is suitable for tasks such as URL classification, phishing detection, threat intelligence, and malware analysis.

    🔹 Key Highlights

    Contains URLs reported as phishing or malicious.

    Includes timestamps for when URLs were added and last seen online.

    Provides threat classifications (e.g., phishing, malware, fraud, botnet).

    Enriched with technical tags indicating malware families or targeted platforms (e.g., Mozi, elf, mips, 32-bit).

    Potential Use Cases

    Training machine learning models for phishing/malware URL detection.

    Building threat intelligence dashboards.

    Performing exploratory data analysis (EDA) on phishing trends over time.

    Understanding malware targeting patterns (e.g., IoT attacks using Mozi botnet).

  17. W

    Website Malware Scanner Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Website Malware Scanner Report [Dataset]. https://www.datainsightsmarket.com/reports/website-malware-scanner-1989693
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Jun 21, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Discover the booming Website Malware Scanner market analysis! Explore key trends, growth drivers, leading companies (Invicti, Acunetix, Qualys), and future projections for 2025-2033. Learn how to protect your website from cyber threats.

  18. c

    URL Threat Intelligence Database

    • controld.com
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ControlD Security Research (2025). URL Threat Intelligence Database [Dataset]. https://controld.com/tools/website-link-checker
    Explore at:
    Dataset updated
    Feb 15, 2025
    Dataset authored and provided by
    ControlD Security Research
    Description

    Comprehensive database of known malicious URLs, phishing sites, and threat indicators

  19. r

    URL dataset for malicious/benign classification

    • resodate.org
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ethan M. Rudd; Ahmed Abdallah (2024). URL dataset for malicious/benign classification [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvdXJsLWRhdGFzZXQtZm9yLW1hbGljaW91cy1iZW5pZ24tY2xhc3NpZmljYXRpb24=
    Explore at:
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Leibniz Data Manager
    Authors
    Ethan M. Rudd; Ahmed Abdallah
    Description

    A dataset of URLs with binary labels for malicious/benign classification

  20. m

    Dataset of Malicious and Benign Webpages

    • data.mendeley.com
    Updated May 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AK Singh (2020). Dataset of Malicious and Benign Webpages [Dataset]. http://doi.org/10.17632/gdx3pkwp47.1
    Explore at:
    Dataset updated
    May 1, 2020
    Authors
    AK Singh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains extracted attributes from websites that can be used for Classification of webpages as malicious or benign. The dataset also includes raw page content including JavaScript code that can be used as unstructured data in Deep Learning or for extracting further attributes. The data has been collected by crawling the Internet using MalCrawler [1]. The labels have been verified using the Google Safe Browsing API [2]. Attributes have been selected based on their relevance [3]. The details of dataset attributes is as given below: 'url' - The URL of the webpage. 'ip_add' - IP Address of the webpage. 'geo_loc' - The geographic location where the webpage is hosted. 'url_len' - The length of URL. 'js_len' - Length of JavaScript code on the webpage. 'js_obf_len - Length of obfuscated JavaScript code. 'tld' - The Top Level Domain of the webpage. 'who_is' - Whether the WHO IS domain information is compete or not. 'https' - Whether the site uses https or http. 'content' - The raw webpage content including JavaScript code. 'label' - The class label for benign or malicious webpage.

    Python code for extraction of the above listed dataset attributes is attached. The Visualisation of this dataset and it python code is also attached. This visualisation can be seen online on Kaggle [5].

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Samah Malibari (2022). Benign and Malicious URLs [Dataset]. https://www.kaggle.com/datasets/samahsadiq/benign-and-malicious-urls
Organization logo

Benign and Malicious URLs

This dataset is a Balanced dataset contains Benign and Malicious URLs

Explore at:
181 scholarly articles cite this dataset (View in Google Scholar)
zip(12229374 bytes)Available download formats
Dataset updated
Jul 31, 2022
Authors
Samah Malibari
Description

This dataset is created to form a Balanced URLs dataset with the same number of unique Benign and Malicious URLs. The total number of URLs in the dataset is 632,508 unique URLs.

The creation of the dataset has involved 2 different datasets from Kaggle which are as follows:

First Dataset: 450,176 URLs, out of which 77% benign and 23% malicious URLs. Can be found here: https://www.kaggle.com/datasets/siddharthkumar25/malicious-and-benign-urls

Second Dataset: 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Can be found here: https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset

To create the Balanced dataset, the first dataset was the main dataset, and then more malicious URLs from the second dataset were added, after that the extra Benign URLs were removed to keep the balance. Of course, unifying the columns and removing the duplicates were done to only keep the unique instances.

For more information about the collection of the URLs themselves, please refer to the mentioned datasets above.

All the URLs are in one .csv file with 3 columns: 1- First column is the 'url' column which has the list of URLs. 2- Second column is the 'label' which states the class of the URL wether 'benign' or 'malicious'. 3- Third column is the 'result' which also represents the class of the URL but with 0 and 1 values. {0 is benign and 1 is malicious}.

Search
Clear search
Close search
Google apps
Main menu