16 datasets found
  1. Dataset Malicious URLs

    • kaggle.com
    zip
    Updated Jan 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Talha Barkaat Ahmad ☑️ (2025). Dataset Malicious URLs [Dataset]. https://www.kaggle.com/datasets/talhabarkaatahmad/dataset-malicious-urls
    Explore at:
    zip(17866119 bytes)Available download formats
    Dataset updated
    Jan 3, 2025
    Authors
    Talha Barkaat Ahmad ☑️
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context Malicious URLs or malicious website is a very serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. We have collected this dataset to include a large number of examples of Malicious URLs so that a machine learning-based model can be developed to identify malicious urls so that we can stop them in advance before infecting computer system or spreading through inteinternet.

    Content we have collected a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Figure 2 depicts their distribution in terms of percentage. As we know one of the most crucial tasks is to curate the dataset for a machine learning project. We have curated this dataset from five different sources.

    For collecting benign, phishing, malware and defacement URLs we have used URL dataset (ISCX-URL-2016) For increasing phishing and malware URLs, we have used Malware domain black list dataset. We have increased benign URLs using faizan git repo At last, we have increased more number of phishing URLs using Phishtank dataset and PhishStorm dataset As we have told you that dataset is collected from different sources. So firstly, we have collected the URLs from different sources into a separate data frame and finally merge them to retain only URLs and their class type.

  2. Benign and Malicious URLs

    • kaggle.com
    zip
    Updated Jul 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samah Malibari (2022). Benign and Malicious URLs [Dataset]. https://www.kaggle.com/datasets/samahsadiq/benign-and-malicious-urls
    Explore at:
    zip(12229374 bytes)Available download formats
    Dataset updated
    Jul 31, 2022
    Authors
    Samah Malibari
    Description

    This dataset is created to form a Balanced URLs dataset with the same number of unique Benign and Malicious URLs. The total number of URLs in the dataset is 632,508 unique URLs.

    The creation of the dataset has involved 2 different datasets from Kaggle which are as follows:

    First Dataset: 450,176 URLs, out of which 77% benign and 23% malicious URLs. Can be found here: https://www.kaggle.com/datasets/siddharthkumar25/malicious-and-benign-urls

    Second Dataset: 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Can be found here: https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset

    To create the Balanced dataset, the first dataset was the main dataset, and then more malicious URLs from the second dataset were added, after that the extra Benign URLs were removed to keep the balance. Of course, unifying the columns and removing the duplicates were done to only keep the unique instances.

    For more information about the collection of the URLs themselves, please refer to the mentioned datasets above.

    All the URLs are in one .csv file with 3 columns: 1- First column is the 'url' column which has the list of URLs. 2- Second column is the 'label' which states the class of the URL wether 'benign' or 'malicious'. 3- Third column is the 'result' which also represents the class of the URL but with 0 and 1 values. {0 is benign and 1 is malicious}.

  3. A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large...

    • zenodo.org
    json
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák (2024). A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large Corpus of Benign, Phishing, and Malware Domain Names 2024 [Dataset]. http://doi.org/10.5281/zenodo.13330074
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 10, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Aug 16, 2024
    Description

    The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS handshakes and certificates, and GeoIP information for 368,956 benign domains from Cisco Umbrella, 461,338 benign domains from the actual CESNET network traffic, 164,425 phishing domains from PhishTank and OpenPhish services, and 100,809 malware domains from various sources like ThreatFox, The Firebog, MISP threat intelligence platform, and other sources. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered malicious by VT have been removed from phishing and malware datasets. Similarly, benign domain names that were considered risky by VT have been removed from the benign datasets. The data was collected between March 2023 and July 2024. The final assessment of the data was conducted in August 2024.

    The dataset is useful for cybersecurity research, e.g. statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing and malware website detection.

    Data Files

    • The data is located in the following individual files:

      • benign_umbrella.json - data for 368,956 benign domains from Cisco Umbrella,
      • benign_cesnet.json - data for 461,338 benign domains from the CESNET network,
      • phishing.json - data for 164,425 phishing domains, and
      • malware.json - data for 100,809 malware domains.

    Data Structure

    Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:

    • some fields may be missing (they should be interpreted as nulls),
    • extra fields may be present (they should be ignored).

    Field name

    Field type

    Nullable

    Description

    domain_name

    String

    No

    The evaluated domain name

    url

    String

    No

    The source URL for the domain name

    evaluated_on

    Date

    No

    Date of last collection attempt

    source

    String

    No

    An identifier of the source

    sourced_on

    Date

    No

    Date of ingestion of the domain name

    dns

    Object

    Yes

    Data from DNS scan

    rdap

    Object

    Yes

    Data from RDAP or WHOIS

    tls

    Object

    Yes

    Data from TLS handshake

    ip_data

    Array of Objects

    Yes

    Array of data objects capturing the IP addresses related to the domain name

    DNS data (dns field)

    A

    Array of Strings

    No

    Array of IPv4 addresses

    AAAA

    Array of Strings

    No

    Array of IPv6 addresses

    TXT

    Array of Strings

    No

    Array of raw TXT values

    CNAME

    Object

    No

    The CNAME target and related IPs

    MX

    Array of Objects

    No

    Array of objects with the MX target hostname, priority and related IPs

    NS

    Array of Objects

    No

    Array of objects with the NS target hostname and related IPs

    SOA

    Object

    No

    All the SOA fields, present if found at the target domain name

    zone_SOA

    Object

    No

    The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly

    dnssec

    Object

    No

    Flags describing the DNSSEC validation result for each record type

    ttls

    Object

    No

    The TTL values for each record type

    remarks

    Object

    No

    The zone domain name and DNSSEC flags

    RDAP data (rdap field)

    copyright_notice

    String

    No

    RDAP/WHOIS data usage copyright notice

    dnssec

    Bool

    No

    DNSSEC presence flag

    entitites

    Object

    No

    An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities.

    expiration_date

    Date

    Yes

    The current date of expiration

    handle

    String

    No

    RDAP handle

    last_changed_date

    Date

    Yes

    The date when the domain was last changed

    name

    String

    No

    The target domain name for which the data in this object are stored

    nameservers

    Array of Strings

    No

    Nameserver hostnames provided by RDAP or WHOIS

    registration_date

    Date

    Yes

    First registration date

    status

    Array of Strings

  4. Tabular dataset ready for malicious url detection

    • kaggle.com
    zip
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pilar Piñeiro (2024). Tabular dataset ready for malicious url detection [Dataset]. https://www.kaggle.com/datasets/pilarpieiro/tabular-dataset-ready-for-malicious-url-detection/code
    Explore at:
    zip(668529046 bytes)Available download formats
    Dataset updated
    Feb 14, 2024
    Authors
    Pilar Piñeiro
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Description: This Kaggle Dataset focuses on Malicious URL detection, aiming to facilitate research and development in the field of cybersecurity and threat detection.

    The dataset includes a diverse set of URLs, both benign and malicious, accompanied by an extensive list of calculated features. These features cover various aspects of URLs, providing a holistic view to aid in the identification of potential threats.

    Key feature types:

    1. Basic URL Components
    2. Domain Information
    3. Content Analysis
    4. Path and Query Parameters
    5. SSL Certificate Details
    6. Host Reputation
    7. Network Features
    8. Machine Learning-Derived Features
    9. Behavioural Features

    Researchers and data scientists can leverage this dataset to develop and benchmark URL detection models, employing various machine learning and deep learning techniques. The abundance of calculated features offers a rich resource for exploring novel approaches to enhance the accuracy and robustness of URL classification systems. Additionally, the dataset encourages collaboration and the sharing of insights within the cybersecurity community.

  5. Malicious and Benign Websites

    • kaggle.com
    zip
    Updated Apr 9, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Urcuqui (2018). Malicious and Benign Websites [Dataset]. https://www.kaggle.com/xwolf12/malicious-and-benign-websites
    Explore at:
    zip(48762 bytes)Available download formats
    Dataset updated
    Apr 9, 2018
    Authors
    Christian Urcuqui
    Description

    In process to edition ..

    Context

    Malicious websites are of great concern due it is a problem to analyze one by one and to index each URL in a black list. Unfortunately, there is a lack of datasets with malicious and benign web characteristics. This dataset is a research production of my bachelor students whose aims to fill this gap.

    This is our first dataset version got from our web security project, we are working to improve its results

    Content

    The project consisted to evaluate different classification models to predict malicious and benign websites, based on application layer and network characteristics. The data were obtained by using different verified sources of benign and malicious URL's, in a low interactive client honeypot to isolate network traffic. We used additional tools to get other information, such as, server country with Whois.

    This is the first version and we have some initial results from applying machine learning classifiers in a bachelor thesis. Further details on the data process making and the data description can be found in the article below.

    URL Dataset

    This is an important topic and one of the most difficult thing to process, according to other articles and another open resource, we used three black list: + machinelearning.inginf.units.it/data-andtools/hidden-fraudulent-urls-dataset + malwaredomainlist.com + zeuztacker.abuse.ch

    From them we got around 185181 URLs, we supposed that they were malicious according to their information, we recommend in a next research step to verity them though another security tool, such as, VirusTotal.

    We got the benign URLs (345000) from https://github.com/faizann24/Using-machinelearning-to-detect-malicious-URLs.git, similar to the previous step, a verification process is also recommended through other security systems.

    Framework

    First we made different scripts in Python in order to systematically analyze and generate the information of each URL (During the next months we will liberate them to the open source community on GitHub).

    First we verified that each URL was available through the libraries in Python (such as request), we started with around 530181 samples, but as a results of this step the samples were filtered and we got 63191 URLs.

    ![Framework to detect malicious websites][1]

    Feature generator:

    During the research process we found that one way to study a malicious website was the analysis of features from its application layer and network layer, in order to get them, the idea is to apply the dynamic and static analysis.
    In the dynamic analysis some articles used web application honeypots kind high interaction, but these resources have not been updated in the last months, so maybe some important vulnerabilities were not mapped.

    Data Description

    • URL: it is the anonimous identification of the URL analyzed in the study
    • URL_LENGTH: it is the number of characters in the URL
    • NUMBER_SPECIAL_CHARACTERS: it is number of special characters identified in the URL, such as, “/”, “%”, “#”, “&”, “. “, “=”
    • CHARSET: it is a categorical value and its meaning is the character encoding standard (also called character set).
    • SERVER: it is a categorical value and its meaning is the operative system of the server got from the packet response.
    • CONTENT_LENGTH: it represents the content size of the HTTP header.
    • WHOIS_COUNTRY: it is a categorical variable, its values are the countries we got from the server response (specifically, our script used the API of Whois).
    • WHOIS_STATEPRO: it is a categorical variable, its values are the states we got from the server response (specifically, our script used the API of Whois).
    • WHOIS_REGDATE: Whois provides the server registration date, so, this variable has date values with format DD/MM/YYY HH:MM
    • WHOIS_UPDATED_DATE: Through the Whois we got the last update date from the server analyzed
    • TCP_CONVERSATION_EXCHANGE: This variable is the number of TCP packets exchanged between the server and our honeypot client
    • DIST_REMOTE_TCP_PORT: it is the number of the ports detected and different to TCP
    • REMOTE_IPS: this variable has the total number of IPs connected to the honeypot
    • APP_BYTES: this is the number of bytes transfered
    • SOURCE_APP_PACKETS: packets sent from the honeypot to the server
    • REMOTE_APP_PACKETS: packets received from the server
    • APP_PACKETS: this is the total number of IP packets generated during the communication between the honeypot and the server
    • DNS_QUERY_TIMES: this is the number of DNS packets generated during the communication between the honeypot and the server
    • TYPE: this is a categorical variable, its values represent the type of web page analyzed, specifically, 1 is for malicious websites and 0 is for benign websites

    Conclusions and future works

    Acknowledgements

    If your papers or other works use our dataset, please cite our pap...

  6. URL List

    • kaggle.com
    zip
    Updated Sep 21, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akash G (2019). URL List [Dataset]. https://www.kaggle.com/akashsuper2000/url-list
    Explore at:
    zip(27996 bytes)Available download formats
    Dataset updated
    Sep 21, 2019
    Authors
    Akash G
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    URL List

    Description:

    The only file in this dataset contains a single column with the list of working URLs both malicious and benign.

    Purpose:

    Usage recommendations include scraping features off the URLs and training ML models on those features.

  7. w

    Websites susceptible to CVE-2020-11008

    • webtechsurvey.com
    csv
    Updated Apr 22, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WebTechSurvey (2020). Websites susceptible to CVE-2020-11008 [Dataset]. https://webtechsurvey.com/cve/CVE-2020-11008
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 22, 2020
    Dataset authored and provided by
    WebTechSurvey
    License

    https://webtechsurvey.com/termshttps://webtechsurvey.com/terms

    Time period covered
    2025
    Area covered
    Global
    Description

    A complete list of live websites affected by CVE-2020-11008, compiled through global website indexing conducted by WebTechSurvey.

  8. H

    Hour-Long Wget Attack Dataset (Base Graph)

    • dataverse.harvard.edu
    Updated Oct 1, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xueyuan Han (2018). Hour-Long Wget Attack Dataset (Base Graph) [Dataset]. http://doi.org/10.7910/DVN/IWFWSP
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 1, 2018
    Dataset provided by
    Harvard Dataverse
    Authors
    Xueyuan Han
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wget provenance data in edge-list format parsed from CamFlow provenance data. This dataset contains attack wget base graph data. Experiments were run for over an hour, with recurrent wget commands issued throughout the experiments (one for every 120 seconds). Background activities were also captured as CamFlow whole-system provenance was turned on. Several malicious URL were run during each experimental session. 5 attack experiments were recorded with different normal benign wget operations mixture. Provenance data was in JSON format and converted into edge-list format for the Unicorn IDS research project. Conversation time was Sept. 26th, 2018. Each experiment consists of a base and a streaming graph component.

  9. Benign and Malicious QR codes

    • kaggle.com
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samah Malibari (2022). Benign and Malicious QR codes [Dataset]. https://www.kaggle.com/datasets/samahsadiq/benign-and-malicious-qr-codes/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    Kaggle
    Authors
    Samah Malibari
    Description

    This dataset is created using Python code to generate QR codes from the REAL list of URLs provided in the following dataset from Kaggle: https://www.kaggle.com/datasets/samahsadiq/benign-and-malicious-urls

    The mentioned dataset consists of over 600,000 URLs. However, only the first 100,000 URLs from each class {Benign and Malicious} are used to generate the QR codes. In total, there 200,000 QR codes images in the dataset that encoded REAL URLs.

    This dataset is a 'Balanced Dataset' of QR codes of version 2. The 100,000 Benign QR codes were generated by a single Loop in python, and the same for the Malicious QR codes.

    The QR code images that belong to malicious URLs are under the 'malicious' folder with 'malicious' word in their file name. On the other hand, the QR cods that belongs to benign URLs are listed under 'benign' folder with 'benign' word appears in their filename.

    NOTE: Keep in mind that malicious QR codes are encoded a REAL malicious URLs, it is not recommended to scan them manually and visiting their encoded websites.

    For more informations about the encoded URLs, please refer to the mentioned dataset above in Kaggle.

  10. w

    Websites susceptible to CVE-2020-5260

    • webtechsurvey.com
    csv
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WebTechSurvey (2020). Websites susceptible to CVE-2020-5260 [Dataset]. https://webtechsurvey.com/cve/CVE-2020-5260
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset authored and provided by
    WebTechSurvey
    License

    https://webtechsurvey.com/termshttps://webtechsurvey.com/terms

    Time period covered
    2025
    Area covered
    Global
    Description

    A complete list of live websites affected by CVE-2020-5260, compiled through global website indexing conducted by WebTechSurvey.

  11. i

    Building a DGA Classifier: Part 1, Data Preparation

    • impactcybertrust.org
    • search.datacite.org
    Updated Jan 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    External Data Source (2019). Building a DGA Classifier: Part 1, Data Preparation [Dataset]. http://doi.org/10.23721/100/1478811
    Explore at:
    Dataset updated
    Jan 28, 2019
    Authors
    External Data Source
    Description

    The purpose of building aDGAclassifier isn't specifically for takedowns of botnets, but to discover and detect the use on our network or services. If we can you have a list of domains resolved and accessed at your organization, it is possible now to see which of those are potentially generated and used bymalware.

    The dataset consists of three sources (as decribed in the Data-Driven Security blog):

    Alexa: For samples of legitimate domains, an obvious choice is to go to the Alexa list of top web sites. But it's not ready for our use as is. If you grab thetop 1 Million Alexa domainsand parse it, you'll find just over 11 thousand are full URLs and not just domains, and there are thousands of domains with subdomains that don't help us (we are only classifying on domains here). So after I remove the URLs, de-duplicated the domains and clean it up, I end up with the Alexa top965,843.

    "Real World" Data fromOpenDNS: After reading the post from Frank Denis at OpenDNS titled"Why Using Real World Data Matters For Building Effective Security Models", I grabbed their10,000 Top Domainsand their10,000 Random samples. If we compare that to the top Alexa domains, 6,901 of the top ten thousand are in the alexa data and 893 of the random domains are in the Alexa data. I will clean that up as I make the final training dataset.

    DGAdo: The Click Security version wasn't very clear in where they got their bad domains so I decided to collect my own and this was rather fun. Because I work with some interesting characters (who know interesting characters), I was able to collect several data sets from recent botnets: "Cryptolocker", two seperate "Game-Over Zues" algorithms, and an anonymous collection of malicious (and algorithmically generated) domains. In the end, I was able to collect 73,598 algorithmically generateddomains.
    ;

  12. Stego-Images-Dataset

    • kaggle.com
    zip
    Updated Aug 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Zuppelli (2022). Stego-Images-Dataset [Dataset]. https://www.kaggle.com/datasets/marcozuppelli/stegoimagesdataset
    Explore at:
    zip(1619497465 bytes)Available download formats
    Dataset updated
    Aug 8, 2022
    Authors
    Marco Zuppelli
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This dataset contains 44,000 512x512 pixels images containing different malicious payloads, i.e., JavaScript, HTML, PowerShell, URLs, and ethereum addresses, embedded via the Least Significant Bit (LSB) technique. The payloads are selected to fit in the first bit of each color channel, i.e., max 512x512x3 bits.

    Dataset Information: - the tool used for the LSB steganographic technique is the LSB-Steganography; - the 8,000 original 512x512 pixels images are borrowed from different open source repositories (and released under GPL3 license); - the malicious JavaScripts are borrowed from the JavaScript Malware Collection; - the malicious JavaScripts obfuscated in HTML are borrowed from the Malicious Javascript Dataset; - the malicious PowerShell scripts are borrowed from the PowerDrive repository; - the malicious URLs are borrowed from the Ethereum-lists repository; - the ethereum addresses are borrowed from the URLhaus database.

    The train set is composed of 16,000 images containing the different payloads. The test set is composed of 8,000 images containing the different payloads. The validation set is composed of 8,000 images containing the different payloads. The test folder contains 2 additional test set: - stego_b64: 8,000 images containing different payloads "obfuscated" with the base64 algorithm; - stego_zip: 8,000 images containing the different payloads "obfuscated" with the zip algorithm. You can find more information on how the dataset is composed in the corresponding "dataset_info.csv" file.

    If you use this dataset, please cite our work (doi: 10.22667/JOWUA.2022.09.30.050): N. Cassavia, L. Caviglione, M. Guarascio, G. Manco, M. Zuppelli, “Detection of Steganographic Threats Targeting Digital Images in Heterogeneous Ecosystems Through Machine Learning”, Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications, Vol. 13, No. 3, pp. 50-67, September 2022.

  13. Malicious and Benign Website dataset

    • kaggle.com
    zip
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack Cavar (2024). Malicious and Benign Website dataset [Dataset]. https://www.kaggle.com/datasets/jackcavar/malicious-and-benign-website-dataset
    Explore at:
    zip(2305952218 bytes)Available download formats
    Dataset updated
    Apr 2, 2024
    Authors
    Jack Cavar
    Description

    Machine learning dataset created as part of my 4th year dissertation at Abertay University.

    Dataset consists of:

    20,175 phishing websites from PhishTank and PhishStats.

    49,524 benign websites from Alexa top 1 million websites.

    Dataset is formed into two separate parts. An excel spreadsheet and accompanying txt file of the first page scraped from the URL. Additional information for the website such as the amount of redirects a request made and the WHOIS information of the site was gathered also.

    The dataset was collected over a 38 day period. LightGBM was found to work best with the dataset. With a larger dataset models on this framework will be very accurate.

  14. m

    CIC-DDoS2019 Dataset

    • data.mendeley.com
    • kaggle.com
    Updated Mar 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Alamin Talukder (2023). CIC-DDoS2019 Dataset [Dataset]. http://doi.org/10.17632/ssnc74xm6r.1
    Explore at:
    Dataset updated
    Mar 3, 2023
    Authors
    Md Alamin Talukder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Distributed Denial of Service (DDoS) attack is a menace to network security that aims at exhausting the target networks with malicious traffic. Although many statistical methods have been designed for DDoS attack detection, designing a real-time detector with low computational overhead is still one of the main concerns. On the other hand, the evaluation of new detection algorithms and techniques heavily relies on the existence of well-designed datasets. In this paper, first, we review the existing datasets comprehensively and propose a new taxonomy for DDoS attacks. Secondly, we generate a new dataset, namely CICDDoS2019, which remedies all current shortcomings. Thirdly, using the generated dataset, we propose a new detection and family classification approach based on a set of network flow features. Finally, we provide the most important feature sets to detect different types of DDoS attacks with their corresponding weights.

    The dataset offers an extended set of Distributed Denial of Service attacks, most of which employ some form of amplification through reflection. The dataset shares its feature set with the other CIC NIDS datasets, IDS2017, IDS2018 and DoS2017

    original paper link: https://ieeexplore.ieee.org/abstract/document/8888419 kaggle dataset link: https://www.kaggle.com/datasets/dhoogla/cicddos2019

  15. Fast & Furious: Malware Detection Data Stream

    • kaggle.com
    zip
    Updated Aug 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabrício Ceschin (2022). Fast & Furious: Malware Detection Data Stream [Dataset]. https://www.kaggle.com/datasets/fabriciojoc/fast-furious-malware-data-stream/discussion
    Explore at:
    zip(8794473827 bytes)Available download formats
    Dataset updated
    Aug 12, 2022
    Authors
    Fabrício Ceschin
    Description

    These datasets (DREBIN and AndroZoo) are contributions to the paper "Fast & Furious: On the Modelling of Malware Detection as an Evolving Data Stream". If you use them in your work, please cite our paper using the BibTeX below:

    @article{CESCHIN2022118590,
    title = {Fast & Furious: On the modelling of malware detection as an evolving data stream},
    journal = {Expert Systems with Applications},
    pages = {118590},
    year = {2022},
    issn = {0957-4174},
    doi = {https://doi.org/10.1016/j.eswa.2022.118590},
    url = {https://www.sciencedirect.com/science/article/pii/S0957417422016463},
    author = {Fabrício Ceschin and Marcus Botacin and Heitor Murilo Gomes and Felipe Pinagé and Luiz S. Oliveira and André Grégio},
    keywords = {Machine learning, Data streams, Concept drift, Malware detection, Android}
    }
    

    Both datasets are saved in the parquet file format. To read them, use the following code:

    data_drebin = pd.read_parquet("drebin_drift.parquet.zip")
    data_androzoo = pd.read_parquet("androbin.parquet.zip")
    

    Note that these datasets are different from their original versions. The original DREBIN dataset does not contain the samples' timestamps, which we collected using VirusTotal API. Our version of the AndroZoo dataset is a subset of reports from their dataset previously available in their APK Analysis API, which was discontinued.

    The DREBIN dataset is composed of ten textual attributes from Android APKs (list of API calls, permissions, URLs, etc), which are publicly available to download and contain 123,453 benign and 5,560 malicious Android applications. Their distribution over time is shown below.

    https://i.imgur.com/IGKOMtE.png" alt="DREBIN dataset distribution by month">

    The AndroZoo dataset is a subset of Android applications reports provided by AndroZoo API, composed of eight textual attributes (resources names, source code classes and methods, manifest permissions etc.) and contains 213,928 benign and 70,340 malicious applications. The distribution over time of our AndroZoo subset, which keeps the same goodware and malware distribution as the original dataset (composed of most of 10 million apps), is shown below.

    https://i.imgur.com/8zxH3M4.png" alt="AndroZoo dataset distribution by month">

    The source code for all the experiments shown in the paper are also available here on Kaggle (note that the experiments using AndroZoo dataset did not run in the Kaggle environment due to high memory usage).

    Experiment 1 (The Best-Case Scenario for AVs - ML Cross-Validation)

    Here we classify all samples together to compare which feature extraction algorithm is the best and report baseline results. We tested several parameters for both algorithms and fixed the vocabulary size at 100 for TF-IDF (top-100 features ordered by term frequency) and created projections with 100 dimensions for Word2Vec, resulting in 1, 000 and 800 features for each app in both cases, for DREBIN and AndroZoo, respectively. All results are reported after 10-fold cross-validation procedures, a method commonly used in ML to evaluate models because its results are less prone to biases (note that we are training new classifiers and feature extractors at every iteration of the cross-validation process). . In practice, folding the dataset implies that the AV company has a mixed view of both past and future threats, despite temporal effects, which is the best scenario for AV operation and ML evaluation.

    Source Codes: DREBIN TFIDF | DREBIN W2V | ANDROZOO TFIDF | ANDROZOO W2V

    Experiment 2 (On Classification Failure - Temporal Classification)

    Although the currently used classification methodology helps reduci dataset biases, it would demand knowledge about future threats to work properly. AV companies train their classifiers using data from past samples and leverage them to predict future threats, expecting to present the same characteristics as past ones. However, malware samples are very dynamic, thus this strategy is the worst-case scenario for AV companies. To demonstrate the effects of predicting future threats based on past data, we split our datasets in two: we used the first half (oldest samples) to train our classifiers, which were then used to predict the newest samples from the second half. The results in the paper indicate a drop in all metrics when compared to the 10-fold experiment in both DREBIN and AndroZoo dataset...

  16. Web Network

    • kaggle.com
    zip
    Updated Sep 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    willian oliveira (2024). Web Network [Dataset]. https://www.kaggle.com/datasets/willianoliveiragibin/web-network/discussion
    Explore at:
    zip(4812 bytes)Available download formats
    Dataset updated
    Sep 27, 2024
    Authors
    willian oliveira
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains network traffic logs captured by Burp-Suite, aimed at classifying web requests as either good or bad based on their characteristics. By analyzing patterns in the network logs, this dataset helps in identifying web requests that could be categorized as legitimate or malicious. The main goal is to assist in the detection and prevention of web-based attacks, contributing to a more secure online environment.

    The dataset was gathered using Burp-Suite, a tool widely recognized for web vulnerability scanning and traffic monitoring. Burp-Suite captures network traffic, providing detailed records of the interactions between clients and servers. This includes information such as URL paths, headers, and parameters. Using this information, it is possible to develop classification models that assess the characteristics of each web request, determining whether they should be considered "good" or "bad."

    Characteristics of the Dataset Traffic Log Format: The dataset contains various HTTP(S) requests, including headers, URLs, and request bodies. Each request is labeled based on its perceived legitimacy. The legitimate requests ("good") are standard, benign user activity, while the malicious requests ("bad") are crafted to exploit server vulnerabilities.

    Feature Analysis: The characteristics of each web request are crucial to classifying it correctly. This includes analyzing the structure of the URL, the parameters included in the query string, and the content of request headers. By evaluating these elements, we can detect suspicious patterns that may indicate a malicious request, such as SQL injection attempts, cross-site scripting (XSS), or other forms of attack.

    Detection Strategy: The classification is based on identifying certain key patterns and terms that are indicative of an attack. This approach is particularly useful for detecting known types of attacks, such as SQL injection, command injection, and XSS. The dataset includes specific keywords that have been identified as potential indicators of malicious intent.

    List of Bad Words in the URL Path The dataset specifically highlights the importance of monitoring the URL path for certain keywords that are commonly used in attacks. The list of "bad words" that should be checked in the URL path includes:

    sleep: Used in SQL injection attacks to delay server response. uid: Frequently targeted in attempts to exploit user identifier vulnerabilities. select: A common SQL keyword that might indicate a potential SQL injection attempt. waitfor: Used in SQL Server to introduce delays, often seen in timing attacks. delay: Similar to "sleep," used to manipulate response times. system: Can indicate an attempt to execute system commands through vulnerabilities like command injection. union: Often used in SQL injection to combine results from different tables, which may lead to data exposure. order by: Another SQL term that may be exploited to modify query results. group by: Similar to "order by," used in SQL queries that might be manipulated for malicious purposes. admin: An attempt to access administrative sections of a website, possibly leading to privilege escalation. drop: Indicates an attempt to delete database tables or other critical components. script: Often used in cross-site scripting (XSS) attacks to inject malicious code into webpages. These "bad words" serve as potential red flags in the dataset and play an important role in differentiating between legitimate and malicious requests. If any of these keywords appear in the URL path, it significantly increases the likelihood that the request is malicious and warrants further investigation or immediate blocking.

    Importance for Web Security The ability to classify web requests as good or bad is a fundamental aspect of web security. Many cyber attacks begin with an attempt to interact with web servers in unintended ways, exploiting vulnerabilities to gain unauthorized access, disrupt services, or steal sensitive information. By training models using this dataset, security teams can create intelligent systems that automatically recognize patterns associated with attacks.

    SQL Injection Prevention: By identifying keywords such as "select," "union," "drop," and others, the model can flag requests that appear to contain SQL code, suggesting an attempt to execute an unauthorized database query.

    Command Injection: Keywords like "system" may indicate attempts to execute shell commands, which can be highly damaging if successful.

    Access Control: Requests that include "admin" may signal attempts to access restricted areas, possibly representing privilege escalation attacks.

    Practical Usage In practice, this dataset could be used to build a machine learning model for real-time web traffic analysis. The model could be integrated into web application firewalls (WAFs) to detect and block suspicious requests be...

  17. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Talha Barkaat Ahmad ☑️ (2025). Dataset Malicious URLs [Dataset]. https://www.kaggle.com/datasets/talhabarkaatahmad/dataset-malicious-urls
Organization logo

Dataset Malicious URLs

Dataset for Malicious URLs

Explore at:
8 scholarly articles cite this dataset (View in Google Scholar)
zip(17866119 bytes)Available download formats
Dataset updated
Jan 3, 2025
Authors
Talha Barkaat Ahmad ☑️
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Context Malicious URLs or malicious website is a very serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. We have collected this dataset to include a large number of examples of Malicious URLs so that a machine learning-based model can be developed to identify malicious urls so that we can stop them in advance before infecting computer system or spreading through inteinternet.

Content we have collected a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Figure 2 depicts their distribution in terms of percentage. As we know one of the most crucial tasks is to curate the dataset for a machine learning project. We have curated this dataset from five different sources.

For collecting benign, phishing, malware and defacement URLs we have used URL dataset (ISCX-URL-2016) For increasing phishing and malware URLs, we have used Malware domain black list dataset. We have increased benign URLs using faizan git repo At last, we have increased more number of phishing URLs using Phishtank dataset and PhishStorm dataset As we have told you that dataset is collected from different sources. So firstly, we have collected the URLs from different sources into a separate data frame and finally merge them to retain only URLs and their class type.

Search
Clear search
Close search
Google apps
Main menu