73 datasets found
  1. i

    Pristine and Malicious URLs

    • ieee-dataport.org
    Updated Nov 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ehsan Nowroozi (2023). Pristine and Malicious URLs [Dataset]. https://ieee-dataport.org/documents/pristine-and-malicious-urls
    Explore at:
    Dataset updated
    Nov 6, 2023
    Authors
    Ehsan Nowroozi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The goal of our research is to identify malicious advertisement URLs and to apply adversarial attack on ensembles. We extract lexical and web-scrapped features from using python code. And then 4 machine learning algorithms are applied for the classification process and then used the K-Means clustering for the visual understanding. We check the vulnerability of the models by the adversarial examples. We applied Zeroth Order Optimization adversarial attack on the models and compute the attack accuracy.

  2. P

    Malicious URLs Dataset Dataset

    • paperswithcode.com
    Updated Oct 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Malicious URLs Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/malicious-urls-dataset
    Explore at:
    Dataset updated
    Oct 8, 2024
    Description

    Context Malicious URLs or malicious website is a very serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. We have collected this dataset to include a large number of examples of Malicious URLs so that a machine learning-based model can be developed to identify malicious urls so that we can stop them in advance before infecting computer system or spreading through inteinternet.

    Content we have collected a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Figure 2 depicts their distribution in terms of percentage. As we know one of the most crucial tasks is to curate the dataset for a machine learning project. We have curated this dataset from five different sources.

    For collecting benign, phishing, malware and defacement URLs we have used URL dataset (ISCX-URL-2016) For increasing phishing and malware URLs, we have used Malware domain black list dataset. We have increased benign URLs using faizan git repo At last, we have increased more number of phishing URLs using Phishtank dataset and PhishStorm dataset As we have told you that dataset is collected from different sources. So firstly, we have collected the URLs from different sources into a separate data frame and finally merge them to retain only URLs and their class type.

  3. z

    A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large...

    • zenodo.org
    json
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák (2024). A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large Corpus of Benign, Phishing, and Malware Domain Names 2024 [Dataset]. http://doi.org/10.5281/zenodo.13330074
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 10, 2024
    Dataset provided by
    Zenodo
    Authors
    Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Aug 16, 2024
    Description

    The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS handshakes and certificates, and GeoIP information for 368,956 benign domains from Cisco Umbrella, 461,338 benign domains from the actual CESNET network traffic, 164,425 phishing domains from PhishTank and OpenPhish services, and 100,809 malware domains from various sources like ThreatFox, The Firebog, MISP threat intelligence platform, and other sources. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered malicious by VT have been removed from phishing and malware datasets. Similarly, benign domain names that were considered risky by VT have been removed from the benign datasets. The data was collected between March 2023 and July 2024. The final assessment of the data was conducted in August 2024.

    The dataset is useful for cybersecurity research, e.g. statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing and malware website detection.

    Data Files

    • The data is located in the following individual files:

      • benign_umbrella.json - data for 368,956 benign domains from Cisco Umbrella,
      • benign_cesnet.json - data for 461,338 benign domains from the CESNET network,
      • phishing.json - data for 164,425 phishing domains, and
      • malware.json - data for 100,809 malware domains.

    Data Structure

    Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:

    • some fields may be missing (they should be interpreted as nulls),
    • extra fields may be present (they should be ignored).

    Field name

    Field type

    Nullable

    Description

    domain_name

    String

    No

    The evaluated domain name

    url

    String

    No

    The source URL for the domain name

    evaluated_on

    Date

    No

    Date of last collection attempt

    source

    String

    No

    An identifier of the source

    sourced_on

    Date

    No

    Date of ingestion of the domain name

    dns

    Object

    Yes

    Data from DNS scan

    rdap

    Object

    Yes

    Data from RDAP or WHOIS

    tls

    Object

    Yes

    Data from TLS handshake

    ip_data

    Array of Objects

    Yes

    Array of data objects capturing the IP addresses related to the domain name

    DNS data (dns field)

    A

    Array of Strings

    No

    Array of IPv4 addresses

    AAAA

    Array of Strings

    No

    Array of IPv6 addresses

    TXT

    Array of Strings

    No

    Array of raw TXT values

    CNAME

    Object

    No

    The CNAME target and related IPs

    MX

    Array of Objects

    No

    Array of objects with the MX target hostname, priority and related IPs

    NS

    Array of Objects

    No

    Array of objects with the NS target hostname and related IPs

    SOA

    Object

    No

    All the SOA fields, present if found at the target domain name

    zone_SOA

    Object

    No

    The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly

    dnssec

    Object

    No

    Flags describing the DNSSEC validation result for each record type

    ttls

    Object

    No

    The TTL values for each record type

    remarks

    Object

    No

    The zone domain name and DNSSEC flags

    RDAP data (rdap field)

    copyright_notice

    String

    No

    RDAP/WHOIS data usage copyright notice

    dnssec

    Bool

    No

    DNSSEC presence flag

    entitites

    Object

    No

    An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities.

    expiration_date

    Date

    Yes

    The current date of expiration

    handle

    String

    No

    RDAP handle

    last_changed_date

    Date

    Yes

    The date when the domain was last changed

    name

    String

    No

    The target domain name for which the data in this object are stored

    nameservers

    Array of Strings

    No

    Nameserver hostnames provided by RDAP or WHOIS

    registration_date

    Date

    Yes

    First registration date

    status

    Array of Strings

  4. Common industries for malicious URL redirections South Korea H2 2021

    • statista.com
    Updated Jun 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Common industries for malicious URL redirections South Korea H2 2021 [Dataset]. https://www.statista.com/statistics/1311491/south-korea-malicious-url-redirect-common-industries/
    Explore at:
    Dataset updated
    Jun 26, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    South Korea
    Description

    In the second half of 2021, websites regarding manufacturing were the most common websites to be targeted by malicious URL redirections, with 39 percent of detected cases being found on these sites. Although manufacturing websites have been a common target for malware attacks before, finds on these sites have largely increased compared to the first half of the year, which recorded around 23 percent of cases redirecting through that industry.

  5. malicious-url

    • kaggle.com
    Updated Apr 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kianindeed (2024). malicious-url [Dataset]. https://www.kaggle.com/datasets/kianindeed/malicious-url
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 23, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    kianindeed
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by kianindeed

    Released under MIT

    Contents

  6. f

    Summary of previous works on malicious URL detection.

    • plos.figshare.com
    xls
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait (2024). Summary of previous works on malicious URL detection. [Dataset]. http://doi.org/10.1371/journal.pone.0302196.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary of previous works on malicious URL detection.

  7. S

    Suspicious File and URL Analysis Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Suspicious File and URL Analysis Report [Dataset]. https://www.archivemarketresearch.com/reports/suspicious-file-and-url-analysis-55344
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Mar 10, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global market for suspicious file and URL analysis is experiencing robust growth, projected to reach $88 million in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 6.4% from 2025 to 2033. This expansion is driven by the escalating sophistication of cyber threats, the increasing reliance on digital infrastructure across various sectors, and the growing need for proactive security measures to mitigate risks associated with malicious files and URLs. The market's segmentation reveals a strong preference for cloud-based solutions, offering scalability and accessibility to organizations of all sizes. Large enterprises are the primary consumers, reflecting their higher vulnerability to advanced cyberattacks and their greater capacity for investment in robust security solutions. However, the market is also seeing significant adoption among SMEs, driven by the increasing affordability and ease of use of cloud-based solutions and a rising awareness of the risks associated with malicious online content. Several factors contribute to market growth. The development and proliferation of advanced malware necessitates continuous improvement in threat detection and analysis capabilities. Furthermore, the expanding attack surface due to remote work and the increasing use of IoT devices are contributing to a heightened demand for effective file and URL analysis tools. Regulatory compliance requirements, particularly within sensitive industries like finance and healthcare, further incentivize organizations to invest in these solutions. Conversely, challenges such as the emergence of obfuscated malware, the high cost of advanced solutions, and the need for specialized expertise pose some restraints to broader market penetration. The competitive landscape is diverse, with established cybersecurity players and innovative startups offering a range of solutions catering to specific needs and budgets. This competitive pressure is ultimately beneficial for consumers, driving innovation and fostering a more efficient and effective market for suspicious file and URL analysis.

  8. z

    A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large...

    • zenodo.org
    json
    Updated Dec 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radek Hranický; Radek Hranický; Jan Polišenský; Jan Polišenský; Adam Horák; Petr Pouč; Petr Pouč; Kamil Jeřábek; Kamil Jeřábek; Tomáš Ebert; Adam Horák; Tomáš Ebert (2024). A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large Corpus of Benign, Phishing, and Malware Domain Names 2024 [Dataset]. http://doi.org/10.5281/zenodo.14332167
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 11, 2024
    Dataset provided by
    Zenodo
    Authors
    Radek Hranický; Radek Hranický; Jan Polišenský; Jan Polišenský; Adam Horák; Petr Pouč; Petr Pouč; Kamil Jeřábek; Kamil Jeřábek; Tomáš Ebert; Adam Horák; Tomáš Ebert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Aug 16, 2024
    Description

    The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS handshakes and certificates, and GeoIP information for 368,956 benign domains from Cisco Umbrella, 461,338 benign domains from the actual CESNET network traffic, 164,425 phishing domains from PhishTank and OpenPhish services, and 100,809 malware domains from various sources like ThreatFox, The Firebog, MISP threat intelligence platform, and other sources. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered malicious by VT have been removed from phishing and malware datasets. Similarly, benign domain names that were considered risky by VT have been removed from the benign datasets. The data was collected between March 2023 and July 2024. The final assessment of the data was conducted in August 2024.

    The dataset is useful for cybersecurity research, e.g. statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing and malware website detection.

    The dataset was created using software available in the associated GitHub repository nesfit/domainradar-dib.

    Data Files

    • The data is located in the following individual files:

      • benign_umbrella.json - data for 368,956 benign domains from Cisco Umbrella,
      • benign_cesnet.json - data for 461,338 benign domains from the CESNET network,
      • phishing.json - data for 164,425 phishing domains, and
      • malware.json - data for 100,809 malware domains.
    • The schema.json file contains a JSON Schema with detailed description of the data entries.

    Data Structure

    Both files contain a JSON array of records generated using mongoexport (in the MongoDB Extended JSON (v2) format in Relaxed Mode). The following table documents the structure of a record. Please note that:

    • some fields may be missing (they should be interpreted as nulls),
    • extra fields may be present (they should be ignored).

    Field name

    Field type

    Nullable

    Description

    domain_name

    String

    No

    The evaluated domain name

    url

    String

    No

    The source URL for the domain name

    evaluated_on

    Date

    No

    Date of last collection attempt

    source

    String

    No

    An identifier of the source

    sourced_on

    Date

    No

    Date of ingestion of the domain name

    dns

    Object

    Yes

    Data from DNS scan

    rdap

    Object

    Yes

    Data from RDAP or WHOIS

    tls

    Object

    Yes

    Data from TLS handshake

    ip_data

    Array of Objects

    Yes

    Array of data objects capturing the IP addresses related to the domain name

    malware_type

    String

    No

    The malware type/family or “unknown” (only present in malware.json)

    DNS data (dns field)

    A

    Array of Strings

    No

    Array of IPv4 addresses

    AAAA

    Array of Strings

    No

    Array of IPv6 addresses

    TXT

    Array of Strings

    No

    Array of raw TXT values

    CNAME

    Object

    No

    The CNAME target and related IPs

    MX

    Array of Objects

    No

    Array of objects with the MX target hostname, priority and related IPs

    NS

    Array of Objects

    No

    Array of objects with the NS target hostname and related IPs

    SOA

    Object

    No

    All the SOA fields, present if found at the target domain name

    zone_SOA

    Object

    No

    The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly

    dnssec

    Object

    No

    Flags describing the DNSSEC validation result for each record type

    ttls

    Object

    No

    The TTL values for each record type

    remarks

    Object

    No

    The zone domain name and DNSSEC flags

    RDAP data (rdap field)

    copyright_notice

    String

    No

    RDAP/WHOIS data usage copyright notice

    dnssec

    Bool

    No

    DNSSEC presence flag

    entitites

    Object

    No

    An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities.

    expiration_date

    Date

    Yes

    The current date of expiration

    handle

    String

    No

    RDAP handle

    last_changed_date

    Date

    Yes

    The date when the domain was last changed

    name

    String

    No

  9. Total detection cases of web-based malware website South Korea 2014-2023

    • statista.com
    • ai-chatbox.pro
    Updated Dec 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Total detection cases of web-based malware website South Korea 2014-2023 [Dataset]. https://www.statista.com/statistics/1308201/south-korea-total-detection-cases-of-web-based-malware/
    Explore at:
    Dataset updated
    Dec 13, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    South Korea
    Description

    In 2023, the total detection cases of web-based malware sites in South Korea amounted to roughly 12.7 thousand, a slight decrease compared to the previous year. The highest number of detected web-based malware sites in South Korea was 47,703 cases in 2014. The type of web-based malware sites was comprised of distribution sites and staging sties.

  10. Malicious URL

    • kaggle.com
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreeshail Chavan (2024). Malicious URL [Dataset]. https://www.kaggle.com/datasets/shreeshailchavan/malicious-url/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shreeshail Chavan
    Description

    Dataset

    This dataset was created by Shreeshail Chavan

    Contents

  11. S

    Suspicious File and URL Analysis Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Suspicious File and URL Analysis Report [Dataset]. https://www.datainsightsmarket.com/reports/suspicious-file-and-url-analysis-1462174
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Jun 21, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The market for suspicious file and URL analysis is experiencing robust growth, driven by the escalating sophistication of cyber threats and the increasing reliance on digital infrastructure across various sectors. The $92 million market size in 2025, coupled with a compound annual growth rate (CAGR) of 6.7%, projects substantial expansion to approximately $150 million by 2033. This growth is fueled by several key factors. The rising frequency and severity of ransomware attacks, phishing campaigns, and malware distribution necessitate robust security solutions for proactive threat detection and response. Furthermore, the expanding adoption of cloud-based services and the increasing interconnectedness of devices amplify the attack surface, thereby increasing the demand for advanced file and URL analysis capabilities. The growing awareness of data privacy regulations, such as GDPR and CCPA, also incentivizes organizations to enhance their security posture and invest in solutions that can effectively identify and mitigate potential threats. The market landscape is highly competitive, with a diverse range of players, from established cybersecurity giants like CrowdStrike, McAfee, and Symantec to specialized providers like Any.Run and Joe Sandbox. The market's segmentation likely includes solutions based on different analysis techniques (static, dynamic, sandbox-based), deployment models (cloud, on-premise), and target users (enterprise, SMB, individuals). While the provided data lacks regional specifics, it's reasonable to expect that North America and Europe will initially dominate market share, given their advanced cybersecurity infrastructure and high rates of digital adoption. However, emerging markets in Asia-Pacific and Latin America are poised for significant growth in the coming years, driven by rising digital literacy and economic expansion. Competition will intensify as vendors strive to offer innovative features, such as AI-powered threat detection and improved integration with existing security ecosystems.

  12. i

    ISCX-URL-2016

    • ieee-dataport.org
    Updated Dec 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sajidha S A (2023). ISCX-URL-2016 [Dataset]. https://ieee-dataport.org/documents/iscx-url-2016
    Explore at:
    Dataset updated
    Dec 22, 2023
    Authors
    Sajidha S A
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Web has long become a major platform for online criminal activities. URLs are used as the main vehicle in this domain. To counter this issues security community focused its efforts on developing techniques for mostly blacklisting of malicious URLs.

  13. f

    Hyperparameter of tuned Random Forest classifier.

    • figshare.com
    xls
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait (2024). Hyperparameter of tuned Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0302196.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Web applications are important for various online businesses and operations because of their platform stability and low operation cost. The increasing usage of Internet-of-Things (IoT) devices within a network has contributed to the rise of network intrusion issues due to malicious Uniform Resource Locators (URLs). Generally, malicious URLs are initiated to promote scams, attacks, and frauds which can lead to high-risk intrusion. Several methods have been developed to detect malicious URLs in previous works. There has been a good amount of work done to detect malicious URLs using various methods such as random forest, regression, LightGBM, and more as reported in the literature. However, most of the previous works focused on the binary classification of malicious URLs and are tested on limited URL datasets. Nevertheless, the detection of malicious URLs remains a challenging task that remains open to research. Hence, this work proposed a stacking-based ensemble classifier to perform multi-class classification of malicious URLs on larger URL datasets to justify the robustness of the proposed method. This study focuses on obtaining lexical features directly from the URL to identify malicious websites. Then, the proposed stacking-based ensemble classifier is developed by integrating Random Forest, XGBoost, LightGBM, and CatBoost. In addition, hyperparameter tuning was performed using the Randomized Search method to optimize the proposed classifier. The proposed stacking-based ensemble classifier aims to take advantage of the performance of each machine learning model and aggregate the output to improve prediction accuracy. The classification accuracies of the machine learning model when applied individually are 93.6%, 95.2%, 95.7% and 94.8% for random forest, XGBoost, LightGBM, and CatBoost respectively. The proposed stacking-based ensemble classifier has shown significant results in classifying four classes of malicious URLs (phishing, malware, defacement, and benign) with an average accuracy of 96.8% when benchmarked with previous works.

  14. i

    Malware Analysis Datasets: Top-1000 PE Imports

    • ieee-dataport.org
    Updated Nov 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angelo Oliveira (2019). Malware Analysis Datasets: Top-1000 PE Imports [Dataset]. https://ieee-dataport.org/open-access/malware-analysis-datasets-top-1000-pe-imports
    Explore at:
    Dataset updated
    Nov 8, 2019
    Authors
    Angelo Oliveira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of my PhD research on malware detection and classification using Deep Learning. It contains static analysis data: Top-1000 imported functions extracted from the 'pe_imports' elements of Cuckoo Sandbox reports. PE malware examples were downloaded from virusshare.com. PE goodware examples were downloaded from portableapps.com and from Windows 7 x86 directories.

  15. Malicious URLs with preprocessing and split

    • kaggle.com
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zihan ZHAO_qq (2023). Malicious URLs with preprocessing and split [Dataset]. https://www.kaggle.com/datasets/zihanzhaoqq/malicious-urls-with-preprocessing-and-split
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Zihan ZHAO_qq
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Zihan ZHAO_qq

    Released under Apache 2.0

    Contents

  16. o

    Phishing URL Classifier Dataset

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Phishing URL Classifier Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/705b35a9-e638-462d-a5e1-d9f70ff4234a
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Website Analytics & User Experience
    Description

    This dataset is a curated collection of over 800,000 URLs, designed to represent a variety of online domains. Approximately 52% of these domains are identified as legitimate entities, while the remaining 47% are categorised as phishing domains, indicating potential online threats. The dataset consists of two key columns: "url" and "status". The "status" column uses binary encoding, where 0 signifies phishing domains and 1 indicates legitimate domains. This balanced distribution between phishing and legitimate instances helps ensure the dataset's robustness for analysis and model development.

    Columns

    • url: This field contains the Uniform Resource Locators (URLs) for each domain, including both legitimate and phishing entries.
    • status: This field denotes the classification of the URL. A value of 0 represents a phishing domain, indicating a potential risk, while a value of 1 signifies a legitimate domain, offering assurance.

    Distribution

    The dataset is provided in a CSV file format. It contains 808,042 unique entries. The distribution of statuses is approximately 394,982 entries flagged as phishing (0) and 427,028 entries flagged as legitimate (1). This offers an almost equal balance across the two categories.

    Usage

    This dataset is ideal for applications aimed at understanding, combating, and mitigating online threats. It can be used for developing models related to phishing detection, binary classification, and website analytics. It is also suitable for data cleaning exercises and projects involving Natural Language Processing (NLP) and Deep Learning.

    Coverage

    The data collection for this dataset is global in scope. While a specific time range for data collection is not provided, the dataset was listed on 05/06/2025.

    License

    CCO

    Who Can Use It

    This dataset is particularly valuable for researchers and practitioners working in the fields of AI and Machine Learning. Intended users include those looking to: * Develop and train models for identifying malicious URLs. * Analyse patterns distinguishing legitimate websites from phishing attempts. * Enhance cybersecurity measures and protect users from online threats.

    Dataset Name Suggestions

    • URL Phishing Detection
    • Legitimate and Malicious URLs
    • Online Threat URL Dataset
    • Phishing URL Classifier Data
    • Web Security URL Collection

    Attributes

    Original Data Source: Phishing and Legitimate URLS

  17. GoDaddy Annual Cybersecurity Report: 2024 Website Malware Threat Landscape

    • godaddy.com
    pdf
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GoDaddy (2025). GoDaddy Annual Cybersecurity Report: 2024 Website Malware Threat Landscape [Dataset]. https://www.godaddy.com/resources/news/godaddy-annual-cybersecurity-report
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 8, 2025
    Dataset authored and provided by
    GoDaddyhttp://godaddy.com/
    Description

    In 2024, GoDaddy InfoSec researchers monitored and analyzed website security threats using Sucuri SiteCheck's remote scanning technology, which processed over 70 million website scans across all hosting providers globally. This analysis provides insights into attack patterns and malware campaigns affecting websites worldwide The GoDaddy InfoSec malware research team helps protect the broader web ecosystem through automated continuous threat monitoring and detailed analysis, benefiting both our customers and the wider internet community. Our researchers develop and maintain sophisticated detection signatures by analyzing new malware samples, tracking emerging campaigns, and reverse engineering attack methodologies. This proactive approach helps us to identify and block new threats before they can impact our customers. Through collaboration between our malware research and threat intelligence teams along with analysis of malware samples and attack patterns, our security researchers documented sophisticated traffic distribution systems, social engineering tactics, and new methods of malware delivery and persistence. Analysis of 1.1 million infected websites revealed that malware and malicious redirects dominated the threat landscape, accounting for 74.7% of detected infections. Our researchers saw an increasing number of threat actors using social engineering tactics like fake browser updates and captchas to lure website visitors into installing malware. Additionally, we saw major campaigns including Balada Injector (149,351 detections) and Sign1 (96,084 detections) leveraging traffic distribution systems to monetize compromised website traffic while employing sophisticated visitor profiling to avoid detection. The abuse of legitimate WordPress plugins and themes continued to be a significant trend, with campaigns storing malicious code in database options rather than files to evade traditional security controls. This technique was particularly evident in the DNS TXT Records campaign, which utilized WPCode to execute malicious PHP code while maintaining persistence through automated reactivation systems. Additionally, the increase in compromises through stolen administrative credentials highlighted the growing connection between endpoint security and website security. SEO spam techniques continued to evolve, affecting 422,741 websites globally through various methods. Japanese spam (117,393 detections) and gambling-related content (79,817 detections) represented the most prevalent spam categories, employing advanced cloaking techniques and geo-targeting capabilities to maintain effectiveness while avoiding detection.

  18. test-url-cat-malware.com - Historical whois Lookup

    • whoisdatacenter.com
    csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AllHeart Web Inc, test-url-cat-malware.com - Historical whois Lookup [Dataset]. https://whoisdatacenter.com/domain/test-url-cat-malware.com/
    Explore at:
    csvAvailable download formats
    Dataset provided by
    AllHeart Web
    Authors
    AllHeart Web Inc
    License

    https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/

    Time period covered
    Mar 15, 1985 - Jun 13, 2025
    Description

    Explore the historical Whois records related to test-url-cat-malware.com (Domain). Get insights into ownership history and changes over time.

  19. W

    Website Malware Scanner Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Website Malware Scanner Report [Dataset]. https://www.datainsightsmarket.com/reports/website-malware-scanner-1989693
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Jun 21, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The website malware scanner market is experiencing robust growth, driven by the escalating frequency and sophistication of cyberattacks targeting websites. The increasing reliance on e-commerce and online services makes website security paramount, fueling demand for effective malware scanning solutions. While precise market sizing data is unavailable, a logical estimation based on the prevalence of website security concerns and the growth of related sectors like cybersecurity suggests a market size exceeding $2 billion in 2025, expanding at a Compound Annual Growth Rate (CAGR) of approximately 15% over the forecast period (2025-2033). This growth is propelled by several key factors including the rise of sophisticated malware, the increasing adoption of cloud-based security solutions, and growing regulatory pressures demanding robust website security measures. The market is segmented by deployment (cloud, on-premise), scanning type (static, dynamic), and organization size (SMEs, enterprises), with cloud-based solutions currently dominating due to their scalability and cost-effectiveness. The competitive landscape is fragmented, with a mix of established players like Invicti, Acunetix, and Qualys alongside numerous smaller, specialized providers. Differentiation is primarily achieved through features such as advanced detection capabilities, ease of use, integration with existing security infrastructures, and pricing models. The market is expected to witness increased consolidation through mergers and acquisitions, as larger players seek to expand their product portfolios and market share. Growth restraints include the potential for false positives, the complexity of integrating scanning tools into existing workflows, and the ongoing evolution of malware techniques, demanding continuous updates and improvements to scanner technology. Future trends point towards an increase in AI-powered malware detection, enhanced vulnerability management integration, and the adoption of blockchain technology for enhanced security and transparency.

  20. f

    APT family and sample size.

    • plos.figshare.com
    xls
    Updated Jun 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jian Zhang; Shengquan Liu; Zhihua Liu (2024). APT family and sample size. [Dataset]. http://doi.org/10.1371/journal.pone.0304066.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 27, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Jian Zhang; Shengquan Liu; Zhihua Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, with the development of the Internet, the attribution classification of APT malware remains an important issue in society. Existing methods have yet to consider the DLL link library and hidden file address during the execution process, and there are shortcomings in capturing the local and global correlation of event behaviors. Compared to the structural features of binary code, opcode features reflect the runtime instructions and do not consider the issue of multiple reuse of local operation behaviors within the same APT organization. Obfuscation techniques more easily influence attribution classification based on single features. To address the above issues, (1) an event behavior graph based on API instructions and related operations is constructed to capture the execution traces on the host using the GNNs model. (2) ImageCNTM captures the local spatial correlation and continuous long-term dependency of opcode images. (3) The word frequency and behavior features are concatenated and fused, proposing a multi-feature, multi-input deep learning model. We collected a publicly available dataset of APT malware to evaluate our method. The attribution classification results of the model based on a single feature reached 89.24% and 91.91%. Finally, compared to single-feature classifiers, the multi-feature fusion model achieves better classification performance.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ehsan Nowroozi (2023). Pristine and Malicious URLs [Dataset]. https://ieee-dataport.org/documents/pristine-and-malicious-urls

Pristine and Malicious URLs

Explore at:
Dataset updated
Nov 6, 2023
Authors
Ehsan Nowroozi
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The goal of our research is to identify malicious advertisement URLs and to apply adversarial attack on ensembles. We extract lexical and web-scrapped features from using python code. And then 4 machine learning algorithms are applied for the classification process and then used the K-Means clustering for the visual understanding. We check the vulnerability of the models by the adversarial examples. We applied Zeroth Order Optimization adversarial attack on the models and compute the attack accuracy.

Search
Clear search
Close search
Google apps
Main menu