19 datasets found
  1. Businesses worldwide affected by ransomware 2018-2025

    • statista.com
    Updated Aug 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Businesses worldwide affected by ransomware 2018-2025 [Dataset]. https://www.statista.com/statistics/204457/businesses-ransomware-attack-rate/
    Explore at:
    Dataset updated
    Aug 26, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    As of 2025, nearly 63 percent of businesses worldwide were affected by ransomware attacks. This figure represents a decrease on the previous year and was by far the lowest figure reported since 2020. Overall, since 2018, more than half of the total survey respondents each year stated that their organizations had been victimized by ransomware. Most targeted industries In 2024, the critical manufacturing industry in the United States was once again most targeted by ransomware attacks. Overall, organizations in this industry experienced 258 cyberattacks in the measured year. Healthcare and the public health sector ranked second, followed by government facilities, with 238 and 220 cyberattacks, respectively. Ransomware in the manufacturing industry The manufacturing industry, along with its subindustries, is constantly targeted by ransomware attacks, causing data loss, business disruptions, and reputational damage. Often, such cyberattacks are international and have a political intent. In 2024, exploited vulnerabilities were the leading cause of ransomware attacks in the manufacturing industry.

  2. f

    RDE-Dataset.zipRansomware Defense Empowered: Deep Learning for Real-Time...

    • figshare.com
    zip
    Updated Mar 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hassan jalil hadi; Hassan Jalil Hadi (2024). RDE-Dataset.zipRansomware Defense Empowered: Deep Learning for Real-Time Family Identification with a Proprietary Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.25467826.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 24, 2024
    Dataset provided by
    figshare
    Authors
    Hassan jalil hadi; Hassan Jalil Hadi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ransomware, leveraging sophisticated encryption techniques, poses a significant threat by encrypting crucial data, thereby rendering it inaccessible. The proliferation of diverse ransomware variants has caused considerable harm to governments, corporations, and individual users alike. Despite the increasing prevalence of cyber threats, existing solutions often struggle with real-time detection and early identification of ransomware families. To address this challenge, we introduce FCG-RFD, a novel benchmark dataset featuring extensive Function Call Graphs (FCG) tailored for ransomware family detection. Given the constantly evolving nature of malware, antivirus scanners face ongoing challenges, necessitating access to recent and updated datasets. Our dataset comprises 8,095 samples sourced from reputable repositories including VirusSamples, Virusshare, VirusSign, the Zoo, and MalwareBazaar. Additionally, we include 8,020 normal files obtained from trusted sources such as the Microsoft Store and Softonic. Through FCG-RFD, we aim to facilitate more robust and timely detection of ransomware families, ultimately enhancing cybersecurity measures against this pervasive threat.

  3. Z

    Malware Repositories and Their Authors on GitHub

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tania, Nishat Ara (2024). Malware Repositories and Their Authors on GitHub [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10806592
    Explore at:
    Dataset updated
    Mar 11, 2024
    Dataset provided by
    Zhang, Qian
    Masud, Md Rayhanul
    Faloutsos, Michalis
    Rokon, Md Omar Faruk
    Tania, Nishat Ara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is rooted in a study aimed at unveiling the origins and motivations behind the creation of malware repositories on GitHub. Our research embarks on an innovative journey to dissect the profiles and intentions of GitHub users who have been involved in this dubious activity.

    Employing a robust methodology, we meticulously identified 14,000 GitHub users linked to malware repositories. By leveraging advanced large language model (LLM) analytics, we classified these individuals into distinct categories based on their perceived intent: 3,339 were deemed Malicious, 3,354 Likely Malicious, and 7,574 Benign, offering a nuanced perspective on the community behind these repositories.

    Our analysis penetrates the veil of anonymity and obscurity often associated with these GitHub profiles, revealing stark contrasts in their characteristics. Malicious authors were found to typically possess sparse profiles focused on nefarious activities, while Benign authors presented well-rounded profiles, actively contributing to cybersecurity education and research. Those labeled as Likely Malicious exhibited a spectrum of engagement levels, underlining the complexity and diversity within this digital ecosystem.

    We are offering two datasets in this paper. First, a list of malware repositories - we have collected and extended the malware repositories on the GitHub in 2022 following the original papers. Second, a csv file with the github users information with their maliciousness classfication label.

    malware_repos.txt

    Purpose: This file contains a curated list of GitHub repositories identified as containing malware. These repositories were identified following the methodology outlined in the research paper "SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub."

    Contents: The file is structured as a simple text file, with each line representing a unique repository in the format username/reponame. This format allows for easy identification and access to each repository on GitHub for further analysis or review.

    Usage: The list serves as a critical resource for researchers and cybersecurity professionals interested in studying malware, understanding its distribution on platforms like GitHub, or developing defense mechanisms against such malicious content.

    obfuscated_github_user_dataset.csv

    Purpose: Accompanying the list of malware repositories, this CSV file contains detailed, albeit obfuscated, profile information of the GitHub users who authored these repositories. The obfuscation process has been applied to protect user privacy and comply with ethical standards, especially given the sensitive nature of associating individuals with potentially malicious activities.

    Contents: The dataset includes several columns representing different aspects of user profiles, such as obfuscated identifiers (e.g., ID, login, name), contact information (e.g., email, blog), and GitHub-specific metrics (e.g., followers count, number of public repositories). Notably, sensitive information has been masked or replaced with generic placeholders to prevent user identification.

    Usage: This dataset can be instrumental for researchers analyzing behaviors, patterns, or characteristics of users involved in creating malware repositories on GitHub. It provides a basis for statistical analysis, trend identification, or the development of predictive models, all while upholding the necessary ethical considerations.

  4. i

    Malware API Call Dataset

    • ieee-dataport.org
    Updated May 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ferhat Ozgur Catak (2022). Malware API Call Dataset [Dataset]. https://ieee-dataport.org/open-access/malware-api-call-dataset
    Explore at:
    Dataset updated
    May 18, 2022
    Authors
    Ferhat Ozgur Catak
    Description

    This study seeks to obtain data which will help to address machine learning based malware research gaps. The specific objective of this study is to build a benchmark dataset for Windows operating system API calls of various malware. This is the first study to undertake metamorphic malware to build sequential API calls. It is hoped that this research will contribute to a deeper understanding of how metamorphic malware change their behavior (i.e. API calls) by adding meaningless opcodes with their own dissembler/assembler parts.

  5. Malware Dataset on Android Applications

    • zenodo.org
    Updated May 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deris Stiawan; Deris Stiawan (2025). Malware Dataset on Android Applications [Dataset]. http://doi.org/10.5281/zenodo.15377874
    Explore at:
    Dataset updated
    May 10, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Deris Stiawan; Deris Stiawan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 29, 2023
    Description

    Android has become the most popular operating system on mobile devices, making it a prime target for threat actors in creating malware. The research conducted by the author aims to detect reverse TCP exploits in network traffic. The tools used are Metasploit for Android, Termux, PCAPdroid, Wireshark, OpenVPN, and Apktool in both terminal and application versions. The supporting devices for this research are hardware devices, namely a smartphone, VPS, Mikrotik Router, and laptop.

  6. MH-1M Dataset

    • figshare.com
    zip
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hendrio Bragança; Vanderson Rocha; Joner Assolin; Diego Kreutz; Eduardo Feitosa (2025). MH-1M Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28355897.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 21, 2025
    Dataset provided by
    figshare
    Authors
    Hendrio Bragança; Vanderson Rocha; Joner Assolin; Diego Kreutz; Eduardo Feitosa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The rapid and widespread increase of Android malware presents substantial obstacles to cybersecurity research. In order to revolutionize the field of malware research, we present the MH-1M dataset, which is a thorough compilation of 1,340,515 APK samples. This dataset encompasses a wide range of diverse attributes and metadata, offering a comprehensive perspective. The utilization of the VirusTotal API guarantees precise assessment of threats by amalgamating various detection techniques. Our research indicates that MH-1M is a highly current dataset that provides valuable insights into the changing nature of malware.MH-1M consists of 23,247 features that cover a wide range of application behavior, from intents::accept to apicalls::landroid/window/splashscreenview.remove. The features are categorized into four primary classifications:Feature TypesValuesAPICalls22,394Intents407OPCodes232Permissions214The dataset is stored efficiently, utilizing a memory capacity of 29.0 GB, which showcases its substantial yet controllable magnitude. The dataset consists of 1,221,421 benign applications and 119,094 malware applications, ensuring a balanced representation for accurate malware detection and analysis.The MH-1M repository also offers a wide variety of metadata from APKs, providing useful data into the development of malicious software over a period of more than ten years. The Android features include a wide variety of metadata, which includes SHA256 hashes, file names, package names, compilation APIs, and various other details. This GitHub repository contains over 400GB of valuable data, making it the largest and most comprehensive dataset available for advancing research and development in Android malware detection.

  7. e

    Dataset of Publication "Malware Communication in Smart Factories: A Network...

    • b2find.eudat.eu
    • researchdata.tuwien.ac.at
    Updated Aug 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Dataset of Publication "Malware Communication in Smart Factories: A Network Traffic Data Set" [Dataset]. https://b2find.eudat.eu/dataset/5a44cf28-2ebc-5d4b-b163-238f939b5625
    Explore at:
    Dataset updated
    Aug 18, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning-based intrusion detection requires suitable and realisticdata sets for training and testing. However, data sets that originate fromreal networks are rare. Network data is considered privacy sensitive and the purposeful introduction of malicious traffic is usually not possible. In thispaper we introduce a labeled data set captured at a smart factory locatedin Vienna, Austria during normal operation and during penetration tests with differentattack types. The data set contains 173 GB of PCAP files, which represent 16 days (395 hours) of factory operation. It includes MQTT, OPC UA, and Modbus/TCP traffic. The captured malicious traffic was originatedby a professional penetration tester who performed two types of attacks: (a)aggressive attacks that are easier to detect and (b) stealthy attacks that areharder to detect. Our data set includes the raw PCAP files and extractedflow data. Labels for packets and flows indicate whether packets (or flows)originated from a specific attack or from benign communication. We describethe methodology for creating the data set, conduct an analysis of the dataand provide detailed information about the recorded traffic itself. The dataset is freely available to support reproducible research and the comparabilityof results in the area of intrusion detection in industrial networks. File description: a_day1, a_day2, s_day1, s_day2, tf_a and tf_s: Main data set, where files starting with "tf" are training files containing only benign, operational data and all other files are attack files containing both, operational data and attack data. images.zip: Contains descriptive images about the data. extractions.zip: Contains extracted packets, flows in both labeled and unlabeled form. a_day_tuesday_dos.zip: additional day of attack traffic containing benign and attack data, including a DoS attack. This day is not labeled.

  8. Malware Detection in Network Traffic Data

    • kaggle.com
    Updated Dec 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agung Pambudi (2023). Malware Detection in Network Traffic Data [Dataset]. http://doi.org/10.34740/kaggle/dsv/7285844
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 26, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Agung Pambudi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To cite the dataset please reference it as “Stratosphere Laboratory. A labeled dataset with malicious and benign IoT network traffic. January 22th. Agustin Parmisano, Sebastian Garcia, Maria Jose Erquiaga. https://www.stratosphereips.org/datasets-iot23

    This dataset includes labels that explain the linkages between flows connected with harmful or possibly malicious activity to provide network malware researchers and analysts with more thorough information. These labels were painstakingly created at the Stratosphere labs using malware capture analysis.

    We present a concise explanation of the labels used for the identification of malicious flows, based on manual network analysis, below:

    Attack: This label signifies the occurrence of an attack originating from an infected device directed towards another host. Any flow that endeavors to exploit a vulnerable service, discerned through payload and behavioral analysis, falls under this classification. Examples include brute force attempts on telnet logins or header-based command injections in GET requests.

    Benign: The "Benign" label denotes connections where no suspicious or malicious activities have been detected.

    C&C (Command and Control): This label indicates that the infected device has established a connection with a Command and Control server. This observation is rooted in the periodic nature of connections or activities such as binary downloads or the exchange of IRC-like or decoded commands.

    DDoS (Distributed Denial of Service): "DDoS" is assigned when the infected device is actively involved in a Distributed Denial of Service attack, identifiable by the volume of flows directed towards a single IP address.

    FileDownload: This label signifies that a file is being downloaded to the infected device. It is determined by examining connections with response bytes exceeding a specified threshold (typically 3KB or 5KB), often in conjunction with known suspicious destination ports or IPs associated with Command and Control servers.

    HeartBeat: "HeartBeat" designates connections where packets serve the purpose of tracking the infected host by the Command and Control server. Such connections are identified through response bytes below a certain threshold (typically 1B) and exhibit periodic similarities. This is often associated with known suspicious destination ports or IPs linked to Command and Control servers.

    Mirai: This label is applied when connections exhibit characteristics resembling those of the Mirai botnet, based on patterns consistent with common Mirai attack profiles.

    Okiru: Similar to "Mirai," the "Okiru" label is assigned to connections displaying characteristics of the Okiru botnet. The parameters for this label are the same as for Mirai, but Okiru is a less prevalent botnet family.

    PartOfAHorizontalPortScan: This label is employed when connections are involved in a horizontal port scan aimed at gathering information for potential subsequent attacks. The labeling decision hinges on patterns such as shared ports, similar transmitted byte counts, and multiple distinct destination IPs among the connections.

    Torii: The "Torii" label is used when connections exhibit traits indicative of the Torii botnet, with labeling criteria similar to those used for Mirai, albeit in the context of a less common botnet family.

    Field NameDescriptionType
    tsThe timestamp of the connection event.time
    uidA unique identifier for the connection.string
    id.orig_hThe source IP address.addr
    id.orig_pThe source port.port
    id.resp_hThe destination IP address.addr
    id.resp_pThe destination port.port
    protoThe network protocol used (e.g., 'tcp').enum
    serviceThe service associated with the connection.string
    durationThe duration of the connection.interval
    orig_bytesThe number of bytes sent from the source to the destination.count
    resp_bytesThe number of bytes sent from the destination to the source.count
    conn_stateThe state of the connection.string
    local_origIndicates whether the connection is considered local or not.bool
    local_respIndicates whether the connection is considered...
  9. e

    Dataset of Publication "Malware Communication in Smart Factories: A Network...

    • b2find.eudat.eu
    Updated Apr 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Dataset of Publication "Malware Communication in Smart Factories: A Network Traffic Data Set" [Dataset]. https://b2find.eudat.eu/dataset/a4f43cd9-25b1-5df3-a529-e430ae2fe323
    Explore at:
    Dataset updated
    Apr 12, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Brenner, B., Fabini, J., Offermanns, M., Semper, S., & Zseby, T. (2024). Malware communication in smart factories: A network traffic data set. Computer Networks, 255, 110804. or in BibTeX: @article{brenner2024malware, title={Malware communication in smart factories: A network traffic data set}, author={Brenner, Bernhard and Fabini, Joachim and Offermanns, Magnus and Semper, Sabrina and Zseby, Tanja}, journal={Computer Networks}, volume={255}, pages={110804}, year={2024}, publisher={Elsevier}} Context and methodology Machine learning-based intrusion detection requires suitable and realistic data sets for training and testing. However, data sets that originate from real networks are rare. Network data is considered privacy-sensitive, and the purposeful introduction of malicious traffic is usually not possible. In this paper, we introduce a labeled data set captured at a smart factory located in Vienna, Austria, during normal operation and during penetration tests with different attack types. The data set contains 173 GB of PCAP files, representing 16 days (395 hours) of factory operation. It includes MQTT, OPC UA, and Modbus/TCP traffic. The captured malicious traffic originated from a professional penetration tester who performed two types of attacks:(a) Aggressive attacks that are easier to detect.(b) Stealthy attacks that are harder to detect. Our data set includes the raw PCAP files and extracted flow data. Labels for packets and flows indicate whether they originated from a specific attack or from benign communication. We describe the methodology for creating the dataset, conduct an analysis of the data, and provide detailed information about the recorded traffic itself. The dataset is freely available to support reproducible research and the comparability of results in the area of intrusion detection in industrial networks. Technical details readme.txt Information about the data collection, format, necessary software and versions to access it.

  10. WinMET Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, json
    Updated Sep 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Razvan Raducu; Razvan Raducu; Alain Villagrasa-Labrador; Alain Villagrasa-Labrador; Ricardo J. Rodríguez; Ricardo J. Rodríguez; Pedro Álvarez; Pedro Álvarez (2025). WinMET Dataset [Dataset]. http://doi.org/10.5281/zenodo.16414116
    Explore at:
    bin, jsonAvailable download formats
    Dataset updated
    Sep 1, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Razvan Raducu; Razvan Raducu; Alain Villagrasa-Labrador; Alain Villagrasa-Labrador; Ricardo J. Rodríguez; Ricardo J. Rodríguez; Pedro Álvarez; Pedro Álvarez
    License

    https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html

    Description

    WinMET (Windows Malware Execution Traces) Dataset

    WinMET dataset contains the execution traces generated with CAPE sandbox after analyzing several malware samples. The execution traces are valid JSON files that contain the spawned processes, the sequence of WinAPI and system calls invoked by each process, their parameters, their return values, and OS accesed resources, amongst many others.

    Please use this DOI reference that always points to the latest WinMET version: https://doi.org/10.5281/zenodo.12647555" target="_blank" rel="noopener">https://doi.org/10.5281/zenodo.12647555

    How to use the dataset

    All 7z files are password protected. The password is: infected.

    • Total execution traces: 31844 (.json files), split into 5 volumes.
    • Compressed size per volume: ~2.5 GB.
    • Uncompressed size per volume: ~154 GB.
    • Total compressed size: ~13 GB.
    • Total uncompressed size: ~750 GB.

    WinMET Volumes:

    • WinMET_volume_1.7z - MD5: bf51181eafc8452090bb6ce9f47b6714
      • Total files: 6369
      • First file: 0000025c5ee1d6707e6dddfe2816f92d9d8d8bb7c84371c44529e8083109b0e5.json
      • Last file: 31c2e51efcbff0aa489aa6af1a48cf78f6a9febfb449a19d029f8cc8ebb4495f.json
    • WinMET_volume_2.7z - MD5: aee86b4591a46c69b0d027de80ff1011
      • Total files: 6369
      • First file: 31c4300fdba21e03ce5ad8ef340832493bcbf702a2ee897cf3a85fdd38dbf10c.json
      • Last file: 65a8b01babb2fcf3ed26a2236a606d7bc7d1f087749a455554b8ef7eddba56fc.json
    • WinMET_volume_3.7z - MD5 996723774909bc2e6745382697317460
      • Total files: 6369
      • First file: 65a92f49f687b2f421397bbd3a6426b0b4914b896659c2d07a287e112a25939d.json
      • Last file: 996d4e0a67dcad433fa2049dca1defdd984d776fbb5bc5990c0114932be25066.json
    • WinMET_volume_4.7z - MD5 4f5acbabeb9d24c96dadef71f56bd916
      • Total files: 6369
      • First file: 996fdb5a25f89426e241f02094474706fafd567fcc5980a07ac7a38efa8625ea.json
      • Last file: cdb5eed6579773d8fbdb13deb766664ba1c8cc01794790855e61e1564daf62f5.json
    • WinMET_volume_5.7z - MD5 b3d15c97990dd0dfb0d94e369f486025
      • Total files: 6368
      • First file: cdb7a65d6efc528d6084879e2a24cafb6869c84c45f076208ac437b3bdbdae94.json
      • Last file: ffff75b38c340f90d5fd3fbda5257f11caea5c8160daf26c9a29c04bb333a1c2.json

    Additional files:

    • cape_report_to_label_mapping.jsoncontains the mappings of each report with its corresponding label as assigned by the CAPE sandbox labeling algorithm, sorted in descendant order (given the number of reports belonging to each label/family).
    • avclass_report_to_label_mapping.json contains the mappings of each report with its corresponding label as assigned by AVClass, sorted in descendant order (given the number of reports belonging to each label/family).
    • reports_consensus_label.json contains both labels (CAPE and AVClass) for each execution trace.

    Citation

    If you use this dataset, cite it as follows:

    Raducu, R., Villagrasa-Labrador, A., Rodríguez, R. J., & Álvarez, P. (2025). WinMET Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.12647555

    BibTex:

    @misc{WinMET_dataset,
    author = {Raducu, Razvan and Villagrasa‑Labrador, Alain and Rodríguez, Ricardo J. and Álvarez, Pedro},
    title = {{WinMET Dataset: Windows Malware Execution Traces}},
    howpublished = {Zenodo, dataset},
    year = {2025},
    doi = {10.5281/zenodo.12647555},
    url = {https://doi.org/10.5281/zenodo.12647555}
    }

    This dataset was generated using the https://github.com/reverseame/MALVADA/tree/main" target="_blank" rel="noopener">MALVADA framework, which you can read more about in our publication https://doi.org/10.1016/j.softx.2025.102082" target="_blank" rel="noopener">https://doi.org/10.1016/j.softx.2025.102082. The article also provides insights about the contents of this dataset.

    Razvan Raducu, Alain Villagrasa-Labrador, Ricardo J. Rodríguez, Pedro Álvarez, MALVADA: A framework for generating datasets of malware execution traces, SoftwareX, Volume 30, 2025, 102082, ISSN 2352-7110, https://doi.org/10.1016/j.softx.2025.102082. (https://www.sciencedirect.com/science/article/pii/S2352711025000494)

    Statistics

    The following statistic (and many more) can be obtained by analyzing the WinMET dataset with the https://github.com/reverseame/MALVADA" target="_blank" rel="noopener">MALVADA framework.

    • Top 20 CAPE consensus labels (there are many more)*:
      • Dacic (3127 execution traces)
      • Padodor (1504 execution traces)
      • Redline (1455 execution traces)
      • Crifi (1101 execution traces)
      • Cosmu (836 execution traces)
      • Agenttesla (806 execution traces)
      • Amadey (601 execution traces)
      • Loki (551 execution traces)
      • Berbew (532 execution traces)
      • Qukart (496 execution traces)
      • Tedy (410 execution traces)
      • Mint (389 execution traces)
      • Metastealer (376 execution traces)
      • Smokeloader (349 execution traces)
      • Taskun (335 execution traces)
      • Virlock (313 execution traces)
      • Formbook (301 execution traces)
      • Strab (273 execution traces)
      • Agensla (235 execution traces)
      • Autorun (229 execution traces)
    • Top 20 AVClass consensus labels (there are many more)*:
      • Redline (4438 execution traces)
      • Vbclone (4023 execution traces)
      • Berbew (2794 execution traces)
      • Agenttesla (1201 execution traces)
      • Cosmu (899 execution traces)
      • Taskun (856 execution traces)
      • Disabler (799 execution traces)
      • Amadey (763 execution traces)
      • Gamarue (546 execution traces)
      • Noon (530 execution traces)
      • Strab (468 execution traces)
      • Snojan (433 execution traces)
      • Stop (399 execution traces)
      • Snakelogger (365 execution traces)
      • Virlock (326 execution traces)
      • Qbot (315 execution traces)
      • Equationdrug (270 execution traces)
      • Mokes (262 execution traces)
      • Blihan (261 execution traces)
      • Dofoil (254 execution traces)
    • There are 7256 execution traces with no CAPE label.
    • There are 1846 execution traces with no AVClass label.
    • There are 1241 execution traces with no label.

    * The execution traces with no label are assigned the "(n/a)" family. We ommited it here.

    Changelog

    • 2025.07.25:
      • Dataset now contains ~32K execution traces.
      • Split new dataset into 5 volumes.
      • Updated TOP20 consensus labels.
      • Added reports_consensus_label.json.
      • Fixed Reline <-> Redline AVClass mappings https://github.com/malicialab/avclass/pull/48.
    • Version 2.0: Added cape and avclass label mappings.
  11. DataSet for ICSE SEIP 25: Detecting Python Malware in the Software Supply...

    • zenodo.org
    bin
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ridwan Shariffdeen; Ridwan Shariffdeen (2024). DataSet for ICSE SEIP 25: Detecting Python Malware in the Software Supply Chain with Program Analysis [Dataset]. http://doi.org/10.5281/zenodo.14580885
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ridwan Shariffdeen; Ridwan Shariffdeen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    * MalOSS: subset of malicious packages from MalOSS dataset [RQ1, RQ2, RQ4]
    * BackStabber: subset of malicious packages from BackStabber Knife's Collection [RQ1, RQ2, RQ4]
    * MalRegistry: subset of malicious packages from Python MalRegistry dataset [RQ1, RQ2, RQ4]
    * Popular: a collection of top-100 most popular python packages from PyPI [RQ1, RQ2, RQ3, RQ4]
    * Trusted: a collection of packages from trusted organizations hosted in PyPI [RQ1, RQ2, RQ3, RQ4]
    * DataKund: a collection of newly identified malicious packages from PyPI [Case Study]
    * Recent: a collection of packages that were recently (2024 Oct) added to PyPI [Macaron Case Study]
  12. D

    Database Security Audits Services Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Database Security Audits Services Report [Dataset]. https://www.datainsightsmarket.com/reports/database-security-audits-services-1419617
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Apr 25, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Database Security Audits Services market is experiencing robust growth, driven by the increasing reliance on databases across various industries and the escalating threat landscape. The market's expansion is fueled by several key factors. Firstly, stringent data privacy regulations like GDPR and CCPA are compelling organizations to prioritize database security and conduct regular audits to ensure compliance. Secondly, the rising frequency and sophistication of cyberattacks targeting databases, including ransomware and data breaches, are prompting proactive security measures, including comprehensive audits. Thirdly, the shift towards cloud-based databases introduces new security challenges and necessitates specialized audit services to address vulnerabilities inherent in cloud environments. The market is segmented by application (Financial, Medical, Telecom, Government, Manufacturing, Others) and type (Cloud-based, On-premise), with cloud-based services witnessing faster adoption due to the expanding cloud computing market. North America and Europe currently hold significant market share, but regions like Asia-Pacific are exhibiting rapid growth potential owing to increasing digitalization and adoption of advanced technologies. Major players are investing in innovative solutions and expanding their service portfolios to cater to diverse client needs, fostering competition and driving market evolution. While the market faces restraints like high implementation costs and a shortage of skilled professionals, the overall growth trajectory remains positive, propelled by the escalating demand for robust database security and compliance. The forecast period (2025-2033) anticipates continued expansion, potentially exceeding a compound annual growth rate (CAGR) of 15%. This optimistic projection is based on several factors. First, the ongoing digital transformation across industries will lead to increased reliance on databases and subsequently, heightened demand for security audits. Second, the continuous evolution of cyber threats will necessitate more frequent and comprehensive audits, further boosting market growth. Thirdly, the market will benefit from technological advancements in database security tools and methodologies, enabling more efficient and effective audits. However, challenges remain, particularly in addressing the skill gap and ensuring the affordability of these services for smaller organizations. Nevertheless, the long-term outlook for the Database Security Audits Services market remains strongly positive, with significant opportunities for market expansion and innovation.

  13. h

    cosoco-image-dataset

    • huggingface.co
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K3Y Ltd (2025). cosoco-image-dataset [Dataset]. http://doi.org/10.57967/hf/5853
    Explore at:
    Dataset updated
    May 28, 2025
    Dataset authored and provided by
    K3Y Ltd
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    COSOCO: Compromised Software Containers Image Dataset

    Paper: Malware Detection in Docker Containers: An Image is Worth a Thousand Logs Dataset Documentation: COSOCO Dataset Documentation

      Dataset Description
    

    COSOCO (Compromised Software Containers) is a synthetic dataset of 3364 images representing benign and malware-compromised software containers. Each image in the dataset represents a dockerized software container that has been converted to an image using common… See the full description on the dataset page: https://huggingface.co/datasets/k3ylabs/cosoco-image-dataset.

  14. Fast & Furious: Malware Detection Data Stream

    • kaggle.com
    Updated Aug 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabrício Ceschin (2022). Fast & Furious: Malware Detection Data Stream [Dataset]. https://www.kaggle.com/fabriciojoc/fast-furious-malware-data-stream
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 12, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Fabrício Ceschin
    Description

    These datasets (DREBIN and AndroZoo) are contributions to the paper "Fast & Furious: On the Modelling of Malware Detection as an Evolving Data Stream". If you use them in your work, please cite our paper using the BibTeX below:

    @article{CESCHIN2022118590,
    title = {Fast & Furious: On the modelling of malware detection as an evolving data stream},
    journal = {Expert Systems with Applications},
    pages = {118590},
    year = {2022},
    issn = {0957-4174},
    doi = {https://doi.org/10.1016/j.eswa.2022.118590},
    url = {https://www.sciencedirect.com/science/article/pii/S0957417422016463},
    author = {Fabrício Ceschin and Marcus Botacin and Heitor Murilo Gomes and Felipe Pinagé and Luiz S. Oliveira and André Grégio},
    keywords = {Machine learning, Data streams, Concept drift, Malware detection, Android}
    }
    

    Both datasets are saved in the parquet file format. To read them, use the following code:

    data_drebin = pd.read_parquet("drebin_drift.parquet.zip")
    data_androzoo = pd.read_parquet("androbin.parquet.zip")
    

    Note that these datasets are different from their original versions. The original DREBIN dataset does not contain the samples' timestamps, which we collected using VirusTotal API. Our version of the AndroZoo dataset is a subset of reports from their dataset previously available in their APK Analysis API, which was discontinued.

    The DREBIN dataset is composed of ten textual attributes from Android APKs (list of API calls, permissions, URLs, etc), which are publicly available to download and contain 123,453 benign and 5,560 malicious Android applications. Their distribution over time is shown below.

    https://i.imgur.com/IGKOMtE.png" alt="DREBIN dataset distribution by month">

    The AndroZoo dataset is a subset of Android applications reports provided by AndroZoo API, composed of eight textual attributes (resources names, source code classes and methods, manifest permissions etc.) and contains 213,928 benign and 70,340 malicious applications. The distribution over time of our AndroZoo subset, which keeps the same goodware and malware distribution as the original dataset (composed of most of 10 million apps), is shown below.

    https://i.imgur.com/8zxH3M4.png" alt="AndroZoo dataset distribution by month">

    The source code for all the experiments shown in the paper are also available here on Kaggle (note that the experiments using AndroZoo dataset did not run in the Kaggle environment due to high memory usage).

    Experiment 1 (The Best-Case Scenario for AVs - ML Cross-Validation)

    Here we classify all samples together to compare which feature extraction algorithm is the best and report baseline results. We tested several parameters for both algorithms and fixed the vocabulary size at 100 for TF-IDF (top-100 features ordered by term frequency) and created projections with 100 dimensions for Word2Vec, resulting in 1, 000 and 800 features for each app in both cases, for DREBIN and AndroZoo, respectively. All results are reported after 10-fold cross-validation procedures, a method commonly used in ML to evaluate models because its results are less prone to biases (note that we are training new classifiers and feature extractors at every iteration of the cross-validation process). . In practice, folding the dataset implies that the AV company has a mixed view of both past and future threats, despite temporal effects, which is the best scenario for AV operation and ML evaluation.

    Source Codes: DREBIN TFIDF | DREBIN W2V | ANDROZOO TFIDF | ANDROZOO W2V

    Experiment 2 (On Classification Failure - Temporal Classification)

    Although the currently used classification methodology helps reduci dataset biases, it would demand knowledge about future threats to work properly. AV companies train their classifiers using data from past samples and leverage them to predict future threats, expecting to present the same characteristics as past ones. However, malware samples are very dynamic, thus this strategy is the worst-case scenario for AV companies. To demonstrate the effects of predicting future threats based on past data, we split our datasets in two: we used the first half (oldest samples) to train our classifiers, which were then used to predict the newest samples from the second half. The results in the paper indicate a drop in all metrics when compared to the 10-fold experiment in both DREBIN and AndroZoo dataset...

  15. f

    Statistical values of the CNN model.

    • plos.figshare.com
    xls
    Updated Jan 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Aamir; Muhammad Waseem Iqbal; Mariam Nosheen; M. Usman Ashraf; Ahmad Shaf; Khalid Ali Almarhabi; Ahmed Mohammed Alghamdi; Adel A. Bahaddad (2024). Statistical values of the CNN model. [Dataset]. http://doi.org/10.1371/journal.pone.0296722.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 19, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Muhammad Aamir; Muhammad Waseem Iqbal; Mariam Nosheen; M. Usman Ashraf; Ahmad Shaf; Khalid Ali Almarhabi; Ahmed Mohammed Alghamdi; Adel A. Bahaddad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Android is the most popular operating system of the latest mobile smart devices. With this operating system, many Android applications have been developed and become an essential part of our daily lives. Unfortunately, different kinds of Android malware have also been generated with these applications’ endless stream and somehow installed during the API calls, permission granted and extra packages installation and badly affected the system security rules to harm the system. Therefore, it is compulsory to detect and classify the android malware to save the user’s privacy to avoid maximum damages. Many research has already been developed on the different techniques related to android malware detection and classification. In this work, we present AMDDLmodel a deep learning technique that consists of a convolutional neural network. This model works based on different parameters, filter sizes, number of epochs, learning rates, and layers to detect and classify the android malware. The Drebin dataset consisting of 215 features was used for this model evaluation. The model shows an accuracy value of 99.92%. The other statistical values are precision, recall, and F1-score. AMDDLmodel introduces innovative deep learning for Android malware detection, enhancing accuracy and practical user security through inventive feature engineering and comprehensive performance evaluation. The AMDDLmodel shows the highest accuracy values as compared to the existing techniques.

  16. f

    State-of-the-art comparison with the existing techniques.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jan 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Aamir; Muhammad Waseem Iqbal; Mariam Nosheen; M. Usman Ashraf; Ahmad Shaf; Khalid Ali Almarhabi; Ahmed Mohammed Alghamdi; Adel A. Bahaddad (2024). State-of-the-art comparison with the existing techniques. [Dataset]. http://doi.org/10.1371/journal.pone.0296722.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 19, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Muhammad Aamir; Muhammad Waseem Iqbal; Mariam Nosheen; M. Usman Ashraf; Ahmad Shaf; Khalid Ali Almarhabi; Ahmed Mohammed Alghamdi; Adel A. Bahaddad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    State-of-the-art comparison with the existing techniques.

  17. f

    The confusion matrix.

    • plos.figshare.com
    xls
    Updated Sep 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ye Tian; Xin Dai; Zhijun Li; Hong Guo; Xiao Mao (2025). The confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0331574.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 3, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Ye Tian; Xin Dai; Zhijun Li; Hong Guo; Xiao Mao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the widespread adoption of internet technologies and email communication systems, the exponential growth in email usage has precipitated a corresponding surge in spam proliferation. These unsolicited messages not only consume users’ valuable time through information overload but also pose significant cybersecurity threats through malware distribution and phishing schemes, thereby jeopardizing both digital security and user experience. This emerging challenge underscores the critical importance of developing effective spam detection mechanisms as a cornerstone of modern cybersecurity infrastructure. Through empirical analysis of machine learning (ML) performance on publicly available spam datasets, we established that algorithmic ensemble methods consistently outperform individual models in detection accuracy. We propose an optimized stacking ensemble framework that strategically combines predictions from four heterogeneous base models (NBC, k-NN, LR, XGBoost) through meta-learner integration. Our methodology incorporates grid search cross-validation with hyperparameter space optimization, enabling systematic identification of parameter configurations that maximize detection performance. The enhanced model was rigorously evaluated using comprehensive metrics including accuracy (99.79%), precision, recall, and F1-score, demonstrating statistically significant improvements over both baseline models and existing solutions documented in the literature.

  18. f

    Performance comparison before and after tuning.

    • plos.figshare.com
    xls
    Updated Sep 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ye Tian; Xin Dai; Zhijun Li; Hong Guo; Xiao Mao (2025). Performance comparison before and after tuning. [Dataset]. http://doi.org/10.1371/journal.pone.0331574.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 3, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Ye Tian; Xin Dai; Zhijun Li; Hong Guo; Xiao Mao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the widespread adoption of internet technologies and email communication systems, the exponential growth in email usage has precipitated a corresponding surge in spam proliferation. These unsolicited messages not only consume users’ valuable time through information overload but also pose significant cybersecurity threats through malware distribution and phishing schemes, thereby jeopardizing both digital security and user experience. This emerging challenge underscores the critical importance of developing effective spam detection mechanisms as a cornerstone of modern cybersecurity infrastructure. Through empirical analysis of machine learning (ML) performance on publicly available spam datasets, we established that algorithmic ensemble methods consistently outperform individual models in detection accuracy. We propose an optimized stacking ensemble framework that strategically combines predictions from four heterogeneous base models (NBC, k-NN, LR, XGBoost) through meta-learner integration. Our methodology incorporates grid search cross-validation with hyperparameter space optimization, enabling systematic identification of parameter configurations that maximize detection performance. The enhanced model was rigorously evaluated using comprehensive metrics including accuracy (99.79%), precision, recall, and F1-score, demonstrating statistically significant improvements over both baseline models and existing solutions documented in the literature.

  19. f

    Performance of the proposed stacking model.

    • plos.figshare.com
    xls
    Updated Sep 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ye Tian; Xin Dai; Zhijun Li; Hong Guo; Xiao Mao (2025). Performance of the proposed stacking model. [Dataset]. http://doi.org/10.1371/journal.pone.0331574.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 3, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Ye Tian; Xin Dai; Zhijun Li; Hong Guo; Xiao Mao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the widespread adoption of internet technologies and email communication systems, the exponential growth in email usage has precipitated a corresponding surge in spam proliferation. These unsolicited messages not only consume users’ valuable time through information overload but also pose significant cybersecurity threats through malware distribution and phishing schemes, thereby jeopardizing both digital security and user experience. This emerging challenge underscores the critical importance of developing effective spam detection mechanisms as a cornerstone of modern cybersecurity infrastructure. Through empirical analysis of machine learning (ML) performance on publicly available spam datasets, we established that algorithmic ensemble methods consistently outperform individual models in detection accuracy. We propose an optimized stacking ensemble framework that strategically combines predictions from four heterogeneous base models (NBC, k-NN, LR, XGBoost) through meta-learner integration. Our methodology incorporates grid search cross-validation with hyperparameter space optimization, enabling systematic identification of parameter configurations that maximize detection performance. The enhanced model was rigorously evaluated using comprehensive metrics including accuracy (99.79%), precision, recall, and F1-score, demonstrating statistically significant improvements over both baseline models and existing solutions documented in the literature.

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2025). Businesses worldwide affected by ransomware 2018-2025 [Dataset]. https://www.statista.com/statistics/204457/businesses-ransomware-attack-rate/
Organization logo

Businesses worldwide affected by ransomware 2018-2025

Explore at:
24 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Aug 26, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description

As of 2025, nearly 63 percent of businesses worldwide were affected by ransomware attacks. This figure represents a decrease on the previous year and was by far the lowest figure reported since 2020. Overall, since 2018, more than half of the total survey respondents each year stated that their organizations had been victimized by ransomware. Most targeted industries In 2024, the critical manufacturing industry in the United States was once again most targeted by ransomware attacks. Overall, organizations in this industry experienced 258 cyberattacks in the measured year. Healthcare and the public health sector ranked second, followed by government facilities, with 238 and 220 cyberattacks, respectively. Ransomware in the manufacturing industry The manufacturing industry, along with its subindustries, is constantly targeted by ransomware attacks, causing data loss, business disruptions, and reputational damage. Often, such cyberattacks are international and have a political intent. In 2024, exploited vulnerabilities were the leading cause of ransomware attacks in the manufacturing industry.

Search
Clear search
Close search
Google apps
Main menu