21 datasets found
  1. i

    Malware Analysis Datasets: Top-1000 PE Imports

    • ieee-dataport.org
    Updated Nov 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angelo Oliveira (2019). Malware Analysis Datasets: Top-1000 PE Imports [Dataset]. https://ieee-dataport.org/open-access/malware-analysis-datasets-top-1000-pe-imports
    Explore at:
    Dataset updated
    Nov 8, 2019
    Authors
    Angelo Oliveira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of my PhD research on malware detection and classification using Deep Learning. It contains static analysis data: Top-1000 imported functions extracted from the 'pe_imports' elements of Cuckoo Sandbox reports. PE malware examples were downloaded from virusshare.com. PE goodware examples were downloaded from portableapps.com and from Windows 7 x86 directories.

  2. f

    RDE-Dataset.zipRansomware Defense Empowered: Deep Learning for Real-Time...

    • figshare.com
    zip
    Updated Mar 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hassan jalil hadi; Hassan Jalil Hadi (2024). RDE-Dataset.zipRansomware Defense Empowered: Deep Learning for Real-Time Family Identification with a Proprietary Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.25467826.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 24, 2024
    Dataset provided by
    figshare
    Authors
    Hassan jalil hadi; Hassan Jalil Hadi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ransomware, leveraging sophisticated encryption techniques, poses a significant threat by encrypting crucial data, thereby rendering it inaccessible. The proliferation of diverse ransomware variants has caused considerable harm to governments, corporations, and individual users alike. Despite the increasing prevalence of cyber threats, existing solutions often struggle with real-time detection and early identification of ransomware families. To address this challenge, we introduce FCG-RFD, a novel benchmark dataset featuring extensive Function Call Graphs (FCG) tailored for ransomware family detection. Given the constantly evolving nature of malware, antivirus scanners face ongoing challenges, necessitating access to recent and updated datasets. Our dataset comprises 8,095 samples sourced from reputable repositories including VirusSamples, Virusshare, VirusSign, the Zoo, and MalwareBazaar. Additionally, we include 8,020 normal files obtained from trusted sources such as the Microsoft Store and Softonic. Through FCG-RFD, we aim to facilitate more robust and timely detection of ransomware families, ultimately enhancing cybersecurity measures against this pervasive threat.

  3. Businesses worldwide affected by ransomware 2018-2023

    • statista.com
    Updated Nov 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Businesses worldwide affected by ransomware 2018-2023 [Dataset]. https://www.statista.com/statistics/204457/businesses-ransomware-attack-rate/
    Explore at:
    Dataset updated
    Nov 9, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    As of 2023, over 72 percent of businesses worldwide were affected by ransomware attacks. This figure represents an increase on the previous five years and was by far the highest figure reported. Overall, since 2018, more than half of the total survey respondents each year stated that their organizations had been victimized by ransomware. Most targeted industries
    In 2023, the healthcare industry in the United States was once again most targeted by ransomware attacks. This industry also suffers most data breaches as a consequence of cyberattacks. The critical manufacturing industry ranked second by the number of ransomware attacks, followed by the government facilities industry. Ransomware in the manufacturing industry
    The manufacturing industry, along with its subindustries, is constantly targeted by ransomware attacks, causing data loss, business disruptions, and reputational damage. Often, such cyberattacks are international and have a political intent. In 2023, compromised credentials were the leading cause of ransomware attacks in the manufacturing industry.

  4. Android Malware Dataset for Machine Learning

    • kaggle.com
    Updated Mar 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shashwat Tiwari (2021). Android Malware Dataset for Machine Learning [Dataset]. https://www.kaggle.com/datasets/shashwatwork/android-malware-dataset-for-machine-learning/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 13, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shashwat Tiwari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    "Mobile malware is malicious software that targets mobile phones or wireless-enabled Personal digital assistants (PDA), by causing the collapse of the system and loss or leakage of confidential information. As wireless phones and PDA networks have become more and more common and have grown in complexity, it has become increasingly difficult to ensure their safety and security against electronic attacks in the form of viruses or other malware."

    Content

    Dataset consisting of feature vectors of 215 attributes extracted from 15,036 applications (5,560 malware apps from Drebin project and 9,476 benign apps). The dataset has been used to develop and evaluate multilevel classifier fusion approach for Android malware detection, published in the IEEE Transactions on Cybernetics paper 'DroidFusion: A Novel Multilevel Classifier Fusion Approach for Android Malware Detection. The supporting file contains the description of the feature vectors/attributes obtained via static code analysis of the Android apps.

    Acknowledgements

    Yerima, Suleiman (2018): Android malware dataset for machine learning 2. figshare. Dataset. https://doi.org/10.6084/m9.figshare.5854653.v1 Data Source - https://figshare.com/articles/dataset/Android_malware_dataset_for_machine_learning_2/5854653 Literature URL - https://ieeexplore.ieee.org/document/8245867

  5. i

    Malware API Call Dataset

    • ieee-dataport.org
    Updated May 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ferhat Ozgur Catak (2022). Malware API Call Dataset [Dataset]. https://ieee-dataport.org/open-access/malware-api-call-dataset
    Explore at:
    Dataset updated
    May 18, 2022
    Authors
    Ferhat Ozgur Catak
    Description

    This study seeks to obtain data which will help to address machine learning based malware research gaps. The specific objective of this study is to build a benchmark dataset for Windows operating system API calls of various malware. This is the first study to undertake metamorphic malware to build sequential API calls. It is hoped that this research will contribute to a deeper understanding of how metamorphic malware change their behavior (i.e. API calls) by adding meaningless opcodes with their own dissembler/assembler parts.

  6. WinMET Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, json
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Razvan Raducu; Razvan Raducu; Alain Villagrasa-Labrador; Alain Villagrasa-Labrador; Ricardo J. Rodríguez; Ricardo J. Rodríguez; Pedro Álvarez; Pedro Álvarez (2025). WinMET Dataset [Dataset]. http://doi.org/10.5281/zenodo.12737794
    Explore at:
    json, binAvailable download formats
    Dataset updated
    Mar 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Razvan Raducu; Razvan Raducu; Alain Villagrasa-Labrador; Alain Villagrasa-Labrador; Ricardo J. Rodríguez; Ricardo J. Rodríguez; Pedro Álvarez; Pedro Álvarez
    License

    https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html

    Description

    WinMET (Windows Malware Execution Traces) Dataset

    WinMET dataset contains the reports generated with CAPE sandbox after analyzing several malware samples. The reports are valid JSON files that contain the spawned processes, the sequence of WinAPI and system calls invoked by each process, their parameters, their return values, and OS accesed resources, amongst many others.

    Please use this DOI reference that always points to the latest WinMET version: https://doi.org/10.5281/zenodo.12647555" target="_blank" rel="noopener">https://doi.org/10.5281/zenodo.12647555

    This dataset was generated using the https://github.com/reverseame/MALVADA/tree/main" target="_blank" rel="noopener">MALVADA framework, which you can read more about in our publication https://doi.org/10.1016/j.softx.2025.102082" target="_blank" rel="noopener">https://doi.org/10.1016/j.softx.2025.102082. The article also provides insights about the contents of this dataset.

    Razvan Raducu, Alain Villagrasa-Labrador, Ricardo J. Rodríguez, Pedro Álvarez, MALVADA: A framework for generating datasets of malware execution traces, SoftwareX, Volume 30, 2025, 102082, ISSN 2352-7110, https://doi.org/10.1016/j.softx.2025.102082.(https://www.sciencedirect.com/science/article/pii/S2352711025000494)

    How to use the dataset

    The 7z file is password protected. The password is: infected.

    Compressed size on disk: ~2.5GiB.
    Decompressed size on disk: ~105GiB.
    Total decompressed .json files: 9889.

    The name of each .json file is irrelevant. It corresponds to its analysis ID.

    cape_report_to_label_mapping.json and avclass_report_to_label_mapping.json contain the mappings of each report with its corresponding consensus label, sorted in descendent order (given the number of reports belonging to each label/family).

    Integrity checks for WinMET.7z:

    • MD5: 75b3354fb186ae5a47c320e253bd96ee
    • SHA256: 00faac011f4938a29ba9afbd9f0b50d89ede342d1d0d6877cb90b46eabd92c72
    • SHA512: 038ca9303623cadaa72eab680221e81e1d335449d08f6395b39eb99baad4092e02c00955089fba31ce1a9dd04260ae80b622491f754774331bced18e8e3be1c4

    Citation

    If you use this dataset, cite it as follows:

    TBA.

    Statistics

    The following statistic (and many more) can be obtained by analyzing the WinMET dataset with the https://github.com/reverseame/MALVADA" target="_blank" rel="noopener">MALVADA framework.

    • Total reports: 9889.
    • Average VT (VirusTotal) detections: ~53.
    • There 268 benign or undetected reports. That is, 10 or less VT detections (default threshold).
    • There are 2584 reports with no CAPE consensus label.
    • There are 695 reports with no AVClass consensus label.
    • Top 20 CAPE consensus labels (there are many more):
      • "(n/a)": 2584
      • "Redline": 1227
      • "Agenttesla": 1010
      • "Crifi": 622
      • "Amadey": 606
      • "Smokeloader": 538
      • "Virlock": 471
      • "Msilheracles": 408
      • "Tedy": 364
      • "Disabler": 343
      • "Xorstringsnet": 321
      • "Snake": 252
      • "Autorun": 252
      • "Metastealer": 246
      • "Formbook": 244
      • "Lokibot": 202
      • "Strab": 188
      • "Loki": 185
      • "Mint": 179
      • "Taskun": 178
    • Top 20 AVClass consensus labels (there are many more)
      • "Reline": 2187
      • "Disabler": 732
      • "(n/a)": 695
      • "Amadey": 575
      • "Agenttesla": 478
      • "Taskun": 382
      • "Virlock": 293
      • "Equationdrug": 270
      • "Stop": 268
      • "Strab": 260
      • "Noon": 259
      • "Gamarue": 181
      • "Dofoil": 135
      • "Makoob": 113
      • "Mokes": 110
      • "Snakelogger": 110
      • "Bladabindi": 98
      • "Zard": 84
      • "Gcleaner": 83
      • "Deyma": 80

    Changelog

    • Version 2.0: Added cape and avclass label mappings.
  7. Z

    Malware Repositories and Their Authors on GitHub

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tania, Nishat Ara (2024). Malware Repositories and Their Authors on GitHub [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10806592
    Explore at:
    Dataset updated
    Mar 11, 2024
    Dataset provided by
    Zhang, Qian
    Rokon, Md Omar Faruk
    Faloutsos, Michalis
    Masud, Md Rayhanul
    Tania, Nishat Ara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is rooted in a study aimed at unveiling the origins and motivations behind the creation of malware repositories on GitHub. Our research embarks on an innovative journey to dissect the profiles and intentions of GitHub users who have been involved in this dubious activity.

    Employing a robust methodology, we meticulously identified 14,000 GitHub users linked to malware repositories. By leveraging advanced large language model (LLM) analytics, we classified these individuals into distinct categories based on their perceived intent: 3,339 were deemed Malicious, 3,354 Likely Malicious, and 7,574 Benign, offering a nuanced perspective on the community behind these repositories.

    Our analysis penetrates the veil of anonymity and obscurity often associated with these GitHub profiles, revealing stark contrasts in their characteristics. Malicious authors were found to typically possess sparse profiles focused on nefarious activities, while Benign authors presented well-rounded profiles, actively contributing to cybersecurity education and research. Those labeled as Likely Malicious exhibited a spectrum of engagement levels, underlining the complexity and diversity within this digital ecosystem.

    We are offering two datasets in this paper. First, a list of malware repositories - we have collected and extended the malware repositories on the GitHub in 2022 following the original papers. Second, a csv file with the github users information with their maliciousness classfication label.

    malware_repos.txt

    Purpose: This file contains a curated list of GitHub repositories identified as containing malware. These repositories were identified following the methodology outlined in the research paper "SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub."

    Contents: The file is structured as a simple text file, with each line representing a unique repository in the format username/reponame. This format allows for easy identification and access to each repository on GitHub for further analysis or review.

    Usage: The list serves as a critical resource for researchers and cybersecurity professionals interested in studying malware, understanding its distribution on platforms like GitHub, or developing defense mechanisms against such malicious content.

    obfuscated_github_user_dataset.csv

    Purpose: Accompanying the list of malware repositories, this CSV file contains detailed, albeit obfuscated, profile information of the GitHub users who authored these repositories. The obfuscation process has been applied to protect user privacy and comply with ethical standards, especially given the sensitive nature of associating individuals with potentially malicious activities.

    Contents: The dataset includes several columns representing different aspects of user profiles, such as obfuscated identifiers (e.g., ID, login, name), contact information (e.g., email, blog), and GitHub-specific metrics (e.g., followers count, number of public repositories). Notably, sensitive information has been masked or replaced with generic placeholders to prevent user identification.

    Usage: This dataset can be instrumental for researchers analyzing behaviors, patterns, or characteristics of users involved in creating malware repositories on GitHub. It provides a basis for statistical analysis, trend identification, or the development of predictive models, all while upholding the necessary ethical considerations.

  8. c

    Classification of Malwares (CLaMP) Dataset

    • cubig.ai
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Classification of Malwares (CLaMP) Dataset [Dataset]. https://cubig.ai/store/products/228/classification-of-malwares-clamp-dataset
    Explore at:
    Dataset updated
    May 2, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description

    1) Data introduction • Malwares dataset is a malware classifier dataset built from header field values ​​of portable executable files.

    2) Data utilization (1) Malwares data has characteristics that: • The source of the data is Mendeley data. The goal is to binary classify malicious code based on a total of 5184 samples. (2) Malwares data can be used to: • Cybersecurity: This dataset can be used to develop advanced cybersecurity tools that enhance protection against malicious software by detecting and classifying malware based on PE header analysis. • Malware analysis: By analyzing datasets, researchers can understand common patterns and features of malware PE headers and contribute to the broader field of malware research and defense strategies.

  9. r

    Data from: Matthew Gaber: Peekaboo

    • researchdata.edu.au
    Updated Apr 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohiuddin Ahmed; Matthew Gaber; Helge Janicke (2024). Matthew Gaber: Peekaboo [Dataset]. http://doi.org/10.25958/85P1-4W32
    Explore at:
    Dataset updated
    Apr 12, 2024
    Dataset provided by
    Edith Cowan University
    Authors
    Mohiuddin Ahmed; Matthew Gaber; Helge Janicke
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Cyber-attacks continue to evolve, increasing in frequency and sophistication where Artificial Intelligence (AI) is becoming essential in detecting modern malware. However, the accuracy of AI in malware detection is dependent on the quality of the features it is trained with. Static and dynamic analysis of malware is limited by the widespread use of obfuscation and anti-analysis techniques employed by malware authors, where if an analysis environment is detected the malware will hide its malicious behavior. However, Dynamic Binary Instrumentation (DBI) allows deep and precise control of the malware sample, thereby facilitating the extraction of authentic features from sophisticated and evasive malware. We developed Peekaboo, a DBI tool to defeat the anti-analysis techniques and extract authentic behavior from live malware samples. We collected 18,527 malware samples across ransomware, spyware, trojans, botnets, worms, Advanced Persistent Threats (APT) and post exploitation tools where every sample includes type, family, and variant information, for example Ransomware-WannaCry-SHA256. We also collected 1,973 benign software samples for analysis.

    This dataset contains the results for each sample, that were run for up to 15 minutes, to observe not only the anti-analysis techniques used but also its complete behavior. For each malware sample, the network traffic, every opcode that is executed and every evasive technique that is used are captured.

  10. H

    Static DLL Feature Dataset for Malware Detection

    • dataverse.harvard.edu
    Updated May 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed; Ahmad Almulhem (2025). Static DLL Feature Dataset for Malware Detection [Dataset]. http://doi.org/10.7910/DVN/GGD1G2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 17, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Mohammed; Ahmad Almulhem
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset comprises 1,777 entries of dynamic-link libraries (DLLs), with 700 labeled as benign and the remaining as malicious. Each entry contains static metadata attributes including MD5 hash, name, timestamps, subsystem, import/export functions, imphash, entropy, and indicators like packer presence and overlay. These features are commonly used in static analysis pipelines for malware detection and classification. This dataset supports the development of static malware detection models, particularly useful where dynamic analysis is impractical due to evasion techniques or resource constraints.

  11. e

    Dataset of Publication "Malware Communication in Smart Factories: A Network...

    • b2find.eudat.eu
    Updated Apr 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Dataset of Publication "Malware Communication in Smart Factories: A Network Traffic Data Set" [Dataset]. https://b2find.eudat.eu/dataset/a4f43cd9-25b1-5df3-a529-e430ae2fe323
    Explore at:
    Dataset updated
    Apr 12, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Brenner, B., Fabini, J., Offermanns, M., Semper, S., & Zseby, T. (2024). Malware communication in smart factories: A network traffic data set. Computer Networks, 255, 110804. or in BibTeX: @article{brenner2024malware, title={Malware communication in smart factories: A network traffic data set}, author={Brenner, Bernhard and Fabini, Joachim and Offermanns, Magnus and Semper, Sabrina and Zseby, Tanja}, journal={Computer Networks}, volume={255}, pages={110804}, year={2024}, publisher={Elsevier}} Context and methodology Machine learning-based intrusion detection requires suitable and realistic data sets for training and testing. However, data sets that originate from real networks are rare. Network data is considered privacy-sensitive, and the purposeful introduction of malicious traffic is usually not possible. In this paper, we introduce a labeled data set captured at a smart factory located in Vienna, Austria, during normal operation and during penetration tests with different attack types. The data set contains 173 GB of PCAP files, representing 16 days (395 hours) of factory operation. It includes MQTT, OPC UA, and Modbus/TCP traffic. The captured malicious traffic originated from a professional penetration tester who performed two types of attacks:(a) Aggressive attacks that are easier to detect.(b) Stealthy attacks that are harder to detect. Our data set includes the raw PCAP files and extracted flow data. Labels for packets and flows indicate whether they originated from a specific attack or from benign communication. We describe the methodology for creating the dataset, conduct an analysis of the data, and provide detailed information about the recorded traffic itself. The dataset is freely available to support reproducible research and the comparability of results in the area of intrusion detection in industrial networks. Technical details readme.txt Information about the data collection, format, necessary software and versions to access it.

  12. m

    Dataset Description for "Quantum AI for Cybersecurity Threat Prediction"

    • data.mendeley.com
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bindu Garg (2025). Dataset Description for "Quantum AI for Cybersecurity Threat Prediction" [Dataset]. http://doi.org/10.17632/fswng37vbz.2
    Explore at:
    Dataset updated
    Mar 20, 2025
    Authors
    Bindu Garg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is engineered to propel the development of quantum-enhanced anomaly detection systems for cybersecurity, merging real-world network traffic data with the potential for simulated attack scenarios. It comprises two datasets—malicious and non-malicious—crafted to train ML models, leveraging quantum AI to identify subtle anomalies and mitigate cyber threats, particularly those resistant to classical detection methods. Derived from Wireshark captures of normal web browsing and attack simulations, it provides a crucial baseline for quantum machine learning (QML) models.

    The dataset's strength lies in its fusion of traditional network attributes. These frequency features are paramount for QML algorithms to discern complex patterns indicative of malicious behavior. For instance, QML can identify minute deviations in source/destination frequency or unusual protocol usage, often missed by classical methods.

    Column Descriptions:

    No. (Record Number): Unique identifier. Time: Timestamp of activity. Source: Source device/IP. Source_Count: Source frequency. Destination: Destination device/IP. Destination_Count: Destination frequency. Protocol: Network protocol. Protocol_Count: Protocol frequency. Length: Packet size. Info: Contextual details.

    Uniqueness of the Dataset:

    • Two-Class Design: The dataset includes separate malicious and non-malicious traffic logs, essential for training ML models to differentiate between normal and attack patterns. • Frequency-Based Features: The inclusion of "Source_Count," "Destination_Count," and "Protocol_Count" significantly enhances analytical capabilities, allowing the detection of anomalies based on activity patterns. • Comprehensive Network Traffic Attributes: The dataset combines frequency features with standard network traffic attributes (Time, Source, Destination, Protocol, Length, Info), providing a holistic view of network activity. • Potential for Diverse Analysis: The combination of structured and semi-structured data (in the "Info" column) enables a wide range of analytical techniques, including time series analysis, machine learning, and natural language processing. • Cybersecurity Focus: Designed for cybersecurity threat prediction, it is valuable for researchers and practitioners in this domain. • Real-World and Simulated Attacks: The dataset includes both benign traffic and simulated attacks, making it ideal for testing security systems before deployment.

    Conclusion:

    This dataset, is a powerful tool for cybersecurity analysis. Its strength lies in its ability to establish a baseline and detect deviations, even subtle ones. The inclusion of malicious and non-malicious data enables precise model training for threat detection. It is vital for behavioral analysis, DDoS detection, malware analysis, forensics, and training. This dataset empowers security professionals to develop advanced solutions, enhancing network security by revealing valuable insights from seemingly routine network traffic.

  13. t

    Dataset of Publication "Malware Communication in Smart Factories: A Network...

    • researchdata.tuwien.ac.at
    • b2find.eudat.eu
    zip
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernhard Brenner; Joachim Fabini; Joachim Fabini; Magnus Offermanns; Sabrina Semper; Tanja Zseby; Tanja Zseby; Bernhard Brenner; Magnus Offermanns; Sabrina Semper; Bernhard Brenner; Magnus Offermanns; Sabrina Semper; Bernhard Brenner; Magnus Offermanns; Sabrina Semper (2024). Dataset of Publication "Malware Communication in Smart Factories: A Network Traffic Data Set" [Dataset]. http://doi.org/10.48436/vs6hv-1vs74
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    TU Wien
    Authors
    Bernhard Brenner; Joachim Fabini; Joachim Fabini; Magnus Offermanns; Sabrina Semper; Tanja Zseby; Tanja Zseby; Bernhard Brenner; Magnus Offermanns; Sabrina Semper; Bernhard Brenner; Magnus Offermanns; Sabrina Semper; Bernhard Brenner; Magnus Offermanns; Sabrina Semper
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Aug 11, 2024
    Description

    Machine learning-based intrusion detection requires suitable and realistic
    data sets for training and testing. However, data sets that originate from
    real networks are rare. Network data is considered privacy sensitive and the
    purposeful introduction of malicious traffic is usually not possible. In this
    paper we introduce a labeled data set captured at a smart factory located
    in Vienna, Austria during normal operation and during penetration tests with different
    attack types. The data set contains 173 GB of PCAP files, which represent 16 days (395 hours) of factory operation. It includes MQTT, OPC UA, and Modbus/TCP traffic. The captured malicious traffic was originated
    by a professional penetration tester who performed two types of attacks: (a)
    aggressive attacks that are easier to detect and (b) stealthy attacks that are
    harder to detect. Our data set includes the raw PCAP files and extracted
    flow data. Labels for packets and flows indicate whether packets (or flows)
    originated from a specific attack or from benign communication. We describe
    the methodology for creating the data set, conduct an analysis of the data
    and provide detailed information about the recorded traffic itself. The data
    set is freely available to support reproducible research and the comparability
    of results in the area of intrusion detection in industrial networks.

    File description:

    a_day1, a_day2, s_day1, s_day2, tf_a and tf_s: Main data set, where files starting with "tf" are training files containing only benign, operational data and all other files are attack files containing both, operational data and attack data.

    images.zip: Contains descriptive images about the data.

    extractions.zip: Contains extracted packets, flows in both labeled and unlabeled form.

    a_day_tuesday_dos.zip: additional day of attack traffic containing benign and attack data, including a DoS attack. This day is not labeled.

  14. i

    Data from: Cyber Threat Intelligent (CTI) dataset generated from public...

    • ieee-dataport.org
    Updated Jan 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daegeon Kim (2022). Cyber Threat Intelligent (CTI) dataset generated from public security reports and malware repositories [Dataset]. https://ieee-dataport.org/open-access/cyber-threat-intelligent-cti-dataset-generated-public-security-reports-and-malware
    Explore at:
    Dataset updated
    Jan 22, 2022
    Authors
    Daegeon Kim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SHA1

  15. D

    Data Encryption Market Report

    • promarketreports.com
    doc, pdf, ppt
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pro Market Reports (2025). Data Encryption Market Report [Dataset]. https://www.promarketreports.com/reports/data-encryption-market-9193
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Pro Market Reports
    License

    https://www.promarketreports.com/privacy-policyhttps://www.promarketreports.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Data Encryption Market Overview The global data encryption market is projected to register significant growth, with a market size of USD 14.5 billion in 2025 and a CAGR of 16% over the forecast period of 2025-2033. The increasing adoption of cloud computing and digital transformation initiatives are driving the demand for data encryption solutions to protect sensitive data from cyber threats. Additionally, industry regulations, such as GDPR and CCPA, are mandating organizations to implement data encryption measures, further fueling market growth. Market Drivers, Restraints, and Trends Key market drivers include rising cybersecurity threats, increasing data breaches, and the growing need for data privacy. The increasing adoption of IoT and mobile computing is also contributing to the need for data encryption. However, the high cost of implementation and the lack of skilled professionals can pose challenges to market growth. Notable market trends include the emergence of advanced encryption algorithms, such as quantum-safe cryptography, and the integration of encryption with AI and machine learning technologies. Regional factors, such as government regulations and technology adoption rates, also influence the market's growth trajectory. Recent developments include: On Apr. 11, 2023, Menlo Security, a leading provider of browser security solutions, published the results of the 10th Annual Cyberthreat Defense Report (CDR) by the CyberEdge Group. The report, partially sponsored by Menlo Security, highlights the augmenting importance of browser isolation technologies to combat ransomware and other malicious threats., The research revealed that most ransomware attacks include threats beyond data encryption. According to the report, around 51% of respondents confirmed that they have been using at least one type of browser or Internet isolation to protect their organizational data, while another 40% are about to deploy data encryption technology. Furthermore, around 33% of respondents noted that browser isolation is a key cybersecurity strategy to protect against sophisticated attacks, including ransomware, phishing, and zero-day attacks., On Feb.14, 2023, EnterpriseDB, a relational database provider, announced the addition of Transparent Data Encryption (TDE) based on open-source PostgreSQL to its databases. The new TDE feature will be shipped along with the firm's enterprise version of its database. TDE is a method of encrypting database files to ensure data security while at rest and in motion., Adding that most enterprises use TDE for compliance issues helps ensure data encryption on the hard drive and files on a backup. Before the development of built-in TDE, enterprises relied on either full-disk encryption or stackable cryptographic file system encryption., On Jan.25, 2023, Researchers from the Tokyo University of Science, Japan, announced the development of a faster and cheaper method for handling encrypted data while improving security. The new data encryption method developed by Japanese researchers combines the best of homomorphic encryption and secret sharing to handle encrypted data., Homomorphic encryption and secret sharing are key methods to compute sensitive data while preserving privacy. Homomorphic encryption is computationally intensive and involves performing computational data encryption on a single server, while secret sharing is fast and computationally efficient., In this method, the encrypted data/secret input is divided and distributed across multiple servers, each performing a computation, such as multiplication, on its data. The results of the computations are then used to reconstruct the original data., September 2022: Convergence Technology Solutions Corp., a supplier of software-enabled IT and cloud solutions, declared that it has obtained certification in Canada to sell and deploy IBM zsystems and LinuxONE., November 2019: Penta Security Systems announced that it has been selected as a finalist for the 2020 SC Magazine Awards, which are given by SC Media and celebrated in the United States. As a result, MyDiamo from Penta Security has been named the Best Database Security Solution of 2020. Additionally, this will result in the expansion of common-level encryption and improve the open-source DBMS installation procedure.. Potential restraints include: ISSUE REGARDING SECURITY AND DATA BREACH 44, HIGH IMPLEMENTATION COSTS AND COMPLEXITY 45; ISSUE WITH RESPECT TO DATA CONSISTENCY AND INTEROPERABILITY ACROSS DIFFERENT EDGE PLATFORMS 45.

  16. t

    Dataset of Publication "Malware Communication in Smart Factories: A Network...

    • researchdata.tuwien.at
    csv, txt, zip
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernhard Brenner; Joachim Fabini; Joachim Fabini; Magnus Offermanns; Sabrina Semper; Tanja Zseby; Tanja Zseby; Bernhard Brenner; Magnus Offermanns; Sabrina Semper; Bernhard Brenner; Magnus Offermanns; Sabrina Semper; Bernhard Brenner; Magnus Offermanns; Sabrina Semper (2025). Dataset of Publication "Malware Communication in Smart Factories: A Network Traffic Data Set" [Dataset]. http://doi.org/10.48436/ghdc6-45k78
    Explore at:
    csv, zip, txtAvailable download formats
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    TU Wien
    Authors
    Bernhard Brenner; Joachim Fabini; Joachim Fabini; Magnus Offermanns; Sabrina Semper; Tanja Zseby; Tanja Zseby; Bernhard Brenner; Magnus Offermanns; Sabrina Semper; Bernhard Brenner; Magnus Offermanns; Sabrina Semper; Bernhard Brenner; Magnus Offermanns; Sabrina Semper
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Aug 11, 2024
    Description

    Note: If you use this dataset, please cite the following paper:

    Brenner, B., Fabini, J., Offermanns, M., Semper, S., & Zseby, T. (2024). Malware communication in smart factories: A network traffic data set. Computer Networks, 255, 110804.

    or in BibTeX:

    @article{brenner2024malware,
    title={Malware communication in smart factories: A network traffic data set},
    author={Brenner, Bernhard and Fabini, Joachim and Offermanns, Magnus and Semper, Sabrina and Zseby, Tanja},
    journal={Computer Networks},
    volume={255},
    pages={110804},
    year={2024},
    publisher={Elsevier}
    }

    Context and methodology

    Machine learning-based intrusion detection requires suitable and realistic data sets for training and testing. However, data sets that originate from real networks are rare. Network data is considered privacy-sensitive, and the purposeful introduction of malicious traffic is usually not possible.

    In this paper, we introduce a labeled data set captured at a smart factory located in Vienna, Austria, during normal operation and during penetration tests with different attack types. The data set contains 173 GB of PCAP files, representing 16 days (395 hours) of factory operation. It includes MQTT, OPC UA, and Modbus/TCP traffic.

    The captured malicious traffic originated from a professional penetration tester who performed two types of attacks:
    (a) Aggressive attacks that are easier to detect.
    (b) Stealthy attacks that are harder to detect.

    Our data set includes the raw PCAP files and extracted flow data. Labels for packets and flows indicate whether they originated from a specific attack or from benign communication.

    We describe the methodology for creating the dataset, conduct an analysis of the data, and provide detailed information about the recorded traffic itself. The dataset is freely available to support reproducible research and the comparability of results in the area of intrusion detection in industrial networks.

    Technical details

    • readme.txt
      • Information about the data collection, format, necessary software and versions to access it.
    • license.txt:
      • Licensing information.
    • a_day1, a_day2, s_day1, s_day2, tf_a, and tf_s:
      • Main dataset, where files starting with "tf" are training files containing only benign,
        operational data. All other files are attack files containing both operational data and
        attack data.
    • images.zip:
      • Contains descriptive images about the data.
    • extractions.zip:
      • Contains extracted packets and flows in both labeled and unlabeled form.
    • a_day_tuesday_dos.zip:
      • An additional day of attack traffic containing benign and attack data, including a DoS attack. This day is not labeled.
    • list_of_extracted_features:
      • A complete list of features we extracted from the PCAP Files. All flow files contain these features.
    • list_of_identified_protocols.csv:
      • A complete list of all protocols that we could identify within the PCAP files provided.
  17. D

    Database Security Audits Services Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Apr 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Database Security Audits Services Report [Dataset]. https://www.datainsightsmarket.com/reports/database-security-audits-services-1419617
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Apr 25, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Database Security Audits Services market is experiencing robust growth, driven by the increasing reliance on databases across various industries and the escalating threat landscape. The market's expansion is fueled by several key factors. Firstly, stringent data privacy regulations like GDPR and CCPA are compelling organizations to prioritize database security and conduct regular audits to ensure compliance. Secondly, the rising frequency and sophistication of cyberattacks targeting databases, including ransomware and data breaches, are prompting proactive security measures, including comprehensive audits. Thirdly, the shift towards cloud-based databases introduces new security challenges and necessitates specialized audit services to address vulnerabilities inherent in cloud environments. The market is segmented by application (Financial, Medical, Telecom, Government, Manufacturing, Others) and type (Cloud-based, On-premise), with cloud-based services witnessing faster adoption due to the expanding cloud computing market. North America and Europe currently hold significant market share, but regions like Asia-Pacific are exhibiting rapid growth potential owing to increasing digitalization and adoption of advanced technologies. Major players are investing in innovative solutions and expanding their service portfolios to cater to diverse client needs, fostering competition and driving market evolution. While the market faces restraints like high implementation costs and a shortage of skilled professionals, the overall growth trajectory remains positive, propelled by the escalating demand for robust database security and compliance. The forecast period (2025-2033) anticipates continued expansion, potentially exceeding a compound annual growth rate (CAGR) of 15%. This optimistic projection is based on several factors. First, the ongoing digital transformation across industries will lead to increased reliance on databases and subsequently, heightened demand for security audits. Second, the continuous evolution of cyber threats will necessitate more frequent and comprehensive audits, further boosting market growth. Thirdly, the market will benefit from technological advancements in database security tools and methodologies, enabling more efficient and effective audits. However, challenges remain, particularly in addressing the skill gap and ensuring the affordability of these services for smaller organizations. Nevertheless, the long-term outlook for the Database Security Audits Services market remains strongly positive, with significant opportunities for market expansion and innovation.

  18. h

    EMBER2024

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert J. Joyce, EMBER2024 [Dataset]. https://huggingface.co/datasets/joyce8/EMBER2024
    Explore at:
    Authors
    Robert J. Joyce
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    EMBER2024 Dataset

    EMBER2024 is an update to the EMBER2017 and EMBER2018 datasets. It includes raw features and labels for 3.2 million malicious and benign files from 6 different file types (Win32, Win64, .NET, APK, ELF, and PDF). EMBER2024 is meant to allow researchers to explore a variety of common malware analysis classification tasks. The dataset includes 7 types of labels and tags that support malicious/benign detection, malware family classification, malware behavior prediction… See the full description on the dataset page: https://huggingface.co/datasets/joyce8/EMBER2024.

  19. h

    cosoco-image-dataset

    • huggingface.co
    Updated May 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K3Y Ltd (2025). cosoco-image-dataset [Dataset]. http://doi.org/10.57967/hf/5853
    Explore at:
    Dataset updated
    May 28, 2025
    Dataset authored and provided by
    K3Y Ltd
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    COSOCO: Compromised Software Containers Image Dataset

    Paper: Malware Detection in Docker Containers: An Image is Worth a Thousand Logs Dataset Documentation: COSOCO Dataset Documentation

      Dataset Description
    

    COSOCO (Compromised Software Containers) is a synthetic dataset of 3364 images representing benign and malware-compromised software containers. Each image in the dataset represents a dockerized software container that has been converted to an image using common… See the full description on the dataset page: https://huggingface.co/datasets/k3ylabs/cosoco-image-dataset.

  20. Dataset used for training IoT C&C classifier

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Mar 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Uhříček; Daniel Uhříček; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka; Dušan Kolář; Dušan Kolář (2022). Dataset used for training IoT C&C classifier [Dataset]. http://doi.org/10.5281/zenodo.6396923
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 31, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel Uhříček; Daniel Uhříček; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka; Dušan Kolář; Dušan Kolář
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was used for training the IoT C&C classifier. It is provided in the form of extended bidirectional flow data. The flow data were generated by ipfixprobe flow exporter and converted into CSV files. Apart from traditional flow information (IP addresses, ports, amount of transferred data), ipfixprobe was set with default timeouts (5 minutes active, 30 s inactive) to generate per-packet information for the first 30 packets. The flow records were then aggregated into 5-minute intervals - when the flow was split due to inactivity, the aggregator then stitched the flow back into a single one.

    The column headers in provided CSV files stand for:

    Column NameDescription
    ipaddr DST_IPSource IP address
    ipaddr SRC_IPDestination IP address
    uint64 BYTESThe number of transmitted bytes from SRC->DST
    uint64 BYTES_REVThe number of transmitted bytes from DST->SRC
    time TIME_FIRSTTimestamp of the first packet in the flow in format YYYY-MM-DDTHH-MM-SS
    time TIME_LASTTimestamp of the last packet in the flow in format YYYY-MM-DDTHH-MM-SS
    macaddr DST_MACDestination MAC address
    macaddr SRC_MACSource MAC address
    uint32 COUNTNumber of aggregated flow records
    uint32 PACKETSThe number of packets transmitted from Source to Destination
    uint32 PACKETS_REVThe number of packets transmitted from Destination to Source
    uint16 DST_PORTDestination port
    uint16 SRC_PORTSource port
    uint8 DIR_BIT_FIELDFlag for distinguishin WAN(1)/LAN(0)
    uint8 PROTOCOLThe number of transport protocol
    uint8 TCP_FLAGSLogic OR across all TCP flags in the packets transmitted SRC->DST
    uint8 TCP_FLAGS_REVLogic OR across all TCP flags in the packets transmitted DST->SRC
    int8* PPI_PKT_DIRECTIONSArray with packets' direction (1)- SRC->DST, (-1)-DST->SRC
    uint8* PPI_PKT_FLAGSArray with packets' TCP flags
    uint16* PPI_PKT_LENGTHSArray with packets' payload lengths
    time* PPI_PKT_TIMESArray with packets' timestamps

    Dataset consists of two parts: a benign part captured on the real ISP network and a malicious part captured in a lab environment.

    Bening part captured on the real ISP network
    This part was created by packet capturing on the metering points located at the perimeter of the CESNET2 network. The metering points monitor 100 Gbps backbone peering lines used by approximately half a million users. We performed packet filtering based on ports for the capture. The CESNET training capture was used as benign traffic in the C&C model training and testing pipeline to cover potential nuances and variability of benign data seen in the ISP-level network. Since we deal with data from the production network,
    we cannot guarantee a benign nature of all captured communication. However, we verified every IP address according to the internal blocklist of the CESNET association and external ones. We used AbuseIPDB and URLhaus blocklists.

    Since we are dealing with the real captures, the IP addresses, and MAC addresses
    were anonymized.


    Malicious part created in the controlled lab-created environment
    From leaked source codes, we picked one variant from each of the most prevalent client-server IoT botnet families: (1) Tsunami, (2) Gafgyt, (3) Mirai. Each implements a distinct communication protocol; Tsunami is an example of an IRC bot; Gafgyt
    uses a simple text-based protocol; Mirai implements a custom binary protocol. Afterward, we prepared virtualized testing environment.

    We deployed the malware in a controlled manner, filtering out its scanning and exploiting activities. The dataset covers the most notable C&C behavior. As previously recognized, the C&C communication consists of C&C heartbeat and
    bot commands. Thus, for each of the three prepared malware variants, we first imagine the malware running with no received commands. That includes the initiation of the TCP connection to the C&C server, which continues for one hour. And then, we imagine the malware receiving commands from its C&C server. The position of the command packets is chosen arbitrarily relative to the background heartbeat packets because, in the real-world scenario, the timing of the commands is tied to a random human action.


    Directory tree of provided dataset

    .
    ├── README.md
    ├── benign
    │  ├── AN_p20-21-25-143-3389.agg.head.csv
    │  ├── AN_p22.agg.head.csv
    │  ├── AN_p443.agg.head.csv
    │  ├── AN_p80.agg.head.csv
    │  └── AN_p8080.agg.head.csv
    └── cnc
      ├── kaiten
      │  ├── cnc.csv
      │  ├── command-01.csv
      │  ├── command-02.csv
      │  ├── command-03.csv
      │  ├── command-04.csv
      │  ├── command-05.csv
      │  ├── command-06.csv
      │  ├── command-07.csv
      │  └── command-08.csv
      ├── mirai
      │  ├── cnc.csv
      │  ├── command-01.csv
      │  ├── command-02.csv
      │  ├── command-03.csv
      │  ├── command-04.csv
      │  ├── command-05.csv
      │  ├── command-06.csv
      │  ├── command-07.csv
      │  └── command-08.csv
      └── qbot
        ├── cnc.csv
        ├── command-01.csv
        ├── command-02.csv
        ├── command-03.csv
        └── command-04.csv
    

    Acknowledgment
    This research was funded by the Ministry of Interior of the Czech Republic,
    grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis and also by the
    Grant Agency of the CTU in Prague, grant No. SGS20/210/OHK3/3T/18 funded by
    the MEYS of the Czech Republic.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Angelo Oliveira (2019). Malware Analysis Datasets: Top-1000 PE Imports [Dataset]. https://ieee-dataport.org/open-access/malware-analysis-datasets-top-1000-pe-imports

Malware Analysis Datasets: Top-1000 PE Imports

Explore at:
7 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Nov 8, 2019
Authors
Angelo Oliveira
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is part of my PhD research on malware detection and classification using Deep Learning. It contains static analysis data: Top-1000 imported functions extracted from the 'pe_imports' elements of Cuckoo Sandbox reports. PE malware examples were downloaded from virusshare.com. PE goodware examples were downloaded from portableapps.com and from Windows 7 x86 directories.

Search
Clear search
Close search
Google apps
Main menu