Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of my PhD research on malware detection and classification using Deep Learning. It contains static analysis data: Top-1000 imported functions extracted from the 'pe_imports' elements of Cuckoo Sandbox reports. PE malware examples were downloaded from virusshare.com. PE goodware examples were downloaded from portableapps.com and from Windows 7 x86 directories.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ransomware, leveraging sophisticated encryption techniques, poses a significant threat by encrypting crucial data, thereby rendering it inaccessible. The proliferation of diverse ransomware variants has caused considerable harm to governments, corporations, and individual users alike. Despite the increasing prevalence of cyber threats, existing solutions often struggle with real-time detection and early identification of ransomware families. To address this challenge, we introduce FCG-RFD, a novel benchmark dataset featuring extensive Function Call Graphs (FCG) tailored for ransomware family detection. Given the constantly evolving nature of malware, antivirus scanners face ongoing challenges, necessitating access to recent and updated datasets. Our dataset comprises 8,095 samples sourced from reputable repositories including VirusSamples, Virusshare, VirusSign, the Zoo, and MalwareBazaar. Additionally, we include 8,020 normal files obtained from trusted sources such as the Microsoft Store and Softonic. Through FCG-RFD, we aim to facilitate more robust and timely detection of ransomware families, ultimately enhancing cybersecurity measures against this pervasive threat.
As of 2023, over 72 percent of businesses worldwide were affected by ransomware attacks. This figure represents an increase on the previous five years and was by far the highest figure reported. Overall, since 2018, more than half of the total survey respondents each year stated that their organizations had been victimized by ransomware.
Most targeted industries
In 2023, the healthcare industry in the United States was once again most targeted by ransomware attacks. This industry also suffers most data breaches as a consequence of cyberattacks. The critical manufacturing industry ranked second by the number of ransomware attacks, followed by the government facilities industry.
Ransomware in the manufacturing industry
The manufacturing industry, along with its subindustries, is constantly targeted by ransomware attacks, causing data loss, business disruptions, and reputational damage. Often, such cyberattacks are international and have a political intent. In 2023, compromised credentials were the leading cause of ransomware attacks in the manufacturing industry.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
"Mobile malware is malicious software that targets mobile phones or wireless-enabled Personal digital assistants (PDA), by causing the collapse of the system and loss or leakage of confidential information. As wireless phones and PDA networks have become more and more common and have grown in complexity, it has become increasingly difficult to ensure their safety and security against electronic attacks in the form of viruses or other malware."
Dataset consisting of feature vectors of 215 attributes extracted from 15,036 applications (5,560 malware apps from Drebin project and 9,476 benign apps). The dataset has been used to develop and evaluate multilevel classifier fusion approach for Android malware detection, published in the IEEE Transactions on Cybernetics paper 'DroidFusion: A Novel Multilevel Classifier Fusion Approach for Android Malware Detection. The supporting file contains the description of the feature vectors/attributes obtained via static code analysis of the Android apps.
Yerima, Suleiman (2018): Android malware dataset for machine learning 2. figshare. Dataset. https://doi.org/10.6084/m9.figshare.5854653.v1 Data Source - https://figshare.com/articles/dataset/Android_malware_dataset_for_machine_learning_2/5854653 Literature URL - https://ieeexplore.ieee.org/document/8245867
This study seeks to obtain data which will help to address machine learning based malware research gaps. The specific objective of this study is to build a benchmark dataset for Windows operating system API calls of various malware. This is the first study to undertake metamorphic malware to build sequential API calls. It is hoped that this research will contribute to a deeper understanding of how metamorphic malware change their behavior (i.e. API calls) by adding meaningless opcodes with their own dissembler/assembler parts.
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
WinMET dataset contains the reports generated with CAPE sandbox after analyzing several malware samples. The reports are valid JSON files that contain the spawned processes, the sequence of WinAPI and system calls invoked by each process, their parameters, their return values, and OS accesed resources, amongst many others.
Please use this DOI reference that always points to the latest WinMET version: https://doi.org/10.5281/zenodo.12647555" target="_blank" rel="noopener">https://doi.org/10.5281/zenodo.12647555
This dataset was generated using the https://github.com/reverseame/MALVADA/tree/main" target="_blank" rel="noopener">MALVADA framework, which you can read more about in our publication https://doi.org/10.1016/j.softx.2025.102082" target="_blank" rel="noopener">https://doi.org/10.1016/j.softx.2025.102082. The article also provides insights about the contents of this dataset.
Razvan Raducu, Alain Villagrasa-Labrador, Ricardo J. Rodríguez, Pedro Álvarez, MALVADA: A framework for generating datasets of malware execution traces, SoftwareX, Volume 30, 2025, 102082, ISSN 2352-7110, https://doi.org/10.1016/j.softx.2025.102082.(https://www.sciencedirect.com/science/article/pii/S2352711025000494)
The 7z file is password protected. The password is: infected
.
Compressed size on disk: ~2.5GiB.
Decompressed size on disk: ~105GiB.
Total decompressed .json
files: 9889.
The name of each .json
file is irrelevant. It corresponds to its analysis ID.
cape_report_to_label_mapping.json
and avclass_report_to_label_mapping.json
contain the mappings of each report with its corresponding consensus label, sorted in descendent order (given the number of reports belonging to each label/family).
WinMET.7z
:If you use this dataset, cite it as follows:
TBA.
The following statistic (and many more) can be obtained by analyzing the WinMET dataset with the https://github.com/reverseame/MALVADA" target="_blank" rel="noopener">MALVADA framework.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is rooted in a study aimed at unveiling the origins and motivations behind the creation of malware repositories on GitHub. Our research embarks on an innovative journey to dissect the profiles and intentions of GitHub users who have been involved in this dubious activity.
Employing a robust methodology, we meticulously identified 14,000 GitHub users linked to malware repositories. By leveraging advanced large language model (LLM) analytics, we classified these individuals into distinct categories based on their perceived intent: 3,339 were deemed Malicious, 3,354 Likely Malicious, and 7,574 Benign, offering a nuanced perspective on the community behind these repositories.
Our analysis penetrates the veil of anonymity and obscurity often associated with these GitHub profiles, revealing stark contrasts in their characteristics. Malicious authors were found to typically possess sparse profiles focused on nefarious activities, while Benign authors presented well-rounded profiles, actively contributing to cybersecurity education and research. Those labeled as Likely Malicious exhibited a spectrum of engagement levels, underlining the complexity and diversity within this digital ecosystem.
We are offering two datasets in this paper. First, a list of malware repositories - we have collected and extended the malware repositories on the GitHub in 2022 following the original papers. Second, a csv file with the github users information with their maliciousness classfication label.
malware_repos.txt
Purpose: This file contains a curated list of GitHub repositories identified as containing malware. These repositories were identified following the methodology outlined in the research paper "SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub."
Contents: The file is structured as a simple text file, with each line representing a unique repository in the format username/reponame. This format allows for easy identification and access to each repository on GitHub for further analysis or review.
Usage: The list serves as a critical resource for researchers and cybersecurity professionals interested in studying malware, understanding its distribution on platforms like GitHub, or developing defense mechanisms against such malicious content.
obfuscated_github_user_dataset.csv
Purpose: Accompanying the list of malware repositories, this CSV file contains detailed, albeit obfuscated, profile information of the GitHub users who authored these repositories. The obfuscation process has been applied to protect user privacy and comply with ethical standards, especially given the sensitive nature of associating individuals with potentially malicious activities.
Contents: The dataset includes several columns representing different aspects of user profiles, such as obfuscated identifiers (e.g., ID, login, name), contact information (e.g., email, blog), and GitHub-specific metrics (e.g., followers count, number of public repositories). Notably, sensitive information has been masked or replaced with generic placeholders to prevent user identification.
Usage: This dataset can be instrumental for researchers analyzing behaviors, patterns, or characteristics of users involved in creating malware repositories on GitHub. It provides a basis for statistical analysis, trend identification, or the development of predictive models, all while upholding the necessary ethical considerations.
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data introduction • Malwares dataset is a malware classifier dataset built from header field values of portable executable files.
2) Data utilization (1) Malwares data has characteristics that: • The source of the data is Mendeley data. The goal is to binary classify malicious code based on a total of 5184 samples. (2) Malwares data can be used to: • Cybersecurity: This dataset can be used to develop advanced cybersecurity tools that enhance protection against malicious software by detecting and classifying malware based on PE header analysis. • Malware analysis: By analyzing datasets, researchers can understand common patterns and features of malware PE headers and contribute to the broader field of malware research and defense strategies.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Cyber-attacks continue to evolve, increasing in frequency and sophistication where Artificial Intelligence (AI) is becoming essential in detecting modern malware. However, the accuracy of AI in malware detection is dependent on the quality of the features it is trained with. Static and dynamic analysis of malware is limited by the widespread use of obfuscation and anti-analysis techniques employed by malware authors, where if an analysis environment is detected the malware will hide its malicious behavior. However, Dynamic Binary Instrumentation (DBI) allows deep and precise control of the malware sample, thereby facilitating the extraction of authentic features from sophisticated and evasive malware. We developed Peekaboo, a DBI tool to defeat the anti-analysis techniques and extract authentic behavior from live malware samples. We collected 18,527 malware samples across ransomware, spyware, trojans, botnets, worms, Advanced Persistent Threats (APT) and post exploitation tools where every sample includes type, family, and variant information, for example Ransomware-WannaCry-SHA256. We also collected 1,973 benign software samples for analysis.
This dataset contains the results for each sample, that were run for up to 15 minutes, to observe not only the anti-analysis techniques used but also its complete behavior. For each malware sample, the network traffic, every opcode that is executed and every evasive technique that is used are captured.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset comprises 1,777 entries of dynamic-link libraries (DLLs), with 700 labeled as benign and the remaining as malicious. Each entry contains static metadata attributes including MD5 hash, name, timestamps, subsystem, import/export functions, imphash, entropy, and indicators like packer presence and overlay. These features are commonly used in static analysis pipelines for malware detection and classification. This dataset supports the development of static malware detection models, particularly useful where dynamic analysis is impractical due to evasion techniques or resource constraints.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Brenner, B., Fabini, J., Offermanns, M., Semper, S., & Zseby, T. (2024). Malware communication in smart factories: A network traffic data set. Computer Networks, 255, 110804. or in BibTeX: @article{brenner2024malware, title={Malware communication in smart factories: A network traffic data set}, author={Brenner, Bernhard and Fabini, Joachim and Offermanns, Magnus and Semper, Sabrina and Zseby, Tanja}, journal={Computer Networks}, volume={255}, pages={110804}, year={2024}, publisher={Elsevier}} Context and methodology Machine learning-based intrusion detection requires suitable and realistic data sets for training and testing. However, data sets that originate from real networks are rare. Network data is considered privacy-sensitive, and the purposeful introduction of malicious traffic is usually not possible. In this paper, we introduce a labeled data set captured at a smart factory located in Vienna, Austria, during normal operation and during penetration tests with different attack types. The data set contains 173 GB of PCAP files, representing 16 days (395 hours) of factory operation. It includes MQTT, OPC UA, and Modbus/TCP traffic. The captured malicious traffic originated from a professional penetration tester who performed two types of attacks:(a) Aggressive attacks that are easier to detect.(b) Stealthy attacks that are harder to detect. Our data set includes the raw PCAP files and extracted flow data. Labels for packets and flows indicate whether they originated from a specific attack or from benign communication. We describe the methodology for creating the dataset, conduct an analysis of the data, and provide detailed information about the recorded traffic itself. The dataset is freely available to support reproducible research and the comparability of results in the area of intrusion detection in industrial networks. Technical details readme.txt Information about the data collection, format, necessary software and versions to access it.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is engineered to propel the development of quantum-enhanced anomaly detection systems for cybersecurity, merging real-world network traffic data with the potential for simulated attack scenarios. It comprises two datasets—malicious and non-malicious—crafted to train ML models, leveraging quantum AI to identify subtle anomalies and mitigate cyber threats, particularly those resistant to classical detection methods. Derived from Wireshark captures of normal web browsing and attack simulations, it provides a crucial baseline for quantum machine learning (QML) models.
The dataset's strength lies in its fusion of traditional network attributes. These frequency features are paramount for QML algorithms to discern complex patterns indicative of malicious behavior. For instance, QML can identify minute deviations in source/destination frequency or unusual protocol usage, often missed by classical methods.
Column Descriptions:
No. (Record Number): Unique identifier. Time: Timestamp of activity. Source: Source device/IP. Source_Count: Source frequency. Destination: Destination device/IP. Destination_Count: Destination frequency. Protocol: Network protocol. Protocol_Count: Protocol frequency. Length: Packet size. Info: Contextual details.
Uniqueness of the Dataset:
• Two-Class Design: The dataset includes separate malicious and non-malicious traffic logs, essential for training ML models to differentiate between normal and attack patterns. • Frequency-Based Features: The inclusion of "Source_Count," "Destination_Count," and "Protocol_Count" significantly enhances analytical capabilities, allowing the detection of anomalies based on activity patterns. • Comprehensive Network Traffic Attributes: The dataset combines frequency features with standard network traffic attributes (Time, Source, Destination, Protocol, Length, Info), providing a holistic view of network activity. • Potential for Diverse Analysis: The combination of structured and semi-structured data (in the "Info" column) enables a wide range of analytical techniques, including time series analysis, machine learning, and natural language processing. • Cybersecurity Focus: Designed for cybersecurity threat prediction, it is valuable for researchers and practitioners in this domain. • Real-World and Simulated Attacks: The dataset includes both benign traffic and simulated attacks, making it ideal for testing security systems before deployment.
Conclusion:
This dataset, is a powerful tool for cybersecurity analysis. Its strength lies in its ability to establish a baseline and detect deviations, even subtle ones. The inclusion of malicious and non-malicious data enables precise model training for threat detection. It is vital for behavioral analysis, DDoS detection, malware analysis, forensics, and training. This dataset empowers security professionals to develop advanced solutions, enhancing network security by revealing valuable insights from seemingly routine network traffic.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning-based intrusion detection requires suitable and realistic
data sets for training and testing. However, data sets that originate from
real networks are rare. Network data is considered privacy sensitive and the
purposeful introduction of malicious traffic is usually not possible. In this
paper we introduce a labeled data set captured at a smart factory located
in Vienna, Austria during normal operation and during penetration tests with different
attack types. The data set contains 173 GB of PCAP files, which represent 16 days (395 hours) of factory operation. It includes MQTT, OPC UA, and Modbus/TCP traffic. The captured malicious traffic was originated
by a professional penetration tester who performed two types of attacks: (a)
aggressive attacks that are easier to detect and (b) stealthy attacks that are
harder to detect. Our data set includes the raw PCAP files and extracted
flow data. Labels for packets and flows indicate whether packets (or flows)
originated from a specific attack or from benign communication. We describe
the methodology for creating the data set, conduct an analysis of the data
and provide detailed information about the recorded traffic itself. The data
set is freely available to support reproducible research and the comparability
of results in the area of intrusion detection in industrial networks.
File description:
a_day1, a_day2, s_day1, s_day2, tf_a and tf_s: Main data set, where files starting with "tf" are training files containing only benign, operational data and all other files are attack files containing both, operational data and attack data.
images.zip: Contains descriptive images about the data.
extractions.zip: Contains extracted packets, flows in both labeled and unlabeled form.
a_day_tuesday_dos.zip: additional day of attack traffic containing benign and attack data, including a DoS attack. This day is not labeled.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SHA1
https://www.promarketreports.com/privacy-policyhttps://www.promarketreports.com/privacy-policy
Data Encryption Market Overview The global data encryption market is projected to register significant growth, with a market size of USD 14.5 billion in 2025 and a CAGR of 16% over the forecast period of 2025-2033. The increasing adoption of cloud computing and digital transformation initiatives are driving the demand for data encryption solutions to protect sensitive data from cyber threats. Additionally, industry regulations, such as GDPR and CCPA, are mandating organizations to implement data encryption measures, further fueling market growth. Market Drivers, Restraints, and Trends Key market drivers include rising cybersecurity threats, increasing data breaches, and the growing need for data privacy. The increasing adoption of IoT and mobile computing is also contributing to the need for data encryption. However, the high cost of implementation and the lack of skilled professionals can pose challenges to market growth. Notable market trends include the emergence of advanced encryption algorithms, such as quantum-safe cryptography, and the integration of encryption with AI and machine learning technologies. Regional factors, such as government regulations and technology adoption rates, also influence the market's growth trajectory. Recent developments include: On Apr. 11, 2023, Menlo Security, a leading provider of browser security solutions, published the results of the 10th Annual Cyberthreat Defense Report (CDR) by the CyberEdge Group. The report, partially sponsored by Menlo Security, highlights the augmenting importance of browser isolation technologies to combat ransomware and other malicious threats., The research revealed that most ransomware attacks include threats beyond data encryption. According to the report, around 51% of respondents confirmed that they have been using at least one type of browser or Internet isolation to protect their organizational data, while another 40% are about to deploy data encryption technology. Furthermore, around 33% of respondents noted that browser isolation is a key cybersecurity strategy to protect against sophisticated attacks, including ransomware, phishing, and zero-day attacks., On Feb.14, 2023, EnterpriseDB, a relational database provider, announced the addition of Transparent Data Encryption (TDE) based on open-source PostgreSQL to its databases. The new TDE feature will be shipped along with the firm's enterprise version of its database. TDE is a method of encrypting database files to ensure data security while at rest and in motion., Adding that most enterprises use TDE for compliance issues helps ensure data encryption on the hard drive and files on a backup. Before the development of built-in TDE, enterprises relied on either full-disk encryption or stackable cryptographic file system encryption., On Jan.25, 2023, Researchers from the Tokyo University of Science, Japan, announced the development of a faster and cheaper method for handling encrypted data while improving security. The new data encryption method developed by Japanese researchers combines the best of homomorphic encryption and secret sharing to handle encrypted data., Homomorphic encryption and secret sharing are key methods to compute sensitive data while preserving privacy. Homomorphic encryption is computationally intensive and involves performing computational data encryption on a single server, while secret sharing is fast and computationally efficient., In this method, the encrypted data/secret input is divided and distributed across multiple servers, each performing a computation, such as multiplication, on its data. The results of the computations are then used to reconstruct the original data., September 2022: Convergence Technology Solutions Corp., a supplier of software-enabled IT and cloud solutions, declared that it has obtained certification in Canada to sell and deploy IBM zsystems and LinuxONE., November 2019: Penta Security Systems announced that it has been selected as a finalist for the 2020 SC Magazine Awards, which are given by SC Media and celebrated in the United States. As a result, MyDiamo from Penta Security has been named the Best Database Security Solution of 2020. Additionally, this will result in the expansion of common-level encryption and improve the open-source DBMS installation procedure.. Potential restraints include: ISSUE REGARDING SECURITY AND DATA BREACH 44, HIGH IMPLEMENTATION COSTS AND COMPLEXITY 45; ISSUE WITH RESPECT TO DATA CONSISTENCY AND INTEROPERABILITY ACROSS DIFFERENT EDGE PLATFORMS 45.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: If you use this dataset, please cite the following paper:
Brenner, B., Fabini, J., Offermanns, M., Semper, S., & Zseby, T. (2024). Malware communication in smart factories: A network traffic data set. Computer Networks, 255, 110804.
or in BibTeX:
@article{brenner2024malware,
title={Malware communication in smart factories: A network traffic data set},
author={Brenner, Bernhard and Fabini, Joachim and Offermanns, Magnus and Semper, Sabrina and Zseby, Tanja},
journal={Computer Networks},
volume={255},
pages={110804},
year={2024},
publisher={Elsevier}
}
Machine learning-based intrusion detection requires suitable and realistic data sets for training and testing. However, data sets that originate from real networks are rare. Network data is considered privacy-sensitive, and the purposeful introduction of malicious traffic is usually not possible.
In this paper, we introduce a labeled data set captured at a smart factory located in Vienna, Austria, during normal operation and during penetration tests with different attack types. The data set contains 173 GB of PCAP files, representing 16 days (395 hours) of factory operation. It includes MQTT, OPC UA, and Modbus/TCP traffic.
The captured malicious traffic originated from a professional penetration tester who performed two types of attacks:
(a) Aggressive attacks that are easier to detect.
(b) Stealthy attacks that are harder to detect.
Our data set includes the raw PCAP files and extracted flow data. Labels for packets and flows indicate whether they originated from a specific attack or from benign communication.
We describe the methodology for creating the dataset, conduct an analysis of the data, and provide detailed information about the recorded traffic itself. The dataset is freely available to support reproducible research and the comparability of results in the area of intrusion detection in industrial networks.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Database Security Audits Services market is experiencing robust growth, driven by the increasing reliance on databases across various industries and the escalating threat landscape. The market's expansion is fueled by several key factors. Firstly, stringent data privacy regulations like GDPR and CCPA are compelling organizations to prioritize database security and conduct regular audits to ensure compliance. Secondly, the rising frequency and sophistication of cyberattacks targeting databases, including ransomware and data breaches, are prompting proactive security measures, including comprehensive audits. Thirdly, the shift towards cloud-based databases introduces new security challenges and necessitates specialized audit services to address vulnerabilities inherent in cloud environments. The market is segmented by application (Financial, Medical, Telecom, Government, Manufacturing, Others) and type (Cloud-based, On-premise), with cloud-based services witnessing faster adoption due to the expanding cloud computing market. North America and Europe currently hold significant market share, but regions like Asia-Pacific are exhibiting rapid growth potential owing to increasing digitalization and adoption of advanced technologies. Major players are investing in innovative solutions and expanding their service portfolios to cater to diverse client needs, fostering competition and driving market evolution. While the market faces restraints like high implementation costs and a shortage of skilled professionals, the overall growth trajectory remains positive, propelled by the escalating demand for robust database security and compliance. The forecast period (2025-2033) anticipates continued expansion, potentially exceeding a compound annual growth rate (CAGR) of 15%. This optimistic projection is based on several factors. First, the ongoing digital transformation across industries will lead to increased reliance on databases and subsequently, heightened demand for security audits. Second, the continuous evolution of cyber threats will necessitate more frequent and comprehensive audits, further boosting market growth. Thirdly, the market will benefit from technological advancements in database security tools and methodologies, enabling more efficient and effective audits. However, challenges remain, particularly in addressing the skill gap and ensuring the affordability of these services for smaller organizations. Nevertheless, the long-term outlook for the Database Security Audits Services market remains strongly positive, with significant opportunities for market expansion and innovation.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
EMBER2024 Dataset
EMBER2024 is an update to the EMBER2017 and EMBER2018 datasets. It includes raw features and labels for 3.2 million malicious and benign files from 6 different file types (Win32, Win64, .NET, APK, ELF, and PDF). EMBER2024 is meant to allow researchers to explore a variety of common malware analysis classification tasks. The dataset includes 7 types of labels and tags that support malicious/benign detection, malware family classification, malware behavior prediction… See the full description on the dataset page: https://huggingface.co/datasets/joyce8/EMBER2024.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COSOCO: Compromised Software Containers Image Dataset
Paper: Malware Detection in Docker Containers: An Image is Worth a Thousand Logs Dataset Documentation: COSOCO Dataset Documentation
Dataset Description
COSOCO (Compromised Software Containers) is a synthetic dataset of 3364 images representing benign and malware-compromised software containers. Each image in the dataset represents a dockerized software container that has been converted to an image using common… See the full description on the dataset page: https://huggingface.co/datasets/k3ylabs/cosoco-image-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was used for training the IoT C&C classifier. It is provided in the form of extended bidirectional flow data. The flow data were generated by ipfixprobe flow exporter and converted into CSV files. Apart from traditional flow information (IP addresses, ports, amount of transferred data), ipfixprobe was set with default timeouts (5 minutes active, 30 s inactive) to generate per-packet information for the first 30 packets. The flow records were then aggregated into 5-minute intervals - when the flow was split due to inactivity, the aggregator then stitched the flow back into a single one.
The column headers in provided CSV files stand for:
Column Name | Description |
---|---|
ipaddr DST_IP | Source IP address |
ipaddr SRC_IP | Destination IP address |
uint64 BYTES | The number of transmitted bytes from SRC->DST |
uint64 BYTES_REV | The number of transmitted bytes from DST->SRC |
time TIME_FIRST | Timestamp of the first packet in the flow in format YYYY-MM-DDTHH-MM-SS |
time TIME_LAST | Timestamp of the last packet in the flow in format YYYY-MM-DDTHH-MM-SS |
macaddr DST_MAC | Destination MAC address |
macaddr SRC_MAC | Source MAC address |
uint32 COUNT | Number of aggregated flow records |
uint32 PACKETS | The number of packets transmitted from Source to Destination |
uint32 PACKETS_REV | The number of packets transmitted from Destination to Source |
uint16 DST_PORT | Destination port |
uint16 SRC_PORT | Source port |
uint8 DIR_BIT_FIELD | Flag for distinguishin WAN(1)/LAN(0) |
uint8 PROTOCOL | The number of transport protocol |
uint8 TCP_FLAGS | Logic OR across all TCP flags in the packets transmitted SRC->DST |
uint8 TCP_FLAGS_REV | Logic OR across all TCP flags in the packets transmitted DST->SRC |
int8* PPI_PKT_DIRECTIONS | Array with packets' direction (1)- SRC->DST, (-1)-DST->SRC |
uint8* PPI_PKT_FLAGS | Array with packets' TCP flags |
uint16* PPI_PKT_LENGTHS | Array with packets' payload lengths |
time* PPI_PKT_TIMES | Array with packets' timestamps |
Dataset consists of two parts: a benign part captured on the real ISP network and a malicious part captured in a lab environment.
Bening part captured on the real ISP network
This part was created by packet capturing on the metering points located at the perimeter of the CESNET2 network. The metering points monitor 100 Gbps backbone peering lines used by approximately half a million users. We performed packet filtering based on ports for the capture. The CESNET training capture was used as benign traffic in the C&C model training and testing pipeline to cover potential nuances and variability of benign data seen in the ISP-level network. Since we deal with data from the production network,
we cannot guarantee a benign nature of all captured communication. However, we verified every IP address according to the internal blocklist of the CESNET association and external ones. We used AbuseIPDB and URLhaus blocklists.
Since we are dealing with the real captures, the IP addresses, and MAC addresses
were anonymized.
Malicious part created in the controlled lab-created environment
From leaked source codes, we picked one variant from each of the most prevalent client-server IoT botnet families: (1) Tsunami, (2) Gafgyt, (3) Mirai. Each implements a distinct communication protocol; Tsunami is an example of an IRC bot; Gafgyt
uses a simple text-based protocol; Mirai implements a custom binary protocol. Afterward, we prepared virtualized testing environment.
We deployed the malware in a controlled manner, filtering out its scanning and exploiting activities. The dataset covers the most notable C&C behavior. As previously recognized, the C&C communication consists of C&C heartbeat and
bot commands. Thus, for each of the three prepared malware variants, we first imagine the malware running with no received commands. That includes the initiation of the TCP connection to the C&C server, which continues for one hour. And then, we imagine the malware receiving commands from its C&C server. The position of the command packets is chosen arbitrarily relative to the background heartbeat packets because, in the real-world scenario, the timing of the commands is tied to a random human action.
Directory tree of provided dataset
.
├── README.md
├── benign
│ ├── AN_p20-21-25-143-3389.agg.head.csv
│ ├── AN_p22.agg.head.csv
│ ├── AN_p443.agg.head.csv
│ ├── AN_p80.agg.head.csv
│ └── AN_p8080.agg.head.csv
└── cnc
├── kaiten
│ ├── cnc.csv
│ ├── command-01.csv
│ ├── command-02.csv
│ ├── command-03.csv
│ ├── command-04.csv
│ ├── command-05.csv
│ ├── command-06.csv
│ ├── command-07.csv
│ └── command-08.csv
├── mirai
│ ├── cnc.csv
│ ├── command-01.csv
│ ├── command-02.csv
│ ├── command-03.csv
│ ├── command-04.csv
│ ├── command-05.csv
│ ├── command-06.csv
│ ├── command-07.csv
│ └── command-08.csv
└── qbot
├── cnc.csv
├── command-01.csv
├── command-02.csv
├── command-03.csv
└── command-04.csv
Acknowledgment
This research was funded by the Ministry of Interior of the Czech Republic,
grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis and also by the
Grant Agency of the CTU in Prague, grant No. SGS20/210/OHK3/3T/18 funded by
the MEYS of the Czech Republic.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of my PhD research on malware detection and classification using Deep Learning. It contains static analysis data: Top-1000 imported functions extracted from the 'pe_imports' elements of Cuckoo Sandbox reports. PE malware examples were downloaded from virusshare.com. PE goodware examples were downloaded from portableapps.com and from Windows 7 x86 directories.