19 datasets found

Businesses worldwide affected by ransomware 2018-2025
statista.com
Updated Aug 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Businesses worldwide affected by ransomware 2018-2025 [Dataset]. https://www.statista.com/statistics/204457/businesses-ransomware-attack-rate/
Explore at:
Dataset updated
Aug 26, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
As of 2025, nearly 63 percent of businesses worldwide were affected by ransomware attacks. This figure represents a decrease on the previous year and was by far the lowest figure reported since 2020. Overall, since 2018, more than half of the total survey respondents each year stated that their organizations had been victimized by ransomware. Most targeted industries In 2024, the critical manufacturing industry in the United States was once again most targeted by ransomware attacks. Overall, organizations in this industry experienced 258 cyberattacks in the measured year. Healthcare and the public health sector ranked second, followed by government facilities, with 238 and 220 cyberattacks, respectively. Ransomware in the manufacturing industry The manufacturing industry, along with its subindustries, is constantly targeted by ransomware attacks, causing data loss, business disruptions, and reputational damage. Often, such cyberattacks are international and have a political intent. In 2024, exploited vulnerabilities were the leading cause of ransomware attacks in the manufacturing industry.
f
RDE-Dataset.zipRansomware Defense Empowered: Deep Learning for Real-Time...
figshare.com
zip
Updated Mar 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hassan jalil hadi; Hassan Jalil Hadi (2024). RDE-Dataset.zipRansomware Defense Empowered: Deep Learning for Real-Time Family Identification with a Proprietary Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.25467826.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25467826.v1
Dataset updated
Mar 24, 2024
Dataset provided by
figshare
Authors
Hassan jalil hadi; Hassan Jalil Hadi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ransomware, leveraging sophisticated encryption techniques, poses a significant threat by encrypting crucial data, thereby rendering it inaccessible. The proliferation of diverse ransomware variants has caused considerable harm to governments, corporations, and individual users alike. Despite the increasing prevalence of cyber threats, existing solutions often struggle with real-time detection and early identification of ransomware families. To address this challenge, we introduce FCG-RFD, a novel benchmark dataset featuring extensive Function Call Graphs (FCG) tailored for ransomware family detection. Given the constantly evolving nature of malware, antivirus scanners face ongoing challenges, necessitating access to recent and updated datasets. Our dataset comprises 8,095 samples sourced from reputable repositories including VirusSamples, Virusshare, VirusSign, the Zoo, and MalwareBazaar. Additionally, we include 8,020 normal files obtained from trusted sources such as the Microsoft Store and Softonic. Through FCG-RFD, we aim to facilitate more robust and timely detection of ransomware families, ultimately enhancing cybersecurity measures against this pervasive threat.
Z
Malware Repositories and Their Authors on GitHub
data.niaid.nih.gov
zenodo.org
Updated Mar 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tania, Nishat Ara (2024). Malware Repositories and Their Authors on GitHub [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10806592
Explore at:
Dataset updated
Mar 11, 2024
Dataset provided by
Zhang, Qian
Masud, Md Rayhanul
Faloutsos, Michalis
Rokon, Md Omar Faruk
Tania, Nishat Ara
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is rooted in a study aimed at unveiling the origins and motivations behind the creation of malware repositories on GitHub. Our research embarks on an innovative journey to dissect the profiles and intentions of GitHub users who have been involved in this dubious activity.

Employing a robust methodology, we meticulously identified 14,000 GitHub users linked to malware repositories. By leveraging advanced large language model (LLM) analytics, we classified these individuals into distinct categories based on their perceived intent: 3,339 were deemed Malicious, 3,354 Likely Malicious, and 7,574 Benign, offering a nuanced perspective on the community behind these repositories.

Our analysis penetrates the veil of anonymity and obscurity often associated with these GitHub profiles, revealing stark contrasts in their characteristics. Malicious authors were found to typically possess sparse profiles focused on nefarious activities, while Benign authors presented well-rounded profiles, actively contributing to cybersecurity education and research. Those labeled as Likely Malicious exhibited a spectrum of engagement levels, underlining the complexity and diversity within this digital ecosystem.

We are offering two datasets in this paper. First, a list of malware repositories - we have collected and extended the malware repositories on the GitHub in 2022 following the original papers. Second, a csv file with the github users information with their maliciousness classfication label.

malware_repos.txt

Purpose: This file contains a curated list of GitHub repositories identified as containing malware. These repositories were identified following the methodology outlined in the research paper "SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub."

Contents: The file is structured as a simple text file, with each line representing a unique repository in the format username/reponame. This format allows for easy identification and access to each repository on GitHub for further analysis or review.

Usage: The list serves as a critical resource for researchers and cybersecurity professionals interested in studying malware, understanding its distribution on platforms like GitHub, or developing defense mechanisms against such malicious content.

obfuscated_github_user_dataset.csv

Purpose: Accompanying the list of malware repositories, this CSV file contains detailed, albeit obfuscated, profile information of the GitHub users who authored these repositories. The obfuscation process has been applied to protect user privacy and comply with ethical standards, especially given the sensitive nature of associating individuals with potentially malicious activities.

Contents: The dataset includes several columns representing different aspects of user profiles, such as obfuscated identifiers (e.g., ID, login, name), contact information (e.g., email, blog), and GitHub-specific metrics (e.g., followers count, number of public repositories). Notably, sensitive information has been masked or replaced with generic placeholders to prevent user identification.

Usage: This dataset can be instrumental for researchers analyzing behaviors, patterns, or characteristics of users involved in creating malware repositories on GitHub. It provides a basis for statistical analysis, trend identification, or the development of predictive models, all while upholding the necessary ethical considerations.
i
Malware API Call Dataset
ieee-dataport.org
Updated May 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ferhat Ozgur Catak (2022). Malware API Call Dataset [Dataset]. https://ieee-dataport.org/open-access/malware-api-call-dataset
Explore at:
Dataset updated
May 18, 2022
Authors
Ferhat Ozgur Catak
Description
This study seeks to obtain data which will help to address machine learning based malware research gaps. The specific objective of this study is to build a benchmark dataset for Windows operating system API calls of various malware. This is the first study to undertake metamorphic malware to build sequential API calls. It is hoped that this research will contribute to a deeper understanding of how metamorphic malware change their behavior (i.e. API calls) by adding meaningless opcodes with their own dissembler/assembler parts.
Malware Dataset on Android Applications
zenodo.org
Updated May 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deris Stiawan; Deris Stiawan (2025). Malware Dataset on Android Applications [Dataset]. http://doi.org/10.5281/zenodo.15377874
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15377874
Dataset updated
May 10, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Deris Stiawan; Deris Stiawan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 29, 2023
Description
Android has become the most popular operating system on mobile devices, making it a prime target for threat actors in creating malware. The research conducted by the author aims to detect reverse TCP exploits in network traffic. The tools used are Metasploit for Android, Termux, PCAPdroid, Wireshark, OpenVPN, and Apktool in both terminal and application versions. The supporting devices for this research are hardware devices, namely a smartphone, VPS, Mikrotik Router, and laptop.
MH-1M Dataset
figshare.com
zip
Updated Feb 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hendrio Bragança; Vanderson Rocha; Joner Assolin; Diego Kreutz; Eduardo Feitosa (2025). MH-1M Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28355897.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28355897.v2
Dataset updated
Feb 21, 2025
Dataset provided by
figshare
Authors
Hendrio Bragança; Vanderson Rocha; Joner Assolin; Diego Kreutz; Eduardo Feitosa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The rapid and widespread increase of Android malware presents substantial obstacles to cybersecurity research. In order to revolutionize the field of malware research, we present the MH-1M dataset, which is a thorough compilation of 1,340,515 APK samples. This dataset encompasses a wide range of diverse attributes and metadata, offering a comprehensive perspective. The utilization of the VirusTotal API guarantees precise assessment of threats by amalgamating various detection techniques. Our research indicates that MH-1M is a highly current dataset that provides valuable insights into the changing nature of malware.MH-1M consists of 23,247 features that cover a wide range of application behavior, from intents::accept to apicalls::landroid/window/splashscreenview.remove. The features are categorized into four primary classifications:Feature TypesValuesAPICalls22,394Intents407OPCodes232Permissions214The dataset is stored efficiently, utilizing a memory capacity of 29.0 GB, which showcases its substantial yet controllable magnitude. The dataset consists of 1,221,421 benign applications and 119,094 malware applications, ensuring a balanced representation for accurate malware detection and analysis.The MH-1M repository also offers a wide variety of metadata from APKs, providing useful data into the development of malicious software over a period of more than ten years. The Android features include a wide variety of metadata, which includes SHA256 hashes, file names, package names, compilation APIs, and various other details. This GitHub repository contains over 400GB of valuable data, making it the largest and most comprehensive dataset available for advancing research and development in Android malware detection.
e
Dataset of Publication "Malware Communication in Smart Factories: A Network...
b2find.eudat.eu
researchdata.tuwien.ac.at
Updated Aug 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Dataset of Publication "Malware Communication in Smart Factories: A Network Traffic Data Set" [Dataset]. https://b2find.eudat.eu/dataset/5a44cf28-2ebc-5d4b-b163-238f939b5625
Explore at:
Dataset updated
Aug 18, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning-based intrusion detection requires suitable and realisticdata sets for training and testing. However, data sets that originate fromreal networks are rare. Network data is considered privacy sensitive and the purposeful introduction of malicious traffic is usually not possible. In thispaper we introduce a labeled data set captured at a smart factory locatedin Vienna, Austria during normal operation and during penetration tests with differentattack types. The data set contains 173 GB of PCAP files, which represent 16 days (395 hours) of factory operation. It includes MQTT, OPC UA, and Modbus/TCP traffic. The captured malicious traffic was originatedby a professional penetration tester who performed two types of attacks: (a)aggressive attacks that are easier to detect and (b) stealthy attacks that areharder to detect. Our data set includes the raw PCAP files and extractedflow data. Labels for packets and flows indicate whether packets (or flows)originated from a specific attack or from benign communication. We describethe methodology for creating the data set, conduct an analysis of the dataand provide detailed information about the recorded traffic itself. The dataset is freely available to support reproducible research and the comparabilityof results in the area of intrusion detection in industrial networks. File description: a_day1, a_day2, s_day1, s_day2, tf_a and tf_s: Main data set, where files starting with "tf" are training files containing only benign, operational data and all other files are attack files containing both, operational data and attack data. images.zip: Contains descriptive images about the data. extractions.zip: Contains extracted packets, flows in both labeled and unlabeled form. a_day_tuesday_dos.zip: additional day of attack traffic containing benign and attack data, including a DoS attack. This day is not labeled.

Malware Detection in Network Traffic Data

kaggle.com

Updated Dec 26, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Agung Pambudi (2023). Malware Detection in Network Traffic Data [Dataset]. http://doi.org/10.34740/kaggle/dsv/7285844

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/7285844

Dataset updated

Dec 26, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Agung Pambudi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

To cite the dataset please reference it as “Stratosphere Laboratory. A labeled dataset with malicious and benign IoT network traffic. January 22th. Agustin Parmisano, Sebastian Garcia, Maria Jose Erquiaga. https://www.stratosphereips.org/datasets-iot23

This dataset includes labels that explain the linkages between flows connected with harmful or possibly malicious activity to provide network malware researchers and analysts with more thorough information. These labels were painstakingly created at the Stratosphere labs using malware capture analysis.

We present a concise explanation of the labels used for the identification of malicious flows, based on manual network analysis, below:

Attack: This label signifies the occurrence of an attack originating from an infected device directed towards another host. Any flow that endeavors to exploit a vulnerable service, discerned through payload and behavioral analysis, falls under this classification. Examples include brute force attempts on telnet logins or header-based command injections in GET requests.

Benign: The "Benign" label denotes connections where no suspicious or malicious activities have been detected.

C&C (Command and Control): This label indicates that the infected device has established a connection with a Command and Control server. This observation is rooted in the periodic nature of connections or activities such as binary downloads or the exchange of IRC-like or decoded commands.

DDoS (Distributed Denial of Service): "DDoS" is assigned when the infected device is actively involved in a Distributed Denial of Service attack, identifiable by the volume of flows directed towards a single IP address.

FileDownload: This label signifies that a file is being downloaded to the infected device. It is determined by examining connections with response bytes exceeding a specified threshold (typically 3KB or 5KB), often in conjunction with known suspicious destination ports or IPs associated with Command and Control servers.

HeartBeat: "HeartBeat" designates connections where packets serve the purpose of tracking the infected host by the Command and Control server. Such connections are identified through response bytes below a certain threshold (typically 1B) and exhibit periodic similarities. This is often associated with known suspicious destination ports or IPs linked to Command and Control servers.

Mirai: This label is applied when connections exhibit characteristics resembling those of the Mirai botnet, based on patterns consistent with common Mirai attack profiles.

Okiru: Similar to "Mirai," the "Okiru" label is assigned to connections displaying characteristics of the Okiru botnet. The parameters for this label are the same as for Mirai, but Okiru is a less prevalent botnet family.

PartOfAHorizontalPortScan: This label is employed when connections are involved in a horizontal port scan aimed at gathering information for potential subsequent attacks. The labeling decision hinges on patterns such as shared ports, similar transmitted byte counts, and multiple distinct destination IPs among the connections.

Torii: The "Torii" label is used when connections exhibit traits indicative of the Torii botnet, with labeling criteria similar to those used for Mirai, albeit in the context of a less common botnet family.

Field Name	Description	Type
ts	The timestamp of the connection event.	time
uid	A unique identifier for the connection.	string
id.orig_h	The source IP address.	addr
id.orig_p	The source port.	port
id.resp_h	The destination IP address.	addr
id.resp_p	The destination port.	port
proto	The network protocol used (e.g., 'tcp').	enum
service	The service associated with the connection.	string
duration	The duration of the connection.	interval
orig_bytes	The number of bytes sent from the source to the destination.	count
resp_bytes	The number of bytes sent from the destination to the source.	count
conn_state	The state of the connection.	string
local_orig	Indicates whether the connection is considered local or not.	bool
local_resp	Indicates whether the connection is considered...

e
Dataset of Publication "Malware Communication in Smart Factories: A Network...
b2find.eudat.eu
Updated Apr 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Dataset of Publication "Malware Communication in Smart Factories: A Network Traffic Data Set" [Dataset]. https://b2find.eudat.eu/dataset/a4f43cd9-25b1-5df3-a529-e430ae2fe323
Explore at:
Dataset updated
Apr 12, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Brenner, B., Fabini, J., Offermanns, M., Semper, S., & Zseby, T. (2024). Malware communication in smart factories: A network traffic data set. Computer Networks, 255, 110804. or in BibTeX: @article{brenner2024malware, title={Malware communication in smart factories: A network traffic data set}, author={Brenner, Bernhard and Fabini, Joachim and Offermanns, Magnus and Semper, Sabrina and Zseby, Tanja}, journal={Computer Networks}, volume={255}, pages={110804}, year={2024}, publisher={Elsevier}} Context and methodology Machine learning-based intrusion detection requires suitable and realistic data sets for training and testing. However, data sets that originate from real networks are rare. Network data is considered privacy-sensitive, and the purposeful introduction of malicious traffic is usually not possible. In this paper, we introduce a labeled data set captured at a smart factory located in Vienna, Austria, during normal operation and during penetration tests with different attack types. The data set contains 173 GB of PCAP files, representing 16 days (395 hours) of factory operation. It includes MQTT, OPC UA, and Modbus/TCP traffic. The captured malicious traffic originated from a professional penetration tester who performed two types of attacks:(a) Aggressive attacks that are easier to detect.(b) Stealthy attacks that are harder to detect. Our data set includes the raw PCAP files and extracted flow data. Labels for packets and flows indicate whether they originated from a specific attack or from benign communication. We describe the methodology for creating the dataset, conduct an analysis of the data, and provide detailed information about the recorded traffic itself. The dataset is freely available to support reproducible research and the comparability of results in the area of intrusion detection in industrial networks. Technical details readme.txt Information about the data collection, format, necessary software and versions to access it.
WinMET Dataset
zenodo.org
data.niaid.nih.gov
bin, json
Updated Sep 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Razvan Raducu; Razvan Raducu; Alain Villagrasa-Labrador; Alain Villagrasa-Labrador; Ricardo J. Rodríguez; Ricardo J. Rodríguez; Pedro Álvarez; Pedro Álvarez (2025). WinMET Dataset [Dataset]. http://doi.org/10.5281/zenodo.16414116
Explore at:
bin, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16414116
Dataset updated
Sep 1, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Razvan Raducu; Razvan Raducu; Alain Villagrasa-Labrador; Alain Villagrasa-Labrador; Ricardo J. Rodríguez; Ricardo J. Rodríguez; Pedro Álvarez; Pedro Álvarez
License
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Description
WinMET (Windows Malware Execution Traces) Dataset

WinMET dataset contains the execution traces generated with CAPE sandbox after analyzing several malware samples. The execution traces are valid JSON files that contain the spawned processes, the sequence of WinAPI and system calls invoked by each process, their parameters, their return values, and OS accesed resources, amongst many others.

Please use this DOI reference that always points to the latest WinMET version: https://doi.org/10.5281/zenodo.12647555" target="_blank" rel="noopener">https://doi.org/10.5281/zenodo.12647555

How to use the dataset

All 7z files are password protected. The password is: infected.

Total execution traces: 31844 (.json files), split into 5 volumes.

Compressed size per volume: ~2.5 GB.

Uncompressed size per volume: ~154 GB.

Total compressed size: ~13 GB.

Total uncompressed size: ~750 GB.

WinMET Volumes:

WinMET_volume_1.7z - MD5: bf51181eafc8452090bb6ce9f47b6714

Total files: 6369

First file: 0000025c5ee1d6707e6dddfe2816f92d9d8d8bb7c84371c44529e8083109b0e5.json

Last file: 31c2e51efcbff0aa489aa6af1a48cf78f6a9febfb449a19d029f8cc8ebb4495f.json

WinMET_volume_2.7z - MD5: aee86b4591a46c69b0d027de80ff1011

Total files: 6369

First file: 31c4300fdba21e03ce5ad8ef340832493bcbf702a2ee897cf3a85fdd38dbf10c.json

Last file: 65a8b01babb2fcf3ed26a2236a606d7bc7d1f087749a455554b8ef7eddba56fc.json

WinMET_volume_3.7z - MD5 996723774909bc2e6745382697317460

Total files: 6369

First file: 65a92f49f687b2f421397bbd3a6426b0b4914b896659c2d07a287e112a25939d.json

Last file: 996d4e0a67dcad433fa2049dca1defdd984d776fbb5bc5990c0114932be25066.json

WinMET_volume_4.7z - MD5 4f5acbabeb9d24c96dadef71f56bd916

Total files: 6369

First file: 996fdb5a25f89426e241f02094474706fafd567fcc5980a07ac7a38efa8625ea.json

Last file: cdb5eed6579773d8fbdb13deb766664ba1c8cc01794790855e61e1564daf62f5.json

WinMET_volume_5.7z - MD5 b3d15c97990dd0dfb0d94e369f486025

Total files: 6368

First file: cdb7a65d6efc528d6084879e2a24cafb6869c84c45f076208ac437b3bdbdae94.json

Last file: ffff75b38c340f90d5fd3fbda5257f11caea5c8160daf26c9a29c04bb333a1c2.json

Additional files:

cape_report_to_label_mapping.jsoncontains the mappings of each report with its corresponding label as assigned by the CAPE sandbox labeling algorithm, sorted in descendant order (given the number of reports belonging to each label/family).

avclass_report_to_label_mapping.json contains the mappings of each report with its corresponding label as assigned by AVClass, sorted in descendant order (given the number of reports belonging to each label/family).

reports_consensus_label.json contains both labels (CAPE and AVClass) for each execution trace.

Citation

If you use this dataset, cite it as follows:

Raducu, R., Villagrasa-Labrador, A., Rodríguez, R. J., & Álvarez, P. (2025). WinMET Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.12647555

BibTex:

@misc{WinMET_dataset,
author = {Raducu, Razvan and Villagrasa‑Labrador, Alain and Rodríguez, Ricardo J. and Álvarez, Pedro},
title = {{WinMET Dataset: Windows Malware Execution Traces}},
howpublished = {Zenodo, dataset},
year = {2025},
doi = {10.5281/zenodo.12647555},
url = {https://doi.org/10.5281/zenodo.12647555}
}

This dataset was generated using the https://github.com/reverseame/MALVADA/tree/main" target="_blank" rel="noopener">MALVADA framework, which you can read more about in our publication https://doi.org/10.1016/j.softx.2025.102082" target="_blank" rel="noopener">https://doi.org/10.1016/j.softx.2025.102082. The article also provides insights about the contents of this dataset.

Razvan Raducu, Alain Villagrasa-Labrador, Ricardo J. Rodríguez, Pedro Álvarez, MALVADA: A framework for generating datasets of malware execution traces, SoftwareX, Volume 30, 2025, 102082, ISSN 2352-7110, https://doi.org/10.1016/j.softx.2025.102082. (https://www.sciencedirect.com/science/article/pii/S2352711025000494)

Statistics

The following statistic (and many more) can be obtained by analyzing the WinMET dataset with the https://github.com/reverseame/MALVADA" target="_blank" rel="noopener">MALVADA framework.

Top 20 CAPE consensus labels (there are many more)*:

Dacic (3127 execution traces)

Padodor (1504 execution traces)

Redline (1455 execution traces)

Crifi (1101 execution traces)

Cosmu (836 execution traces)

Agenttesla (806 execution traces)

Amadey (601 execution traces)

Loki (551 execution traces)

Berbew (532 execution traces)

Qukart (496 execution traces)

Tedy (410 execution traces)

Mint (389 execution traces)

Metastealer (376 execution traces)

Smokeloader (349 execution traces)

Taskun (335 execution traces)

Virlock (313 execution traces)

Formbook (301 execution traces)

Strab (273 execution traces)

Agensla (235 execution traces)

Autorun (229 execution traces)

Top 20 AVClass consensus labels (there are many more)*:

Redline (4438 execution traces)

Vbclone (4023 execution traces)

Berbew (2794 execution traces)

Agenttesla (1201 execution traces)

Cosmu (899 execution traces)

Taskun (856 execution traces)

Disabler (799 execution traces)

Amadey (763 execution traces)

Gamarue (546 execution traces)

Noon (530 execution traces)

Strab (468 execution traces)

Snojan (433 execution traces)

Stop (399 execution traces)

Snakelogger (365 execution traces)

Virlock (326 execution traces)

Qbot (315 execution traces)

Equationdrug (270 execution traces)

Mokes (262 execution traces)

Blihan (261 execution traces)

Dofoil (254 execution traces)

There are 7256 execution traces with no CAPE label.

There are 1846 execution traces with no AVClass label.

There are 1241 execution traces with no label.

* The execution traces with no label are assigned the "(n/a)" family. We ommited it here.

Changelog

2025.07.25:

Dataset now contains ~32K execution traces.

Split new dataset into 5 volumes.

Updated TOP20 consensus labels.

Added reports_consensus_label.json.

Fixed Reline <-> Redline AVClass mappings https://github.com/malicialab/avclass/pull/48.

Version 2.0: Added cape and avclass label mappings.
DataSet for ICSE SEIP 25: Detecting Python Malware in the Software Supply...
zenodo.org
bin
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ridwan Shariffdeen; Ridwan Shariffdeen (2024). DataSet for ICSE SEIP 25: Detecting Python Malware in the Software Supply Chain with Program Analysis [Dataset]. http://doi.org/10.5281/zenodo.14580885
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14580885
Dataset updated
Dec 31, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ridwan Shariffdeen; Ridwan Shariffdeen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
* MalOSS: subset of malicious packages from MalOSS dataset [RQ1, RQ2, RQ4]
* BackStabber: subset of malicious packages from BackStabber Knife's Collection [RQ1, RQ2, RQ4]
* MalRegistry: subset of malicious packages from Python MalRegistry dataset [RQ1, RQ2, RQ4]
* Popular: a collection of top-100 most popular python packages from PyPI [RQ1, RQ2, RQ3, RQ4]
* Trusted: a collection of packages from trusted organizations hosted in PyPI [RQ1, RQ2, RQ3, RQ4]
* DataKund: a collection of newly identified malicious packages from PyPI [Case Study]
* Recent: a collection of packages that were recently (2024 Oct) added to PyPI [Macaron Case Study]
D
Database Security Audits Services Report
datainsightsmarket.com
doc, pdf, ppt
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Database Security Audits Services Report [Dataset]. https://www.datainsightsmarket.com/reports/database-security-audits-services-1419617
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Apr 25, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Database Security Audits Services market is experiencing robust growth, driven by the increasing reliance on databases across various industries and the escalating threat landscape. The market's expansion is fueled by several key factors. Firstly, stringent data privacy regulations like GDPR and CCPA are compelling organizations to prioritize database security and conduct regular audits to ensure compliance. Secondly, the rising frequency and sophistication of cyberattacks targeting databases, including ransomware and data breaches, are prompting proactive security measures, including comprehensive audits. Thirdly, the shift towards cloud-based databases introduces new security challenges and necessitates specialized audit services to address vulnerabilities inherent in cloud environments. The market is segmented by application (Financial, Medical, Telecom, Government, Manufacturing, Others) and type (Cloud-based, On-premise), with cloud-based services witnessing faster adoption due to the expanding cloud computing market. North America and Europe currently hold significant market share, but regions like Asia-Pacific are exhibiting rapid growth potential owing to increasing digitalization and adoption of advanced technologies. Major players are investing in innovative solutions and expanding their service portfolios to cater to diverse client needs, fostering competition and driving market evolution. While the market faces restraints like high implementation costs and a shortage of skilled professionals, the overall growth trajectory remains positive, propelled by the escalating demand for robust database security and compliance. The forecast period (2025-2033) anticipates continued expansion, potentially exceeding a compound annual growth rate (CAGR) of 15%. This optimistic projection is based on several factors. First, the ongoing digital transformation across industries will lead to increased reliance on databases and subsequently, heightened demand for security audits. Second, the continuous evolution of cyber threats will necessitate more frequent and comprehensive audits, further boosting market growth. Thirdly, the market will benefit from technological advancements in database security tools and methodologies, enabling more efficient and effective audits. However, challenges remain, particularly in addressing the skill gap and ensuring the affordability of these services for smaller organizations. Nevertheless, the long-term outlook for the Database Security Audits Services market remains strongly positive, with significant opportunities for market expansion and innovation.
h
cosoco-image-dataset
huggingface.co
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
K3Y Ltd (2025). cosoco-image-dataset [Dataset]. http://doi.org/10.57967/hf/5853
Explore at:
Unique identifier
https://doi.org/10.57967/hf/5853
Dataset updated
May 28, 2025
Dataset authored and provided by
K3Y Ltd
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
COSOCO: Compromised Software Containers Image Dataset

Paper: Malware Detection in Docker Containers: An Image is Worth a Thousand Logs Dataset Documentation: COSOCO Dataset Documentation

Dataset Description

COSOCO (Compromised Software Containers) is a synthetic dataset of 3364 images representing benign and malware-compromised software containers. Each image in the dataset represents a dockerized software container that has been converted to an image using common… See the full description on the dataset page: https://huggingface.co/datasets/k3ylabs/cosoco-image-dataset.
Fast & Furious: Malware Detection Data Stream
kaggle.com
Updated Aug 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabrício Ceschin (2022). Fast & Furious: Malware Detection Data Stream [Dataset]. https://www.kaggle.com/fabriciojoc/fast-furious-malware-data-stream
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Fabrício Ceschin
Description
These datasets (DREBIN and AndroZoo) are contributions to the paper "Fast & Furious: On the Modelling of Malware Detection as an Evolving Data Stream". If you use them in your work, please cite our paper using the BibTeX below:

@article{CESCHIN2022118590, title = {Fast & Furious: On the modelling of malware detection as an evolving data stream}, journal = {Expert Systems with Applications}, pages = {118590}, year = {2022}, issn = {0957-4174}, doi = {https://doi.org/10.1016/j.eswa.2022.118590}, url = {https://www.sciencedirect.com/science/article/pii/S0957417422016463}, author = {Fabrício Ceschin and Marcus Botacin and Heitor Murilo Gomes and Felipe Pinagé and Luiz S. Oliveira and André Grégio}, keywords = {Machine learning, Data streams, Concept drift, Malware detection, Android} }

Both datasets are saved in the parquet file format. To read them, use the following code:

data_drebin = pd.read_parquet("drebin_drift.parquet.zip") data_androzoo = pd.read_parquet("androbin.parquet.zip")

Note that these datasets are different from their original versions. The original DREBIN dataset does not contain the samples' timestamps, which we collected using VirusTotal API. Our version of the AndroZoo dataset is a subset of reports from their dataset previously available in their APK Analysis API, which was discontinued.

The DREBIN dataset is composed of ten textual attributes from Android APKs (list of API calls, permissions, URLs, etc), which are publicly available to download and contain 123,453 benign and 5,560 malicious Android applications. Their distribution over time is shown below.

https://i.imgur.com/IGKOMtE.png" alt="DREBIN dataset distribution by month">

The AndroZoo dataset is a subset of Android applications reports provided by AndroZoo API, composed of eight textual attributes (resources names, source code classes and methods, manifest permissions etc.) and contains 213,928 benign and 70,340 malicious applications. The distribution over time of our AndroZoo subset, which keeps the same goodware and malware distribution as the original dataset (composed of most of 10 million apps), is shown below.

https://i.imgur.com/8zxH3M4.png" alt="AndroZoo dataset distribution by month">

The source code for all the experiments shown in the paper are also available here on Kaggle (note that the experiments using AndroZoo dataset did not run in the Kaggle environment due to high memory usage).

Experiment 1 (The Best-Case Scenario for AVs - ML Cross-Validation)

Here we classify all samples together to compare which feature extraction algorithm is the best and report baseline results. We tested several parameters for both algorithms and fixed the vocabulary size at 100 for TF-IDF (top-100 features ordered by term frequency) and created projections with 100 dimensions for Word2Vec, resulting in 1, 000 and 800 features for each app in both cases, for DREBIN and AndroZoo, respectively. All results are reported after 10-fold cross-validation procedures, a method commonly used in ML to evaluate models because its results are less prone to biases (note that we are training new classifiers and feature extractors at every iteration of the cross-validation process). . In practice, folding the dataset implies that the AV company has a mixed view of both past and future threats, despite temporal effects, which is the best scenario for AV operation and ML evaluation.

Source Codes: DREBIN TFIDF | DREBIN W2V | ANDROZOO TFIDF | ANDROZOO W2V

Experiment 2 (On Classification Failure - Temporal Classification)

Although the currently used classification methodology helps reduci dataset biases, it would demand knowledge about future threats to work properly. AV companies train their classifiers using data from past samples and leverage them to predict future threats, expecting to present the same characteristics as past ones. However, malware samples are very dynamic, thus this strategy is the worst-case scenario for AV companies. To demonstrate the effects of predicting future threats based on past data, we split our datasets in two: we used the first half (oldest samples) to train our classifiers, which were then used to predict the newest samples from the second half. The results in the paper indicate a drop in all metrics when compared to the 10-fold experiment in both DREBIN and AndroZoo dataset...
f
Statistical values of the CNN model.
plos.figshare.com
xls
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Aamir; Muhammad Waseem Iqbal; Mariam Nosheen; M. Usman Ashraf; Ahmad Shaf; Khalid Ali Almarhabi; Ahmed Mohammed Alghamdi; Adel A. Bahaddad (2024). Statistical values of the CNN model. [Dataset]. http://doi.org/10.1371/journal.pone.0296722.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0296722.t001
Dataset updated
Jan 19, 2024
Dataset provided by
PLOS ONE
Authors
Muhammad Aamir; Muhammad Waseem Iqbal; Mariam Nosheen; M. Usman Ashraf; Ahmad Shaf; Khalid Ali Almarhabi; Ahmed Mohammed Alghamdi; Adel A. Bahaddad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Android is the most popular operating system of the latest mobile smart devices. With this operating system, many Android applications have been developed and become an essential part of our daily lives. Unfortunately, different kinds of Android malware have also been generated with these applications’ endless stream and somehow installed during the API calls, permission granted and extra packages installation and badly affected the system security rules to harm the system. Therefore, it is compulsory to detect and classify the android malware to save the user’s privacy to avoid maximum damages. Many research has already been developed on the different techniques related to android malware detection and classification. In this work, we present AMDDLmodel a deep learning technique that consists of a convolutional neural network. This model works based on different parameters, filter sizes, number of epochs, learning rates, and layers to detect and classify the android malware. The Drebin dataset consisting of 215 features was used for this model evaluation. The model shows an accuracy value of 99.92%. The other statistical values are precision, recall, and F1-score. AMDDLmodel introduces innovative deep learning for Android malware detection, enhancing accuracy and practical user security through inventive feature engineering and comprehensive performance evaluation. The AMDDLmodel shows the highest accuracy values as compared to the existing techniques.
f
State-of-the-art comparison with the existing techniques.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jan 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Aamir; Muhammad Waseem Iqbal; Mariam Nosheen; M. Usman Ashraf; Ahmad Shaf; Khalid Ali Almarhabi; Ahmed Mohammed Alghamdi; Adel A. Bahaddad (2024). State-of-the-art comparison with the existing techniques. [Dataset]. http://doi.org/10.1371/journal.pone.0296722.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0296722.t002
Dataset updated
Jan 19, 2024
Dataset provided by
PLOS ONE
Authors
Muhammad Aamir; Muhammad Waseem Iqbal; Mariam Nosheen; M. Usman Ashraf; Ahmad Shaf; Khalid Ali Almarhabi; Ahmed Mohammed Alghamdi; Adel A. Bahaddad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
State-of-the-art comparison with the existing techniques.
f
The confusion matrix.
plos.figshare.com
xls
Updated Sep 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ye Tian; Xin Dai; Zhijun Li; Hong Guo; Xiao Mao (2025). The confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0331574.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0331574.t002
Dataset updated
Sep 3, 2025
Dataset provided by
PLOS ONE
Authors
Ye Tian; Xin Dai; Zhijun Li; Hong Guo; Xiao Mao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the widespread adoption of internet technologies and email communication systems, the exponential growth in email usage has precipitated a corresponding surge in spam proliferation. These unsolicited messages not only consume users’ valuable time through information overload but also pose significant cybersecurity threats through malware distribution and phishing schemes, thereby jeopardizing both digital security and user experience. This emerging challenge underscores the critical importance of developing effective spam detection mechanisms as a cornerstone of modern cybersecurity infrastructure. Through empirical analysis of machine learning (ML) performance on publicly available spam datasets, we established that algorithmic ensemble methods consistently outperform individual models in detection accuracy. We propose an optimized stacking ensemble framework that strategically combines predictions from four heterogeneous base models (NBC, k-NN, LR, XGBoost) through meta-learner integration. Our methodology incorporates grid search cross-validation with hyperparameter space optimization, enabling systematic identification of parameter configurations that maximize detection performance. The enhanced model was rigorously evaluated using comprehensive metrics including accuracy (99.79%), precision, recall, and F1-score, demonstrating statistically significant improvements over both baseline models and existing solutions documented in the literature.
f
Performance comparison before and after tuning.
plos.figshare.com
xls
Updated Sep 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ye Tian; Xin Dai; Zhijun Li; Hong Guo; Xiao Mao (2025). Performance comparison before and after tuning. [Dataset]. http://doi.org/10.1371/journal.pone.0331574.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0331574.t004
Dataset updated
Sep 3, 2025
Dataset provided by
PLOS ONE
Authors
Ye Tian; Xin Dai; Zhijun Li; Hong Guo; Xiao Mao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the widespread adoption of internet technologies and email communication systems, the exponential growth in email usage has precipitated a corresponding surge in spam proliferation. These unsolicited messages not only consume users’ valuable time through information overload but also pose significant cybersecurity threats through malware distribution and phishing schemes, thereby jeopardizing both digital security and user experience. This emerging challenge underscores the critical importance of developing effective spam detection mechanisms as a cornerstone of modern cybersecurity infrastructure. Through empirical analysis of machine learning (ML) performance on publicly available spam datasets, we established that algorithmic ensemble methods consistently outperform individual models in detection accuracy. We propose an optimized stacking ensemble framework that strategically combines predictions from four heterogeneous base models (NBC, k-NN, LR, XGBoost) through meta-learner integration. Our methodology incorporates grid search cross-validation with hyperparameter space optimization, enabling systematic identification of parameter configurations that maximize detection performance. The enhanced model was rigorously evaluated using comprehensive metrics including accuracy (99.79%), precision, recall, and F1-score, demonstrating statistically significant improvements over both baseline models and existing solutions documented in the literature.
f
Performance of the proposed stacking model.
plos.figshare.com
xls
Updated Sep 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ye Tian; Xin Dai; Zhijun Li; Hong Guo; Xiao Mao (2025). Performance of the proposed stacking model. [Dataset]. http://doi.org/10.1371/journal.pone.0331574.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0331574.t003
Dataset updated
Sep 3, 2025
Dataset provided by
PLOS ONE
Authors
Ye Tian; Xin Dai; Zhijun Li; Hong Guo; Xiao Mao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the widespread adoption of internet technologies and email communication systems, the exponential growth in email usage has precipitated a corresponding surge in spam proliferation. These unsolicited messages not only consume users’ valuable time through information overload but also pose significant cybersecurity threats through malware distribution and phishing schemes, thereby jeopardizing both digital security and user experience. This emerging challenge underscores the critical importance of developing effective spam detection mechanisms as a cornerstone of modern cybersecurity infrastructure. Through empirical analysis of machine learning (ML) performance on publicly available spam datasets, we established that algorithmic ensemble methods consistently outperform individual models in detection accuracy. We propose an optimized stacking ensemble framework that strategically combines predictions from four heterogeneous base models (NBC, k-NN, LR, XGBoost) through meta-learner integration. Our methodology incorporates grid search cross-validation with hyperparameter space optimization, enabling systematic identification of parameter configurations that maximize detection performance. The enhanced model was rigorously evaluated using comprehensive metrics including accuracy (99.79%), precision, recall, and F1-score, demonstrating statistically significant improvements over both baseline models and existing solutions documented in the literature.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). Businesses worldwide affected by ransomware 2018-2025 [Dataset]. https://www.statista.com/statistics/204457/businesses-ransomware-attack-rate/

Businesses worldwide affected by ransomware 2018-2025

Explore at:

24 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Aug 26, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Area covered

Worldwide

Description

As of 2025, nearly 63 percent of businesses worldwide were affected by ransomware attacks. This figure represents a decrease on the previous year and was by far the lowest figure reported since 2020. Overall, since 2018, more than half of the total survey respondents each year stated that their organizations had been victimized by ransomware. Most targeted industries In 2024, the critical manufacturing industry in the United States was once again most targeted by ransomware attacks. Overall, organizations in this industry experienced 258 cyberattacks in the measured year. Healthcare and the public health sector ranked second, followed by government facilities, with 238 and 220 cyberattacks, respectively. Ransomware in the manufacturing industry The manufacturing industry, along with its subindustries, is constantly targeted by ransomware attacks, causing data loss, business disruptions, and reputational damage. Often, such cyberattacks are international and have a political intent. In 2024, exploited vulnerabilities were the leading cause of ransomware attacks in the manufacturing industry.

Clear search

Close search

Google apps

Main menu

Businesses worldwide affected by ransomware 2018-2025

RDE-Dataset.zipRansomware Defense Empowered: Deep Learning for Real-Time...

Malware Repositories and Their Authors on GitHub

Malware API Call Dataset

Malware Dataset on Android Applications

MH-1M Dataset

Dataset of Publication "Malware Communication in Smart Factories: A Network...

Malware Detection in Network Traffic Data

Dataset of Publication "Malware Communication in Smart Factories: A Network...

WinMET Dataset

WinMET (Windows Malware Execution Traces) Dataset

How to use the dataset

WinMET Volumes:

Additional files:

Citation

Statistics

Changelog

DataSet for ICSE SEIP 25: Detecting Python Malware in the Software Supply...

Database Security Audits Services Report

cosoco-image-dataset

Fast & Furious: Malware Detection Data Stream

Statistical values of the CNN model.

State-of-the-art comparison with the existing techniques.

The confusion matrix.

Performance comparison before and after tuning.

Performance of the proposed stacking model.

Businesses worldwide affected by ransomware 2018-2025