100+ datasets found

MalwareBazaar Malware Dataset (Sep - Oct 2025)
kaggle.com
zip
Updated Oct 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
José Reyes (2025). MalwareBazaar Malware Dataset (Sep - Oct 2025) [Dataset]. https://www.kaggle.com/datasets/arkreyes/malwarebazaar-malware-dataset-sep-oct-2025
Explore at:
zip(9415213 bytes)Available download formats
Dataset updated
Oct 9, 2025
Authors
José Reyes
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
MalwareBazaar Malware Dataset.

Introduction.

This dataset is useful to practice skills in Data Analysis or Data Science, contains information about indicators of crompromise found in MalwareBazaar's database.

Description.

The dataset was retrieved from MalwareBazaar's database, full dump CSV. Curated, formatted and cleaned by myself.

Metadata removed (footer with unreadable information).

'date' formatted to datetime (better reading format).

Data filtered from the last 90 days.

Unnecessary columns with "NaN" data removed.
b
Complete Antivirus Database
comodo.com
cav
Updated Dec 8, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Comodo (2015). Complete Antivirus Database [Dataset]. https://www.comodo.com/home/internet-security/updates/vdp/database.php
Explore at:
cavAvailable download formats
Dataset updated
Dec 8, 2015
Dataset authored and provided by
Comodo
License
https://www.comodo.com/home/internet-security/updates/vdp/database.phphttps://www.comodo.com/home/internet-security/updates/vdp/database.php
Description
The complete Comodo Internet Security database is available for download...
Android Malware Detection Dataset
kaggle.com
zip
Updated Feb 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danny Revaldo (2024). Android Malware Detection Dataset [Dataset]. https://www.kaggle.com/datasets/dannyrevaldo/android-malware-detection-dataset
Explore at:
zip(123470 bytes)Available download formats
Dataset updated
Feb 24, 2024
Authors
Danny Revaldo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The "Android Malware Detection Dataset" is a comprehensive collection of data designed to facilitate research in the detection and analysis of malware targeting the Android platform. This dataset encompasses a wide range of features extracted from Android applications, providing valuable insights into their behaviors and functionalities.

Key features of the dataset include:

Permission Features: Various permissions requested by Android applications, such as access to location (coarse and fine), camera, microphone, contacts, SMS, calendar, storage, and more.

System Features: Features related to system functions and controls, including access to device hardware (e.g., sensors, Bluetooth, NFC), system settings (e.g., changing network state, WiFi settings), and system services (e.g., managing accounts, managing documents).

Security-related Features: Features related to security functionalities and behaviors, encompassing permission management, authentication, encryption (e.g., cryptographic operations), and security policy enforcement.

Communication Features: Features related to communication functionalities, including sending and receiving SMS messages, making phone calls, accessing network state, and managing network connections.

Data Access Features: Features related to accessing and manipulating data, such as reading and writing to various data sources (e.g., external storage, databases), accessing user information (e.g., contacts, call logs), and accessing app-specific data.

App Lifecycle Features: Features related to managing the application lifecycle, including app installation and uninstallation, app startup and shutdown, app updates, and app permissions.

Device Control Features: Features related to controlling device behavior and settings, such as changing system settings, modifying audio settings, controlling device display, and managing device power.

Miscellaneous Features: Other miscellaneous features including accessing system logs, system services and components (e.g., camera, location manager), handling system events (e.g., incoming calls, boot completed), and interacting with system UI components.

This dataset provides researchers with a rich source of information to develop and evaluate effective malware detection and analysis techniques, ultimately contributing to the enhancement of mobile security on the Android platform.
Data from: Malware Finances and Operations: a Data-Driven Study of the Value...
data.europa.eu
data.niaid.nih.gov
+1more
unknown
Updated Oct 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2023). Malware Finances and Operations: a Data-Driven Study of the Value Chain for Infections and Compromised Access [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-8047205?locale=bg
Explore at:
unknown(8866943)Available download formats
Dataset updated
Oct 18, 2023
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description The datasets demonstrate the malware economy and the value chain published in our paper, Malware Finances and Operations: a Data-Driven Study of the Value Chain for Infections and Compromised Access, at the 12th International Workshop on Cyber Crime (IWCC 2023), part of the ARES Conference, published by the International Conference Proceedings Series of the ACM ICPS. Using the well-documented scripts, it is straightforward to reproduce our findings. It takes an estimated 1 hour of human time and 3 hours of computing time to duplicate our key findings from MalwareInfectionSet; around one hour with VictimAccessSet; and minutes to replicate the price calculations using AccountAccessSet. See the included README.md files and Python scripts. We choose to represent each victim by a single JavaScript Object Notation (JSON) data file. Data sources provide sets of victim JSON data files from which we've extracted the essential information and omitted Personally Identifiable Information (PII). We collected, curated, and modelled three datasets, which we publish under the Creative Commons Attribution 4.0 International License. 1. MalwareInfectionSet We discover (and, to the best of our knowledge, document scientifically for the first time) that malware networks appear to dump their data collections online. We collected these infostealer malware logs available for free. We utilise 245 malware log dumps from 2019 and 2020 originating from 14 malware networks. The dataset contains 1.8 million victim files, with a dataset size of 15 GB. 2. VictimAccessSet We demonstrate how Infostealer malware networks sell access to infected victims. Genesis Market focuses on user-friendliness and continuous supply of compromised data. Marketplace listings include everything necessary to gain access to the victim's online accounts, including passwords and usernames, but also detailed collection of information which provides a clone of the victim's browser session. Indeed, Genesis Market simplifies the import of compromised victim authentication data into a web browser session. We measure the prices on Genesis Market and how compromised device prices are determined. We crawled the website between April 2019 and May 2022, collecting the web pages offering the resources for sale. The dataset contains 0.5 million victim files, with a dataset size of 3.5 GB. 3. AccountAccessSet The Database marketplace operates inside the anonymous Tor network. Vendors offer their goods for sale, and customers can purchase them with Bitcoins. The marketplace sells online accounts, such as PayPal and Spotify, as well as private datasets, such as driver's licence photographs and tax forms. We then collect data from Database Market, where vendors sell online credentials, and investigate similarly. To build our dataset, we crawled the website between November 2021 and June 2022, collecting the web pages offering the credentials for sale. The dataset contains 33,896 victim files, with a dataset size of 400 MB. Credits Authors Billy Bob Brumley (Tampere University, Tampere, Finland) Juha Nurmi (Tampere University, Tampere, Finland) Mikko Niemelä (Cyber Intelligence House, Singapore) Funding This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under project numbers 804476 (SCARE) and 952622 (SPIRS). Alternative links to download: AccountAccessSet, MalwareInfectionSet, and VictimAccessSet.
T
Maldeb Dataset
dataverse.telkomuniversity.ac.id
ieee-dataport.org
+1more
png
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Telkom University Dataverse (2024). Maldeb Dataset [Dataset]. http://doi.org/10.34820/FK2/HQYV4X
Explore at:
png(37009), png(40485), png(17688), png(34844), png(9493), png(29711), png(20558), png(28684), png(29803), png(6311), png(40949), png(40392), png(38400), png(4038), png(5275), png(17960), png(38508), png(37266), png(31778), png(40248), png(28914), png(38992), png(40895), png(7485), png(28915), png(17724), png(25025), png(38142), png(27095), png(26777), png(37000), png(33749), png(12823), png(16016), png(12597), png(14025), png(7385), png(42604), png(26334), png(27060), png(19233), png(28916), png(12160), png(31488), png(3872), png(36959), png(16928), png(3667), png(32525), png(18253), png(29577), png(40024), png(39597), png(39050), png(11090), png(9764), png(41011), png(39924), png(31149), png(4693), png(39079), png(36808), png(2226), png(38297), png(32701), png(7143), png(5541), png(31606), png(39359), png(11048), png(32711), png(12788), png(26224), png(38202), png(36818), png(20676), png(9677), png(41423), png(24325), png(30595), png(36543), png(7767), png(36066), png(37337), png(33854), png(28742), png(24158), png(42716), png(14727), png(41822), png(27177), png(31238), png(42792), png(34881), png(38036), png(37751), png(14483), png(24093), png(13037), png(42313), png(23072), png(15264), png(19868), png(30260), png(38010), png(30017), png(34029), png(19782), png(41975), png(3367), png(12188), png(32190), png(42775), png(2606), png(41390), png(34637), png(38167), png(10958), png(9704), png(40913), png(42849), png(6512), png(12577), png(30133), png(40975), png(42692), png(13627), png(29584), png(10867), png(10814), png(18784), png(27712), png(11945), png(3054), png(42333), png(27827), png(8053), png(24375), png(31575), png(33487), png(13038)Available download formats
Unique identifier
https://doi.org/10.34820/FK2/HQYV4X
Dataset updated
Mar 28, 2024
Dataset provided by
Telkom University Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
Directorate General of Higher Education, Ministry of Education and Culture Republic of Indonesia
Japanese Student Service Association (JASSO)
Description
Malware-benign Image representation. The Dataset were collected from several malware repositories, including TekDefense, TheZoo, The Malware-Repo, Malware Database amd Malware Bazar. The benign samples were collected from Microsoft 10 and 11 system apps and several open source software repository including CNET, Sourceforge, FileForum, PortableFreeware. The samples were validated by scanning them using Virustotal Malware scanning services. The Samples underwent preprocessing by converting the malware binary into grayscale images following rules from Nataraj (2011). Nataraj Paper: https://vision.ece.ucsb.edu/research/signal-processing-malware-analysis. Maldeb Dataset is collected by Debi Amalia Septiyani and Halimul Hakim Khairul D. A. Septiyani, “Generating Grayscale and RGB Images dataset for windows PE malware using Gist Features extaction method,” Institut Teknologi Bandung, 2022, and Dani Agung Prastiyo, "Design and implementation of a machine learning-based malware classification system with an audio signal feature Analysis Approach," Institut Teknologi Bandung, 2023. The complete dataset can be accessed on this link https://ieee-dataport.org/documents/maldeb-dataset and https://github.com/julismail/Self-Supervised
Portable Executable Malware Data
kaggle.com
zip
Updated Mar 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
malwareTBugs (2025). Portable Executable Malware Data [Dataset]. https://www.kaggle.com/datasets/malwaretbugs/maldata
Explore at:
zip(23094201 bytes)Available download formats
Dataset updated
Mar 10, 2025
Authors
malwareTBugs
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Dataset

This dataset was created by malwareTBugs

Released under Database: Open Database, Contents: Database Contents

Contents
i
Malware Analysis Datasets: Top-1000 PE Imports
ieee-dataport.org
Updated Nov 8, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Angelo Oliveira (2019). Malware Analysis Datasets: Top-1000 PE Imports [Dataset]. https://ieee-dataport.org/open-access/malware-analysis-datasets-top-1000-pe-imports
Explore at:
Dataset updated
Nov 8, 2019
Authors
Angelo Oliveira
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is part of my PhD research on malware detection and classification using Deep Learning. It contains static analysis data: Top-1000 imported functions extracted from the 'pe_imports' elements of Cuckoo Sandbox reports. PE malware examples were downloaded from virusshare.com. PE goodware examples were downloaded from portableapps.com and from Windows 7 x86 directories.
Quttera Website Malware Threat Encyclopedia
threats.quttera.com
json
Updated Nov 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quttera (2025). Quttera Website Malware Threat Encyclopedia [Dataset]. https://threats.quttera.com/
Explore at:
jsonAvailable download formats
Dataset updated
Nov 21, 2025
Dataset authored and provided by
Quttera
Time period covered
2024 - Present
Description
Comprehensive database of website malware threats, vulnerabilities, and security risks detected by Quttera's malware scanner.
Kraken2 Metagenomic Virus Database
osti.gov
Updated Apr 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21) (2020). Kraken2 Metagenomic Virus Database [Dataset]. http://doi.org/10.13139/OLCF/1615774
Explore at:
Unique identifier
https://doi.org/10.13139/OLCF/1615774
Dataset updated
Apr 23, 2020
Dataset provided by
Department of Energy Biological and Environmental Research Program
Office of Sciencehttp://www.er.doe.gov/
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Description
The Database: Kraken2 [1] database built from a classification tree containing over 700k metagenomic viruses from JGI IMG/VR [2]. (1) Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome Biol., 20(1), 1–13. doi: 10.1186/s13059-019-1891-0 (2) Paez-Espino D, Chen I-MA, Palaniappan K, Ratner A, Chu K, Szeto E, et al. IMG/VR: a database of cultured and uncultured DNA Viruses and retroviruses. Nucleic Acids Res. 2017;45:D457–65. For Paper: Title: A k-mer based approach for virus classification in metatranscriptomic and metagenomic samples identifies viral associations in the Populus phytobiome and autism brains Abstract Background Viruses are an underrepresented taxa in the study and identification of microbiome constituents; however, they play an important role in health, microbiome regulation, and transfer of genetic material. Only a few thousand viruses have been isolated, sequenced, and assigned a taxonomy, which further limits the ability to identify and quantify viruses in the microbiome. Additionally, the vast diversity of viruses represents a challenge for classification, not only in constructing a viral taxonomy, but also in identifying similarities between a virus' genotype and its phenotype. However, the diversity of viral sequences can be leveraged to classify their sequences in metagenomic and metatranscriptomic samples. Methods To identify viruses in transcriptomic and genomic samples, we developed a dynamic programming algorithm for creating a classification tree out of 715,672 metagenome viruses. To create the classification tree, we clustered proportional similarity scores generated from the k-mer profiles of each of the metagenome viruses. We then integrated the viral classification tree with the NCBI taxonomy for use with ParaKraken, a metagenomic/transcriptomic classifier. Results To illustrate the breadth of our utility for classifying viruses with ParaKraken, we analyzed data from a plant metagenome study identifying the differences between two Populus genotypes in three different compartments and on a human metatranscriptome study identifying the differences between Autism Spectrum Disorder patients and controls in post mortem brain biopsies. In the Populus study, we identified genotype and compartment specific viral signatures, while in the Autism study we identified a significant increased abundance of eight viral sequences in Autism brain biopsies. Conclusion Viruses represent an important aspect of the microbiome. The ability to classify viruses represents the first step in being able to better understand their role in the microbiome. The viral classification method presented here allows for more complete identification of viral sequences for use in identifying associations between viruses and the host and viruses and other microbiome members. Acknowledgements and Funding This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. This research was also supported by the Plant-Microbe Interfaces Scientific Focus Area in the Genomic Science Program, the Office of Biological and Environmental Research (BER) in the U.S. Department of Energy Office of Science, and by the Department of Energy, Laboratory Directed Research and Development funding (ProjectID 8321), at the Oak Ridge National Laboratory. Oak Ridge National Laboratory is managed by UT-Battelle, LLC, for the US DOE under contract DE-AC05-00OR22725. This research used resources of the Compute and Data Environment for Science (CADES).
Z
Malware Repositories and Their Authors on GitHub
data.niaid.nih.gov
zenodo.org
Updated Mar 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tania, Nishat Ara; Masud, Md Rayhanul; Rokon, Md Omar Faruk; Zhang, Qian; Faloutsos, Michalis (2024). Malware Repositories and Their Authors on GitHub [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10806592
Explore at:
Dataset updated
Mar 11, 2024
Dataset provided by
University of California, Riverside
Walmart Global Tech
Authors
Tania, Nishat Ara; Masud, Md Rayhanul; Rokon, Md Omar Faruk; Zhang, Qian; Faloutsos, Michalis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is rooted in a study aimed at unveiling the origins and motivations behind the creation of malware repositories on GitHub. Our research embarks on an innovative journey to dissect the profiles and intentions of GitHub users who have been involved in this dubious activity.

Employing a robust methodology, we meticulously identified 14,000 GitHub users linked to malware repositories. By leveraging advanced large language model (LLM) analytics, we classified these individuals into distinct categories based on their perceived intent: 3,339 were deemed Malicious, 3,354 Likely Malicious, and 7,574 Benign, offering a nuanced perspective on the community behind these repositories.

Our analysis penetrates the veil of anonymity and obscurity often associated with these GitHub profiles, revealing stark contrasts in their characteristics. Malicious authors were found to typically possess sparse profiles focused on nefarious activities, while Benign authors presented well-rounded profiles, actively contributing to cybersecurity education and research. Those labeled as Likely Malicious exhibited a spectrum of engagement levels, underlining the complexity and diversity within this digital ecosystem.

We are offering two datasets in this paper. First, a list of malware repositories - we have collected and extended the malware repositories on the GitHub in 2022 following the original papers. Second, a csv file with the github users information with their maliciousness classfication label.

malware_repos.txt

Purpose: This file contains a curated list of GitHub repositories identified as containing malware. These repositories were identified following the methodology outlined in the research paper "SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub."

Contents: The file is structured as a simple text file, with each line representing a unique repository in the format username/reponame. This format allows for easy identification and access to each repository on GitHub for further analysis or review.

Usage: The list serves as a critical resource for researchers and cybersecurity professionals interested in studying malware, understanding its distribution on platforms like GitHub, or developing defense mechanisms against such malicious content.

obfuscated_github_user_dataset.csv

Purpose: Accompanying the list of malware repositories, this CSV file contains detailed, albeit obfuscated, profile information of the GitHub users who authored these repositories. The obfuscation process has been applied to protect user privacy and comply with ethical standards, especially given the sensitive nature of associating individuals with potentially malicious activities.

Contents: The dataset includes several columns representing different aspects of user profiles, such as obfuscated identifiers (e.g., ID, login, name), contact information (e.g., email, blog), and GitHub-specific metrics (e.g., followers count, number of public repositories). Notably, sensitive information has been masked or replaced with generic placeholders to prevent user identification.

Usage: This dataset can be instrumental for researchers analyzing behaviors, patterns, or characteristics of users involved in creating malware repositories on GitHub. It provides a basis for statistical analysis, trend identification, or the development of predictive models, all while upholding the necessary ethical considerations.
r
RNA Virus Database
rrid.site
dknet.org
+2more
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RNA Virus Database [Dataset]. http://identifiers.org/RRID:SCR_007899
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007899
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented August 19, 2016. It is a database and web application describing the genome organization and providing analytical tools for the 938 known species of RNA virus. It can identify submitted nucleotide sequences, can place them into multiple whole-genome alignments (in species where more than one isolate has been fully sequenced) and contains translated genome sequences for all species. It has been created for two main purposes: to facilitate the comparative analysis of RNA viruses and to become a hub for other, more specialised virus Web sites.
S
AI-powered malware simulation of a medical imaging database
scidb.cn
Updated Sep 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Somaya_haiba (2025). AI-powered malware simulation of a medical imaging database [Dataset]. http://doi.org/10.57760/sciencedb.27227
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.27227
Dataset updated
Sep 2, 2025
Dataset provided by
Science Data Bank
Authors
Somaya_haiba
Description
The dataset comprises medical imaging data that demonstrate the presence or absence of illnesses. used to simulate AI-based malware modulation, this database is paired with malware-modulated counterparts. By creating tampered images on the fly from the benign dataset using three mechanisms:Adversarial perturbations to input data that can cause data misclassification.Patch-level content edits by Copying-pasting or inpainting of small square regions (8–32 px) to simulate lesion insertion or removal.Metadata-consistent rescaling for random resize and crop variance. Each training batch is a duplicate of the original images.
n
IVDB - Influenza Virus Database
neuinfo.org
dknet.org
+1more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). IVDB - Influenza Virus Database [Dataset]. http://identifiers.org/RRID:SCR_013404
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_013404
Dataset updated
Jan 29, 2022
Description
IVDB hosts complete genome sequences of influenza A virus generated by BGI and curates all other published influenza virus sequences after expert annotations. For the convenience of efficient data utilization, our Q-Filter system classifies and ranks all nucleotide sequences into 7 categories according to sequence content and integrity. IVDB provides a series of tools and viewers for analyzing the viral genomes, genes, genetic polymorphisms and phylogenetic relationships comparatively. A searching system is developed for users to retrieve a combination of different data types by setting various search options. To facilitate analysis of the global viral transmission and evolution, the IV Sequence Distribution Tool (IVDT) is developed to display worldwide geographic distribution of the viral genotypes and to couple genomic data with epidemiological data. The BLAST, multiple sequence alignment tools and phylogenetic analysis tools were integrated for online data analysis. Furthermore, IVDB offers instant access to the pre-computed alignments and polymorphism analysis of influenza virus genes and proteins and presents the results by SNP distribution plots and minor allele distributions. IVDB aims to be a powerful information resource and an analysis workbench for scientists working on IV genetics, evolution, diagnostics, vaccine development, and drug design.

Malware Detection in Network Traffic Data

kaggle.com

zip

Updated Dec 26, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Agung Pambudi (2023). Malware Detection in Network Traffic Data [Dataset]. https://www.kaggle.com/datasets/agungpambudi/network-malware-detection-connection-analysis

Explore at:

zip(755409206 bytes)Available download formats

Dataset updated

Dec 26, 2023

Authors

Agung Pambudi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

To cite the dataset please reference it as “Stratosphere Laboratory. A labeled dataset with malicious and benign IoT network traffic. January 22th. Agustin Parmisano, Sebastian Garcia, Maria Jose Erquiaga. https://www.stratosphereips.org/datasets-iot23

This dataset includes labels that explain the linkages between flows connected with harmful or possibly malicious activity to provide network malware researchers and analysts with more thorough information. These labels were painstakingly created at the Stratosphere labs using malware capture analysis.

We present a concise explanation of the labels used for the identification of malicious flows, based on manual network analysis, below:

Attack: This label signifies the occurrence of an attack originating from an infected device directed towards another host. Any flow that endeavors to exploit a vulnerable service, discerned through payload and behavioral analysis, falls under this classification. Examples include brute force attempts on telnet logins or header-based command injections in GET requests.

Benign: The "Benign" label denotes connections where no suspicious or malicious activities have been detected.

C&C (Command and Control): This label indicates that the infected device has established a connection with a Command and Control server. This observation is rooted in the periodic nature of connections or activities such as binary downloads or the exchange of IRC-like or decoded commands.

DDoS (Distributed Denial of Service): "DDoS" is assigned when the infected device is actively involved in a Distributed Denial of Service attack, identifiable by the volume of flows directed towards a single IP address.

FileDownload: This label signifies that a file is being downloaded to the infected device. It is determined by examining connections with response bytes exceeding a specified threshold (typically 3KB or 5KB), often in conjunction with known suspicious destination ports or IPs associated with Command and Control servers.

HeartBeat: "HeartBeat" designates connections where packets serve the purpose of tracking the infected host by the Command and Control server. Such connections are identified through response bytes below a certain threshold (typically 1B) and exhibit periodic similarities. This is often associated with known suspicious destination ports or IPs linked to Command and Control servers.

Mirai: This label is applied when connections exhibit characteristics resembling those of the Mirai botnet, based on patterns consistent with common Mirai attack profiles.

Okiru: Similar to "Mirai," the "Okiru" label is assigned to connections displaying characteristics of the Okiru botnet. The parameters for this label are the same as for Mirai, but Okiru is a less prevalent botnet family.

PartOfAHorizontalPortScan: This label is employed when connections are involved in a horizontal port scan aimed at gathering information for potential subsequent attacks. The labeling decision hinges on patterns such as shared ports, similar transmitted byte counts, and multiple distinct destination IPs among the connections.

Torii: The "Torii" label is used when connections exhibit traits indicative of the Torii botnet, with labeling criteria similar to those used for Mirai, albeit in the context of a less common botnet family.

Field Name	Description	Type
ts	The timestamp of the connection event.	time
uid	A unique identifier for the connection.	string
id.orig_h	The source IP address.	addr
id.orig_p	The source port.	port
id.resp_h	The destination IP address.	addr
id.resp_p	The destination port.	port
proto	The network protocol used (e.g., 'tcp').	enum
service	The service associated with the connection.	string
duration	The duration of the connection.	interval
orig_bytes	The number of bytes sent from the source to the destination.	count
resp_bytes	The number of bytes sent from the destination to the source.	count
conn_state	The state of the connection.	string
local_orig	Indicates whether the connection is considered local or not.	bool
local_resp	Indicates whether the connection is considered...

S
benign and injected IoMT packet database
scidb.cn
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Somaya_haiba (2025). benign and injected IoMT packet database [Dataset]. http://doi.org/10.57760/sciencedb.23587
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.23587
Dataset updated
Apr 14, 2025
Dataset provided by
Science Data Bank
Authors
Somaya_haiba
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset was compiled over a year and a half from various websites and sources, and it contains 7449 benign and malicious IoMT packets presented by real-world components of the e-healthcare system that monitor network transmission. Data quality is improved at several preprocessing stages, including dealing with noises and unwanted values as strings, cleaning, encoding string features, and rescaling all disordered data values using data transformation functions. To standardize the analysis of network features, we only consider features related to networking characteristics and reject all other features that provide insights into the patient's vital signs. This data set is for analyzing the IoMT traffic behavior within the smart hospital's networks.
m
Static and dynamic analysis of both generic and APT-related malware
data.mendeley.com
Updated Mar 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luis F. Martin-Liras (2020). Static and dynamic analysis of both generic and APT-related malware [Dataset]. http://doi.org/10.17632/w2w8gjsgnt.1
Explore at:
Unique identifier
https://doi.org/10.17632/w2w8gjsgnt.1
Dataset updated
Mar 12, 2020
Authors
Luis F. Martin-Liras
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset provides a total of 1944 features obtained from the static and dynamic analysis reports of more than 19400 malware samples, more than 1000 of then belonging to malware samples obtained from APT attacks. The objective of this dataset is to provide researcher a tool to discern the differences between generic and APT-related malware samples.
n
HCVDB - Hepatitis C Virus Database
neuinfo.org
rrid.site
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). HCVDB - Hepatitis C Virus Database [Dataset]. http://identifiers.org/RRID:SCR_007703
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007703
Dataset updated
Jan 29, 2022
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented August 23, 2016. The euHCVdb is a Hepatitis C Virus database oriented towards protein sequence, structure and function analyses and structural biology of HCV. In order to make the existing HCV databases as complementary as possible, the current developments are coordinated with the other databases (Japan and Los Alamos) as part of an international collaborative effort. It is monthly updated from the EMBL Nucleotide sequence database and maintained in a relational database management system. Programs for parsing the EMBL database flat files, annotating HCV entries, filling up and querying the database used SQL and Java programming languages. Great efforts have been made to develop a fully automatic annotation procedure thanks to a reference set of HCV complete annotated well-characterized genomes of various genotypes. This automatic procedure ensures standardization of nomenclature for all entries and provides genomic regions/proteins present in the entry, bibliographic reference, genotype, interesting sites or domains, source of the sequence and structural data that are available as protein 3D models. Hepatitis C, Hepatitis C Virus, Hepatitis C Virus protein .
Global data breaches caused by malware 2023-2024, by industry
statista.com
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Global data breaches caused by malware 2023-2024, by industry [Dataset]. https://www.statista.com/statistics/1419328/data-breaches-malware-by-industry/
Explore at:
Dataset updated
May 15, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Nov 1, 2023 - Oct 31, 2024
Area covered
Worldwide
Description
Between November 2023 and October 2024, organizations in the manufacturing sector worldwide saw around 1,036 instances of data breaches caused by malware attacks. Professional services ranked second, with 824 data breach cases in the measured period. Furthermore, malware caused 468 data breach incidents in the information sector.
m
Ransomware Printable Character N-gram Feature Dataset
data.mendeley.com
Updated Sep 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keven Gonçalves (2025). Ransomware Printable Character N-gram Feature Dataset [Dataset]. http://doi.org/10.17632/ghpy6kdhx5.1
Explore at:
Unique identifier
https://doi.org/10.17632/ghpy6kdhx5.1
Dataset updated
Sep 15, 2025
Authors
Keven Gonçalves
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was generated for the academic research paper titled "Zero-Day Ransomware Family Detection Based on Printable Character Analysis and Machine Learning", published in the Electronic Journal of Scientific Initiation in Computing (Revista Eletrônica de Iniciação Científica em Computação – REIC), vol. 23 (2025), doi: http://doi.org/10.5753/reic.2025.6021.

It contains structural features in the form of 3-, 4-, and 5-gram printable characters extracted from 2,675 binary executable samples. The training and validation set consists of 2,157 samples (80%): 1,023 ransomware samples from 25 relevant families and 1,134 goodware samples. The testing set consists of 518 samples (20%): 385 ransomware samples from 15 recent families and 133 goodware samples.

The CSV file columns are sample ID, filename, target class (RG), family ID, and numerical columns ( binaryfeatures), as follows: | ID | filename | RG | family | 2000 Features | Training Goodware | 10000 to 11133 | Their name.exe | 0 | 0 | Binary features | Testing Goodware | 12000 to 12132 | Their name.exe | 0 | 0 | Binary features | Training Ransomware | 20000 to 21022 | Their SHA-256 hash | 1 | 1-25 family IDs | Binary features | Testing Ransomware | 22000 to 22384 | Their SHA-256 hash | 1 | 26-40 family IDs | Binary features |

Family IDs: Avaddon 1 Babuk 2 Blackmatter 3 Conti 4 Darkside 5 Dharma 6 Doppelpaymer 7 Exorcist 8 Gandcrab 9 Lockbit 10 Makop 11 Maze 12 Mountlocker 13 Nefilim 14 Netwalker 15 Phobos 16 Pysa 17 Ragnarok 18 RansomeXX 19 Revil 20 Ryuk 21 Stop 22 Thanos 23 Wastedlocker 24 Zeppelin 25

AvosLocker 26 BianLian 27 BlackBasta 28 BlackByte 29 BlackCat 30 BlueSky 31 Clop 32 Hive 33 HolyGhost 34 Karma 35 Lorenz 36 Maui 37 Night Sky 38 PlayCrypt 39 Quantum 40
m
Android Malware and Normal permissions dataset
data.mendeley.com
impactcybertrust.org
Updated Mar 13, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arvind Mahindru (2018). Android Malware and Normal permissions dataset [Dataset]. http://doi.org/10.17632/958wvr38gy.1
Explore at:
Unique identifier
https://doi.org/10.17632/958wvr38gy.1
Dataset updated
Mar 13, 2018
Authors
Arvind Mahindru
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 18,850 normal android application packages and 10,000 malware android packages which are used to identify the behaviour of malware application on permission they need at run-time.

Facebook

Twitter

Click to copy link

Link copied

Cite

José Reyes (2025). MalwareBazaar Malware Dataset (Sep - Oct 2025) [Dataset]. https://www.kaggle.com/datasets/arkreyes/malwarebazaar-malware-dataset-sep-oct-2025

MalwareBazaar Malware Dataset (Sep - Oct 2025)

a dataset of uploaded malware in MalwareBazaar's database.

Explore at:

zip(9415213 bytes)Available download formats

Dataset updated

Oct 9, 2025

Authors

José Reyes

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

MalwareBazaar Malware Dataset.

Introduction.

This dataset is useful to practice skills in Data Analysis or Data Science, contains information about indicators of crompromise found in MalwareBazaar's database.

Description.

The dataset was retrieved from MalwareBazaar's database, full dump CSV. Curated, formatted and cleaned by myself.

Metadata removed (footer with unreadable information).
'date' formatted to datetime (better reading format).
Data filtered from the last 90 days.
Unnecessary columns with "NaN" data removed.

Clear search

Close search

Google apps

Main menu

MalwareBazaar Malware Dataset (Sep - Oct 2025)

MalwareBazaar Malware Dataset.

Introduction.

Description.

Complete Antivirus Database

Android Malware Detection Dataset

Data from: Malware Finances and Operations: a Data-Driven Study of the Value...

Maldeb Dataset

Portable Executable Malware Data

Dataset

Contents

Malware Analysis Datasets: Top-1000 PE Imports

Quttera Website Malware Threat Encyclopedia

Kraken2 Metagenomic Virus Database

Malware Repositories and Their Authors on GitHub

RNA Virus Database

AI-powered malware simulation of a medical imaging database

IVDB - Influenza Virus Database

Malware Detection in Network Traffic Data

benign and injected IoMT packet database

Static and dynamic analysis of both generic and APT-related malware

HCVDB - Hepatitis C Virus Database

Global data breaches caused by malware 2023-2024, by industry

Ransomware Printable Character N-gram Feature Dataset

Android Malware and Normal permissions dataset

MalwareBazaar Malware Dataset (Sep - Oct 2025)

a dataset of uploaded malware in MalwareBazaar's database.

MalwareBazaar Malware Dataset.

Introduction.

Description.