Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is useful to practice skills in Data Analysis or Data Science, contains information about indicators of crompromise found in MalwareBazaar's database.
The dataset was retrieved from MalwareBazaar's database, full dump CSV. Curated, formatted and cleaned by myself.
Facebook
Twitterhttps://www.comodo.com/home/internet-security/updates/vdp/database.phphttps://www.comodo.com/home/internet-security/updates/vdp/database.php
The complete Comodo Internet Security database is available for download...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The "Android Malware Detection Dataset" is a comprehensive collection of data designed to facilitate research in the detection and analysis of malware targeting the Android platform. This dataset encompasses a wide range of features extracted from Android applications, providing valuable insights into their behaviors and functionalities.
Key features of the dataset include:
This dataset provides researchers with a rich source of information to develop and evaluate effective malware detection and analysis techniques, ultimately contributing to the enhancement of mobile security on the Android platform.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description The datasets demonstrate the malware economy and the value chain published in our paper, Malware Finances and Operations: a Data-Driven Study of the Value Chain for Infections and Compromised Access, at the 12th International Workshop on Cyber Crime (IWCC 2023), part of the ARES Conference, published by the International Conference Proceedings Series of the ACM ICPS. Using the well-documented scripts, it is straightforward to reproduce our findings. It takes an estimated 1 hour of human time and 3 hours of computing time to duplicate our key findings from MalwareInfectionSet; around one hour with VictimAccessSet; and minutes to replicate the price calculations using AccountAccessSet. See the included README.md files and Python scripts. We choose to represent each victim by a single JavaScript Object Notation (JSON) data file. Data sources provide sets of victim JSON data files from which we've extracted the essential information and omitted Personally Identifiable Information (PII). We collected, curated, and modelled three datasets, which we publish under the Creative Commons Attribution 4.0 International License. 1. MalwareInfectionSet We discover (and, to the best of our knowledge, document scientifically for the first time) that malware networks appear to dump their data collections online. We collected these infostealer malware logs available for free. We utilise 245 malware log dumps from 2019 and 2020 originating from 14 malware networks. The dataset contains 1.8 million victim files, with a dataset size of 15 GB. 2. VictimAccessSet We demonstrate how Infostealer malware networks sell access to infected victims. Genesis Market focuses on user-friendliness and continuous supply of compromised data. Marketplace listings include everything necessary to gain access to the victim's online accounts, including passwords and usernames, but also detailed collection of information which provides a clone of the victim's browser session. Indeed, Genesis Market simplifies the import of compromised victim authentication data into a web browser session. We measure the prices on Genesis Market and how compromised device prices are determined. We crawled the website between April 2019 and May 2022, collecting the web pages offering the resources for sale. The dataset contains 0.5 million victim files, with a dataset size of 3.5 GB. 3. AccountAccessSet The Database marketplace operates inside the anonymous Tor network. Vendors offer their goods for sale, and customers can purchase them with Bitcoins. The marketplace sells online accounts, such as PayPal and Spotify, as well as private datasets, such as driver's licence photographs and tax forms. We then collect data from Database Market, where vendors sell online credentials, and investigate similarly. To build our dataset, we crawled the website between November 2021 and June 2022, collecting the web pages offering the credentials for sale. The dataset contains 33,896 victim files, with a dataset size of 400 MB. Credits Authors Billy Bob Brumley (Tampere University, Tampere, Finland) Juha Nurmi (Tampere University, Tampere, Finland) Mikko Niemelä (Cyber Intelligence House, Singapore) Funding This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under project numbers 804476 (SCARE) and 952622 (SPIRS). Alternative links to download: AccountAccessSet, MalwareInfectionSet, and VictimAccessSet.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Malware-benign Image representation. The Dataset were collected from several malware repositories, including TekDefense, TheZoo, The Malware-Repo, Malware Database amd Malware Bazar. The benign samples were collected from Microsoft 10 and 11 system apps and several open source software repository including CNET, Sourceforge, FileForum, PortableFreeware. The samples were validated by scanning them using Virustotal Malware scanning services. The Samples underwent preprocessing by converting the malware binary into grayscale images following rules from Nataraj (2011). Nataraj Paper: https://vision.ece.ucsb.edu/research/signal-processing-malware-analysis. Maldeb Dataset is collected by Debi Amalia Septiyani and Halimul Hakim Khairul D. A. Septiyani, “Generating Grayscale and RGB Images dataset for windows PE malware using Gist Features extaction method,” Institut Teknologi Bandung, 2022, and Dani Agung Prastiyo, "Design and implementation of a machine learning-based malware classification system with an audio signal feature Analysis Approach," Institut Teknologi Bandung, 2023. The complete dataset can be accessed on this link https://ieee-dataport.org/documents/maldeb-dataset and https://github.com/julismail/Self-Supervised
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset was created by malwareTBugs
Released under Database: Open Database, Contents: Database Contents
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of my PhD research on malware detection and classification using Deep Learning. It contains static analysis data: Top-1000 imported functions extracted from the 'pe_imports' elements of Cuckoo Sandbox reports. PE malware examples were downloaded from virusshare.com. PE goodware examples were downloaded from portableapps.com and from Windows 7 x86 directories.
Facebook
TwitterComprehensive database of website malware threats, vulnerabilities, and security risks detected by Quttera's malware scanner.
Facebook
TwitterThe Database: Kraken2 [1] database built from a classification tree containing over 700k metagenomic viruses from JGI IMG/VR [2]. (1) Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome Biol., 20(1), 1–13. doi: 10.1186/s13059-019-1891-0 (2) Paez-Espino D, Chen I-MA, Palaniappan K, Ratner A, Chu K, Szeto E, et al. IMG/VR: a database of cultured and uncultured DNA Viruses and retroviruses. Nucleic Acids Res. 2017;45:D457–65. For Paper: Title: A k-mer based approach for virus classification in metatranscriptomic and metagenomic samples identifies viral associations in the Populus phytobiome and autism brains Abstract Background Viruses are an underrepresented taxa in the study and identification of microbiome constituents; however, they play an important role in health, microbiome regulation, and transfer of genetic material. Only a few thousand viruses have been isolated, sequenced, and assigned a taxonomy, which further limits the ability to identify and quantify viruses in the microbiome. Additionally, the vast diversity of viruses represents a challenge for classification, not only in constructing a viral taxonomy, but also in identifying similarities between a virus' genotype and its phenotype. However, the diversity of viral sequences can be leveraged to classify their sequences in metagenomic and metatranscriptomic samples. Methods To identify viruses in transcriptomic and genomic samples, we developed a dynamic programming algorithm for creating a classification tree out of 715,672 metagenome viruses. To create the classification tree, we clustered proportional similarity scores generated from the k-mer profiles of each of the metagenome viruses. We then integrated the viral classification tree with the NCBI taxonomy for use with ParaKraken, a metagenomic/transcriptomic classifier. Results To illustrate the breadth of our utility for classifying viruses with ParaKraken, we analyzed data from a plant metagenome study identifying the differences between two Populus genotypes in three different compartments and on a human metatranscriptome study identifying the differences between Autism Spectrum Disorder patients and controls in post mortem brain biopsies. In the Populus study, we identified genotype and compartment specific viral signatures, while in the Autism study we identified a significant increased abundance of eight viral sequences in Autism brain biopsies. Conclusion Viruses represent an important aspect of the microbiome. The ability to classify viruses represents the first step in being able to better understand their role in the microbiome. The viral classification method presented here allows for more complete identification of viral sequences for use in identifying associations between viruses and the host and viruses and other microbiome members. Acknowledgements and Funding This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. This research was also supported by the Plant-Microbe Interfaces Scientific Focus Area in the Genomic Science Program, the Office of Biological and Environmental Research (BER) in the U.S. Department of Energy Office of Science, and by the Department of Energy, Laboratory Directed Research and Development funding (ProjectID 8321), at the Oak Ridge National Laboratory. Oak Ridge National Laboratory is managed by UT-Battelle, LLC, for the US DOE under contract DE-AC05-00OR22725. This research used resources of the Compute and Data Environment for Science (CADES).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is rooted in a study aimed at unveiling the origins and motivations behind the creation of malware repositories on GitHub. Our research embarks on an innovative journey to dissect the profiles and intentions of GitHub users who have been involved in this dubious activity.
Employing a robust methodology, we meticulously identified 14,000 GitHub users linked to malware repositories. By leveraging advanced large language model (LLM) analytics, we classified these individuals into distinct categories based on their perceived intent: 3,339 were deemed Malicious, 3,354 Likely Malicious, and 7,574 Benign, offering a nuanced perspective on the community behind these repositories.
Our analysis penetrates the veil of anonymity and obscurity often associated with these GitHub profiles, revealing stark contrasts in their characteristics. Malicious authors were found to typically possess sparse profiles focused on nefarious activities, while Benign authors presented well-rounded profiles, actively contributing to cybersecurity education and research. Those labeled as Likely Malicious exhibited a spectrum of engagement levels, underlining the complexity and diversity within this digital ecosystem.
We are offering two datasets in this paper. First, a list of malware repositories - we have collected and extended the malware repositories on the GitHub in 2022 following the original papers. Second, a csv file with the github users information with their maliciousness classfication label.
malware_repos.txt
Purpose: This file contains a curated list of GitHub repositories identified as containing malware. These repositories were identified following the methodology outlined in the research paper "SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub."
Contents: The file is structured as a simple text file, with each line representing a unique repository in the format username/reponame. This format allows for easy identification and access to each repository on GitHub for further analysis or review.
Usage: The list serves as a critical resource for researchers and cybersecurity professionals interested in studying malware, understanding its distribution on platforms like GitHub, or developing defense mechanisms against such malicious content.
obfuscated_github_user_dataset.csv
Purpose: Accompanying the list of malware repositories, this CSV file contains detailed, albeit obfuscated, profile information of the GitHub users who authored these repositories. The obfuscation process has been applied to protect user privacy and comply with ethical standards, especially given the sensitive nature of associating individuals with potentially malicious activities.
Contents: The dataset includes several columns representing different aspects of user profiles, such as obfuscated identifiers (e.g., ID, login, name), contact information (e.g., email, blog), and GitHub-specific metrics (e.g., followers count, number of public repositories). Notably, sensitive information has been masked or replaced with generic placeholders to prevent user identification.
Usage: This dataset can be instrumental for researchers analyzing behaviors, patterns, or characteristics of users involved in creating malware repositories on GitHub. It provides a basis for statistical analysis, trend identification, or the development of predictive models, all while upholding the necessary ethical considerations.
Facebook
TwitterTHIS RESOURCE IS NO LONGER IN SERVICE, documented August 19, 2016. It is a database and web application describing the genome organization and providing analytical tools for the 938 known species of RNA virus. It can identify submitted nucleotide sequences, can place them into multiple whole-genome alignments (in species where more than one isolate has been fully sequenced) and contains translated genome sequences for all species. It has been created for two main purposes: to facilitate the comparative analysis of RNA viruses and to become a hub for other, more specialised virus Web sites.
Facebook
TwitterThe dataset comprises medical imaging data that demonstrate the presence or absence of illnesses. used to simulate AI-based malware modulation, this database is paired with malware-modulated counterparts. By creating tampered images on the fly from the benign dataset using three mechanisms:Adversarial perturbations to input data that can cause data misclassification.Patch-level content edits by Copying-pasting or inpainting of small square regions (8–32 px) to simulate lesion insertion or removal.Metadata-consistent rescaling for random resize and crop variance. Each training batch is a duplicate of the original images.
Facebook
TwitterIVDB hosts complete genome sequences of influenza A virus generated by BGI and curates all other published influenza virus sequences after expert annotations. For the convenience of efficient data utilization, our Q-Filter system classifies and ranks all nucleotide sequences into 7 categories according to sequence content and integrity. IVDB provides a series of tools and viewers for analyzing the viral genomes, genes, genetic polymorphisms and phylogenetic relationships comparatively. A searching system is developed for users to retrieve a combination of different data types by setting various search options. To facilitate analysis of the global viral transmission and evolution, the IV Sequence Distribution Tool (IVDT) is developed to display worldwide geographic distribution of the viral genotypes and to couple genomic data with epidemiological data. The BLAST, multiple sequence alignment tools and phylogenetic analysis tools were integrated for online data analysis. Furthermore, IVDB offers instant access to the pre-computed alignments and polymorphism analysis of influenza virus genes and proteins and presents the results by SNP distribution plots and minor allele distributions. IVDB aims to be a powerful information resource and an analysis workbench for scientists working on IV genetics, evolution, diagnostics, vaccine development, and drug design.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To cite the dataset please reference it as “Stratosphere Laboratory. A labeled dataset with malicious and benign IoT network traffic. January 22th. Agustin Parmisano, Sebastian Garcia, Maria Jose Erquiaga. https://www.stratosphereips.org/datasets-iot23
This dataset includes labels that explain the linkages between flows connected with harmful or possibly malicious activity to provide network malware researchers and analysts with more thorough information. These labels were painstakingly created at the Stratosphere labs using malware capture analysis.
We present a concise explanation of the labels used for the identification of malicious flows, based on manual network analysis, below:
Attack: This label signifies the occurrence of an attack originating from an infected device directed towards another host. Any flow that endeavors to exploit a vulnerable service, discerned through payload and behavioral analysis, falls under this classification. Examples include brute force attempts on telnet logins or header-based command injections in GET requests.
Benign: The "Benign" label denotes connections where no suspicious or malicious activities have been detected.
C&C (Command and Control): This label indicates that the infected device has established a connection with a Command and Control server. This observation is rooted in the periodic nature of connections or activities such as binary downloads or the exchange of IRC-like or decoded commands.
DDoS (Distributed Denial of Service): "DDoS" is assigned when the infected device is actively involved in a Distributed Denial of Service attack, identifiable by the volume of flows directed towards a single IP address.
FileDownload: This label signifies that a file is being downloaded to the infected device. It is determined by examining connections with response bytes exceeding a specified threshold (typically 3KB or 5KB), often in conjunction with known suspicious destination ports or IPs associated with Command and Control servers.
HeartBeat: "HeartBeat" designates connections where packets serve the purpose of tracking the infected host by the Command and Control server. Such connections are identified through response bytes below a certain threshold (typically 1B) and exhibit periodic similarities. This is often associated with known suspicious destination ports or IPs linked to Command and Control servers.
Mirai: This label is applied when connections exhibit characteristics resembling those of the Mirai botnet, based on patterns consistent with common Mirai attack profiles.
Okiru: Similar to "Mirai," the "Okiru" label is assigned to connections displaying characteristics of the Okiru botnet. The parameters for this label are the same as for Mirai, but Okiru is a less prevalent botnet family.
PartOfAHorizontalPortScan: This label is employed when connections are involved in a horizontal port scan aimed at gathering information for potential subsequent attacks. The labeling decision hinges on patterns such as shared ports, similar transmitted byte counts, and multiple distinct destination IPs among the connections.
Torii: The "Torii" label is used when connections exhibit traits indicative of the Torii botnet, with labeling criteria similar to those used for Mirai, albeit in the context of a less common botnet family.
| Field Name | Description | Type |
|---|---|---|
| ts | The timestamp of the connection event. | time |
| uid | A unique identifier for the connection. | string |
| id.orig_h | The source IP address. | addr |
| id.orig_p | The source port. | port |
| id.resp_h | The destination IP address. | addr |
| id.resp_p | The destination port. | port |
| proto | The network protocol used (e.g., 'tcp'). | enum |
| service | The service associated with the connection. | string |
| duration | The duration of the connection. | interval |
| orig_bytes | The number of bytes sent from the source to the destination. | count |
| resp_bytes | The number of bytes sent from the destination to the source. | count |
| conn_state | The state of the connection. | string |
| local_orig | Indicates whether the connection is considered local or not. | bool |
| local_resp | Indicates whether the connection is considered... |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was compiled over a year and a half from various websites and sources, and it contains 7449 benign and malicious IoMT packets presented by real-world components of the e-healthcare system that monitor network transmission. Data quality is improved at several preprocessing stages, including dealing with noises and unwanted values as strings, cleaning, encoding string features, and rescaling all disordered data values using data transformation functions. To standardize the analysis of network features, we only consider features related to networking characteristics and reject all other features that provide insights into the patient's vital signs. This data set is for analyzing the IoMT traffic behavior within the smart hospital's networks.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides a total of 1944 features obtained from the static and dynamic analysis reports of more than 19400 malware samples, more than 1000 of then belonging to malware samples obtained from APT attacks. The objective of this dataset is to provide researcher a tool to discern the differences between generic and APT-related malware samples.
Facebook
TwitterTHIS RESOURCE IS NO LONGER IN SERVICE, documented August 23, 2016. The euHCVdb is a Hepatitis C Virus database oriented towards protein sequence, structure and function analyses and structural biology of HCV. In order to make the existing HCV databases as complementary as possible, the current developments are coordinated with the other databases (Japan and Los Alamos) as part of an international collaborative effort. It is monthly updated from the EMBL Nucleotide sequence database and maintained in a relational database management system. Programs for parsing the EMBL database flat files, annotating HCV entries, filling up and querying the database used SQL and Java programming languages. Great efforts have been made to develop a fully automatic annotation procedure thanks to a reference set of HCV complete annotated well-characterized genomes of various genotypes. This automatic procedure ensures standardization of nomenclature for all entries and provides genomic regions/proteins present in the entry, bibliographic reference, genotype, interesting sites or domains, source of the sequence and structural data that are available as protein 3D models. Hepatitis C, Hepatitis C Virus, Hepatitis C Virus protein .
Facebook
TwitterBetween November 2023 and October 2024, organizations in the manufacturing sector worldwide saw around 1,036 instances of data breaches caused by malware attacks. Professional services ranked second, with 824 data breach cases in the measured period. Furthermore, malware caused 468 data breach incidents in the information sector.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was generated for the academic research paper titled "Zero-Day Ransomware Family Detection Based on Printable Character Analysis and Machine Learning", published in the Electronic Journal of Scientific Initiation in Computing (Revista Eletrônica de Iniciação Científica em Computação – REIC), vol. 23 (2025), doi: http://doi.org/10.5753/reic.2025.6021.
It contains structural features in the form of 3-, 4-, and 5-gram printable characters extracted from 2,675 binary executable samples. The training and validation set consists of 2,157 samples (80%): 1,023 ransomware samples from 25 relevant families and 1,134 goodware samples. The testing set consists of 518 samples (20%): 385 ransomware samples from 15 recent families and 133 goodware samples.
The CSV file columns are sample ID, filename, target class (RG), family ID, and numerical columns ( binaryfeatures), as follows: | ID | filename | RG | family | 2000 Features | Training Goodware | 10000 to 11133 | Their name.exe | 0 | 0 | Binary features | Testing Goodware | 12000 to 12132 | Their name.exe | 0 | 0 | Binary features | Training Ransomware | 20000 to 21022 | Their SHA-256 hash | 1 | 1-25 family IDs | Binary features | Testing Ransomware | 22000 to 22384 | Their SHA-256 hash | 1 | 26-40 family IDs | Binary features |
Family IDs: Avaddon 1 Babuk 2 Blackmatter 3 Conti 4 Darkside 5 Dharma 6 Doppelpaymer 7 Exorcist 8 Gandcrab 9 Lockbit 10 Makop 11 Maze 12 Mountlocker 13 Nefilim 14 Netwalker 15 Phobos 16 Pysa 17 Ragnarok 18 RansomeXX 19 Revil 20 Ryuk 21 Stop 22 Thanos 23 Wastedlocker 24 Zeppelin 25
AvosLocker 26 BianLian 27 BlackBasta 28 BlackByte 29 BlackCat 30 BlueSky 31 Clop 32 Hive 33 HolyGhost 34 Karma 35 Lorenz 36 Maui 37 Night Sky 38 PlayCrypt 39 Quantum 40
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 18,850 normal android application packages and 10,000 malware android packages which are used to identify the behaviour of malware application on permission they need at run-time.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is useful to practice skills in Data Analysis or Data Science, contains information about indicators of crompromise found in MalwareBazaar's database.
The dataset was retrieved from MalwareBazaar's database, full dump CSV. Curated, formatted and cleaned by myself.