Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is rooted in a study aimed at unveiling the origins and motivations behind the creation of malware repositories on GitHub. Our research embarks on an innovative journey to dissect the profiles and intentions of GitHub users who have been involved in this dubious activity.
Employing a robust methodology, we meticulously identified 14,000 GitHub users linked to malware repositories. By leveraging advanced large language model (LLM) analytics, we classified these individuals into distinct categories based on their perceived intent: 3,339 were deemed Malicious, 3,354 Likely Malicious, and 7,574 Benign, offering a nuanced perspective on the community behind these repositories.
Our analysis penetrates the veil of anonymity and obscurity often associated with these GitHub profiles, revealing stark contrasts in their characteristics. Malicious authors were found to typically possess sparse profiles focused on nefarious activities, while Benign authors presented well-rounded profiles, actively contributing to cybersecurity education and research. Those labeled as Likely Malicious exhibited a spectrum of engagement levels, underlining the complexity and diversity within this digital ecosystem.
We are offering two datasets in this paper. First, a list of malware repositories - we have collected and extended the malware repositories on the GitHub in 2022 following the original papers. Second, a csv file with the github users information with their maliciousness classfication label.
malware_repos.txt
username/reponame
. This format allows for easy identification and access to each repository on GitHub for further analysis or review.obfuscated_github_user_dataset.csv
Protection against ransomware is particularly relevant in systems running the Android operating system, due to its huge users' base and, therefore, its potential for monetization from the attackers. In "Extinguishing Ransomware - A Hybrid Approach to Android Ransomware Detection" (see references for details), we describe a hybrid (static + dynamic) malware detection method that has extremely good accuracy (100% detection rate, with false positive below 4%).
We release a dataset related to the dynamic detection part of the aforementioned methods and containing execution traces of ransomware Android applications, in order to facilitate further research as well as to facilitate the adoption of dynamic detection in practice. The dataset contains execution traces from 666 ransomware applications taken from the Heldroid project [https://github.com/necst/heldroid] (the app repository is unavailable at the moment). Execution records were obtained by running the applications, one at a time, on the Android emulator. For each application, a maximum of 20,000 stimuli were applied with a maximum execution time of 15 minutes. For most of the applications, all the stimuli could be applied in this timeframe. In some of the traces none of the two limits is reached due to emulator hiccups. Collected features are related to the memory and CPU usage, network interaction and system calls and their monitoring is performed with a period of two seconds. The Android emulator of the Android Software Development Kit for Android 4.0 (release 20140702) was used. To guarantee that the system was always in a mint condition when a new sample is started, thus avoiding possible interference (e.g., changed settings, running processes, and modifications of the operating system files) from previously run samples, the Android operating system was each time re-initialized before running each application. The application execution process was automated by means of a shell script that made use of Android Debug Bridge (adb) and that was run on a Linux PC. The Monkey application exerciser was used in the script as a generator of the aforementioned stimuli. The Monkey is a command-line tool that can be run on any emulator instance or on a device; it sends a pseudo-random stream of user events (stimuli) into the system, which acts as a stress test on the application software.
In this dataset, we provide both per-app CSV files as well as unified files, in which CSV files of single applications have been concatenated. The CSV files contain the features extracted from the raw execution record. The provided files are listed below:
ransom-per_app-csv.zip - features obtained by executing ransomware applications, one CSV per application
ransom-unified-csv.zip - features obtained by executing ransomware applications, only one CSV file
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Malware-benign Image representation. The Dataset were collected from several malware repositories, including TekDefense, TheZoo, The Malware-Repo, Malware Database amd Malware Bazar. The benign samples were collected from Microsoft 10 and 11 system apps and several open source software repository including CNET, Sourceforge, FileForum, PortableFreeware. The samples were validated by scanning them using Virustotal Malware scanning services. The Samples underwent preprocessing by converting the malware binary into grayscale images following rules from Nataraj (2011). Nataraj Paper: https://vision.ece.ucsb.edu/research/signal-processing-malware-analysis. Maldeb Dataset is collected by Debi Amalia Septiyani and Halimul Hakim Khairul D. A. Septiyani, “Generating Grayscale and RGB Images dataset for windows PE malware using Gist Features extaction method,” Institut Teknologi Bandung, 2022, and Dani Agung Prastiyo, "Design and implementation of a machine learning-based malware classification system with an audio signal feature Analysis Approach," Institut Teknologi Bandung, 2023. The complete dataset can be accessed on this link https://ieee-dataport.org/documents/maldeb-dataset and https://github.com/julismail/Self-Supervised
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
WinMET dataset contains the reports generated with CAPE sandbox after analyzing several malware samples. The reports are valid JSON files that contain the spawned processes, the sequence of WinAPI and system calls invoked by each process, their parameters, their return values, and OS accesed resources, amongst many others.
Please use this DOI reference that always points to the latest WinMET version: https://doi.org/10.5281/zenodo.12647555" target="_blank" rel="noopener">https://doi.org/10.5281/zenodo.12647555
This dataset was generated using the https://github.com/reverseame/MALVADA/tree/main" target="_blank" rel="noopener">MALVADA framework, which you can read more about in our publication https://doi.org/10.1016/j.softx.2025.102082" target="_blank" rel="noopener">https://doi.org/10.1016/j.softx.2025.102082. The article also provides insights about the contents of this dataset.
Razvan Raducu, Alain Villagrasa-Labrador, Ricardo J. Rodríguez, Pedro Álvarez, MALVADA: A framework for generating datasets of malware execution traces, SoftwareX, Volume 30, 2025, 102082, ISSN 2352-7110, https://doi.org/10.1016/j.softx.2025.102082.(https://www.sciencedirect.com/science/article/pii/S2352711025000494)
The 7z file is password protected. The password is: infected
.
Compressed size on disk: ~2.5GiB.
Decompressed size on disk: ~105GiB.
Total decompressed .json
files: 9889.
The name of each .json
file is irrelevant. It corresponds to its analysis ID.
cape_report_to_label_mapping.json
and avclass_report_to_label_mapping.json
contain the mappings of each report with its corresponding consensus label, sorted in descendent order (given the number of reports belonging to each label/family).
WinMET.7z
:If you use this dataset, cite it as follows:
TBA.
The following statistic (and many more) can be obtained by analyzing the WinMET dataset with the https://github.com/reverseame/MALVADA" target="_blank" rel="noopener">MALVADA framework.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are the data set and source code related to the paper: "A Framework for Developing Strategic Cyber Threat Intelligence from Advanced Persistent Threat Analysis Reports Using Graph-Based Algorithms"
1- aptnotes-downloader.zip : contains source code that downloads all APT reports listed in https://github.com/aptnotes/data and https://github.com/CyberMonitor/APT_CyberCriminal_Campagin_Collections
2- apt-groups.zip : contains all APT group names gathered from https://docs.google.com/spreadsheets/d/1H9_xaxQHpWaa4O_Son4Gx0YOIzlcBWMsdvePFX68EKU/edit?gid=1864660085#gid=1864660085 and https://malpedia.caad.fkie.fraunhofer.de/actors and https://malpedia.caad.fkie.fraunhofer.de/actors
3- apt-reports.zip : contains all deduplicated APT reports gathered from https://github.com/aptnotes/data and https://github.com/CyberMonitor/APT_CyberCriminal_Campagin_Collections
4- countries.zip : contains country name list.
5- ttps.zip : contains all MITRE techniques gathered from https://attack.mitre.org/resources/attack-data-and-tools/
6- malware-families.zip : contains all malware family names gathered from https://malpedia.caad.fkie.fraunhofer.de/families
7- ioc-searcher-app.zip : contains source code that extracts IoCs from APT reports. Extracted IoC files are provided in report-analyser.zip. Original code repo can be found at https://github.com/malicialab/iocsearcher
8- extracted-iocs.zip : contains extracted IoCs by ioc-searcher-app.zip
9- report-analyser.zip : contains source code that searchs APT reports, malware families, countries and TTPs. I case of a match, it updates files in extracted-iocs.zip.
10- cti-transformation-app.zip : contains source code that transforms files in extracted-iocs.zip to CTI triples and saves into Neo4j graph database.
11- graph-db-backup.zip : contains volume folder of Neo4j Docker container. When it is mounted to a Docker container, all CTI database becomes reachable from Neo4j web interface. Here is how to run a Neo4j Docker container that mounts folder in the zip:
docker run -d --publish=7474:7474 --publish=7687:7687 --volume={PATH_TO_VOLUME}/DEVIL_NEO4J_VOLUME/neo4j/data:/data --volume={PATH_TO_VOLUME}/DEVIL_NEO4J_VOLUME/neo4j/plugins:/plugins --volume={PATH_TO_VOLUME}/DEVIL_NEO4J_VOLUME/neo4j/logs:/logs --volume={PATH_TO_VOLUME}/DEVIL_NEO4J_VOLUME/neo4j/conf:/conf --env 'NEO4J_PLUGINS=["apoc","graph-data-science"]' --env NEO4J_apoc_export_file_enabled=true --env NEO4J_apoc_import_file_enabled=true --env NEO4J_apoc_import_file_use_neo4j_config=true --env=NEO4J_AUTH=none neo4j:5.13.0
web interface: http://localhost:7474
username: neo4j
password: neo4j
Mobile malware detection has attracted massive research effort in our community. A reliable and up-to-date malware dataset is critical to evaluate the effectiveness of malware detection approaches. Essentially, the malware ground truth should be manually verified by security experts, and their malicious behaviors should be carefully labelled. Although there are several widely-used malware benchmarks in our community (e.g., MalGenome, Drebin, Piggybacking and AMD, etc.), these benchmarks face several limitations including out-of-date, size, coverage, and reliability issues, etc.
We make effort to create MalRadar, a growing and up-to-date Android malware dataset using the most reliable way, i.e., by collecting malware based on the analysis reports of security experts. We have crawled all the mobile security related reports released by ten leading security companies, and used an automated approach to extract and label the useful ones describing new Android malware and containing Indicators of Compromise (IoC) information. We have successfully compiled MalRadar, a dataset that contains 4,534 unique Android malware samples (including both apks and metadata) released from 2014 to April 2021 by the time of this paper, all of which were manually verified by security experts with detailed behavior analysis. For more details, please visit https://malradar.github.io/
The dataset includes the following files:
(1) sample-info.csv
In this file, we list all the detailed information about each sample, including apk file hash, app name, package name, report family, etc.
(2) malradar.zip
We have packaged the malware samples in chunks of 1000 applications: malradar-0, malradar-1, malradar-2, malradar-3. All the apk files name after the file SHA256.
If your papers or articles used our dataset, please include a citation to our paper:
@article{wang2022malradar,
title={MalRadar: Demystifying Android Malware in the New Era},
author={Wang, Liu and Wang, Haoyu and He, Ren and Tao, Ran and Meng, Guozhu and Luo, Xiapu and Liu, Xuanzhe},
journal={Proceedings of the ACM on Measurement and Analysis of Computing Systems},
volume={6},
number={2},
pages={1--27},
year={2022},
publisher={ACM New York, NY, USA}
}
To leverage the vast literature solving the original MNIST digit recognition problem in small thumbnails, this firmware dataset maps the first 1024 bytes of malicious, benign and hacked Internet of Things and embedded software binaries (Executable and Linkable Format, ELF). The goal is to provide a drop-in replacement for MNIST techniques but relevant to weeding out malware using image recognition.
The images are reported in CSV where the filename, label class (both categorical and numerical), and the first 1024 bytes mapped into a grayscale range from 0-255 by converting first each byte to decimal (0-15) then scaling.
See additional background on ELF files, https://en.wikipedia.org/wiki/Executable_and_Linkable_Format and https://linux-audit.com/elf-binaries-on-linux-understanding-and-analysis/
The labeled ELF files repository, https://github.com/nimrodpar/Labeled-Elfs
Comparison of firmware detection using these image representations and comparing with signature-based methods as well as contrasting statistical (tree) methods with deep learning techniques
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset is a collection of evasive PDF samples, labeled as malicious (1) or benign (0). Since the dataset has an evasive nature, it can be used to test the robustness of trained PDF malware classifiers against evasion attacks. The dataset contains 500,000 generated evasive samples, including 450,000 malicious and 50,000 benign PDFs.
More details about the data can be found in the publication: Leveraging Adversarial Samples for Enhanced Classification of Malicious and Evasive PDF Files.
This resource aims to support researchers and cybersecurity professionals in developing more advanced and robust detection mechanisms for PDF-based malware.
Any work that uses this dataset should cite the following paper:
Trad, F.; Hussein, A.; Chehab, A. Leveraging Adversarial Samples for Enhanced Classification of Malicious and Evasive PDF Files. Appl. Sci. 2023, 13, 3472. https://doi.org/10.3390/app13063472
Implementation code can be found here
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The USTC-TFC2016 dataset is mainly used for network traffic classification research, including malicious traffic and normal application traffic, and is jointly completed by the University of Science and Technology of China and the Institute of Acoustics of the Chinese Academy of Sciences. The data comes from two sources: one is 10 types of malicious traffic selected from the CTU dataset, which were collected by researchers from the Czech CTU University from real environments between 2011 and 2015; The second type is the 10 normal application traffic generated by network instrument simulation. This dataset consists of 20 types of traffic, corresponding to 20 data files, all in pcap format. In order to save space, some pcap files are compressed and uploaded. After decompression, the total size of each pcap file is 3.71GB. For more information about this dataset, please refer to: 1) Wei Wang, Ming Zhu, Xuewen Zeng, Xiaozhou Ye and Yiqiang Sheng, “Malware traffic classification using convolutional neural network for representation learning”ICOIN 2017,pp712-717; 2) Wang Wei, Research on Network Traffic Classification and Anomaly Detection Methods Based on Deep Learning, Ph.D. Thesis, University of Science and Technology of China, 2018. This dataset and preprocessing tool were released in 2018 https://github.com/echowei/ Many domestic and foreign researchers are using this dataset. Due to bandwidth and capacity constraints, it is often unable to download. Upload it to the "Science Database" website of the Chinese Academy of Sciences for long-term storage and easy download. At the same time, we look forward to relevant researchers uploading and sharing new malicious traffic and encrypted traffic using domestic cryptographic protocols, as well as expanding this dataset to include more types of malicious and normal traffic (such as 100 each), forming a richer and more comprehensive dataset to approach the actual network traffic situation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We have developed ProjecTILs, a computational approach to project new data sets into a reference map of T cells, enabling their direct comparison in a stable, annotated system of coordinates. Because new cells are embedded in the same space of the reference, ProjecTILs enables the classification of query cells into annotated, discrete states, but also over a continuous space of intermediate states. By comparing multiple samples over the same map, and across alternative embeddings, the method allows exploring the effect of cellular perturbations (e.g. as the result of therapy or genetic engineering) and identifying genetic programs significantly altered in the query compared to a control set or to the reference map. We illustrate the projection of several data sets from recent publications over two cross-study murine T cell reference atlases: the first describing tumor-infiltrating T lymphocytes (TILs), the second characterizing acute and chronic viral infection. Single-cell data to build the virus-specific CD8 T cell reference map were downloaded from GEO under the following entries: GSE131535, GSE134139 and GSE119943, selecting only samples in wild type conditions. Data for the Ptpn2-KO, Tox-KO and CD4-depletion projections were obtained from entries GSE134139, GSE119943, and GSE137007 and were not included in the construction of the reference map. To construct the LCMV reference map, we split the dataset into five batches that displayed strong batch effects, and applied STACAS (https://github.com/carmonalab/STACAS) to mitigate its confounding effects. We computed 800 variable genes per batch, excluding cell cycling genes, ribosomal and mitochondrial genes, and computed pairwise anchors using 200 integration genes, and otherwise default STACAS parameters. Anchors were filtered at the default threshold 0.8 percentile, and integration was performed with the IntegrateData Seurat3 function with the guide tree suggested by STACAS. Next, we performed unsupervised clustering of the integrated cell embeddings using the Shared Nearest Neighbor (SNN) clustering method implemented in Seurat 3 with parameters {resolution=0.4, reduction=”pca”, k.param=20}. We then manually annotated individual clusters (merging clusters when necessary) based on several criteria: i) average expression of key marker genes in individual clusters; ii) gradients of gene expression over the UMAP representation of the reference map; iii) gene-set enrichment analysis to determine over- and under- expressed genes per cluster using MAST. In order to have access to predictive methods for UMAP, we recomputed PCA and UMAP embeddings independently of Seurat3 using respectively the prcomp function from basic R package “stats”, and the “umap” R package (https://github.com/tkonopka/umap).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The winFam dataset is a collection that amalgamates cryptoransomware files sourced from VirusShare and Windows PE32 malware/benign files from MaleVis dataset.
Key Features:-
File Sources
Cryptoransomware files were sourced from VirusShare (https://virusshare.com/), a reputable repository known for its extensive collection of malware samples. Windows PE32 files were obtained from https://web.cs.hacettepe.edu.tr/~selman/malevis/, enriching the dataset with a variety of Windows executable files.
Conversion to Images
The executable files from Cryptoransomware were converted into RGB images using the binary2image tool (https://github.com/ncarkaci/binary-to-image/blob/master/binary2image.py). This transformation allows researchers to leverage image processing techniques and neural networks for advanced analysis.
Malware Families
The dataset comprises 26 distinct malware families and a benign family is included for comparative analysis and to facilitate the development of effective detection mechanisms. The images are split up into three folders- Train, Valid and Test.
Research Applications
It serves as a valuable resource for the development and evaluation of deep learning models based on malware visualization.
Community Contribution
The winFam dataset is shared with the cybersecurity community on Zenodo to encourage collaboration and knowledge-sharing. Researchers, analysts, and developers can leverage this dataset to advance the state of the art in malware analysis and cybersecurity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
About Dataset
Context The dataset is acquired via the VIRRION platform designed for label-free capture and enrichment of viruses. It consists of Raman spectra of three groups of RNA viruses, including human respiratory viruses (influenza A H1N1 and H3N2, influenza B, rhinovirus, respiratory syncytial virus (RSV)), avian respiratory viruses (influenza A H5N2 and H7N2, infectious bronchitis virus (IBV), reovirus), and human enteroviruses (Coxsackievirus B type 1 and 3 (CVB1, CVB3), enteroviruses EV70 and EV71). Content The dataset contains 10 sample Raman Spectra for each virus type/subtype in “virus_spectra.zip”, and visualization of the spectra in “virus_spectra_plots.zip”. The Raman mapping data of spectra was collected for each sample by the 20X20 grid, and the first two colomns are the corresponding X and Y coordinates respectively. In each txt file that corresponds to one Raman spectrum, the columns are:
X coordinate Y coordinate Wavenumber Signal Intensity
More data and machine learning source code More data are available upon request, for research purposes only. Please email mtterrones@gmail.com (Dr. Mauricio Terrones) with a short description about the purpose of usage along with your request for more data. The machine learning source code for our work is available via the GitHub Repo below: https://github.com/karenyyy/Accurate-Virus-Identification-with-Interpretable-Raman-Signatures-by-Machine-Learning-.git
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Short Description: This is the dataset for the paper "Exploring the Use of Static and Dynamic Analysis to Improve the Performance of the Mining Sandbox Approach for Android Malware Identification", accepted for publication in the Journal of Systems and Software.
Link to this repository: https://github.com/droidxp/paper-replication-package
Authors of the Paper
Abstract
The popularization of the Android platform and the growing number of Android applications (apps) that manage sensitive data turned the Android ecosystem into an attractive target for malicious software. For this reason, researchers and practitioners have investigated new approaches to address Android's security issues, including techniques that leverage dynamic analysis to mine Android sandboxes. The mining sandbox approach consists in running dynamic analysis tools on a benign version of an Android app. This exploratory phase records all calls to sensitive APIs. Later, we can use this information to (a) prevent calls to other sensitive APIs (those not recorded in the exploratory phase) or (b) run the dynamic analysis tools again in a different version of the app. During this second execution of the fuzzing tools, a warning of possible malicious behavior is raised whenever the new version of the app calls a sensitive API not recorded in the exploratory phase.
The use of a mining sandbox approach is an effective technique for Android malware analysis, as previous research works revealed. Particularly, existing reports present an accuracy of almost 70% in the identification of malicious behavior using dynamic analysis tools to mine android sandboxes. However, although the use of dynamic analysis for mining Android sandboxes has been investigated before, little is known about the potential benefits of combining static analysis with a mining sandbox approach for identifying malicious behavior. Accordingly, in this paper we present the results of two studies that investigate the impact of using static analysis to complement the performance of existing dynamic analysis tools tailored for mining Android sandboxes, in the task of identifying malicious behavior.
In the first study we conduct a non-exact replication of a previous study (hereafter BLL-Study) that compares the performance of test case generation tools for mining Android sandboxes. Differently from the original work, here we isolate the effect of an independent static analysis component (DroidFax) they used to instrument the Android apps in their experiments. This decision was motivated by the fact that DroidFax could have influenced the efficacy of the dynamic analyses tools positively---through the execution of specific static analysis algorithms DroidFax also implements. In our second study, we carried out a new experiment to investigate the efficacy of taint analysis algorithms to complement the mining sandbox approach previously used to identify malicious behavior. To this end, we executed the FlowDroid tool to mine the source-sink flows from benign/malign pairs of Android apps used in previous research work.
Our study brings several findings. For instance, the first study reveals that DroidFax alone (static analysis) can detect 43.75% of the malwares in the BLL-Study dataset, contributing substantially in the performance of the dynamic analysis tools in the BLL-Study. The results of the second study show that taint analysis is also practical to complement the mining sandboxes approach, with a performance similar to that reached by dynamic analysis tools.
Authors: Hengchuang Yin, Shufang Wu, Jie Tan, Qian Guo, Mo Li, Jinyuan Guo, Yaqi Wang, Xiaoqing Jiang, and Huaiqiu Zhu*
This work has been accepted by GigaScience.
Hengchuang Yin, Shufang Wu, Jie Tan, Qian Guo, Mo Li, Jinyuan Guo, Yaqi Wang, Xiaoqing Jiang, and Huaiqiu Zhu. "IPEV: Identification of Prokaryotic and Eukaryotic Virus-Derived Sequences in Virome Using Deep Learning." GigaScience 13 (2024): giae018. https://doi.org/10.1093/gigascience/giae018.
Background: The virome obtained through virus-like particle enrichment contains a mixture of prokaryotic and eukaryotic virus-derived fragments. Accurate identification and classification of these elements are crucial to understanding their roles and functions in microbial communities. However, the rapid mutation rates of viral genomes pose challenges in developing high-performance tools for classification, potentially limiting downstream analyses.
Findings: We present IPEV, a novel method to distinguish prokaryotic and eukaryotic viruses in viromes, with a 2D convolutional neural network combining trinucleotide pair relative distance and frequency. Cross-validation assessments of IPEV demonstrate its state-of-the-art precision, significantly improving the F1-score by approximately 22% on an independent test set compared to existing methods when query viruses share less than 30% sequence similarity with known viruses. Furthermore, IPEV outperforms other methods in accuracy on marine and gut virome samples based on annotations by sequence alignments. IPEV reduces runtime by at most 1,225 times compared to existing methods under the same computing configuration. We also utilized IPEV to analyze longitudinal samples and found that the gut virome exhibits a higher degree of temporal stability than previously observed in persistent personal viromes, providing novel insights into the resilience of the gut virome in individuals.
Conclusions: IPEV is a high-performance, user-friendly tool that assists biologists in identifying and classifying prokaryotic and eukaryotic viruses within viromes. The tool is available at https://github.com/basehc/IPEV.
5_fold_cross_validation.zip: Dataset of cross-validation of IPEV
Eukaryotic_virus_CV_Dataset-1.csv: GI, and accession ID for the cross-validation Dataset-1 (eukaryotic virus)
Prokaryotic_virus_CV_Dataset-1.csv: GI, and accession ID for the cross-validation Dataset-1 (prokaryotic virus)
Test_Prokaryotic_virus_Dataset-1.fasta: An independent test set of IPEV (prokaryotic virus)
Test_Eukaryotic_virus_Dataset-1.fasta: An independent test set of IPEV (eukaryotic virus)
Dataset_sequencing_error.zip: Simulated dataset with sequencing errors
Cap_enzyme_sequence.fasta: Accession IDs of Receptor Binding Proteins (RBPs) in phages collected by our article
Dataset_runtime_evaluation.zip: Dataset for evaluating the runtime of IPEV
Receptor_binding_protein_accession_id: Accession IDs of Receptor Binding Proteins (RBPs) in phages collected by our article
archaea_ID.txt Accession ID information for the reference archaea dataset
bacteria_ID.txt Accession ID information for the reference bacterial dataset
marine_virome_id.csv: Ocean virome data information used in our paper
gut_virome.csv:Gur virome data information used in our paper
fungi.txt: Negative sequence information used to train, validate, and test the model in the decontamination function
bacteria.txt: Negative sequence information used to train, validate, and test the model in the decontamination function
We also provide a Docker image file that does not require any environment configuration. You can reproduce the results of our paper (e.g., train and test our IPEV model) in a Docker image.
Pull the dryinhc/ipev_v1 image from Docker Hub. Open a terminal window and run the following command:
docker pull dryinhc/ipev_v1
This will download the image to your local machine.
Run the dryinhc/ipev_v1 image. In the same terminal window, run the following command:
docker run -it --rm dryinhc/ipev_v1
This will start a container based on the image and run the IPEV tool.
And you can run cd train or cd other file folders in the container.
To exit the container, press Ctrl+D or type exit.
It contains 4 directories, namely 5 fold cross validation, independent set, marine virome, and gut virome. The 5-fold cross-validation directory holds the scripts required for implementing the 5-fold cross-validation method. The independent set directory contains scripts necessary for working with an independent set. Lastly, the marine virome and gut virome directories store scripts for analyzing real datasets.
We hereby confirm that the dataset associated with the research described in this work is made available to the public under the Creative Commons Zero (CC0) license.
If you have any questions, please don't hesitate to ask me: yinhengchuang@pku.edu.cn or hqzhu@pku.edu.cn
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is rooted in a study aimed at unveiling the origins and motivations behind the creation of malware repositories on GitHub. Our research embarks on an innovative journey to dissect the profiles and intentions of GitHub users who have been involved in this dubious activity.
Employing a robust methodology, we meticulously identified 14,000 GitHub users linked to malware repositories. By leveraging advanced large language model (LLM) analytics, we classified these individuals into distinct categories based on their perceived intent: 3,339 were deemed Malicious, 3,354 Likely Malicious, and 7,574 Benign, offering a nuanced perspective on the community behind these repositories.
Our analysis penetrates the veil of anonymity and obscurity often associated with these GitHub profiles, revealing stark contrasts in their characteristics. Malicious authors were found to typically possess sparse profiles focused on nefarious activities, while Benign authors presented well-rounded profiles, actively contributing to cybersecurity education and research. Those labeled as Likely Malicious exhibited a spectrum of engagement levels, underlining the complexity and diversity within this digital ecosystem.
We are offering two datasets in this paper. First, a list of malware repositories - we have collected and extended the malware repositories on the GitHub in 2022 following the original papers. Second, a csv file with the github users information with their maliciousness classfication label.
malware_repos.txt
username/reponame
. This format allows for easy identification and access to each repository on GitHub for further analysis or review.obfuscated_github_user_dataset.csv