14 datasets found

Malware Repositories and Their Authors on GitHub
zenodo.org
data.niaid.nih.gov
csv, txt
Updated Mar 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nishat Ara Tania; Nishat Ara Tania; Md Rayhanul Masud; Md Rayhanul Masud; Md Omar Faruk Rokon; Md Omar Faruk Rokon; Qian Zhang; Qian Zhang; Michalis Faloutsos; Michalis Faloutsos (2024). Malware Repositories and Their Authors on GitHub [Dataset]. http://doi.org/10.5281/zenodo.10806593
Explore at:
csv, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10806593
Dataset updated
Mar 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nishat Ara Tania; Nishat Ara Tania; Md Rayhanul Masud; Md Rayhanul Masud; Md Omar Faruk Rokon; Md Omar Faruk Rokon; Qian Zhang; Qian Zhang; Michalis Faloutsos; Michalis Faloutsos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Mar 4, 2024
Description
This dataset is rooted in a study aimed at unveiling the origins and motivations behind the creation of malware repositories on GitHub. Our research embarks on an innovative journey to dissect the profiles and intentions of GitHub users who have been involved in this dubious activity.

Employing a robust methodology, we meticulously identified 14,000 GitHub users linked to malware repositories. By leveraging advanced large language model (LLM) analytics, we classified these individuals into distinct categories based on their perceived intent: 3,339 were deemed Malicious, 3,354 Likely Malicious, and 7,574 Benign, offering a nuanced perspective on the community behind these repositories.

Our analysis penetrates the veil of anonymity and obscurity often associated with these GitHub profiles, revealing stark contrasts in their characteristics. Malicious authors were found to typically possess sparse profiles focused on nefarious activities, while Benign authors presented well-rounded profiles, actively contributing to cybersecurity education and research. Those labeled as Likely Malicious exhibited a spectrum of engagement levels, underlining the complexity and diversity within this digital ecosystem.

We are offering two datasets in this paper. First, a list of malware repositories - we have collected and extended the malware repositories on the GitHub in 2022 following the original papers. Second, a csv file with the github users information with their maliciousness classfication label.

malware_repos.txt

Purpose: This file contains a curated list of GitHub repositories identified as containing malware. These repositories were identified following the methodology outlined in the research paper "SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub."

Contents: The file is structured as a simple text file, with each line representing a unique repository in the format username/reponame. This format allows for easy identification and access to each repository on GitHub for further analysis or review.

Usage: The list serves as a critical resource for researchers and cybersecurity professionals interested in studying malware, understanding its distribution on platforms like GitHub, or developing defense mechanisms against such malicious content.

obfuscated_github_user_dataset.csv

Purpose: Accompanying the list of malware repositories, this CSV file contains detailed, albeit obfuscated, profile information of the GitHub users who authored these repositories. The obfuscation process has been applied to protect user privacy and comply with ethical standards, especially given the sensitive nature of associating individuals with potentially malicious activities.

Contents: The dataset includes several columns representing different aspects of user profiles, such as obfuscated identifiers (e.g., ID, login, name), contact information (e.g., email, blog), and GitHub-specific metrics (e.g., followers count, number of public repositories). Notably, sensitive information has been masked or replaced with generic placeholders to prevent user identification.

Usage: This dataset can be instrumental for researchers analyzing behaviors, patterns, or characteristics of users involved in creating malware repositories on GitHub. It provides a basis for statistical analysis, trend identification, or the development of predictive models, all while upholding the necessary ethical considerations.
Dataset of "Extinguishing Ransomware - A Hybrid Approach to Android...
zenodo.org
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alberto Ferrante; Alberto Ferrante; Francesco Mercaldo; Miroslaw Malek; Jelena Milosevic; Francesco Mercaldo; Miroslaw Malek; Jelena Milosevic (2020). Dataset of "Extinguishing Ransomware - A Hybrid Approach to Android Ransomware Detection" [Dataset]. http://doi.org/10.5281/zenodo.1420449
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.1420449
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alberto Ferrante; Alberto Ferrante; Francesco Mercaldo; Miroslaw Malek; Jelena Milosevic; Francesco Mercaldo; Miroslaw Malek; Jelena Milosevic
Description
Protection against ransomware is particularly relevant in systems running the Android operating system, due to its huge users' base and, therefore, its potential for monetization from the attackers. In "Extinguishing Ransomware - A Hybrid Approach to Android Ransomware Detection" (see references for details), we describe a hybrid (static + dynamic) malware detection method that has extremely good accuracy (100% detection rate, with false positive below 4%).

We release a dataset related to the dynamic detection part of the aforementioned methods and containing execution traces of ransomware Android applications, in order to facilitate further research as well as to facilitate the adoption of dynamic detection in practice. The dataset contains execution traces from 666 ransomware applications taken from the Heldroid project [https://github.com/necst/heldroid] (the app repository is unavailable at the moment). Execution records were obtained by running the applications, one at a time, on the Android emulator. For each application, a maximum of 20,000 stimuli were applied with a maximum execution time of 15 minutes. For most of the applications, all the stimuli could be applied in this timeframe. In some of the traces none of the two limits is reached due to emulator hiccups. Collected features are related to the memory and CPU usage, network interaction and system calls and their monitoring is performed with a period of two seconds. The Android emulator of the Android Software Development Kit for Android 4.0 (release 20140702) was used. To guarantee that the system was always in a mint condition when a new sample is started, thus avoiding possible interference (e.g., changed settings, running processes, and modifications of the operating system files) from previously run samples, the Android operating system was each time re-initialized before running each application. The application execution process was automated by means of a shell script that made use of Android Debug Bridge (adb) and that was run on a Linux PC. The Monkey application exerciser was used in the script as a generator of the aforementioned stimuli. The Monkey is a command-line tool that can be run on any emulator instance or on a device; it sends a pseudo-random stream of user events (stimuli) into the system, which acts as a stress test on the application software.

In this dataset, we provide both per-app CSV files as well as unified files, in which CSV files of single applications have been concatenated. The CSV files contain the features extracted from the raw execution record. The provided files are listed below:

ransom-per_app-csv.zip - features obtained by executing ransomware applications, one CSV per application

ransom-unified-csv.zip - features obtained by executing ransomware applications, only one CSV file
T
Maldeb Dataset
dataverse.telkomuniversity.ac.id
ieee-dataport.org
png
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Telkom University Dataverse (2024). Maldeb Dataset [Dataset]. http://doi.org/10.34820/FK2/HQYV4X
Explore at:
png(37009), png(40485), png(17688), png(34844), png(9493), png(29711), png(20558), png(28684), png(29803), png(6311), png(40949), png(40392), png(38400), png(4038), png(5275), png(17960), png(38508), png(37266), png(31778), png(40248), png(28914), png(38992), png(40895), png(7485), png(28915), png(17724), png(25025), png(38142), png(27095), png(26777), png(37000), png(33749), png(12823), png(16016), png(12597), png(14025), png(7385), png(42604), png(26334), png(27060), png(19233), png(28916), png(12160), png(31488), png(3872), png(36959), png(16928), png(3667), png(32525), png(18253), png(29577), png(40024), png(39597), png(39050), png(11090), png(9764), png(41011), png(39924), png(31149), png(4693), png(39079), png(36808), png(2226), png(38297), png(32701), png(7143), png(5541), png(31606), png(39359), png(11048), png(32711), png(12788), png(26224), png(38202), png(36818), png(20676), png(9677), png(41423), png(24325), png(30595), png(36543), png(7767), png(36066), png(37337), png(33854), png(28742), png(24158), png(42716), png(14727), png(41822), png(27177), png(31238), png(42792), png(34881), png(38036), png(37751), png(14483), png(24093), png(13037), png(42313), png(23072), png(15264), png(19868), png(30260), png(38010), png(30017), png(34029), png(19782), png(41975), png(3367), png(12188), png(32190), png(42775), png(2606), png(41390), png(34637), png(38167), png(10958), png(9704), png(40913), png(42849), png(6512), png(12577), png(30133), png(40975), png(42692), png(13627), png(29584), png(10867), png(10814), png(18784), png(27712), png(11945), png(3054), png(42333), png(27827), png(8053), png(24375), png(31575), png(33487), png(13038)Available download formats
Unique identifier
https://doi.org/10.34820/FK2/HQYV4X
Dataset updated
Mar 28, 2024
Dataset provided by
Telkom University Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
Directorate General of Higher Education, Ministry of Education and Culture Republic of Indonesia
Japanese Student Service Association (JASSO)
Description
Malware-benign Image representation. The Dataset were collected from several malware repositories, including TekDefense, TheZoo, The Malware-Repo, Malware Database amd Malware Bazar. The benign samples were collected from Microsoft 10 and 11 system apps and several open source software repository including CNET, Sourceforge, FileForum, PortableFreeware. The samples were validated by scanning them using Virustotal Malware scanning services. The Samples underwent preprocessing by converting the malware binary into grayscale images following rules from Nataraj (2011). Nataraj Paper: https://vision.ece.ucsb.edu/research/signal-processing-malware-analysis. Maldeb Dataset is collected by Debi Amalia Septiyani and Halimul Hakim Khairul D. A. Septiyani, “Generating Grayscale and RGB Images dataset for windows PE malware using Gist Features extaction method,” Institut Teknologi Bandung, 2022, and Dani Agung Prastiyo, "Design and implementation of a machine learning-based malware classification system with an audio signal feature Analysis Approach," Institut Teknologi Bandung, 2023. The complete dataset can be accessed on this link https://ieee-dataport.org/documents/maldeb-dataset and https://github.com/julismail/Self-Supervised
WinMET Dataset
zenodo.org
data.niaid.nih.gov
bin, json
Updated Mar 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Razvan Raducu; Razvan Raducu; Alain Villagrasa-Labrador; Alain Villagrasa-Labrador; Ricardo J. Rodríguez; Ricardo J. Rodríguez; Pedro Álvarez; Pedro Álvarez (2025). WinMET Dataset [Dataset]. http://doi.org/10.5281/zenodo.12737794
Explore at:
json, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12737794
Dataset updated
Mar 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Razvan Raducu; Razvan Raducu; Alain Villagrasa-Labrador; Alain Villagrasa-Labrador; Ricardo J. Rodríguez; Ricardo J. Rodríguez; Pedro Álvarez; Pedro Álvarez
License
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Description
WinMET (Windows Malware Execution Traces) Dataset

WinMET dataset contains the reports generated with CAPE sandbox after analyzing several malware samples. The reports are valid JSON files that contain the spawned processes, the sequence of WinAPI and system calls invoked by each process, their parameters, their return values, and OS accesed resources, amongst many others.

Please use this DOI reference that always points to the latest WinMET version: https://doi.org/10.5281/zenodo.12647555" target="_blank" rel="noopener">https://doi.org/10.5281/zenodo.12647555

This dataset was generated using the https://github.com/reverseame/MALVADA/tree/main" target="_blank" rel="noopener">MALVADA framework, which you can read more about in our publication https://doi.org/10.1016/j.softx.2025.102082" target="_blank" rel="noopener">https://doi.org/10.1016/j.softx.2025.102082. The article also provides insights about the contents of this dataset.

Razvan Raducu, Alain Villagrasa-Labrador, Ricardo J. Rodríguez, Pedro Álvarez, MALVADA: A framework for generating datasets of malware execution traces, SoftwareX, Volume 30, 2025, 102082, ISSN 2352-7110, https://doi.org/10.1016/j.softx.2025.102082.(https://www.sciencedirect.com/science/article/pii/S2352711025000494)

How to use the dataset

The 7z file is password protected. The password is: infected.

Compressed size on disk: ~2.5GiB.
Decompressed size on disk: ~105GiB.
Total decompressed .json files: 9889.

The name of each .json file is irrelevant. It corresponds to its analysis ID.

cape_report_to_label_mapping.json and avclass_report_to_label_mapping.json contain the mappings of each report with its corresponding consensus label, sorted in descendent order (given the number of reports belonging to each label/family).

Integrity checks for WinMET.7z:

MD5: 75b3354fb186ae5a47c320e253bd96ee

SHA256: 00faac011f4938a29ba9afbd9f0b50d89ede342d1d0d6877cb90b46eabd92c72

SHA512: 038ca9303623cadaa72eab680221e81e1d335449d08f6395b39eb99baad4092e02c00955089fba31ce1a9dd04260ae80b622491f754774331bced18e8e3be1c4

Citation

If you use this dataset, cite it as follows:

TBA.

Statistics

The following statistic (and many more) can be obtained by analyzing the WinMET dataset with the https://github.com/reverseame/MALVADA" target="_blank" rel="noopener">MALVADA framework.

Total reports: 9889.

Average VT (VirusTotal) detections: ~53.

There 268 benign or undetected reports. That is, 10 or less VT detections (default threshold).

There are 2584 reports with no CAPE consensus label.

There are 695 reports with no AVClass consensus label.

Top 20 CAPE consensus labels (there are many more):

"(n/a)": 2584

"Redline": 1227

"Agenttesla": 1010

"Crifi": 622

"Amadey": 606

"Smokeloader": 538

"Virlock": 471

"Msilheracles": 408

"Tedy": 364

"Disabler": 343

"Xorstringsnet": 321

"Snake": 252

"Autorun": 252

"Metastealer": 246

"Formbook": 244

"Lokibot": 202

"Strab": 188

"Loki": 185

"Mint": 179

"Taskun": 178

Top 20 AVClass consensus labels (there are many more)

"Reline": 2187

"Disabler": 732

"(n/a)": 695

"Amadey": 575

"Agenttesla": 478

"Taskun": 382

"Virlock": 293

"Equationdrug": 270

"Stop": 268

"Strab": 260

"Noon": 259

"Gamarue": 181

"Dofoil": 135

"Makoob": 113

"Mokes": 110

"Snakelogger": 110

"Bladabindi": 98

"Zard": 84

"Gcleaner": 83

"Deyma": 80

Changelog

Version 2.0: Added cape and avclass label mappings.
Z
Dataset and Source Code for the Paper: A Framework for Developing Strategic...
data.niaid.nih.gov
figshare.com
+1more
Updated Jul 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gulbay, BURAK (2024). Dataset and Source Code for the Paper: A Framework for Developing Strategic Cyber Threat Intelligence from Advanced Persistent Threat Analysis Reports Using Graph-Based Algorithms [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12741054
Explore at:
Dataset updated
Jul 14, 2024
Dataset authored and provided by
Gulbay, BURAK
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here are the data set and source code related to the paper: "A Framework for Developing Strategic Cyber Threat Intelligence from Advanced Persistent Threat Analysis Reports Using Graph-Based Algorithms"

1- aptnotes-downloader.zip : contains source code that downloads all APT reports listed in https://github.com/aptnotes/data and https://github.com/CyberMonitor/APT_CyberCriminal_Campagin_Collections

2- apt-groups.zip : contains all APT group names gathered from https://docs.google.com/spreadsheets/d/1H9_xaxQHpWaa4O_Son4Gx0YOIzlcBWMsdvePFX68EKU/edit?gid=1864660085#gid=1864660085 and https://malpedia.caad.fkie.fraunhofer.de/actors and https://malpedia.caad.fkie.fraunhofer.de/actors

3- apt-reports.zip : contains all deduplicated APT reports gathered from https://github.com/aptnotes/data and https://github.com/CyberMonitor/APT_CyberCriminal_Campagin_Collections

4- countries.zip : contains country name list.

5- ttps.zip : contains all MITRE techniques gathered from https://attack.mitre.org/resources/attack-data-and-tools/

6- malware-families.zip : contains all malware family names gathered from https://malpedia.caad.fkie.fraunhofer.de/families

7- ioc-searcher-app.zip : contains source code that extracts IoCs from APT reports. Extracted IoC files are provided in report-analyser.zip. Original code repo can be found at https://github.com/malicialab/iocsearcher

8- extracted-iocs.zip : contains extracted IoCs by ioc-searcher-app.zip

9- report-analyser.zip : contains source code that searchs APT reports, malware families, countries and TTPs. I case of a match, it updates files in extracted-iocs.zip.

10- cti-transformation-app.zip : contains source code that transforms files in extracted-iocs.zip to CTI triples and saves into Neo4j graph database.

11- graph-db-backup.zip : contains volume folder of Neo4j Docker container. When it is mounted to a Docker container, all CTI database becomes reachable from Neo4j web interface. Here is how to run a Neo4j Docker container that mounts folder in the zip:

docker run -d --publish=7474:7474 --publish=7687:7687 --volume={PATH_TO_VOLUME}/DEVIL_NEO4J_VOLUME/neo4j/data:/data --volume={PATH_TO_VOLUME}/DEVIL_NEO4J_VOLUME/neo4j/plugins:/plugins --volume={PATH_TO_VOLUME}/DEVIL_NEO4J_VOLUME/neo4j/logs:/logs --volume={PATH_TO_VOLUME}/DEVIL_NEO4J_VOLUME/neo4j/conf:/conf --env 'NEO4J_PLUGINS=["apoc","graph-data-science"]' --env NEO4J_apoc_export_file_enabled=true --env NEO4J_apoc_import_file_enabled=true --env NEO4J_apoc_import_file_use_neo4j_config=true --env=NEO4J_AUTH=none neo4j:5.13.0

web interface: http://localhost:7474

username: neo4j

password: neo4j
The MalRadar Dataset
zenodo.org
data.niaid.nih.gov
Updated Jul 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MalRadar; MalRadar (2022). The MalRadar Dataset [Dataset]. http://doi.org/10.5281/zenodo.6451769
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6451769
Dataset updated
Jul 5, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
MalRadar; MalRadar
Description
Mobile malware detection has attracted massive research effort in our community. A reliable and up-to-date malware dataset is critical to evaluate the effectiveness of malware detection approaches. Essentially, the malware ground truth should be manually verified by security experts, and their malicious behaviors should be carefully labelled. Although there are several widely-used malware benchmarks in our community (e.g., MalGenome, Drebin, Piggybacking and AMD, etc.), these benchmarks face several limitations including out-of-date, size, coverage, and reliability issues, etc.

We make effort to create MalRadar, a growing and up-to-date Android malware dataset using the most reliable way, i.e., by collecting malware based on the analysis reports of security experts. We have crawled all the mobile security related reports released by ten leading security companies, and used an automated approach to extract and label the useful ones describing new Android malware and containing Indicators of Compromise (IoC) information. We have successfully compiled MalRadar, a dataset that contains 4,534 unique Android malware samples (including both apks and metadata) released from 2014 to April 2021 by the time of this paper, all of which were manually verified by security experts with detailed behavior analysis. For more details, please visit https://malradar.github.io/

The dataset includes the following files:

(1) sample-info.csv

In this file, we list all the detailed information about each sample, including apk file hash, app name, package name, report family, etc.

(2) malradar.zip

We have packaged the malware samples in chunks of 1000 applications: malradar-0, malradar-1, malradar-2, malradar-3. All the apk files name after the file SHA256.

If your papers or articles used our dataset, please include a citation to our paper:

@article{wang2022malradar, title={MalRadar: Demystifying Android Malware in the New Era}, author={Wang, Liu and Wang, Haoyu and He, Ren and Tao, Ran and Meng, Guozhu and Luo, Xiapu and Liu, Xuanzhe}, journal={Proceedings of the ACM on Measurement and Analysis of Computing Systems}, volume={6}, number={2}, pages={1--27}, year={2022}, publisher={ACM New York, NY, USA} }
IoT Firmware Image Classification
kaggle.com
Updated Jun 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tecperson (2021). IoT Firmware Image Classification [Dataset]. https://www.kaggle.com/datasets/datamunge/iot-firmware-image-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
tecperson
Description
Context

To leverage the vast literature solving the original MNIST digit recognition problem in small thumbnails, this firmware dataset maps the first 1024 bytes of malicious, benign and hacked Internet of Things and embedded software binaries (Executable and Linkable Format, ELF). The goal is to provide a drop-in replacement for MNIST techniques but relevant to weeding out malware using image recognition.

Content

The images are reported in CSV where the filename, label class (both categorical and numerical), and the first 1024 bytes mapped into a grayscale range from 0-255 by converting first each byte to decimal (0-15) then scaling.

Acknowledgements

See additional background on ELF files, https://en.wikipedia.org/wiki/Executable_and_Linkable_Format and https://linux-audit.com/elf-binaries-on-linux-understanding-and-analysis/

The labeled ELF files repository, https://github.com/nimrodpar/Labeled-Elfs

Inspiration

Comparison of firmware detection using these image representations and comparing with signature-based methods as well as contrasting statistical (tree) methods with deep learning techniques
Evasive PDF Samples
kaggle.com
Updated Mar 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fouad Trad (2024). Evasive PDF Samples [Dataset]. https://www.kaggle.com/datasets/fouadtrad2/evasive-pdf-samples/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 9, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Fouad Trad
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is a collection of evasive PDF samples, labeled as malicious (1) or benign (0). Since the dataset has an evasive nature, it can be used to test the robustness of trained PDF malware classifiers against evasion attacks. The dataset contains 500,000 generated evasive samples, including 450,000 malicious and 50,000 benign PDFs.

More details about the data can be found in the publication: Leveraging Adversarial Samples for Enhanced Classification of Malicious and Evasive PDF Files.

This resource aims to support researchers and cybersecurity professionals in developing more advanced and robust detection mechanisms for PDF-based malware.

Any work that uses this dataset should cite the following paper:

Trad, F.; Hussein, A.; Chehab, A. Leveraging Adversarial Samples for Enhanced Classification of Malicious and Evasive PDF Files. Appl. Sci. 2023, 13, 3472. https://doi.org/10.3390/app13063472

Implementation code can be found here
S
USTC-TFC2016
scidb.cn
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wang Wei; Zhu Ming; Zeng Xuewen; Ye Xiaozhou; Sheng Yiqiang (2025). USTC-TFC2016 [Dataset]. http://doi.org/10.57760/sciencedb.18772
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.18772
Dataset updated
Jun 18, 2025
Dataset provided by
Science Data Bank
Authors
Wang Wei; Zhu Ming; Zeng Xuewen; Ye Xiaozhou; Sheng Yiqiang
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The USTC-TFC2016 dataset is mainly used for network traffic classification research, including malicious traffic and normal application traffic, and is jointly completed by the University of Science and Technology of China and the Institute of Acoustics of the Chinese Academy of Sciences. The data comes from two sources: one is 10 types of malicious traffic selected from the CTU dataset, which were collected by researchers from the Czech CTU University from real environments between 2011 and 2015; The second type is the 10 normal application traffic generated by network instrument simulation. This dataset consists of 20 types of traffic, corresponding to 20 data files, all in pcap format. In order to save space, some pcap files are compressed and uploaded. After decompression, the total size of each pcap file is 3.71GB. For more information about this dataset, please refer to: 1) Wei Wang， Ming Zhu， Xuewen Zeng， Xiaozhou Ye and Yiqiang Sheng， “Malware traffic classification using convolutional neural network for representation learning”ICOIN 2017，pp712-717; 2) Wang Wei, Research on Network Traffic Classification and Anomaly Detection Methods Based on Deep Learning, Ph.D. Thesis, University of Science and Technology of China, 2018. This dataset and preprocessing tool were released in 2018 https://github.com/echowei/ Many domestic and foreign researchers are using this dataset. Due to bandwidth and capacity constraints, it is often unable to download. Upload it to the "Science Database" website of the Chinese Academy of Sciences for long-term storage and easy download. At the same time, we look forward to relevant researchers uploading and sharing new malicious traffic and encrypted traffic using domestic cryptographic protocols, as well as expanding this dataset to include more types of malicious and normal traffic (such as 100 each), forming a richer and more comprehensive dataset to approach the actual network traffic situation.
f
ProjecTILs murine reference atlas of virus-specific CD8 T cells, version 2
figshare.com
application/gzip
Updated Jul 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massimo Andreatta; Santiago Carmona (2023). ProjecTILs murine reference atlas of virus-specific CD8 T cells, version 2 [Dataset]. http://doi.org/10.6084/m9.figshare.23764572.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23764572.v1
Dataset updated
Jul 26, 2023
Dataset provided by
figshare
Authors
Massimo Andreatta; Santiago Carmona
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We have developed ProjecTILs, a computational approach to project new data sets into a reference map of T cells, enabling their direct comparison in a stable, annotated system of coordinates. Because new cells are embedded in the same space of the reference, ProjecTILs enables the classification of query cells into annotated, discrete states, but also over a continuous space of intermediate states. By comparing multiple samples over the same map, and across alternative embeddings, the method allows exploring the effect of cellular perturbations (e.g. as the result of therapy or genetic engineering) and identifying genetic programs significantly altered in the query compared to a control set or to the reference map. We illustrate the projection of several data sets from recent publications over two cross-study murine T cell reference atlases: the first describing tumor-infiltrating T lymphocytes (TILs), the second characterizing acute and chronic viral infection. Single-cell data to build the virus-specific CD8 T cell reference map were downloaded from GEO under the following entries: GSE131535, GSE134139 and GSE119943, selecting only samples in wild type conditions. Data for the Ptpn2-KO, Tox-KO and CD4-depletion projections were obtained from entries GSE134139, GSE119943, and GSE137007 and were not included in the construction of the reference map. To construct the LCMV reference map, we split the dataset into five batches that displayed strong batch effects, and applied STACAS (https://github.com/carmonalab/STACAS) to mitigate its confounding effects. We computed 800 variable genes per batch, excluding cell cycling genes, ribosomal and mitochondrial genes, and computed pairwise anchors using 200 integration genes, and otherwise default STACAS parameters. Anchors were filtered at the default threshold 0.8 percentile, and integration was performed with the IntegrateData Seurat3 function with the guide tree suggested by STACAS. Next, we performed unsupervised clustering of the integrated cell embeddings using the Shared Nearest Neighbor (SNN) clustering method implemented in Seurat 3 with parameters {resolution=0.4, reduction=”pca”, k.param=20}. We then manually annotated individual clusters (merging clusters when necessary) based on several criteria: i) average expression of key marker genes in individual clusters; ii) gradients of gene expression over the UMAP representation of the reference map; iii) gene-set enrichment analysis to determine over- and under- expressed genes per cluster using MAST. In order to have access to predictive methods for UMAP, we recomputed PCA and UMAP embeddings independently of Seurat3 using respectively the prcomp function from basic R package “stats”, and the “umap” R package (https://github.com/tkonopka/umap).
WinFam Dataset
zenodo.org
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benedict J.N.; Benedict J.N. (2023). WinFam Dataset [Dataset]. http://doi.org/10.5281/zenodo.10147708
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10147708
Dataset updated
Nov 21, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Benedict J.N.; Benedict J.N.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The winFam dataset is a collection that amalgamates cryptoransomware files sourced from VirusShare and Windows PE32 malware/benign files from MaleVis dataset.
Key Features:-
File Sources
Cryptoransomware files were sourced from VirusShare (https://virusshare.com/), a reputable repository known for its extensive collection of malware samples. Windows PE32 files were obtained from https://web.cs.hacettepe.edu.tr/~selman/malevis/, enriching the dataset with a variety of Windows executable files.
Conversion to Images
The executable files from Cryptoransomware were converted into RGB images using the binary2image tool (https://github.com/ncarkaci/binary-to-image/blob/master/binary2image.py). This transformation allows researchers to leverage image processing techniques and neural networks for advanced analysis.
Malware Families
The dataset comprises 26 distinct malware families and a benign family is included for comparative analysis and to facilitate the development of effective detection mechanisms. The images are split up into three folders- Train, Valid and Test.
Research Applications
It serves as a valuable resource for the development and evaluation of deep learning models based on malware visualization.
Community Contribution
The winFam dataset is shared with the cybersecurity community on Zenodo to encourage collaboration and knowledge-sharing. Researchers, analysts, and developers can leverage this dataset to advance the state of the art in malware analysis and cybersecurity.
f
Dataset for the PNAS article titled 'Accurate Virus Identification with...
figshare.com
zip
Updated Apr 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiarong Ye (2022). Dataset for the PNAS article titled 'Accurate Virus Identification with Interpretable Raman Signatures by Machine Learning' [Dataset]. http://doi.org/10.6084/m9.figshare.19426739.v7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19426739.v7
Dataset updated
Apr 18, 2022
Dataset provided by
figshare
Authors
Jiarong Ye
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About Dataset

Context The dataset is acquired via the VIRRION platform designed for label-free capture and enrichment of viruses. It consists of Raman spectra of three groups of RNA viruses, including human respiratory viruses (influenza A H1N1 and H3N2, influenza B, rhinovirus, respiratory syncytial virus (RSV)), avian respiratory viruses (influenza A H5N2 and H7N2, infectious bronchitis virus (IBV), reovirus), and human enteroviruses (Coxsackievirus B type 1 and 3 (CVB1, CVB3), enteroviruses EV70 and EV71). Content The dataset contains 10 sample Raman Spectra for each virus type/subtype in “virus_spectra.zip”, and visualization of the spectra in “virus_spectra_plots.zip”. The Raman mapping data of spectra was collected for each sample by the 20X20 grid, and the first two colomns are the corresponding X and Y coordinates respectively. In each txt file that corresponds to one Raman spectrum, the columns are:

X coordinate Y coordinate Wavenumber Signal Intensity

More data and machine learning source code More data are available upon request, for research purposes only. Please email mtterrones@gmail.com (Dr. Mauricio Terrones) with a short description about the purpose of usage along with your request for more data. The machine learning source code for our work is available via the GitHub Repo below: https://github.com/karenyyy/Accurate-Virus-Identification-with-Interpretable-Raman-Signatures-by-Machine-Learning-.git
Dataset for the paper Exploring the Use of Static and Dynamic Analysis to...
zenodo.org
application/gzip
Updated Sep 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francisco Handrick; Rodrigo Bonifácio; Rodrigo Bonifácio; Francisco Handrick (2021). Dataset for the paper Exploring the Use of Static and Dynamic Analysis to Improve the Performance of the Mining Sandbox Approach for Android Malware Identification [Dataset]. http://doi.org/10.5281/zenodo.5503887
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5503887
Dataset updated
Sep 14, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Francisco Handrick; Rodrigo Bonifácio; Rodrigo Bonifácio; Francisco Handrick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Short Description: This is the dataset for the paper "Exploring the Use of Static and Dynamic Analysis to Improve the Performance of the Mining Sandbox Approach for Android Malware Identification", accepted for publication in the Journal of Systems and Software.

Link to this repository: https://github.com/droidxp/paper-replication-package

Authors of the Paper

Francisco Handrick da Costa

Ismael Medeiros

Thales Menezes

João Victor da Silva

Ingrid Lorraine da Silva

Rodrigo Bonifácio

Krishna Narasimhanb

Márcio Ribeiro

Abstract

The popularization of the Android platform and the growing number of Android applications (apps) that manage sensitive data turned the Android ecosystem into an attractive target for malicious software. For this reason, researchers and practitioners have investigated new approaches to address Android's security issues, including techniques that leverage dynamic analysis to mine Android sandboxes. The mining sandbox approach consists in running dynamic analysis tools on a benign version of an Android app. This exploratory phase records all calls to sensitive APIs. Later, we can use this information to (a) prevent calls to other sensitive APIs (those not recorded in the exploratory phase) or (b) run the dynamic analysis tools again in a different version of the app. During this second execution of the fuzzing tools, a warning of possible malicious behavior is raised whenever the new version of the app calls a sensitive API not recorded in the exploratory phase.

The use of a mining sandbox approach is an effective technique for Android malware analysis, as previous research works revealed. Particularly, existing reports present an accuracy of almost 70% in the identification of malicious behavior using dynamic analysis tools to mine android sandboxes. However, although the use of dynamic analysis for mining Android sandboxes has been investigated before, little is known about the potential benefits of combining static analysis with a mining sandbox approach for identifying malicious behavior. Accordingly, in this paper we present the results of two studies that investigate the impact of using static analysis to complement the performance of existing dynamic analysis tools tailored for mining Android sandboxes, in the task of identifying malicious behavior.

In the first study we conduct a non-exact replication of a previous study (hereafter BLL-Study) that compares the performance of test case generation tools for mining Android sandboxes. Differently from the original work, here we isolate the effect of an independent static analysis component (DroidFax) they used to instrument the Android apps in their experiments. This decision was motivated by the fact that DroidFax could have influenced the efficacy of the dynamic analyses tools positively---through the execution of specific static analysis algorithms DroidFax also implements. In our second study, we carried out a new experiment to investigate the efficacy of taint analysis algorithms to complement the mining sandbox approach previously used to identify malicious behavior. To this end, we executed the FlowDroid tool to mine the source-sink flows from benign/malign pairs of Android apps used in previous research work.

Our study brings several findings. For instance, the first study reveals that DroidFax alone (static analysis) can detect 43.75% of the malwares in the BLL-Study dataset, contributing substantially in the performance of the dynamic analysis tools in the BLL-Study. The results of the second study show that taint analysis is also practical to complement the mining sandboxes approach, with a performance similar to that reached by dynamic analysis tools.
Data from: Identification of prokaryotic and eukaryotic virus-derived...
zenodo.org
bin, csv, txt, zip
Updated Oct 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hengchuang Yin; Shufang Wu; Jie Tan; Qian Guo; Mo Li; Jinyuan Guo; Yaqi Wang; Xiaoqing Jiang; Huaiqiu Zhu; Hengchuang Yin; Shufang Wu; Jie Tan; Qian Guo; Mo Li; Jinyuan Guo; Yaqi Wang; Xiaoqing Jiang; Huaiqiu Zhu (2024). Data from: Identification of prokaryotic and eukaryotic virus-derived sequences in virome using deep learning [Dataset]. http://doi.org/10.5281/zenodo.10118192
Explore at:
txt, csv, bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10118192
Dataset updated
Oct 8, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hengchuang Yin; Shufang Wu; Jie Tan; Qian Guo; Mo Li; Jinyuan Guo; Yaqi Wang; Xiaoqing Jiang; Huaiqiu Zhu; Hengchuang Yin; Shufang Wu; Jie Tan; Qian Guo; Mo Li; Jinyuan Guo; Yaqi Wang; Xiaoqing Jiang; Huaiqiu Zhu
Description
This repository contains the data and Docker image to reproduce the results of our paper: identification of prokaryotic and eukaryotic virus-derived sequences in virome using deep learning

Authors: Hengchuang Yin, Shufang Wu, Jie Tan, Qian Guo, Mo Li, Jinyuan Guo, Yaqi Wang, Xiaoqing Jiang, and Huaiqiu Zhu*

This work has been accepted by GigaScience.

Hengchuang Yin, Shufang Wu, Jie Tan, Qian Guo, Mo Li, Jinyuan Guo, Yaqi Wang, Xiaoqing Jiang, and Huaiqiu Zhu. "IPEV: Identification of Prokaryotic and Eukaryotic Virus-Derived Sequences in Virome Using Deep Learning." GigaScience 13 (2024): giae018. https://doi.org/10.1093/gigascience/giae018.

Background: The virome obtained through virus-like particle enrichment contains a mixture of prokaryotic and eukaryotic virus-derived fragments. Accurate identification and classification of these elements are crucial to understanding their roles and functions in microbial communities. However, the rapid mutation rates of viral genomes pose challenges in developing high-performance tools for classification, potentially limiting downstream analyses.

Findings: We present IPEV, a novel method to distinguish prokaryotic and eukaryotic viruses in viromes, with a 2D convolutional neural network combining trinucleotide pair relative distance and frequency. Cross-validation assessments of IPEV demonstrate its state-of-the-art precision, significantly improving the F1-score by approximately 22% on an independent test set compared to existing methods when query viruses share less than 30% sequence similarity with known viruses. Furthermore, IPEV outperforms other methods in accuracy on marine and gut virome samples based on annotations by sequence alignments. IPEV reduces runtime by at most 1,225 times compared to existing methods under the same computing configuration. We also utilized IPEV to analyze longitudinal samples and found that the gut virome exhibits a higher degree of temporal stability than previously observed in persistent personal viromes, providing novel insights into the resilience of the gut virome in individuals.

Conclusions: IPEV is a high-performance, user-friendly tool that assists biologists in identifying and classifying prokaryotic and eukaryotic viruses within viromes. The tool is available at https://github.com/basehc/IPEV.

5_fold_cross_validation.zip: Dataset of cross-validation of IPEV

Eukaryotic_virus_CV_Dataset-1.csv: GI, and accession ID for the cross-validation Dataset-1 (eukaryotic virus)

Prokaryotic_virus_CV_Dataset-1.csv: GI, and accession ID for the cross-validation Dataset-1 (prokaryotic virus)

Test_Prokaryotic_virus_Dataset-1.fasta: An independent test set of IPEV (prokaryotic virus)

Test_Eukaryotic_virus_Dataset-1.fasta: An independent test set of IPEV (eukaryotic virus)

Dataset_sequencing_error.zip: Simulated dataset with sequencing errors

Cap_enzyme_sequence.fasta: Accession IDs of Receptor Binding Proteins (RBPs) in phages collected by our article

Dataset_runtime_evaluation.zip: Dataset for evaluating the runtime of IPEV

Receptor_binding_protein_accession_id: Accession IDs of Receptor Binding Proteins (RBPs) in phages collected by our article

archaea_ID.txt Accession ID information for the reference archaea dataset

bacteria_ID.txt Accession ID information for the reference bacterial dataset

marine_virome_id.csv: Ocean virome data information used in our paper

gut_virome.csv:Gur virome data information used in our paper

fungi.txt: Negative sequence information used to train, validate, and test the model in the decontamination function

bacteria.txt: Negative sequence information used to train, validate, and test the model in the decontamination function

Reproduce the results of our paper from a Docker image.

We also provide a Docker image file that does not require any environment configuration. You can reproduce the results of our paper (e.g., train and test our IPEV model) in a Docker image.

Pull the dryinhc/ipev_v1 image from Docker Hub. Open a terminal window and run the following command:

docker pull dryinhc/ipev_v1

This will download the image to your local machine.

Run the dryinhc/ipev_v1 image. In the same terminal window, run the following command:

docker run -it --rm dryinhc/ipev_v1

This will start a container based on the image and run the IPEV tool.

And you can run cd train or cd other file folders in the container.

To exit the container, press Ctrl+D or type exit.

It contains 4 directories, namely 5 fold cross validation, independent set, marine virome, and gut virome. The 5-fold cross-validation directory holds the scripts required for implementing the 5-fold cross-validation method. The independent set directory contains scripts necessary for working with an independent set. Lastly, the marine virome and gut virome directories store scripts for analyzing real datasets.

We hereby confirm that the dataset associated with the research described in this work is made available to the public under the Creative Commons Zero (CC0) license.

Contact

If you have any questions, please don't hesitate to ask me: yinhengchuang@pku.edu.cn or hqzhu@pku.edu.cn
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Nishat Ara Tania; Nishat Ara Tania; Md Rayhanul Masud; Md Rayhanul Masud; Md Omar Faruk Rokon; Md Omar Faruk Rokon; Qian Zhang; Qian Zhang; Michalis Faloutsos; Michalis Faloutsos (2024). Malware Repositories and Their Authors on GitHub [Dataset]. http://doi.org/10.5281/zenodo.10806593

Malware Repositories and Their Authors on GitHub

Explore at:

csv, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10806593

Dataset updated

Mar 11, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Nishat Ara Tania; Nishat Ara Tania; Md Rayhanul Masud; Md Rayhanul Masud; Md Omar Faruk Rokon; Md Omar Faruk Rokon; Qian Zhang; Qian Zhang; Michalis Faloutsos; Michalis Faloutsos

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Mar 4, 2024

Description

This dataset is rooted in a study aimed at unveiling the origins and motivations behind the creation of malware repositories on GitHub. Our research embarks on an innovative journey to dissect the profiles and intentions of GitHub users who have been involved in this dubious activity.

Employing a robust methodology, we meticulously identified 14,000 GitHub users linked to malware repositories. By leveraging advanced large language model (LLM) analytics, we classified these individuals into distinct categories based on their perceived intent: 3,339 were deemed Malicious, 3,354 Likely Malicious, and 7,574 Benign, offering a nuanced perspective on the community behind these repositories.

Our analysis penetrates the veil of anonymity and obscurity often associated with these GitHub profiles, revealing stark contrasts in their characteristics. Malicious authors were found to typically possess sparse profiles focused on nefarious activities, while Benign authors presented well-rounded profiles, actively contributing to cybersecurity education and research. Those labeled as Likely Malicious exhibited a spectrum of engagement levels, underlining the complexity and diversity within this digital ecosystem.

We are offering two datasets in this paper. First, a list of malware repositories - we have collected and extended the malware repositories on the GitHub in 2022 following the original papers. Second, a csv file with the github users information with their maliciousness classfication label.

malware_repos.txt
- Purpose: This file contains a curated list of GitHub repositories identified as containing malware. These repositories were identified following the methodology outlined in the research paper "SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub."
- Contents: The file is structured as a simple text file, with each line representing a unique repository in the format username/reponame. This format allows for easy identification and access to each repository on GitHub for further analysis or review.
- Usage: The list serves as a critical resource for researchers and cybersecurity professionals interested in studying malware, understanding its distribution on platforms like GitHub, or developing defense mechanisms against such malicious content.
obfuscated_github_user_dataset.csv
- Purpose: Accompanying the list of malware repositories, this CSV file contains detailed, albeit obfuscated, profile information of the GitHub users who authored these repositories. The obfuscation process has been applied to protect user privacy and comply with ethical standards, especially given the sensitive nature of associating individuals with potentially malicious activities.
- Contents: The dataset includes several columns representing different aspects of user profiles, such as obfuscated identifiers (e.g., ID, login, name), contact information (e.g., email, blog), and GitHub-specific metrics (e.g., followers count, number of public repositories). Notably, sensitive information has been masked or replaced with generic placeholders to prevent user identification.
- Usage: This dataset can be instrumental for researchers analyzing behaviors, patterns, or characteristics of users involved in creating malware repositories on GitHub. It provides a basis for statistical analysis, trend identification, or the development of predictive models, all while upholding the necessary ethical considerations.

Clear search

Close search

Google apps

Main menu

Malware Repositories and Their Authors on GitHub

Dataset of "Extinguishing Ransomware - A Hybrid Approach to Android...

Maldeb Dataset

WinMET Dataset

WinMET (Windows Malware Execution Traces) Dataset

How to use the dataset

Integrity checks for `WinMET.7z`:

Citation

Statistics

Changelog

Dataset and Source Code for the Paper: A Framework for Developing Strategic...

The MalRadar Dataset

IoT Firmware Image Classification

Context

Content

Acknowledgements

Inspiration

Evasive PDF Samples

USTC-TFC2016

ProjecTILs murine reference atlas of virus-specific CD8 T cells, version 2

WinFam Dataset

Dataset for the PNAS article titled 'Accurate Virus Identification with...

Dataset for the paper Exploring the Use of Static and Dynamic Analysis to...

Data from: Identification of prokaryotic and eukaryotic virus-derived...

This repository contains the data and Docker image to reproduce the results of our paper: identification of prokaryotic and eukaryotic virus-derived sequences in virome using deep learning

Reproduce the results of our paper from a Docker image.

Contact

Malware Repositories and Their Authors on GitHub

Malware Repositories and Their Authors on GitHub

Dataset of "Extinguishing Ransomware - A Hybrid Approach to Android...

Maldeb Dataset

WinMET Dataset

WinMET (Windows Malware Execution Traces) Dataset

How to use the dataset

Integrity checks for WinMET.7z:

Citation

Statistics

Changelog

Dataset and Source Code for the Paper: A Framework for Developing Strategic...

The MalRadar Dataset

IoT Firmware Image Classification

Context

Content

Acknowledgements

Inspiration

Evasive PDF Samples

USTC-TFC2016

ProjecTILs murine reference atlas of virus-specific CD8 T cells, version 2

WinFam Dataset

Dataset for the PNAS article titled 'Accurate Virus Identification with...

Dataset for the paper Exploring the Use of Static and Dynamic Analysis to...

Data from: Identification of prokaryotic and eukaryotic virus-derived...

This repository contains the data and Docker image to reproduce the results of our paper: identification of prokaryotic and eukaryotic virus-derived sequences in virome using deep learning

Reproduce the results of our paper from a Docker image.

Contact

Malware Repositories and Their Authors on GitHub

Integrity checks for `WinMET.7z`: