Mobile malware detection has attracted massive research effort in our community. A reliable and up-to-date malware dataset is critical to evaluate the effectiveness of malware detection approaches. Essentially, the malware ground truth should be manually verified by security experts, and their malicious behaviors should be carefully labelled. Although there are several widely-used malware benchmarks in our community (e.g., MalGenome, Drebin, Piggybacking and AMD, etc.), these benchmarks face several limitations including out-of-date, size, coverage, and reliability issues, etc.
We make effort to create MalRadar, a growing and up-to-date Android malware dataset using the most reliable way, i.e., by collecting malware based on the analysis reports of security experts. We have crawled all the mobile security related reports released by ten leading security companies, and used an automated approach to extract and label the useful ones describing new Android malware and containing Indicators of Compromise (IoC) information. We have successfully compiled MalRadar, a dataset that contains 4,534 unique Android malware samples (including both apks and metadata) released from 2014 to April 2021 by the time of this paper, all of which were manually verified by security experts with detailed behavior analysis. For more details, please visit https://malradar.github.io/
The dataset includes the following files:
(1) sample-info.csv
In this file, we list all the detailed information about each sample, including apk file hash, app name, package name, report family, etc.
(2) malradar.zip
We have packaged the malware samples in chunks of 1000 applications: malradar-0, malradar-1, malradar-2, malradar-3. All the apk files name after the file SHA256.
If your papers or articles used our dataset, please include a citation to our paper:
@article{wang2022malradar, title={MalRadar: Demystifying Android Malware in the New Era}, author={Wang, Liu and Wang, Haoyu and He, Ren and Tao, Ran and Meng, Guozhu and Luo, Xiapu and Liu, Xuanzhe}, journal={Proceedings of the ACM on Measurement and Analysis of Computing Systems}, volume={6}, number={2}, pages={1--27}, year={2022}, publisher={ACM New York, NY, USA} }
As of March 2020, the total number of new Android malware samples amounted to 482,579 per month. According to AV-Test, trojans were the most common type of malware affecting Android devices.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset MH-100K, an extensive collection of Android malware information comprising 101,975 samples. It encompasses a main CSV file with valuable metadata, including the SHA256 hash (APK’s signature), file name, package name, Android’s official compilation API, 166 permissions, 24,417 API calls, and 250 intents.
Protection against ransomware is particularly relevant in systems running the Android operating system, due to its huge users' base and, therefore, its potential for monetization from the attackers. In "Extinguishing Ransomware - A Hybrid Approach to Android Ransomware Detection" (see references for details), we describe a hybrid (static + dynamic) malware detection method that has extremely good accuracy (100% detection rate, with false positive below 4%).
We release a dataset related to the dynamic detection part of the aforementioned methods and containing execution traces of ransomware Android applications, in order to facilitate further research as well as to facilitate the adoption of dynamic detection in practice. The dataset contains execution traces from 666 ransomware applications taken from the Heldroid project [https://github.com/necst/heldroid] (the app repository is unavailable at the moment). Execution records were obtained by running the applications, one at a time, on the Android emulator. For each application, a maximum of 20,000 stimuli were applied with a maximum execution time of 15 minutes. For most of the applications, all the stimuli could be applied in this timeframe. In some of the traces none of the two limits is reached due to emulator hiccups. Collected features are related to the memory and CPU usage, network interaction and system calls and their monitoring is performed with a period of two seconds. The Android emulator of the Android Software Development Kit for Android 4.0 (release 20140702) was used. To guarantee that the system was always in a mint condition when a new sample is started, thus avoiding possible interference (e.g., changed settings, running processes, and modifications of the operating system files) from previously run samples, the Android operating system was each time re-initialized before running each application. The application execution process was automated by means of a shell script that made use of Android Debug Bridge (adb) and that was run on a Linux PC. The Monkey application exerciser was used in the script as a generator of the aforementioned stimuli. The Monkey is a command-line tool that can be run on any emulator instance or on a device; it sends a pseudo-random stream of user events (stimuli) into the system, which acts as a stress test on the application software.
In this dataset, we provide both per-app CSV files as well as unified files, in which CSV files of single applications have been concatenated. The CSV files contain the features extracted from the raw execution record. The provided files are listed below:
ransom-per_app-csv.zip - features obtained by executing ransomware applications, one CSV per application
ransom-unified-csv.zip - features obtained by executing ransomware applications, only one CSV file
According to AV-Test, trojans were found to be the preferred means for cybercriminals to infiltrate Android systems. In 2019, trojans accounted for 93.93 percent of all malware attacks on Android systems. Ransomware ranked second, with 2.47 percent of Android malware samples involving this variant.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of metainformation of benign and malware Android samples used in the paper Martín, A., Calleja, A., Menéndez, H. D., Tapiador, J., & Camacho, D. (2016, December). ADROIT: Android malware detection using meta-information. In Computational Intelligence (SSCI), 2016 IEEE Symposium Series on (pp. 1-8). IEEE.
Martín, Alejandro; Calleja, Alejandro; Menéndez, Héctor D.; Tapiador, Juan; Camacho, David (2017), “ADROIT”, Mendeley Data, V2, doi: 10.17632/yr92xbrvgx.2
This is a dataset for coronavirus-themed malware for Android devices. It is a daily growing COVID-19 themed mobile app dataset, which contains 4,322 COVID-19 themed apk samples (2,500 unique apps) and 611 potential malware samples (370 unique malicious apps) by the time of mid-November, 2020.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contains permission data set extracted from different .apk files downloaded from thirty repositories. It contain more than 1500 permissions. Apps belong to thirty different categories.
Protection against malware is particularly relevant on systems running the Android operating system, due to its huge use base and, therefore, its potential for monetization from the attackers.
Protection against malware is particularly relevant in systems running the Android operating system, due to its huge users’ base and, therefore, its potential for monetization from the attackers.
Dynamic malware detection has been widely adopted by the scientific community but not yet in practical applications.
We release DYNAMISM (Dynamic Analysis of Malware), a dataset containing execution traces of both benign and malicious applications running on Android OS, in order to facilitate further research as well as to facilitate the adoption of dynamic detection in practice. The dataset contains execution traces from 2,386 benign applications and 2,495 malicious applications taken from the Malware Genome Project repository [http://www.malgenomeproject.org] and from Drebin Dataset [https://www.sec.cs.tu-bs.de/~danarp/drebin/]. Execution records were obtained by running the applications, one at a time, on the Android emulator. For each application, a maximum of 2,000 stimuli were applied with a maximum execution time of 10 minutes. For most of the applications, all the stimuli could be applied in this timeframe. In some of the traces none of the two limits is reached due to emulator hiccups. Collected features are related to the memory and CPU usage, network interaction and system calls and their monitoring is performed with a period of two seconds. The Android emulator of the Android Software Development Kit for Android 4.0 (release 20140702) was used. To guarantee that the system was always in a mint condition when a new sample is started, thus avoiding possible interference (e.g., changed settings, running processes, and modifications of the operating system files) from previously run samples, the Android operating system was each time re-initialized before running each application. The application execution process was automated by means of a shell script that made use of Android Debug Bridge (adb) and that was run on a Linux PC. The Monkey application exerciser was used in the script as a generator of the aforementioned stimuli. The Monkey is a command-line tool that can be run on any emulator instance or on a device; it sends a pseudo-random stream of user events (stimuli) into the system, which acts as a stress test on the application software.
In this dataset, we provide both per-app CSV files as well as unified files, in which CSV files of single applications have been concatenated. The CSV files contain the features extracted from the raw execution record. The provided files are listed below:
benign-per_app-csv.zip - features obtained by executing benign applications, one CSV per application
benign-unified-csv.zip - features obtained by executing benign applications, only one CSV file
malicious-per_app-csv.zip - features obtained by executing malicious applications, one CSV per application
malicious-unified-csv.zip - features obtained by executing malicious applications, only one CSV file
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository includes VirusTotal scan reports of samples of two datasets used in the paper.
(1) GPset-VT-Reports.zip: the first and second VT scans for the GPset samples.
(2) 3rdset-VT-Reports.zip: the first and second VT scans for the 3rdset samples.
(3)FusedApp-VT-Reports.zip: the VT scan for the fused (multi-family) samples.
Each scan report is a JSON file generated by VirusTotal, including file metadata (e.g., hashes, size), certificate metadata (e.g., thumbprint), VirusTotal-specific data (e.g., submission time), and a list of detection labels assigned by various antivirus engines used to scan the file.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 200K android malware apps which are labeled and characterized into corresponding family. Benign android apps (200K) are collected from Androzoo dataset to balance the huge dataset. We collected 14 malware categories including adware, backdoor, file infector, no category, Potentially Unwanted Apps (PUA), ransomware, riskware, scareware, trojan, trojan-banker, trojan-dropper, trojan-sms,**trojan-spy** and zero-day.
A complete taxonomy of all the malware families of captured malware apps is created by dividing them into 8 categories such as sensitive data collection, media, hardware, actions/activities, internet connection, C&C, antivirus and storage & settings.
Category | Number of families | Number of samples |
---|---|---|
Adware | 48 | 47,210 |
Backdoor | 11 | 1,538 |
File Infector | 5 | 669 |
No Category | - | 2,296 |
PUA | 8 | 2,051 |
Ransomware | 8 | 6,202 |
Riskware | 21 | 97,349 |
Scareware | 3 | 1,556 |
Trojan | 45 | 13,559 |
Trojan-Banker | 11 | 887 |
Trojan-Dropper | 9 | 2,302 |
Trojan-SMS | 11 | 3,125 |
Trojan-Spy | 11 | 3,540 |
Zero-day | - | 13,340 |
AndroidManifest.xml contains a lot of features that can be used for static analysis. The main extracted features include:
Feature | Values |
---|---|
Package Name | "com.fb.iwidget" |
Activities | "com.fb.iwidget.OverlayActivity" "org.acra.CrashReportDialog" "com.batch.android.BatchActionActivity" "com.fb.iwidget.MainActivity" "com.fb.iwidget.PreferencesActivity" "com.fb.iwidget.PickerActivity" "com.fb.iwidget.IntroActivity" |
Services | "com.batch.android.BatchActionService" "com.fb.iwidget.MainService" "com.fb.iwidget.SnapAccessService" |
Receivers/Providers | "com.fb.iwidget.ExpandWidgetProvider" "com.fb.iwidget.ActionReceiver" |
Intents Actions | "android.accessibilityservice.AccessibilityService" "android.appwidget.action.APPWIDGET_UPDATE" "android.intent.action.BOOT_COMPLETED" "android.intent.action.CREATE_SHORTCUT" "android.intent.action.MAIN" "android.intent.action.MY_PACKAGE_REPLACED" "android.intent.action.USER_PRESENT" "android.intent.action.VIEW" "com.fb.iwidget.action.SHOULD_REVIVE" |
Intents Categories | "android.intent.category.BROWSABLE" "android.intent.category.DEFAULT" "android.intent.category.LAUNCHER" |
Permissions | "android.permission.ACCESS_NETWORK_STATE" "android.permission.CALL_PHONE" "android.permission.INTERNET" "android.permission.RECEIVE_BOOT_COMPLETED" "android.permission.SYSTEM_ALERT_WINDOW" "com.android.vending.BILLING" "android.permission.BIND_ACCESSIBILITY_SERVICE" |
Meta-Data | "android.accessibilityservice" "android.appwidget.provider" |
#Icons | 331 |
#Pictures | 0 |
#Videos | 0 |
Audio files | 0 |
Videos | 0 |
Size of the App | 4.2M |
For understanding the behavioral changes of these malware categories and families, six categories of features are extracted after executing the malware in an emulated environment. The main extracted features include:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 18,850 normal android application packages and 10,000 malware android packages which are used to identify the behaviour of malware application on permission they need at run-time.
AMFD(Android Malware Family Dataset)
12570 malware samples, 32 malware Family
The accumulated dataset combines some botnet samples from the Android Genome Malware project, malware security blog, VirusTotal and samples provided by well-known anti-malware vendor. Overall, the dataset includes 1929 samples spawning a period of 2010 (the first appearance of Android botnet) to 2014. The Android Botnet dataset consists of 14 families: Family, Year of discovery, No. of samples
AnserverBot, 2011, 244 Bmaster, 2012, 6 DroidDream, 2011, 363 Geinimi, 2010, 264 MisoSMS, 2013, 100 NickySpy, 2011, 199 Not Compatible, 2014, 76 PJapps, 2011, 244 Pletor, 2014, 85 RootSmart, 2012, 28 Sandroid, 2014, 44 TigerBot, 2012, 96 Wroba, 2014, 100 Zitmo, 2010, 80 ; cic@unb.ca.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset containing 2375 samples of Android Process Memory String Dumps. The dataset is broadly composed of 2 classes: "Benign App" Memory Dumps and "Malicious App" Memory Dumps, respectively, split into 2 ZIP archives. The ZIP archives in total are approximately 17GB in size, however the unzipped contents are approximately 67GB.This dataset is derived from a subset of the APK files originally made freely available for research through the AndroZoo project [1]. The AndroZoo project collected millions of Android applications and scanned them with the VirusTotal online malware scanning service, thereby classifying most of the apps as either malicious or benign at the time of scanning. The process memory dumps in this dataset were generated through running the subset of APK files from the AndroZoo dataset in an Android Emulator, capturing the process memory of the individual process and subsequently extracting only the strings from the process memory dump. This was facilitated through building 2 applications: Coriander and AndroMemDumpBeta which facilitate the running of Apps on Android Emulators, and the capturing of process memory respectively. The source code for these software applications is available on Github. The individual samples are labelled with the SHA256 hash filename from the original AndroZoo labeling and the application package names extracted from within the specific APK manifest file. They also contain a time-stamp for when the memory dumping process took place for the specific file. The file extension used is ".dmp" to indicate that the files are memory dumps, however they only contain strings, and thus can be viewed in any simple text editor.A subset of the first 10000 APK files from the original AndroZoo dataset is also included within this dataset. The metadata of these APK files is present in the file "AndroZoo-First-10000" and the 2375 Android Apps that are the main subjects of our dataset are extracted from here..Our dataset is intended to be used in furthering our research related to Machine Learning-based Triage for Android Memory Forensics. It has been made openly available in order to foster opportunities for collaboration with other researchers, to enable validation of research results as well as to enhance the body of knowledge in related areas of research.References:[1]. K. Allix, T. F. Bissyandé, J. Klein, and Y. Le Traon. AndroZoo: Collecting Millions of Android Apps for the Research Community. Mining Software Repositories (MSR) 2016
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is published as part of the paper "LAMD: Context-driven Android Malware Detection and Classification with LLMs". It includes the training and testing data, along with the results of testing malware with LAMD.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Short Description: This is the dataset for the paper "Exploring the Use of Static and Dynamic Analysis to Improve the Performance of the Mining Sandbox Approach for Android Malware Identification", accepted for publication in the Journal of Systems and Software.
Link to this repository: https://github.com/droidxp/paper-replication-package
Authors of the Paper
Abstract
The popularization of the Android platform and the growing number of Android applications (apps) that manage sensitive data turned the Android ecosystem into an attractive target for malicious software. For this reason, researchers and practitioners have investigated new approaches to address Android's security issues, including techniques that leverage dynamic analysis to mine Android sandboxes. The mining sandbox approach consists in running dynamic analysis tools on a benign version of an Android app. This exploratory phase records all calls to sensitive APIs. Later, we can use this information to (a) prevent calls to other sensitive APIs (those not recorded in the exploratory phase) or (b) run the dynamic analysis tools again in a different version of the app. During this second execution of the fuzzing tools, a warning of possible malicious behavior is raised whenever the new version of the app calls a sensitive API not recorded in the exploratory phase.
The use of a mining sandbox approach is an effective technique for Android malware analysis, as previous research works revealed. Particularly, existing reports present an accuracy of almost 70% in the identification of malicious behavior using dynamic analysis tools to mine android sandboxes. However, although the use of dynamic analysis for mining Android sandboxes has been investigated before, little is known about the potential benefits of combining static analysis with a mining sandbox approach for identifying malicious behavior. Accordingly, in this paper we present the results of two studies that investigate the impact of using static analysis to complement the performance of existing dynamic analysis tools tailored for mining Android sandboxes, in the task of identifying malicious behavior.
In the first study we conduct a non-exact replication of a previous study (hereafter BLL-Study) that compares the performance of test case generation tools for mining Android sandboxes. Differently from the original work, here we isolate the effect of an independent static analysis component (DroidFax) they used to instrument the Android apps in their experiments. This decision was motivated by the fact that DroidFax could have influenced the efficacy of the dynamic analyses tools positively---through the execution of specific static analysis algorithms DroidFax also implements. In our second study, we carried out a new experiment to investigate the efficacy of taint analysis algorithms to complement the mining sandbox approach previously used to identify malicious behavior. To this end, we executed the FlowDroid tool to mine the source-sink flows from benign/malign pairs of Android apps used in previous research work.
Our study brings several findings. For instance, the first study reveals that DroidFax alone (static analysis) can detect 43.75% of the malwares in the BLL-Study dataset, contributing substantially in the performance of the dynamic analysis tools in the BLL-Study. The results of the second study show that taint analysis is also practical to complement the mining sandboxes approach, with a performance similar to that reached by dynamic analysis tools.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are the two datasets -- EMBER Class and AZ Class to reproduce the results of the paper ``MalCL: Leveraging GAN-Based Generative Replay to Combat Catastrophic Forgetting in Malware Classification", accepted to be published at the The 39th Annual AAAI Conference on Artificial Intelligence (AAAI) 2025.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Mobile malware detection has attracted massive research effort in our community. A reliable and up-to-date malware dataset is critical to evaluate the effectiveness of malware detection approaches. Essentially, the malware ground truth should be manually verified by security experts, and their malicious behaviors should be carefully labelled. Although there are several widely-used malware benchmarks in our community (e.g., MalGenome, Drebin, Piggybacking and AMD, etc.), these benchmarks face several limitations including out-of-date, size, coverage, and reliability issues, etc.
We make effort to create MalRadar, a growing and up-to-date Android malware dataset using the most reliable way, i.e., by collecting malware based on the analysis reports of security experts. We have crawled all the mobile security related reports released by ten leading security companies, and used an automated approach to extract and label the useful ones describing new Android malware and containing Indicators of Compromise (IoC) information. We have successfully compiled MalRadar, a dataset that contains 4,534 unique Android malware samples (including both apks and metadata) released from 2014 to April 2021 by the time of this paper, all of which were manually verified by security experts with detailed behavior analysis. For more details, please visit https://malradar.github.io/
The dataset includes the following files:
(1) sample-info.csv
In this file, we list all the detailed information about each sample, including apk file hash, app name, package name, report family, etc.
(2) malradar.zip
We have packaged the malware samples in chunks of 1000 applications: malradar-0, malradar-1, malradar-2, malradar-3. All the apk files name after the file SHA256.
If your papers or articles used our dataset, please include a citation to our paper:
@article{wang2022malradar, title={MalRadar: Demystifying Android Malware in the New Era}, author={Wang, Liu and Wang, Haoyu and He, Ren and Tao, Ran and Meng, Guozhu and Luo, Xiapu and Liu, Xuanzhe}, journal={Proceedings of the ACM on Measurement and Analysis of Computing Systems}, volume={6}, number={2}, pages={1--27}, year={2022}, publisher={ACM New York, NY, USA} }