These datasets (DREBIN and AndroZoo) are contributions to the paper "Fast & Furious: On the Modelling of Malware Detection as an Evolving Data Stream". If you use them in your work, please cite our paper using the BibTeX below:
@article{CESCHIN2022118590,
title = {Fast & Furious: On the modelling of malware detection as an evolving data stream},
journal = {Expert Systems with Applications},
pages = {118590},
year = {2022},
issn = {0957-4174},
doi = {https://doi.org/10.1016/j.eswa.2022.118590},
url = {https://www.sciencedirect.com/science/article/pii/S0957417422016463},
author = {Fabrício Ceschin and Marcus Botacin and Heitor Murilo Gomes and Felipe Pinagé and Luiz S. Oliveira and André Grégio},
keywords = {Machine learning, Data streams, Concept drift, Malware detection, Android}
}
Both datasets are saved in the parquet file format. To read them, use the following code:
data_drebin = pd.read_parquet("drebin_drift.parquet.zip")
data_androzoo = pd.read_parquet("androbin.parquet.zip")
Note that these datasets are different from their original versions. The original DREBIN dataset does not contain the samples' timestamps, which we collected using VirusTotal API. Our version of the AndroZoo dataset is a subset of reports from their dataset previously available in their APK Analysis API, which was discontinued.
The DREBIN dataset is composed of ten textual attributes from Android APKs (list of API calls, permissions, URLs, etc), which are publicly available to download and contain 123,453 benign and 5,560 malicious Android applications. Their distribution over time is shown below.
https://i.imgur.com/IGKOMtE.png" alt="DREBIN dataset distribution by month">
The AndroZoo dataset is a subset of Android applications reports provided by AndroZoo API, composed of eight textual attributes (resources names, source code classes and methods, manifest permissions etc.) and contains 213,928 benign and 70,340 malicious applications. The distribution over time of our AndroZoo subset, which keeps the same goodware and malware distribution as the original dataset (composed of most of 10 million apps), is shown below.
https://i.imgur.com/8zxH3M4.png" alt="AndroZoo dataset distribution by month">
The source code for all the experiments shown in the paper are also available here on Kaggle (note that the experiments using AndroZoo dataset did not run in the Kaggle environment due to high memory usage).
Experiment 1 (The Best-Case Scenario for AVs - ML Cross-Validation)
Here we classify all samples together to compare which feature extraction algorithm is the best and report baseline results. We tested several parameters for both algorithms and fixed the vocabulary size at 100 for TF-IDF (top-100 features ordered by term frequency) and created projections with 100 dimensions for Word2Vec, resulting in 1, 000 and 800 features for each app in both cases, for DREBIN and AndroZoo, respectively. All results are reported after 10-fold cross-validation procedures, a method commonly used in ML to evaluate models because its results are less prone to biases (note that we are training new classifiers and feature extractors at every iteration of the cross-validation process). . In practice, folding the dataset implies that the AV company has a mixed view of both past and future threats, despite temporal effects, which is the best scenario for AV operation and ML evaluation.
Source Codes: DREBIN TFIDF | DREBIN W2V | ANDROZOO TFIDF | ANDROZOO W2V
Experiment 2 (On Classification Failure - Temporal Classification)
Although the currently used classification methodology helps reduci dataset biases, it would demand knowledge about future threats to work properly. AV companies train their classifiers using data from past samples and leverage them to predict future threats, expecting to present the same characteristics as past ones. However, malware samples are very dynamic, thus this strategy is the worst-case scenario for AV companies. To demonstrate the effects of predicting future threats based on past data, we split our datasets in two: we used the first half (oldest samples) to train our classifiers, which were then used to predict the newest samples from the second half. The results in the paper indicate a drop in all metrics when compared to the 10-fold experiment in both DREBIN and AndroZoo dataset...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
clean Android applications from Androzoo
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset containing 2375 samples of Android Process Memory String Dumps. The dataset is broadly composed of 2 classes: "Benign App" Memory Dumps and "Malicious App" Memory Dumps, respectively, split into 2 ZIP archives. The ZIP archives in total are approximately 17GB in size, however the unzipped contents are approximately 67GB.
This dataset is derived from a subset of the APK files originally made freely available for research through the AndroZoo project [1]. The AndroZoo project collected millions of Android applications and scanned them with the VirusTotal online malware scanning service, thereby classifying most of the apps as either malicious or benign at the time of scanning. The process memory dumps in this dataset were generated through running the subset of APK files from the AndroZoo dataset in an Android Emulator, capturing the process memory of the individual process and subsequently extracting only the strings from the process memory dump. This was facilitated through building 2 applications: Coriander and AndroMemDumpBeta which facilitate the running of Apps on Android Emulators, and the capturing of process memory respectively. The source code for these software applications is available on Github.
The individual samples are labelled with the SHA256 hash filename from the original AndroZoo labeling and the application package names extracted from within the specific APK manifest file. They also contain a time-stamp for when the memory dumping process took place for the specific file. The file extension used is ".dmp" to indicate that the files are memory dumps, however they only contain strings, and thus can be viewed in any simple text editor.
A subset of the first 10000 APK files from the original AndroZoo dataset is also included within this dataset. The metadata of these APK files is present in the file "AndroZoo-First-10000" and the 2375 Android Apps that are the main subjects of our dataset are extracted from here..
Our dataset is intended to be used in furthering our research related to Machine Learning-based Triage for Android Memory Forensics. It has been made openly available in order to foster opportunities for collaboration with other researchers, to enable validation of research results as well as to enhance the body of knowledge in related areas of research.
References: [1]. K. Allix, T. F. Bissyandé, J. Klein, and Y. Le Traon. AndroZoo: Collecting Millions of Android Apps for the Research Community. Mining Software Repositories (MSR) 2016
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains labels of 2.47 million Android apk hashes extracted from VirusTotal reports.
The dataset was used in the experiments of our publication titled An Analysis of Android Malware Classification Services
The csv of the labels that was extracted from the VirusTotal reports is provided in labeling_dataset.csv.gz
. A cell's value of -1
is used whenever there was no result from the
engine for the given apk file hash value. The column names are provided in cols_labeling_dataset.csv
.
Note
-1
is a string
and not an integer
Rashed M, Suarez-Tangil G. An Analysis of Android Malware Classification Services. Sensors. 2021; 21(16):5671. https://doi.org/10.3390/s21165671
@Article
{s21165671,
AUTHOR = {Rashed, Mohammed and Suarez-Tangil, Guillermo},
TITLE = {An Analysis of Android Malware Classification Services},
JOURNAL = {Sensors},
VOLUME = {21},
YEAR = {2021},
NUMBER = {16},
ARTICLE-NUMBER = {5671},
URL = {https://www.mdpi.com/1424-8220/21/16/5671},\
ISSN = {1424-8220},
DOI = {10.3390/s21165671}
}
gzip
apt-get install gzip
gzip
is pre-installedgzip
from http://gnuwin32.sourceforge.net/packages/gzip.htmThere are two ways to use the file:
gunzip labelingDataset.csv.gz
list_of_selected_sha256
, run the following command:zcat labelingDataset.csv.gz | cut -d',' -f1 > list_of_selected_sha256
zcat labeling_dataset.csv.gz | grep -v ',snaggamea' | awk -F, '{if ( $2 >= "2016-05" ) {print} }'
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The list of apps that we analyzed for the paper titled "Security Smells Pervade Mobile App Servers," ESEM, 2021.The closed-source apps have been downloaded from the AndroZoo repository hosted at the University of Luxembourg (https://androzoo.uni.lu/), and the open-source apps have been downloaded from the F-Droid repository (https://f-droid.org/).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
2019
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# AndroLibZoo
This repository hosts the AndroLibZoo dataset and the artifacts used in our study on Android libraries.
## Artifacts
The artifacts folder contains all the artifacts, i.e., datasets, scripts, results, source code, etc., to reproduce our study and produce AndroLibZoo.
All folders contain scripts that start with "XX_", with XX being a number that represents the order in which scripts need to be executed. Some of the scripts need to be parametrized with names of servers, AndroZoo API keys, prefixes for Redis server, etc.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This release includes the experimental results that are used for the ASE artifact analysis. All the results are described in our paper. All required artifacts are attached to this release.
The source code is made publicly available on the bitbucket page: https://bitbucket.org/se_anonymous/junittestgen/src/master/
Users can easily access and execute our tool following the steps in the Setup part.
In addition, all of our experimental results are publicly available on this page. For simplicity, we provide the following instructions for our artifacts:
1.Experimental_Setup.zip : Experimental Setup, including 10,000 Android apps(for each target SDK version between 21 (i.e., Android 5.0) and 30 (i.e., Android
11.03) from AndroZoo).
2.All_Tests.zip : All the generated test cases.
3.TestName_TargetAPI_Mapping.txt : The mapping of test case name and its corresponding target API.
4.JUnitTestRun_SDK*_log.txt : Tests Execution results on SDK 21 - 30(* refers to SDK version).
5.EvoSuite-tests.zip : The generated test cases by using Evosuite
6.Evosuite_results.zip : Execution results of tests generated by Evosuite
7.CiD.zip : CiD results
Not seeing a result you expected?
Learn how you can add new datasets to our index.
These datasets (DREBIN and AndroZoo) are contributions to the paper "Fast & Furious: On the Modelling of Malware Detection as an Evolving Data Stream". If you use them in your work, please cite our paper using the BibTeX below:
@article{CESCHIN2022118590,
title = {Fast & Furious: On the modelling of malware detection as an evolving data stream},
journal = {Expert Systems with Applications},
pages = {118590},
year = {2022},
issn = {0957-4174},
doi = {https://doi.org/10.1016/j.eswa.2022.118590},
url = {https://www.sciencedirect.com/science/article/pii/S0957417422016463},
author = {Fabrício Ceschin and Marcus Botacin and Heitor Murilo Gomes and Felipe Pinagé and Luiz S. Oliveira and André Grégio},
keywords = {Machine learning, Data streams, Concept drift, Malware detection, Android}
}
Both datasets are saved in the parquet file format. To read them, use the following code:
data_drebin = pd.read_parquet("drebin_drift.parquet.zip")
data_androzoo = pd.read_parquet("androbin.parquet.zip")
Note that these datasets are different from their original versions. The original DREBIN dataset does not contain the samples' timestamps, which we collected using VirusTotal API. Our version of the AndroZoo dataset is a subset of reports from their dataset previously available in their APK Analysis API, which was discontinued.
The DREBIN dataset is composed of ten textual attributes from Android APKs (list of API calls, permissions, URLs, etc), which are publicly available to download and contain 123,453 benign and 5,560 malicious Android applications. Their distribution over time is shown below.
https://i.imgur.com/IGKOMtE.png" alt="DREBIN dataset distribution by month">
The AndroZoo dataset is a subset of Android applications reports provided by AndroZoo API, composed of eight textual attributes (resources names, source code classes and methods, manifest permissions etc.) and contains 213,928 benign and 70,340 malicious applications. The distribution over time of our AndroZoo subset, which keeps the same goodware and malware distribution as the original dataset (composed of most of 10 million apps), is shown below.
https://i.imgur.com/8zxH3M4.png" alt="AndroZoo dataset distribution by month">
The source code for all the experiments shown in the paper are also available here on Kaggle (note that the experiments using AndroZoo dataset did not run in the Kaggle environment due to high memory usage).
Experiment 1 (The Best-Case Scenario for AVs - ML Cross-Validation)
Here we classify all samples together to compare which feature extraction algorithm is the best and report baseline results. We tested several parameters for both algorithms and fixed the vocabulary size at 100 for TF-IDF (top-100 features ordered by term frequency) and created projections with 100 dimensions for Word2Vec, resulting in 1, 000 and 800 features for each app in both cases, for DREBIN and AndroZoo, respectively. All results are reported after 10-fold cross-validation procedures, a method commonly used in ML to evaluate models because its results are less prone to biases (note that we are training new classifiers and feature extractors at every iteration of the cross-validation process). . In practice, folding the dataset implies that the AV company has a mixed view of both past and future threats, despite temporal effects, which is the best scenario for AV operation and ML evaluation.
Source Codes: DREBIN TFIDF | DREBIN W2V | ANDROZOO TFIDF | ANDROZOO W2V
Experiment 2 (On Classification Failure - Temporal Classification)
Although the currently used classification methodology helps reduci dataset biases, it would demand knowledge about future threats to work properly. AV companies train their classifiers using data from past samples and leverage them to predict future threats, expecting to present the same characteristics as past ones. However, malware samples are very dynamic, thus this strategy is the worst-case scenario for AV companies. To demonstrate the effects of predicting future threats based on past data, we split our datasets in two: we used the first half (oldest samples) to train our classifiers, which were then used to predict the newest samples from the second half. The results in the paper indicate a drop in all metrics when compared to the 10-fold experiment in both DREBIN and AndroZoo dataset...