8 datasets found

Fast & Furious: Malware Detection Data Stream
kaggle.com
Updated Aug 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabrício Ceschin (2022). Fast & Furious: Malware Detection Data Stream [Dataset]. https://www.kaggle.com/fabriciojoc/fast-furious-malware-data-stream
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Fabrício Ceschin
Description
These datasets (DREBIN and AndroZoo) are contributions to the paper "Fast & Furious: On the Modelling of Malware Detection as an Evolving Data Stream". If you use them in your work, please cite our paper using the BibTeX below:

@article{CESCHIN2022118590, title = {Fast & Furious: On the modelling of malware detection as an evolving data stream}, journal = {Expert Systems with Applications}, pages = {118590}, year = {2022}, issn = {0957-4174}, doi = {https://doi.org/10.1016/j.eswa.2022.118590}, url = {https://www.sciencedirect.com/science/article/pii/S0957417422016463}, author = {Fabrício Ceschin and Marcus Botacin and Heitor Murilo Gomes and Felipe Pinagé and Luiz S. Oliveira and André Grégio}, keywords = {Machine learning, Data streams, Concept drift, Malware detection, Android} }

Both datasets are saved in the parquet file format. To read them, use the following code:

data_drebin = pd.read_parquet("drebin_drift.parquet.zip") data_androzoo = pd.read_parquet("androbin.parquet.zip")

Note that these datasets are different from their original versions. The original DREBIN dataset does not contain the samples' timestamps, which we collected using VirusTotal API. Our version of the AndroZoo dataset is a subset of reports from their dataset previously available in their APK Analysis API, which was discontinued.

The DREBIN dataset is composed of ten textual attributes from Android APKs (list of API calls, permissions, URLs, etc), which are publicly available to download and contain 123,453 benign and 5,560 malicious Android applications. Their distribution over time is shown below.

https://i.imgur.com/IGKOMtE.png" alt="DREBIN dataset distribution by month">

The AndroZoo dataset is a subset of Android applications reports provided by AndroZoo API, composed of eight textual attributes (resources names, source code classes and methods, manifest permissions etc.) and contains 213,928 benign and 70,340 malicious applications. The distribution over time of our AndroZoo subset, which keeps the same goodware and malware distribution as the original dataset (composed of most of 10 million apps), is shown below.

https://i.imgur.com/8zxH3M4.png" alt="AndroZoo dataset distribution by month">

The source code for all the experiments shown in the paper are also available here on Kaggle (note that the experiments using AndroZoo dataset did not run in the Kaggle environment due to high memory usage).

Experiment 1 (The Best-Case Scenario for AVs - ML Cross-Validation)

Here we classify all samples together to compare which feature extraction algorithm is the best and report baseline results. We tested several parameters for both algorithms and fixed the vocabulary size at 100 for TF-IDF (top-100 features ordered by term frequency) and created projections with 100 dimensions for Word2Vec, resulting in 1, 000 and 800 features for each app in both cases, for DREBIN and AndroZoo, respectively. All results are reported after 10-fold cross-validation procedures, a method commonly used in ML to evaluate models because its results are less prone to biases (note that we are training new classifiers and feature extractors at every iteration of the cross-validation process). . In practice, folding the dataset implies that the AV company has a mixed view of both past and future threats, despite temporal effects, which is the best scenario for AV operation and ML evaluation.

Source Codes: DREBIN TFIDF | DREBIN W2V | ANDROZOO TFIDF | ANDROZOO W2V

Experiment 2 (On Classification Failure - Temporal Classification)

Although the currently used classification methodology helps reduci dataset biases, it would demand knowledge about future threats to work properly. AV companies train their classifiers using data from past samples and leverage them to predict future threats, expecting to present the same characteristics as past ones. However, malware samples are very dynamic, thus this strategy is the worst-case scenario for AV companies. To demonstrate the effects of predicting future threats based on past data, we split our datasets in two: we used the first half (oldest samples) to train our classifiers, which were then used to predict the newest samples from the second half. The results in the paper indicate a drop in all metrics when compared to the 10-fold experiment in both DREBIN and AndroZoo dataset...
Z
Clean Apks - Androzoo
data.niaid.nih.gov
Updated Apr 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Razgallah, Asma (2022). Clean Apks - Androzoo [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6433988
Explore at:
Dataset updated
Apr 16, 2022
Dataset authored and provided by
Razgallah, Asma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
clean Android applications from Androzoo
r
Android Process Memory String Dumps Dataset
researchdata.se
su.figshare.com
+1more
Updated May 11, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irvin Homem; Panagiotis Papapetrou (2017). Android Process Memory String Dumps Dataset [Dataset]. http://doi.org/10.17045/STHLMUNI.4989773
Explore at:
Unique identifier
https://doi.org/10.17045/STHLMUNI.4989773
Dataset updated
May 11, 2017
Dataset provided by
Stockholm University
Authors
Irvin Homem; Panagiotis Papapetrou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A dataset containing 2375 samples of Android Process Memory String Dumps. The dataset is broadly composed of 2 classes: "Benign App" Memory Dumps and "Malicious App" Memory Dumps, respectively, split into 2 ZIP archives. The ZIP archives in total are approximately 17GB in size, however the unzipped contents are approximately 67GB.

This dataset is derived from a subset of the APK files originally made freely available for research through the AndroZoo project [1]. The AndroZoo project collected millions of Android applications and scanned them with the VirusTotal online malware scanning service, thereby classifying most of the apps as either malicious or benign at the time of scanning. The process memory dumps in this dataset were generated through running the subset of APK files from the AndroZoo dataset in an Android Emulator, capturing the process memory of the individual process and subsequently extracting only the strings from the process memory dump. This was facilitated through building 2 applications: Coriander and AndroMemDumpBeta which facilitate the running of Apps on Android Emulators, and the capturing of process memory respectively. The source code for these software applications is available on Github.

The individual samples are labelled with the SHA256 hash filename from the original AndroZoo labeling and the application package names extracted from within the specific APK manifest file. They also contain a time-stamp for when the memory dumping process took place for the specific file. The file extension used is ".dmp" to indicate that the files are memory dumps, however they only contain strings, and thus can be viewed in any simple text editor.

A subset of the first 10000 APK files from the original AndroZoo dataset is also included within this dataset. The metadata of these APK files is present in the file "AndroZoo-First-10000" and the 2375 Android Apps that are the main subjects of our dataset are extracted from here..

Our dataset is intended to be used in furthering our research related to Machine Learning-based Triage for Android Memory Forensics. It has been made openly available in order to foster opportunities for collaboration with other researchers, to enable validation of research results as well as to enhance the body of knowledge in related areas of research.

References: [1]. K. Allix, T. F. Bissyandé, J. Klein, and Y. Le Traon. AndroZoo: Collecting Millions of Android Apps for the Research Community. Mining Software Repositories (MSR) 2016
Android Malware Dataset with VirusTotal Labels
zenodo.org
zip
Updated May 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Rashed; Mohammed Rashed; Juan Tapiador; Juan Tapiador; Guillermo Suarez-Tangil; Guillermo Suarez-Tangil (2024). Android Malware Dataset with VirusTotal Labels [Dataset]. http://doi.org/10.5281/zenodo.11095700
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11095700
Dataset updated
May 1, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mohammed Rashed; Mohammed Rashed; Juan Tapiador; Juan Tapiador; Guillermo Suarez-Tangil; Guillermo Suarez-Tangil
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains labels of 2.47 million Android apk hashes extracted from VirusTotal reports.

The dataset was used in the experiments of our publication titled An Analysis of Android Malware Classification Services

The csv of the labels that was extracted from the VirusTotal reports is provided in labeling_dataset.csv.gz . A cell's value of -1 is used whenever there was no result from the
engine for the given apk file hash value. The column names are provided in cols_labeling_dataset.csv.

Note

-1 is a string and not an integer

If you use information from this repo, please cite our paper

Rashed M, Suarez-Tangil G. An Analysis of Android Malware Classification Services. Sensors. 2021; 21(16):5671. https://doi.org/10.3390/s21165671

BibTeX

@Article{s21165671,
AUTHOR = {Rashed, Mohammed and Suarez-Tangil, Guillermo},
TITLE = {An Analysis of Android Malware Classification Services},
JOURNAL = {Sensors},
VOLUME = {21},
YEAR = {2021},
NUMBER = {16},
ARTICLE-NUMBER = {5671},
URL = {https://www.mdpi.com/1424-8220/21/16/5671},\
ISSN = {1424-8220},
DOI = {10.3390/s21165671}
}

Required Software

gzip

Debian-based Linux: you may install it using the following command apt-get install gzip

MacOS: gzip is pre-installed

Windows: you may download gzip from http://gnuwin32.sourceforge.net/packages/gzip.htm

How to use the file?

There are two ways to use the file:

Extract the gzip file and then you will have a csv output file. For that you need to install gzip and then extracting .csv.gz. The user may use the command gunzip labelingDataset.csv.gz

Extract information from the zipped file directly (following the same logic of AndroZoo's csv):
To extract the first column and save to a file called list_of_selected_sha256, run the following command:
zcat labelingDataset.csv.gz | cut -d',' -f1 > list_of_selected_sha256
To obtain rows of apk hashes that were first seen after the 1st of May, 2016, run this command:
zcat labeling_dataset.csv.gz | grep -v ',snaggamea' | awk -F, '{if ( $2 >= "2016-05" ) {print} }'
f
List of analyzed apps
figshare.com
txt
Updated Jul 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pascal Gadient (2021). List of analyzed apps [Dataset]. http://doi.org/10.6084/m9.figshare.14981061.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14981061.v1
Dataset updated
Jul 14, 2021
Dataset provided by
figshare
Authors
Pascal Gadient
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The list of apps that we analyzed for the paper titled "Security Smells Pervade Mobile App Servers," ESEM, 2021.The closed-source apps have been downloaded from the AndroZoo repository hosted at the University of Luxembourg (https://androzoo.uni.lu/), and the open-source apps have been downloaded from the F-Droid repository (https://f-droid.org/).
i
PermGuard Android Malware Dataset
ieee-dataport.org
Updated Dec 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arvind Prasad (2024). PermGuard Android Malware Dataset [Dataset]. https://ieee-dataport.org/documents/permguard-android-malware-dataset
Explore at:
Dataset updated
Dec 30, 2024
Authors
Arvind Prasad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
2019
AndroLibZoo
zenodo.org
application/gzip, bin
Updated Aug 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2023). AndroLibZoo [Dataset]. http://doi.org/10.5281/zenodo.8207463
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8207463
Dataset updated
Aug 5, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# AndroLibZoo

This repository hosts the AndroLibZoo dataset and the artifacts used in our study on Android libraries.

## Artifacts

The artifacts folder contains all the artifacts, i.e., datasets, scripts, results, source code, etc., to reproduce our study and produce AndroLibZoo.

The file AndroLibZoo.lst is our dataset.

The subfolders in « AndroLibZoo » are as follows:

The motivation folder contains all artifacts related to our motivation study (i.e., Section 3).

The methodology folder contains all artifacts related to our methodology (i.e., Section 4). In particular, we present:

How we gather libraries from Maven

How we gather libraries from Google

How we gather transitive dependencies

How we gather libraries from open source apps

How we gather libraries from Gradle plugin libraries

How we refined our list of libraries.

The description folder contains the artifacts useful to describe our dataset (i.e., Section 5).

The comparison folder holds the artifacts useful to compare our dataset with the state of the art (i.e., Section 6.a).

The importance folder holds the artifacts useful to highlight the importance of our AndroLibZoo to filter libraries for static analysis of Android apps (i.e., Section 6.b).

The evaluation folder holds the artifacts useful to show how AndroLibZoo improves existing static analyzers (i.e., Section 6.c).

The subfolders in « methodology_for_comparison_with_sota_list » contains the same folder as listed above but to produce the 2016 version of our dataset to compare against the state-of-the-art list.

The subfolders in « comparison_libradar » contain all the necessary files to compare our approach with libradar.

All folders contain scripts that start with "XX_", with XX being a number that represents the order in which scripts need to be executed. Some of the scripts need to be parametrized with names of servers, AndroZoo API keys, prefixes for Redis server, etc.
Artifacts of the ASE2022 Submission JUnitTestGen
zenodo.org
txt, zip
Updated May 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anonymous anonymous; anonymous anonymous (2022). Artifacts of the ASE2022 Submission JUnitTestGen [Dataset]. http://doi.org/10.5281/zenodo.6507579
Explore at:
txt, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6507579
Dataset updated
May 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
anonymous anonymous; anonymous anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This release includes the experimental results that are used for the ASE artifact analysis. All the results are described in our paper. All required artifacts are attached to this release.

The source code is made publicly available on the bitbucket page: https://bitbucket.org/se_anonymous/junittestgen/src/master/
Users can easily access and execute our tool following the steps in the Setup part.

In addition, all of our experimental results are publicly available on this page. For simplicity, we provide the following instructions for our artifacts:

1.Experimental_Setup.zip : Experimental Setup, including 10,000 Android apps(for each target SDK version between 21 (i.e., Android 5.0) and 30 (i.e., Android
11.03) from AndroZoo).
2.All_Tests.zip : All the generated test cases.
3.TestName_TargetAPI_Mapping.txt : The mapping of test case name and its corresponding target API.
4.JUnitTestRun_SDK*_log.txt : Tests Execution results on SDK 21 - 30(* refers to SDK version).
5.EvoSuite-tests.zip : The generated test cases by using Evosuite
6.Evosuite_results.zip : Execution results of tests generated by Evosuite
7.CiD.zip : CiD results
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Fabrício Ceschin (2022). Fast & Furious: Malware Detection Data Stream [Dataset]. https://www.kaggle.com/fabriciojoc/fast-furious-malware-data-stream

Fast & Furious: Malware Detection Data Stream

415K static android malware samples from 2009 to 2018 with their timestamps

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 12, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Fabrício Ceschin

Description

These datasets (DREBIN and AndroZoo) are contributions to the paper "Fast & Furious: On the Modelling of Malware Detection as an Evolving Data Stream". If you use them in your work, please cite our paper using the BibTeX below:

@article{CESCHIN2022118590,
title = {Fast & Furious: On the modelling of malware detection as an evolving data stream},
journal = {Expert Systems with Applications},
pages = {118590},
year = {2022},
issn = {0957-4174},
doi = {https://doi.org/10.1016/j.eswa.2022.118590},
url = {https://www.sciencedirect.com/science/article/pii/S0957417422016463},
author = {Fabrício Ceschin and Marcus Botacin and Heitor Murilo Gomes and Felipe Pinagé and Luiz S. Oliveira and André Grégio},
keywords = {Machine learning, Data streams, Concept drift, Malware detection, Android}
}

Both datasets are saved in the parquet file format. To read them, use the following code:

data_drebin = pd.read_parquet("drebin_drift.parquet.zip")
data_androzoo = pd.read_parquet("androbin.parquet.zip")

Note that these datasets are different from their original versions. The original DREBIN dataset does not contain the samples' timestamps, which we collected using VirusTotal API. Our version of the AndroZoo dataset is a subset of reports from their dataset previously available in their APK Analysis API, which was discontinued.

The DREBIN dataset is composed of ten textual attributes from Android APKs (list of API calls, permissions, URLs, etc), which are publicly available to download and contain 123,453 benign and 5,560 malicious Android applications. Their distribution over time is shown below.

https://i.imgur.com/IGKOMtE.png" alt="DREBIN dataset distribution by month">

The AndroZoo dataset is a subset of Android applications reports provided by AndroZoo API, composed of eight textual attributes (resources names, source code classes and methods, manifest permissions etc.) and contains 213,928 benign and 70,340 malicious applications. The distribution over time of our AndroZoo subset, which keeps the same goodware and malware distribution as the original dataset (composed of most of 10 million apps), is shown below.

https://i.imgur.com/8zxH3M4.png" alt="AndroZoo dataset distribution by month">

The source code for all the experiments shown in the paper are also available here on Kaggle (note that the experiments using AndroZoo dataset did not run in the Kaggle environment due to high memory usage).

Experiment 1 (The Best-Case Scenario for AVs - ML Cross-Validation)

Here we classify all samples together to compare which feature extraction algorithm is the best and report baseline results. We tested several parameters for both algorithms and fixed the vocabulary size at 100 for TF-IDF (top-100 features ordered by term frequency) and created projections with 100 dimensions for Word2Vec, resulting in 1, 000 and 800 features for each app in both cases, for DREBIN and AndroZoo, respectively. All results are reported after 10-fold cross-validation procedures, a method commonly used in ML to evaluate models because its results are less prone to biases (note that we are training new classifiers and feature extractors at every iteration of the cross-validation process). . In practice, folding the dataset implies that the AV company has a mixed view of both past and future threats, despite temporal effects, which is the best scenario for AV operation and ML evaluation.

Source Codes: DREBIN TFIDF | DREBIN W2V | ANDROZOO TFIDF | ANDROZOO W2V

Experiment 2 (On Classification Failure - Temporal Classification)

Although the currently used classification methodology helps reduci dataset biases, it would demand knowledge about future threats to work properly. AV companies train their classifiers using data from past samples and leverage them to predict future threats, expecting to present the same characteristics as past ones. However, malware samples are very dynamic, thus this strategy is the worst-case scenario for AV companies. To demonstrate the effects of predicting future threats based on past data, we split our datasets in two: we used the first half (oldest samples) to train our classifiers, which were then used to predict the newest samples from the second half. The results in the paper indicate a drop in all metrics when compared to the 10-fold experiment in both DREBIN and AndroZoo dataset...

Clear search

Close search

Google apps

Main menu

Fast & Furious: Malware Detection Data Stream

Clean Apks - Androzoo

Android Process Memory String Dumps Dataset

Android Malware Dataset with VirusTotal Labels

If you use information from this repo, please cite our paper

BibTeX

Required Software

How to use the file?

List of analyzed apps

PermGuard Android Malware Dataset

AndroLibZoo

Artifacts of the ASE2022 Submission JUnitTestGen

Fast & Furious: Malware Detection Data Stream

415K static android malware samples from 2009 to 2018 with their timestamps