8 datasets found
  1. Fast & Furious: Malware Detection Data Stream

    • kaggle.com
    Updated Aug 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabrício Ceschin (2022). Fast & Furious: Malware Detection Data Stream [Dataset]. https://www.kaggle.com/fabriciojoc/fast-furious-malware-data-stream
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 12, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Fabrício Ceschin
    Description

    These datasets (DREBIN and AndroZoo) are contributions to the paper "Fast & Furious: On the Modelling of Malware Detection as an Evolving Data Stream". If you use them in your work, please cite our paper using the BibTeX below:

    @article{CESCHIN2022118590,
    title = {Fast & Furious: On the modelling of malware detection as an evolving data stream},
    journal = {Expert Systems with Applications},
    pages = {118590},
    year = {2022},
    issn = {0957-4174},
    doi = {https://doi.org/10.1016/j.eswa.2022.118590},
    url = {https://www.sciencedirect.com/science/article/pii/S0957417422016463},
    author = {Fabrício Ceschin and Marcus Botacin and Heitor Murilo Gomes and Felipe Pinagé and Luiz S. Oliveira and André Grégio},
    keywords = {Machine learning, Data streams, Concept drift, Malware detection, Android}
    }
    

    Both datasets are saved in the parquet file format. To read them, use the following code:

    data_drebin = pd.read_parquet("drebin_drift.parquet.zip")
    data_androzoo = pd.read_parquet("androbin.parquet.zip")
    

    Note that these datasets are different from their original versions. The original DREBIN dataset does not contain the samples' timestamps, which we collected using VirusTotal API. Our version of the AndroZoo dataset is a subset of reports from their dataset previously available in their APK Analysis API, which was discontinued.

    The DREBIN dataset is composed of ten textual attributes from Android APKs (list of API calls, permissions, URLs, etc), which are publicly available to download and contain 123,453 benign and 5,560 malicious Android applications. Their distribution over time is shown below.

    https://i.imgur.com/IGKOMtE.png" alt="DREBIN dataset distribution by month">

    The AndroZoo dataset is a subset of Android applications reports provided by AndroZoo API, composed of eight textual attributes (resources names, source code classes and methods, manifest permissions etc.) and contains 213,928 benign and 70,340 malicious applications. The distribution over time of our AndroZoo subset, which keeps the same goodware and malware distribution as the original dataset (composed of most of 10 million apps), is shown below.

    https://i.imgur.com/8zxH3M4.png" alt="AndroZoo dataset distribution by month">

    The source code for all the experiments shown in the paper are also available here on Kaggle (note that the experiments using AndroZoo dataset did not run in the Kaggle environment due to high memory usage).

    Experiment 1 (The Best-Case Scenario for AVs - ML Cross-Validation)

    Here we classify all samples together to compare which feature extraction algorithm is the best and report baseline results. We tested several parameters for both algorithms and fixed the vocabulary size at 100 for TF-IDF (top-100 features ordered by term frequency) and created projections with 100 dimensions for Word2Vec, resulting in 1, 000 and 800 features for each app in both cases, for DREBIN and AndroZoo, respectively. All results are reported after 10-fold cross-validation procedures, a method commonly used in ML to evaluate models because its results are less prone to biases (note that we are training new classifiers and feature extractors at every iteration of the cross-validation process). . In practice, folding the dataset implies that the AV company has a mixed view of both past and future threats, despite temporal effects, which is the best scenario for AV operation and ML evaluation.

    Source Codes: DREBIN TFIDF | DREBIN W2V | ANDROZOO TFIDF | ANDROZOO W2V

    Experiment 2 (On Classification Failure - Temporal Classification)

    Although the currently used classification methodology helps reduci dataset biases, it would demand knowledge about future threats to work properly. AV companies train their classifiers using data from past samples and leverage them to predict future threats, expecting to present the same characteristics as past ones. However, malware samples are very dynamic, thus this strategy is the worst-case scenario for AV companies. To demonstrate the effects of predicting future threats based on past data, we split our datasets in two: we used the first half (oldest samples) to train our classifiers, which were then used to predict the newest samples from the second half. The results in the paper indicate a drop in all metrics when compared to the 10-fold experiment in both DREBIN and AndroZoo dataset...

  2. Z

    Clean Apks - Androzoo

    • data.niaid.nih.gov
    Updated Apr 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Razgallah, Asma (2022). Clean Apks - Androzoo [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6433988
    Explore at:
    Dataset updated
    Apr 16, 2022
    Dataset authored and provided by
    Razgallah, Asma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    clean Android applications from Androzoo

  3. r

    Android Process Memory String Dumps Dataset

    • researchdata.se
    • su.figshare.com
    • +1more
    Updated May 11, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irvin Homem; Panagiotis Papapetrou (2017). Android Process Memory String Dumps Dataset [Dataset]. http://doi.org/10.17045/STHLMUNI.4989773
    Explore at:
    Dataset updated
    May 11, 2017
    Dataset provided by
    Stockholm University
    Authors
    Irvin Homem; Panagiotis Papapetrou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset containing 2375 samples of Android Process Memory String Dumps. The dataset is broadly composed of 2 classes: "Benign App" Memory Dumps and "Malicious App" Memory Dumps, respectively, split into 2 ZIP archives. The ZIP archives in total are approximately 17GB in size, however the unzipped contents are approximately 67GB.

    This dataset is derived from a subset of the APK files originally made freely available for research through the AndroZoo project [1]. The AndroZoo project collected millions of Android applications and scanned them with the VirusTotal online malware scanning service, thereby classifying most of the apps as either malicious or benign at the time of scanning. The process memory dumps in this dataset were generated through running the subset of APK files from the AndroZoo dataset in an Android Emulator, capturing the process memory of the individual process and subsequently extracting only the strings from the process memory dump. This was facilitated through building 2 applications: Coriander and AndroMemDumpBeta which facilitate the running of Apps on Android Emulators, and the capturing of process memory respectively. The source code for these software applications is available on Github.

    The individual samples are labelled with the SHA256 hash filename from the original AndroZoo labeling and the application package names extracted from within the specific APK manifest file. They also contain a time-stamp for when the memory dumping process took place for the specific file. The file extension used is ".dmp" to indicate that the files are memory dumps, however they only contain strings, and thus can be viewed in any simple text editor.

    A subset of the first 10000 APK files from the original AndroZoo dataset is also included within this dataset. The metadata of these APK files is present in the file "AndroZoo-First-10000" and the 2375 Android Apps that are the main subjects of our dataset are extracted from here..

    Our dataset is intended to be used in furthering our research related to Machine Learning-based Triage for Android Memory Forensics. It has been made openly available in order to foster opportunities for collaboration with other researchers, to enable validation of research results as well as to enhance the body of knowledge in related areas of research.

    References: [1]. K. Allix, T. F. Bissyandé, J. Klein, and Y. Le Traon. AndroZoo: Collecting Millions of Android Apps for the Research Community. Mining Software Repositories (MSR) 2016

  4. Android Malware Dataset with VirusTotal Labels

    • zenodo.org
    zip
    Updated May 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Rashed; Mohammed Rashed; Juan Tapiador; Juan Tapiador; Guillermo Suarez-Tangil; Guillermo Suarez-Tangil (2024). Android Malware Dataset with VirusTotal Labels [Dataset]. http://doi.org/10.5281/zenodo.11095700
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mohammed Rashed; Mohammed Rashed; Juan Tapiador; Juan Tapiador; Guillermo Suarez-Tangil; Guillermo Suarez-Tangil
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains labels of 2.47 million Android apk hashes extracted from VirusTotal reports.

    The dataset was used in the experiments of our publication titled An Analysis of Android Malware Classification Services

    The csv of the labels that was extracted from the VirusTotal reports is provided in labeling_dataset.csv.gz . A cell's value of -1 is used whenever there was no result from the
    engine for the given apk file hash value. The column names are provided in cols_labeling_dataset.csv.

    Note

    -1 is a string and not an integer

    If you use information from this repo, please cite our paper

    Rashed M, Suarez-Tangil G. An Analysis of Android Malware Classification Services. Sensors. 2021; 21(16):5671. https://doi.org/10.3390/s21165671

    BibTeX

    @Article{s21165671,
    AUTHOR = {Rashed, Mohammed and Suarez-Tangil, Guillermo},
    TITLE = {An Analysis of Android Malware Classification Services},
    JOURNAL = {Sensors},
    VOLUME = {21},
    YEAR = {2021},
    NUMBER = {16},
    ARTICLE-NUMBER = {5671},
    URL = {https://www.mdpi.com/1424-8220/21/16/5671},\
    ISSN = {1424-8220},
    DOI = {10.3390/s21165671}
    }

    Required Software

    gzip

    How to use the file?

    There are two ways to use the file:

    1. Extract the gzip file and then you will have a csv output file. For that you need to install gzip and then extracting .csv.gz. The user may use the command gunzip labelingDataset.csv.gz
    2. Extract information from the zipped file directly (following the same logic of AndroZoo's csv):
      To extract the first column and save to a file called list_of_selected_sha256, run the following command:
      zcat labelingDataset.csv.gz | cut -d',' -f1 > list_of_selected_sha256
      To obtain rows of apk hashes that were first seen after the 1st of May, 2016, run this command:
      zcat labeling_dataset.csv.gz | grep -v ',snaggamea' | awk -F, '{if ( $2 >= "2016-05" ) {print} }'
  5. f

    List of analyzed apps

    • figshare.com
    txt
    Updated Jul 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pascal Gadient (2021). List of analyzed apps [Dataset]. http://doi.org/10.6084/m9.figshare.14981061.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 14, 2021
    Dataset provided by
    figshare
    Authors
    Pascal Gadient
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The list of apps that we analyzed for the paper titled "Security Smells Pervade Mobile App Servers," ESEM, 2021.The closed-source apps have been downloaded from the AndroZoo repository hosted at the University of Luxembourg (https://androzoo.uni.lu/), and the open-source apps have been downloaded from the F-Droid repository (https://f-droid.org/).

  6. i

    PermGuard Android Malware Dataset

    • ieee-dataport.org
    Updated Dec 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arvind Prasad (2024). PermGuard Android Malware Dataset [Dataset]. https://ieee-dataport.org/documents/permguard-android-malware-dataset
    Explore at:
    Dataset updated
    Dec 30, 2024
    Authors
    Arvind Prasad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    2019

  7. AndroLibZoo

    • zenodo.org
    application/gzip, bin
    Updated Aug 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2023). AndroLibZoo [Dataset]. http://doi.org/10.5281/zenodo.8207463
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Aug 5, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    # AndroLibZoo

    This repository hosts the AndroLibZoo dataset and the artifacts used in our study on Android libraries.

    ## Artifacts

    The artifacts folder contains all the artifacts, i.e., datasets, scripts, results, source code, etc., to reproduce our study and produce AndroLibZoo.

    • The file AndroLibZoo.lst is our dataset.
    • The subfolders in « AndroLibZoo » are as follows:
      • The motivation folder contains all artifacts related to our motivation study (i.e., Section 3).
      • The methodology folder contains all artifacts related to our methodology (i.e., Section 4). In particular, we present:
        • How we gather libraries from Maven
        • How we gather libraries from Google
        • How we gather transitive dependencies
        • How we gather libraries from open source apps
        • How we gather libraries from Gradle plugin libraries
        • How we refined our list of libraries.
      • The description folder contains the artifacts useful to describe our dataset (i.e., Section 5).
      • The comparison folder holds the artifacts useful to compare our dataset with the state of the art (i.e., Section 6.a).
      • The importance folder holds the artifacts useful to highlight the importance of our AndroLibZoo to filter libraries for static analysis of Android apps (i.e., Section 6.b).
      • The evaluation folder holds the artifacts useful to show how AndroLibZoo improves existing static analyzers (i.e., Section 6.c).
    • The subfolders in « methodology_for_comparison_with_sota_list » contains the same folder as listed above but to produce the 2016 version of our dataset to compare against the state-of-the-art list.
    • The subfolders in « comparison_libradar » contain all the necessary files to compare our approach with libradar.

    All folders contain scripts that start with "XX_", with XX being a number that represents the order in which scripts need to be executed. Some of the scripts need to be parametrized with names of servers, AndroZoo API keys, prefixes for Redis server, etc.

  8. Artifacts of the ASE2022 Submission JUnitTestGen

    • zenodo.org
    txt, zip
    Updated May 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anonymous anonymous; anonymous anonymous (2022). Artifacts of the ASE2022 Submission JUnitTestGen [Dataset]. http://doi.org/10.5281/zenodo.6507579
    Explore at:
    txt, zipAvailable download formats
    Dataset updated
    May 1, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    anonymous anonymous; anonymous anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This release includes the experimental results that are used for the ASE artifact analysis. All the results are described in our paper. All required artifacts are attached to this release.


    The source code is made publicly available on the bitbucket page: https://bitbucket.org/se_anonymous/junittestgen/src/master/
    Users can easily access and execute our tool following the steps in the Setup part.

    In addition, all of our experimental results are publicly available on this page. For simplicity, we provide the following instructions for our artifacts:

    1.Experimental_Setup.zip : Experimental Setup, including 10,000 Android apps(for each target SDK version between 21 (i.e., Android 5.0) and 30 (i.e., Android
    11.03) from AndroZoo).
    2.All_Tests.zip : All the generated test cases.
    3.TestName_TargetAPI_Mapping.txt : The mapping of test case name and its corresponding target API.
    4.JUnitTestRun_SDK*_log.txt : Tests Execution results on SDK 21 - 30(* refers to SDK version).
    5.EvoSuite-tests.zip : The generated test cases by using Evosuite
    6.Evosuite_results.zip : Execution results of tests generated by Evosuite
    7.CiD.zip : CiD results

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Fabrício Ceschin (2022). Fast & Furious: Malware Detection Data Stream [Dataset]. https://www.kaggle.com/fabriciojoc/fast-furious-malware-data-stream
Organization logo

Fast & Furious: Malware Detection Data Stream

415K static android malware samples from 2009 to 2018 with their timestamps

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Fabrício Ceschin
Description

These datasets (DREBIN and AndroZoo) are contributions to the paper "Fast & Furious: On the Modelling of Malware Detection as an Evolving Data Stream". If you use them in your work, please cite our paper using the BibTeX below:

@article{CESCHIN2022118590,
title = {Fast & Furious: On the modelling of malware detection as an evolving data stream},
journal = {Expert Systems with Applications},
pages = {118590},
year = {2022},
issn = {0957-4174},
doi = {https://doi.org/10.1016/j.eswa.2022.118590},
url = {https://www.sciencedirect.com/science/article/pii/S0957417422016463},
author = {Fabrício Ceschin and Marcus Botacin and Heitor Murilo Gomes and Felipe Pinagé and Luiz S. Oliveira and André Grégio},
keywords = {Machine learning, Data streams, Concept drift, Malware detection, Android}
}

Both datasets are saved in the parquet file format. To read them, use the following code:

data_drebin = pd.read_parquet("drebin_drift.parquet.zip")
data_androzoo = pd.read_parquet("androbin.parquet.zip")

Note that these datasets are different from their original versions. The original DREBIN dataset does not contain the samples' timestamps, which we collected using VirusTotal API. Our version of the AndroZoo dataset is a subset of reports from their dataset previously available in their APK Analysis API, which was discontinued.

The DREBIN dataset is composed of ten textual attributes from Android APKs (list of API calls, permissions, URLs, etc), which are publicly available to download and contain 123,453 benign and 5,560 malicious Android applications. Their distribution over time is shown below.

https://i.imgur.com/IGKOMtE.png" alt="DREBIN dataset distribution by month">

The AndroZoo dataset is a subset of Android applications reports provided by AndroZoo API, composed of eight textual attributes (resources names, source code classes and methods, manifest permissions etc.) and contains 213,928 benign and 70,340 malicious applications. The distribution over time of our AndroZoo subset, which keeps the same goodware and malware distribution as the original dataset (composed of most of 10 million apps), is shown below.

https://i.imgur.com/8zxH3M4.png" alt="AndroZoo dataset distribution by month">

The source code for all the experiments shown in the paper are also available here on Kaggle (note that the experiments using AndroZoo dataset did not run in the Kaggle environment due to high memory usage).

Experiment 1 (The Best-Case Scenario for AVs - ML Cross-Validation)

Here we classify all samples together to compare which feature extraction algorithm is the best and report baseline results. We tested several parameters for both algorithms and fixed the vocabulary size at 100 for TF-IDF (top-100 features ordered by term frequency) and created projections with 100 dimensions for Word2Vec, resulting in 1, 000 and 800 features for each app in both cases, for DREBIN and AndroZoo, respectively. All results are reported after 10-fold cross-validation procedures, a method commonly used in ML to evaluate models because its results are less prone to biases (note that we are training new classifiers and feature extractors at every iteration of the cross-validation process). . In practice, folding the dataset implies that the AV company has a mixed view of both past and future threats, despite temporal effects, which is the best scenario for AV operation and ML evaluation.

Source Codes: DREBIN TFIDF | DREBIN W2V | ANDROZOO TFIDF | ANDROZOO W2V

Experiment 2 (On Classification Failure - Temporal Classification)

Although the currently used classification methodology helps reduci dataset biases, it would demand knowledge about future threats to work properly. AV companies train their classifiers using data from past samples and leverage them to predict future threats, expecting to present the same characteristics as past ones. However, malware samples are very dynamic, thus this strategy is the worst-case scenario for AV companies. To demonstrate the effects of predicting future threats based on past data, we split our datasets in two: we used the first half (oldest samples) to train our classifiers, which were then used to predict the newest samples from the second half. The results in the paper indicate a drop in all metrics when compared to the 10-fold experiment in both DREBIN and AndroZoo dataset...

Search
Clear search
Close search
Google apps
Main menu