Android is one of the most used mobile operating systems worldwide. Due to its technological impact, its open-source code and the possibility of installing applications from third parties without any central control, Android has recently become a malware target. Even if it includes security mechanisms, the last news about malicious activities and Android´s vulnerabilities point to the importance of continuing the development of methods and frameworks to improve its security.
To prevent malware attacks, researches and developers have proposed different security solutions, applying static analysis, dynamic analysis, and artificial intelligence. Indeed, data science has become a promising area in cybersecurity, since analytical models based on data allow for the discovery of insights that can help to predict malicious activities.
In this work, we propose to consider some network layer features as the basis for machine learning models that can successfully detect malware applications, using open datasets from the research community.
This dataset is based on another dataset (DroidCollector) where you can get all the network traffic in pcap files, in our research we preprocessed the files in order to get network features that are illustrated in the next article:
López, C. C. U., Villarreal, J. S. D., Belalcazar, A. F. P., Cadavid, A. N., & Cely, J. G. D. (2018, May). Features to Detect Android Malware. In 2018 IEEE Colombian Conference on Communications and Computing (COLCOM) (pp. 1-6). IEEE.
Cao, D., Wang, S., Li, Q., Cheny, Z., Yan, Q., Peng, L., & Yang, B. (2016, August). DroidCollector: A High Performance Framework for High Quality Android Traffic Collection. In Trustcom/BigDataSE/I SPA, 2016 IEEE (pp. 1753-1758). IEEE
This dataset contains information on application install interactions of users in the Myket android application market. The dataset was created for the purpose of evaluating interaction prediction models, requiring user and item identifiers along with timestamps of the interactions. Hence, the dataset can be used for interaction prediction and building a recommendation system. Furthermore, the data forms a dynamic network of interactions, and we can also perform network representation learning on the nodes in the network, which are users and applications.
Data Creation The dataset was initially generated by the Myket data team, and later cleaned and subsampled by Erfan Loghmani a master student at Sharif University of Technology at the time. The data team focused on a two-week period and randomly sampled 1/3 of the users with interactions during that period. They then selected install and update interactions for three months before and after the two-week period, resulting in interactions spanning about 6 months and two weeks.
We further subsampled and cleaned the data to focus on application download interactions. We identified the top 8000 most installed applications and selected interactions related to them. We retained users with more than 32 interactions, resulting in 280,391 users. From this group, we randomly selected 10,000 users, and the data was filtered to include only interactions for these users. The detailed procedure can be found in here.
Data Structure The dataset has two main files.
myket.csv: This file contains the interaction information and follows the same format as the datasets used in the "JODIE: Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks" (ACM SIGKDD 2019) project. However, this data does not contain state labels and interaction features, resulting in associated columns being all zero. app_info_sample.csv: This file comprises features associated with applications present in the sample. For each individual application, information such as the approximate number of installs, average rating, count of ratings, and category are included. These features provide insights into the applications present in the dataset.
Dataset Details
Total Instances: 694,121 install interaction instances Instances Format: Triplets of user_id, app_name, timestamp 10,000 users and 7,988 android applications Item features for 7,606 applications
For a detailed summary of the data's statistics, including information on users, applications, and interactions, please refer to the Python notebook available at summary-stats.ipynb. The notebook provides an overview of the dataset's characteristics and can be helpful for understanding the data's structure before using it for research or analysis.
Top 20 Most Installed Applications | Package Name | Count of Interactions | | ---------------------------------- | --------------------- | | com.instagram.android | 15292 | | ir.resaneh1.iptv | 12143 | | com.tencent.ig | 7919 | | com.ForgeGames.SpecialForcesGroup2 | 7797 | | ir.nomogame.ClutchGame | 6193 | | com.dts.freefireth | 6041 | | com.whatsapp | 5876 | | com.supercell.clashofclans | 5817 | | com.mojang.minecraftpe | 5649 | | com.lenovo.anyshare.gps | 5076 | | ir.medu.shad | 4673 | | com.firsttouchgames.dls3 | 4641 | | com.activision.callofduty.shooter | 4357 | | com.tencent.iglite | 4126 | | com.aparat | 3598 | | com.kiloo.subwaysurf | 3135 | | com.supercell.clashroyale | 2793 | | co.palang.QuizOfKings | 2589 | | com.nazdika.app | 2436 | | com.digikala | 2413 |
Comparison with SNAP Datasets The Myket dataset introduced in this repository exhibits distinct characteristics compared to the real-world datasets used by the project. The table below provides a comparative overview of the key dataset characteristics:
Dataset | #Users | #Items | #Interactions | Average Interactions per User | Average Unique Items per User |
---|---|---|---|---|---|
Myket | 10,000 | 7,988 | 694,121 | 69.4 | 54.6 |
LastFM | 980 | 1,000 | 1,293,103 | 1,319.5 | 158.2 |
10,000 | 984 | 672,447 | 67.2 | 7.9 | |
Wikipedia | 8,227 | 1,000 | 157,474 | 19.1 | 2.2 |
MOOC | 7,047 | 97 | 411,749 | 58.4 | 25.3 |
The Myket dataset stands out by having an ample number of both users and items, highlighting its relevance for real-world, large-scale applications. Unlike LastFM, Reddit, and Wikipedia datasets, where users exhibit repetitive item interactions, the Myket dataset contains a comparatively lower amount of repetitive interactions. This unique characteristic reflects the diverse nature of user behaviors in the Android application market environment.
Citation If you use this dataset in your research, please cite the following preprint:
@misc{loghmani2023effect, title={Effect of Choosing Loss Function when Using T-batching for Representation Learning on Dynamic Networks}, author={Erfan Loghmani and MohammadAmin Fazli}, year={2023}, eprint={2308.06862}, archivePrefix={arXiv}, primaryClass={cs.LG} }
This dataset was collected by the ICSI Netalyzr app for Android to develop a characterization of how operational decisions, such as network configurations, business models, and relationships between operators introduce diversity in service quality and affect user security and privacy. We delve in detail beyond the radio link and into network configuration and business relationships in six countries. We identify the widespread use of transparent middleboxes such as HTTP and DNS proxies, analyzing how they actively modify user traffic, compromise user privacy, and potentially undermine user security. In addition, we identify network sharing agreements between operators, highlighting the implications of roaming and characterizing the properties of MVNOs, including that a majority are simply rebranded versions of major operators. More broadly, our findings using this data highlight the importance of considering higher-layer relationships when seeking to analyze mobile traffic in a sound fashion. ; narseo@icsi.berkeley.edu
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository is part of the ITC-NetMingledApp dataset, which includes network traffic data from 36 Android applications, with each capture featuring concurrent traffic from multiple applications and smartphones. This repository contains part #1 of the data related to the Iran-Tehran scenario. Each capture is stored in a compressed file containing the relevant PCAP files of the associated applications. The PCAP files are named according to a convention: {TimeStamp}_{Application Name}{Download-Upload Speed}.pcap Part #2 of Iran-Tehran scenario is in the Tehran Dataset #2 (https://doi.org/10.17632/zsffy3j9y6.1) repository.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Scenario description:
Platoon formation with live traffic light data included in planner.
- Enabled live traffic light data included in planner
- Not using the Android app
- Starting at default locations
- This test was filmed, including the GUI.
Session description:
Platoon formation improvement by traffic light data.
Datasets descriptions:
AUTOPILOT_BrainPort_Platooning_DriverVehicleInteraction: Data extracted from the CAN of the vehicle
This dataset contains e.g. throttlestatus, clutchstatus, brakestatus, brakeforce, wipersstatus, steeringwheel for the vehicle
AUTOPILOT_BrainPort_Platooning_EnvironmentSensorsAbsolute: Data extracted from the vehicle environment sensors
This dataset contains information about detected object, with absolute coordinates
AUTOPILOT_BrainPort_Platooning_EnvironmentSensorsRelative: Data extracted from the vehicle environment sensors
This dataset contains information about detected object, with relative coordinates
AUTOPILOT_BrainPort_Platooning_IotVehicleMessage: Data sent between all devices, vehicles and services
Each sensor data submission is a Message. A Message has an Envelope, a Path, and optionally (but likely) Path Events and optionally Path Media. The envelope bears fundamental information about the individual sender (the vehicle) but not to a level that owner of the vehicle can be identified or different messages can be identified that originate from a single vehicle.
AUTOPILOT_BrainPort_Platooning_PlatoonFormation: Data sent from PlatoonService to vehicle
This dataset contains information about the route and speed for a specific vehicle for forming a platoon
AUTOPILOT_BrainPort_Platooning_PlatooningAction: Data logged by vehicle
This dataset contains information about the current status of the platooning
AUTOPILOT_BrainPort_Platooning_PlatooningEvent: Data logged by vehicle
This dataset contains information about the identifiers used for each specific platooning event
AUTOPILOT_BrainPort_Platooning_PlatoonStatus: Data sent by vehicle to PlatoonService
This dataset contains information about the current status of the platooning
AUTOPILOT_BrainPort_Platooning_PositioningSystem: Data from GPS on the vehicle
This dataset contains speed, longitude, latitude, heading from the GPS
AUTOPILOT_BrainPort_Platooning_PositioningSystemResample: Data from GPS on the vehicle
This dataset contains speed,longitude,latitude,heading from the GPS, resampled to 100 milliseconds
AUTOPILOT_BrainPort_Platooning_PSInfo: Data sent by PlatoonService to the vehicle
This dataset contains speed and route information for the vehicle to create a platoon
AUTOPILOT_BrainPort_Platooning_Target: Data from sensors on the vehicle
Target detection in the vicinity of the host vehicle, by a vehicle sensor or virtual sensor
AUTOPILOT_BrainPort_Platooning_Vehicle: Data from the CAN and sensors about the state of the vehicle
This dataset contains a.o temperature and battery state of the vehicles
AUTOPILOT_BrainPort_Platooning_VehicleDynamics: Data from the CAN and sensors about the state of the vehicle
This dataset contains a.o accelerations and speedlimit of the vehicle, as observed from the CAN and the external sensors
AUTOPILOT_BrainPort_Platooning_VehicleDynamics: Data from the CAN and sensors about the state of the vehicle
This dataset contains a.o accelerations and speedlimit of the vehicle, as observed from the CAN and the external sensors
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Scenario description:
Platoon formation and platooning, from Helmond to Eindhoven and back to the Automotive Campus.
- Starting in urban area with speed limits of 15 and 30 km/h.
- Driving East on the Europaweg with speed limits of 50 and 70 km/h. This includes 3 crossings with traffic lights.
- Driving on the the N270, along the Automotive Campus. One crossing with traffic lights, just before the A270.
- Driving on the A270 (speed limit 100 km/h). Interrupted by one traffic light.
- U-turn at the fly-over or at the end of the A270, to return the same way to the Automotive Campus.
Session description:
Platoon formation and platooning, with live traffic light data included in planner.
- Live traffic light data available for planner
- Driver uses the Android app
- Starting at default locations
- Platooning (CACC and lane keeping) on the A270 when possible.
Datasets descriptions:
AUTOPILOT_BrainPort_Platooning_DriverVehicleInteraction: Data extracted from the CAN of the vehicle
This dataset contains e.g. throttlestatus, clutchstatus, brakestatus, brakeforce, wipersstatus, steeringwheel for the vehicle
AUTOPILOT_BrainPort_Platooning_EnvironmentSensorsAbsolute: Data extracted from the vehicle environment sensors
This dataset contains information about detected object, with absolute coordinates
AUTOPILOT_BrainPort_Platooning_EnvironmentSensorsRelative: Data extracted from the vehicle environment sensors
This dataset contains information about detected object, with relative coordinates
AUTOPILOT_BrainPort_Platooning_IotVehicleMessage: Data sent between all devices, vehicles and services
Each sensor data submission is a Message. A Message has an Envelope, a Path, and optionally (but likely) Path Events and optionally Path Media. The envelope bears fundamental information about the individual sender (the vehicle) but not to a level that owner of the vehicle can be identified or different messages can be identified that originate from a single vehicle.
AUTOPILOT_BrainPort_Platooning_PlatoonFormation: Data sent from PlatoonService to vehicle
This dataset contains information about the route and speed for a specific vehicle for forming a platoon
AUTOPILOT_BrainPort_Platooning_PlatooningAction: Data logged by vehicle
This dataset contains information about the current status of the platooning
AUTOPILOT_BrainPort_Platooning_PlatooningEvent: Data logged by vehicle
This dataset contains information about the identifiers used for each specific platooning event
AUTOPILOT_BrainPort_Platooning_PlatoonStatus: Data sent by vehicle to PlatoonService
This dataset contains information about the current status of the platooning
AUTOPILOT_BrainPort_Platooning_PositioningSystem: Data from GPS on the vehicle
This dataset contains speed, longitude, latitude, heading from the GPS
AUTOPILOT_BrainPort_Platooning_PositioningSystemResample: Data from GPS on the vehicle
This dataset contains speed,longitude,latitude,heading from the GPS, resampled to 100 milliseconds
AUTOPILOT_BrainPort_Platooning_PSInfo: Data sent by PlatoonService to the vehicle
This dataset contains speed and route information for the vehicle to create a platoon
AUTOPILOT_BrainPort_Platooning_Target: Data from sensors on the vehicle
Target detection in the vicinity of the host vehicle, by a vehicle sensor or virtual sensor
AUTOPILOT_BrainPort_Platooning_Vehicle: Data from the CAN and sensors about the state of the vehicle
This dataset contains a.o temperature and battery state of the vehicles
AUTOPILOT_BrainPort_Platooning_VehicleDynamics: Data from the CAN and sensors about the state of the vehicle
This dataset contains a.o accelerations and speedlimit of the vehicle, as observed from the CAN and the external sensors
AUTOPILOT_BrainPort_Platooning_VehicleDynamics: Data from the CAN and sensors about the state of the vehicle
This dataset contains a.o accelerations and speedlimit of the vehicle, as observed from the CAN and the external sensors
Potential weaknesses and other security issues per app.
Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
In this table you will find information about CoronaMelder. This concerns two variables: 1. The number of people who downloaded CoronaMelder 2. The number of people who warned others via CoronaMelder 1. The number of downloads is based on data from: - App Store (iOS) - Play Store (Android) - Huawei App Gallery (Android) 2. If you have tested positive for corona, you can voluntarily indicate this in the app, together with an employee of the GGD. The numbers show how many people have done this.
Cuckoo Sandbox is the leading open sourceautomated malware analysis system. You can throw any suspicious file atit and in a matter of seconds Cuckoo will provide you back some detailedresults outlining what such file did when executed inside an isolatedenvironment.
Cuckoo Sandbox is free software that automated the task of analyzing any malicious file under Windows, OS X, Linux, and Android.
What can it do?
Cuckoo Sandbox is an advanced, extremely modular, and 100% open source automated malware analysis system with infinite application opportunities. By default it is able to:
Analyze many different malicious files (executables, office documents, pdf files, emails, etc) as well as malicious websites under Windows, Linux, Mac OS X, and Android virtualized environments.
Trace API calls and general behavior of the file and distill this into high level information and signatures comprehensible by anyone.
Dump and analyze network traffic, even when encrypted with SSL/TLS. With native network routing support to drop all traffic or route it through InetSIM, a network interface, or a VPN.
Perform advanced memory analysis of the infected virtualized system through Volatility as well as on a process memory granularity using YARA.
Due to Cuckoo s open source nature and extensive modular design one may customize any aspect of the analysis environment, analysis results processing, and reporting stage. Cuckoo provides you all the requirements to easily integrate the sandbox into your existing framework and backend in the way you want, with the format you want, and all of that without licensing requirements.
.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Android is one of the most used mobile operating systems worldwide. Due to its technological impact, its open-source code and the possibility of installing applications from third parties without any central control, Android has recently become a malware target. Even if it includes security mechanisms, the last news about malicious activities and Android´s vulnerabilities point to the importance of continuing the development of methods and frameworks to improve its security.
To prevent malware attacks, researches and developers have proposed different security solutions, applying static analysis, dynamic analysis, and artificial intelligence. Indeed, data science has become a promising area in cybersecurity, since analytical models based on data allow for the discovery of insights that can help to predict malicious activities.
In this work, we propose to consider some network layer features as the basis for machine learning models that can successfully detect malware applications, using open datasets from the research community.
This dataset is based on another dataset (DroidCollector) where you can get all the network traffic in pcap files, in our research we preprocessed the files in order to get network features that are illustrated in the next article:
López, C. C. U., Villarreal, J. S. D., Belalcazar, A. F. P., Cadavid, A. N., & Cely, J. G. D. (2018, May). Features to Detect Android Malware. In 2018 IEEE Colombian Conference on Communications and Computing (COLCOM) (pp. 1-6). IEEE.
Cao, D., Wang, S., Li, Q., Cheny, Z., Yan, Q., Peng, L., & Yang, B. (2016, August). DroidCollector: A High Performance Framework for High Quality Android Traffic Collection. In Trustcom/BigDataSE/I SPA, 2016 IEEE (pp. 1753-1758). IEEE