15 datasets found

Drone-Based Malware Detection (DBMD)
kaggle.com
Updated Jul 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DatasetEngineer (2024). Drone-Based Malware Detection (DBMD) [Dataset]. http://doi.org/10.34740/kaggle/dsv/9045375
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/9045375
Dataset updated
Jul 27, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
DatasetEngineer
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description Welcome to the Drone-Based Malware Detection dataset! This dataset is designed to aid researchers and practitioners in exploring innovative cybersecurity solutions using drone-collected data. The dataset contains detailed information on network traffic, drone sensor readings, malware detection indicators, and environmental conditions. It offers a unique perspective by integrating data from drones with traditional network security metrics to enhance malware detection capabilities.

Dataset Overview The dataset comprises four main categories:

Network Traffic Data: Captures network traffic attributes including IP addresses, ports, protocols, packet sizes, and various derived metrics. Drone Sensor Data: Includes GPS coordinates, altitude, speed, heading, battery level, and other sensor readings from drones. Malware Detection Data: Contains indicators and scores relevant to detecting malware, such as anomaly scores, suspicious IP counts, reputation scores, and attack types. Environmental Data: Provides context through environmental conditions like location type, noise level, weather conditions, and more. Files and Features The dataset is divided into four separate CSV files:

network_traffic_data.csv

timestamp: Date and time of the traffic event. source_ip: Source IP address. destination_ip: Destination IP address. source_port: Source port number. destination_port: Destination port number. protocol: Network protocol (TCP, UDP, ICMP). packet_length: Length of the network packet. payload_data: Content of the packet payload. flag: Network flag (SYN, ACK, FIN, RST). traffic_volume: Volume of traffic in bytes. flow_duration: Duration of the network flow. flow_bytes_per_s: Bytes per second for the flow. flow_packets_per_s: Packets per second for the flow. packet_count: Number of packets in the flow. average_packet_size: Average size of packets. min_packet_size: Minimum packet size. max_packet_size: Maximum packet size. packet_size_variance: Variance in packet sizes. header_length: Length of the packet header. payload_length: Length of the packet payload. ip_ttl: Time to live for the IP packet. tcp_window_size: TCP window size. icmp_type: ICMP type (echo_request, echo_reply, destination_unreachable). dns_query_count: Number of DNS queries. dns_response_count: Number of DNS responses. http_method: HTTP method (GET, POST, PUT, DELETE). http_status_code: HTTP status code (200, 404, 500, 301). content_type: Content type (text/html, application/json, image/png). ssl_tls_version: SSL/TLS version. ssl_tls_cipher_suite: SSL/TLS cipher suite. drone_data.csv

latitude: Latitude of the drone. longitude: Longitude of the drone. altitude: Altitude of the drone. speed: Speed of the drone. heading: Heading of the drone. battery_level: Battery level of the drone. drone_id: Unique identifier for the drone. flight_time: Total flight time. signal_strength: Strength of the drone's signal. temperature: Temperature at the drone's location. humidity: Humidity at the drone's location. pressure: Atmospheric pressure at the drone's location. wind_speed: Wind speed at the drone's location. wind_direction: Wind direction at the drone's location. gps_accuracy: Accuracy of the GPS signal. malware_detection_data.csv

anomaly_score: Score indicating the level of anomaly detected. suspicious_ip_count: Number of suspicious IP addresses detected. malicious_payload_indicator: Indicator for malicious payload (0 or 1). reputation_score: Reputation score for the network entity. behavioral_score: Behavioral score indicating potential malicious activity. attack_type: Type of attack (DDoS, phishing, malware). signature_match: Indicator for signature match (0 or 1). sandbox_result: Result from sandbox analysis (clean, infected). heuristic_score: Heuristic score for potential threats. traffic_pattern: Pattern of the traffic (burst, steady). environmental_data.csv

location_type: Type of location (urban, rural). nearby_devices: Number of nearby devices. signal_interference: Level of signal interference. noise_level: Noise level in the environment. time_of_day: Time of day (morning, afternoon, evening, night). day_of_week: Day of the week. weather_conditions: Weather conditions (sunny, rainy, cloudy, stormy). Usage and Applications This dataset can be used for:

Cybersecurity Research: Developing and testing algorithms for malware detection using drone data. Machine Learning: Training models to identify malicious activity based on network traffic and drone sensor readings. Data Analysis: Exploring the relationships between environmental conditions, drone sensor data, and network traffic anomalies. Educational Purposes: Teaching data science, machine learning, and cybersecurity concepts using a comprehensive and multi-faceted dataset.

Acknowledgements This dataset is based on real-world data collected from drone sensors and network traffic monitoring s...
Network Traffic Android Malware
kaggle.com
zip
Updated Sep 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Urcuqui (2019). Network Traffic Android Malware [Dataset]. https://www.kaggle.com/datasets/xwolf12/network-traffic-android-malware
Explore at:
zip(116603 bytes)Available download formats
Dataset updated
Sep 12, 2019
Authors
Christian Urcuqui
Description
Introduction

Android is one of the most used mobile operating systems worldwide. Due to its technological impact, its open-source code and the possibility of installing applications from third parties without any central control, Android has recently become a malware target. Even if it includes security mechanisms, the last news about malicious activities and Android´s vulnerabilities point to the importance of continuing the development of methods and frameworks to improve its security.

To prevent malware attacks, researches and developers have proposed different security solutions, applying static analysis, dynamic analysis, and artificial intelligence. Indeed, data science has become a promising area in cybersecurity, since analytical models based on data allow for the discovery of insights that can help to predict malicious activities.

In this work, we propose to consider some network layer features as the basis for machine learning models that can successfully detect malware applications, using open datasets from the research community.

Content

This dataset is based on another dataset (DroidCollector) where you can get all the network traffic in pcap files, in our research we preprocessed the files in order to get network features that are illustrated in the next article:

López, C. C. U., Villarreal, J. S. D., Belalcazar, A. F. P., Cadavid, A. N., & Cely, J. G. D. (2018, May). Features to Detect Android Malware. In 2018 IEEE Colombian Conference on Communications and Computing (COLCOM) (pp. 1-6). IEEE.

Acknowledgements

Cao, D., Wang, S., Li, Q., Cheny, Z., Yan, Q., Peng, L., & Yang, B. (2016, August). DroidCollector: A High Performance Framework for High Quality Android Traffic Collection. In Trustcom/BigDataSE/I SPA, 2016 IEEE (pp. 1753-1758). IEEE
Android Malware Dataset for Machine Learning
kaggle.com
Updated Mar 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shashwat Tiwari (2021). Android Malware Dataset for Machine Learning [Dataset]. https://www.kaggle.com/datasets/shashwatwork/android-malware-dataset-for-machine-learning/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 13, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shashwat Tiwari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

"Mobile malware is malicious software that targets mobile phones or wireless-enabled Personal digital assistants (PDA), by causing the collapse of the system and loss or leakage of confidential information. As wireless phones and PDA networks have become more and more common and have grown in complexity, it has become increasingly difficult to ensure their safety and security against electronic attacks in the form of viruses or other malware."

Content

Dataset consisting of feature vectors of 215 attributes extracted from 15,036 applications (5,560 malware apps from Drebin project and 9,476 benign apps). The dataset has been used to develop and evaluate multilevel classifier fusion approach for Android malware detection, published in the IEEE Transactions on Cybernetics paper 'DroidFusion: A Novel Multilevel Classifier Fusion Approach for Android Malware Detection. The supporting file contains the description of the feature vectors/attributes obtained via static code analysis of the Android apps.

Acknowledgements

Yerima, Suleiman (2018): Android malware dataset for machine learning 2. figshare. Dataset. https://doi.org/10.6084/m9.figshare.5854653.v1 Data Source - https://figshare.com/articles/dataset/Android_malware_dataset_for_machine_learning_2/5854653 Literature URL - https://ieeexplore.ieee.org/document/8245867
Malware Analysis Datasets: PE Section Headers
kaggle.com
zip
Updated Aug 14, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Angelo Oliveira (2019). Malware Analysis Datasets: PE Section Headers [Dataset]. https://www.kaggle.com/ang3loliveira/malware-analysis-datasets-pe-section-headers
Explore at:
zip(1307424 bytes)Available download formats
Dataset updated
Aug 14, 2019
Authors
Angelo Oliveira
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Introduction

This dataset is part of my PhD research on malware detection and classification using Deep Learning. It contains static analysis data (PE Section Headers of the .text, .code and CODE sections) extracted from the 'pe_sections' elements of Cuckoo Sandbox reports. PE malware examples were downloaded from virusshare.com. PE goodware examples were downloaded from portableapps.com and from Windows 7 x86 directories.

Features

Column name: hash
Description: MD5 hash of the example
Content: 32 bytes string

Column name: size_of_data
Description: The size of the section on disk
Content: Integer

Column name: virtual_address
Description: Memory address of the first byte of the section relative to the image base
Content: Integer

Column name: entropy
Description: Calculated entropy of the section
Content: Float

Column name: virtual_size
Description: The size of the section when loaded into memory
Content: Integer

Column name: malware
Description: Class
Content: 0 (Goodware) or 1 (Malware)

Acknowledgements

Thank you Cuckoo Sandbox for developing such an amazing dynamic analysis environment!
Thank you VirusShare! Because sharing is caring!

Citations

Please refer to http://dx.doi.org/10.21227/2czh-es14
P
EDGE-IIOTSET Dataset
paperswithcode.com
Updated Oct 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). EDGE-IIOTSET Dataset [Dataset]. https://paperswithcode.com/dataset/edge-iiotset
Explore at:
Dataset updated
Oct 16, 2023
Description
ABSTRACT In this project, we propose a new comprehensive realistic cyber security dataset of IoT and IIoT applications, called Edge-IIoTset, which can be used by machine learning-based intrusion detection systems in two different modes, namely, centralized and federated learning. Specifically, the proposed testbed is organized into seven layers, including, Cloud Computing Layer, Network Functions Virtualization Layer, Blockchain Network Layer, Fog Computing Layer, Software-Defined Networking Layer, Edge Computing Layer, and IoT and IIoT Perception Layer. In each layer, we propose new emerging technologies that satisfy the key requirements of IoT and IIoT applications, such as, ThingsBoard IoT platform, OPNFV platform, Hyperledger Sawtooth, Digital twin, ONOS SDN controller, Mosquitto MQTT brokers, Modbus TCP/IP, ...etc. The IoT data are generated from various IoT devices (more than 10 types) such as Low-cost digital sensors for sensing temperature and humidity, Ultrasonic sensor, Water level detection sensor, pH Sensor Meter, Soil Moisture sensor, Heart Rate Sensor, Flame Sensor, ...etc.). However, we identify and analyze fourteen attacks related to IoT and IIoT connectivity protocols, which are categorized into five threats, including, DoS/DDoS attacks, Information gathering, Man in the middle attacks, Injection attacks, and Malware attacks. In addition, we extract features obtained from different sources, including alerts, system resources, logs, network traffic, and propose new 61 features with high correlations from 1176 found features. After processing and analyzing the proposed realistic cyber security dataset, we provide a primary exploratory data analysis and evaluate the performance of machine learning approaches (i.e., traditional machine learning as well as deep learning) in both centralized and federated learning modes.

Instructions:

Great news! The Edge-IIoT dataset has been featured as a "Document in the top 1% of Web of Science." This indicates that it is ranked within the top 1% of all publications indexed by the Web of Science (WoS) in terms of citations and impact.

Please kindly visit kaggle link for the updates: https://www.kaggle.com/datasets/mohamedamineferrag/edgeiiotset-cyber-sec...

Free use of the Edge-IIoTset dataset for academic research purposes is hereby granted in perpetuity. Use for commercial purposes is allowable after asking the leader author, Dr Mohamed Amine Ferrag, who has asserted his right under the Copyright.

The details of the Edge-IIoT dataset were published in following the paper. For the academic/public use of these datasets, the authors have to cities the following paper:

Mohamed Amine Ferrag, Othmane Friha, Djallel Hamouda, Leandros Maglaras, Helge Janicke, "Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT and IIoT Applications for Centralized and Federated Learning", IEEE Access, April 2022 (IF: 3.37), DOI: 10.1109/ACCESS.2022.3165809

Link to paper : https://ieeexplore.ieee.org/document/9751703

The directories of the Edge-IIoTset dataset include the following:

•File 1 (Normal traffic)

-File 1.1 (Distance): This file includes two documents, namely, Distance.csv and Distance.pcap. The IoT sensor (Ultrasonic sensor) is used to capture the IoT data.

-File 1.2 (Flame_Sensor): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.

-File 1.3 (Heart_Rate): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.

-File 1.4 (IR_Receiver): This file includes two documents, namely, IR_Receiver.csv and IR_Receiver.pcap. The IoT sensor (IR (Infrared) Receiver Sensor) is used to capture the IoT data.

-File 1.5 (Modbus): This file includes two documents, namely, Modbus.csv and Modbus.pcap. The IoT sensor (Modbus Sensor) is used to capture the IoT data.

-File 1.6 (phValue): This file includes two documents, namely, phValue.csv and phValue.pcap. The IoT sensor (pH-sensor PH-4502C) is used to capture the IoT data.

-File 1.7 (Soil_Moisture): This file includes two documents, namely, Soil_Moisture.csv and Soil_Moisture.pcap. The IoT sensor (Soil Moisture Sensor v1.2) is used to capture the IoT data.

-File 1.8 (Sound_Sensor): This file includes two documents, namely, Sound_Sensor.csv and Sound_Sensor.pcap. The IoT sensor (LM393 Sound Detection Sensor) is used to capture the IoT data.

-File 1.9 (Temperature_and_Humidity): This file includes two documents, namely, Temperature_and_Humidity.csv and Temperature_and_Humidity.pcap. The IoT sensor (DHT11 Sensor) is used to capture the IoT data.

-File 1.10 (Water_Level): This file includes two documents, namely, Water_Level.csv and Water_Level.pcap. The IoT sensor (Water sensor) is used to capture the IoT data.

•File 2 (Attack traffic):

-File 2.1 (Attack traffic (CSV files)): This file includes 13 documents, namely, Backdoor_attack.csv, DDoS_HTTP_Flood_attack.csv, DDoS_ICMP_Flood_attack.csv, DDoS_TCP_SYN_Flood_attack.csv, DDoS_UDP_Flood_attack.csv, MITM_attack.csv, OS_Fingerprinting_attack.csv, Password_attack.csv, Port_Scanning_attack.csv, Ransomware_attack.csv, SQL_injection_attack.csv, Uploading_attack.csv, Vulnerability_scanner_attack.csv, XSS_attack.csv. Each document is specific for each attack.

-File 2.2 (Attack traffic (PCAP files)): This file includes 13 documents, namely, Backdoor_attack.pcap, DDoS_HTTP_Flood_attack.pcap, DDoS_ICMP_Flood_attack.pcap, DDoS_TCP_SYN_Flood_attack.pcap, DDoS_UDP_Flood_attack.pcap, MITM_attack.pcap, OS_Fingerprinting_attack.pcap, Password_attack.pcap, Port_Scanning_attack.pcap, Ransomware_attack.pcap, SQL_injection_attack.pcap, Uploading_attack.pcap, Vulnerability_scanner_attack.pcap, XSS_attack.pcap. Each document is specific for each attack.

•File 3 (Selected dataset for ML and DL):

-File 3.1 (DNN-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating deep learning-based intrusion detection systems.

-File 3.2 (ML-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating traditional machine learning-based intrusion detection systems.

Step 1: Downloading The Edge-IIoTset dataset From the Kaggle platform from google.colab import files

!pip install -q kaggle

files.upload()

!mkdir ~/.kaggle

!cp kaggle.json ~/.kaggle/

!chmod 600 ~/.kaggle/kaggle.json

!kaggle datasets download -d mohamedamineferrag/edgeiiotset-cyber-security-dataset-of-iot-iiot -f "Edge-IIoTset dataset/Selected dataset for ML and DL/DNN-EdgeIIoT-dataset.csv"

!unzip DNN-EdgeIIoT-dataset.csv.zip

!rm DNN-EdgeIIoT-dataset.csv.zip

Step 2: Reading the Datasets' CSV file to a Pandas DataFrame: import pandas as pd

import numpy as np

df = pd.read_csv('DNN-EdgeIIoT-dataset.csv', low_memory=False)

Step 3 : Exploring some of the DataFrame's contents: df.head(5)

print(df['Attack_type'].value_counts())

Step 4: Dropping data (Columns, duplicated rows, NAN, Null..): from sklearn.utils import shuffle

drop_columns = ["frame.time", "ip.src_host", "ip.dst_host", "arp.src.proto_ipv4","arp.dst.proto_ipv4",

"http.file_data","http.request.full_uri","icmp.transmit_timestamp", "http.request.uri.query", "tcp.options","tcp.payload","tcp.srcport", "tcp.dstport", "udp.port", "mqtt.msg"]

df.drop(drop_columns, axis=1, inplace=True)

df.dropna(axis=0, how='any', inplace=True)

df.drop_duplicates(subset=None, keep="first", inplace=True)

df = shuffle(df)

df.isna().sum()

print(df['Attack_type'].value_counts())

Step 5: Categorical data encoding (Dummy Encoding): import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn import preprocessing

def encode_text_dummy(df, name):

dummies = pd.get_dummies(df[name])

for x in dummies.columns:

dummy_name = f"{name}-{x}" df[dummy_name] = dummies[x]

df.drop(name, axis=1, inplace=True)

encode_text_dummy(df,'http.request.method')

encode_text_dummy(df,'http.referer')

encode_text_dummy(df,"http.request.version")

encode_text_dummy(df,"dns.qry.name.len")

encode_text_dummy(df,"mqtt.conack.flags")

encode_text_dummy(df,"mqtt.protoname")

encode_text_dummy(df,"mqtt.topic")

Step 6: Creation of the preprocessed dataset df.to_csv('preprocessed_DNN.csv', encoding='utf-8')

For more information about the dataset, please contact the lead author of this project, Dr Mohamed Amine Ferrag, on his email: mohamed.amine.ferrag@gmail.com

More information about Dr. Mohamed Amine Ferrag is available at:

https://www.linkedin.com/in/Mohamed-Amine-Ferrag

https://dblp.uni-trier.de/pid/142/9937.html

https://www.researchgate.net/profile/Mohamed_Amine_Ferrag

https://scholar.google.fr/citations?user=IkPeqxMAAAAJ&hl=fr&oi=ao

https://www.scopus.com/authid/detail.uri?authorId=56115001200

https://publons.com/researcher/1322865/mohamed-amine-ferrag/

https://orcid.org/0000-0002-0632-3172

Last Updated: 27 Mar. 2023
tabular-features
kaggle.com
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sigehmireg Lbjevf (2025). tabular-features [Dataset]. https://www.kaggle.com/datasets/sigehmireglbjevf/comparison-tabular-graph-ml-malware
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sigehmireg Lbjevf
Description
Datasets of tabular features for the paper "A Unified Comparison of Tabular and Graph-Based Feature Representations in Machine Learning for Malware Detection", submitted to WORMA '25.

The .txt files contain the hashes of the files used for each part of the study.

The .csv files contain the static tabular features (EMBER) for each dataset.

The .pickle files contain the dynamic tabular features for each dataset.
o
Threat Intelligence Text Dataset
opendatabay.com
.undefined
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Threat Intelligence Text Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/8293a044-4601-409d-898b-a16bf6852ae2
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Website Analytics & User Experience
Description
This curated dataset, Cyber-BERT, is designed for Natural Language Processing (NLP) applications within the cybersecurity domain. It contains text extracted from various cybersecurity sources, encompassing topics such as malware analysis, vulnerabilities, cyber threats, and network security. The dataset is well-suited for training BERT-based models to perform essential tasks like threat detection, text classification, and broader cybersecurity research. The data has been meticulously preprocessed to ensure cleanliness, with URLs, non-text symbols, HTML tags, metadata, and redundant content removed.

Columns

text: This column contains the processed cybersecurity-related text.

Distribution

The dataset is typically provided in a CSV file format, making it readily accessible for various applications. It contains approximately 50,000 samples, though the exact number may vary based on collection updates. The data has undergone significant preprocessing to enhance its utility for NLP tasks, including the removal of URLs, non-text symbols, HTML tags, metadata, and duplicate entries.

Usage

This dataset offers a range of valuable applications, including: * Cyber Threat Detection: Utilise the dataset to train models for classifying security threats. * Named Entity Recognition (NER): Identify and extract key entities such as malware, exploits, and vulnerabilities from cybersecurity text. * Threat Intelligence Analysis: Extract valuable insights from cybersecurity reports and other relevant texts. * BERT Fine-Tuning: Build specialised NLP models tailored for security domains and specific cybersecurity challenges.

Coverage

The text within this dataset is extracted from prominent cybersecurity sources including TheHackerNews, CVE Details, Any.Run, and OpenPhish. The dataset's scope is global. Specific time ranges for the data content itself are not provided.

License

CCO

Who Can Use It

This dataset is an excellent resource for: * Researchers focused on advancing NLP techniques in cybersecurity. * Data Scientists and Machine Learning Engineers developing threat detection systems or text classification models. * Security Analysts looking to automate aspects of threat intelligence analysis. * Anyone involved in building specialised NLP models for security domains.

Dataset Name Suggestions

Cyber-BERT

Cybersecurity NLP Corpus

Threat Intelligence Text Dataset

Security Text Analytics Data

BERT Security Dataset

Attributes

Original Data Source: Cyber-BERT
IoT Firmware Image Classification
kaggle.com
Updated Jun 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tecperson (2021). IoT Firmware Image Classification [Dataset]. https://www.kaggle.com/datasets/datamunge/iot-firmware-image-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
tecperson
Description
Context

To leverage the vast literature solving the original MNIST digit recognition problem in small thumbnails, this firmware dataset maps the first 1024 bytes of malicious, benign and hacked Internet of Things and embedded software binaries (Executable and Linkable Format, ELF). The goal is to provide a drop-in replacement for MNIST techniques but relevant to weeding out malware using image recognition.

Content

The images are reported in CSV where the filename, label class (both categorical and numerical), and the first 1024 bytes mapped into a grayscale range from 0-255 by converting first each byte to decimal (0-15) then scaling.

Acknowledgements

See additional background on ELF files, https://en.wikipedia.org/wiki/Executable_and_Linkable_Format and https://linux-audit.com/elf-binaries-on-linux-understanding-and-analysis/

The labeled ELF files repository, https://github.com/nimrodpar/Labeled-Elfs

Inspiration

Comparison of firmware detection using these image representations and comparing with signature-based methods as well as contrasting statistical (tree) methods with deep learning techniques
P
Malimg Dataset
paperswithcode.com
Updated Nov 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nataraj L.; Karthikeyan S.; Jacob G.; Manjunath B. S. (2022). Malimg Dataset [Dataset]. https://paperswithcode.com/dataset/malimg
Explore at:
Dataset updated
Nov 8, 2022
Authors
Nataraj L.; Karthikeyan S.; Jacob G.; Manjunath B. S.
Description
The Malimg Dataset contains 9,339 malware byteplot images from 25 different families.
Cybersecurity Threat and Awareness Program Dataset
kaggle.com
Updated Oct 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DatasetEngineer (2024). Cybersecurity Threat and Awareness Program Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/9665651
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/9665651
Dataset updated
Oct 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
DatasetEngineer
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset Title: Cybersecurity Threat Detection and Awareness Program Dataset (2018-2024)

Description: This dataset provides a comprehensive collection of cybersecurity events and network traffic data, spanning from January 2018 to March 2024, collected from real-world corporate environments in Texas, USA. The data includes a diverse range of cybersecurity incidents, covering normal activity as well as various types of threats. It was gathered from multiple sources, such as network traffic logs, system logs, and external threat intelligence feeds, making it suitable for developing machine learning models aimed at threat detection, incident response, and cybersecurity awareness improvement.

The dataset is well-suited for research and experimentation in threat intelligence, intrusion detection, cybersecurity awareness training, and anomaly detection. The included features allow for the modeling of various threat scenarios and multi-class classification tasks. The labeled data provides information on the severity and type of threats detected, supporting both supervised and unsupervised learning techniques.

Features Overview:

Date_Time: The timestamp of the event (e.g., 2022-05-01 14:30:00), indicating when the activity or incident occurred.

Source_IP: IP address of the originating device involved in the event (e.g., 192.168.1.1).

Destination_IP: IP address of the target device involved in the event (e.g., 10.0.0.5).

Source_Port: Port number on the originating device (e.g., 443).

Destination_Port: Port number on the target device (e.g., 80).

Protocol_Type: The protocol used for the communication, such as TCP, UDP, ICMP.

Flow_Duration: Duration of the network flow in milliseconds.

Packet_Size: The size of the packet in bytes.

Flow_Bytes/s: The number of bytes transmitted per second during the flow.

Flow_Packets/s: The number of packets transmitted per second during the flow.

Total_Forward_Packets: Total number of packets sent in the forward direction.

Total_Backward_Packets: Total number of packets sent in the reverse direction.

Packet_Length_Mean: Average packet length for the flow.

IAT_Forward: Inter-arrival time for packets in the forward direction.

IAT_Backward: Inter-arrival time for packets in the reverse direction.

Active_Duration: Duration of active time for the connection.

Idle_Duration: Duration of idle time for the connection.

IDS_Alert_Count: Number of intrusion detection system alerts triggered during the event.

Anomaly_Score: A score indicating the anomaly level of the event, derived from anomaly detection algorithms.

Attack_Vector: Type of attack vector used (e.g., Phishing, DDoS, Brute Force).

Attack_Severity: Severity of the detected threat, categorized as Low, Medium, High, or Critical.

Compromised_Hosts_Count: Number of hosts compromised during the event.

Botnet_Family: Family of botnet detected (if applicable), such as Mirai, Zeus.

Malware_Type: Type of malware detected, such as Ransomware, Trojan.

User_Login_Attempts: Number of login attempts during the event.

Geolocation: Geographic location of the originating IP (Country, City).

Device_Type: Type of device involved (e.g., Server, Router, Mobile).

Firewall_Logs: Binary indicator (0 or 1) showing whether firewall logs flagged the activity.

Antivirus_Alerts: Binary indicator (0 or 1) showing whether antivirus software detected a threat.

Open_Ports_Count: Number of open ports on the target device.

Reputation_Score: A score indicating the reputation of the IP/domain based on threat intelligence sources.

Blacklisted_IP: Binary indicator (0 or 1) indicating if the IP is listed on a blacklist.

Known_Vulnerability: Binary indicator (0 or 1) showing if the target system has known vulnerabilities (based on CVE).

Threat_Intelligence_Source: Source from which the threat intelligence information was gathered.

System_Patch_Status: Indicates whether the system is patched (Up-to-date, Outdated).

CPU_Utilization: CPU usage percentage during the event.

Memory_Utilization: Memory usage percentage during the event.

Employee_Training_Completion: Completion rate of cybersecurity awareness training for the employee involved.

Phishing_Simulation_Success: Result of phishing simulation attempts (Success, Failure).

Reported_Incidents: Number of cybersecurity incidents reported by the user.

Incident_Response_Time: Time taken to respond to the incident in minutes.

Label (Target Variable):

Threat_Severity: The severity level of the threat, categorized as: 0: No Threat 1: Low-Level Threat 2: Medium-Level Threat 3: High-Level Threat 4: Critical Threat Usage: This dataset is ideal for training and testing machine learning models for tasks such as:

Multi-class classification for threat detection. Anomaly detection. Predictive modeling for incident response prioritization. Cybersecurity awareness program improvement. Researchers and...
Simargl2022
kaggle.com
Updated Jun 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H2020 SIMARGL (2022). Simargl2022 [Dataset]. http://doi.org/10.34740/kaggle/ds/2173090
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/2173090
Dataset updated
Jun 18, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
H2020 SIMARGL
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Article

The Proposition and Evaluation of the RoEduNet-SIMARGL2021 Network Intrusion Detection Dataset

Context

Cybersecurity is an arms race, with both the security and the adversaries attempting to outsmart one another, coming up with new attacks, new ways to defend against those attacks, and again with new ways to circumvent those defenses. This situation creates a constant need for novel, realistic cybersecurity datasets. This paper introduces the effects of using machine-learning-based intrusion detection methods in network traffic coming from a real-life architecture. The main contribution of this work is a dataset coming from a real-world, academic network. Real-life traffic was collected and, after performing a series of attacks, a dataset was assembled. The network data schema is in the Netflow v9 format and it contains 44 unique features and a label describing each frame.

Cite

This dataset is publicly available for use. When using our dataset, please cite our related paper: Maria-Elena Mihailescu, Darius Mihai, Mihai Carabas, Mikolaj Komisarek, Marek Pawlicki, Witold Holubowicz, Rafal Kozik: The Proposition and Evaluation of the RoEduNet-SIMARGL2021 Network Intrusion Detection Dataset. Sensors 21(13): 4319 (2021)

Acknowledgements

This work is funded under the SIMARGL Project – Secure Intelligent Methods for Advanced RecoGnition of malware and stegomalware, with the support of the European Commission and the Horizon 2020 Program, under Grant Agreement No. 833042.
CoronaHack -Chest X-Ray-Dataset
kaggle.com
zip
Updated Mar 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Praveen (2020). CoronaHack -Chest X-Ray-Dataset [Dataset]. https://www.kaggle.com/praveengovi/coronahack-chest-xraydataset
Explore at:
zip(1275680348 bytes)Available download formats
Dataset updated
Mar 20, 2020
Authors
Praveen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

Corona - COVID19 virus affects the respiratory system of healthy individual & Chest X -Ray is one of the important imaging methods to identify the corona virus.

With the Chest X - Ray dataset, Develop a Machine Learning Model to classify the X Rays of Healthy vs Pneumonia (Corona) affected patients & this model powers the AI application to test the Corona Virus in Faster Phase.

Content

Collection Chest X Ray of Healthy vs Pneumonia (Corona) affected patients infected patients along with few other categories such as SARS (Severe Acute Respiratory Syndrome ) ,Streptococcus & ARDS (Acute Respiratory Distress Syndrome)

Images name and labels are available in Chest_Xray_Corona_Metadata.csv

COVID 19 - https://en.wikipedia.org/wiki/Coronavirus_disease_2019 ARDS - https://en.wikipedia.org/wiki/Acute_respiratory_distress_syndrome Streptococcus - https://en.wikipedia.org/wiki/Streptococcus SARS - https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F378285%2Fcfdeda929ebe5e6254590538601c0ef6%2FChest_XRay_dataset_labels.png?generation=1584770009221937&alt=media" alt="">

Acknowledgements

I would like to thank to below team from Joseph Paul Cohen. Postdoctoral Fellow, Mila, University of Montreal for the dataset below for corona dataset & 80% dataset collected from different sources.

Original Source :- https://github.com/ieee8023/covid-chestxray-dataset

Inspiration

Automated methods to detect and classify human diseases from medical images.Novel Machine Learning Algorithms and neural networks helps to reduce the Corona Virus detection time and aids the doctor to drive the consultation in better way
UMUDGA - Domain Generation
kaggle.com
zip
Updated Mar 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saurabh Shahane (2021). UMUDGA - Domain Generation [Dataset]. https://www.kaggle.com/saurabhshahane/domain-generation
Explore at:
zip(1346047998 bytes)Available download formats
Dataset updated
Mar 27, 2021
Authors
Saurabh Shahane
Description
Context

In computer security, network botnets still represent a major cyber threat. Concealing techniques such as the dynamic addressing and the Domain Name Generation Algorithms (DGAs) require an improved and more effective detection process. To this extent, this data descriptor presents a collection of over 30 million manually-labelled algorithmically generated domain names decorated with a feature set ready-to-use for Machine Learning analysis. This proposed data set enables researchers to move forward the data collection, organization and pre-processing phases, eventually enabling them to focus on the analysis and the production of Machine-Learning powered solutions for network intrusion detection.

Content

50 among the most important malware variants have been selected. Each family is available both as list of domains and as collection of features. To be more precise, the former is generated by executing the malware DGAs in a controlled environment with fixed parameters, while the latter is generated by extracting a combination of statistical and Natural Language Processing (NLP) metrics.

Acknowledgements

Zago, Mattia; Gil Pérez, Manuel; Martinez Perez, Gregorio (2020), “UMUDGA - University of Murcia Domain Generation Algorithm Dataset”, Mendeley Data, V1, doi: 10.17632/y8ph45msv8.1
Assembly Shellcode Dataset
kaggle.com
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Assembly Shellcode Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/assembly-shellcode-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Assembly Shellcode Dataset

The Largest Collection of Linux Assembly Shellcodes

By SoLID (From Huggingface) [source]

About this dataset

The dataset consists of multiple files for different purposes. The validation.csv file contains a set of carefully selected assembly shellcodes that serve the purpose of validation. These shellcodes are used to ensure the accuracy and integrity of any models or algorithms trained on this dataset.

The train.csv file contains both the intent column, which describes the purpose or objective behind each specific shellcode, and its corresponding assembly code snippets in order to facilitate supervised learning during training procedures. This file proves to be immensely valuable for researchers, practitioners, and developers seeking to study or develop effective techniques for dealing with malicious code analysis or security-related tasks.

For testing purposes, the test.csv file provides yet another collection of assembly shellcodes that can be employed as test cases to assess the performance, robustness, and generalization capability of various models or methodologies developed within this domain.

How to use the dataset

Understanding the Dataset

The dataset consists of multiple files that serve different purposes:

train.csv: This file contains the intent and corresponding assembly code snippets for training purposes. It can be used to train machine learning models or develop algorithms based on shellcode analysis.

test.csv: The test.csv file in the dataset contains a collection of assembly shellcodes specifically designed for testing purposes. You can use these shellcodes to evaluate and validate your models or analysis techniques.

validation.csv: The validation.csv file includes a set of assembly shellcodes that are specifically reserved for validation purposes. These shellcodes can be used separately to ensure the accuracy and reliability of your models.

Columns in the Dataset

The columns available in each CSV file are as follows:

intent: The intent column describes the purpose or objective of each specific shellcode entry. It provides information regarding what action or achievement is intended by using that particular piece of code.

snippet: The snippet column contains the actual assembly code corresponding to each intent entry in its respective row. It includes all necessary instructions and data required to execute the desired action specified by that intent.

Utilizing the Dataset

To effectively utilize this dataset, follow these general steps:

Familiarize yourself with assembly language: Assembly language is essential when working with shellcodes since they consist of low-level machine instructions understood by processors directly.

Explore intents: Start by analyzing and understanding different intents present in the dataset entries thoroughly. Each intent represents a specific goal or purpose behind creating an individual piece of code.

Examine snippets: Review the assembly code snippets corresponding to each intent entry. Carefully study the instructions and data used in the shellcode, as they directly influence their intended actions.

Train your models: If you are working on machine learning or algorithm development, utilize the train.csv file to train your models based on the labeled intent and snippet data provided. This step will enable you to build powerful tools for analyzing or detecting shellcodes automatically.

Evaluate using test datasets: Use the various assembly shellcodes present in test.csv to evaluate and validate your trained models or analysis techniques. This evaluation will help

Research Ideas

Malware analysis: The dataset can be used for studying and analyzing various shellcode techniques used in malware attacks. Researchers and security professionals can use this dataset to develop detection and prevention mechanisms against such attacks.

Penetration testing: Security experts can use this dataset to simulate real-world attack scenarios and test the effectiveness of their defensive measures. By having access to a diverse range of shellcodes, they can identify vulnerabilities in systems and patch them before malicious actors exploit them.

Machine learning training: This dataset can be used to train machine learning models for automatic detection or classification of shellcodes. By combining the intent column (which describes the objective of each shellcode) with the corresponding assembly code snippets, researchers can develop algorithms that automatically identify the purpose or ...
covid-19_mask_detection
kaggle.com
Updated May 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
omkar1008 (2020). covid-19_mask_detection [Dataset]. https://www.kaggle.com/datasets/omkar1008/covid19-mask-detection/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 3, 2020
Dataset provided by
Kaggle
Authors
omkar1008
Description
Dataset

This dataset was created by omkar1008

Contents
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

DatasetEngineer (2024). Drone-Based Malware Detection (DBMD) [Dataset]. http://doi.org/10.34740/kaggle/dsv/9045375

Drone-Based Malware Detection (DBMD)

Drone-Based Malware Detection (DBMD) - Network Traffic

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/9045375

Dataset updated

Jul 27, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

DatasetEngineer

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Description Welcome to the Drone-Based Malware Detection dataset! This dataset is designed to aid researchers and practitioners in exploring innovative cybersecurity solutions using drone-collected data. The dataset contains detailed information on network traffic, drone sensor readings, malware detection indicators, and environmental conditions. It offers a unique perspective by integrating data from drones with traditional network security metrics to enhance malware detection capabilities.

Dataset Overview The dataset comprises four main categories:

Network Traffic Data: Captures network traffic attributes including IP addresses, ports, protocols, packet sizes, and various derived metrics. Drone Sensor Data: Includes GPS coordinates, altitude, speed, heading, battery level, and other sensor readings from drones. Malware Detection Data: Contains indicators and scores relevant to detecting malware, such as anomaly scores, suspicious IP counts, reputation scores, and attack types. Environmental Data: Provides context through environmental conditions like location type, noise level, weather conditions, and more. Files and Features The dataset is divided into four separate CSV files:

network_traffic_data.csv

timestamp: Date and time of the traffic event. source_ip: Source IP address. destination_ip: Destination IP address. source_port: Source port number. destination_port: Destination port number. protocol: Network protocol (TCP, UDP, ICMP). packet_length: Length of the network packet. payload_data: Content of the packet payload. flag: Network flag (SYN, ACK, FIN, RST). traffic_volume: Volume of traffic in bytes. flow_duration: Duration of the network flow. flow_bytes_per_s: Bytes per second for the flow. flow_packets_per_s: Packets per second for the flow. packet_count: Number of packets in the flow. average_packet_size: Average size of packets. min_packet_size: Minimum packet size. max_packet_size: Maximum packet size. packet_size_variance: Variance in packet sizes. header_length: Length of the packet header. payload_length: Length of the packet payload. ip_ttl: Time to live for the IP packet. tcp_window_size: TCP window size. icmp_type: ICMP type (echo_request, echo_reply, destination_unreachable). dns_query_count: Number of DNS queries. dns_response_count: Number of DNS responses. http_method: HTTP method (GET, POST, PUT, DELETE). http_status_code: HTTP status code (200, 404, 500, 301). content_type: Content type (text/html, application/json, image/png). ssl_tls_version: SSL/TLS version. ssl_tls_cipher_suite: SSL/TLS cipher suite. drone_data.csv

latitude: Latitude of the drone. longitude: Longitude of the drone. altitude: Altitude of the drone. speed: Speed of the drone. heading: Heading of the drone. battery_level: Battery level of the drone. drone_id: Unique identifier for the drone. flight_time: Total flight time. signal_strength: Strength of the drone's signal. temperature: Temperature at the drone's location. humidity: Humidity at the drone's location. pressure: Atmospheric pressure at the drone's location. wind_speed: Wind speed at the drone's location. wind_direction: Wind direction at the drone's location. gps_accuracy: Accuracy of the GPS signal. malware_detection_data.csv

anomaly_score: Score indicating the level of anomaly detected. suspicious_ip_count: Number of suspicious IP addresses detected. malicious_payload_indicator: Indicator for malicious payload (0 or 1). reputation_score: Reputation score for the network entity. behavioral_score: Behavioral score indicating potential malicious activity. attack_type: Type of attack (DDoS, phishing, malware). signature_match: Indicator for signature match (0 or 1). sandbox_result: Result from sandbox analysis (clean, infected). heuristic_score: Heuristic score for potential threats. traffic_pattern: Pattern of the traffic (burst, steady). environmental_data.csv

location_type: Type of location (urban, rural). nearby_devices: Number of nearby devices. signal_interference: Level of signal interference. noise_level: Noise level in the environment. time_of_day: Time of day (morning, afternoon, evening, night). day_of_week: Day of the week. weather_conditions: Weather conditions (sunny, rainy, cloudy, stormy). Usage and Applications This dataset can be used for:

Cybersecurity Research: Developing and testing algorithms for malware detection using drone data. Machine Learning: Training models to identify malicious activity based on network traffic and drone sensor readings. Data Analysis: Exploring the relationships between environmental conditions, drone sensor data, and network traffic anomalies. Educational Purposes: Teaching data science, machine learning, and cybersecurity concepts using a comprehensive and multi-faceted dataset.

Acknowledgements This dataset is based on real-world data collected from drone sensors and network traffic monitoring s...

Clear search

Close search

Google apps

Main menu

Drone-Based Malware Detection (DBMD)

Network Traffic Android Malware

Introduction

Content

Acknowledgements

Android Malware Dataset for Machine Learning

Context

Content

Acknowledgements

Malware Analysis Datasets: PE Section Headers

Introduction

Features

Acknowledgements

Citations

EDGE-IIOTSET Dataset

tabular-features

Threat Intelligence Text Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

IoT Firmware Image Classification

Context

Content

Acknowledgements

Inspiration

Malimg Dataset

Cybersecurity Threat and Awareness Program Dataset

Simargl2022

Article

Context

Cite

Acknowledgements

CoronaHack -Chest X-Ray-Dataset

Context

Content

Acknowledgements

Inspiration

UMUDGA - Domain Generation

Context

Content

Acknowledgements

Assembly Shellcode Dataset

Assembly Shellcode Dataset

The Largest Collection of Linux Assembly Shellcodes

About this dataset

How to use the dataset

Understanding the Dataset

Columns in the Dataset

Utilizing the Dataset

Research Ideas

covid-19_mask_detection

Dataset

Contents

Drone-Based Malware Detection (DBMD)

Drone-Based Malware Detection (DBMD) - Network Traffic