http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
The dataset has been introduced by the below-mentioned researches: E. C. P. Neto, S. Dadkhah, R. Ferreira, A. Zohourian, R. Lu, A. A. Ghorbani. "CICIoT2023: A real-time dataset and benchmark for large-scale attacks in IoT environment," Sensor (2023) – (submitted to Journal of Sensors). The present data contains different kinds of IoT intrusions. The categories of the IoT intrusions enlisted in the data are as follows: DDoS Brute Force Spoofing DoS Recon Web-based Mirai
There are several subcategories are present in the data for each kind of intrusion types in the IoT. The dataset contains 1191264 instances of network for intrusions and 47 features of each of the intrusions. The dataset can be used to prepare the predictive model through which different kind of intrusive attacks can be detected. The data is also suitable for designing the IDS system.
This dataset was created by Kiran
This dataset was created by sid007
ABSTRACT In this project, we propose a new comprehensive realistic cyber security dataset of IoT and IIoT applications, called Edge-IIoTset, which can be used by machine learning-based intrusion detection systems in two different modes, namely, centralized and federated learning. Specifically, the proposed testbed is organized into seven layers, including, Cloud Computing Layer, Network Functions Virtualization Layer, Blockchain Network Layer, Fog Computing Layer, Software-Defined Networking Layer, Edge Computing Layer, and IoT and IIoT Perception Layer. In each layer, we propose new emerging technologies that satisfy the key requirements of IoT and IIoT applications, such as, ThingsBoard IoT platform, OPNFV platform, Hyperledger Sawtooth, Digital twin, ONOS SDN controller, Mosquitto MQTT brokers, Modbus TCP/IP, ...etc. The IoT data are generated from various IoT devices (more than 10 types) such as Low-cost digital sensors for sensing temperature and humidity, Ultrasonic sensor, Water level detection sensor, pH Sensor Meter, Soil Moisture sensor, Heart Rate Sensor, Flame Sensor, ...etc.). However, we identify and analyze fourteen attacks related to IoT and IIoT connectivity protocols, which are categorized into five threats, including, DoS/DDoS attacks, Information gathering, Man in the middle attacks, Injection attacks, and Malware attacks. In addition, we extract features obtained from different sources, including alerts, system resources, logs, network traffic, and propose new 61 features with high correlations from 1176 found features. After processing and analyzing the proposed realistic cyber security dataset, we provide a primary exploratory data analysis and evaluate the performance of machine learning approaches (i.e., traditional machine learning as well as deep learning) in both centralized and federated learning modes.
Instructions:
Great news! The Edge-IIoT dataset has been featured as a "Document in the top 1% of Web of Science." This indicates that it is ranked within the top 1% of all publications indexed by the Web of Science (WoS) in terms of citations and impact.
Please kindly visit kaggle link for the updates: https://www.kaggle.com/datasets/mohamedamineferrag/edgeiiotset-cyber-sec...
Free use of the Edge-IIoTset dataset for academic research purposes is hereby granted in perpetuity. Use for commercial purposes is allowable after asking the leader author, Dr Mohamed Amine Ferrag, who has asserted his right under the Copyright.
The details of the Edge-IIoT dataset were published in following the paper. For the academic/public use of these datasets, the authors have to cities the following paper:
Mohamed Amine Ferrag, Othmane Friha, Djallel Hamouda, Leandros Maglaras, Helge Janicke, "Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT and IIoT Applications for Centralized and Federated Learning", IEEE Access, April 2022 (IF: 3.37), DOI: 10.1109/ACCESS.2022.3165809
Link to paper : https://ieeexplore.ieee.org/document/9751703
The directories of the Edge-IIoTset dataset include the following:
•File 1 (Normal traffic)
-File 1.1 (Distance): This file includes two documents, namely, Distance.csv and Distance.pcap. The IoT sensor (Ultrasonic sensor) is used to capture the IoT data.
-File 1.2 (Flame_Sensor): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.
-File 1.3 (Heart_Rate): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.
-File 1.4 (IR_Receiver): This file includes two documents, namely, IR_Receiver.csv and IR_Receiver.pcap. The IoT sensor (IR (Infrared) Receiver Sensor) is used to capture the IoT data.
-File 1.5 (Modbus): This file includes two documents, namely, Modbus.csv and Modbus.pcap. The IoT sensor (Modbus Sensor) is used to capture the IoT data.
-File 1.6 (phValue): This file includes two documents, namely, phValue.csv and phValue.pcap. The IoT sensor (pH-sensor PH-4502C) is used to capture the IoT data.
-File 1.7 (Soil_Moisture): This file includes two documents, namely, Soil_Moisture.csv and Soil_Moisture.pcap. The IoT sensor (Soil Moisture Sensor v1.2) is used to capture the IoT data.
-File 1.8 (Sound_Sensor): This file includes two documents, namely, Sound_Sensor.csv and Sound_Sensor.pcap. The IoT sensor (LM393 Sound Detection Sensor) is used to capture the IoT data.
-File 1.9 (Temperature_and_Humidity): This file includes two documents, namely, Temperature_and_Humidity.csv and Temperature_and_Humidity.pcap. The IoT sensor (DHT11 Sensor) is used to capture the IoT data.
-File 1.10 (Water_Level): This file includes two documents, namely, Water_Level.csv and Water_Level.pcap. The IoT sensor (Water sensor) is used to capture the IoT data.
•File 2 (Attack traffic):
-File 2.1 (Attack traffic (CSV files)): This file includes 13 documents, namely, Backdoor_attack.csv, DDoS_HTTP_Flood_attack.csv, DDoS_ICMP_Flood_attack.csv, DDoS_TCP_SYN_Flood_attack.csv, DDoS_UDP_Flood_attack.csv, MITM_attack.csv, OS_Fingerprinting_attack.csv, Password_attack.csv, Port_Scanning_attack.csv, Ransomware_attack.csv, SQL_injection_attack.csv, Uploading_attack.csv, Vulnerability_scanner_attack.csv, XSS_attack.csv. Each document is specific for each attack.
-File 2.2 (Attack traffic (PCAP files)): This file includes 13 documents, namely, Backdoor_attack.pcap, DDoS_HTTP_Flood_attack.pcap, DDoS_ICMP_Flood_attack.pcap, DDoS_TCP_SYN_Flood_attack.pcap, DDoS_UDP_Flood_attack.pcap, MITM_attack.pcap, OS_Fingerprinting_attack.pcap, Password_attack.pcap, Port_Scanning_attack.pcap, Ransomware_attack.pcap, SQL_injection_attack.pcap, Uploading_attack.pcap, Vulnerability_scanner_attack.pcap, XSS_attack.pcap. Each document is specific for each attack.
•File 3 (Selected dataset for ML and DL):
-File 3.1 (DNN-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating deep learning-based intrusion detection systems.
-File 3.2 (ML-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating traditional machine learning-based intrusion detection systems.
Step 1: Downloading The Edge-IIoTset dataset From the Kaggle platform from google.colab import files
!pip install -q kaggle
files.upload()
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d mohamedamineferrag/edgeiiotset-cyber-security-dataset-of-iot-iiot -f "Edge-IIoTset dataset/Selected dataset for ML and DL/DNN-EdgeIIoT-dataset.csv"
!unzip DNN-EdgeIIoT-dataset.csv.zip
!rm DNN-EdgeIIoT-dataset.csv.zip
Step 2: Reading the Datasets' CSV file to a Pandas DataFrame: import pandas as pd
import numpy as np
df = pd.read_csv('DNN-EdgeIIoT-dataset.csv', low_memory=False)
Step 3 : Exploring some of the DataFrame's contents: df.head(5)
print(df['Attack_type'].value_counts())
Step 4: Dropping data (Columns, duplicated rows, NAN, Null..): from sklearn.utils import shuffle
drop_columns = ["frame.time", "ip.src_host", "ip.dst_host", "arp.src.proto_ipv4","arp.dst.proto_ipv4",
"http.file_data","http.request.full_uri","icmp.transmit_timestamp",
"http.request.uri.query", "tcp.options","tcp.payload","tcp.srcport",
"tcp.dstport", "udp.port", "mqtt.msg"]
df.drop(drop_columns, axis=1, inplace=True)
df.dropna(axis=0, how='any', inplace=True)
df.drop_duplicates(subset=None, keep="first", inplace=True)
df = shuffle(df)
df.isna().sum()
print(df['Attack_type'].value_counts())
Step 5: Categorical data encoding (Dummy Encoding): import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
def encode_text_dummy(df, name):
dummies = pd.get_dummies(df[name])
for x in dummies.columns:
dummy_name = f"{name}-{x}"
df[dummy_name] = dummies[x]
df.drop(name, axis=1, inplace=True)
encode_text_dummy(df,'http.request.method')
encode_text_dummy(df,'http.referer')
encode_text_dummy(df,"http.request.version")
encode_text_dummy(df,"dns.qry.name.len")
encode_text_dummy(df,"mqtt.conack.flags")
encode_text_dummy(df,"mqtt.protoname")
encode_text_dummy(df,"mqtt.topic")
Step 6: Creation of the preprocessed dataset df.to_csv('preprocessed_DNN.csv', encoding='utf-8')
For more information about the dataset, please contact the lead author of this project, Dr Mohamed Amine Ferrag, on his email: mohamed.amine.ferrag@gmail.com
More information about Dr. Mohamed Amine Ferrag is available at:
https://www.linkedin.com/in/Mohamed-Amine-Ferrag
https://dblp.uni-trier.de/pid/142/9937.html
https://www.researchgate.net/profile/Mohamed_Amine_Ferrag
https://scholar.google.fr/citations?user=IkPeqxMAAAAJ&hl=fr&oi=ao
https://www.scopus.com/authid/detail.uri?authorId=56115001200
https://publons.com/researcher/1322865/mohamed-amine-ferrag/
https://orcid.org/0000-0002-0632-3172
Last Updated: 27 Mar. 2023
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Distributed Denial of Service (DDoS) attack is a menace to network security that aims at exhausting the target networks with malicious traffic. Although many statistical methods have been designed for DDoS attack detection, designing a real-time detector with low computational overhead is still one of the main concerns. On the other hand, the evaluation of new detection algorithms and techniques heavily relies on the existence of well-designed datasets. In this paper, first, we review the existing datasets comprehensively and propose a new taxonomy for DDoS attacks. Secondly, we generate a new dataset, namely CICDDoS2019, which remedies all current shortcomings. Thirdly, using the generated dataset, we propose a new detection and family classification approach based on a set of network flow features. Finally, we provide the most important feature sets to detect different types of DDoS attacks with their corresponding weights.
The dataset offers an extended set of Distributed Denial of Service attacks, most of which employ some form of amplification through reflection. The dataset shares its feature set with the other CIC NIDS datasets, IDS2017, IDS2018 and DoS2017
original paper link: https://ieeexplore.ieee.org/abstract/document/8888419 kaggle dataset link: https://www.kaggle.com/datasets/dhoogla/cicddos2019
The dataset is from the University of Nevada, Reno Link: https://www.tapadhirdas.com/unr-idd-dataset
The goal of multi-class classification is to differentiate the intrusions not only from normal working conditions but also from each other. Multi-class classification helps us to learn about the root causes of network intrusions. The labels for multi-class classification in UNR-IDD are illustrated in the accompanying table.
Label | Description |
---|---|
Normal | Network Functionality. |
TCP-SYN | TCP-SYN Flood. |
PortScan | Port Scanning. |
Overflow | Flow Table Overflow. |
Blackhole | Blackhole Attack. |
Diversion | Traffic Diversion Attack. |
UNSW-NB15 is a network intrusion dataset. It contains nine different attacks, includes DoS, worms, Backdoors, and Fuzzers. The dataset contains raw network packets. The number of records in the training set is 175,341 records and the testing set is 82,332 records from the different types, attack and normal.
Paper: UNSW-NB15: a comprehensive data set for network intrusion detection systems
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
and contextual features
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Hussain Afzaal 03
Released under CC0: Public Domain
Kitsune Network Attack Dataset This is a collection of nine network attack datasets captured from a either an IP-based commercial surveillance system or a network full of IoT devices. Each dataset contains millions of network packets and diffrent cyber attack within it.
For each attack, you are supplied with:
A preprocessed dataset in csv format (ready for machine learning) The corresponding label vector in csv format The original network capture in pcap format (in case you want to engineer your own features)
We will now describe in detail what's in these datasets and how they were collected.
The Network Attacks We have collected a wide variety of attacks which you would find in a real network intrusion. The following is a list of the cyber attack datasets avalaible:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F827271%2F79e305668553e521b0709a2413323c45%2Fkaggle_dataset_table.png?generation=1598461684070844&alt=media" alt="image" width="100">
For more details on the attacks themselves, please refer to our NDSS paper (citation below).
The Data Collection The following figure presents the network topologies which we used to collect the data, and the corrisponding attack vectors at which the attacks were performed. The network capture took place at point 1 and point X at the router (where a network intrusion detection system could feasibly be placed). For each dataset, clean network traffic was captured for the first 1 million packets, then the cyber attack was performed.
The Dataset Format Each preprocessed dataset csv has m rows (packets) and 115 columns (features) with no header. The 115 features were extracted using our AfterImage feature extractor, described in our NDSS paper (see below) and available in Python here. In summary, the 115 features provide a statistical snapshot of the network (hosts and behaviors) in the context of the current packet traversing the network. The AfterImage feature extractor is unique in that it can efficiently process millions of streams (network channels) in real-time, incrementally, making it suitable for handling network traffic.
Citation If you use these datasets, please cite:
@inproceedings{mirsky2018kitsune, title={Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection}, author={Mirsky, Yisroel and Doitshman, Tomer and Elovici, Yuval and Shabtai, Asaf}, booktitle={The Network and Distributed System Security Symposium (NDSS) 2018}, year={2018} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the modern day technological advancements and the evolution of Industry 4.0
Source
https://www.kaggle.com/datasets/dhoogla/unswnb15?resource=download
Dataset
This is an academic intrusion detection dataset. All the credit goes to the original authors: dr. Nour Moustafa and dr. Jill Slay. Please cite their original paper and all other appropriate articles listed on the UNSW-NB15 page. The full dataset also offers the pcap, BRO and Argus files along with additional documentation. The modifications to the predesignated train-test sets are minimal… See the full description on the dataset page: https://huggingface.co/datasets/wwydmanski/UNSW-NB15.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset constructed to trigger IDS rules based on the community data set of the Snort Intrusion Detection System
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IEC 60870-5-104
Intrusion Detection Dataset
Readme File
ITHACA – University of Western Macedonia - https://ithaca.ece.uowm.gr/
Authors: Panagiotis Radoglou-Grammatikis, Thomas Lagkas, Vasileios Argyriou, Panagiotis Sarigiannidis
Publication Date: September 23, 2022
1.Introduction
The evolution of the Industrial Internet of Things (IIoT) introduces several benefits, such as real-time monitoring, pervasive control and self-healing. However, despite the valuable services, security and privacy issues still remain given the presence of legacy and insecure communication protocols like IEC 60870-5-104. IEC 60870-5-104 is an industrial protocol widely applied in critical infrastructures, such as the smart electrical grid and industrial healthcare systems. The IEC 60870-5-104 Intrusion Detection Dataset was implemented in the context of the research paper entitled "Modeling, Detecting, and Mitigating Threats Against Industrial Healthcare Systems: A Combined Software Defined Networking and Reinforcement Learning Approach" [1], in the context of two H2020 projects: ELECTRON: rEsilient and seLf-healed EleCTRical pOwer Nanogrid (101021936) and SDN-microSENSE: SDN - microgrid reSilient Electrical eNergy SystEm (833955). This dataset includes labelled Transmission Control Protocol (TCP)/Internet Protocol (IP) network flow statistics (Common-Separated Values (CSV) format) and IEC 60870-5-104 flow statistics (CSV format) related to twelve IEC 60870-5-104 cyberattacks. In particular, the cyberattacks are related to unauthorised commands and Denial of Service (DoS) activities against IEC 60870-5-104. Moreover, the relevant Packet Capture (PCAP) files are available. The dataset can be utilised for Artificial Intelligence (AI)-based Intrusion Detection Systems (IDS), taking full advantage of Machine Learning (ML) and Deep Learning (DL).
2.Instructions
The IEC 60870-5-104 dataset was implemented following the methodology of A. Gharib et al. in [2], including eleven features: (a) Complete Network Configuration, (b) Complete Traffic, (c) Labelled Dataset, (d) Complete Interaction, (e) Complete Capture, (f) Available Protocols, (g) Attack Diversity, (h) Heterogeneity, (i) Feature Set and (j) Metadata.
A network topology consisting of (a) seven industrial entities, (b) one Human Machine Interfaces (HMI) and (c) three cyberattackers was used to construct the IEC 60870-5-104 Intrusion Detection Dataset. The industrial entities use IEC TestServer[1], while the HMI uses Qtester104[2]. On the other hand, the cyberattackers use Kali Linux[3] equipped with Metasploit[4], OpenMUC j60870[5] and Ettercap[6]. The cyberattacks were performed during the following days.
For each attack, a 7zip file is provided, including the network traffic and the network flow statistics for each entity. Moreover, a relevant diagram is provided, illustrating the corresponding cyberattack. In particular, for each entity, a folder is given, including (a) the relevant pcap file, (b) Transmission Control Protocol (TCP) / Internet Protocol (IP) network flow statistics in a Common Separated Value (CSV) format and (c) IEC 60870-5-104 flow statistics in a CSV format. The TCP/IP network flow statistics were generated by CICFlowMeter[7], while the IEC 60870-5-104 flow statistics were generated based on a Custom IEC 60870-5-104 Python Parser[8], taking full advantage of Scapy[9].
3.Dataset Structure
The dataset consists of the following files:
Each 7zip file includes respective folders related to the entities/devices (described in the following section) participating in each attack. In particular, for each entity/device, there is a folder including (a) the overall network traffic (pcap file) related to this entity/device during each attack, (b) the TCP/IP network flow statistics (CSV file) from CICFlowMeter for the overall network traffic, (c) the IEC 60870-5-104 network traffic (pcap file) related to this entity/device during each attack, (d) the TCP/IP network flow statistics (CSV file) from CICFlowMeter for the IEC 608770-5-104 network traffic, (e) the IEC 60870-5-104 flow statistics (CSV file) from the Custom IEC 60870-5-104 Python Parser for the IEC 608770-5-104 network traffic and finally, (f) an image showing how the attack was executed. Finally, it is noteworthy that the network flow from both CICFlowMeter and Custom IEC 60870-5-104 Python Parser in each CSV file are labelled based on the IEC 60870-5-104 cyberattacks executed for the generation of this dataset. The description of these attacks is given in the following section, while the various features from CICFlowMeter and Custom IEC 60870-5-104 Python Parser are presented in Section 5.
4.Testbed & IEC 60870-5-104 Attacks
The testbed created for generating this dataset is composed of five virtual RTU devices emulated by IEC TestServer and two real RTU devices. Moreover, there is another workstation which plays the role of Master Terminal Unit (MTU) and HMI, sending legitimate IEC 60870-5-104 commands to the corresponding RTUs. For this purpose, the workstation uses QTester104. In addition, there are three attackers that act as malicious insiders executing the following cyberattacks against the aforementioned RTUs. Finally, the network traffic data of each entity/device was captured through tshark.
Table 1: IEC 60870-5-104 Cyberattacks Description
IEC 60870-5-104 Cyberattack Description |
Description |
Dataset Files |
MITM Drop |
During this attack, the cyberattacker is placed between two endpoints, thus monitoring and dropping the network traffic |
This dataset was created by Sabarishwaran G
Message Queuing Telemetry Transport (MQTT) protocol is one of the most used standards used in Internet of Things (IoT) machine to machine communication. The increase in the number of available IoT devices and used protocols reinforce the need for new and robust Intrusion Detection Systems (IDS). However, building IoT IDS requires the availability of datasets to process, train and evaluate these models.
MQTT-IoT-IDS2020 is the first dataset to simulate an MQTT-based network. The dataset is generated using a simulated MQTT network architecture. The network comprises twelve sensors, a broker, a simulated camera, and an attacker. Five scenarios are recorded: (1) normal operation, (2) aggressive scan, (3) UDP scan, (4) Sparta SSH brute-force, and (5) MQTT brute-force attack. The raw pcap files are saved, then features are extracted. Three abstraction levels of features are extracted from the raw pcap files: (a) packet features, (b) Unidirectional flow features and (c) Bidirectional flow features. The csv feature files in the dataset are suited for Machine Learning (ML) usage. Also, the raw pcap files are suitable for the deeper analysis of MQTT IoT networks communication and the associated attacks.
This dataset was created by md Shamshuzzoha
The TII-SSRC-23 dataset offers a comprehensive collection of network traffic patterns, meticulously compiled to support the development and research of Intrusion Detection Systems (IDS). It presents a dual structure: one part provides a tabular representation of extracted features in CSV format, while the other offers raw network traffic data for each type of traffic in PCAP files. This rich dataset captures both benign and malicious network scenarios, serving as an invaluable resource for researchers in the machine learning field.
URL: https://www.kaggle.com/datasets/daniaherzalla/tii-ssrc-23
Darpa is a dataset consisting of communications between source IPs and destination IPs. This dataset contains different attacks between IPs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CICIoT2023 dataset is a large-scale, realistic intrusion detection dataset designed to support security analytics and machine learning research in the Internet of Things (IoT) domain. Created by the Canadian Institute for Cybersecurity (CIC), the dataset captures 33 different types of attacks (including DDoS, DoS, Recon, Web-based, Brute Force, Spoofing, and Mirai) executed by malicious IoT devices against other IoT targets.
The testbed consists of 105 real IoT devices of different types and manufacturers, including smart home devices and industrial equipment, configured in a complex network topology to emulate real-world conditions. The dataset includes benign and malicious traffic in various formats and supports feature extraction for both traditional ML and deep learning models.
This dataset aims to address the lack of diversity and scale in previous IoT security datasets, offering a robust benchmark for evaluating intrusion detection systems (IDS) and enabling research in IoT cybersecurity, anomaly detection, and network forensics.
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
The dataset has been introduced by the below-mentioned researches: E. C. P. Neto, S. Dadkhah, R. Ferreira, A. Zohourian, R. Lu, A. A. Ghorbani. "CICIoT2023: A real-time dataset and benchmark for large-scale attacks in IoT environment," Sensor (2023) – (submitted to Journal of Sensors). The present data contains different kinds of IoT intrusions. The categories of the IoT intrusions enlisted in the data are as follows: DDoS Brute Force Spoofing DoS Recon Web-based Mirai
There are several subcategories are present in the data for each kind of intrusion types in the IoT. The dataset contains 1191264 instances of network for intrusions and 47 features of each of the intrusions. The dataset can be used to prepare the predictive model through which different kind of intrusive attacks can be detected. The data is also suitable for designing the IDS system.