Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A comprehensive dataset, merging all the aforementioned datasets mentioned in: https://staff.itee.uq.edu.au/marius/NIDS_datasets/#RA5
The newly published dataset represents the benefits of shared dataset feature sets, where the merging of multiple smaller ones is possible. This will eventually lead to a bigger and more universal NIDS datasets containing flows from multiple network setups and different attack settings.
An additional label feature identifying the original dataset of each flow. This can be used to compare the same attack scenarios conducted over two or more different test-bed networks. The attack categories have been modified to combine all parent categories.
Attacks named DoS attacks-Hulk, DoS attacks-SlowHTTPTest, DoS attacks-GoldenEye and DoS attacks-Slowloris have been renamed to the parent DoS category. Attacks named DDOS attack-LOIC-UDP, DDOS attack-HOIC and DDoS attacks-LOIC-HTTP have been renamed to DDoS. Attacks named FTP-BruteForce, SSH-Bruteforce, Brute Force -Web and Brute Force -XSS have been combined as a brute-force category. Finally, SQL Injection attacks have been included in the injection attacks category.
The NF-UQ-NIDS dataset has a total of 11,994,893 records, out of which 9,208,048 (76.77%) are benign flows and 2,786,845 (23.23%) are attacks. The table below lists the distribution of the final attack categories.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains the original NetFlow V1 datasets, as published by the authors listed below. I have not made any changes — just uploaded them here to make access easier for others in the community who are working on machine learning-based network intrusion detection systems (NIDS).
If you use these datasets in your work, please cite the original paper: Mohanad Sarhan, Siamak Layeghy, Nour Moustafa and Marius Portmann. NetFlow Datasets for Machine Learning-Based Network Intrusion Detection Systems. In: Big Data Technologies and Applications. BDTA 2020, WiCON 2020. Springer, Cham, 2021.
The collection includes five datasets, converted by the original authors from four different formats into a unified NetFlow format. Each dataset contains 12 basic NetFlow features and is provided in CSV format.
🧠 Credits & Citation All credit goes to the original authors: Mohanad Sarhan, Siamak Layeghy, Nour Moustafa, and Marius Portmann Published in: "NetFlow Datasets for Machine Learning-Based Network Intrusion Detection Systems" Big Data Technologies and Applications (BDTA 2020, WiCON 2020), Springer, Cham, 2021.
Note If you are the original authors and prefer this dataset to be taken down or credited differently, please feel free to reach out. I just wanted to help make the data more accessible to the Kaggle community.
Facebook
Twitterhttp://guides.library.uq.edu.au/deposit_your_data/terms_and_conditionshttp://guides.library.uq.edu.au/deposit_your_data/terms_and_conditions
NetFlow Version 2 of the datasets is made up of 43 extended NetFlow features. The details of the datasets are published in: Mohanad Sarhan, Siamak Layeghy, and Marius Portmann, Towards a Standard Feature Set for Network Intrusion Detection System Datasets, Mobile Networks and Applications, 103, 108379, 2022 The use of the datasets for academic research purposes is granted in perpetuity after citing the above papers. For commercial purposes, it should be agreed upon by the authors. Please get in touch with the author Mohanad Sarhan for more details.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AIT Netflow Data Sets
This repository contains labeled synthetic netflows suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. The netflows are generated from the packet captures contained in the AIT-LDS-v2.0. A detailed description of that dataset is available in [1]. The packet captures were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.
In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. The following attacks are launched in the network:
This repository contains the following files:
Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).
If you use the dataset, please cite the following publications:
[1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner, M. Wurzenberger, and A. Rauber. "Maintainable Log Datasets for Evaluation of Intrusion Detection Systems". IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 3466-3482. [PDF]
[2] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]
Facebook
Twitterhttp://guides.library.uq.edu.au/deposit_your_data/terms_and_conditionshttp://guides.library.uq.edu.au/deposit_your_data/terms_and_conditions
NetFlow Version 1 of the datasets is made up of 8 basic NetFlow features. The details of the datasets are published in; Sarhan M., Layeghy S., Moustafa N., Portmann M. (2021) NetFlow Datasets for Machine Learning-Based Network Intrusion Detection Systems. In: Big Data Technologies and Applications. BDTA 2020, WiCON 2020. Springer, Cham. The use of the datasets for academic research purposes is granted in perpetuity after citing the above papers. For commercial purposes, it should be agreed upon by the authors. Please get in touch with the author Mohanad Sarhan for more details.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The network data schema is in the Netflow V9 format. Given two files 'train_net.csv' and 'test_net.csv', train_net.csv explains when the particular ALERT will happen. There are 4 classes present in the dataset, named following: 'None', 'Port Scanning', 'Denial of Service', 'Malware'.
SIMARGL Project – Secure Intelligent Methods for Advanced RecoGnition of malware and stegomalware, with the support of the European Commission and the Horizon 2020 Program, under Grant Agreement No. 833042.
Maria-Elena Mihailescu, Darius Mihai, Mihai Carabas, Mikolaj Komisarek, Marek Pawlicki, Witold Holubowicz, Rafal Kozik: The Proposition and Evaluation of the RoEduNet-SIMARGL2021 Network Intrusion Detection Dataset. Sensors 21(13): 4319 (2021)
Facebook
Twitterhttp://guides.library.uq.edu.au/deposit_your_data/terms_and_conditionshttp://guides.library.uq.edu.au/deposit_your_data/terms_and_conditions
NetFlow Version 1 of the datasets is made up of 8 basic NetFlow features. The details of the datasets are published in; Sarhan M., Layeghy S., Moustafa N., Portmann M. (2021) NetFlow Datasets for Machine Learning-Based Network Intrusion Detection Systems. In: Big Data Technologies and Applications. BDTA 2020, WiCON 2020. Springer, Cham. The use of the datasets for academic research purposes is granted in perpetuity after citing the above papers. For commercial purposes, it should be agreed upon by the authors. Please get in touch with the author Mohanad Sarhan for more details.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
NF-UQ-NIDS is the combined version of the four network intrusion detection (NIDS) datasets in the NF-collection by the university of Queensland. The aim was to standardize network-security datasets to achieve interoperability and to enable larger analyses. With some relabeling (documentation) the authors merged four independent NIDS datasets.
All credit goes to the original authors: Dr. Mohanad Sarhan, Dr. Siamak Layeghy, Dr. Nour Moustafa & Dr. Marius Portmann. Please cite their original conference article when using this dataset.
V1: Base dataset in CSV format as downloaded from here V2: Cleaning -> parquet files
In the parquet files all data types are already set correctly, there are 0 records with missing information and 0 duplicate records.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
| Field Name | Description |
|---|---|
| FLOW_ID | Unique identificator of flow |
| IPV4_SRC_ADDR | IPv4 source address |
| IPV4_DST_ADDR | IPv4 destination address |
| IN_PKTS | Number of incoming packets |
| IN_BYTES | Number of incoming bytes |
| OUT_PKTS | Number of outgoing packets |
| OUT_BYTES | Number of outgoing bytes |
| FIRST_SWITCHED | Time of first packet in the flow |
| LAST_SWITCHED | Time of last packet in the flow |
| L4_SRC_PORT | Layer 4 source port |
| L4_DST_PORT | Layer 4 destination port |
| TCP_FLAGS | TCP flags |
| PROTOCOL | Protocol |
| PROTOCOL_MAP | Protocol map |
| TOTAL_FLOWS_EXP | Total flows experienced |
| L7_PROTO | Layer 7 protocol |
| L7_PROTO_NAME | Layer 7 protocol name |
| ANOMALY_CATEGORY | Name of classification flow |
| ANOMALY | Binary classification flow |
This work is co-funded under the APPRAISE Project – fAcilitating Public & Private secuRity operAtors to mitigate terrorIsm Scenarios against soft targEts, with the support of the European Commission and the Horizon 2020 Program, under Grant Agreement No. 101021981.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.
NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.
Datasets
The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).
The datasets contain both benign and malicious traffic. All collected datasets are balanced.
The version of NetFlow used to build the datasets is 5.
| Dataset | Aim | Samples | Benign-malicious traffic ratio |
|---|---|---|---|
| D1 | Training | 400,003 | 50% |
| D2 | Test | 57,239 | 50% |
Infrastructure and implementation
Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.
DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)
Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).
The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.
The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.
| Parameters | Description |
|---|---|
| '--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema' | Enumerate users, password hashes, privileges, roles, databases, tables and columns |
| --level=5 | Increase the probability of a false positive identification |
| --risk=3 | Increase the probability of extracting data |
| --random-agent | Select the User-Agent randomly |
| --batch | Never ask for user input, use the default behavior |
| --answers="follow=Y" | Predefined answers to yes |
Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).
The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24.
The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.
However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.
To run the MySQL server we ran MariaDB version 10.4.12.
Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Detailed attributes for Netflow v5 flow records.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides comprehensive network flow analytics, capturing source and destination details, protocols, traffic volumes, flow durations, and autonomous system numbers. It includes traffic classification and anomaly detection flags, making it ideal for network monitoring, security analysis, and performance optimization across enterprise and service provider environments.
Facebook
Twitterhttp://guides.library.uq.edu.au/deposit_your_data/terms_and_conditionshttp://guides.library.uq.edu.au/deposit_your_data/terms_and_conditions
NetFlow Version 2 of the datasets is made up of 43 extended NetFlow features. The details of the datasets are published in: Mohanad Sarhan, Siamak Layeghy, and Marius Portmann, Towards a Standard Feature Set for Network Intrusion Detection System Datasets, Mobile Networks and Applications, 103, 108379, 2022 The use of the datasets for academic research purposes is granted in perpetuity after citing the above papers. For commercial purposes, it should be agreed upon by the authors. Please get in touch with the author Mohanad Sarhan for more details.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Proposition and Evaluation of the RoEduNet-SIMARGL2021 Network Intrusion Detection Dataset
Cybersecurity is an arms race, with both the security and the adversaries attempting to outsmart one another, coming up with new attacks, new ways to defend against those attacks, and again with new ways to circumvent those defenses. This situation creates a constant need for novel, realistic cybersecurity datasets. This paper introduces the effects of using machine-learning-based intrusion detection methods in network traffic coming from a real-life architecture. The main contribution of this work is a dataset coming from a real-world, academic network. Real-life traffic was collected and, after performing a series of attacks, a dataset was assembled. The network data schema is in the Netflow v9 format and it contains 44 unique features and a label describing each frame.
This dataset is publicly available for use. When using our dataset, please cite our related paper: Maria-Elena Mihailescu, Darius Mihai, Mihai Carabas, Mikolaj Komisarek, Marek Pawlicki, Witold Holubowicz, Rafal Kozik: The Proposition and Evaluation of the RoEduNet-SIMARGL2021 Network Intrusion Detection Dataset. Sensors 21(13): 4319 (2021)
This work is funded under the SIMARGL Project – Secure Intelligent Methods for Advanced RecoGnition of malware and stegomalware, with the support of the European Commission and the Horizon 2020 Program, under Grant Agreement No. 833042.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Overview This dataset comprises 400,000 network traffic entries collected from logistics networks at major airports in the United States, including those in Texas and Washington. The dataset provides a real-world view of network activity, featuring a mix of benign and malicious traffic, making it an invaluable resource for researchers and practitioners in cybersecurity and network analysis. Please note that some data has been eliminated for privacy purposes.
Features The dataset consists of 26 features, as outlined below:
Time: Timestamp of the network activity, formatted as YYYY-MM-DD HH:MM . Protocol: Type of protocol used for communication (e.g., TCP, UDP). Flag: TCP flags indicating the state of the connection (e.g., SYN, ACK). Family: Classification of the traffic, including normal operations and various attack families (e.g., WannaCry, Phishing). Clusters: Identifier for clustering similar traffic, useful for analyzing patterns. Source Address: IP address of the device originating the traffic. Destination Address: IP address of the destination device within the airport network. BTC: Bitcoin transaction amounts, if applicable. USD: USD transaction amounts, if applicable. Netflow Bytes: Total bytes of data transmitted in the flow. IP Address: Redundant field for clarity, representing the source IP. Threat Level: Classification indicating the threat level of the traffic (e.g., Benign, Zero-Day Attack). Port: Port number used for communication. Prediction: Model prediction indicating whether the traffic is benign or represents an attack. Payload Size: Size of the data payload transmitted. Number of Packets: Count of packets involved in the traffic flow. Application Layer Data: Information about the application layer requests (e.g., HTTP methods). User-Agent: Information about the client software making the request. Geolocation: Airport-related geolocation, indicating the specific airport involved (e.g., DFW, SEA). Logistics ID: Unique identifier for logistics items (e.g., shipment ID). Anomaly Score: Score indicating the likelihood of the traffic being anomalous or malicious. Event Description: Descriptive label for the event, detailing the nature of the traffic. Response Time: Time taken for the server to respond to the request. Session ID: Unique identifier for the network session. Data Transfer Rate: Rate of data transfer, measured in Mbps. Error Code: HTTP or application-level error codes returned (if applicable). Dataset Characteristics Total Entries: 400,000 Class Distribution: 62% benign traffic and 38% representing zero-day attacks and other threats. Geographical Focus: Traffic data includes activities at major airports, such as Dallas/Fort Worth International Airport (DFW) and Seattle-Tacoma International Airport (SEA). Use Cases This dataset can be utilized for:
Research: Investigating zero-day attack detection techniques. Machine Learning: Training models to classify benign and malicious network traffic. Network Security: Enhancing security measures in logistics networks at airports. Conclusion The "Zero-Day Attack Detection in Airport Logistics Networks" dataset provides a realistic and comprehensive view of network behavior within airport logistics, offering critical insights for developing effective cybersecurity strategies against zero-day threats.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
VHS-22 is a heterogeneous, flow-level dataset which combines ISOT, CICIDS-17, Booters and CTU-13 datasets, as well as traffic from Malware Traffic Analysis (MTA) site, to increase variety of malicious and legitimate traffic flows. It contains 27.7 million flows (20.3 million legitimate and 7.4 million of attacks). The flows are represented in the form of 45 features; apart from classical NetFlow features, VHS-22 contains statistical parameters and network-level features. Their detailed description and the results of initial detection experiments are presented in the paper:
Paweł Szumełda, Natan Orzechowski, Mariusz Rawski, and Artur Janicki. 2022. VHS-22 – A Very Heterogeneous Set of Network Traffic Data for Threat Detection. In Proc. European Interdisciplinary Cybersecurity Conference (EICC 2022), June 15–16, 2022, Barcelona, Spain. ACM, New York, NY, USA, https://doi.org/10.1145/3528580.3532843
Every day contains different attacks mixed with legitimate traffic. 01-01-2022 Botnet attacks from ISOT dataset. 02-01-2022 Various attacks from MTA dataset. 03-01-2022 Web attacks from CICIDS-17 dataset. 04-01-2022 Bruteforce attacks from CICIDS-17 dataset. 05-01-2022 Botnet attacks from CICIDS-17 dataset. 06-01-2022 DDoS attacks from CICIDS-17 dataset 07-01-2022 to 11-01-2022 DDoS attacks from Booters dataset. 12-01-2022 to 23-01-2022 Botnet traffic from CTU-13 dataset.
The VHS-22 dataset consists of labeled network flows and all data is publicly available for researchers in .csv format. When using VHS-22, please cite our paper which describes the VHS-22 dataset in detail, as well as the publications describing the source datasets:
Paweł Szumełda, Natan Orzechowski, Mariusz Rawski, and Artur Janicki. 2022. VHS-22 – A Very Heterogeneous Set of Network Traffic Data for Threat Detection. In Proc. European Interdisciplinary Cybersecurity Conference (EICC 2022), June 15–16, 2022, Barcelona, Spain. ACM, New York, NY, USA, https://doi.org/10.1145/3528580.3532843
Sherif Saad, Issa Traore, Ali Ghorbani, Bassam Sayed, David Zhao, Wei Lu, John Felix, and Payman Hakimian. 2011. Detecting P2P botnets through network behavior analysis and machine learning. In Proc. International Conference on Privacy, Security and Trust. IEEE, Montreal, Canada, 174–1
Iman Sharafaldin, Arash Habibi Lashkari, and Ali A. Ghorbani. 2018. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization, In Proc. 4th International Conference on Information Systems Security and Privacy (ICISSP 2018), Funchal, Portugal
José Jair Santanna, Romain Durban, Anna Sperotto, and Aiko Pras. 2015. Inside booters: An analysis on operational databases. In Proc. International Symposium on Integrated Network Management (INM 2015). IFIP/IEEE, Ottawa, Canada, 432–440. https://doi.org/10.1109/INM.2015.71403
Riaz Khan, Xiaosong Zhang, Rajesh Kumar, Abubakar Sharif, Noorbakhsh Amiri Golilarz, and Mamoun Alazab. 2019. An Adaptive Multi-Layer Botnet Detection Technique Using Machine Learning Classifiers. Applied Sciences 9 (06 2019), 2375. https://doi.org/10.3390/app91123
The Malware Traffic Analysis data originate from https://www.malware-traffic-analysis.net, authored by Brad.
The work has been funded by the SIMARGL Project -- Secure Intelligent Methods for Advanced RecoGnition of malware and stegomalware, with the support of the European Commission and the Horizon 2020 Program, under Grant Agreement No. 833042.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A comprehensive dataset, merging all the aforementioned datasets mentioned in: https://staff.itee.uq.edu.au/marius/NIDS_datasets/#RA5
The newly published dataset represents the benefits of shared dataset feature sets, where the merging of multiple smaller ones is possible. This will eventually lead to a bigger and more universal NIDS datasets containing flows from multiple network setups and different attack settings.
An additional label feature identifying the original dataset of each flow. This can be used to compare the same attack scenarios conducted over two or more different test-bed networks. The attack categories have been modified to combine all parent categories.
Attacks named DoS attacks-Hulk, DoS attacks-SlowHTTPTest, DoS attacks-GoldenEye and DoS attacks-Slowloris have been renamed to the parent DoS category. Attacks named DDOS attack-LOIC-UDP, DDOS attack-HOIC and DDoS attacks-LOIC-HTTP have been renamed to DDoS. Attacks named FTP-BruteForce, SSH-Bruteforce, Brute Force -Web and Brute Force -XSS have been combined as a brute-force category. Finally, SQL Injection attacks have been included in the injection attacks category.
The NF-UQ-NIDS dataset has a total of 11,994,893 records, out of which 9,208,048 (76.77%) are benign flows and 2,786,845 (23.23%) are attacks. The table below lists the distribution of the final attack categories.