Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Raw network data was collected over a period of 5 days, Monday through Friday, and stored in PCAP files. Monday was used to create most of the Benign data, while the Attack-Network implemented various types of attacks over the next 4 days, such as Brute Force connections (FTP and SSH), several types of DoS attacks, as well as a Botnet attack, Infiltration attacks and subsequent Port-Scanning activity.
The PCAP data was processed using a tool developed by one of the authors of [1], called CICFlowMeter [3]. This tool produces flow traces: sequences of packets between specific source and destination IP, with corresponding values for source and destination ports. TCP flows are usually terminated by connection teardowns, while UDP flows are terminated by a flow timeout. For each of these flow traces many features were selected, measuring flow characteristics, such as packet size, number of packets, flow duration, etc. For some of these variables, statistics such as their mean and standard deviations are provided as features as well. While several features are categorical (such as IP addresses, Port numbers and TCP flag counts), most of the other features are numerical.
The result is the CICIDS-2017 dataset, with about 80 features and several attack families which can ultimately be divided in 16 categories: one Benign category and 15 Attack categories. This original dataset is available at [4]. Subsequently, the authors of [2] spent a lot of effort to correct some errors in the dataset, by fixing the CICFlowMeter software (especially regarding TCP flow terminations) and by re-labeling some of the samples accordingly. They posted the corrected dataset on their website [5]; this also has links to their GitHub site, which provides Python code that can be used to efficiently import the data. I used that as a starting point for my notebook, here on Kaggle.
For each of the 5 days a csv file with network flows was produced.
These are the files in the dataset, with some changes: I created decimal values for the IP-addresses, and I removed a couple of rows with inf values.
In addition, I created 5 more files (_plus for each day), with extra features that translate information regarding traffic flows within the local network, or between the local network and external IP addresses. It should be noted that only two attacks have an external IP address, while for most attacks the local network is facing the gateway.
[1] Sharafaldin I., Lashkari A.H., and Ghorbani A.A. Toward generating a new intrusion detection dataset and intrusion traffic characterization, Proceedings of the 4th International Conference on Information Systems Security and Privacy ICISSP - Volume 1, 108-116, 2018. [2] Engelen G., Rimmer V., and Joosen W. Troubleshooting an intrusion detection dataset: the CICIDS2017 case study, 2021 IEEE Security and Privacy Workshops (SPW), 2021:7-12. [3] https://www.unb.ca/cic/research/applications.html [4] https://www.unb.ca/cic/datasets/ids-2017.html [5] https://intrusion-detection.distrinet-research.be/CNS2022/index.html
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Using NLFlowLyzer, we successfully generated the “BCCC-CIC-IDS2017” dataset by extracting key flows from raw network traffic data of CIC-IDS2017, resulting in CSV files integrating essential network and transport layer features. This new dataset offers a structured approach for analyzing intrusion detection, combining diverse traffic types into multiple sub-categories. The “BCCC-CIC-IDS2017” dataset enriches the depth and variety needed to rigorously evaluate our proposed profiling model, advancing research in network security and enhancing the development of intrusion detection systems.
The full research paper outlining the details of the dataset and its underlying principles:
"NTLFlowLyzer: Toward Generating an Intrusion Detection Dataset and Intruders Behavior Profiling through Network Layer Traffic Analysis and Pattern Extraction, MohammadMoein Shafi, Arash Habibi Lashkari, Arousha Haghighian Roudsari, Computer & Security, Computers & Security, 104160, ISSN 0167-4048 (2024)"
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Intrusion Detection Systems (IDSs) and Intrusion Prevention Systems (IPSs) are the most important defense tools against the sophisticated and ever-growing network attacks. Due to the lack of reliable test and validation datasets, anomaly-based intrusion detection approaches are suffering from consistent and accurate performance evolutions.
Our evaluations of the existing eleven datasets since 1998 show that most are out of date and unreliable. Some of these datasets suffer from the lack of traffic diversity and volumes, some do not cover the variety of known attacks, while others anonymize packet payload data, which cannot reflect the current trends. Some are also lacking feature set and metadata.
CICIDS2017 dataset contains benign and the most up-to-date common attacks, which resembles the true real-world data (PCAPs). It also includes the results of the network traffic analysis using CICFlowMeter with labeled flows based on the time stamp, source, and destination IPs, source and destination ports, protocols and attack (CSV files). Also available is the extracted features definition.
Generating realistic background traffic was our top priority in building this dataset. We have used our proposed B-Profile system (Sharafaldin, et al. 2016) to profile the abstract behavior of human interactions and generates naturalistic benign background traffic. For this dataset, we built the abstract behaviour of 25 users based on the HTTP, HTTPS, FTP, SSH, and email protocols.
The data capturing period started at 9 a.m., Monday, July 3, 2017 and ended at 5 p.m. on Friday July 7, 2017, for a total of 5 days. Monday is the normal day and only includes the benign traffic. The implemented attacks include Brute Force FTP, Brute Force SSH, DoS, Heartbleed, Web Attack, Infiltration, Botnet and DDoS. They have been executed both morning and afternoon on Tuesday, Wednesday, Thursday and Friday.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The network intrusion detection system (NIDS) plays a critical role in maintaining network security. However, traditional NIDS relies on a large volume of samples for training, which exhibits insufficient adaptability in rapidly changing network environments and complex attack methods, especially when facing novel and rare attacks. As attack strategies evolve, there is often a lack of sufficient samples to train models, making it difficult for traditional methods to respond quickly and effectively to new threats. Although existing few-shot network intrusion detection systems have begun to address sample scarcity, these systems often fail to effectively capture long-range dependencies within the network environment due to limited observational scope. To overcome these challenges, this paper proposes a novel elevated few-shot network intrusion detection method based on self-attention mechanisms and iterative refinement. This approach leverages the advantages of self-attention to effectively extract key features from network traffic and capture long-range dependencies. Additionally, the introduction of positional encoding ensures the temporal sequence of traffic is preserved during processing, enhancing the model’s ability to capture temporal dynamics. By combining multiple update strategies in meta-learning, the model is initially trained on a general foundation during the training phase, followed by fine-tuning with few-shot data during the testing phase, significantly reducing sample dependency while improving the model’s adaptability and prediction accuracy. Experimental results indicate that this method achieved detection rates of 99.90% and 98.23% on the CICIDS2017 and CICIDS2018 datasets, respectively, using only 10 samples.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contrasted with the past, improvements in PC and correspondence innovations have given broad and propelled changes. The use of new innovations give incredible advantages to people, organizations, and governments, be that as it may, messes some up against them. For instance, the protection of significant data, security of put away information stages, accessibility of information and so forth. Contingent upon these issues, digital fear based oppression is one of the most significant issues in this day and age. Digital fear, which made a great deal of issues people and establishments, has arrived at a level that could undermine open and nation security by different gatherings, for example, criminal association, proficient people and digital activists. Along these lines, Intrusion Detection Systems (IDS) has been created to maintain a strategic distance from digital assaults. Right now, learning the bolster support vector machine (SVM) calculations were utilized to recognize port sweep endeavors dependent on the new CICIDS2017 dataset with 97.80%, 69.79% precision rates were accomplished individually.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The rapid evolution of cyber threats poses significant challenges to the adaptability and performance of anomaly detection systems. This study presents an innovative hybrid deep learning framework that integrates Convolutional Neural Networks (CNN), Long Short-Term Memory networks (LSTM), and Transformer models with a novel self-learning mechanism to enhance network traffic anomaly detection. Our key contributions include: (1) a synergistic two-stage model fusion architecture that captures both spatial and temporal traffic patterns; (2) an adaptive learning mechanism with multi-metric drift detection that autonomously responds to evolving threats; and (3) a knowledge preservation strategy that maintains detection capabilities while adapting to new attack patterns. The proposed CNN-LSTM model achieves F1-scores of 0.9778 and 0.9695 on the UNSW-NB15 and CICIDS2017 datasets respectively for binary classification of normal vs. anomalous traffic. The LSTM-Transformer model further classifies specific anomaly types with accuracies of 0.9632 and 0.9528 on these datasets, representing significant improvements over recent methods. Experiments demonstrate the framework’s robustness, maintaining an average accuracy of 0.955 ( 0.005) over a 15-day simulated period with multiple induced concept drifts. The self-learning mechanism, with multi-metric drift detection and an efficient model update strategy, enables the system to detect drifts and recover performance within 23.4 ± 0.20 hours post-drift, while achieving a 92.8% detection rate for zero-day attacks. The proposed framework offers a promising direction for developing efficient and autonomous cybersecurity systems capable of handling dynamic and evolving threat landscapes.
Facebook
TwitterThis dataset is obtained from Error Prevalence in NIDS datasets: A Case Study on CIC-IDS-2017 and CSE-CIC-IDS-2018.
It is improved according to the paper [1].
[1] Liu, Lisa, et al. "Error prevalence in nids datasets: A case study on cic-ids-2017 and cse-cic-ids-2018." 2022 IEEE Conference on Communications and Network Security (CNS). IEEE, 2022.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Canadian Institute for Cybersecurity has published several datasets for network intrusion detection. Four of them: CIC-IDS2017, CIC-DoS2017, CSE-CIC-IDS2018 and CIC-DDoS2019 are collated here into one collection, cleaned up and with harmonized labeling.
The intent behind this collection is simple: to have a larger, more varied set of NIDS samples for more powerful analyses by researchers. Too often, researchers still rely on the individual datasets even though the full set is compatible out-of-the-box. The parts have been created for the same purpose and they have been processed with the same feature extraction tool chain.
This collection also takes into account 2 articles in which flawed features were discovered. Those features have been removed from the dataset. See the cleanup notebook for more information.
If you make use of this combined version, please credit the original authors. The relevant publications are cited here on Kaggle alongside the individual datasets and they are also readily available at the CIC's official dataset distribution page
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Labels of normal and attack classes in the CICIDS-2017 dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a cleaned and preprocessed version of the original CICIDS2017 network intrusion detection dataset, prepared for machine learning. It includes the following CSV file:
cicids2017_cleaned.csv: Contains the raw, unscaled feature values after cleaning and preprocessing, ready for further treatment (such as scaling and sampling) after train/test split.Original Dataset:
The CICIDS2017 dataset (available here) is a widely used benchmark dataset in cybersecurity research. It captures network traffic with both benign (normal) activity and various attack scenarios, making it suitable for developing and testing intrusion detection systems. However, the original dataset presents some challenges for direct use in machine learning due to missing values, duplicate entries, inconsistencies, and the need for feature engineering.
Steps Taken:
NaN and then handled along with other missing values.Source Code and Project:
Kudos to chethuhn, who, among others, uploaded the original CICIDS2017 to Kaggle.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The RS2024 dataset features logs generated by network load balancers (Layer 4) and application load balancers (Layer 7). It includes identified attack traffic through detailed payload analysis, making it ideal for testing Network Intrusion Detection Systems (NIDS) in real-time enterprise scenarios. The GC2024 dataset comprises Zeek network logs, offering a combination of attack traffic and bot traffic sourced from public repositories. This data was obtained from an AWS Elastic Compute Cloud (EC2) instance, presenting a varied setting for assessing security systems. The network intrusion detection model using the eXtreme Gradient Boosting (XGBoost) algorithm was trained on a combined dataset that includes UNSW-NB15, CICIDS2017, TON_IoT, RS-2024, and GC-2024. The model, saved in joblib format, is ready for deployment and can be used for further analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Results of training under poisoned data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Attack types distribution in CICIDS2017 and CICIDS2018 datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The network intrusion detection system (NIDS) plays a critical role in maintaining network security. However, traditional NIDS relies on a large volume of samples for training, which exhibits insufficient adaptability in rapidly changing network environments and complex attack methods, especially when facing novel and rare attacks. As attack strategies evolve, there is often a lack of sufficient samples to train models, making it difficult for traditional methods to respond quickly and effectively to new threats. Although existing few-shot network intrusion detection systems have begun to address sample scarcity, these systems often fail to effectively capture long-range dependencies within the network environment due to limited observational scope. To overcome these challenges, this paper proposes a novel elevated few-shot network intrusion detection method based on self-attention mechanisms and iterative refinement. This approach leverages the advantages of self-attention to effectively extract key features from network traffic and capture long-range dependencies. Additionally, the introduction of positional encoding ensures the temporal sequence of traffic is preserved during processing, enhancing the model’s ability to capture temporal dynamics. By combining multiple update strategies in meta-learning, the model is initially trained on a general foundation during the training phase, followed by fine-tuning with few-shot data during the testing phase, significantly reducing sample dependency while improving the model’s adaptability and prediction accuracy. Experimental results indicate that this method achieved detection rates of 99.90% and 98.23% on the CICIDS2017 and CICIDS2018 datasets, respectively, using only 10 samples.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dynamical growth of cyber threats in IoT setting requires smart and scalable intrusion detection systems. In this paper, a Lean-based hybrid Intrusion Detection framework using Particle Swarm Optimization and Genetic Algorithm (PSO-GA) to select the features and Extreme Learning Machine and Bootstrap Aggregation (ELM-BA) to classify the features is introduced. The proposed framework obtains high detection rates on the CICIDS-2017 dataset, with 100 percent accuracy on important attack categories, like PortScan, SQL Injection, and Brute Force. Statistical verification and visual evaluation metrics are used to validate the model, which can be interpreted and proved to be solid. The framework is crafted following Lean ideals; thus, it has minimal computational overhead and optimal detection efficiency. It can be efficiently ported to the real-world usage in smart cities and industrial internet of things systems. The suggested framework can be deployed in smart cities and industrial Internet of Things (IoT) systems in real time, and it provides scalable and effective cyber threat detection. By adopting it, false positives can be greatly minimized, the latency of the decision-making process can be decreased, as well as the IoT critical infrastructure resilience against the ever-changing cyber threats can be increased.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The network intrusion detection system (NIDS) plays a critical role in maintaining network security. However, traditional NIDS relies on a large volume of samples for training, which exhibits insufficient adaptability in rapidly changing network environments and complex attack methods, especially when facing novel and rare attacks. As attack strategies evolve, there is often a lack of sufficient samples to train models, making it difficult for traditional methods to respond quickly and effectively to new threats. Although existing few-shot network intrusion detection systems have begun to address sample scarcity, these systems often fail to effectively capture long-range dependencies within the network environment due to limited observational scope. To overcome these challenges, this paper proposes a novel elevated few-shot network intrusion detection method based on self-attention mechanisms and iterative refinement. This approach leverages the advantages of self-attention to effectively extract key features from network traffic and capture long-range dependencies. Additionally, the introduction of positional encoding ensures the temporal sequence of traffic is preserved during processing, enhancing the model’s ability to capture temporal dynamics. By combining multiple update strategies in meta-learning, the model is initially trained on a general foundation during the training phase, followed by fine-tuning with few-shot data during the testing phase, significantly reducing sample dependency while improving the model’s adaptability and prediction accuracy. Experimental results indicate that this method achieved detection rates of 99.90% and 98.23% on the CICIDS2017 and CICIDS2018 datasets, respectively, using only 10 samples.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The network intrusion detection system (NIDS) plays a critical role in maintaining network security. However, traditional NIDS relies on a large volume of samples for training, which exhibits insufficient adaptability in rapidly changing network environments and complex attack methods, especially when facing novel and rare attacks. As attack strategies evolve, there is often a lack of sufficient samples to train models, making it difficult for traditional methods to respond quickly and effectively to new threats. Although existing few-shot network intrusion detection systems have begun to address sample scarcity, these systems often fail to effectively capture long-range dependencies within the network environment due to limited observational scope. To overcome these challenges, this paper proposes a novel elevated few-shot network intrusion detection method based on self-attention mechanisms and iterative refinement. This approach leverages the advantages of self-attention to effectively extract key features from network traffic and capture long-range dependencies. Additionally, the introduction of positional encoding ensures the temporal sequence of traffic is preserved during processing, enhancing the model’s ability to capture temporal dynamics. By combining multiple update strategies in meta-learning, the model is initially trained on a general foundation during the training phase, followed by fine-tuning with few-shot data during the testing phase, significantly reducing sample dependency while improving the model’s adaptability and prediction accuracy. Experimental results indicate that this method achieved detection rates of 99.90% and 98.23% on the CICIDS2017 and CICIDS2018 datasets, respectively, using only 10 samples.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparative performance analysis on CICIDS2017 datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of testing on original dataset with generated dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Results of training on the original dataset.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Raw network data was collected over a period of 5 days, Monday through Friday, and stored in PCAP files. Monday was used to create most of the Benign data, while the Attack-Network implemented various types of attacks over the next 4 days, such as Brute Force connections (FTP and SSH), several types of DoS attacks, as well as a Botnet attack, Infiltration attacks and subsequent Port-Scanning activity.
The PCAP data was processed using a tool developed by one of the authors of [1], called CICFlowMeter [3]. This tool produces flow traces: sequences of packets between specific source and destination IP, with corresponding values for source and destination ports. TCP flows are usually terminated by connection teardowns, while UDP flows are terminated by a flow timeout. For each of these flow traces many features were selected, measuring flow characteristics, such as packet size, number of packets, flow duration, etc. For some of these variables, statistics such as their mean and standard deviations are provided as features as well. While several features are categorical (such as IP addresses, Port numbers and TCP flag counts), most of the other features are numerical.
The result is the CICIDS-2017 dataset, with about 80 features and several attack families which can ultimately be divided in 16 categories: one Benign category and 15 Attack categories. This original dataset is available at [4]. Subsequently, the authors of [2] spent a lot of effort to correct some errors in the dataset, by fixing the CICFlowMeter software (especially regarding TCP flow terminations) and by re-labeling some of the samples accordingly. They posted the corrected dataset on their website [5]; this also has links to their GitHub site, which provides Python code that can be used to efficiently import the data. I used that as a starting point for my notebook, here on Kaggle.
For each of the 5 days a csv file with network flows was produced.
These are the files in the dataset, with some changes: I created decimal values for the IP-addresses, and I removed a couple of rows with inf values.
In addition, I created 5 more files (_plus for each day), with extra features that translate information regarding traffic flows within the local network, or between the local network and external IP addresses. It should be noted that only two attacks have an external IP address, while for most attacks the local network is facing the gateway.
[1] Sharafaldin I., Lashkari A.H., and Ghorbani A.A. Toward generating a new intrusion detection dataset and intrusion traffic characterization, Proceedings of the 4th International Conference on Information Systems Security and Privacy ICISSP - Volume 1, 108-116, 2018. [2] Engelen G., Rimmer V., and Joosen W. Troubleshooting an intrusion detection dataset: the CICIDS2017 case study, 2021 IEEE Security and Privacy Workshops (SPW), 2021:7-12. [3] https://www.unb.ca/cic/research/applications.html [4] https://www.unb.ca/cic/datasets/ids-2017.html [5] https://intrusion-detection.distrinet-research.be/CNS2022/index.html