Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
it has been found that the dataset has few major shortcomings. These issues are sufficient enough to biased the detection engine of any typical IDS.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DoS
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Cleaned CICIDS2017 Dataset
This dataset is a cleaned and preprocessed version of the CICIDS2017 dataset created by the Canadian Institute for Cybersecurity, University of New Brunswick.
Modifications
Removed duplicate records Normalized feature names Filtered specific attack types Piviot the different attack data into single dataset
Source
Original dataset: CICIDS2017
License & Citation
This dataset is provided for research purposes. Please refer… See the full description on the dataset page: https://huggingface.co/datasets/agrawalchaitany/cyberbert_dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CIC-IDS-V2 is an extended version of the original CIC-IDS 2017 dataset. The dataset is normalised and 1 new class called "Comb" is added which is a combination of synthesised data of multiple non-benign classes.
To cite the dataset, please reference the original paper with DOI: 10.1109/SmartNets61466.2024.10577645. The paper is published in IEEE SmartNets and can be accessed here.
Citation info:
Madhubalan, Akshayraj & Gautam, Amit & Tiwary, Priya. (2024). Blender-GAN: Multi-Target Conditional Generative Adversarial Network for Novel Class Synthetic Data Generation. 1-7. 10.1109/SmartNets61466.2024.10577645.
This dataset was made by Abluva Inc, a Palo Alto based, research-driven Data Protection firm. Our data protection platform empowers customers to secure data through advanced security mechanisms such as Fine Grained Access control and sophisticated depersonalization algorithms (e.g. Pseudonymization, Anonymization and Randomization). Abluva's Data Protection solutions facilitate data democratization within and outside the organizations, mitigating the concerns related to theft and compliance. The innovative intrusion detection algorithm by Abluva employs patented technologies for an intricately balanced approach that excludes normal access deviations, ensuring intrusion detection without disrupting the business operations. Abluva’s Solution enables organizations to extract further value from their data by enabling secure Knowledge Graphs and deploying Secure Data as a Service among other novel uses of data. Committed to providing a safe and secure environment, Abluva empowers organizations to unlock the full potential of their data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Detection results on the CICIDS2017 dataset (K = 10).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
makekali/CIC-IDS-2017 dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset was created by Mohaned Mohammed Naji
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model performance results based on the CICIDS2017 dataset.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
fikrimulyana/CIC-IDS-2017 dataset hosted on Hugging Face and contributed by the HF Datasets community
gyawalishiva/cic-ids-2017-textual dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Information technology has significantly impacted society. IoT and its specialized variant, IoMT, enable remote patient monitoring and improve healthcare. While it contributes to improving healthcare services, it may pose significant security challenges, especially due to the growing interconnectivity of IoMT devices. Hence, a robust IDS is required to handle these issues and prevent future intrusions in a appropriate time. This study proposes an IDS model for the IoMT that integrates advanced feature selection techniques and deep learning to enhance detection performance. The proposed model employs Information Gain (IG) and Recursive Feature Elimination (RFE) in parallel to select the top 50% of features, from which intersection and union subsets are created, followed by a deep autoencoder (DAE) to reduce dimensionality without losing important data. Finally, a deep neural network (DNN) classifies traffic as normal or anomalous. The Experimental results demonstrate superior performance in terms of accuracy, precision, recall, and F1 score. It achieves an accuracy of 99.93% on the WUSTL-EHMS-2020 dataset while reducing training time and attains 99.61% accuracy on the CICIDS2017 dataset. The model performance was validated with an average accuracy of 99.82% ± 0.16% and a statistically significant p-value of 0.0001 on the WUSTL-EHMS-2020 dataset, which refers to stable statistical improvement. This study indicates that the proposed strategy decreases computational complexity and enhances IDS efficiency in resource-constrained IoMT environments.
This dataset was created by Sweety
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Packet Capture (PCAP) files of UNSW-NB15 and CIC-IDS2017 dataset are processed and labelled utilizing the CSV files. Each packet is labelled by comparing the eight distinct features: *Source IP, Destination IP, Source Port, Destination Port, Starting time, Ending time, Protocol and Time to live*. The dimensions for the dataset is Nx1504. All column of the dataset are integers, therefore you can directly utilize this dataset in you machine learning models. Moreover, details of the whole processing and transformation is provided in the following GitHub Repo:
https://github.com/Yasir-ali-farrukh/Payload-Byte
You can utilize the tool available at the above mentioned GitHub repo to generate labelled dataset from scratch. All of the detail of processing and transformation is provided in the following paper:
```yaml
@article{Payload,
author = "Yasir Ali Farrukh and Irfan Khan and Syed Wali and David Bierbrauer and Nathaniel Bastian",
title = "{Payload-Byte: A Tool for Extracting and Labeling Packet Capture Files of Modern Network Intrusion Detection Datasets}",
year = "2022",
month = "9",
url = "https://www.techrxiv.org/articles/preprint/Payload-Byte_A_Tool_for_Extracting_and_Labeling_Packet_Capture_Files_of_Modern_Network_Intrusion_Detection_Datasets/20714221",
doi = "10.36227/techrxiv.20714221.v1"
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Selected Features via a proposed approach from CICIDS2017.
bvk/CICIDS-2017-plus dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison with recent methods on the CICIDS2017 dataset.
This dataset was created by Will
This dataset was created by saifullah saif
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Network Traffic Embeddings Dataset
Model Description
This dataset contains embeddings generated from the CICIDS2017 network traffic dataset using a fine-tuned Meta-Llama-3.1-70B-Instruct model. The embeddings represent network traffic flows formatted in a structured way to capture key network traffic characteristics.
Structure of Embeddings Files
combined.npy
The combined.npy file contains a NumPy array of shape (N, D) where:
N is the total number… See the full description on the dataset page: https://huggingface.co/datasets/Hmehdi515/gtl-hids-embeddings.
This is a traffic dataset which contains balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection. The dataset is a secondary csv feature data which is composed of five public traffic datasets. Our dataset is composed based on three criteria: The first criterion is to combine widely considered public datasets which contain both encrypted malicious and legitimate traffic in existing works, such as the Malwares Capture Facility Project dataset and the CICIDS-2017 dataset. The second criterion is to ensure the data balance, i.e., balance of malicious and legitimate network traffic and similar size of network traffic contributed by each individual dataset. Thus, approximate proportions of malicious and legitimate traffic from each selected public dataset are extracted by using random sampling. We also ensured that there will be no traffic size from one selected public dataset that is much larger than other selected public datasets. The third criterion is that our dataset includes both conventional devices' and IoT devices' encrypted malicious and legitimate traffic, as these devices are increasingly being deployed and are working in the same environments such as offices, homes, and other smart city settings.
Based on the criteria, 5 public datasets are selected. After data pre-processing, details of each selected public dataset and the final composed dataset are shown in “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, proportions of selected traffic size from each selected public dataset with respect to the total traffic size of the composed dataset (% w.r.t the composed dataset), proportions of selected encrypted traffic size from each selected public dataset (% from selected public dataset), and total traffic size of the composed dataset. From the table, we are able to observe that each public dataset equally contributes to approximately 20% of the composed dataset, except for CICDS-2012 (due to its limited number of encrypted malicious traffic). This achieves a balance across individual datasets and reduces bias towards traffic belonging to any dataset during learning. We can also observe that the size of malicious and legitimate traffic are almost the same, thus achieving class balance. The datasets now made available were prepared aiming at encrypted malicious traffic detection. Since the dataset is used for machine learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4 and stratification is applied during data split. Such datasets can be used directly for machine or deep learning model training based on selected features.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
it has been found that the dataset has few major shortcomings. These issues are sufficient enough to biased the detection engine of any typical IDS.