Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. Below, I’ll explain each aspect in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning.
The dataset consists of network-based and user behavior-based features. Each feature provides valuable information about potential cyber threats.
These features describe network-level information such as packet size, protocol type, and encryption methods.
network_packet_size (Packet Size in Bytes)
protocol_type (Communication Protocol)
encryption_used (Encryption Protocol)
These features track user activities, such as login attempts and session duration.
login_attempts (Number of Logins)
session_duration (Session Length in Seconds)
failed_logins (Failed Login Attempts)
unusual_time_access (Login Time Anomaly)
0 or 1) indicating whether access happened at an unusual time.ip_reputation_score (Trustworthiness of IP Address)
browser_type (User’s Browser)
attack_detected)1 means an attack was detected, 0 means normal activity.This dataset can be used for intrusion detection systems (IDS) and cybersecurity research. Some key applications include:
Supervised Learning Approaches
attack_detected as the target).Deep Learning Approaches
If attack labels are missing, anomaly detection can be used: - Autoencoders: Learn normal traffic and flag anomalies. - Isolation Forest: Detects outliers based on feature isolation. - One-Class SVM: Learns normal behavior and detects deviations.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Global Cybersecurity Threats Dataset (2015-2024) provides extensive data on cyberattacks, malware types, targeted industries, and affected countries. It is designed for threat intelligence analysis, cybersecurity trend forecasting, and machine learning model development to enhance global digital security.
| Column Name | Description |
|---|---|
| Country | Country where the attack occurred |
| Year | Year of the incident |
| Threat Type | Type of cybersecurity threat (e.g., Malware, DDoS) |
| Attack Vector | Method of attack (e.g., Phishing, SQL Injection) |
| Affected Industry | Industry targeted (e.g., Finance, Healthcare) |
| Data Breached (GB) | Volume of data compromised |
| Financial Impact ($M) | Estimated financial loss in millions |
| Severity Level | Low, Medium, High, Critical |
| Response Time (Hours) | Time taken to mitigate the attack |
| Mitigation Strategy | Countermeasures taken |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a novel cutting-edge, large-scale multiclass dataset to improve the security of network cognition of suspicious traffic in networks. The proposed newly generated dataset contains up-to-date samples and features available to the public to help reduce the effect of upcoming cyberattacks with machine learning methods. Specifically, 6 million traffic samples with 60 features are collected and organized into two balanced classes: 50% normal traffic and 50% anomaly (attack) traffic. Furthermore, the anomaly traffic is composed of 15 different attacks including MITM-ARP-SPOOFING attack, SSH-BRUTE FORCE attack, FTP-BRUTE FORCE attack, DDOS-ICMP, DDOS-RAWIP attack, DDOS-UDP attack, DOS attack, EXPLOITING-FTP attack, FUZZING attack, ICMP FLOOD attack, SYN-FLOOD attack, PORT SCANNING attack, REMOTE CODE EXECUTION attack, SQL INJECTION attack, and XSS attack.
For detailed info, Please refer to and cite our article: Q. Abu Al-Haija, Z. Masoud, A. Yasin, K. Alesawi, Y. Alkarnawi, "Revolutionizing Threat Hunting in Communication Networks: Introducing a Cutting-Edge Large-Scale Multiclass Dataset", 15th International Conference on Information and Communication Systems (ICICS 2024), IEEE, Aug. 2024.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The CyberFedDefender dataset is a simulated dataset designed for developing and testing federated learning-based cyber threat detection models. This dataset is tailored for research and experimentation in distributed anomaly detection and privacy-preserving cybersecurity frameworks. It includes traffic features commonly used in intrusion detection systems (IDS) with a focus on cloud and edge computing environments. Each record represents network traffic metadata, with labeled instances of both normal and malicious activities, making it ideal for machine learning applications in cybersecurity.
Dataset Features The dataset consists of 1,430 instances, with 23 features including information on packet size, duration, bytes sent/received, flow statistics, and attack labels. It covers common cyberattacks such as DDoS, Brute Force, and Ransomware, along with normal network traffic.
Feature List: Timestamp: The time when the network traffic was recorded. Source_IP: The IP address of the source machine. Destination_IP: The IP address of the destination machine. Protocol: The network protocol used (TCP, UDP, ICMP). Packet_Length: The length of the packet in bytes. Duration: The duration of the connection in seconds. Source_Port: The port number used by the source. Destination_Port: The port number used by the destination. Bytes_Sent: Total bytes sent from the source to the destination. Bytes_Received: Total bytes received by the destination from the source. Flags: TCP flags indicating the connection's state (e.g., SYN, ACK). Flow_Packets/s: Number of packets per second in the traffic flow. Flow_Bytes/s: Number of bytes per second in the traffic flow. Avg_Packet_Size: Average size of the packets during the connection. Total_Fwd_Packets: Total number of forward packets. Total_Bwd_Packets: Total number of backward packets. Fwd_Header_Length: Length of the forward packet headers. Bwd_Header_Length: Length of the backward packet headers. Sub_Flow_Fwd_Bytes: Bytes sent in the forward subflow. Sub_Flow_Bwd_Bytes: Bytes received in the backward subflow. Inbound: Indicates whether the traffic is inbound (1) or outbound (0). Attack_Type: Type of cyberattack or normal traffic (e.g., DDoS, Brute Force, Ransomware, Normal). Label: Binary classification label where 1 indicates malicious traffic and 0 represents normal traffic. Usage This dataset is designed for research in the following areas:
Federated learning for cyber threat detection Privacy-preserving machine learning in cybersecurity Intrusion detection systems (IDS) Distributed anomaly detection in cloud and edge environments Researchers can leverage this dataset to build and evaluate models for anomaly detection, perform comparative analysis, or enhance the robustness of federated learning frameworks in cybersecurity applications.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains synthetic HTTP log data designed for cybersecurity analysis, particularly for anomaly detection tasks.
Dataset Features Timestamp: Simulated time for each log entry. IP_Address: Randomized IP addresses to simulate network traffic. Request_Type: Common HTTP methods (GET, POST, PUT, DELETE). Status_Code: HTTP response status codes (e.g., 200, 404, 403, 500). Anomaly_Flag: Binary flag indicating anomalies (1 = anomaly, 0 = normal). User_Agent: Simulated user agents for device and browser identification. Session_ID: Random session IDs to simulate user activity. Location: Geographic locations of requests. Applications This dataset can be used for:
Anomaly Detection: Identify suspicious network activity or attacks. Machine Learning: Train models for classification tasks (e.g., detect anomalies). Cybersecurity Analysis: Analyze HTTP traffic patterns and identify threats. Example Challenge Build a machine learning model to predict the Anomaly_Flag based on the features provided.
Facebook
Twitterhttps://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/
cyber-security-events
Dataset Description
This dataset contains cybersecurity events collected from honeypot infrastructure. The data has been processed and feature-engineered for machine learning applications in threat detection and security analytics.
Feature Categories
Network Features
Connection flow statistics (bytes, packets, duration) Protocol-specific metrics Geographic information IP reputation data
Behavioral Features
Session… See the full description on the dataset page: https://huggingface.co/datasets/pyToshka/cyber-security-events.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
namely
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of the research work titled "A Dataset to Train Intrusion Detection Systems based on Machine Learning Models for Electrical Substations," which is currently awaiting approval for publication. The dataset has been meticulously curated to support the development and evaluation of machine learning models tailored for detecting cyber intrusions in the context of electrical substations. It is intended to facilitate research and advancements in cybersecurity for critical infrastructure, specifically focusing on real-world scenarios within electrical substation environments. We encourage its use for experimentation and benchmarking in related areas of study.
The following sections list the content of the dataset generated.
The outcomes of different test executions are available as follows:
Each test consists of the model results in Python pickle format (with a .pkl extension) and a detailed description of the execution conditions in an output log file (with a .log extension).
A snapshot of the source code used to process these files is included under the filename source-code-cybersecurity-datasets-v2.0.zip. For an updated version, please consider visiting github repository.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The StealthPhisher Phishing Attack Dataset, generated at the Cybersecurity Lab, GLA University, Mathura, is a large, diverse, and recent Phishing Attack Dataset developed to address the evolving nature of phishing attacks. It comprises over 336,749 records, including 160,943 legitimate URLs and 175,806 phishing URLs, collected from reliable sources such as PhishTank. Reflecting the most recent phishing tactics, this dataset serves as a valuable resource for training and evaluating AI-based phishing detection systems.
Key features include URL-based attributes (e.g., length, TLD type, IP presence), statistical metrics (e.g., Shannon Entropy, Kolmogorov Complexity, Fractal Dimension), and HTML/interaction-based features (e.g., popups, redirects, forms). These multidimensional attributes provide comprehensive insights into phishing behavior, enabling accurate and robust threat detection. Designed to capture real-world scenarios, the dataset equips AI models to recognize both traditional and emerging phishing strategies effectively.
This dataset was generated as part of the research work presented in the article “StealthPhisher: A Defensive Framework against Phishing Attack using Hybrid Deep Learning and GenAI,” published in Expert Systems with Applications (https://doi.org/10.1016/j.eswa.2025.130205). Researchers using this dataset in their research work are kindly requested to cite this article.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Network traffic datasets created by Single Flow Time Series Analysis
Datasets were created for the paper: Network Traffic Classification based on Single Flow Time Series Analysis -- Josef Koumar, Karel Hynek, Tomáš Čejka -- which was published at The 19th International Conference on Network and Service Management (CNSM) 2023. Please cite usage of our datasets as:
J. Koumar, K. Hynek and T. Čejka, "Network Traffic Classification Based on Single Flow Time Series Analysis," 2023 19th International Conference on Network and Service Management (CNSM), Niagara Falls, ON, Canada, 2023, pp. 1-7, doi: 10.23919/CNSM59352.2023.10327876.
This Zenodo repository contains 23 datasets created from 15 well-known published datasets which are cited in the table below. Each dataset contains 69 features created by Time Series Analysis of Single Flow Time Series. The detailed description of features from datasets is in the file: feature_description.pdf
In the following table is a description of each dataset file:
| File name | Detection problem | Citation of original raw dataset |
| botnet_binary.csv | Binary detection of botnet | S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014. |
| botnet_multiclass.csv | Multi-class classification of botnet | S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014. |
| cryptomining_design.csv | Binary detection of cryptomining; the design part | Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022 |
| cryptomining_evaluation.csv | Binary detection of cryptomining; the evaluation part | Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022 |
| dns_malware.csv | Binary detection of malware DNS | Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021. |
| doh_cic.csv | Binary detection of DoH |
Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020 |
| doh_real_world.csv | Binary detection of DoH | Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022 |
| dos.csv | Binary detection of DoS | Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019. |
| edge_iiot_binary.csv | Binary detection of IoT malware | Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022. |
| edge_iiot_multiclass.csv | Multi-class classification of IoT malware | Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022. |
| https_brute_force.csv | Binary detection of HTTPS Brute Force | Jan Luxemburk et al. HTTPS Brute-force dataset with extended network flows, November 2020 |
| ids_cic_binary.csv | Binary detection of intrusion in IDS | Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018. |
| ids_cic_multiclass.csv | Multi-class classification of intrusion in IDS | Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018. |
| ids_unsw_nb_15_binary.csv | Binary detection of intrusion in IDS | Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015. |
| ids_unsw_nb_15_multiclass.csv | Multi-class classification of intrusion in IDS | Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015. |
| iot_23.csv | Binary detection of IoT malware | Sebastian Garcia et al. IoT-23: A labeled dataset with malicious and benign IoT network traffic, January 2020. More details here https://www.stratosphereips.org /datasets-iot23 |
| ton_iot_binary.csv | Binary detection of IoT malware | Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021 |
| ton_iot_multiclass.csv | Multi-class classification of IoT malware | Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021 |
| tor_binary.csv | Binary detection of TOR | Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017. |
| tor_multiclass.csv | Multi-class classification of TOR | Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017. |
| vpn_iscx_binary.csv | Binary detection of VPN | Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016. |
| vpn_iscx_multiclass.csv | Multi-class classification of VPN | Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016. |
| vpn_vnat_binary.csv | Binary detection of VPN | Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022 |
| vpn_vnat_multiclass.csv | Multi-class classification of VPN | Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022 |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a traffic dataset which contains balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection. The dataset is a secondary csv feature data which is composed of five public traffic datasets. Our dataset is composed based on three criteria: The first criterion is to combine widely considered public datasets which contain both encrypted malicious and legitimate traffic in existing works, such as the Malwares Capture Facility Project dataset and the CICIDS-2017 dataset. The second criterion is to ensure the data balance, i.e., balance of malicious and legitimate network traffic and similar size of network traffic contributed by each individual dataset. Thus, approximate proportions of malicious and legitimate traffic from each selected public dataset are extracted by using random sampling. We also ensured that there will be no traffic size from one selected public dataset that is much larger than other selected public datasets. The third criterion is that our dataset includes both conventional devices' and IoT devices' encrypted malicious and legitimate traffic, as these devices are increasingly being deployed and are working in the same environments such as offices, homes, and other smart city settings.
Based on the criteria, 5 public datasets are selected. After data pre-processing, details of each selected public dataset and the final composed dataset are shown in “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, proportions of selected traffic size from each selected public dataset with respect to the total traffic size of the composed dataset (% w.r.t the composed dataset), proportions of selected encrypted traffic size from each selected public dataset (% of selected public dataset), and total traffic size of the composed dataset. From the table, we are able to observe that each public dataset equally contributes to approximately 20% of the composed dataset, except for CICDS-2012 (due to its limited number of encrypted malicious traffic). This achieves a balance across individual datasets and reduces bias towards traffic belonging to any dataset during learning. We can also observe that the size of malicious and legitimate traffic are almost the same, thus achieving class balance. The datasets now made available were prepared aiming at encrypted malicious traffic detection. Since the dataset is used for machine learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4 and stratification is applied during data split. Such datasets can be used directly for machine or deep learning model training based on selected features.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study is publicly available for research purposes. If you are using this dataset, please cite the following paper, which outlines the complete details of the dataset and the methodology used for its generation:
Amit Karamchandani, Javier Núñez, Luis de-la-Cal, Yenny Moreno, Alberto Mozo, Antonio Pastor, "On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination," under submission.
This dataset contains a synthetic dataset generated to differentiate between benign and malicious heavy hitter flows within complex network environments. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction.
To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic.
The feature set includes flow statistics commonly used in network analysis, such as:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description: This is a Spearman Correlation Heatmap of the 32 features used for machine learning and deep learning models in cybersecurity. The diagonal cells are perfect self-correlation (value = 1) and the off-diagonal cells are pairwise correlations between features. Since there are no strong correlations (close to 1 or -1) we removed the redundant or irrelevant features, so each selected feature brings unique and independent information to the model. Feature selection is key in building cyber intrusion detection systems as it reduces computational overhead, simplifies the model and improves accuracy and robustness. This is part of the systematic feature engineering process to optimize datasets for anomaly detection, network traffic analysis and intrusion detection. Researchers in AI for cybersecurity can use this to build more interpretable and efficient models to detect in large scale networks. This figure shows the importance of correlation analysis for high dimensional datasets and contributes to cyber, data science and machine learning.
Why It Matters: Reduces overfitting in machine learning models. Improves computational efficiency for large-scale datasets. Enhances feature interpretability for robust cybersecurity solutions.
Keywords: Spearman Correlation Heatmap, Feature Selection, Intrusion Detection System, Cybersecurity, Machine Learning, Deep Learning, Anomaly Detection, Network Traffic Analysis, Artificial Intelligence in Cybersecurity, Dataset Optimization, Feature Engineering for Cyber Threats
References: This file pertains to our research study, which has been accepted for publication in the Scientific and Technical Journal of Information Technologies, Mechanics and Optics. The study is titled: "Enhancing and Extending CatBoost for Accurate Detection and Classification of DoS and DDoS Attack Subtypes in Network Traffic."
https://doi.org/10.1109/ICSIP61881.2024.10671552 https://doi.org/10.24143/2072-9502-2024-3-65-74
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As technology advances, the number and complexity of cyber-attacks increase, forcing defense techniques to be updated and improved. To help develop effective tools for detecting security threats it is essential to have reliable and representative security datasets. Many existing security datasets have limitations that make them unsuitable for research, including lack of labels, unbalanced traffic, and outdated threats.
CTU-SME-11 is a labeled network dataset designed to address the limitations of previous datasets. The dataset was captured in a real network that mimics a small-medium enterprise setting. Raw network traffic (packets) was captured from 11 devices using tcpdump for a duration of 7 days, from 20th to 26th of February, 2023 in Prague, Czech Republic. The devices were chosen based on the enterprise setting and consists of IoT, desktop and mobile devices, both bare metal and virtualized. The devices were infected with malware or exposed to Internet attacks, and factory reset to restore benign behavior.
The raw data was processed to generate network flows (Zeek logs) which were analyzed and labeled. The dataset contains two types of levels, a high level label and a descriptive label, which were put by experts. The former can take three values, benign, malicious or background. The latter contains detailed information about the specific behavior observed in the network flows. The dataset contains 99 million labeled network flows. The overall compressed size of the dataset is 80GB and the uncompressed size is 170GB.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset has been meticulously prepared and utilized as a validation set during the evaluation phase of "Meta IDS" to asses the performance of various machine learning models. It is now made available for interested users and researchers who seek a reliable and diverse dataset for training and testing their own custom models.
The validation dataset comprises a comprehensive collection of labeled entries, that determines whether the packet type is "malicious" or "benign." It covers complex design patterns that are commonly encountered in real-world applications. The dataset is designed to be representative, encompassing edge and fog layers that are in contact with cloud layer, thereby enabling thorough testing and evaluation of different models. Each sample in the dataset is labeled with the corresponding ground truth, providing a reliable reference for model performance evaluation.
To ensure convenient distribution and storage, the dataset has been broken down into three separate batches, each containing a portion of the dataset. This allows for convenient downloading and management of the dataset. The three batches are provided as individual compressed files.
In order to extract the data, follow the following instructions:
Once uncompressed, you will have access to the dataset in its original format for further exploration, analysis, and model training etc. The total storage required for extraction is approximately 800 GB in total, with the first batch requiring approximately 302 GB, the second batch requiring approximately 203 GB, and the third batch requiring approximately 297 GB of data storage.
The first batch contains 1,049,527,992 entries, where as the second batch contains 711,043,331 entries, and for the third and last batch we have 1,029,303,062 entries. The following table provides the feature names along with their explanation and example value once the dataset is extracted.
| Feature | Description | Example Value |
|---|---|---|
| ip.src | Source IP address in the packet | a05d4ecc38da01406c9635ec694917e969622160e728495e3169f62822444e17 |
| ip.dst | Destination IP address in the packet | a52db0d87623d8a25d0db324d74f0900deb5ca4ec8ad9f346114db134e040ec5 |
| frame.time_epoch | Epoch time of the frame | 1676165569.930869 |
| arp.hw.type | Hardware type | 1 |
| arp.hw.size | Hardware size | 6 |
| arp.proto.size | Protocol size | 4 |
| arp.opcode | Opcode | 2 |
| data.len | Length | 2713 |
| eth.dst.lg | Destination LG bit | 1 |
| eth.dst.ig | Destination IG bit | 1 |
| eth.src.lg | Source LG bit | 1 |
| eth.src.ig | Source IG bit | 1 |
| frame.offset_shift | Time shift for this packet | 0 |
| frame.len | frame length on the wire | 1208 |
| frame.cap_len | Frame length stored into the capture file | 215 |
| frame.marked | Frame is marked | 0 |
| frame.ignored | Frame is ignored | 0 |
| frame.encap_type | Encapsulation type | 1 |
| gre | Generic Routing Encapsulation | 'Generic Routing Encapsulation (IP)’ |
| ip.version | Version | 6 |
| ip.hdr_len | Header length | 24 |
| ip.dsfield.dscp | Differentiated Services Codepoint | 56 |
| ip.dsfield.ecn | Explicit Congestion Notification | 2 |
| ip.len | Total length | 614 |
| ip.flags.rb | Reserved bit | 0 |
| ip.flags.df | Don't fragment | 1 |
| ip.flags.mf | More fragments | 0 |
| ip.frag_offset | Fragment offset | 0 |
| ip.ttl | Time to live | 31 |
| ip.proto | Protocol | 47 |
| ip.checksum.status | Header checksum status | 2 |
| tcp.srcport | TCP source port | 53425 |
| tcp.flags | Flags | 0x00000098 |
| tcp.flags.ns | Nonce | 0 |
| tcp.flags.cwr | Congestion Window Reduced (CWR) | 1 |
| udp.srcport | UDP source port | 64413 |
| udp.dstport | UDP destination port | 54087 |
| udp.stream | Stream index | 1345 |
| udp.length | Length | 225 |
| udp.checksum.status | Checksum status | 3 |
| packet_type | Type of the packet which is either "benign" or "malicious" | 0 |
Furthermore, in compliance with the GDPR and to ensure the privacy of individuals, all IP addresses present in the dataset have been anonymized through hashing. This anonymization process helps protect the identity of individuals while preserving the integrity and utility of the dataset for research and model development purposes.
Please note that while the dataset provides valuable insights and a solid foundation for machine learning tasks, it is not a substitute for extensive real-world data collection. However, it serves as a valuable resource for researchers, practitioners, and enthusiasts in the machine learning community, offering a compliant and anonymized dataset for developing and validating custom models in a specific problem domain.
By leveraging the validation dataset for machine learning model evaluation and custom model training, users can accelerate their research and development efforts, building upon the knowledge gained from my thesis while contributing to the advancement of the field.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
MedSec-25 is a comprehensive, labeled network traffic dataset designed specifically for the Internet of Medical Things (IoMT) in healthcare environments. It addresses the limitations of existing generic IoT datasets by capturing realistic traffic from a custom-built healthcare IoT lab that mimics real-world hospital operations. The dataset includes both benign (normal) traffic and malicious traffic from multi-staged attack campaigns inspired by the MITRE ATT&CK framework. This allows for the development and evaluation of machine learning-based intrusion detection systems (IDS) tailored to IoMT scenarios, where patient safety and data privacy are critical. The dataset was generated using a variety of medical sensors (e.g., ECG, EEG, HHI, Respiration, SpO2) and environmental sensors (e.g., thermistor, ultrasonic, PIR, flame) connected via Raspberry Pi nodes and an IoT server. Traffic was captured over 7.5 hours using tools like Wireshark and tcpdump, resulting in PCAPNG files. These were processed with CICFlowMeter to extract flow-based features, producing a cleaned CSV dataset with 554,534 bidirectional network flows and 84 features.
Realistic Setup: Built in a physical lab at Rochester Institute of Technology, Dubai, incorporating diverse IoMT devices, protocols (e.g., MQTT, SSH, Telnet, FTP, HTTP, DNS), and real-time patient interactions (anonymized to comply with privacy regulations like HIPAA).
Multi-Staged Attacks: Unlike datasets focusing on isolated attacks, MedSec-25 simulates full attack chains: Reconnaissance (e.g., SYN/TCP scans, OS fingerprinting), Initial Access (e.g., brute-force, malformed MQTT packets), Lateral Movement (e.g., exploiting vulnerabilities to pivot between devices), and Exfiltration (e.g., data theft via MQTT).
Imbalanced Nature: This is the cleaned (imbalanced) version of the dataset. Users may need to apply balancing techniques (e.g., SMOTE oversampling + random undersampling) for model training, as demonstrated in the associated paper.
Size and Quality: 554,534 rows, no duplicates, no missing values (except 111 NaNs in Flow Byts/s, ~0.02%, which can be handled via imputation). Data types include float64 (45 columns), int64 (34 columns), and object (5 columns: Flow ID, Src IP, Dst IP, Timestamp, Label).
Utility: Preliminary models trained on this dataset (e.g., KNN: 98.09% accuracy, Decision Tree: 98.35% accuracy) show excellent performance for detecting attack stages.
This dataset is ideal for researchers in cybersecurity, machine learning, and healthcare IoT, enabling the creation of an IDS that can detect attacks at different phases to prevent escalation.
Benign Traffic: Generated over two days with active sensors, services (HTTP dashboard for patient monitoring, SSH/Telnet for remote access, FTP for file transfers), and real users (students/faculty) interacting with medical devices. No personally identifiable information was stored.
Malicious Traffic: Two Kali Linux attacker machines simulated MITRE ATT&CK-inspired campaigns using tools like Nmap, Scapy, Metasploit, and custom Python scripts.
Capture Tools: Wireshark and tcpdump for PCAPNG files (total ~1GB: 600MB benign, 400MB malicious).
Processing: Combined PCAP files per label, extracted features with CICFlowMeter, labeled flows manually based on attack phases, and cleaned for ML readiness. The final cleaned CSV is ~350MB.
The dataset includes 84 features extracted by CICFlowMeter, categorized as:
Identifiers: Flow ID, Src IP, Src Port, Dst IP, Dst Port, Protocol, Timestamp.
Time-Series Metrics: Flow Duration, Flow IAT Mean/Std/Max/Min, Fwd/Bwd IAT Tot/Mean/Std/Max/Min.
Size/Count Statistics: Tot Fwd/Bwd Pkts, TotLen Fwd/Bwd Pkts, Fwd/Bwd Pkt Len Max/Min/Mean/Std, Pkt Len Min/Max/Mean/Std/Var, Pkt Size Avg.
Flag Counts: Fwd/Bwd PSH/URG Flags, FIN/SYN/RST/PSH/ACK/URG/CWE/ECE Flag Cnt.
Rates and Ratios: Flow Byts/s, Flow Pkts/s, Fwd/Bwd Pkts/s, Down/Up Ratio, Active/Idle Mean/Std/Max/Min.
Segmentation and Others: Fwd/Bwd Seg Size Avg/Min, Subflow Fwd/Bwd Pkts/Byts, Init Fwd/Bwd Win Byts, Fwd Act Data Pkts, Fwd/Bwd Byts/b Avg, Fwd/Bwd Pkts/b Avg, Fwd/Bwd Blk Rate Avg.
The dataset is labeled with 5 classes representing benign behavior and attack stages:
Reconnaissance: 401,683 flows Initial Access: 102,090 flows Exfiltration: 25,915 flows Lateral Movement: 12,498 flows Benign: 12,348 flows
Note: The dataset is imbalanced, with Reconnaissance dominating. Apply balancing techniques for optimal ML performance.
Preprocessing Suggestions: Encode categorical features (e.g., Protocol, Label) using LabelEncoder. Normalize numerical features with Min-Max Scaler or StandardScaler. Handle the minor NaNs in Flow Byts/s via mean imputation.
Model Training: Split into train/test (e.g., 80/20). Suitable for classification tasks w...
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global machine learning security market size reached USD 7.42 billion in 2024, reflecting a robust expansion driven by escalating cyber threats and the increasing adoption of advanced digital technologies across various sectors. The market is projected to register a compelling CAGR of 23.6% during the forecast period, with the total market value anticipated to reach USD 61.58 billion by 2033. This exponential growth is primarily fueled by the urgent need for proactive security solutions capable of identifying and mitigating sophisticated cyberattacks in real time.
The primary growth driver for the machine learning security market is the rapid surge in cyberattacks, including ransomware, phishing, and advanced persistent threats. As organizations digitize operations and expand their cloud infrastructure, the attack surface increases, making traditional security measures insufficient. Machine learning-based security solutions can analyze vast datasets, detect anomalies, and respond to threats far more efficiently than conventional systems. The growing sophistication of cybercriminals, who now leverage artificial intelligence themselves, has made it imperative for enterprises to adopt equally advanced defense mechanisms. This environment of escalating cyber risk underpins the strong demand for machine learning security solutions across all major industries.
Another significant factor propelling the market is the increasing regulatory pressure and compliance requirements worldwide. Governments and regulatory bodies have introduced stringent data protection and privacy laws, such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States. These regulations mandate organizations to implement robust security frameworks to safeguard sensitive information. Machine learning security tools not only enhance the effectiveness of compliance programs by automating threat detection and reporting but also help organizations avoid hefty penalties associated with data breaches. As businesses strive to maintain compliance and protect their reputations, investment in machine learning-driven security is becoming a strategic imperative.
Furthermore, the expansion of the Internet of Things (IoT), cloud computing, and remote work trends have introduced new vectors of vulnerability, necessitating adaptive security approaches. Machine learning security solutions can continuously learn and adapt to emerging threats, providing proactive protection for dynamic IT environments. The integration of machine learning into security operations centers (SOCs) enables real-time monitoring, rapid incident response, and predictive analytics. This capability is particularly valuable for sectors such as BFSI, healthcare, and manufacturing, where the cost of a security breach can be catastrophic. As digital transformation accelerates, the market is poised for sustained growth, with organizations prioritizing advanced security investments.
From a regional perspective, North America currently dominates the machine learning security market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The United States, in particular, is a frontrunner due to its mature cybersecurity ecosystem, high technology adoption rate, and presence of leading market players. Europe’s growth is bolstered by strict regulatory frameworks and strong investment in digital infrastructure, while Asia Pacific is experiencing the fastest growth rate, driven by rapid digitalization, increasing cybercrime, and a burgeoning IT sector. Latin America and the Middle East & Africa are also witnessing steady adoption as awareness of cybersecurity threats rises and governments take proactive measures to strengthen national cyber defenses.
The machine learning security market is segmented by component into software, hardware, and services, with each playing a critical role in enabling comprehensive security solutions. Software forms the backbone of machine learning security, encompassing threat detection platforms, security analytics, and automated response systems. These solutions leverage advanced algorithms to process and analyze massive volumes of security data, identify patterns of malicious activity, and provide actionable insights. The software segment is anticipated to maintain its dominance througho
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project provides a comprehensive collection of open-source datasets focused on cybersecurity threats and AI security vulnerabilities. The datasets are carefully selected to align with specific security threats, such as:
Each dataset includes a detailed description, source type, purpose, and direct access links for easy retrieval.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview of the proposed insider threat detection algorithms. (XLSX 25.6 kb)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of 247950 instances, of which 128541 are from phishing URLs and 119409 are from legitimate URLs. It encompasses 41 features and 1 target variable (0=legitimate,1=phishing), making it suitable for implementing machine learning algorithms to identify phishing attacks.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. Below, I’ll explain each aspect in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning.
The dataset consists of network-based and user behavior-based features. Each feature provides valuable information about potential cyber threats.
These features describe network-level information such as packet size, protocol type, and encryption methods.
network_packet_size (Packet Size in Bytes)
protocol_type (Communication Protocol)
encryption_used (Encryption Protocol)
These features track user activities, such as login attempts and session duration.
login_attempts (Number of Logins)
session_duration (Session Length in Seconds)
failed_logins (Failed Login Attempts)
unusual_time_access (Login Time Anomaly)
0 or 1) indicating whether access happened at an unusual time.ip_reputation_score (Trustworthiness of IP Address)
browser_type (User’s Browser)
attack_detected)1 means an attack was detected, 0 means normal activity.This dataset can be used for intrusion detection systems (IDS) and cybersecurity research. Some key applications include:
Supervised Learning Approaches
attack_detected as the target).Deep Learning Approaches
If attack labels are missing, anomaly detection can be used: - Autoencoders: Learn normal traffic and flag anomalies. - Isolation Forest: Detects outliers based on feature isolation. - One-Class SVM: Learns normal behavior and detects deviations.