100+ datasets found

Cybersecurity 🪪 Intrusion 🦠 Detection Dataset
kaggle.com
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dinesh Naveen Kumar Samudrala (2025). Cybersecurity 🪪 Intrusion 🦠 Detection Dataset [Dataset]. https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dinesh Naveen Kumar Samudrala
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. Below, I’ll explain each aspect in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning.

1. Understanding the Features

The dataset consists of network-based and user behavior-based features. Each feature provides valuable information about potential cyber threats.

A. Network-Based Features

These features describe network-level information such as packet size, protocol type, and encryption methods.

network_packet_size (Packet Size in Bytes)

Represents the size of network packets, ranging between 64 to 1500 bytes.

Packets on the lower end (~64 bytes) may indicate control messages, while larger packets (~1500 bytes) often carry bulk data.

Attackers may use abnormally small or large packets for reconnaissance or exploitation attempts.

protocol_type (Communication Protocol)

The protocol used in the session: TCP, UDP, or ICMP.

TCP (Transmission Control Protocol): Reliable, connection-oriented (common for HTTP, HTTPS, SSH).

UDP (User Datagram Protocol): Faster but less reliable (used for VoIP, streaming).

ICMP (Internet Control Message Protocol): Used for network diagnostics (ping); often abused in Denial-of-Service (DoS) attacks.

encryption_used (Encryption Protocol)

Values: AES, DES, None.

AES (Advanced Encryption Standard): Strong encryption, commonly used.

DES (Data Encryption Standard): Older encryption, weaker security.

None: Indicates unencrypted communication, which can be risky.

Attackers might use no encryption to avoid detection or weak encryption to exploit vulnerabilities.

B. User Behavior-Based Features

These features track user activities, such as login attempts and session duration.

login_attempts (Number of Logins)

High values might indicate brute-force attacks (repeated login attempts).

Typical users have 1–3 login attempts, while an attack may have hundreds or thousands.

session_duration (Session Length in Seconds)

A very long session might indicate unauthorized access or persistence by an attacker.

Attackers may try to stay connected to maintain access.

failed_logins (Failed Login Attempts)

High failed login counts indicate credential stuffing or dictionary attacks.

Many failed attempts followed by a successful login could suggest an account was compromised.

unusual_time_access (Login Time Anomaly)

A binary flag (0 or 1) indicating whether access happened at an unusual time.

Attackers often operate outside normal business hours to evade detection.

ip_reputation_score (Trustworthiness of IP Address)

A score from 0 to 1, where higher values indicate suspicious activity.

IP addresses associated with botnets, spam, or previous attacks tend to have higher scores.

browser_type (User’s Browser)

Common browsers: Chrome, Firefox, Edge, Safari.

Unknown: Could be an indicator of automated scripts or bots.

2. Target Variable (attack_detected)

Binary classification: 1 means an attack was detected, 0 means normal activity.

The dataset is useful for supervised machine learning, where a model learns from labeled attack patterns.

3. Possible Use Cases

This dataset can be used for intrusion detection systems (IDS) and cybersecurity research. Some key applications include:

A. Machine Learning-Based Intrusion Detection

Supervised Learning Approaches

Classification Models (Logistic Regression, Decision Trees, Random Forest, XGBoost, SVM)

Train the model using labeled data (attack_detected as the target).

Evaluate using accuracy, precision, recall, F1-score.

Deep Learning Approaches

Use Neural Networks (DNN, LSTM, CNN) for pattern recognition.

LSTMs work well for time-series-based network traffic analysis.

B. Anomaly Detection (Unsupervised Learning)

If attack labels are missing, anomaly detection can be used: - Autoencoders: Learn normal traffic and flag anomalies. - Isolation Forest: Detects outliers based on feature isolation. - One-Class SVM: Learns normal behavior and detects deviations.

C. Rule-Based Detection

If certain thresholds are met (e.g., failed_logins > 10 & ip_reputation_score > 0.8), an alert is triggered.

4. Challenges & Considerations

Adversarial Attacks: Attackers may modify traffic to evade detection.

Concept Drift: Cyber threats...

🌐 Global Cybersecurity Threats (2015-2024)

kaggle.com

zip

Updated Mar 16, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Atharva Soundankar (2025). 🌐 Global Cybersecurity Threats (2015-2024) [Dataset]. https://www.kaggle.com/datasets/atharvasoundankar/global-cybersecurity-threats-2015-2024

Explore at:

zip(48178 bytes)Available download formats

Dataset updated

Mar 16, 2025

Authors

Atharva Soundankar

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

📂

The Global Cybersecurity Threats Dataset (2015-2024) provides extensive data on cyberattacks, malware types, targeted industries, and affected countries. It is designed for threat intelligence analysis, cybersecurity trend forecasting, and machine learning model development to enhance global digital security.

📊 Column Descriptions

Column Name	Description
Country	Country where the attack occurred
Year	Year of the incident
Threat Type	Type of cybersecurity threat (e.g., Malware, DDoS)
Attack Vector	Method of attack (e.g., Phishing, SQL Injection)
Affected Industry	Industry targeted (e.g., Finance, Healthcare)
Data Breached (GB)	Volume of data compromised
Financial Impact ($M)	Estimated financial loss in millions
Severity Level	Low, Medium, High, Critical
Response Time (Hours)	Time taken to mitigate the attack
Mitigation Strategy	Countermeasures taken

m
Large-Scale Network Cyberattacks Multiclass Dataset 2024 (LSNM2024)
data.mendeley.com
Updated Jul 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qasem Abu Al-Haija (2024). Large-Scale Network Cyberattacks Multiclass Dataset 2024 (LSNM2024) [Dataset]. http://doi.org/10.17632/7pzyfvv9jn.1
Explore at:
Unique identifier
https://doi.org/10.17632/7pzyfvv9jn.1
Dataset updated
Jul 1, 2024
Authors
Qasem Abu Al-Haija
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a novel cutting-edge, large-scale multiclass dataset to improve the security of network cognition of suspicious traffic in networks. The proposed newly generated dataset contains up-to-date samples and features available to the public to help reduce the effect of upcoming cyberattacks with machine learning methods. Specifically, 6 million traffic samples with 60 features are collected and organized into two balanced classes: 50% normal traffic and 50% anomaly (attack) traffic. Furthermore, the anomaly traffic is composed of 15 different attacks including MITM-ARP-SPOOFING attack, SSH-BRUTE FORCE attack, FTP-BRUTE FORCE attack, DDOS-ICMP, DDOS-RAWIP attack, DDOS-UDP attack, DOS attack, EXPLOITING-FTP attack, FUZZING attack, ICMP FLOOD attack, SYN-FLOOD attack, PORT SCANNING attack, REMOTE CODE EXECUTION attack, SQL INJECTION attack, and XSS attack.

For detailed info, Please refer to and cite our article: Q. Abu Al-Haija, Z. Masoud, A. Yasin, K. Alesawi, Y. Alkarnawi, "Revolutionizing Threat Hunting in Communication Networks: Introducing a Cutting-Edge Large-Scale Multiclass Dataset", 15th International Conference on Information and Communication Systems (ICICS 2024), IEEE, Aug. 2024.
Cyber Threat Detection
kaggle.com
zip
Updated Oct 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hussain Afzaal 03 (2024). Cyber Threat Detection [Dataset]. https://www.kaggle.com/datasets/hussainsheikh03/cyber-threat-detection
Explore at:
zip(51424 bytes)Available download formats
Dataset updated
Oct 23, 2024
Authors
Hussain Afzaal 03
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The CyberFedDefender dataset is a simulated dataset designed for developing and testing federated learning-based cyber threat detection models. This dataset is tailored for research and experimentation in distributed anomaly detection and privacy-preserving cybersecurity frameworks. It includes traffic features commonly used in intrusion detection systems (IDS) with a focus on cloud and edge computing environments. Each record represents network traffic metadata, with labeled instances of both normal and malicious activities, making it ideal for machine learning applications in cybersecurity.

Dataset Features The dataset consists of 1,430 instances, with 23 features including information on packet size, duration, bytes sent/received, flow statistics, and attack labels. It covers common cyberattacks such as DDoS, Brute Force, and Ransomware, along with normal network traffic.

Feature List: Timestamp: The time when the network traffic was recorded. Source_IP: The IP address of the source machine. Destination_IP: The IP address of the destination machine. Protocol: The network protocol used (TCP, UDP, ICMP). Packet_Length: The length of the packet in bytes. Duration: The duration of the connection in seconds. Source_Port: The port number used by the source. Destination_Port: The port number used by the destination. Bytes_Sent: Total bytes sent from the source to the destination. Bytes_Received: Total bytes received by the destination from the source. Flags: TCP flags indicating the connection's state (e.g., SYN, ACK). Flow_Packets/s: Number of packets per second in the traffic flow. Flow_Bytes/s: Number of bytes per second in the traffic flow. Avg_Packet_Size: Average size of the packets during the connection. Total_Fwd_Packets: Total number of forward packets. Total_Bwd_Packets: Total number of backward packets. Fwd_Header_Length: Length of the forward packet headers. Bwd_Header_Length: Length of the backward packet headers. Sub_Flow_Fwd_Bytes: Bytes sent in the forward subflow. Sub_Flow_Bwd_Bytes: Bytes received in the backward subflow. Inbound: Indicates whether the traffic is inbound (1) or outbound (0). Attack_Type: Type of cyberattack or normal traffic (e.g., DDoS, Brute Force, Ransomware, Normal). Label: Binary classification label where 1 indicates malicious traffic and 0 represents normal traffic. Usage This dataset is designed for research in the following areas:

Federated learning for cyber threat detection Privacy-preserving machine learning in cybersecurity Intrusion detection systems (IDS) Distributed anomaly detection in cloud and edge environments Researchers can leverage this dataset to build and evaluate models for anomaly detection, perform comparative analysis, or enhance the robustness of federated learning frameworks in cybersecurity applications.
Synthetic Cybersecurity Logs for Anomaly Detection
kaggle.com
zip
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fcWebDev (2024). Synthetic Cybersecurity Logs for Anomaly Detection [Dataset]. https://www.kaggle.com/datasets/fcwebdev/synthetic-cybersecurity-logs-for-anomaly-detection
Explore at:
zip(160070 bytes)Available download formats
Dataset updated
Dec 16, 2024
Authors
fcWebDev
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains synthetic HTTP log data designed for cybersecurity analysis, particularly for anomaly detection tasks.

Dataset Features Timestamp: Simulated time for each log entry. IP_Address: Randomized IP addresses to simulate network traffic. Request_Type: Common HTTP methods (GET, POST, PUT, DELETE). Status_Code: HTTP response status codes (e.g., 200, 404, 403, 500). Anomaly_Flag: Binary flag indicating anomalies (1 = anomaly, 0 = normal). User_Agent: Simulated user agents for device and browser identification. Session_ID: Random session IDs to simulate user activity. Location: Geographic locations of requests. Applications This dataset can be used for:

Anomaly Detection: Identify suspicious network activity or attacks. Machine Learning: Train models for classification tasks (e.g., detect anomalies). Cybersecurity Analysis: Analyze HTTP traffic patterns and identify threats. Example Challenge Build a machine learning model to predict the Anomaly_Flag based on the features provided.
h
cyber-security-events
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuriy Medvedev, cyber-security-events [Dataset]. https://huggingface.co/datasets/pyToshka/cyber-security-events
Explore at:
Authors
Yuriy Medvedev
License
https://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/
Description
cyber-security-events

Dataset Description

This dataset contains cybersecurity events collected from honeypot infrastructure. The data has been processed and feature-engineered for machine learning applications in threat detection and security analytics.

Feature Categories Network Features

Connection flow statistics (bytes, packets, duration) Protocol-specific metrics Geographic information IP reputation data

Behavioral Features

Session… See the full description on the dataset page: https://huggingface.co/datasets/pyToshka/cyber-security-events.
i
Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT...
ieee-dataport.org
Updated Nov 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Amine FERRAG (2025). Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT and IIoT Applications: Centralized and Federated Learning [Dataset]. https://ieee-dataport.org/documents/edge-iiotset-new-comprehensive-realistic-cyber-security-dataset-iot-and-iiot-applications
Explore at:
Dataset updated
Nov 19, 2025
Authors
Mohamed Amine FERRAG
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
namely
Dataset to Train Intrusion Detection Systems based on Machine Learning...
zenodo.org
application/gzip, bin +1
Updated Nov 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esteban Damian Gutierrez Mlot; Esteban Damian Gutierrez Mlot (2024). Dataset to Train Intrusion Detection Systems based on Machine Learning Models for Electrical Substations [Dataset]. http://doi.org/10.5281/zenodo.14066350
Explore at:
bin, application/gzip, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14066350
Dataset updated
Nov 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Esteban Damian Gutierrez Mlot; Esteban Damian Gutierrez Mlot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DATASET

This dataset is part of the research work titled "A Dataset to Train Intrusion Detection Systems based on Machine Learning Models for Electrical Substations," which is currently awaiting approval for publication. The dataset has been meticulously curated to support the development and evaluation of machine learning models tailored for detecting cyber intrusions in the context of electrical substations. It is intended to facilitate research and advancements in cybersecurity for critical infrastructure, specifically focusing on real-world scenarios within electrical substation environments. We encourage its use for experimentation and benchmarking in related areas of study.

The following sections list the content of the dataset generated.

Data

raw

iec6180

attack-free-data

capture61850-attackfree.pcap (from real substation)

capture61850-attackfree_PTP.pcap

capture61850-attackfree_normalfault.pcap

attack-data

capture61850-floodattack_withfault.pcap

capture61850-floodattack_withoutfault.pcap

capture61850-fuzzyattack_withfault.pcap

capture61850-fuzzyattack_withoutfault.pcap

capture61850-replay.pcap

capture61850-ptpattack.pcap

iec104

attack-free-data

capture104-attackfree.pcap (from real substation)

attack-data

capture104-dosattack.pcap

capture104-floodattack.pcap

capture104-fuzzyattack.pcap

capture104-iec104starvationattack.pcap

capture104-mitmattack.pcap

capture104-ntpddosattack.pcap

capture104-portscanattack.pcap

processed

iec6180

attack-free-data

capture61850-attackfree.csv

capture61850-attackfree_PTP.csv

capture61850-attackfree_normalfault.csv

attack-data

capture61850-floodattack_withfault.csv

capture61850-floodattack_withoutfault.csv

capture61850-fuzzyattack_withfault.csv

capture61850-fuzzyattack_withoutfault.csv

capture61850-replay.csv

capture61850-ptpattack.csv

headers_iec61850[all].txt

iec104

attack-free-data

capture104-attackfree.csv

attack-data

capture104-dosattack.csv

capture104-floodattack.csv

capture104-fuzzyattack.csv

capture104-iec104starvationattack.csv

capture104-mitmattack.csv

capture104-ntpddosattack.csv

capture104-portscanattack.csv

headers_iec104[all].txt

Description

file type: it may be captured61850 or captured104 depending on whether it contains network captures of the protocol IEC61850 or IEC104.

attack: attack free (attackfree) or attack name is added to the file name.

function: optionally, if there are some details about functionality captured (normalfault) or specific protocol capture (PTP).

file extension: the type can be PCAP (network capture) or CSV (flow file).

Results

results

test1-iec104

model-test1-iec104.pkl

test1-iec104.log

test1-iec61850

model-test1-iec61850.pkl

test1-iec61850.log

test2-iec61850

model-test2-iec61850.pkl

test2-iec61850.log

Description

The outcomes of different test executions are available as follows:

test1-iec104: IEC 104 protocol for all attacks and attack free scenario

test1-iec61850: IEC 61850 protocol for fuzzy attack with fault injection and attack free scenario

test2-iec61850: IEC 61850 protocol for fuzzy attack normal operation and attack free scenario

Each test consists of the model results in Python pickle format (with a .pkl extension) and a detailed description of the execution conditions in an output log file (with a .log extension).

Source Code

A snapshot of the source code used to process these files is included under the filename source-code-cybersecurity-datasets-v2.0.zip. For an updated version, please consider visiting github repository.
m
StealthPhisher Phishing Attack Dataset
data.mendeley.com
Updated Nov 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tanmay Jha (2025). StealthPhisher Phishing Attack Dataset [Dataset]. http://doi.org/10.17632/m2479kmybx.2
Explore at:
Unique identifier
https://doi.org/10.17632/m2479kmybx.2
Dataset updated
Nov 7, 2025
Authors
Tanmay Jha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The StealthPhisher Phishing Attack Dataset, generated at the Cybersecurity Lab, GLA University, Mathura, is a large, diverse, and recent Phishing Attack Dataset developed to address the evolving nature of phishing attacks. It comprises over 336,749 records, including 160,943 legitimate URLs and 175,806 phishing URLs, collected from reliable sources such as PhishTank. Reflecting the most recent phishing tactics, this dataset serves as a valuable resource for training and evaluating AI-based phishing detection systems.

Key features include URL-based attributes (e.g., length, TLD type, IP presence), statistical metrics (e.g., Shannon Entropy, Kolmogorov Complexity, Fractal Dimension), and HTML/interaction-based features (e.g., popups, redirects, forms). These multidimensional attributes provide comprehensive insights into phishing behavior, enabling accurate and robust threat detection. Designed to capture real-world scenarios, the dataset equips AI models to recognize both traditional and emerging phishing strategies effectively.

This dataset was generated as part of the research work presented in the article “StealthPhisher: A Defensive Framework against Phishing Attack using Hybrid Deep Learning and GenAI,” published in Expert Systems with Applications (https://doi.org/10.1016/j.eswa.2025.130205). Researchers using this dataset in their research work are kindly requested to cite this article.

Network traffic datasets created by Single Flow Time Series Analysis

zenodo.org
data.niaid.nih.gov

csv, pdf

Updated Jul 11, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka (2024). Network traffic datasets created by Single Flow Time Series Analysis [Dataset]. http://doi.org/10.5281/zenodo.8035724

Explore at:

csv, pdfAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.8035724

Dataset updated

Jul 11, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Network traffic datasets created by Single Flow Time Series Analysis

Datasets were created for the paper: Network Traffic Classification based on Single Flow Time Series Analysis -- Josef Koumar, Karel Hynek, Tomáš Čejka -- which was published at The 19th International Conference on Network and Service Management (CNSM) 2023. Please cite usage of our datasets as:

J. Koumar, K. Hynek and T. Čejka, "Network Traffic Classification Based on Single Flow Time Series Analysis," 2023 19th International Conference on Network and Service Management (CNSM), Niagara Falls, ON, Canada, 2023, pp. 1-7, doi: 10.23919/CNSM59352.2023.10327876.

This Zenodo repository contains 23 datasets created from 15 well-known published datasets which are cited in the table below. Each dataset contains 69 features created by Time Series Analysis of Single Flow Time Series. The detailed description of features from datasets is in the file: feature_description.pdf

In the following table is a description of each dataset file:

File name	Detection problem	Citation of original raw dataset
botnet_binary.csv	Binary detection of botnet	S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
botnet_multiclass.csv	Multi-class classification of botnet	S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
cryptomining_design.csv	Binary detection of cryptomining; the design part	Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
cryptomining_evaluation.csv	Binary detection of cryptomining; the evaluation part	Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
dns_malware.csv	Binary detection of malware DNS	Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021.
doh_cic.csv	Binary detection of DoH	Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020
doh_real_world.csv	Binary detection of DoH	Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022
dos.csv	Binary detection of DoS	Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019.
edge_iiot_binary.csv	Binary detection of IoT malware	Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.
edge_iiot_multiclass.csv	Multi-class classification of IoT malware	Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.
https_brute_force.csv	Binary detection of HTTPS Brute Force	Jan Luxemburk et al. HTTPS Brute-force dataset with extended network flows, November 2020
ids_cic_binary.csv	Binary detection of intrusion in IDS	Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.
ids_cic_multiclass.csv	Multi-class classification of intrusion in IDS	Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.
ids_unsw_nb_15_binary.csv	Binary detection of intrusion in IDS	Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.
ids_unsw_nb_15_multiclass.csv	Multi-class classification of intrusion in IDS	Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.
iot_23.csv	Binary detection of IoT malware	Sebastian Garcia et al. IoT-23: A labeled dataset with malicious and benign IoT network traffic, January 2020. More details here https://www.stratosphereips.org /datasets-iot23
ton_iot_binary.csv	Binary detection of IoT malware	Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021
ton_iot_multiclass.csv	Multi-class classification of IoT malware	Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021
tor_binary.csv	Binary detection of TOR	Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017.
tor_multiclass.csv	Multi-class classification of TOR	Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017.
vpn_iscx_binary.csv	Binary detection of VPN	Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016.
vpn_iscx_multiclass.csv	Multi-class classification of VPN	Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016.
vpn_vnat_binary.csv	Binary detection of VPN	Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022
vpn_vnat_multiclass.csv	Multi-class classification of VPN	Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022

m
Composed Encrypted Malicious Traffic Dataset for machine learning based...
data.mendeley.com
Updated Oct 12, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zihao Wang (2021). Composed Encrypted Malicious Traffic Dataset for machine learning based encrypted malicious traffic analysis. [Dataset]. http://doi.org/10.17632/ztyk4h3v6s.2
Explore at:
Unique identifier
https://doi.org/10.17632/ztyk4h3v6s.2
Dataset updated
Oct 12, 2021
Authors
Zihao Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a traffic dataset which contains balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection. The dataset is a secondary csv feature data which is composed of five public traffic datasets. Our dataset is composed based on three criteria: The first criterion is to combine widely considered public datasets which contain both encrypted malicious and legitimate traffic in existing works, such as the Malwares Capture Facility Project dataset and the CICIDS-2017 dataset. The second criterion is to ensure the data balance, i.e., balance of malicious and legitimate network traffic and similar size of network traffic contributed by each individual dataset. Thus, approximate proportions of malicious and legitimate traffic from each selected public dataset are extracted by using random sampling. We also ensured that there will be no traffic size from one selected public dataset that is much larger than other selected public datasets. The third criterion is that our dataset includes both conventional devices' and IoT devices' encrypted malicious and legitimate traffic, as these devices are increasingly being deployed and are working in the same environments such as offices, homes, and other smart city settings.

Based on the criteria, 5 public datasets are selected. After data pre-processing, details of each selected public dataset and the final composed dataset are shown in “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, proportions of selected traffic size from each selected public dataset with respect to the total traffic size of the composed dataset (% w.r.t the composed dataset), proportions of selected encrypted traffic size from each selected public dataset (% of selected public dataset), and total traffic size of the composed dataset. From the table, we are able to observe that each public dataset equally contributes to approximately 20% of the composed dataset, except for CICDS-2012 (due to its limited number of encrypted malicious traffic). This achieves a balance across individual datasets and reduces bias towards traffic belonging to any dataset during learning. We can also observe that the size of malicious and legitimate traffic are almost the same, thus achieving class balance. The datasets now made available were prepared aiming at encrypted malicious traffic detection. Since the dataset is used for machine learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4 and stratification is applied during data split. Such datasets can be used directly for machine or deep learning model training based on selected features.
Network Digital Twin-Generated Dataset for Machine Learning-Based Detection...
zenodo.org
data.niaid.nih.gov
bin
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amit Karamchandani Batra; Amit Karamchandani Batra; Javier Nuñez Fuente; Luis de la Cal García; Luis de la Cal García; Yenny Moreno Meneses; Alberto Mozo Velasco; Alberto Mozo Velasco; Antonio Pastor Perales; Antonio Pastor Perales; Diego R. López; Diego R. López; Javier Nuñez Fuente; Yenny Moreno Meneses (2024). Network Digital Twin-Generated Dataset for Machine Learning-Based Detection of Benign and Malicious Heavy Hitter Flows [Dataset]. http://doi.org/10.5281/zenodo.14134646
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14134646
Dataset updated
Nov 13, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amit Karamchandani Batra; Amit Karamchandani Batra; Javier Nuñez Fuente; Luis de la Cal García; Luis de la Cal García; Yenny Moreno Meneses; Alberto Mozo Velasco; Alberto Mozo Velasco; Antonio Pastor Perales; Antonio Pastor Perales; Diego R. López; Diego R. López; Javier Nuñez Fuente; Yenny Moreno Meneses
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 11, 2024
Description
The dataset used in this study is publicly available for research purposes. If you are using this dataset, please cite the following paper, which outlines the complete details of the dataset and the methodology used for its generation:

Amit Karamchandani, Javier Núñez, Luis de-la-Cal, Yenny Moreno, Alberto Mozo, Antonio Pastor, "On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination," under submission.

This dataset contains a synthetic dataset generated to differentiate between benign and malicious heavy hitter flows within complex network environments. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction.

To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic.

The feature set includes flow statistics commonly used in network analysis, such as:

Traffic protocol type,

Flow duration (the time between the initial and final packet in both directions),

Total count of payload packets transmitted in both directions,

Cumulative bytes transmitted in both directions,

Time discrepancy between the first packet observations at the source and destination,

Packet and byte transmission rates per second within each interval, and

Total packet and byte counts within each interval in both directions.
m
Spearman Correlation Heatmaps After Feature Selection
data.mendeley.com
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
abdulkader hajjouz (2024). Spearman Correlation Heatmaps After Feature Selection [Dataset]. http://doi.org/10.17632/hxd7gmrvth.1
Explore at:
Unique identifier
https://doi.org/10.17632/hxd7gmrvth.1
Dataset updated
Nov 20, 2024
Authors
abdulkader hajjouz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description: This is a Spearman Correlation Heatmap of the 32 features used for machine learning and deep learning models in cybersecurity. The diagonal cells are perfect self-correlation (value = 1) and the off-diagonal cells are pairwise correlations between features. Since there are no strong correlations (close to 1 or -1) we removed the redundant or irrelevant features, so each selected feature brings unique and independent information to the model. Feature selection is key in building cyber intrusion detection systems as it reduces computational overhead, simplifies the model and improves accuracy and robustness. This is part of the systematic feature engineering process to optimize datasets for anomaly detection, network traffic analysis and intrusion detection. Researchers in AI for cybersecurity can use this to build more interpretable and efficient models to detect in large scale networks. This figure shows the importance of correlation analysis for high dimensional datasets and contributes to cyber, data science and machine learning.

Why It Matters: Reduces overfitting in machine learning models. Improves computational efficiency for large-scale datasets. Enhances feature interpretability for robust cybersecurity solutions.

Keywords: Spearman Correlation Heatmap, Feature Selection, Intrusion Detection System, Cybersecurity, Machine Learning, Deep Learning, Anomaly Detection, Network Traffic Analysis, Artificial Intelligence in Cybersecurity, Dataset Optimization, Feature Engineering for Cyber Threats

References: This file pertains to our research study, which has been accepted for publication in the Scientific and Technical Journal of Information Technologies, Mechanics and Optics. The study is titled: "Enhancing and Extending CatBoost for Accurate Detection and Classification of DoS and DDoS Attack Subtypes in Network Traffic."

https://doi.org/10.1109/ICSIP61881.2024.10671552 https://doi.org/10.24143/2072-9502-2024-3-65-74
CTU-SME-11: a labeled dataset with real benign and malicious network traffic...
zenodo.org
data.niaid.nih.gov
+1more
bin, bz2, csv, html
Updated May 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Štěpán Bendl; Štěpán Bendl; Veronica Valeros; Veronica Valeros; Sebastian Garcia; Sebastian Garcia (2023). CTU-SME-11: a labeled dataset with real benign and malicious network traffic mimicking a small medium-size enterprise environment [Dataset]. http://doi.org/10.5281/zenodo.7958259
Explore at:
csv, html, bz2, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7958259
Dataset updated
May 26, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Štěpán Bendl; Štěpán Bendl; Veronica Valeros; Veronica Valeros; Sebastian Garcia; Sebastian Garcia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As technology advances, the number and complexity of cyber-attacks increase, forcing defense techniques to be updated and improved. To help develop effective tools for detecting security threats it is essential to have reliable and representative security datasets. Many existing security datasets have limitations that make them unsuitable for research, including lack of labels, unbalanced traffic, and outdated threats.

CTU-SME-11 is a labeled network dataset designed to address the limitations of previous datasets. The dataset was captured in a real network that mimics a small-medium enterprise setting. Raw network traffic (packets) was captured from 11 devices using tcpdump for a duration of 7 days, from 20th to 26th of February, 2023 in Prague, Czech Republic. The devices were chosen based on the enterprise setting and consists of IoT, desktop and mobile devices, both bare metal and virtualized. The devices were infected with malware or exposed to Internet attacks, and factory reset to restore benign behavior.

The raw data was processed to generate network flows (Zeek logs) which were analyzed and labeled. The dataset contains two types of levels, a high level label and a descriptive label, which were put by experts. The former can take three values, benign, malicious or background. The latter contains detailed information about the specific behavior observed in the network flows. The dataset contains 99 million labeled network flows. The overall compressed size of the dataset is 80GB and the uncompressed size is 170GB.

Federated Learning for Distributed Intrusion Detection Systems in Public...

zenodo.org
data.europa.eu

bz2

Updated May 23, 2023

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Alireza Bakhshi Zadi Mahmoodi; Alireza Bakhshi Zadi Mahmoodi; Panos Kostakos; Panos Kostakos (2023). Federated Learning for Distributed Intrusion Detection Systems in Public Networks - Validation Dataset [Dataset]. http://doi.org/10.5281/zenodo.7956304

Explore at:

bz2Available download formats

Unique identifier

https://doi.org/10.5281/zenodo.7956304

Dataset updated

May 23, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Alireza Bakhshi Zadi Mahmoodi; Alireza Bakhshi Zadi Mahmoodi; Panos Kostakos; Panos Kostakos

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset has been meticulously prepared and utilized as a validation set during the evaluation phase of "Meta IDS" to asses the performance of various machine learning models. It is now made available for interested users and researchers who seek a reliable and diverse dataset for training and testing their own custom models.

The validation dataset comprises a comprehensive collection of labeled entries, that determines whether the packet type is "malicious" or "benign." It covers complex design patterns that are commonly encountered in real-world applications. The dataset is designed to be representative, encompassing edge and fog layers that are in contact with cloud layer, thereby enabling thorough testing and evaluation of different models. Each sample in the dataset is labeled with the corresponding ground truth, providing a reliable reference for model performance evaluation.

To ensure convenient distribution and storage, the dataset has been broken down into three separate batches, each containing a portion of the dataset. This allows for convenient downloading and management of the dataset. The three batches are provided as individual compressed files.

In order to extract the data, follow the following instructions:

Download and install bzip2 (if not already installed) from the official website or your package manager.
Place the compressed dataset file in a directory of your choice.
Open a terminal or command prompt and navigate to the directory where the compressed dataset file is located.
Execute the following command to uncompress the dataset:
- bzip2 -d filename.bz2
Replace "filename.bz2" with the actual name of the compressed dataset file.

Once uncompressed, you will have access to the dataset in its original format for further exploration, analysis, and model training etc. The total storage required for extraction is approximately 800 GB in total, with the first batch requiring approximately 302 GB, the second batch requiring approximately 203 GB, and the third batch requiring approximately 297 GB of data storage.

The first batch contains 1,049,527,992 entries, where as the second batch contains 711,043,331 entries, and for the third and last batch we have 1,029,303,062 entries. The following table provides the feature names along with their explanation and example value once the dataset is extracted.

Feature	Description	Example Value
ip.src	Source IP address in the packet	a05d4ecc38da01406c9635ec694917e969622160e728495e3169f62822444e17
ip.dst	Destination IP address in the packet	a52db0d87623d8a25d0db324d74f0900deb5ca4ec8ad9f346114db134e040ec5
frame.time_epoch	Epoch time of the frame	1676165569.930869
arp.hw.type	Hardware type	1
arp.hw.size	Hardware size	6
arp.proto.size	Protocol size	4
arp.opcode	Opcode	2
data.len	Length	2713
eth.dst.lg	Destination LG bit	1
eth.dst.ig	Destination IG bit	1
eth.src.lg	Source LG bit	1
eth.src.ig	Source IG bit	1
frame.offset_shift	Time shift for this packet	0
frame.len	frame length on the wire	1208
frame.cap_len	Frame length stored into the capture file	215
frame.marked	Frame is marked	0
frame.ignored	Frame is ignored	0
frame.encap_type	Encapsulation type	1
gre	Generic Routing Encapsulation	'Generic Routing Encapsulation (IP)’
ip.version	Version	6
ip.hdr_len	Header length	24
ip.dsfield.dscp	Differentiated Services Codepoint	56
ip.dsfield.ecn	Explicit Congestion Notification	2
ip.len	Total length	614
ip.flags.rb	Reserved bit	0
ip.flags.df	Don't fragment	1
ip.flags.mf	More fragments	0
ip.frag_offset	Fragment offset	0
ip.ttl	Time to live	31
ip.proto	Protocol	47
ip.checksum.status	Header checksum status	2
tcp.srcport	TCP source port	53425
tcp.flags	Flags	0x00000098
tcp.flags.ns	Nonce	0
tcp.flags.cwr	Congestion Window Reduced (CWR)	1
udp.srcport	UDP source port	64413
udp.dstport	UDP destination port	54087
udp.stream	Stream index	1345
udp.length	Length	225
udp.checksum.status	Checksum status	3
packet_type	Type of the packet which is either "benign" or "malicious"	0

Furthermore, in compliance with the GDPR and to ensure the privacy of individuals, all IP addresses present in the dataset have been anonymized through hashing. This anonymization process helps protect the identity of individuals while preserving the integrity and utility of the dataset for research and model development purposes.

Please note that while the dataset provides valuable insights and a solid foundation for machine learning tasks, it is not a substitute for extensive real-world data collection. However, it serves as a valuable resource for researchers, practitioners, and enthusiasts in the machine learning community, offering a compliant and anonymized dataset for developing and validating custom models in a specific problem domain.

By leveraging the validation dataset for machine learning model evaluation and custom model training, users can accelerate their research and development efforts, building upon the knowledge gained from my thesis while contributing to the advancement of the field.

MedSec-25: IoMT Cybersecurity Dataset
kaggle.com
zip
Updated Sep 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Abdullah (2025). MedSec-25: IoMT Cybersecurity Dataset [Dataset]. https://www.kaggle.com/datasets/abdullah001234/medsec-25-iomt-cybersecurity-dataset
Explore at:
zip(38496221 bytes)Available download formats
Dataset updated
Sep 8, 2025
Authors
Muhammad Abdullah
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Overview

MedSec-25 is a comprehensive, labeled network traffic dataset designed specifically for the Internet of Medical Things (IoMT) in healthcare environments. It addresses the limitations of existing generic IoT datasets by capturing realistic traffic from a custom-built healthcare IoT lab that mimics real-world hospital operations. The dataset includes both benign (normal) traffic and malicious traffic from multi-staged attack campaigns inspired by the MITRE ATT&CK framework. This allows for the development and evaluation of machine learning-based intrusion detection systems (IDS) tailored to IoMT scenarios, where patient safety and data privacy are critical. The dataset was generated using a variety of medical sensors (e.g., ECG, EEG, HHI, Respiration, SpO2) and environmental sensors (e.g., thermistor, ultrasonic, PIR, flame) connected via Raspberry Pi nodes and an IoT server. Traffic was captured over 7.5 hours using tools like Wireshark and tcpdump, resulting in PCAPNG files. These were processed with CICFlowMeter to extract flow-based features, producing a cleaned CSV dataset with 554,534 bidirectional network flows and 84 features.

Key Highlights:

Realistic Setup: Built in a physical lab at Rochester Institute of Technology, Dubai, incorporating diverse IoMT devices, protocols (e.g., MQTT, SSH, Telnet, FTP, HTTP, DNS), and real-time patient interactions (anonymized to comply with privacy regulations like HIPAA).

Multi-Staged Attacks: Unlike datasets focusing on isolated attacks, MedSec-25 simulates full attack chains: Reconnaissance (e.g., SYN/TCP scans, OS fingerprinting), Initial Access (e.g., brute-force, malformed MQTT packets), Lateral Movement (e.g., exploiting vulnerabilities to pivot between devices), and Exfiltration (e.g., data theft via MQTT).

Imbalanced Nature: This is the cleaned (imbalanced) version of the dataset. Users may need to apply balancing techniques (e.g., SMOTE oversampling + random undersampling) for model training, as demonstrated in the associated paper.

Size and Quality: 554,534 rows, no duplicates, no missing values (except 111 NaNs in Flow Byts/s, ~0.02%, which can be handled via imputation). Data types include float64 (45 columns), int64 (34 columns), and object (5 columns: Flow ID, Src IP, Dst IP, Timestamp, Label).

Utility: Preliminary models trained on this dataset (e.g., KNN: 98.09% accuracy, Decision Tree: 98.35% accuracy) show excellent performance for detecting attack stages.

This dataset is ideal for researchers in cybersecurity, machine learning, and healthcare IoT, enabling the creation of an IDS that can detect attacks at different phases to prevent escalation.

Data Collection

Benign Traffic: Generated over two days with active sensors, services (HTTP dashboard for patient monitoring, SSH/Telnet for remote access, FTP for file transfers), and real users (students/faculty) interacting with medical devices. No personally identifiable information was stored.

Malicious Traffic: Two Kali Linux attacker machines simulated MITRE ATT&CK-inspired campaigns using tools like Nmap, Scapy, Metasploit, and custom Python scripts.

Capture Tools: Wireshark and tcpdump for PCAPNG files (total ~1GB: 600MB benign, 400MB malicious).

Processing: Combined PCAP files per label, extracted features with CICFlowMeter, labeled flows manually based on attack phases, and cleaned for ML readiness. The final cleaned CSV is ~350MB.

Features

The dataset includes 84 features extracted by CICFlowMeter, categorized as:

Identifiers: Flow ID, Src IP, Src Port, Dst IP, Dst Port, Protocol, Timestamp.

Time-Series Metrics: Flow Duration, Flow IAT Mean/Std/Max/Min, Fwd/Bwd IAT Tot/Mean/Std/Max/Min.

Size/Count Statistics: Tot Fwd/Bwd Pkts, TotLen Fwd/Bwd Pkts, Fwd/Bwd Pkt Len Max/Min/Mean/Std, Pkt Len Min/Max/Mean/Std/Var, Pkt Size Avg.

Flag Counts: Fwd/Bwd PSH/URG Flags, FIN/SYN/RST/PSH/ACK/URG/CWE/ECE Flag Cnt.

Rates and Ratios: Flow Byts/s, Flow Pkts/s, Fwd/Bwd Pkts/s, Down/Up Ratio, Active/Idle Mean/Std/Max/Min.

Segmentation and Others: Fwd/Bwd Seg Size Avg/Min, Subflow Fwd/Bwd Pkts/Byts, Init Fwd/Bwd Win Byts, Fwd Act Data Pkts, Fwd/Bwd Byts/b Avg, Fwd/Bwd Pkts/b Avg, Fwd/Bwd Blk Rate Avg.

Labels

The dataset is labeled with 5 classes representing benign behavior and attack stages:

Reconnaissance: 401,683 flows Initial Access: 102,090 flows Exfiltration: 25,915 flows Lateral Movement: 12,498 flows Benign: 12,348 flows

Note: The dataset is imbalanced, with Reconnaissance dominating. Apply balancing techniques for optimal ML performance.

Usage

Preprocessing Suggestions: Encode categorical features (e.g., Protocol, Label) using LabelEncoder. Normalize numerical features with Min-Max Scaler or StandardScaler. Handle the minor NaNs in Flow Byts/s via mean imputation.

Model Training: Split into train/test (e.g., 80/20). Suitable for classification tasks w...
D
Machine Learning Security Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Machine Learning Security Market Research Report 2033 [Dataset]. https://dataintelo.com/report/machine-learning-security-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Machine Learning Security Market Outlook

According to our latest research, the global machine learning security market size reached USD 7.42 billion in 2024, reflecting a robust expansion driven by escalating cyber threats and the increasing adoption of advanced digital technologies across various sectors. The market is projected to register a compelling CAGR of 23.6% during the forecast period, with the total market value anticipated to reach USD 61.58 billion by 2033. This exponential growth is primarily fueled by the urgent need for proactive security solutions capable of identifying and mitigating sophisticated cyberattacks in real time.

The primary growth driver for the machine learning security market is the rapid surge in cyberattacks, including ransomware, phishing, and advanced persistent threats. As organizations digitize operations and expand their cloud infrastructure, the attack surface increases, making traditional security measures insufficient. Machine learning-based security solutions can analyze vast datasets, detect anomalies, and respond to threats far more efficiently than conventional systems. The growing sophistication of cybercriminals, who now leverage artificial intelligence themselves, has made it imperative for enterprises to adopt equally advanced defense mechanisms. This environment of escalating cyber risk underpins the strong demand for machine learning security solutions across all major industries.

Another significant factor propelling the market is the increasing regulatory pressure and compliance requirements worldwide. Governments and regulatory bodies have introduced stringent data protection and privacy laws, such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States. These regulations mandate organizations to implement robust security frameworks to safeguard sensitive information. Machine learning security tools not only enhance the effectiveness of compliance programs by automating threat detection and reporting but also help organizations avoid hefty penalties associated with data breaches. As businesses strive to maintain compliance and protect their reputations, investment in machine learning-driven security is becoming a strategic imperative.

Furthermore, the expansion of the Internet of Things (IoT), cloud computing, and remote work trends have introduced new vectors of vulnerability, necessitating adaptive security approaches. Machine learning security solutions can continuously learn and adapt to emerging threats, providing proactive protection for dynamic IT environments. The integration of machine learning into security operations centers (SOCs) enables real-time monitoring, rapid incident response, and predictive analytics. This capability is particularly valuable for sectors such as BFSI, healthcare, and manufacturing, where the cost of a security breach can be catastrophic. As digital transformation accelerates, the market is poised for sustained growth, with organizations prioritizing advanced security investments.

From a regional perspective, North America currently dominates the machine learning security market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The United States, in particular, is a frontrunner due to its mature cybersecurity ecosystem, high technology adoption rate, and presence of leading market players. Europe’s growth is bolstered by strict regulatory frameworks and strong investment in digital infrastructure, while Asia Pacific is experiencing the fastest growth rate, driven by rapid digitalization, increasing cybercrime, and a burgeoning IT sector. Latin America and the Middle East & Africa are also witnessing steady adoption as awareness of cybersecurity threats rises and governments take proactive measures to strengthen national cyber defenses.

Component Analysis

The machine learning security market is segmented by component into software, hardware, and services, with each playing a critical role in enabling comprehensive security solutions. Software forms the backbone of machine learning security, encompassing threat detection platforms, security analytics, and automated response systems. These solutions leverage advanced algorithms to process and analyze massive volumes of security data, identify patterns of malicious activity, and provide actionable insights. The software segment is anticipated to maintain its dominance througho
Data Collection & Requirements
zenodo.org
bin
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Data Collection & Requirements [Dataset]. http://doi.org/10.5281/zenodo.14976797
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14976797
Dataset updated
Mar 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Open-Source Cybersecurity and AI Security Datasets

This project provides a comprehensive collection of open-source datasets focused on cybersecurity threats and AI security vulnerabilities. The datasets are carefully selected to align with specific security threats, such as:

Data Exfiltration

Data Poisoning

Model Manipulation

Adversarial Examples

Model Inversion

Model Extraction

Spoofing Attacks

Unauthorized Access

Supply Chain Compromise

Dataset Collection

Each dataset includes a detailed description, source type, purpose, and direct access links for easy retrieval.

1. DARPA Intrusion Detection Dataset

Access Here

Description: Simulated network traffic with various cyber attack scenarios (e.g., DoS, Probe, U2R, R2L).

Format: PCAP

Update Frequency: Static

Use Cases: IDS training, intrusion detection research

2. MITRE ATT&CK Framework Data

Access Here

Description: A globally-accessible knowledge base of adversarial tactics, techniques, and procedures (TTPs).

Format: JSON, STIX

Update Frequency: Quarterly

Use Cases: Threat intelligence, adversary simulation, AI model defense

3. VirusShare Malware Repository

Access Here (Registration Required)

Description: Large-scale collection of live malware samples for security research.

Format: ZIP, PE files

Update Frequency: Weekly

Use Cases: AI-based malware detection, sandbox testing

4. National Vulnerability Database (NVD)

Access Here

Description: A repository of reported vulnerabilities (CVEs) with severity scores and descriptions.

Format: XML, JSON

Update Frequency: Daily

Use Cases: Vulnerability management, exploit mitigation research

5. LANL Unified Host and Network Dataset

Access Here

Description: Enterprise-scale dataset containing network and host logs with real-world red-team attack events.

Format: Text files

Update Frequency: Static

Use Cases: Insider threat detection, anomaly detection in network security

6. CIC-IDS2017 (Intrusion Detection Dataset)

Access Here

Description: Network traffic dataset with multiple attack types, including DDoS, brute-force, and infiltration attacks.

Format: PCAP, CSV

Update Frequency: Static

Use Cases: Machine learning-based intrusion detection, behavioral analysis

7. CIC IoV CAN Bus Dataset 2024

Access Here

Description: Vehicle CAN bus data, including spoofing and denial-of-service (DoS) attack traces.

Format: CSV, PCAP

Update Frequency: Static

Use Cases: Automotive security, AI-based anomaly detection in vehicles

8. ImageNet-A (Adversarial Image Dataset)

Access Here

Description: A dataset of real-world images that cause misclassification in deep learning models.

Format: JPEG

Update Frequency: Static

Use Cases: Adversarial robustness evaluation, model retraining for security
Additional file 1: Table S1. of Detection and prediction of insider threats...
springernature.figshare.com
figshare.com
xlsx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iffat Gheyas; Ali Abdallah (2023). Additional file 1: Table S1. of Detection and prediction of insider threats to cyber security: a systematic literature review and meta-analysis [Dataset]. http://doi.org/10.6084/m9.figshare.c.3639305_D1.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3639305_D1.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Iffat Gheyas; Ali Abdallah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview of the proposed insider threat detection algorithms. (XLSX 25.6 kb)
m
Phishing Detection Dataset
data.mendeley.com
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maruf Tamal (2023). Phishing Detection Dataset [Dataset]. http://doi.org/10.17632/6tm2d6sz7p.1
Explore at:
Unique identifier
https://doi.org/10.17632/6tm2d6sz7p.1
Dataset updated
Jun 7, 2023
Authors
Maruf Tamal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of 247950 instances, of which 128541 are from phishing URLs and 119409 are from legitimate URLs. It encompasses 41 features and 1 target variable (0=legitimate,1=phishing), making it suitable for implementing machine learning algorithms to identify phishing attacks.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dinesh Naveen Kumar Samudrala (2025). Cybersecurity 🪪 Intrusion 🦠 Detection Dataset [Dataset]. https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset

Cybersecurity 🪪 Intrusion 🦠 Detection Dataset

Prevent before attack

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 10, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Dinesh Naveen Kumar Samudrala

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. Below, I’ll explain each aspect in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning.

1. Understanding the Features

The dataset consists of network-based and user behavior-based features. Each feature provides valuable information about potential cyber threats.

A. Network-Based Features

These features describe network-level information such as packet size, protocol type, and encryption methods.

network_packet_size (Packet Size in Bytes)
- Represents the size of network packets, ranging between 64 to 1500 bytes.
- Packets on the lower end (~64 bytes) may indicate control messages, while larger packets (~1500 bytes) often carry bulk data.
- Attackers may use abnormally small or large packets for reconnaissance or exploitation attempts.
protocol_type (Communication Protocol)
- The protocol used in the session: TCP, UDP, or ICMP.
- TCP (Transmission Control Protocol): Reliable, connection-oriented (common for HTTP, HTTPS, SSH).
- UDP (User Datagram Protocol): Faster but less reliable (used for VoIP, streaming).
- ICMP (Internet Control Message Protocol): Used for network diagnostics (ping); often abused in Denial-of-Service (DoS) attacks.
encryption_used (Encryption Protocol)
- Values: AES, DES, None.
- AES (Advanced Encryption Standard): Strong encryption, commonly used.
- DES (Data Encryption Standard): Older encryption, weaker security.
- None: Indicates unencrypted communication, which can be risky.
- Attackers might use no encryption to avoid detection or weak encryption to exploit vulnerabilities.

B. User Behavior-Based Features

These features track user activities, such as login attempts and session duration.

login_attempts (Number of Logins)
- High values might indicate brute-force attacks (repeated login attempts).
- Typical users have 1–3 login attempts, while an attack may have hundreds or thousands.
session_duration (Session Length in Seconds)
- A very long session might indicate unauthorized access or persistence by an attacker.
- Attackers may try to stay connected to maintain access.
failed_logins (Failed Login Attempts)
- High failed login counts indicate credential stuffing or dictionary attacks.
- Many failed attempts followed by a successful login could suggest an account was compromised.
unusual_time_access (Login Time Anomaly)
- A binary flag (0 or 1) indicating whether access happened at an unusual time.
- Attackers often operate outside normal business hours to evade detection.
ip_reputation_score (Trustworthiness of IP Address)
- A score from 0 to 1, where higher values indicate suspicious activity.
- IP addresses associated with botnets, spam, or previous attacks tend to have higher scores.
browser_type (User’s Browser)
- Common browsers: Chrome, Firefox, Edge, Safari.
- Unknown: Could be an indicator of automated scripts or bots.

2. Target Variable (`attack_detected`)

Binary classification: 1 means an attack was detected, 0 means normal activity.
The dataset is useful for supervised machine learning, where a model learns from labeled attack patterns.

3. Possible Use Cases

This dataset can be used for intrusion detection systems (IDS) and cybersecurity research. Some key applications include:

A. Machine Learning-Based Intrusion Detection

Supervised Learning Approaches
- Classification Models (Logistic Regression, Decision Trees, Random Forest, XGBoost, SVM)
- Train the model using labeled data (attack_detected as the target).
- Evaluate using accuracy, precision, recall, F1-score.
Deep Learning Approaches
- Use Neural Networks (DNN, LSTM, CNN) for pattern recognition.
- LSTMs work well for time-series-based network traffic analysis.

B. Anomaly Detection (Unsupervised Learning)

If attack labels are missing, anomaly detection can be used: - Autoencoders: Learn normal traffic and flag anomalies. - Isolation Forest: Detects outliers based on feature isolation. - One-Class SVM: Learns normal behavior and detects deviations.

C. Rule-Based Detection

If certain thresholds are met (e.g., failed_logins > 10 & ip_reputation_score > 0.8), an alert is triggered.

4. Challenges & Considerations

Adversarial Attacks: Attackers may modify traffic to evade detection.
Concept Drift: Cyber threats...

Clear search

Close search

Google apps

Main menu

Cybersecurity 🪪 Intrusion 🦠 Detection Dataset

1. Understanding the Features

A. Network-Based Features

B. User Behavior-Based Features

2. Target Variable (attack_detected)

3. Possible Use Cases

A. Machine Learning-Based Intrusion Detection

B. Anomaly Detection (Unsupervised Learning)

C. Rule-Based Detection

4. Challenges & Considerations

🌐 Global Cybersecurity Threats (2015-2024)

📂

📊 Column Descriptions

Large-Scale Network Cyberattacks Multiclass Dataset 2024 (LSNM2024)

Cyber Threat Detection

Synthetic Cybersecurity Logs for Anomaly Detection

cyber-security-events

Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT...

Dataset to Train Intrusion Detection Systems based on Machine Learning...

DATASET

Data

Description

Results

Description

Source Code

StealthPhisher Phishing Attack Dataset

Network traffic datasets created by Single Flow Time Series Analysis

Composed Encrypted Malicious Traffic Dataset for machine learning based...

Network Digital Twin-Generated Dataset for Machine Learning-Based Detection...

Spearman Correlation Heatmaps After Feature Selection

CTU-SME-11: a labeled dataset with real benign and malicious network traffic...

Federated Learning for Distributed Intrusion Detection Systems in Public...

MedSec-25: IoMT Cybersecurity Dataset

Overview

Key Highlights:

Data Collection

Features

Labels

Usage

Machine Learning Security Market Research Report 2033

Machine Learning Security Market Outlook

Component Analysis

Data Collection & Requirements

Open-Source Cybersecurity and AI Security Datasets

Dataset Collection

1. DARPA Intrusion Detection Dataset

2. MITRE ATT&CK Framework Data

3. VirusShare Malware Repository

4. National Vulnerability Database (NVD)

5. LANL Unified Host and Network Dataset

6. CIC-IDS2017 (Intrusion Detection Dataset)

7. CIC IoV CAN Bus Dataset 2024

8. ImageNet-A (Adversarial Image Dataset)

Additional file 1: Table S1. of Detection and prediction of insider threats...

Phishing Detection Dataset

Cybersecurity 🪪 Intrusion 🦠 Detection Dataset

Prevent before attack

1. Understanding the Features

A. Network-Based Features

B. User Behavior-Based Features

2. Target Variable (attack_detected)

3. Possible Use Cases

A. Machine Learning-Based Intrusion Detection

B. Anomaly Detection (Unsupervised Learning)

C. Rule-Based Detection

4. Challenges & Considerations

2. Target Variable (`attack_detected`)

2. Target Variable (`attack_detected`)