100+ datasets found
  1. Cybersecurity 🪪 Intrusion 🦠 Detection Dataset

    • kaggle.com
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinesh Naveen Kumar Samudrala (2025). Cybersecurity 🪪 Intrusion 🦠 Detection Dataset [Dataset]. https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dinesh Naveen Kumar Samudrala
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. Below, I’ll explain each aspect in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning.

    1. Understanding the Features

    The dataset consists of network-based and user behavior-based features. Each feature provides valuable information about potential cyber threats.

    A. Network-Based Features

    These features describe network-level information such as packet size, protocol type, and encryption methods.

    1. network_packet_size (Packet Size in Bytes)

      • Represents the size of network packets, ranging between 64 to 1500 bytes.
      • Packets on the lower end (~64 bytes) may indicate control messages, while larger packets (~1500 bytes) often carry bulk data.
      • Attackers may use abnormally small or large packets for reconnaissance or exploitation attempts.
    2. protocol_type (Communication Protocol)

      • The protocol used in the session: TCP, UDP, or ICMP.
      • TCP (Transmission Control Protocol): Reliable, connection-oriented (common for HTTP, HTTPS, SSH).
      • UDP (User Datagram Protocol): Faster but less reliable (used for VoIP, streaming).
      • ICMP (Internet Control Message Protocol): Used for network diagnostics (ping); often abused in Denial-of-Service (DoS) attacks.
    3. encryption_used (Encryption Protocol)

      • Values: AES, DES, None.
      • AES (Advanced Encryption Standard): Strong encryption, commonly used.
      • DES (Data Encryption Standard): Older encryption, weaker security.
      • None: Indicates unencrypted communication, which can be risky.
      • Attackers might use no encryption to avoid detection or weak encryption to exploit vulnerabilities.

    B. User Behavior-Based Features

    These features track user activities, such as login attempts and session duration.

    1. login_attempts (Number of Logins)

      • High values might indicate brute-force attacks (repeated login attempts).
      • Typical users have 1–3 login attempts, while an attack may have hundreds or thousands.
    2. session_duration (Session Length in Seconds)

      • A very long session might indicate unauthorized access or persistence by an attacker.
      • Attackers may try to stay connected to maintain access.
    3. failed_logins (Failed Login Attempts)

      • High failed login counts indicate credential stuffing or dictionary attacks.
      • Many failed attempts followed by a successful login could suggest an account was compromised.
    4. unusual_time_access (Login Time Anomaly)

      • A binary flag (0 or 1) indicating whether access happened at an unusual time.
      • Attackers often operate outside normal business hours to evade detection.
    5. ip_reputation_score (Trustworthiness of IP Address)

      • A score from 0 to 1, where higher values indicate suspicious activity.
      • IP addresses associated with botnets, spam, or previous attacks tend to have higher scores.
    6. browser_type (User’s Browser)

      • Common browsers: Chrome, Firefox, Edge, Safari.
      • Unknown: Could be an indicator of automated scripts or bots.

    2. Target Variable (attack_detected)

    • Binary classification: 1 means an attack was detected, 0 means normal activity.
    • The dataset is useful for supervised machine learning, where a model learns from labeled attack patterns.

    3. Possible Use Cases

    This dataset can be used for intrusion detection systems (IDS) and cybersecurity research. Some key applications include:

    A. Machine Learning-Based Intrusion Detection

    1. Supervised Learning Approaches

      • Classification Models (Logistic Regression, Decision Trees, Random Forest, XGBoost, SVM)
      • Train the model using labeled data (attack_detected as the target).
      • Evaluate using accuracy, precision, recall, F1-score.
    2. Deep Learning Approaches

      • Use Neural Networks (DNN, LSTM, CNN) for pattern recognition.
      • LSTMs work well for time-series-based network traffic analysis.

    B. Anomaly Detection (Unsupervised Learning)

    If attack labels are missing, anomaly detection can be used: - Autoencoders: Learn normal traffic and flag anomalies. - Isolation Forest: Detects outliers based on feature isolation. - One-Class SVM: Learns normal behavior and detects deviations.

    C. Rule-Based Detection

    • If certain thresholds are met (e.g., failed_logins > 10 & ip_reputation_score > 0.8), an alert is triggered.

    4. Challenges & Considerations

    • Adversarial Attacks: Attackers may modify traffic to evade detection.
    • Concept Drift: Cyber threats...
  2. 🌐 Global Cybersecurity Threats (2015-2024)

    • kaggle.com
    zip
    Updated Mar 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atharva Soundankar (2025). 🌐 Global Cybersecurity Threats (2015-2024) [Dataset]. https://www.kaggle.com/datasets/atharvasoundankar/global-cybersecurity-threats-2015-2024
    Explore at:
    zip(48178 bytes)Available download formats
    Dataset updated
    Mar 16, 2025
    Authors
    Atharva Soundankar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📂

    The Global Cybersecurity Threats Dataset (2015-2024) provides extensive data on cyberattacks, malware types, targeted industries, and affected countries. It is designed for threat intelligence analysis, cybersecurity trend forecasting, and machine learning model development to enhance global digital security.

    📊 Column Descriptions

    Column NameDescription
    CountryCountry where the attack occurred
    YearYear of the incident
    Threat TypeType of cybersecurity threat (e.g., Malware, DDoS)
    Attack VectorMethod of attack (e.g., Phishing, SQL Injection)
    Affected IndustryIndustry targeted (e.g., Finance, Healthcare)
    Data Breached (GB)Volume of data compromised
    Financial Impact ($M)Estimated financial loss in millions
    Severity LevelLow, Medium, High, Critical
    Response Time (Hours)Time taken to mitigate the attack
    Mitigation StrategyCountermeasures taken
  3. m

    Large-Scale Network Cyberattacks Multiclass Dataset 2024 (LSNM2024)

    • data.mendeley.com
    Updated Jul 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qasem Abu Al-Haija (2024). Large-Scale Network Cyberattacks Multiclass Dataset 2024 (LSNM2024) [Dataset]. http://doi.org/10.17632/7pzyfvv9jn.1
    Explore at:
    Dataset updated
    Jul 1, 2024
    Authors
    Qasem Abu Al-Haija
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present a novel cutting-edge, large-scale multiclass dataset to improve the security of network cognition of suspicious traffic in networks. The proposed newly generated dataset contains up-to-date samples and features available to the public to help reduce the effect of upcoming cyberattacks with machine learning methods. Specifically, 6 million traffic samples with 60 features are collected and organized into two balanced classes: 50% normal traffic and 50% anomaly (attack) traffic. Furthermore, the anomaly traffic is composed of 15 different attacks including MITM-ARP-SPOOFING attack, SSH-BRUTE FORCE attack, FTP-BRUTE FORCE attack, DDOS-ICMP, DDOS-RAWIP attack, DDOS-UDP attack, DOS attack, EXPLOITING-FTP attack, FUZZING attack, ICMP FLOOD attack, SYN-FLOOD attack, PORT SCANNING attack, REMOTE CODE EXECUTION attack, SQL INJECTION attack, and XSS attack.

    For detailed info, Please refer to and cite our article: Q. Abu Al-Haija, Z. Masoud, A. Yasin, K. Alesawi, Y. Alkarnawi, "Revolutionizing Threat Hunting in Communication Networks: Introducing a Cutting-Edge Large-Scale Multiclass Dataset", 15th International Conference on Information and Communication Systems (ICICS 2024), IEEE, Aug. 2024.

  4. Cyber Threat Detection

    • kaggle.com
    zip
    Updated Oct 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hussain Afzaal 03 (2024). Cyber Threat Detection [Dataset]. https://www.kaggle.com/datasets/hussainsheikh03/cyber-threat-detection
    Explore at:
    zip(51424 bytes)Available download formats
    Dataset updated
    Oct 23, 2024
    Authors
    Hussain Afzaal 03
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The CyberFedDefender dataset is a simulated dataset designed for developing and testing federated learning-based cyber threat detection models. This dataset is tailored for research and experimentation in distributed anomaly detection and privacy-preserving cybersecurity frameworks. It includes traffic features commonly used in intrusion detection systems (IDS) with a focus on cloud and edge computing environments. Each record represents network traffic metadata, with labeled instances of both normal and malicious activities, making it ideal for machine learning applications in cybersecurity.

    Dataset Features The dataset consists of 1,430 instances, with 23 features including information on packet size, duration, bytes sent/received, flow statistics, and attack labels. It covers common cyberattacks such as DDoS, Brute Force, and Ransomware, along with normal network traffic.

    Feature List: Timestamp: The time when the network traffic was recorded. Source_IP: The IP address of the source machine. Destination_IP: The IP address of the destination machine. Protocol: The network protocol used (TCP, UDP, ICMP). Packet_Length: The length of the packet in bytes. Duration: The duration of the connection in seconds. Source_Port: The port number used by the source. Destination_Port: The port number used by the destination. Bytes_Sent: Total bytes sent from the source to the destination. Bytes_Received: Total bytes received by the destination from the source. Flags: TCP flags indicating the connection's state (e.g., SYN, ACK). Flow_Packets/s: Number of packets per second in the traffic flow. Flow_Bytes/s: Number of bytes per second in the traffic flow. Avg_Packet_Size: Average size of the packets during the connection. Total_Fwd_Packets: Total number of forward packets. Total_Bwd_Packets: Total number of backward packets. Fwd_Header_Length: Length of the forward packet headers. Bwd_Header_Length: Length of the backward packet headers. Sub_Flow_Fwd_Bytes: Bytes sent in the forward subflow. Sub_Flow_Bwd_Bytes: Bytes received in the backward subflow. Inbound: Indicates whether the traffic is inbound (1) or outbound (0). Attack_Type: Type of cyberattack or normal traffic (e.g., DDoS, Brute Force, Ransomware, Normal). Label: Binary classification label where 1 indicates malicious traffic and 0 represents normal traffic. Usage This dataset is designed for research in the following areas:

    Federated learning for cyber threat detection Privacy-preserving machine learning in cybersecurity Intrusion detection systems (IDS) Distributed anomaly detection in cloud and edge environments Researchers can leverage this dataset to build and evaluate models for anomaly detection, perform comparative analysis, or enhance the robustness of federated learning frameworks in cybersecurity applications.

  5. Synthetic Cybersecurity Logs for Anomaly Detection

    • kaggle.com
    zip
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fcWebDev (2024). Synthetic Cybersecurity Logs for Anomaly Detection [Dataset]. https://www.kaggle.com/datasets/fcwebdev/synthetic-cybersecurity-logs-for-anomaly-detection
    Explore at:
    zip(160070 bytes)Available download formats
    Dataset updated
    Dec 16, 2024
    Authors
    fcWebDev
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains synthetic HTTP log data designed for cybersecurity analysis, particularly for anomaly detection tasks.

    Dataset Features Timestamp: Simulated time for each log entry. IP_Address: Randomized IP addresses to simulate network traffic. Request_Type: Common HTTP methods (GET, POST, PUT, DELETE). Status_Code: HTTP response status codes (e.g., 200, 404, 403, 500). Anomaly_Flag: Binary flag indicating anomalies (1 = anomaly, 0 = normal). User_Agent: Simulated user agents for device and browser identification. Session_ID: Random session IDs to simulate user activity. Location: Geographic locations of requests. Applications This dataset can be used for:

    Anomaly Detection: Identify suspicious network activity or attacks. Machine Learning: Train models for classification tasks (e.g., detect anomalies). Cybersecurity Analysis: Analyze HTTP traffic patterns and identify threats. Example Challenge Build a machine learning model to predict the Anomaly_Flag based on the features provided.

  6. h

    cyber-security-events

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuriy Medvedev, cyber-security-events [Dataset]. https://huggingface.co/datasets/pyToshka/cyber-security-events
    Explore at:
    Authors
    Yuriy Medvedev
    License

    https://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/

    Description

    cyber-security-events

      Dataset Description
    

    This dataset contains cybersecurity events collected from honeypot infrastructure. The data has been processed and feature-engineered for machine learning applications in threat detection and security analytics.

      Feature Categories
    
    
    
    
    
      Network Features
    

    Connection flow statistics (bytes, packets, duration) Protocol-specific metrics Geographic information IP reputation data

      Behavioral Features
    

    Session… See the full description on the dataset page: https://huggingface.co/datasets/pyToshka/cyber-security-events.

  7. i

    Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT...

    • ieee-dataport.org
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Amine FERRAG (2025). Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT and IIoT Applications: Centralized and Federated Learning [Dataset]. https://ieee-dataport.org/documents/edge-iiotset-new-comprehensive-realistic-cyber-security-dataset-iot-and-iiot-applications
    Explore at:
    Dataset updated
    Nov 19, 2025
    Authors
    Mohamed Amine FERRAG
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    namely

  8. Dataset to Train Intrusion Detection Systems based on Machine Learning...

    • zenodo.org
    application/gzip, bin +1
    Updated Nov 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esteban Damian Gutierrez Mlot; Esteban Damian Gutierrez Mlot (2024). Dataset to Train Intrusion Detection Systems based on Machine Learning Models for Electrical Substations [Dataset]. http://doi.org/10.5281/zenodo.14066350
    Explore at:
    bin, application/gzip, zipAvailable download formats
    Dataset updated
    Nov 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Esteban Damian Gutierrez Mlot; Esteban Damian Gutierrez Mlot
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DATASET

    This dataset is part of the research work titled "A Dataset to Train Intrusion Detection Systems based on Machine Learning Models for Electrical Substations," which is currently awaiting approval for publication. The dataset has been meticulously curated to support the development and evaluation of machine learning models tailored for detecting cyber intrusions in the context of electrical substations. It is intended to facilitate research and advancements in cybersecurity for critical infrastructure, specifically focusing on real-world scenarios within electrical substation environments. We encourage its use for experimentation and benchmarking in related areas of study.

    The following sections list the content of the dataset generated.

    Data

    • raw
      • iec6180
        • attack-free-data
          • capture61850-attackfree.pcap (from real substation)
          • capture61850-attackfree_PTP.pcap
          • capture61850-attackfree_normalfault.pcap
        • attack-data
          • capture61850-floodattack_withfault.pcap
          • capture61850-floodattack_withoutfault.pcap
          • capture61850-fuzzyattack_withfault.pcap
          • capture61850-fuzzyattack_withoutfault.pcap
          • capture61850-replay.pcap
          • capture61850-ptpattack.pcap
      • iec104
        • attack-free-data
          • capture104-attackfree.pcap (from real substation)
        • attack-data
          • capture104-dosattack.pcap
          • capture104-floodattack.pcap
          • capture104-fuzzyattack.pcap
          • capture104-iec104starvationattack.pcap
          • capture104-mitmattack.pcap
          • capture104-ntpddosattack.pcap
          • capture104-portscanattack.pcap
    • processed
      • iec6180
        • attack-free-data
          • capture61850-attackfree.csv
          • capture61850-attackfree_PTP.csv
          • capture61850-attackfree_normalfault.csv
        • attack-data
          • capture61850-floodattack_withfault.csv
          • capture61850-floodattack_withoutfault.csv
          • capture61850-fuzzyattack_withfault.csv
          • capture61850-fuzzyattack_withoutfault.csv
          • capture61850-replay.csv
          • capture61850-ptpattack.csv
        • headers_iec61850[all].txt
      • iec104
        • attack-free-data
          • capture104-attackfree.csv
        • attack-data
          • capture104-dosattack.csv
          • capture104-floodattack.csv
          • capture104-fuzzyattack.csv
          • capture104-iec104starvationattack.csv
          • capture104-mitmattack.csv
          • capture104-ntpddosattack.csv
          • capture104-portscanattack.csv
        • headers_iec104[all].txt

    Description

    • file type: it may be captured61850 or captured104 depending on whether it contains network captures of the protocol IEC61850 or IEC104.
    • attack: attack free (attackfree) or attack name is added to the file name.
    • function: optionally, if there are some details about functionality captured (normalfault) or specific protocol capture (PTP).
    • file extension: the type can be PCAP (network capture) or CSV (flow file).

    Results

    • results
      • test1-iec104
        • model-test1-iec104.pkl
        • test1-iec104.log
      • test1-iec61850
        • model-test1-iec61850.pkl
        • test1-iec61850.log
      • test2-iec61850
        • model-test2-iec61850.pkl
        • test2-iec61850.log


    Description

    The outcomes of different test executions are available as follows:

    • test1-iec104: IEC 104 protocol for all attacks and attack free scenario
    • test1-iec61850: IEC 61850 protocol for fuzzy attack with fault injection and attack free scenario
    • test2-iec61850: IEC 61850 protocol for fuzzy attack normal operation and attack free scenario


    Each test consists of the model results in Python pickle format (with a .pkl extension) and a detailed description of the execution conditions in an output log file (with a .log extension).

    Source Code

    A snapshot of the source code used to process these files is included under the filename source-code-cybersecurity-datasets-v2.0.zip. For an updated version, please consider visiting github repository.

  9. m

    StealthPhisher Phishing Attack Dataset

    • data.mendeley.com
    Updated Nov 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanmay Jha (2025). StealthPhisher Phishing Attack Dataset [Dataset]. http://doi.org/10.17632/m2479kmybx.2
    Explore at:
    Dataset updated
    Nov 7, 2025
    Authors
    Tanmay Jha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The StealthPhisher Phishing Attack Dataset, generated at the Cybersecurity Lab, GLA University, Mathura, is a large, diverse, and recent Phishing Attack Dataset developed to address the evolving nature of phishing attacks. It comprises over 336,749 records, including 160,943 legitimate URLs and 175,806 phishing URLs, collected from reliable sources such as PhishTank. Reflecting the most recent phishing tactics, this dataset serves as a valuable resource for training and evaluating AI-based phishing detection systems.

    Key features include URL-based attributes (e.g., length, TLD type, IP presence), statistical metrics (e.g., Shannon Entropy, Kolmogorov Complexity, Fractal Dimension), and HTML/interaction-based features (e.g., popups, redirects, forms). These multidimensional attributes provide comprehensive insights into phishing behavior, enabling accurate and robust threat detection. Designed to capture real-world scenarios, the dataset equips AI models to recognize both traditional and emerging phishing strategies effectively.

    This dataset was generated as part of the research work presented in the article “StealthPhisher: A Defensive Framework against Phishing Attack using Hybrid Deep Learning and GenAI,” published in Expert Systems with Applications (https://doi.org/10.1016/j.eswa.2025.130205). Researchers using this dataset in their research work are kindly requested to cite this article.

  10. Network traffic datasets created by Single Flow Time Series Analysis

    • zenodo.org
    • data.niaid.nih.gov
    csv, pdf
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka (2024). Network traffic datasets created by Single Flow Time Series Analysis [Dataset]. http://doi.org/10.5281/zenodo.8035724
    Explore at:
    csv, pdfAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Network traffic datasets created by Single Flow Time Series Analysis

    Datasets were created for the paper: Network Traffic Classification based on Single Flow Time Series Analysis -- Josef Koumar, Karel Hynek, Tomáš Čejka -- which was published at The 19th International Conference on Network and Service Management (CNSM) 2023. Please cite usage of our datasets as:

    J. Koumar, K. Hynek and T. Čejka, "Network Traffic Classification Based on Single Flow Time Series Analysis," 2023 19th International Conference on Network and Service Management (CNSM), Niagara Falls, ON, Canada, 2023, pp. 1-7, doi: 10.23919/CNSM59352.2023.10327876.

    This Zenodo repository contains 23 datasets created from 15 well-known published datasets which are cited in the table below. Each dataset contains 69 features created by Time Series Analysis of Single Flow Time Series. The detailed description of features from datasets is in the file: feature_description.pdf

    In the following table is a description of each dataset file:

    File nameDetection problemCitation of original raw dataset
    botnet_binary.csv Binary detection of botnet S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
    botnet_multiclass.csv Multi-class classification of botnet S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
    cryptomining_design.csvBinary detection of cryptomining; the design part Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
    cryptomining_evaluation.csv Binary detection of cryptomining; the evaluation part Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
    dns_malware.csv Binary detection of malware DNS Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021.
    doh_cic.csv Binary detection of DoH

    Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020

    doh_real_world.csv Binary detection of DoH Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022
    dos.csv Binary detection of DoS Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019.
    edge_iiot_binary.csv Binary detection of IoT malware Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.
    edge_iiot_multiclass.csvMulti-class classification of IoT malwareMohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.
    https_brute_force.csvBinary detection of HTTPS Brute ForceJan Luxemburk et al. HTTPS Brute-force dataset with extended network flows, November 2020
    ids_cic_binary.csvBinary detection of intrusion in IDSIman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.
    ids_cic_multiclass.csv Multi-class classification of intrusion in IDS Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.
    ids_unsw_nb_15_binary.csv Binary detection of intrusion in IDS Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.
    ids_unsw_nb_15_multiclass.csv Multi-class classification of intrusion in IDS Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.
    iot_23.csv Binary detection of IoT malware Sebastian Garcia et al. IoT-23: A labeled dataset with malicious and benign IoT network traffic, January 2020. More details here https://www.stratosphereips.org /datasets-iot23
    ton_iot_binary.csv Binary detection of IoT malware Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021
    ton_iot_multiclass.csv Multi-class classification of IoT malware Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021
    tor_binary.csv Binary detection of TOR Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017.
    tor_multiclass.csv Multi-class classification of TOR Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017.
    vpn_iscx_binary.csv Binary detection of VPN Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016.
    vpn_iscx_multiclass.csv Multi-class classification of VPN Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016.
    vpn_vnat_binary.csv Binary detection of VPN Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022
    vpn_vnat_multiclass.csvMulti-class classification of VPN Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022

  11. m

    Composed Encrypted Malicious Traffic Dataset for machine learning based...

    • data.mendeley.com
    Updated Oct 12, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zihao Wang (2021). Composed Encrypted Malicious Traffic Dataset for machine learning based encrypted malicious traffic analysis. [Dataset]. http://doi.org/10.17632/ztyk4h3v6s.2
    Explore at:
    Dataset updated
    Oct 12, 2021
    Authors
    Zihao Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a traffic dataset which contains balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection. The dataset is a secondary csv feature data which is composed of five public traffic datasets. Our dataset is composed based on three criteria: The first criterion is to combine widely considered public datasets which contain both encrypted malicious and legitimate traffic in existing works, such as the Malwares Capture Facility Project dataset and the CICIDS-2017 dataset. The second criterion is to ensure the data balance, i.e., balance of malicious and legitimate network traffic and similar size of network traffic contributed by each individual dataset. Thus, approximate proportions of malicious and legitimate traffic from each selected public dataset are extracted by using random sampling. We also ensured that there will be no traffic size from one selected public dataset that is much larger than other selected public datasets. The third criterion is that our dataset includes both conventional devices' and IoT devices' encrypted malicious and legitimate traffic, as these devices are increasingly being deployed and are working in the same environments such as offices, homes, and other smart city settings.

    Based on the criteria, 5 public datasets are selected. After data pre-processing, details of each selected public dataset and the final composed dataset are shown in “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, proportions of selected traffic size from each selected public dataset with respect to the total traffic size of the composed dataset (% w.r.t the composed dataset), proportions of selected encrypted traffic size from each selected public dataset (% of selected public dataset), and total traffic size of the composed dataset. From the table, we are able to observe that each public dataset equally contributes to approximately 20% of the composed dataset, except for CICDS-2012 (due to its limited number of encrypted malicious traffic). This achieves a balance across individual datasets and reduces bias towards traffic belonging to any dataset during learning. We can also observe that the size of malicious and legitimate traffic are almost the same, thus achieving class balance. The datasets now made available were prepared aiming at encrypted malicious traffic detection. Since the dataset is used for machine learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4 and stratification is applied during data split. Such datasets can be used directly for machine or deep learning model training based on selected features.

  12. Network Digital Twin-Generated Dataset for Machine Learning-Based Detection...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amit Karamchandani Batra; Amit Karamchandani Batra; Javier Nuñez Fuente; Luis de la Cal García; Luis de la Cal García; Yenny Moreno Meneses; Alberto Mozo Velasco; Alberto Mozo Velasco; Antonio Pastor Perales; Antonio Pastor Perales; Diego R. López; Diego R. López; Javier Nuñez Fuente; Yenny Moreno Meneses (2024). Network Digital Twin-Generated Dataset for Machine Learning-Based Detection of Benign and Malicious Heavy Hitter Flows [Dataset]. http://doi.org/10.5281/zenodo.14134646
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 13, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amit Karamchandani Batra; Amit Karamchandani Batra; Javier Nuñez Fuente; Luis de la Cal García; Luis de la Cal García; Yenny Moreno Meneses; Alberto Mozo Velasco; Alberto Mozo Velasco; Antonio Pastor Perales; Antonio Pastor Perales; Diego R. López; Diego R. López; Javier Nuñez Fuente; Yenny Moreno Meneses
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 11, 2024
    Description

    The dataset used in this study is publicly available for research purposes. If you are using this dataset, please cite the following paper, which outlines the complete details of the dataset and the methodology used for its generation:

    Amit Karamchandani, Javier Núñez, Luis de-la-Cal, Yenny Moreno, Alberto Mozo, Antonio Pastor, "On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination," under submission.

    This dataset contains a synthetic dataset generated to differentiate between benign and malicious heavy hitter flows within complex network environments. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction.

    To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic.

    The feature set includes flow statistics commonly used in network analysis, such as:

    • Traffic protocol type,
    • Flow duration (the time between the initial and final packet in both directions),
    • Total count of payload packets transmitted in both directions,
    • Cumulative bytes transmitted in both directions,
    • Time discrepancy between the first packet observations at the source and destination,
    • Packet and byte transmission rates per second within each interval, and
    • Total packet and byte counts within each interval in both directions.
  13. m

    Spearman Correlation Heatmaps After Feature Selection

    • data.mendeley.com
    Updated Nov 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    abdulkader hajjouz (2024). Spearman Correlation Heatmaps After Feature Selection [Dataset]. http://doi.org/10.17632/hxd7gmrvth.1
    Explore at:
    Dataset updated
    Nov 20, 2024
    Authors
    abdulkader hajjouz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description: This is a Spearman Correlation Heatmap of the 32 features used for machine learning and deep learning models in cybersecurity. The diagonal cells are perfect self-correlation (value = 1) and the off-diagonal cells are pairwise correlations between features. Since there are no strong correlations (close to 1 or -1) we removed the redundant or irrelevant features, so each selected feature brings unique and independent information to the model. Feature selection is key in building cyber intrusion detection systems as it reduces computational overhead, simplifies the model and improves accuracy and robustness. This is part of the systematic feature engineering process to optimize datasets for anomaly detection, network traffic analysis and intrusion detection. Researchers in AI for cybersecurity can use this to build more interpretable and efficient models to detect in large scale networks. This figure shows the importance of correlation analysis for high dimensional datasets and contributes to cyber, data science and machine learning.

    Why It Matters: Reduces overfitting in machine learning models. Improves computational efficiency for large-scale datasets. Enhances feature interpretability for robust cybersecurity solutions.

    Keywords: Spearman Correlation Heatmap, Feature Selection, Intrusion Detection System, Cybersecurity, Machine Learning, Deep Learning, Anomaly Detection, Network Traffic Analysis, Artificial Intelligence in Cybersecurity, Dataset Optimization, Feature Engineering for Cyber Threats

    References: This file pertains to our research study, which has been accepted for publication in the Scientific and Technical Journal of Information Technologies, Mechanics and Optics. The study is titled: "Enhancing and Extending CatBoost for Accurate Detection and Classification of DoS and DDoS Attack Subtypes in Network Traffic."

    https://doi.org/10.1109/ICSIP61881.2024.10671552 https://doi.org/10.24143/2072-9502-2024-3-65-74

  14. CTU-SME-11: a labeled dataset with real benign and malicious network traffic...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin, bz2, csv, html
    Updated May 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Štěpán Bendl; Štěpán Bendl; Veronica Valeros; Veronica Valeros; Sebastian Garcia; Sebastian Garcia (2023). CTU-SME-11: a labeled dataset with real benign and malicious network traffic mimicking a small medium-size enterprise environment [Dataset]. http://doi.org/10.5281/zenodo.7958259
    Explore at:
    csv, html, bz2, binAvailable download formats
    Dataset updated
    May 26, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Štěpán Bendl; Štěpán Bendl; Veronica Valeros; Veronica Valeros; Sebastian Garcia; Sebastian Garcia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As technology advances, the number and complexity of cyber-attacks increase, forcing defense techniques to be updated and improved. To help develop effective tools for detecting security threats it is essential to have reliable and representative security datasets. Many existing security datasets have limitations that make them unsuitable for research, including lack of labels, unbalanced traffic, and outdated threats.

    CTU-SME-11 is a labeled network dataset designed to address the limitations of previous datasets. The dataset was captured in a real network that mimics a small-medium enterprise setting. Raw network traffic (packets) was captured from 11 devices using tcpdump for a duration of 7 days, from 20th to 26th of February, 2023 in Prague, Czech Republic. The devices were chosen based on the enterprise setting and consists of IoT, desktop and mobile devices, both bare metal and virtualized. The devices were infected with malware or exposed to Internet attacks, and factory reset to restore benign behavior.

    The raw data was processed to generate network flows (Zeek logs) which were analyzed and labeled. The dataset contains two types of levels, a high level label and a descriptive label, which were put by experts. The former can take three values, benign, malicious or background. The latter contains detailed information about the specific behavior observed in the network flows. The dataset contains 99 million labeled network flows. The overall compressed size of the dataset is 80GB and the uncompressed size is 170GB.

  15. Federated Learning for Distributed Intrusion Detection Systems in Public...

    • zenodo.org
    • data.europa.eu
    bz2
    Updated May 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alireza Bakhshi Zadi Mahmoodi; Alireza Bakhshi Zadi Mahmoodi; Panos Kostakos; Panos Kostakos (2023). Federated Learning for Distributed Intrusion Detection Systems in Public Networks - Validation Dataset [Dataset]. http://doi.org/10.5281/zenodo.7956304
    Explore at:
    bz2Available download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alireza Bakhshi Zadi Mahmoodi; Alireza Bakhshi Zadi Mahmoodi; Panos Kostakos; Panos Kostakos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset has been meticulously prepared and utilized as a validation set during the evaluation phase of "Meta IDS" to asses the performance of various machine learning models. It is now made available for interested users and researchers who seek a reliable and diverse dataset for training and testing their own custom models.

    The validation dataset comprises a comprehensive collection of labeled entries, that determines whether the packet type is "malicious" or "benign." It covers complex design patterns that are commonly encountered in real-world applications. The dataset is designed to be representative, encompassing edge and fog layers that are in contact with cloud layer, thereby enabling thorough testing and evaluation of different models. Each sample in the dataset is labeled with the corresponding ground truth, providing a reliable reference for model performance evaluation.

    To ensure convenient distribution and storage, the dataset has been broken down into three separate batches, each containing a portion of the dataset. This allows for convenient downloading and management of the dataset. The three batches are provided as individual compressed files.

    In order to extract the data, follow the following instructions:

    • Download and install bzip2 (if not already installed) from the official website or your package manager.
    • Place the compressed dataset file in a directory of your choice.
    • Open a terminal or command prompt and navigate to the directory where the compressed dataset file is located.
    • Execute the following command to uncompress the dataset:
      • bzip2 -d filename.bz2
    • Replace "filename.bz2" with the actual name of the compressed dataset file.

    Once uncompressed, you will have access to the dataset in its original format for further exploration, analysis, and model training etc. The total storage required for extraction is approximately 800 GB in total, with the first batch requiring approximately 302 GB, the second batch requiring approximately 203 GB, and the third batch requiring approximately 297 GB of data storage.

    The first batch contains 1,049,527,992 entries, where as the second batch contains 711,043,331 entries, and for the third and last batch we have 1,029,303,062 entries. The following table provides the feature names along with their explanation and example value once the dataset is extracted.

    FeatureDescriptionExample Value
    ip.srcSource IP address in the packeta05d4ecc38da01406c9635ec694917e969622160e728495e3169f62822444e17
    ip.dstDestination IP address in the packeta52db0d87623d8a25d0db324d74f0900deb5ca4ec8ad9f346114db134e040ec5
    frame.time_epochEpoch time of the frame1676165569.930869
    arp.hw.typeHardware type1
    arp.hw.sizeHardware size6
    arp.proto.sizeProtocol size4
    arp.opcodeOpcode2
    data.lenLength2713
    eth.dst.lgDestination LG bit1
    eth.dst.igDestination IG bit1
    eth.src.lgSource LG bit1
    eth.src.igSource IG bit1
    frame.offset_shiftTime shift for this packet0
    frame.lenframe length on the wire1208
    frame.cap_lenFrame length stored into the capture file215
    frame.markedFrame is marked0
    frame.ignoredFrame is ignored0
    frame.encap_typeEncapsulation type1
    greGeneric Routing Encapsulation'Generic Routing
    Encapsulation (IP)’
    ip.versionVersion6
    ip.hdr_lenHeader length24
    ip.dsfield.dscpDifferentiated Services
    Codepoint
    56
    ip.dsfield.ecnExplicit Congestion
    Notification
    2
    ip.lenTotal length614
    ip.flags.rbReserved bit0
    ip.flags.dfDon't fragment1
    ip.flags.mfMore fragments0
    ip.frag_offsetFragment offset0
    ip.ttlTime to live31
    ip.protoProtocol47
    ip.checksum.statusHeader checksum status2
    tcp.srcportTCP source port53425
    tcp.flagsFlags0x00000098
    tcp.flags.nsNonce0
    tcp.flags.cwrCongestion Window Reduced
    (CWR)
    1
    udp.srcportUDP source port64413
    udp.dstportUDP destination port54087
    udp.streamStream index1345
    udp.lengthLength225
    udp.checksum.statusChecksum status3
    packet_typeType of the packet which is either "benign" or "malicious"0

    Furthermore, in compliance with the GDPR and to ensure the privacy of individuals, all IP addresses present in the dataset have been anonymized through hashing. This anonymization process helps protect the identity of individuals while preserving the integrity and utility of the dataset for research and model development purposes.

    Please note that while the dataset provides valuable insights and a solid foundation for machine learning tasks, it is not a substitute for extensive real-world data collection. However, it serves as a valuable resource for researchers, practitioners, and enthusiasts in the machine learning community, offering a compliant and anonymized dataset for developing and validating custom models in a specific problem domain.

    By leveraging the validation dataset for machine learning model evaluation and custom model training, users can accelerate their research and development efforts, building upon the knowledge gained from my thesis while contributing to the advancement of the field.

  16. MedSec-25: IoMT Cybersecurity Dataset

    • kaggle.com
    zip
    Updated Sep 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Abdullah (2025). MedSec-25: IoMT Cybersecurity Dataset [Dataset]. https://www.kaggle.com/datasets/abdullah001234/medsec-25-iomt-cybersecurity-dataset
    Explore at:
    zip(38496221 bytes)Available download formats
    Dataset updated
    Sep 8, 2025
    Authors
    Muhammad Abdullah
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Overview

    MedSec-25 is a comprehensive, labeled network traffic dataset designed specifically for the Internet of Medical Things (IoMT) in healthcare environments. It addresses the limitations of existing generic IoT datasets by capturing realistic traffic from a custom-built healthcare IoT lab that mimics real-world hospital operations. The dataset includes both benign (normal) traffic and malicious traffic from multi-staged attack campaigns inspired by the MITRE ATT&CK framework. This allows for the development and evaluation of machine learning-based intrusion detection systems (IDS) tailored to IoMT scenarios, where patient safety and data privacy are critical. The dataset was generated using a variety of medical sensors (e.g., ECG, EEG, HHI, Respiration, SpO2) and environmental sensors (e.g., thermistor, ultrasonic, PIR, flame) connected via Raspberry Pi nodes and an IoT server. Traffic was captured over 7.5 hours using tools like Wireshark and tcpdump, resulting in PCAPNG files. These were processed with CICFlowMeter to extract flow-based features, producing a cleaned CSV dataset with 554,534 bidirectional network flows and 84 features.

    Key Highlights:

    Realistic Setup: Built in a physical lab at Rochester Institute of Technology, Dubai, incorporating diverse IoMT devices, protocols (e.g., MQTT, SSH, Telnet, FTP, HTTP, DNS), and real-time patient interactions (anonymized to comply with privacy regulations like HIPAA).

    Multi-Staged Attacks: Unlike datasets focusing on isolated attacks, MedSec-25 simulates full attack chains: Reconnaissance (e.g., SYN/TCP scans, OS fingerprinting), Initial Access (e.g., brute-force, malformed MQTT packets), Lateral Movement (e.g., exploiting vulnerabilities to pivot between devices), and Exfiltration (e.g., data theft via MQTT).

    Imbalanced Nature: This is the cleaned (imbalanced) version of the dataset. Users may need to apply balancing techniques (e.g., SMOTE oversampling + random undersampling) for model training, as demonstrated in the associated paper.

    Size and Quality: 554,534 rows, no duplicates, no missing values (except 111 NaNs in Flow Byts/s, ~0.02%, which can be handled via imputation). Data types include float64 (45 columns), int64 (34 columns), and object (5 columns: Flow ID, Src IP, Dst IP, Timestamp, Label).

    Utility: Preliminary models trained on this dataset (e.g., KNN: 98.09% accuracy, Decision Tree: 98.35% accuracy) show excellent performance for detecting attack stages.

    This dataset is ideal for researchers in cybersecurity, machine learning, and healthcare IoT, enabling the creation of an IDS that can detect attacks at different phases to prevent escalation.

    Data Collection

    Benign Traffic: Generated over two days with active sensors, services (HTTP dashboard for patient monitoring, SSH/Telnet for remote access, FTP for file transfers), and real users (students/faculty) interacting with medical devices. No personally identifiable information was stored.

    Malicious Traffic: Two Kali Linux attacker machines simulated MITRE ATT&CK-inspired campaigns using tools like Nmap, Scapy, Metasploit, and custom Python scripts.

    Capture Tools: Wireshark and tcpdump for PCAPNG files (total ~1GB: 600MB benign, 400MB malicious).

    Processing: Combined PCAP files per label, extracted features with CICFlowMeter, labeled flows manually based on attack phases, and cleaned for ML readiness. The final cleaned CSV is ~350MB.

    Features

    The dataset includes 84 features extracted by CICFlowMeter, categorized as:

    Identifiers: Flow ID, Src IP, Src Port, Dst IP, Dst Port, Protocol, Timestamp.

    Time-Series Metrics: Flow Duration, Flow IAT Mean/Std/Max/Min, Fwd/Bwd IAT Tot/Mean/Std/Max/Min.

    Size/Count Statistics: Tot Fwd/Bwd Pkts, TotLen Fwd/Bwd Pkts, Fwd/Bwd Pkt Len Max/Min/Mean/Std, Pkt Len Min/Max/Mean/Std/Var, Pkt Size Avg.

    Flag Counts: Fwd/Bwd PSH/URG Flags, FIN/SYN/RST/PSH/ACK/URG/CWE/ECE Flag Cnt.

    Rates and Ratios: Flow Byts/s, Flow Pkts/s, Fwd/Bwd Pkts/s, Down/Up Ratio, Active/Idle Mean/Std/Max/Min.

    Segmentation and Others: Fwd/Bwd Seg Size Avg/Min, Subflow Fwd/Bwd Pkts/Byts, Init Fwd/Bwd Win Byts, Fwd Act Data Pkts, Fwd/Bwd Byts/b Avg, Fwd/Bwd Pkts/b Avg, Fwd/Bwd Blk Rate Avg.

    Labels

    The dataset is labeled with 5 classes representing benign behavior and attack stages:

    Reconnaissance: 401,683 flows Initial Access: 102,090 flows Exfiltration: 25,915 flows Lateral Movement: 12,498 flows Benign: 12,348 flows

    Note: The dataset is imbalanced, with Reconnaissance dominating. Apply balancing techniques for optimal ML performance.

    Usage

    Preprocessing Suggestions: Encode categorical features (e.g., Protocol, Label) using LabelEncoder. Normalize numerical features with Min-Max Scaler or StandardScaler. Handle the minor NaNs in Flow Byts/s via mean imputation.

    Model Training: Split into train/test (e.g., 80/20). Suitable for classification tasks w...

  17. D

    Machine Learning Security Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Machine Learning Security Market Research Report 2033 [Dataset]. https://dataintelo.com/report/machine-learning-security-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Machine Learning Security Market Outlook



    According to our latest research, the global machine learning security market size reached USD 7.42 billion in 2024, reflecting a robust expansion driven by escalating cyber threats and the increasing adoption of advanced digital technologies across various sectors. The market is projected to register a compelling CAGR of 23.6% during the forecast period, with the total market value anticipated to reach USD 61.58 billion by 2033. This exponential growth is primarily fueled by the urgent need for proactive security solutions capable of identifying and mitigating sophisticated cyberattacks in real time.




    The primary growth driver for the machine learning security market is the rapid surge in cyberattacks, including ransomware, phishing, and advanced persistent threats. As organizations digitize operations and expand their cloud infrastructure, the attack surface increases, making traditional security measures insufficient. Machine learning-based security solutions can analyze vast datasets, detect anomalies, and respond to threats far more efficiently than conventional systems. The growing sophistication of cybercriminals, who now leverage artificial intelligence themselves, has made it imperative for enterprises to adopt equally advanced defense mechanisms. This environment of escalating cyber risk underpins the strong demand for machine learning security solutions across all major industries.




    Another significant factor propelling the market is the increasing regulatory pressure and compliance requirements worldwide. Governments and regulatory bodies have introduced stringent data protection and privacy laws, such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States. These regulations mandate organizations to implement robust security frameworks to safeguard sensitive information. Machine learning security tools not only enhance the effectiveness of compliance programs by automating threat detection and reporting but also help organizations avoid hefty penalties associated with data breaches. As businesses strive to maintain compliance and protect their reputations, investment in machine learning-driven security is becoming a strategic imperative.




    Furthermore, the expansion of the Internet of Things (IoT), cloud computing, and remote work trends have introduced new vectors of vulnerability, necessitating adaptive security approaches. Machine learning security solutions can continuously learn and adapt to emerging threats, providing proactive protection for dynamic IT environments. The integration of machine learning into security operations centers (SOCs) enables real-time monitoring, rapid incident response, and predictive analytics. This capability is particularly valuable for sectors such as BFSI, healthcare, and manufacturing, where the cost of a security breach can be catastrophic. As digital transformation accelerates, the market is poised for sustained growth, with organizations prioritizing advanced security investments.




    From a regional perspective, North America currently dominates the machine learning security market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The United States, in particular, is a frontrunner due to its mature cybersecurity ecosystem, high technology adoption rate, and presence of leading market players. Europe’s growth is bolstered by strict regulatory frameworks and strong investment in digital infrastructure, while Asia Pacific is experiencing the fastest growth rate, driven by rapid digitalization, increasing cybercrime, and a burgeoning IT sector. Latin America and the Middle East & Africa are also witnessing steady adoption as awareness of cybersecurity threats rises and governments take proactive measures to strengthen national cyber defenses.



    Component Analysis



    The machine learning security market is segmented by component into software, hardware, and services, with each playing a critical role in enabling comprehensive security solutions. Software forms the backbone of machine learning security, encompassing threat detection platforms, security analytics, and automated response systems. These solutions leverage advanced algorithms to process and analyze massive volumes of security data, identify patterns of malicious activity, and provide actionable insights. The software segment is anticipated to maintain its dominance througho

  18. Data Collection & Requirements

    • zenodo.org
    bin
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). Data Collection & Requirements [Dataset]. http://doi.org/10.5281/zenodo.14976797
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Open-Source Cybersecurity and AI Security Datasets

    This project provides a comprehensive collection of open-source datasets focused on cybersecurity threats and AI security vulnerabilities. The datasets are carefully selected to align with specific security threats, such as:

    • Data Exfiltration
    • Data Poisoning
    • Model Manipulation
    • Adversarial Examples
    • Model Inversion
    • Model Extraction
    • Spoofing Attacks
    • Unauthorized Access
    • Supply Chain Compromise

    Dataset Collection

    Each dataset includes a detailed description, source type, purpose, and direct access links for easy retrieval.

    1. DARPA Intrusion Detection Dataset

    • Access Here
    • Description: Simulated network traffic with various cyber attack scenarios (e.g., DoS, Probe, U2R, R2L).
    • Format: PCAP
    • Update Frequency: Static
    • Use Cases: IDS training, intrusion detection research

    2. MITRE ATT&CK Framework Data

    • Access Here
    • Description: A globally-accessible knowledge base of adversarial tactics, techniques, and procedures (TTPs).
    • Format: JSON, STIX
    • Update Frequency: Quarterly
    • Use Cases: Threat intelligence, adversary simulation, AI model defense

    3. VirusShare Malware Repository

    • Access Here (Registration Required)
    • Description: Large-scale collection of live malware samples for security research.
    • Format: ZIP, PE files
    • Update Frequency: Weekly
    • Use Cases: AI-based malware detection, sandbox testing

    4. National Vulnerability Database (NVD)

    • Access Here
    • Description: A repository of reported vulnerabilities (CVEs) with severity scores and descriptions.
    • Format: XML, JSON
    • Update Frequency: Daily
    • Use Cases: Vulnerability management, exploit mitigation research

    5. LANL Unified Host and Network Dataset

    • Access Here
    • Description: Enterprise-scale dataset containing network and host logs with real-world red-team attack events.
    • Format: Text files
    • Update Frequency: Static
    • Use Cases: Insider threat detection, anomaly detection in network security

    6. CIC-IDS2017 (Intrusion Detection Dataset)

    • Access Here
    • Description: Network traffic dataset with multiple attack types, including DDoS, brute-force, and infiltration attacks.
    • Format: PCAP, CSV
    • Update Frequency: Static
    • Use Cases: Machine learning-based intrusion detection, behavioral analysis

    7. CIC IoV CAN Bus Dataset 2024

    • Access Here
    • Description: Vehicle CAN bus data, including spoofing and denial-of-service (DoS) attack traces.
    • Format: CSV, PCAP
    • Update Frequency: Static
    • Use Cases: Automotive security, AI-based anomaly detection in vehicles

    8. ImageNet-A (Adversarial Image Dataset)

    • Access Here
    • Description: A dataset of real-world images that cause misclassification in deep learning models.
    • Format: JPEG
    • Update Frequency: Static
    • Use Cases: Adversarial robustness evaluation, model retraining for security
  19. Additional file 1: Table S1. of Detection and prediction of insider threats...

    • springernature.figshare.com
    • figshare.com
    xlsx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iffat Gheyas; Ali Abdallah (2023). Additional file 1: Table S1. of Detection and prediction of insider threats to cyber security: a systematic literature review and meta-analysis [Dataset]. http://doi.org/10.6084/m9.figshare.c.3639305_D1.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Iffat Gheyas; Ali Abdallah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview of the proposed insider threat detection algorithms. (XLSX 25.6 kb)

  20. m

    Phishing Detection Dataset

    • data.mendeley.com
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maruf Tamal (2023). Phishing Detection Dataset [Dataset]. http://doi.org/10.17632/6tm2d6sz7p.1
    Explore at:
    Dataset updated
    Jun 7, 2023
    Authors
    Maruf Tamal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of 247950 instances, of which 128541 are from phishing URLs and 119409 are from legitimate URLs. It encompasses 41 features and 1 target variable (0=legitimate,1=phishing), making it suitable for implementing machine learning algorithms to identify phishing attacks.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dinesh Naveen Kumar Samudrala (2025). Cybersecurity 🪪 Intrusion 🦠 Detection Dataset [Dataset]. https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset
Organization logo

Cybersecurity 🪪 Intrusion 🦠 Detection Dataset

Prevent before attack

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dinesh Naveen Kumar Samudrala
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. Below, I’ll explain each aspect in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning.

1. Understanding the Features

The dataset consists of network-based and user behavior-based features. Each feature provides valuable information about potential cyber threats.

A. Network-Based Features

These features describe network-level information such as packet size, protocol type, and encryption methods.

  1. network_packet_size (Packet Size in Bytes)

    • Represents the size of network packets, ranging between 64 to 1500 bytes.
    • Packets on the lower end (~64 bytes) may indicate control messages, while larger packets (~1500 bytes) often carry bulk data.
    • Attackers may use abnormally small or large packets for reconnaissance or exploitation attempts.
  2. protocol_type (Communication Protocol)

    • The protocol used in the session: TCP, UDP, or ICMP.
    • TCP (Transmission Control Protocol): Reliable, connection-oriented (common for HTTP, HTTPS, SSH).
    • UDP (User Datagram Protocol): Faster but less reliable (used for VoIP, streaming).
    • ICMP (Internet Control Message Protocol): Used for network diagnostics (ping); often abused in Denial-of-Service (DoS) attacks.
  3. encryption_used (Encryption Protocol)

    • Values: AES, DES, None.
    • AES (Advanced Encryption Standard): Strong encryption, commonly used.
    • DES (Data Encryption Standard): Older encryption, weaker security.
    • None: Indicates unencrypted communication, which can be risky.
    • Attackers might use no encryption to avoid detection or weak encryption to exploit vulnerabilities.

B. User Behavior-Based Features

These features track user activities, such as login attempts and session duration.

  1. login_attempts (Number of Logins)

    • High values might indicate brute-force attacks (repeated login attempts).
    • Typical users have 1–3 login attempts, while an attack may have hundreds or thousands.
  2. session_duration (Session Length in Seconds)

    • A very long session might indicate unauthorized access or persistence by an attacker.
    • Attackers may try to stay connected to maintain access.
  3. failed_logins (Failed Login Attempts)

    • High failed login counts indicate credential stuffing or dictionary attacks.
    • Many failed attempts followed by a successful login could suggest an account was compromised.
  4. unusual_time_access (Login Time Anomaly)

    • A binary flag (0 or 1) indicating whether access happened at an unusual time.
    • Attackers often operate outside normal business hours to evade detection.
  5. ip_reputation_score (Trustworthiness of IP Address)

    • A score from 0 to 1, where higher values indicate suspicious activity.
    • IP addresses associated with botnets, spam, or previous attacks tend to have higher scores.
  6. browser_type (User’s Browser)

    • Common browsers: Chrome, Firefox, Edge, Safari.
    • Unknown: Could be an indicator of automated scripts or bots.

2. Target Variable (attack_detected)

  • Binary classification: 1 means an attack was detected, 0 means normal activity.
  • The dataset is useful for supervised machine learning, where a model learns from labeled attack patterns.

3. Possible Use Cases

This dataset can be used for intrusion detection systems (IDS) and cybersecurity research. Some key applications include:

A. Machine Learning-Based Intrusion Detection

  1. Supervised Learning Approaches

    • Classification Models (Logistic Regression, Decision Trees, Random Forest, XGBoost, SVM)
    • Train the model using labeled data (attack_detected as the target).
    • Evaluate using accuracy, precision, recall, F1-score.
  2. Deep Learning Approaches

    • Use Neural Networks (DNN, LSTM, CNN) for pattern recognition.
    • LSTMs work well for time-series-based network traffic analysis.

B. Anomaly Detection (Unsupervised Learning)

If attack labels are missing, anomaly detection can be used: - Autoencoders: Learn normal traffic and flag anomalies. - Isolation Forest: Detects outliers based on feature isolation. - One-Class SVM: Learns normal behavior and detects deviations.

C. Rule-Based Detection

  • If certain thresholds are met (e.g., failed_logins > 10 & ip_reputation_score > 0.8), an alert is triggered.

4. Challenges & Considerations

  • Adversarial Attacks: Attackers may modify traffic to evade detection.
  • Concept Drift: Cyber threats...
Search
Clear search
Close search
Google apps
Main menu