9 datasets found
  1. Healthcare Ransomware Dataset

    • kaggle.com
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rivalytics (2025). Healthcare Ransomware Dataset [Dataset]. https://www.kaggle.com/datasets/rivalytics/healthcare-ransomware-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 21, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rivalytics
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    📌 Context of the Dataset

    The Healthcare Ransomware Dataset was created to simulate real-world cyberattacks in the healthcare industry. Hospitals, clinics, and research labs have become prime targets for ransomware due to their reliance on real-time patient data and legacy IT infrastructure. This dataset provides insight into attack patterns, recovery times, and cybersecurity practices across different healthcare organizations.

    Why is this important?

    Ransomware attacks on healthcare organizations can shut down entire hospitals, delay treatments, and put lives at risk. Understanding how different healthcare organizations respond to attacks can help develop better security strategies. The dataset allows cybersecurity analysts, data scientists, and researchers to study patterns in ransomware incidents and explore predictive modeling for risk mitigation.

    📌 Sources and Research Inspiration This simulated dataset was inspired by real-world cybersecurity reports and built using insights from official sources, including:

    1️⃣ IBM Cost of a Data Breach Report (2024)

    The healthcare sector had the highest average cost of data breaches ($10.93 million per incident). On average, organizations recovered only 64.8% of their data after paying ransom. Healthcare breaches took 277 days on average to detect and contain.

    2️⃣ Sophos State of Ransomware in Healthcare (2024)

    67% of healthcare organizations were hit by ransomware in 2024, an increase from 60% in 2023. 66% of backup compromise attempts succeeded, making data recovery significantly more difficult. The most common attack vectors included exploited vulnerabilities (34%) and compromised credentials (34%).

    3️⃣ Health & Human Services (HHS) Cybersecurity Reports

    Ransomware incidents in healthcare have doubled since 2016. Organizations that fail to monitor threats frequently experience higher infection rates.

    4️⃣ Cybersecurity & Infrastructure Security Agency (CISA) Alerts

    Identified phishing, unpatched software, and exposed RDP ports as top ransomware entry points. Only 13% of healthcare organizations monitor cyber threats more than once per day, increasing the risk of undetected attacks.

    5️⃣ Emsisoft 2020 Report on Ransomware in Healthcare

    The number of ransomware attacks in healthcare increased by 278% between 2018 and 2023. 560 healthcare facilities were affected in a single year, disrupting patient care and emergency services.

    📌 Why is This a Simulated Dataset?

    This dataset does not contain real patient data or actual ransomware cases. Instead, it was built using probabilistic modeling and structured randomness based on industry benchmarks and cybersecurity reports.

    How It Was Created:

    1️⃣ Defining the Dataset Structure

    The dataset was designed to simulate realistic attack patterns in healthcare, using actual ransomware case studies as inspiration.

    Columns were selected based on what real-world cybersecurity teams track, such as: Attack methods (phishing, RDP exploits, credential theft). Infection rates, recovery time, and backup compromise rates. Organization type (hospitals, clinics, research labs) and monitoring frequency.

    2️⃣ Generating Realistic Data Using ChatGPT & Python

    ChatGPT assisted in defining relationships between attack factors, ensuring that key cybersecurity concepts were accurately reflected. Python’s NumPy and Pandas libraries were used to introduce randomized attack simulations based on real-world statistics. Data was validated against industry research to ensure it aligns with actual ransomware attack trends.

    3️⃣ Ensuring Logical Relationships Between Data Points

    Hospitals take longer to recover due to larger infrastructure and compliance requirements. Organizations that track more cyber threats recover faster because they detect attacks earlier. Backup security significantly impacts recovery time, reflecting the real-world risk of backup encryption attacks.

  2. Businesses worldwide affected by ransomware 2018-2023

    • statista.com
    Updated Nov 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Businesses worldwide affected by ransomware 2018-2023 [Dataset]. https://www.statista.com/statistics/204457/businesses-ransomware-attack-rate/
    Explore at:
    Dataset updated
    Nov 9, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    As of 2023, over 72 percent of businesses worldwide were affected by ransomware attacks. This figure represents an increase on the previous five years and was by far the highest figure reported. Overall, since 2018, more than half of the total survey respondents each year stated that their organizations had been victimized by ransomware. Most targeted industries
    In 2023, the healthcare industry in the United States was once again most targeted by ransomware attacks. This industry also suffers most data breaches as a consequence of cyberattacks. The critical manufacturing industry ranked second by the number of ransomware attacks, followed by the government facilities industry. Ransomware in the manufacturing industry
    The manufacturing industry, along with its subindustries, is constantly targeted by ransomware attacks, causing data loss, business disruptions, and reputational damage. Often, such cyberattacks are international and have a political intent. In 2023, compromised credentials were the leading cause of ransomware attacks in the manufacturing industry.

  3. Global cyberattack distribution 2023, by type

    • statista.com
    Updated Nov 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Global cyberattack distribution 2023, by type [Dataset]. https://www.statista.com/statistics/1382266/cyber-attacks-worldwide-by-type/
    Explore at:
    Dataset updated
    Nov 14, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    Worldwide
    Description

    In 2023, ransomware was the most frequently detected cyberattack worldwide, with around 70 percent of all detected cyberattacks. Network breaches ranked second, with almost 19 percent of the detections. Although less frequently, data exfiltration was also among the detected cyberattacks.

  4. Cyber Threat Dataset: Network, Text & Relation

    • kaggle.com
    Updated Apr 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RAMOLIYA FENIL (2024). Cyber Threat Dataset: Network, Text & Relation [Dataset]. http://doi.org/10.34740/kaggle/dsv/8113350
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 14, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    RAMOLIYA FENIL
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    To use this dataset for research, publication, and other work, please cite the following paper:

    F. Ramoliya, R. Kakkar, R. Gupta, S. Tanwar and S. Agrawal, "SEAM: Deep Learning-based Secure Message Exchange Framework For Autonomous EVs," 2023 IEEE Globecom Workshops (GC Wkshps), Kuala Lumpur, Malaysia, 2023, pp. 80-85, doi: 10.1109/GCWkshps58843.2023.10465168.

    About this dataset:

    This dataset provides a comprehensive collection of data for detecting, diagnosing, and mitigating cyber threats using network traffic data, textual content, and entity relationships. It can be used for training machine learning models to identify various types of cyber threats, understand their underlying patterns, and recommend appropriate solutions.

    Column details:

    1. id: A unique identifier for each instance in the dataset.

    2. text: Textual content transferred over the network, such as emails, messages, or network traffic payloads. This column may contain descriptions of potential cyber threats or attack vectors.

    3. Entries: A list of JSON objects containing the following fields: - sender_id: The ID of the entity that sent or initiated the communication. - label: The type of cyber threat or attack pattern identified, such as malware, attack pattern, identity, benign, software attack, or threat actor. - start_offset: The starting character position of the identified entity or threat within the text field. - end_offset: The ending character position of the identified entity or threat within the text field. - receiver_ids: A list of IDs representing the entities that received or were targeted by the communication.

    4. relations: A list of tuples representing the relationships between entities, where each tuple contains a pair of entity IDs indicating the source and target of the relationship.

    5. diagnosis: A description or diagnosis of the identified cyber threat, providing insights into the nature and potential impact of the threat.

    6. solution: Recommended solutions or mitigation strategies for addressing the identified cyber threat, such as implementing specific security controls, software updates, or network configurations.

    Potential Use Cases:

    - Cyber Threat Detection: Train machine learning models to identify and classify various types of cyber threats based on network traffic data and textual content.

    - Threat Intelligence and Analysis: Analyze the relationships between entities, threat actors, and attack patterns to gain insights into emerging cyber threats and their propagation mechanisms.

    - Incident Response and Mitigation: Develop systems that can recommend appropriate solutions and mitigation strategies based on the diagnosed cyber threats, enabling timely and effective incident response.

    - Network Security Monitoring: Implement real-time monitoring and analysis of network traffic to detect and prevent cyber attacks as they occur.

    - Cybersecurity Education and Research: Utilize the dataset for training cybersecurity professionals, conducting research on cyber threat detection and mitigation techniques, and developing new algorithms and approaches.

    Advanced ML/DL/FL/RL Use Cases:

    - Multi-Modal Threat Detection: Develop multi-modal machine learning models that can leverage both the network traffic data and textual content to enhance cyber threat detection capabilities.

    - Natural Language Processing (NLP) for Threat Analysis: Apply NLP techniques to analyze the textual content and identify potential threats, threat actors, and their relationships.

    - Graph Neural Networks: Leverage entity relationships and network traffic patterns to build graph neural network models for detecting and classifying complex cyber threats.

    - Anomaly Detection: Implement unsupervised or semi-supervised learning algorithms to detect anomalous network traffic patterns and textual content indicating cyber threats.

    - Transfer Learning and Domain Adaptation: Explore transfer learning techniques to adapt pre-trained models or knowledge from related domains to the cyber threat detection task.

    - Federated Learning: Develop federated learning frameworks for collaborative threat intelligence, distributed threat monitoring, and personalized threat detection.

    Collaborative Threat Intelligence: Develop federated learning frameworks that enable organizations to collaboratively train machine learning models for cyber threat detection while preserving data privacy and confidentiality.

    Distributed Threat Monitoring: Implement federated learning systems that can monitor and detect cyber threats across multiple distributed networks or devices, without the need for centralized data collection.

    Personalized Threat Detection: Leverage federated learning to build personalized threat detection models tailored to specific organizatio...

  5. CTU-SME-11: a labeled dataset with real benign and malicious network traffic...

    • zenodo.org
    • data.niaid.nih.gov
    bin, bz2, csv, html
    Updated May 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Štěpán Bendl; Štěpán Bendl; Veronica Valeros; Veronica Valeros; Sebastian Garcia; Sebastian Garcia (2023). CTU-SME-11: a labeled dataset with real benign and malicious network traffic mimicking a small medium-size enterprise environment [Dataset]. http://doi.org/10.5281/zenodo.7958259
    Explore at:
    csv, html, bz2, binAvailable download formats
    Dataset updated
    May 26, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Štěpán Bendl; Štěpán Bendl; Veronica Valeros; Veronica Valeros; Sebastian Garcia; Sebastian Garcia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As technology advances, the number and complexity of cyber-attacks increase, forcing defense techniques to be updated and improved. To help develop effective tools for detecting security threats it is essential to have reliable and representative security datasets. Many existing security datasets have limitations that make them unsuitable for research, including lack of labels, unbalanced traffic, and outdated threats.

    CTU-SME-11 is a labeled network dataset designed to address the limitations of previous datasets. The dataset was captured in a real network that mimics a small-medium enterprise setting. Raw network traffic (packets) was captured from 11 devices using tcpdump for a duration of 7 days, from 20th to 26th of February, 2023 in Prague, Czech Republic. The devices were chosen based on the enterprise setting and consists of IoT, desktop and mobile devices, both bare metal and virtualized. The devices were infected with malware or exposed to Internet attacks, and factory reset to restore benign behavior.

    The raw data was processed to generate network flows (Zeek logs) which were analyzed and labeled. The dataset contains two types of levels, a high level label and a descriptive label, which were put by experts. The former can take three values, benign, malicious or background. The latter contains detailed information about the specific behavior observed in the network flows. The dataset contains 99 million labeled network flows. The overall compressed size of the dataset is 80GB and the uncompressed size is 170GB.

  6. Share of cyberattacks in Italy 2024, by reason

    • statista.com
    • ai-chatbox.pro
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Share of cyberattacks in Italy 2024, by reason [Dataset]. https://www.statista.com/statistics/649358/share-cyber-attacks-in-italy-by-reason/
    Explore at:
    Dataset updated
    Dec 16, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Italy
    Description

    During the first half of 2024, around 88 percent of cyberattacks carried out in Italy had cybercrime as a purpose. Cyber espionage was another motivation, representing the main reason behind roughly four percent of attacks. By contrast, information warfare only accounted for two percent of the cyberattacks in the country in the last examined period. Data breaches in Italy In 2023, over half of the Italian digital population was alerted that their personal data had been breached, and 77.5 percent of the alerted users had the misfortune of being affected by data compromise on the dark web. Despite a decrease in the number of data sets affected in data breaches between 2020 and 2023, Italy recorded almost one million exposed data sets at the beginning of 2023.Meanwhile, the average cost of data breaches for both Italian companies and targeted users kept growing, reaching 4.73 million U.S. dollars in 2024, up from the 3.86 million U.S. dollars recorded in the previous year. The Italian privacy landscape: GDPR effects As a state member of the European Union, Italy is covered by the General Data Protection Regulation (GDPR). Since 2018, the GDPR has regulated online data privacy and has the responsibility to represent consumers’ interests within the digital and tech landscape of the Union. As of 2023, approximately 265 fines were issued in Italy due to violations of the GDPR – making Italy the second country in Europe with the highest number of violations dispensed to tech companies. The highest GDPR fine ever issued in Italy was at the expense of Telecom Italia (TIM), one of the largest Italian telecommunications companies. TIM was fined approximately 27.8 million euros in January 2020. GDPR is enforced and helped by the country's Garante della Privacy, the national institution overseeing Italian users’ online rights, cybersecurity, and digital privacy.

  7. D

    Data Encryption Market Report

    • promarketreports.com
    doc, pdf, ppt
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pro Market Reports (2025). Data Encryption Market Report [Dataset]. https://www.promarketreports.com/reports/data-encryption-market-9193
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Pro Market Reports
    License

    https://www.promarketreports.com/privacy-policyhttps://www.promarketreports.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Data Encryption Market Overview The global data encryption market is projected to register significant growth, with a market size of USD 14.5 billion in 2025 and a CAGR of 16% over the forecast period of 2025-2033. The increasing adoption of cloud computing and digital transformation initiatives are driving the demand for data encryption solutions to protect sensitive data from cyber threats. Additionally, industry regulations, such as GDPR and CCPA, are mandating organizations to implement data encryption measures, further fueling market growth. Market Drivers, Restraints, and Trends Key market drivers include rising cybersecurity threats, increasing data breaches, and the growing need for data privacy. The increasing adoption of IoT and mobile computing is also contributing to the need for data encryption. However, the high cost of implementation and the lack of skilled professionals can pose challenges to market growth. Notable market trends include the emergence of advanced encryption algorithms, such as quantum-safe cryptography, and the integration of encryption with AI and machine learning technologies. Regional factors, such as government regulations and technology adoption rates, also influence the market's growth trajectory. Recent developments include: On Apr. 11, 2023, Menlo Security, a leading provider of browser security solutions, published the results of the 10th Annual Cyberthreat Defense Report (CDR) by the CyberEdge Group. The report, partially sponsored by Menlo Security, highlights the augmenting importance of browser isolation technologies to combat ransomware and other malicious threats., The research revealed that most ransomware attacks include threats beyond data encryption. According to the report, around 51% of respondents confirmed that they have been using at least one type of browser or Internet isolation to protect their organizational data, while another 40% are about to deploy data encryption technology. Furthermore, around 33% of respondents noted that browser isolation is a key cybersecurity strategy to protect against sophisticated attacks, including ransomware, phishing, and zero-day attacks., On Feb.14, 2023, EnterpriseDB, a relational database provider, announced the addition of Transparent Data Encryption (TDE) based on open-source PostgreSQL to its databases. The new TDE feature will be shipped along with the firm's enterprise version of its database. TDE is a method of encrypting database files to ensure data security while at rest and in motion., Adding that most enterprises use TDE for compliance issues helps ensure data encryption on the hard drive and files on a backup. Before the development of built-in TDE, enterprises relied on either full-disk encryption or stackable cryptographic file system encryption., On Jan.25, 2023, Researchers from the Tokyo University of Science, Japan, announced the development of a faster and cheaper method for handling encrypted data while improving security. The new data encryption method developed by Japanese researchers combines the best of homomorphic encryption and secret sharing to handle encrypted data., Homomorphic encryption and secret sharing are key methods to compute sensitive data while preserving privacy. Homomorphic encryption is computationally intensive and involves performing computational data encryption on a single server, while secret sharing is fast and computationally efficient., In this method, the encrypted data/secret input is divided and distributed across multiple servers, each performing a computation, such as multiplication, on its data. The results of the computations are then used to reconstruct the original data., September 2022: Convergence Technology Solutions Corp., a supplier of software-enabled IT and cloud solutions, declared that it has obtained certification in Canada to sell and deploy IBM zsystems and LinuxONE., November 2019: Penta Security Systems announced that it has been selected as a finalist for the 2020 SC Magazine Awards, which are given by SC Media and celebrated in the United States. As a result, MyDiamo from Penta Security has been named the Best Database Security Solution of 2020. Additionally, this will result in the expansion of common-level encryption and improve the open-source DBMS installation procedure.. Potential restraints include: ISSUE REGARDING SECURITY AND DATA BREACH 44, HIGH IMPLEMENTATION COSTS AND COMPLEXITY 45; ISSUE WITH RESPECT TO DATA CONSISTENCY AND INTEROPERABILITY ACROSS DIFFERENT EDGE PLATFORMS 45.

  8. P

    IoT-23 Dataset

    • paperswithcode.com
    Updated Jan 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). IoT-23 Dataset [Dataset]. https://paperswithcode.com/dataset/iot-23
    Explore at:
    Dataset updated
    Jan 23, 2020
    Description

    IoT-23 is a dataset of network traffic from Internet of Things (IoT) devices. It has 20 malware captures executed in IoT devices, and 3 captures for benign IoT devices traffic. It was first published in January 2020, with captures ranging from 2018 to 2019. These IoT network traffic was captured in the Stratosphere Laboratory, AIC group, FEL, CTU University, Czech Republic. Its goal is to offer a large dataset of real and labeled IoT malware infections and IoT benign traffic for researchers to develop machine learning algorithms. This dataset and its research was funded by Avast Software. The malware was allow to connect to the Internet.

  9. P

    EDGE-IIOTSET Dataset

    • paperswithcode.com
    Updated Oct 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). EDGE-IIOTSET Dataset [Dataset]. https://paperswithcode.com/dataset/edge-iiotset
    Explore at:
    Dataset updated
    Oct 16, 2023
    Description

    ABSTRACT In this project, we propose a new comprehensive realistic cyber security dataset of IoT and IIoT applications, called Edge-IIoTset, which can be used by machine learning-based intrusion detection systems in two different modes, namely, centralized and federated learning. Specifically, the proposed testbed is organized into seven layers, including, Cloud Computing Layer, Network Functions Virtualization Layer, Blockchain Network Layer, Fog Computing Layer, Software-Defined Networking Layer, Edge Computing Layer, and IoT and IIoT Perception Layer. In each layer, we propose new emerging technologies that satisfy the key requirements of IoT and IIoT applications, such as, ThingsBoard IoT platform, OPNFV platform, Hyperledger Sawtooth, Digital twin, ONOS SDN controller, Mosquitto MQTT brokers, Modbus TCP/IP, ...etc. The IoT data are generated from various IoT devices (more than 10 types) such as Low-cost digital sensors for sensing temperature and humidity, Ultrasonic sensor, Water level detection sensor, pH Sensor Meter, Soil Moisture sensor, Heart Rate Sensor, Flame Sensor, ...etc.). However, we identify and analyze fourteen attacks related to IoT and IIoT connectivity protocols, which are categorized into five threats, including, DoS/DDoS attacks, Information gathering, Man in the middle attacks, Injection attacks, and Malware attacks. In addition, we extract features obtained from different sources, including alerts, system resources, logs, network traffic, and propose new 61 features with high correlations from 1176 found features. After processing and analyzing the proposed realistic cyber security dataset, we provide a primary exploratory data analysis and evaluate the performance of machine learning approaches (i.e., traditional machine learning as well as deep learning) in both centralized and federated learning modes.

    Instructions:

    Great news! The Edge-IIoT dataset has been featured as a "Document in the top 1% of Web of Science." This indicates that it is ranked within the top 1% of all publications indexed by the Web of Science (WoS) in terms of citations and impact.

    Please kindly visit kaggle link for the updates: https://www.kaggle.com/datasets/mohamedamineferrag/edgeiiotset-cyber-sec...

    Free use of the Edge-IIoTset dataset for academic research purposes is hereby granted in perpetuity. Use for commercial purposes is allowable after asking the leader author, Dr Mohamed Amine Ferrag, who has asserted his right under the Copyright.

    The details of the Edge-IIoT dataset were published in following the paper. For the academic/public use of these datasets, the authors have to cities the following paper:

    Mohamed Amine Ferrag, Othmane Friha, Djallel Hamouda, Leandros Maglaras, Helge Janicke, "Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT and IIoT Applications for Centralized and Federated Learning", IEEE Access, April 2022 (IF: 3.37), DOI: 10.1109/ACCESS.2022.3165809

    Link to paper : https://ieeexplore.ieee.org/document/9751703

    The directories of the Edge-IIoTset dataset include the following:

    •File 1 (Normal traffic)

    -File 1.1 (Distance): This file includes two documents, namely, Distance.csv and Distance.pcap. The IoT sensor (Ultrasonic sensor) is used to capture the IoT data.

    -File 1.2 (Flame_Sensor): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.

    -File 1.3 (Heart_Rate): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.

    -File 1.4 (IR_Receiver): This file includes two documents, namely, IR_Receiver.csv and IR_Receiver.pcap. The IoT sensor (IR (Infrared) Receiver Sensor) is used to capture the IoT data.

    -File 1.5 (Modbus): This file includes two documents, namely, Modbus.csv and Modbus.pcap. The IoT sensor (Modbus Sensor) is used to capture the IoT data.

    -File 1.6 (phValue): This file includes two documents, namely, phValue.csv and phValue.pcap. The IoT sensor (pH-sensor PH-4502C) is used to capture the IoT data.

    -File 1.7 (Soil_Moisture): This file includes two documents, namely, Soil_Moisture.csv and Soil_Moisture.pcap. The IoT sensor (Soil Moisture Sensor v1.2) is used to capture the IoT data.

    -File 1.8 (Sound_Sensor): This file includes two documents, namely, Sound_Sensor.csv and Sound_Sensor.pcap. The IoT sensor (LM393 Sound Detection Sensor) is used to capture the IoT data.

    -File 1.9 (Temperature_and_Humidity): This file includes two documents, namely, Temperature_and_Humidity.csv and Temperature_and_Humidity.pcap. The IoT sensor (DHT11 Sensor) is used to capture the IoT data.

    -File 1.10 (Water_Level): This file includes two documents, namely, Water_Level.csv and Water_Level.pcap. The IoT sensor (Water sensor) is used to capture the IoT data.

    •File 2 (Attack traffic):

    -File 2.1 (Attack traffic (CSV files)): This file includes 13 documents, namely, Backdoor_attack.csv, DDoS_HTTP_Flood_attack.csv, DDoS_ICMP_Flood_attack.csv, DDoS_TCP_SYN_Flood_attack.csv, DDoS_UDP_Flood_attack.csv, MITM_attack.csv, OS_Fingerprinting_attack.csv, Password_attack.csv, Port_Scanning_attack.csv, Ransomware_attack.csv, SQL_injection_attack.csv, Uploading_attack.csv, Vulnerability_scanner_attack.csv, XSS_attack.csv. Each document is specific for each attack.

    -File 2.2 (Attack traffic (PCAP files)): This file includes 13 documents, namely, Backdoor_attack.pcap, DDoS_HTTP_Flood_attack.pcap, DDoS_ICMP_Flood_attack.pcap, DDoS_TCP_SYN_Flood_attack.pcap, DDoS_UDP_Flood_attack.pcap, MITM_attack.pcap, OS_Fingerprinting_attack.pcap, Password_attack.pcap, Port_Scanning_attack.pcap, Ransomware_attack.pcap, SQL_injection_attack.pcap, Uploading_attack.pcap, Vulnerability_scanner_attack.pcap, XSS_attack.pcap. Each document is specific for each attack.

    •File 3 (Selected dataset for ML and DL):

    -File 3.1 (DNN-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating deep learning-based intrusion detection systems.

    -File 3.2 (ML-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating traditional machine learning-based intrusion detection systems.

    Step 1: Downloading The Edge-IIoTset dataset From the Kaggle platform from google.colab import files

    !pip install -q kaggle

    files.upload()

    !mkdir ~/.kaggle

    !cp kaggle.json ~/.kaggle/

    !chmod 600 ~/.kaggle/kaggle.json

    !kaggle datasets download -d mohamedamineferrag/edgeiiotset-cyber-security-dataset-of-iot-iiot -f "Edge-IIoTset dataset/Selected dataset for ML and DL/DNN-EdgeIIoT-dataset.csv"

    !unzip DNN-EdgeIIoT-dataset.csv.zip

    !rm DNN-EdgeIIoT-dataset.csv.zip

    Step 2: Reading the Datasets' CSV file to a Pandas DataFrame: import pandas as pd

    import numpy as np

    df = pd.read_csv('DNN-EdgeIIoT-dataset.csv', low_memory=False)

    Step 3 : Exploring some of the DataFrame's contents: df.head(5)

    print(df['Attack_type'].value_counts())

    Step 4: Dropping data (Columns, duplicated rows, NAN, Null..): from sklearn.utils import shuffle

    drop_columns = ["frame.time", "ip.src_host", "ip.dst_host", "arp.src.proto_ipv4","arp.dst.proto_ipv4",

     "http.file_data","http.request.full_uri","icmp.transmit_timestamp",
    
     "http.request.uri.query", "tcp.options","tcp.payload","tcp.srcport",
    
     "tcp.dstport", "udp.port", "mqtt.msg"]
    

    df.drop(drop_columns, axis=1, inplace=True)

    df.dropna(axis=0, how='any', inplace=True)

    df.drop_duplicates(subset=None, keep="first", inplace=True)

    df = shuffle(df)

    df.isna().sum()

    print(df['Attack_type'].value_counts())

    Step 5: Categorical data encoding (Dummy Encoding): import numpy as np

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn import preprocessing

    def encode_text_dummy(df, name):

    dummies = pd.get_dummies(df[name])

    for x in dummies.columns:

    dummy_name = f"{name}-{x}"
    
    df[dummy_name] = dummies[x]
    

    df.drop(name, axis=1, inplace=True)

    encode_text_dummy(df,'http.request.method')

    encode_text_dummy(df,'http.referer')

    encode_text_dummy(df,"http.request.version")

    encode_text_dummy(df,"dns.qry.name.len")

    encode_text_dummy(df,"mqtt.conack.flags")

    encode_text_dummy(df,"mqtt.protoname")

    encode_text_dummy(df,"mqtt.topic")

    Step 6: Creation of the preprocessed dataset df.to_csv('preprocessed_DNN.csv', encoding='utf-8')

    For more information about the dataset, please contact the lead author of this project, Dr Mohamed Amine Ferrag, on his email: mohamed.amine.ferrag@gmail.com

    More information about Dr. Mohamed Amine Ferrag is available at:

    https://www.linkedin.com/in/Mohamed-Amine-Ferrag

    https://dblp.uni-trier.de/pid/142/9937.html

    https://www.researchgate.net/profile/Mohamed_Amine_Ferrag

    https://scholar.google.fr/citations?user=IkPeqxMAAAAJ&hl=fr&oi=ao

    https://www.scopus.com/authid/detail.uri?authorId=56115001200

    https://publons.com/researcher/1322865/mohamed-amine-ferrag/

    https://orcid.org/0000-0002-0632-3172

    Last Updated: 27 Mar. 2023

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rivalytics (2025). Healthcare Ransomware Dataset [Dataset]. https://www.kaggle.com/datasets/rivalytics/healthcare-ransomware-dataset
Organization logo

Healthcare Ransomware Dataset

Analyze attacks, strengthen security, and improve recovery in healthcare

Explore at:
177 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 21, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rivalytics
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

📌 Context of the Dataset

The Healthcare Ransomware Dataset was created to simulate real-world cyberattacks in the healthcare industry. Hospitals, clinics, and research labs have become prime targets for ransomware due to their reliance on real-time patient data and legacy IT infrastructure. This dataset provides insight into attack patterns, recovery times, and cybersecurity practices across different healthcare organizations.

Why is this important?

Ransomware attacks on healthcare organizations can shut down entire hospitals, delay treatments, and put lives at risk. Understanding how different healthcare organizations respond to attacks can help develop better security strategies. The dataset allows cybersecurity analysts, data scientists, and researchers to study patterns in ransomware incidents and explore predictive modeling for risk mitigation.

📌 Sources and Research Inspiration This simulated dataset was inspired by real-world cybersecurity reports and built using insights from official sources, including:

1️⃣ IBM Cost of a Data Breach Report (2024)

The healthcare sector had the highest average cost of data breaches ($10.93 million per incident). On average, organizations recovered only 64.8% of their data after paying ransom. Healthcare breaches took 277 days on average to detect and contain.

2️⃣ Sophos State of Ransomware in Healthcare (2024)

67% of healthcare organizations were hit by ransomware in 2024, an increase from 60% in 2023. 66% of backup compromise attempts succeeded, making data recovery significantly more difficult. The most common attack vectors included exploited vulnerabilities (34%) and compromised credentials (34%).

3️⃣ Health & Human Services (HHS) Cybersecurity Reports

Ransomware incidents in healthcare have doubled since 2016. Organizations that fail to monitor threats frequently experience higher infection rates.

4️⃣ Cybersecurity & Infrastructure Security Agency (CISA) Alerts

Identified phishing, unpatched software, and exposed RDP ports as top ransomware entry points. Only 13% of healthcare organizations monitor cyber threats more than once per day, increasing the risk of undetected attacks.

5️⃣ Emsisoft 2020 Report on Ransomware in Healthcare

The number of ransomware attacks in healthcare increased by 278% between 2018 and 2023. 560 healthcare facilities were affected in a single year, disrupting patient care and emergency services.

📌 Why is This a Simulated Dataset?

This dataset does not contain real patient data or actual ransomware cases. Instead, it was built using probabilistic modeling and structured randomness based on industry benchmarks and cybersecurity reports.

How It Was Created:

1️⃣ Defining the Dataset Structure

The dataset was designed to simulate realistic attack patterns in healthcare, using actual ransomware case studies as inspiration.

Columns were selected based on what real-world cybersecurity teams track, such as: Attack methods (phishing, RDP exploits, credential theft). Infection rates, recovery time, and backup compromise rates. Organization type (hospitals, clinics, research labs) and monitoring frequency.

2️⃣ Generating Realistic Data Using ChatGPT & Python

ChatGPT assisted in defining relationships between attack factors, ensuring that key cybersecurity concepts were accurately reflected. Python’s NumPy and Pandas libraries were used to introduce randomized attack simulations based on real-world statistics. Data was validated against industry research to ensure it aligns with actual ransomware attack trends.

3️⃣ Ensuring Logical Relationships Between Data Points

Hospitals take longer to recover due to larger infrastructure and compliance requirements. Organizations that track more cyber threats recover faster because they detect attacks earlier. Backup security significantly impacts recovery time, reflecting the real-world risk of backup encryption attacks.

Search
Clear search
Close search
Google apps
Main menu