6 datasets found
  1. Data Breaches

    • kaggle.com
    Updated Nov 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Data Breaches [Dataset]. https://www.kaggle.com/datasets/thedevastator/data-breaches-a-comprehensive-list/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 10, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Data Breaches Dataset

    30,000 Records of cyber-security data breaches

    About this dataset

    This dataset is a compilation of data from various sources detailing data breaches. These sources include press reports, government news releases, and mainstream news articles. The list includes those involving the theft or compromise of 30,000 or more records, although many smaller breaches occur continually. In addition, the various methods used in the breaches are listed, with hacking being the most common.

    Organizations of all types and sizes are susceptible to data breaches, which can have devastating consequences. This dataset can help shed light on which organizations are most at risk and how these breaches occur so that steps can be taken to prevent them in the future

    How to use the dataset

    There are many ways to use this dataset. Here are a few ideas:

    • Use the data to understand which types of organizations are most commonly breached, and what methods are used most often.
    • Analyze the data to see if there are any trends or patterns in when or how breaches occur.
    • Use the data to create a visualizations or infographic showing the prevalence of data breaches

    Research Ideas

    • This dataset can be used to identify trends in data breaches in terms of methods used, types of organizations breached, and geographical distribution.

    • This dataset can be used to study the effect of data breaches on organizational reputation and customer trust.

    • This dataset can be used by organizations to benchmark their own security measures against those of similar organizations that have experienced data breaches

    Acknowledgements

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: df_1.csv | Column name | Description | |:----------------------|:---------------------------------------------------------------------| | Entity | The name of the organization that was breached. (String) | | Year | The year when the breach occurred. (Integer) | | Records | The number of records that were compromised in the breach. (Integer) | | Organization type | The type of organization that was breached. (String) | | Method | The method that was used to breach the organization. (String) | | Sources | The sources from which the data was collected. (String) |

  2. All-time biggest online data breaches 2025

    • statista.com
    Updated May 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). All-time biggest online data breaches 2025 [Dataset]. https://www.statista.com/statistics/290525/cyber-crime-biggest-online-data-breaches-worldwide/
    Explore at:
    Dataset updated
    May 26, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 2025
    Area covered
    Worldwide
    Description

    The largest reported data leakage as of January 2025 was the Cam4 data breach in March 2020, which exposed more than 10 billion data records. The second-largest data breach in history so far, the Yahoo data breach, occurred in 2013. The company initially reported about one billion exposed data records, but after an investigation, the company updated the number, revealing that three billion accounts were affected. The National Public Data Breach was announced in August 2024. The incident became public when personally identifiable information of individuals became available for sale on the dark web. Overall, the security professionals estimate the leakage of nearly three billion personal records. The next significant data leakage was the March 2018 security breach of India's national ID database, Aadhaar, with over 1.1 billion records exposed. This included biometric information such as identification numbers and fingerprint scans, which could be used to open bank accounts and receive financial aid, among other government services.

    Cybercrime - the dark side of digitalization As the world continues its journey into the digital age, corporations and governments across the globe have been increasing their reliance on technology to collect, analyze and store personal data. This, in turn, has led to a rise in the number of cyber crimes, ranging from minor breaches to global-scale attacks impacting billions of users – such as in the case of Yahoo. Within the U.S. alone, 1802 cases of data compromise were reported in 2022. This was a marked increase from the 447 cases reported a decade prior. The high price of data protection As of 2022, the average cost of a single data breach across all industries worldwide stood at around 4.35 million U.S. dollars. This was found to be most costly in the healthcare sector, with each leak reported to have cost the affected party a hefty 10.1 million U.S. dollars. The financial segment followed closely behind. Here, each breach resulted in a loss of approximately 6 million U.S. dollars - 1.5 million more than the global average.

  3. Number of data compromises and impacted individuals in U.S. 2005-2024

    • statista.com
    Updated Jul 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Number of data compromises and impacted individuals in U.S. 2005-2024 [Dataset]. https://www.statista.com/statistics/273550/data-breaches-recorded-in-the-united-states-by-number-of-breaches-and-records-exposed/
    Explore at:
    Dataset updated
    Jul 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    United States
    Description

    In 2024, the number of data compromises in the United States stood at 3,158 cases. Meanwhile, over 1.35 billion individuals were affected in the same year by data compromises, including data breaches, leakage, and exposure. While these are three different events, they have one thing in common. As a result of all three incidents, the sensitive data is accessed by an unauthorized threat actor. Industries most vulnerable to data breaches Some industry sectors usually see more significant cases of private data violations than others. This is determined by the type and volume of the personal information organizations of these sectors store. In 2024 the financial services, healthcare, and professional services were the three industry sectors that recorded most data breaches. Overall, the number of healthcare data breaches in some industry sectors in the United States has gradually increased within the past few years. However, some sectors saw decrease. Largest data exposures worldwide In 2020, an adult streaming website, CAM4, experienced a leakage of nearly 11 billion records. This, by far, is the most extensive reported data leakage. This case, though, is unique because cyber security researchers found the vulnerability before the cyber criminals. The second-largest data breach is the Yahoo data breach, dating back to 2013. The company first reported about one billion exposed records, then later, in 2017, came up with an updated number of leaked records, which was three billion. In March 2018, the third biggest data breach happened, involving India’s national identification database Aadhaar. As a result of this incident, over 1.1 billion records were exposed.

  4. Global number of breached user accounts Q1 2020-Q3 2024

    • statista.com
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Global number of breached user accounts Q1 2020-Q3 2024 [Dataset]. https://www.statista.com/statistics/1307426/number-of-data-breaches-worldwide/
    Explore at:
    Dataset updated
    Jun 23, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    During the third quarter of 2024, data breaches exposed more than *** million records worldwide. Since the first quarter of 2020, the highest number of data records were exposed in the first quarter of ***, more than *** million data sets. Data breaches remain among the biggest concerns of company leaders worldwide. The most common causes of sensitive information loss were operating system vulnerabilities on endpoint devices. Which industries see the most data breaches? Meanwhile, certain conditions make some industry sectors more prone to data breaches than others. According to the latest observations, the public administration experienced the highest number of data breaches between 2021 and 2022. The industry saw *** reported data breach incidents with confirmed data loss. The second were financial institutions, with *** data breach cases, followed by healthcare providers. Data breach cost Data breach incidents have various consequences, the most common impact being financial losses and business disruptions. As of 2023, the average data breach cost across businesses worldwide was **** million U.S. dollars. Meanwhile, a leaked data record cost about *** U.S. dollars. The United States saw the highest average breach cost globally, at **** million U.S. dollars.

  5. Z

    IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santos, Leonel (2024). IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8116337
    Explore at:
    Dataset updated
    Aug 30, 2024
    Dataset provided by
    Areia, José
    Costa, Rogério Luís
    Bispo, Ivo Afonso
    Santos, Leonel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Article Information

    The work involved in developing the dataset and benchmarking its use of machine learning is set out in the article ‘IoMT-TrafficData: Dataset and Tools for Benchmarking Intrusion Detection in Internet of Medical Things’. DOI: 10.1109/ACCESS.2024.3437214.

    Please do cite the aforementioned article when using this dataset.

    Abstract

    The increasing importance of securing the Internet of Medical Things (IoMT) due to its vulnerabilities to cyber-attacks highlights the need for an effective intrusion detection system (IDS). In this study, our main objective was to develop a Machine Learning Model for the IoMT to enhance the security of medical devices and protect patients’ private data. To address this issue, we built a scenario that utilised the Internet of Things (IoT) and IoMT devices to simulate real-world attacks. We collected and cleaned data, pre-processed it, and provided it into our machine-learning model to detect intrusions in the network. Our results revealed significant improvements in all performance metrics, indicating robustness and reproducibility in real-world scenarios. This research has implications in the context of IoMT and cybersecurity, as it helps mitigate vulnerabilities and lowers the number of breaches occurring with the rapid growth of IoMT devices. The use of machine learning algorithms for intrusion detection systems is essential, and our study provides valuable insights and a road map for future research and the deployment of such systems in live environments. By implementing our findings, we can contribute to a safer and more secure IoMT ecosystem, safeguarding patient privacy and ensuring the integrity of medical data.

    ZIP Folder Content

    The ZIP folder comprises two main components: Captures and Datasets. Within the captures folder, we have included all the captures used in this project. These captures are organized into separate folders corresponding to the type of network analysis: BLE or IP-Based. Similarly, the datasets folder follows a similar organizational approach. It contains datasets categorized by type: BLE, IP-Based Packet, and IP-Based Flows.

    To cater to diverse analytical needs, the datasets are provided in two formats: CSV (Comma-Separated Values) and pickle. The CSV format facilitates seamless integration with various data analysis tools, while the pickle format preserves the intricate structures and relationships within the dataset.

    This organization enables researchers to easily locate and utilize the specific captures and datasets they require, based on their preferred network analysis type or dataset type. The availability of different formats further enhances the flexibility and usability of the provided data.

    Datasets' Content

    Within this dataset, three sub-datasets are available, namely BLE, IP-Based Packet, and IP-Based Flows. Below is a table of the features selected for each dataset and consequently used in the evaluation model within the provided work.

    Identified Key Features Within Bluetooth Dataset

    Feature Meaning

    btle.advertising_header BLE Advertising Packet Header

    btle.advertising_header.ch_sel BLE Advertising Channel Selection Algorithm

    btle.advertising_header.length BLE Advertising Length

    btle.advertising_header.pdu_type BLE Advertising PDU Type

    btle.advertising_header.randomized_rx BLE Advertising Rx Address

    btle.advertising_header.randomized_tx BLE Advertising Tx Address

    btle.advertising_header.rfu.1 Reserved For Future 1

    btle.advertising_header.rfu.2 Reserved For Future 2

    btle.advertising_header.rfu.3 Reserved For Future 3

    btle.advertising_header.rfu.4 Reserved For Future 4

    btle.control.instant Instant Value Within a BLE Control Packet

    btle.crc.incorrect Incorrect CRC

    btle.extended_advertising Advertiser Data Information

    btle.extended_advertising.did Advertiser Data Identifier

    btle.extended_advertising.sid Advertiser Set Identifier

    btle.length BLE Length

    frame.cap_len Frame Length Stored Into the Capture File

    frame.interface_id Interface ID

    frame.len Frame Length Wire

    nordic_ble.board_id Board ID

    nordic_ble.channel Channel Index

    nordic_ble.crcok Indicates if CRC is Correct

    nordic_ble.flags Flags

    nordic_ble.packet_counter Packet Counter

    nordic_ble.packet_time Packet time (start to end)

    nordic_ble.phy PHY

    nordic_ble.protover Protocol Version

    Identified Key Features Within IP-Based Packets Dataset

    Feature Meaning

    http.content_length Length of content in an HTTP response

    http.request HTTP request being made

    http.response.code Sequential number of an HTTP response

    http.response_number Sequential number of an HTTP response

    http.time Time taken for an HTTP transaction

    tcp.analysis.initial_rtt Initial round-trip time for TCP connection

    tcp.connection.fin TCP connection termination with a FIN flag

    tcp.connection.syn TCP connection initiation with SYN flag

    tcp.connection.synack TCP connection establishment with SYN-ACK flags

    tcp.flags.cwr Congestion Window Reduced flag in TCP

    tcp.flags.ecn Explicit Congestion Notification flag in TCP

    tcp.flags.fin FIN flag in TCP

    tcp.flags.ns Nonce Sum flag in TCP

    tcp.flags.res Reserved flags in TCP

    tcp.flags.syn SYN flag in TCP

    tcp.flags.urg Urgent flag in TCP

    tcp.urgent_pointer Pointer to urgent data in TCP

    ip.frag_offset Fragment offset in IP packets

    eth.dst.ig Ethernet destination is in the internal network group

    eth.src.ig Ethernet source is in the internal network group

    eth.src.lg Ethernet source is in the local network group

    eth.src_not_group Ethernet source is not in any network group

    arp.isannouncement Indicates if an ARP message is an announcement

    Identified Key Features Within IP-Based Flows Dataset

    Feature Meaning

    proto Transport layer protocol of the connection

    service Identification of an application protocol

    orig_bytes Originator payload bytes

    resp_bytes Responder payload bytes

    history Connection state history

    orig_pkts Originator sent packets

    resp_pkts Responder sent packets

    flow_duration Length of the flow in seconds

    fwd_pkts_tot Forward packets total

    bwd_pkts_tot Backward packets total

    fwd_data_pkts_tot Forward data packets total

    bwd_data_pkts_tot Backward data packets total

    fwd_pkts_per_sec Forward packets per second

    bwd_pkts_per_sec Backward packets per second

    flow_pkts_per_sec Flow packets per second

    fwd_header_size Forward header bytes

    bwd_header_size Backward header bytes

    fwd_pkts_payload Forward payload bytes

    bwd_pkts_payload Backward payload bytes

    flow_pkts_payload Flow payload bytes

    fwd_iat Forward inter-arrival time

    bwd_iat Backward inter-arrival time

    flow_iat Flow inter-arrival time

    active Flow active duration

  6. Average cost per data breach in the United States 2006-2024

    • statista.com
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Average cost per data breach in the United States 2006-2024 [Dataset]. https://www.statista.com/statistics/273575/us-average-cost-incurred-by-a-data-breach/
    Explore at:
    Dataset updated
    Jun 23, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    United States
    Description

    As of 2024, the average cost of a data breach in the United States amounted to **** million U.S. dollars, down from **** million U.S. dollars in the previous year. The global average cost per data breach was **** million U.S. dollars in 2024. Cost of a data breach in different countries worldwide Data breaches impose a big threat for organizations globally. The monetary damage caused by data breaches has increased in many markets in the past decade. In 2023, Canada followed the U.S. by data breach costs, with an average of **** million U.S. dollars. Since 2019, the average monetary damage caused by loss of sensitive information in Canada has increased notably. In the United Kingdom, the average cost of a data breach in 2024 amounted to around **** million U.S. dollars, while in Germany it stood at **** million U.S. dollars. The cost of data breach by industry and segment Data breach costs vary depending on the industry and segment. For the fourth consecutive year, the global healthcare sector registered the highest costs of data breach, which in 2024 amounted to about **** million U.S. dollars. Financial institutions ranked second, with an average cost of *** million U.S. dollars for a data breach. Detection and escalation was the costliest segment in data breaches worldwide, with **** U.S. dollars on average. The cost for lost business ranked second, while response following a breach came across as the third-costliest segment.

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2022). Data Breaches [Dataset]. https://www.kaggle.com/datasets/thedevastator/data-breaches-a-comprehensive-list/code
Organization logo

Data Breaches

30,000 Records of cyber-security data breaches

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 10, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Data Breaches Dataset

30,000 Records of cyber-security data breaches

About this dataset

This dataset is a compilation of data from various sources detailing data breaches. These sources include press reports, government news releases, and mainstream news articles. The list includes those involving the theft or compromise of 30,000 or more records, although many smaller breaches occur continually. In addition, the various methods used in the breaches are listed, with hacking being the most common.

Organizations of all types and sizes are susceptible to data breaches, which can have devastating consequences. This dataset can help shed light on which organizations are most at risk and how these breaches occur so that steps can be taken to prevent them in the future

How to use the dataset

There are many ways to use this dataset. Here are a few ideas:

  • Use the data to understand which types of organizations are most commonly breached, and what methods are used most often.
  • Analyze the data to see if there are any trends or patterns in when or how breaches occur.
  • Use the data to create a visualizations or infographic showing the prevalence of data breaches

Research Ideas

  • This dataset can be used to identify trends in data breaches in terms of methods used, types of organizations breached, and geographical distribution.

  • This dataset can be used to study the effect of data breaches on organizational reputation and customer trust.

  • This dataset can be used by organizations to benchmark their own security measures against those of similar organizations that have experienced data breaches

Acknowledgements

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: df_1.csv | Column name | Description | |:----------------------|:---------------------------------------------------------------------| | Entity | The name of the organization that was breached. (String) | | Year | The year when the breach occurred. (Integer) | | Records | The number of records that were compromised in the breach. (Integer) | | Organization type | The type of organization that was breached. (String) | | Method | The method that was used to breach the organization. (String) | | Sources | The sources from which the data was collected. (String) |

Search
Clear search
Close search
Google apps
Main menu