62 datasets found

"Pwned Passwords" Dataset
academictorrents.com
bittorrent
Updated Aug 3, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
haveibeenpwned.com (2018). "Pwned Passwords" Dataset [Dataset]. https://academictorrents.com/details/53555c69e3799d876159d7290ea60e56b35e36a9
Explore at:
bittorrent(11101449979)Available download formats
Dataset updated
Aug 3, 2018
Dataset provided by
Have I Been Pwned?http://haveibeenpwned.com/
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Version 3 with 517M hashes and counts of password usage ordered by most to least prevalent Pwned Passwords are 517,238,891 real world passwords previously exposed in data breaches. This exposure makes them unsuitable for ongoing use as they re at much greater risk of being used to take over other accounts. They re searchable online below as well as being downloadable for use in other online system. The entire set of passwords is downloadable for free below with each password being represented as a SHA-1 hash to protect the original value (some passwords contain personally identifiable information) followed by a count of how many times that password had been seen in the source data breaches. The list may be integrated into other systems and used to verify whether a password has previously appeared in a data breach after which a system may warn the user or even block the password outright.
All-time biggest online data breaches 2025
statista.com
ai-chatbox.pro
Updated May 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). All-time biggest online data breaches 2025 [Dataset]. https://www.statista.com/statistics/290525/cyber-crime-biggest-online-data-breaches-worldwide/
Explore at:
Dataset updated
May 26, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 2025
Area covered
Worldwide
Description
The largest reported data leakage as of January 2025 was the Cam4 data breach in March 2020, which exposed more than 10 billion data records. The second-largest data breach in history so far, the Yahoo data breach, occurred in 2013. The company initially reported about one billion exposed data records, but after an investigation, the company updated the number, revealing that three billion accounts were affected. The National Public Data Breach was announced in August 2024. The incident became public when personally identifiable information of individuals became available for sale on the dark web. Overall, the security professionals estimate the leakage of nearly three billion personal records. The next significant data leakage was the March 2018 security breach of India's national ID database, Aadhaar, with over 1.1 billion records exposed. This included biometric information such as identification numbers and fingerprint scans, which could be used to open bank accounts and receive financial aid, among other government services.

Cybercrime - the dark side of digitalization As the world continues its journey into the digital age, corporations and governments across the globe have been increasing their reliance on technology to collect, analyze and store personal data. This, in turn, has led to a rise in the number of cyber crimes, ranging from minor breaches to global-scale attacks impacting billions of users – such as in the case of Yahoo. Within the U.S. alone, 1802 cases of data compromise were reported in 2022. This was a marked increase from the 447 cases reported a decade prior. The high price of data protection As of 2022, the average cost of a single data breach across all industries worldwide stood at around 4.35 million U.S. dollars. This was found to be most costly in the healthcare sector, with each leak reported to have cost the affected party a hefty 10.1 million U.S. dollars. The financial segment followed closely behind. Here, each breach resulted in a loss of approximately 6 million U.S. dollars - 1.5 million more than the global average.
D
Data Breach Notifications Affecting Washington Residents
data.wa.gov
s.cnmilf.com
+1more
application/rdfxml +5
Updated Jul 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Washington State Attorney General's Office Consumer Protection Division (2025). Data Breach Notifications Affecting Washington Residents [Dataset]. https://data.wa.gov/Consumer-Protection/Data-Breach-Notifications-Affecting-Washington-Res/sb4j-ca4h
Explore at:
xml, csv, json, application/rssxml, application/rdfxml, tsvAvailable download formats
Dataset updated
Jul 23, 2025
Dataset authored and provided by
Washington State Attorney General's Office Consumer Protection Division
Area covered
Washington
Description
Washington law requires entities impacted by a data breach to notify the Attorney General’s Office (AGO) when more than 500 Washingtonians personal information was compromised as a result of the breach. This dataset is a collection of various statistics that have been derived from these notices, and is the source of data used to produce the AGO’s Annual Data Breach Report.
P
Dashlane Login | How to Login Dashlane Account? Dataset
paperswithcode.com
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Dashlane Login | How to Login Dashlane Account? Dataset [Dataset]. https://paperswithcode.com/dataset/dashlane-login-how-to-login-dashlane-account
Explore at:
Dataset updated
Jun 17, 2025
Description
(Toll Free) Number +1-341-900-3252

It's hard to keep track of passwords (Toll Free) Number +1-341-900-3252 in our digital world. It might be hard to remember logins, keep your accounts safe (Toll Free) Number +1-341-900-3252 , and manage many accounts at once. This is where Dashlane comes in. It makes managing passwords easy because it has a simple UI and strong security. This tutorial is for you if you want to know how to log in to your Dashlane account and what makes (Toll Free) Number +1-341-900-3252 it different from other password managers.

(Toll Free) Number +1-341-900-3252

Why should you use Dashlane to manage your passwords?

It's crucial to know why millions of people around the world choose Dashlane before we get into the login process. Here's how it helps people in their daily lives:

(Toll Free) Number +1-341-900-3252

Better security Dashlane uses strong encryption to keep your credentials safe. You don't have to worry about hackers getting to your data using AES-256 encryption, which is the best in the business.

(Toll Free) Number +1-341-900-3252

Easy to use on all devices Dashlane makes it easy to store passwords on various devices. You can easily get to your login information on any device, whether it's a smartphone, laptop, or tablet.

Easier to log in After you set it up, Dashlane's autofill function lets you log in to apps and websites without having to type in your login and password. Not only does it go faster, but it also gets rid of mistakes.

Keeping an eye on the dark web Dashlane does more than merely keep track of passwords. It also checks the dark web for leaks of personal information. You will be notified right away if your information has been leaked.

Use a VPN to keep your privacy safe Dashlane has a virtual private network (VPN) in addition to passwords to keep your private browsing safe on public Wi-Fi.

How to Access Your Dashlane Account

Whether you're new to Dashlane or use it every day, it's easy to log in. To safely log into your account and start managing your passwords, do the following:

(Toll Free) Number +1-341-900-3252

Step 1: Get the Dashlane app Downloading Dashlane is the first thing you need to do if you're new. It works on all of the most popular platforms, including Windows, macOS, iOS, and Android. You can also use Dashlane as a browser extension on popular browsers including Chrome, Firefox, and Edge.

Step 2: Launch Dashlane Once you've installed the Dashlane app or browser extension, open it.

Step 3: Type in Your Email Address Type in the email address that is linked to your Dashlane login account. This will take you to the login page.

Step 4: Give your master password You'll need to make a Master Password the first time you log in. This is a single, strong (Toll Free) Number +1-341-900-3252 password that opens your vault. To get back in, just type in your master password. Tip: Your master password should be hard to guess but easy to remember. Think about using numerals, letters, and special characters in both upper and lower case.

Step 5: Verify (If Necessary) If you have two-factor authentication (2FA) set up on your account, you will also need to check this step. Dashlane can ask you to enter a code that was delivered to your email or made by an authentication app.

(Toll Free) Number +1-341-900-3252

Step 6: Open Your Vault Once you sign in, you'll see your password vault. This is where you can manage your stored logins, credit card information, and confidential notes.

Important Security Features of Dashlane

Dashlane puts your safety first with these cutting-edge (Toll Free) Number +1-341-900-3252 features:

Encryption with AES-256 Your private information is stored with military-grade encryption, which keeps it safe from hackers.

Architecture with No Knowledge Dashlane uses a zero-knowledge security model, which means that the corporation can't see or get to your passwords.

Ways to log in with biometrics You may make things easier without giving up security by turning on biometric authentication, such as Face ID or fingerprint scanning, on devices that allow it.

Information about the health of your password Dashlane doesn't just keep your passwords safe; it also looks at them. It indicates passwords that are weak or have been used before, which helps you make your accounts stronger.

(Toll Free) Number +1-341-900-3252

Access in an emergency You can let a trustworthy person in if there is an emergency. This makes it easier to keep (Toll Free) Number +1-341-900-3252 track of critical accounts.

How to Get the Most Out of Dashlane

To get the most out of your Dashlane login account, follow these tips:

Turn on Autofill: Autofill can help you save time when you log in, especially to sites you visit often.

Change your passwords often: Change your passwords every now and then (Toll Free) Number +1-341-900-3252 to make them more secure. You can make strong, unique passwords in seconds using Dashlane's Password Generator.

Turn on two-factor authentication: Always use two-factor authentication (2FA) to add an extra layer of security to your account. This way, even if your password is stolen, your account will still be safe.

Use the Password Health Tool: Check your password health score often and change any credentials that are marked.

What Makes Dashlane Unique

Dashlane is different from other password managers since it is easy to use and has sophisticated capabilities like monitoring the dark web and built-in VPN services. Dashlane keeps your information safe without slowing you down when you check in for work, shop online, or manage your personal accounts.

Last Thoughts

(Toll Free) Number +1-341-900-3252

It's easier than ever to keep your internet safety in check. (Toll Free) Number +1-341-900-3252 Dashlane makes it easy to get to your accounts, keeps your private data safe, and protects you from breaches before they happen. Now that you know how to log in to Dashlane, why not give it a shot and take charge of your passwords? Dashlane is the greatest way to keep your online life safe because (Toll Free) Number +1-341-900-3252 your safety deserves the best.
Number of data compromises and impacted individuals in U.S. 2005-2024
statista.com
ai-chatbox.pro
Updated Jul 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Number of data compromises and impacted individuals in U.S. 2005-2024 [Dataset]. https://www.statista.com/statistics/273550/data-breaches-recorded-in-the-united-states-by-number-of-breaches-and-records-exposed/
Explore at:
Dataset updated
Jul 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
United States
Description
In 2024, the number of data compromises in the United States stood at 3,158 cases. Meanwhile, over 1.35 billion individuals were affected in the same year by data compromises, including data breaches, leakage, and exposure. While these are three different events, they have one thing in common. As a result of all three incidents, the sensitive data is accessed by an unauthorized threat actor. Industries most vulnerable to data breaches Some industry sectors usually see more significant cases of private data violations than others. This is determined by the type and volume of the personal information organizations of these sectors store. In 2024 the financial services, healthcare, and professional services were the three industry sectors that recorded most data breaches. Overall, the number of healthcare data breaches in some industry sectors in the United States has gradually increased within the past few years. However, some sectors saw decrease. Largest data exposures worldwide In 2020, an adult streaming website, CAM4, experienced a leakage of nearly 11 billion records. This, by far, is the most extensive reported data leakage. This case, though, is unique because cyber security researchers found the vulnerability before the cyber criminals. The second-largest data breach is the Yahoo data breach, dating back to 2013. The company first reported about one billion exposed records, then later, in 2017, came up with an updated number of leaked records, which was three billion. In March 2018, the third biggest data breach happened, involving India’s national identification database Aadhaar. As a result of this incident, over 1.1 billion records were exposed.
Global number of breached user accounts Q1 2020-Q3 2024
statista.com
ai-chatbox.pro
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Global number of breached user accounts Q1 2020-Q3 2024 [Dataset]. https://www.statista.com/statistics/1307426/number-of-data-breaches-worldwide/
Explore at:
Dataset updated
Jun 23, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
During the third quarter of 2024, data breaches exposed more than *** million records worldwide. Since the first quarter of 2020, the highest number of data records were exposed in the first quarter of ***, more than *** million data sets. Data breaches remain among the biggest concerns of company leaders worldwide. The most common causes of sensitive information loss were operating system vulnerabilities on endpoint devices. Which industries see the most data breaches? Meanwhile, certain conditions make some industry sectors more prone to data breaches than others. According to the latest observations, the public administration experienced the highest number of data breaches between 2021 and 2022. The industry saw *** reported data breach incidents with confirmed data loss. The second were financial institutions, with *** data breach cases, followed by healthcare providers. Data breach cost Data breach incidents have various consequences, the most common impact being financial losses and business disruptions. As of 2023, the average data breach cost across businesses worldwide was **** million U.S. dollars. Meanwhile, a leaked data record cost about *** U.S. dollars. The United States saw the highest average breach cost globally, at **** million U.S. dollars.
Enron Fraud Email Dataset
kaggle.com
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Advaith S Rao (2023). Enron Fraud Email Dataset [Dataset]. https://www.kaggle.com/datasets/advaithsrao/enron-fraud-email-dataset/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Advaith S Rao
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. The data has been made public and presents a diverse set of email information ranging from internal, marketing emails to spam and fraud attempts.

In the early 2000s, Leslie Kaelbling at MIT purchased the dataset and noted that, though the dataset contained scam emails, it also had several integrity problems. The dataset was updated later, but it becomes key to ensure privacy in the data while it is used to train a deep neural network model.

Though the Enron Email Dataset contains over 500K emails, one of the problems with the dataset is the availability of labeled frauds in the dataset. Label annotation is done to detect an umbrella of fraud emails accurately. Since, fraud emails fall into several types such as Phishing, Financial, Romance, Subscription, and Nigerian Prince scams, there have to be multiple heuristics used to label all types of fraudulent emails effectively.

To tackle this problem, heuristics have been used to label the Enron data corpus using email signals, and automated labeling has been performed using simple ML models on other smaller email datasets available online. These fraud annotation techniques are discussed in detail below.

To perform fraud annotation on the Enron dataset as well as provide more fraud examples for modeling, two more fraud data sources have been used, Phishing Email Dataset: https://www.kaggle.com/dsv/6090437 Social Engineering Dataset: http://aclweb.org/aclwiki

Label Annotation

To label the Enron email dataset two signals are used to filter suspicious emails and label them into fraud and non-fraud classes. Automated ML labeling Email Signals

Automated ML Labeling

The following heuristics are used to annotate labels for Enron email data using the other two data sources,

Phishing Model Annotation: A high-precision SVM model trained on the Phishing mails dataset, which is used to annotate the Phishing Label on the Enron Dataset.

Social Engineering Model Annotation: A high-precision SVM model trained on the Social Engineering mails dataset, which is used to annotate the Social Engineering Label on the Enron Dataset.

The two ML Annotator models use Term Frequency Inverse Document Frequency (TF-IDF) to embed the input text and make use of SVM models with Gaussian Kernel.

If either of the models predicted that an email was a fraud, the mail metadata was checked for several email signals. If these heuristics meet the requirements of a high-probability fraud email, we label it as a fraud email.

Email Signals

Email Signal-based heuristics are used to filter and target suspicious emails for fraud labeling specifically. The signals used were,

Person Of Interest: There is a publicly available list of email addresses of employees who were liable for the massive data leak at Enron. These user mailboxes have a higher chance of containing quality fraud emails.

Suspicious Folders: The Enron data is dumped into several folders for every employee. Folders consist of inbox, deleted_items, junk, calendar, etc. A set of folders with a higher chance of containing fraud emails, such as Deleted Items and Junk.

Sender Type: The sender type was categorized as ‘Internal’ and ‘External’ based on their email address.

Low Communication: A threshold of 4 emails based on the table below was used to define Low Communication. A user qualifies as a Low-Comm sender if their emails are below this threshold. Mails sent from low-comm senders have been assigned with a high probability of being a fraud.

Contains Replies and Forwards: If an email contains forwards or replies, a low probability was assigned for it to be a fraud email.

Manual Inspection

To ensure high-quality labels, the mismatch examples from ML Annotation have been manually inspected for Enron dataset relabeling.

Dataset Breakdown

Fraud Non-Fraud
2327 445090

Citations

Enron Dataset Title: Enron Email Dataset URL: https://www.cs.cmu.edu/~enron/ Publisher: MIT, CMU Author: Leslie Kaelbling, William W. Cohen Year: 2015

Phishing Email Detection Dataset Title: Phishing Email Detection URL: https://www.kaggle.com/dsv/6090437 DOI: 10.34740/KAGGLE/DSV/6090437 Publisher: Kaggle Author: Subhadeep Chakraborty Year: 2023

CLAIR Fraud Email Collection Title: CLAIR collection of fraud email URL: http://aclweb.org/aclwiki Author: Radev, D. Year: 2008
C
Public data breach register Utrecht
ckan.mobidatalab.eu
Updated Jul 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OverheidNl (2023). Public data breach register Utrecht [Dataset]. https://ckan.mobidatalab.eu/dataset/utrecht-openbaar-datalekregister-utrecht
Explore at:
http://publications.europa.eu/resource/authority/file-type/xlsAvailable download formats
Dataset updated
Jul 12, 2023
Dataset provided by
OverheidNl
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Utrecht
Description
We record data leaks within the organization of the municipality of Utrecht in the public data leak register. This dataset contains the following data: • Date; • Description of the data breach; • (Possible) consequences for the person(s) involved; • Corrective actions taken; • Whether the Dutch Data Protection Authority (AP) has been informed; • Whether the data subject(s) have been informed. Only completed notifications are included in the register. Reports that are still being investigated by the municipality of Utrecht or the Dutch Data Protection Authority are not yet in the register. This happens when the research is completed. More information? Reporting a security or data leak: www.utrecht.nl/veiligheidslek-melden How the municipality of Utrecht deals with privacy: www.utrecht.nl/privacy
T
civil_comments
tensorflow.org
huggingface.co
Updated Feb 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
Explore at:
Dataset updated
Feb 28, 2023
Description
This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('civil_comments', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
D
Data from: Compromised through Compression: Python source code for DLMS...
phys-techsciences.datastations.nl
text/markdown, txt +2
Updated Dec 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
P.J.M. van Aubel; E. Poll; P.J.M. van Aubel; E. Poll (2021). Compromised through Compression: Python source code for DLMS compression privacy analysis & graphing [Dataset]. http://doi.org/10.17026/DANS-2BY-BNA3
Explore at:
xml(5795), zip(20542), text/markdown(792), txt(626), zip(12920)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-2BY-BNA3
Dataset updated
Dec 14, 2021
Dataset provided by
DANS Data Station Physical and Technical Sciences
Authors
P.J.M. van Aubel; E. Poll; P.J.M. van Aubel; E. Poll
License
http://opensource.org/licenses/BSD-2-Clausehttp://opensource.org/licenses/BSD-2-Clause
Description
Python code (for Python 3.9 & Pandas 1.3.2) to generate the results used in "Compromised through Compression: Privacy Implications of Smart Meter Traffic Analysis".Smart metering comes with risks to privacy. One concern is the possibility of an attacker seeing the traffic that reports the energy use of a household and deriving private information from that. Encryption helps to mask the actual energy measurements, but is not sufficient to cover all risks. One aspect which has yet gone unexplored — and where encryption does not help — is traffic analysis, i.e. whether the length of messages communicating energy measurements can leak privacy-sensitive information to an observer. In this paper we examine whether using encodings or compression for smart metering data could potentially leak information about household energy use. Our analysis is based on the real-world energy use data of ±80 Dutch households.We find that traffic analysis could reveal information about the energy use of individual households if compression is used. As a result, when messages are sent daily, an attacker performing traffic analysis would be able to determine when all the members of a household are away or not using electricity for an entire day. We demonstrate this issue by recognizing when households from our dataset were on holiday. If messages are sent more often, more granular living patterns could likely be determined.We propose a method of encoding the data that is nearly as effective as compression at reducing message size, but does not leak the information that compression leaks. By not requiring compression to achieve the best possible data savings, the risk of traffic analysis is eliminated.This code operates on the relative energy measurements from the "Zonnedael dataset" from Liander N.V. This dataset needs to be obtained separately; see instructions accompanying the code. The code transforms the dataset into absolute measurements such as would be taken by a smart meter. It then generates batch messages covering 24-hour periods starting at midnight, similar to how the Dutch infrastructure batches daily meter readings, in the different possible encodings with and without compression applied. For an explanation of the different encodings, see the paper. The code will then provide statistics on the efficiency of encoding and compression for the entire dataset, and attempt to find the periods of multi-day absences for each household. It will also generate the graphs in the style used in the paper and presentation.
h
prompt-leak-R1-dpo
huggingface.co
Updated Mar 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huang Liang Hsun (2025). prompt-leak-R1-dpo [Dataset]. https://huggingface.co/datasets/lianghsun/prompt-leak-R1-dpo
Explore at:
Dataset updated
Mar 24, 2025
Authors
Huang Liang Hsun
Description
Dataset Card for prompt-leak-R1-dpo

This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

Dataset Details Dataset Description

Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]

Dataset Sources [optional]

Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/lianghsun/prompt-leak-R1-dpo.
Connectomix test dataset 2
openneuro.org
Updated Dec 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonin Rovai (2024). Connectomix test dataset 2 [Dataset]. http://doi.org/10.18112/openneuro.ds005699.v1.0.0
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds005699.v1.0.0
Dataset updated
Dec 8, 2024
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Antonin Rovai
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Connectomix test dataset 2

This dataset is designed to test features of Connectomix. For more information, please visit the GitHub repository:

https://github.com/ln2t/connectomix

The dataset contains data for 4 participants.

Rawdata

Anatomical: T1w image (defaced)

Functional: A standard 6 minutes resting-state scan

Groups: the participants are split in control and patient groups in the participants.tsv file. This split has been done artificially and serves only testing purposes.

Derivatives

fmriprep: preprocessed data using fMRIPrep

connectomix: results of the connectomix software (see below for details)

Commands

The exact commands to run the analyzes depends on your installation of fMRIPrep. In what follows, we simply assume that fmriprep is the command for fMRIPrep. We show here the simplest version of the commands, assuming you adapt those depending on your setup (e.g. if you use Docker).

We also assume that the data are at the following locations: bash bids_dir='/data/ds005699' derivatives_dir='/data/ds005699/derivatives'

Preprocessing (fMRIPrep)

fmriprep $bids_dir ${derivatives_dir}/fmriprep participant --fs-license-file /path/to/fs/license

Analysis: connectomix

Note: The following has been tested for connectomix version 1.0.1.

First set-up path to connectomix script: bash connectomix_cmd='/path/to/connectomix/connectomix/connectomix.py'

Second, set-up paths to config directory: bash config_dir='/data/ds005625/code/connectomix/config'

Participant-level

$connectomix_cmd ${bids_dir} ${derivatives_dir}/connectomix participant --derivatives fmriprep="${derivatives_dir}/fmriprep" --config "${config_dir}/participant_level_config.yaml"

Group-level

Notes: - this is an example of Independent two-samples t-test - Since the dataset contains only four subjects (two subjects per group), the number of possible permutations is very low. For this reason, the number of computed permutations is set to 4, and connectomix can then complete the group level-analysis. Of course, realistic cases should not only include much more participants, but also a much larger number of permutations (see connectomix documentation).

$connectomix_cmd ${bids_dir} ${derivatives_dir}/connectomix group --config "${config_dir}/group_level_config.yaml"
KDHE Regulated Storage Tanks - Leaking Underground (LUST)
hub.kansasgis.org
Updated May 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KDHE Public ArcGIS (2022). KDHE Regulated Storage Tanks - Leaking Underground (LUST) [Dataset]. https://hub.kansasgis.org/datasets/kdhe::kdhe-regulated-storage-tanks-leaking-underground-lust
Explore at:
Dataset updated
May 20, 2022
Dataset provided by
Kansas Department of Health and Environmenthttp://www.kdheks.gov/
Authors
KDHE Public ArcGIS
Area covered

Description
The data is updated nightly using ArcGIS scripting. Scripting will not update the ArcGIS Online "item updated" date, which only reflects the last time the ArcGIS Online item page was last updated. A typical leaking underground storage tank (LUST) scenario involves the release of a fuel product from an underground storage tank (UST) that can contaminate surrounding soil, groundwater, or surface waters, or affect indoor air spaces. Early detection of an UST release is important, as is determining the source of the release, the type of fuel released, the occurrence of imminently threatened receptors, and the appropriate initial response. The primary objective of the initial response is to determine the nature and extent of a release as soon as possible.PROHIBITED USES: KSA 45-230 prohibits the use of names and addresses contained in public records for certain commercial purposes. By submitting this request, you are signing the following written certification that you will not use the information in the records for any purpose prohibited by law.

DATA LIMITATIONS:

This data set is not designed for use as a regulatory tool in permitting or citing decisions; it may be used as a reference source. Carefully consider the provisional or incomplete nature of these data before using them for decisions that concern personal safety or involves substantial monetary consequences.

This dataset contains one facility point per LUST data record. The points will be stacked if multiple LUST occurred at the same facility.

A new facility point is added when a new facility is added to the origination database.

Data is replicated on a nightly basis for public consumption. KDHE is not responsible for database integrity following download.

The facility point is not the exact location of the tank, but a general representative somewhere in the property of the Storage Tank Facility.

KDHE makes no assurances of the accuracy or validity of information presented in the Spatial Data. KDHE Tanks have been located using a variety of locational methods. More recent points are geocoded and validated with accuracy of 3-10 meters. Many inactive/old facilities only had a Legal description to calculate point placement on a map, with an accuracy of 250 – 2000 meters.For users who wish to interact with the data in a finished product, KDHE recommends using our Kansas Environmental Interest Finder . More information about KDHE can be found on the Kansas Department of Health and Environment website .More information about KDHE Storage Tanks can be found on the Kansas Department of Health and Environment website Storage Tanks Division .ATTRIBUTES description: Start Date/End Date: The LUST is considered finished when the remediation has occurred and the environment is back to pre-contamination state. A new LUST will be recorded if the Tank Leaks again. Approved TRUST: Flag Yes if approved for EPA TRUST: In 1986, Congress created the Leaking Underground Storage Tank (LUST) Trust Fund to address petroleum releases from federally regulated underground storage tanks (USTs) by amending Subtitle I of the Solid Waste Disposal Act. In 2005, the Energy Policy Act expanded eligible uses of the Trust Fund to include certain leak prevention activities.
O
Connecticut Fire Department Incidents - Fires Only
data.ct.gov
Updated Feb 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Fire Incident Reporting System (NFIRS) (2025). Connecticut Fire Department Incidents - Fires Only [Dataset]. https://data.ct.gov/widgets/b6pg-k63s
Explore at:
application/rdfxml, application/rssxml, application/geo+json, csv, kml, tsv, xml, kmzAvailable download formats
Dataset updated
Feb 6, 2025
Dataset authored and provided by
National Fire Incident Reporting System (NFIRS)
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Area covered
Connecticut
Description
Only fires are included here. All other incidences, including EMS calls and False Alarms have been excluded.

This dataset contains Connecticut Fire Department Incidents as reported to the National Fire Department Incident Reporting System (NFIRS).

Note that the 2014 and 2016 data has far more entries than the other years. In particular, they detail "False Alarm and False Calls" and "Rescue and Emergency Medical Service (EMS) Incidents"

NFIRS collects details on Fire, HazMat and EMS incidences nationwide, detailing the type of incident, where it occurred, the resources used to mitigate it and more, with a goal of understanding the nature and causes of the incidents. Information is also collected on the number of civilian or firefighter casualties and an estimate of property loss.

Participation in NFIRS is voluntary.

Data is released yearly, with a considerable delay.

Each Incidence is assigned a 3 digit Incidence Type Code. The code describes the situation emergency personnel found when they arrived. Incidence Types are grouped into larger categories, called Series.

For example, Series 400, 'Hazardous Condition' category includes incidence types: 411, 'Gasoline or other flammable liquid spill; 412, 'Gas leak and 413, 'Oil or other combustible liquid spill '.

Not every Incidence Type is included in the data. In 2012, 2013, 2014 and 2015, the NFIRS data releases contained these Series/Incidence Types:

Series 100: Fire Incidences, Series 400: Hazardous Condition (No Fire), Incidence Type 561: Unauthorized burning, under the 'Service Call' Series, Incidence Type 631: Authorized Controlled Burning, under the 'Good Intent Call' series and Incidence Type 632: Prescribed fires also under the 'Good Intent Call' series.

The 2014 and 2016 release included these additional series:

200: Overpressure Rupture, Explosion, Overheat (No Fire), 300: Rescue and Emergency Medical Service (EMS) Incidents, 500: Service Calls, 600: Good Intent Call Series, 700: False Alarm and False Call, 800 Severe Weather and Natural Disaster 900: Special Incident Type.

The official NFIRS documentation has been attached to this dataset.

This dataset does not contain all the detail available in the NFIRS database. If after reviewing the documentation, you find additional information you would like added to the dataset, please let us know.
s
Portsmouth Water Nightflow Data
streamwaterdata.co.uk
hub.arcgis.com
Updated Apr 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AHughes_Portsmouth (2024). Portsmouth Water Nightflow Data [Dataset]. https://www.streamwaterdata.co.uk/datasets/f1aeaa7ad2c947048eaf9fc06b6df0e5
Explore at:
Dataset updated
Apr 25, 2024
Dataset authored and provided by
AHughes_Portsmouth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

Reporting of leakage from water networks is based on the concept of monitoring flows at a time when demand is at a minimum which is normally during the night. This dataset includes net night flow measurements for 10% of the publisher’s total district metered areas. This 10% has been chosen on the basis that the telemetry on site is reliable, that it is not revealing of sensitive usage patterns and that the night flow there is typical of low demand.

Key Definitions

Dataset

A structured and organized collection of related elements, often stored digitally, used for analysis and interpretation in various fields.

Data Triage

The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data.

District Metered Area (DMA)

The role of a district metered area (DMA) is to divide the water distribution network into manageable areas or sectors into which the flow can be measured. These areas provide the water providers with guidance as to which DMAs (District Metered Areas) require leak detection work.

Leakage

The accidental admission or escape of a fluid or gas through a hole or crack

Night Flow

This technique considers that in a DMA, leakages can be estimated when the flow into the

DMA is at its minimum. Typically, this is measured at night between 3am and 4am when customer demand is low so that network leakage can be detected.

Centroid

The centre of a geometric object.

Data History

Data Origin

Companies have configured their networks to be able to continuously monitor night flows using district meters. Flow data is recorded on meters and normally transmitted daily to a data centre. Data is analysed to confirm its validity and used to derive continuous night flow in each monitored area.

Data Triage Considerations

Data Quality

Not all DMAs provide quality data for the purposes of trend analysis. It was decided that water companies should choose 10% of their DMAs to be represented in this data set to begin with. The advice to publishers is to choose those with reliable and consistent telemetry, indicative of genuine low demand during measurement times and not revealing of sensitive night usage patterns.

Data Consistency

There is a concern that companies measure flow allowance for legitimate night use and/or potential night use differently. To avoid any inconsistency, it was decided that we would share the net flow.

Critical National Infrastructure

The release of boundary data for district metered areas has been deemed to be revealing of critical national infrastructure. Because of this, it has been decided that the data set shall only contain point data from a centroid within the DMA.

Data Triage Review Frequency

Every 12 months, unless otherwise requested.

Data Limitations

Some of the flow recorded may be legitimate nighttime usage of the network

Some measuring systems automatically infill estimated measurements where none have been received via telemetry. These estimates are based on past flow.

The reason for a fluctuation in night flow may not be determined by this dataset but potential causes can include seasonal variation in nighttime water usage and mains bursts

Data Publish Frequency

Monthly

Supplementary information

Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset.

Ofwat – Reporting Guidance https://www.ofwat.gov.uk/wp-content/uploads/2018/03/Reporting-guidance-leakage.pdf

Water UK – UK Leakage https://www.water.org.uk/wp-content/uploads/2022/03/Water-UK-A-leakage-Routemap-to-2050.pdf
Q&A sites
kaggle.com
Updated Nov 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TianBaojie (2023). Q&A sites [Dataset]. https://www.kaggle.com/datasets/tianbaojie/q-and-a-sites
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 14, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
TianBaojie
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
In recent years, social Q&A sites have developed rapidly, like Zhihu and Quora, with hundreds of millions of users, providing a convenient platform for users to ask questions and share knowledge. Most users share knowledge through real-name answers, but this sometimes hinders knowledge sharing, such as employees sharing company salaries and students sharing inside information about the laboratory. Revealing the author's true identity, in this case, can cause significant harm to the author. In order to tackle the above problems, social question-and-answer websites provide users with an anonymous answer function, which replaces the real author id of the answer with an anonymous id. Countless users who cannot disclose their identities use anonymous answers to share valuable knowledge.

Although there are countless anonymous answers in the Q&A community, few anonymization techniques have been used. The two super-large Q&A sites, Quora and Zhihu, use two anonymization technologies are hiding the author's information and protecting the storage of anonymous user information. However, anonymous answers, questions, comments, and topics and their topological structure contain distinct attributes unknown to most users, providing valuable features for de-anonymization attacks. Although the question-and-answer website warns anonymous users that personal specific information and language styles in answers may lead to privacy leaks, the question-and-answer community cannot give the probability and cause of privacy leaks for a specific anonymous answer.

In this paper, we propose a novel task, the de-anonymization of the Q&A websites, which refers to recovering the identity information of the real author of the anonymous answer. This task aims to evaluate the risk of privacy leakage of a specific anonymous answer in the question-and-answer websites and explain why the answer is vulnerable to de-anonymization.

To explore the effectiveness of various methodologies, we employ web scraping techniques on public answers from online platforms Zhihu and Quora. The first step involves the selection of seed users related to the ten popular topics. We selected one seed user from each topic. This step can ensure that the collected Q&A community dataset encompasses a diverse range of popular topics. In the second step, we recursively crawl the social relationships between users based on the ten seed users crawled in the first step. In order to make the crawled user pool more widely distributed, we only crawled the first 100 following users for each user until the crawled user pool reached 700,000. Step three is to ascertain the users to be ultimately crawled. A community discovery algorithm is employed to identify a community with the highest transitivity. This community must have a population exceeding 5,000. As a result, we will crawl all users in this community. The fourth step involves extracting the data related to all users within the chosen community. To mitigate the issue of excessive data volume, this article sets a constraint on the number of answers collected by individual users during the crawling process. This upper limit ensures that all answers of 95% of users are crawled. The crawled data consists of user homepage information, relationships between users, user-generated questions and answers, and associated comments and topics. Each question includes its title and the name of the person who asked it. Each answer contains the author's name, time of submission, the content of the answer, and any first-level comments. Additionally, each comment includes its author, submission time, and content.

Here is an example code to read the dataset https://www.kaggle.com/tianbaojie/example-code.
h
pii-masking-300k
huggingface.co
Updated Apr 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2024). pii-masking-300k [Dataset]. http://doi.org/10.57967/hf/1995
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/1995
Dataset updated
Apr 4, 2024
Dataset authored and provided by
Ai4Privacy
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Purpose and Features

🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts:

OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting 749 discussion subjects / use cases split across education, health, and psychology. FinPII contains an additional ~20 types tailored to… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-300k.
h
french_books
huggingface.co
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CATIE (2025). french_books [Dataset]. https://huggingface.co/datasets/CATIE-AQ/french_books
Explore at:
Dataset updated
Jul 7, 2025
Dataset authored and provided by
CATIE
Area covered
French
Description
Description

Dataframe containing 2075 French books in txt format (= the ~2600 French books present in gutenberg from which all books by authors present in the french_books_summuries dataset have been removed to avoid any leaks).More precisely :

the texte column contains the texts the titre column contains the book title the auteur column contains the author's name and dates of birth and death (if you want to filter the texts to keep only those from the given century to the present… See the full description on the dataset page: https://huggingface.co/datasets/CATIE-AQ/french_books.
LO2: Microservice Dataset of Logs and Metrics
zenodo.org
bin, pdf, zip
Updated Feb 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Bakhtin; Alexander Bakhtin; Jesse Nyyssölä; Jesse Nyyssölä; Yuqing Wang; Yuqing Wang; Noman Ahmad; Noman Ahmad; Ke Ping; Ke Ping; Matteo Esposito; Matteo Esposito; Mika Mäntylä; Mika Mäntylä; Davide Taibi; Davide Taibi (2025). LO2: Microservice Dataset of Logs and Metrics [Dataset]. http://doi.org/10.5281/zenodo.14938118
Explore at:
zip, bin, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14938118
Dataset updated
Feb 28, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexander Bakhtin; Alexander Bakhtin; Jesse Nyyssölä; Jesse Nyyssölä; Yuqing Wang; Yuqing Wang; Noman Ahmad; Noman Ahmad; Ke Ping; Ke Ping; Matteo Esposito; Matteo Esposito; Mika Mäntylä; Mika Mäntylä; Davide Taibi; Davide Taibi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LO2 dataset

This is the data repository for the LO2 dataset.

Here is an overview of the contents.

lo2-data.zip

This is the main dataset. This is the completely unedited output of our data collection process. Note that the uncompressed size is around 540 GB. For more information, see the paper and the data-appendix in this repository.

lo2-sample.zip

This is a sample that contains the data used for preliminary analysis. It contains only service logs and the most relevant metrics for the first 100 runs. Furthermore, the metrics are combined on a run level to a single csv to make them easier to utilize.

data-appendix.pdf

This document contains further details and stats about the full dataset. These include file size distributions, empty file analysis, log type analysis and the appearance of an unknown file.

lo2-scripts.zip

Various scripts for processing the data to create the sample, to conduct the preliminary analysis and to create the statistics seen in the data-appendix.

csv_generator.py, csv_merge*.py: These scripts create and combine the metrics into csv files. They need to be run in order. Merging runs to global is very memory intensive.

findempty.py: Finds empty files in the folders. As some are expected to be empty, it also counts the unexpected ones. Used in data-appendix.

loglead_lo2.py: Script for the preliminary analysis of the logs for error detection. Requires LogLead version 1.2.1.

logstats.py: Counts log lines and their type. Used for creating the figure of number of lines per type and service.

node_exporter_metrics.txt: Metric descriptions exported from Prometheus (text file).

pca.py: The Principal Component Analysis script used for preliminary analysis.

reduce_logs.py: Very important for fair analysis as in the beginning of the files there are some initialization rows that leak information regarding correctness.

requirements.txt: Required Python libraries to run the scripts.

sizedist.py: Creating distributions of file sizes per filename for the data-appendix.

Version v3: Updated data appendix introduction, added another stage in the log analysis process in loglead_lo2.py
d
Using Decision Trees to Detect and Isolate Leaks in the J-2X
catalog.data.gov
s.cnmilf.com
+1more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Using Decision Trees to Detect and Isolate Leaks in the J-2X [Dataset]. https://catalog.data.gov/dataset/using-decision-trees-to-detect-and-isolate-leaks-in-the-j-2x
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Full title: Using Decision Trees to Detect and Isolate Simulated Leaks in the J-2X Rocket Engine Mark Schwabacher, NASA Ames Research Center Robert Aguilar, Pratt & Whitney Rocketdyne Fernando Figueroa, NASA Stennis Space Center Abstract The goal of this work was to use data-driven methods to automatically detect and isolate faults in the J-2X rocket engine. It was decided to use decision trees, since they tend to be easier to interpret than other data-driven methods. The decision tree algorithm automatically “learns” a decision tree by performing a search through the space of possible decision trees to find one that fits the training data. The particular decision tree algorithm used is known as C4.5. Simulated J-2X data from a high-fidelity simulator developed at Pratt & Whitney Rocketdyne and known as the Detailed Real-Time Model (DRTM) was used to “train” and test the decision tree. Fifty-six DRTM simulations were performed for this purpose, with different leak sizes, different leak locations, and different times of leak onset. To make the simulations as realistic as possible, they included simulated sensor noise, and included a gradual degradation in both fuel and oxidizer turbine efficiency. A decision tree was trained using 11 of these simulations, and tested using the remaining 45 simulations. In the training phase, the C4.5 algorithm was provided with labeled examples of data from nominal operation and data including leaks in each leak location. From the data, it “learned” a decision tree that can classify unseen data as having no leak or having a leak in one of the five leak locations. In the test phase, the decision tree produced very low false alarm rates and low missed detection rates on the unseen data. It had very good fault isolation rates for three of the five simulated leak locations, but it tended to confuse the remaining two locations, perhaps because a large leak at one of these two locations can look very similar to a small leak at the other location. Introduction The J-2X rocket engine will be tested on Test Stand A-1 at NASA Stennis Space Center (SSC) in Mississippi. A team including people from SSC, NASA Ames Research Center (ARC), and Pratt & Whitney Rocketdyne (PWR) is developing a prototype end-to-end integrated systems health management (ISHM) system that will be used to monitor the test stand and the engine while the engine is on the test stand[1]. The prototype will use several different methods for detecting and diagnosing faults in the test stand and the engine, including rule-based, model-based, and data-driven approaches. SSC is currently using the G2 tool http://www.gensym.com to develop rule-based and model-based fault detection and diagnosis capabilities for the A-1 test stand. This paper describes preliminary results in applying the data-driven approach to detecting and diagnosing faults in the J-2X engine. The conventional approach to detecting and diagnosing faults in complex engineered systems such as rocket engines and test stands is to use large numbers of human experts. Test controllers watch the data in near-real time during each engine test. Engineers study the data after each test. These experts are aided by limit checks that signal when a particular variable goes outside of a predetermined range. The conventional approach is very labor intensive. Also, humans may not be able to recognize faults that involve the relationships among large numbers of variables. Further, some potential faults could happen too quickly for humans to detect them and react before they become catastrophic. Automated fault detection and diagnosis is therefore needed. One approach to automation is to encode human knowledge into rules or models. Another approach is use data-driven methods to automatically learn models from historical data or simulated data. Our prototype will combine the data-driven approach with the model-based and rule-based appro

Fraud	Non-Fraud
2327	445090

Facebook

Twitter

Click to copy link

Link copied

Cite

haveibeenpwned.com (2018). "Pwned Passwords" Dataset [Dataset]. https://academictorrents.com/details/53555c69e3799d876159d7290ea60e56b35e36a9

"Pwned Passwords" Dataset

Explore at:

bittorrent(11101449979)Available download formats

Dataset updated

Aug 3, 2018

Dataset provided by

Have I Been Pwned?http://haveibeenpwned.com/

License

https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

Description

Version 3 with 517M hashes and counts of password usage ordered by most to least prevalent Pwned Passwords are 517,238,891 real world passwords previously exposed in data breaches. This exposure makes them unsuitable for ongoing use as they re at much greater risk of being used to take over other accounts. They re searchable online below as well as being downloadable for use in other online system. The entire set of passwords is downloadable for free below with each password being represented as a SHA-1 hash to protect the original value (some passwords contain personally identifiable information) followed by a count of how many times that password had been seen in the source data breaches. The list may be integrated into other systems and used to verify whether a password has previously appeared in a data breach after which a system may warn the user or even block the password outright.

Clear search

Close search

Google apps

Main menu

"Pwned Passwords" Dataset

All-time biggest online data breaches 2025

Data Breach Notifications Affecting Washington Residents

Dashlane Login | How to Login Dashlane Account? Dataset

Number of data compromises and impacted individuals in U.S. 2005-2024

Global number of breached user accounts Q1 2020-Q3 2024

Enron Fraud Email Dataset

Label Annotation

Automated ML Labeling

Email Signals

Manual Inspection

Dataset Breakdown

Citations

Public data breach register Utrecht

civil_comments

Data from: Compromised through Compression: Python source code for DLMS...

prompt-leak-R1-dpo

Connectomix test dataset 2

Connectomix test dataset 2

Rawdata

Derivatives

Commands

Preprocessing (fMRIPrep)

Analysis: connectomix

Participant-level

Group-level

KDHE Regulated Storage Tanks - Leaking Underground (LUST)

Connecticut Fire Department Incidents - Fires Only

Portsmouth Water Nightflow Data

Q&A sites

pii-masking-300k

french_books

LO2: Microservice Dataset of Logs and Metrics

LO2 dataset

Using Decision Trees to Detect and Isolate Leaks in the J-2X

"Pwned Passwords" Dataset