https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Version 3 with 517M hashes and counts of password usage ordered by most to least prevalent Pwned Passwords are 517,238,891 real world passwords previously exposed in data breaches. This exposure makes them unsuitable for ongoing use as they re at much greater risk of being used to take over other accounts. They re searchable online below as well as being downloadable for use in other online system. The entire set of passwords is downloadable for free below with each password being represented as a SHA-1 hash to protect the original value (some passwords contain personally identifiable information) followed by a count of how many times that password had been seen in the source data breaches. The list may be integrated into other systems and used to verify whether a password has previously appeared in a data breach after which a system may warn the user or even block the password outright.
The largest reported data leakage as of January 2025 was the Cam4 data breach in March 2020, which exposed more than 10 billion data records. The second-largest data breach in history so far, the Yahoo data breach, occurred in 2013. The company initially reported about one billion exposed data records, but after an investigation, the company updated the number, revealing that three billion accounts were affected. The National Public Data Breach was announced in August 2024. The incident became public when personally identifiable information of individuals became available for sale on the dark web. Overall, the security professionals estimate the leakage of nearly three billion personal records. The next significant data leakage was the March 2018 security breach of India's national ID database, Aadhaar, with over 1.1 billion records exposed. This included biometric information such as identification numbers and fingerprint scans, which could be used to open bank accounts and receive financial aid, among other government services.
Cybercrime - the dark side of digitalization As the world continues its journey into the digital age, corporations and governments across the globe have been increasing their reliance on technology to collect, analyze and store personal data. This, in turn, has led to a rise in the number of cyber crimes, ranging from minor breaches to global-scale attacks impacting billions of users – such as in the case of Yahoo. Within the U.S. alone, 1802 cases of data compromise were reported in 2022. This was a marked increase from the 447 cases reported a decade prior. The high price of data protection As of 2022, the average cost of a single data breach across all industries worldwide stood at around 4.35 million U.S. dollars. This was found to be most costly in the healthcare sector, with each leak reported to have cost the affected party a hefty 10.1 million U.S. dollars. The financial segment followed closely behind. Here, each breach resulted in a loss of approximately 6 million U.S. dollars - 1.5 million more than the global average.
Washington law requires entities impacted by a data breach to notify the Attorney General’s Office (AGO) when more than 500 Washingtonians personal information was compromised as a result of the breach. This dataset is a collection of various statistics that have been derived from these notices, and is the source of data used to produce the AGO’s Annual Data Breach Report.
(Toll Free) Number +1-341-900-3252
It's hard to keep track of passwords (Toll Free) Number +1-341-900-3252 in our digital world. It might be hard to remember logins, keep your accounts safe (Toll Free) Number +1-341-900-3252 , and manage many accounts at once. This is where Dashlane comes in. It makes managing passwords easy because it has a simple UI and strong security. This tutorial is for you if you want to know how to log in to your Dashlane account and what makes (Toll Free) Number +1-341-900-3252 it different from other password managers.
(Toll Free) Number +1-341-900-3252
Why should you use Dashlane to manage your passwords?
It's crucial to know why millions of people around the world choose Dashlane before we get into the login process. Here's how it helps people in their daily lives:
(Toll Free) Number +1-341-900-3252
Better security Dashlane uses strong encryption to keep your credentials safe. You don't have to worry about hackers getting to your data using AES-256 encryption, which is the best in the business.
(Toll Free) Number +1-341-900-3252
Easy to use on all devices Dashlane makes it easy to store passwords on various devices. You can easily get to your login information on any device, whether it's a smartphone, laptop, or tablet.
Easier to log in After you set it up, Dashlane's autofill function lets you log in to apps and websites without having to type in your login and password. Not only does it go faster, but it also gets rid of mistakes.
Keeping an eye on the dark web Dashlane does more than merely keep track of passwords. It also checks the dark web for leaks of personal information. You will be notified right away if your information has been leaked.
Use a VPN to keep your privacy safe Dashlane has a virtual private network (VPN) in addition to passwords to keep your private browsing safe on public Wi-Fi.
How to Access Your Dashlane Account
Whether you're new to Dashlane or use it every day, it's easy to log in. To safely log into your account and start managing your passwords, do the following:
(Toll Free) Number +1-341-900-3252
Step 1: Get the Dashlane app Downloading Dashlane is the first thing you need to do if you're new. It works on all of the most popular platforms, including Windows, macOS, iOS, and Android. You can also use Dashlane as a browser extension on popular browsers including Chrome, Firefox, and Edge.
Step 2: Launch Dashlane Once you've installed the Dashlane app or browser extension, open it.
Step 3: Type in Your Email Address Type in the email address that is linked to your Dashlane login account. This will take you to the login page.
Step 4: Give your master password You'll need to make a Master Password the first time you log in. This is a single, strong (Toll Free) Number +1-341-900-3252 password that opens your vault. To get back in, just type in your master password. Tip: Your master password should be hard to guess but easy to remember. Think about using numerals, letters, and special characters in both upper and lower case.
Step 5: Verify (If Necessary) If you have two-factor authentication (2FA) set up on your account, you will also need to check this step. Dashlane can ask you to enter a code that was delivered to your email or made by an authentication app.
(Toll Free) Number +1-341-900-3252
Step 6: Open Your Vault Once you sign in, you'll see your password vault. This is where you can manage your stored logins, credit card information, and confidential notes.
Important Security Features of Dashlane
Dashlane puts your safety first with these cutting-edge (Toll Free) Number +1-341-900-3252 features:
Encryption with AES-256 Your private information is stored with military-grade encryption, which keeps it safe from hackers.
Architecture with No Knowledge Dashlane uses a zero-knowledge security model, which means that the corporation can't see or get to your passwords.
Ways to log in with biometrics You may make things easier without giving up security by turning on biometric authentication, such as Face ID or fingerprint scanning, on devices that allow it.
Information about the health of your password Dashlane doesn't just keep your passwords safe; it also looks at them. It indicates passwords that are weak or have been used before, which helps you make your accounts stronger.
(Toll Free) Number +1-341-900-3252
Access in an emergency You can let a trustworthy person in if there is an emergency. This makes it easier to keep (Toll Free) Number +1-341-900-3252 track of critical accounts.
How to Get the Most Out of Dashlane
To get the most out of your Dashlane login account, follow these tips:
Turn on Autofill: Autofill can help you save time when you log in, especially to sites you visit often.
Change your passwords often: Change your passwords every now and then (Toll Free) Number +1-341-900-3252 to make them more secure. You can make strong, unique passwords in seconds using Dashlane's Password Generator.
Turn on two-factor authentication: Always use two-factor authentication (2FA) to add an extra layer of security to your account. This way, even if your password is stolen, your account will still be safe.
Use the Password Health Tool: Check your password health score often and change any credentials that are marked.
What Makes Dashlane Unique
Dashlane is different from other password managers since it is easy to use and has sophisticated capabilities like monitoring the dark web and built-in VPN services. Dashlane keeps your information safe without slowing you down when you check in for work, shop online, or manage your personal accounts.
Last Thoughts
(Toll Free) Number +1-341-900-3252
It's easier than ever to keep your internet safety in check. (Toll Free) Number +1-341-900-3252 Dashlane makes it easy to get to your accounts, keeps your private data safe, and protects you from breaches before they happen. Now that you know how to log in to Dashlane, why not give it a shot and take charge of your passwords? Dashlane is the greatest way to keep your online life safe because (Toll Free) Number +1-341-900-3252 your safety deserves the best.
In 2024, the number of data compromises in the United States stood at 3,158 cases. Meanwhile, over 1.35 billion individuals were affected in the same year by data compromises, including data breaches, leakage, and exposure. While these are three different events, they have one thing in common. As a result of all three incidents, the sensitive data is accessed by an unauthorized threat actor. Industries most vulnerable to data breaches Some industry sectors usually see more significant cases of private data violations than others. This is determined by the type and volume of the personal information organizations of these sectors store. In 2024 the financial services, healthcare, and professional services were the three industry sectors that recorded most data breaches. Overall, the number of healthcare data breaches in some industry sectors in the United States has gradually increased within the past few years. However, some sectors saw decrease. Largest data exposures worldwide In 2020, an adult streaming website, CAM4, experienced a leakage of nearly 11 billion records. This, by far, is the most extensive reported data leakage. This case, though, is unique because cyber security researchers found the vulnerability before the cyber criminals. The second-largest data breach is the Yahoo data breach, dating back to 2013. The company first reported about one billion exposed records, then later, in 2017, came up with an updated number of leaked records, which was three billion. In March 2018, the third biggest data breach happened, involving India’s national identification database Aadhaar. As a result of this incident, over 1.1 billion records were exposed.
During the third quarter of 2024, data breaches exposed more than *** million records worldwide. Since the first quarter of 2020, the highest number of data records were exposed in the first quarter of ***, more than *** million data sets. Data breaches remain among the biggest concerns of company leaders worldwide. The most common causes of sensitive information loss were operating system vulnerabilities on endpoint devices. Which industries see the most data breaches? Meanwhile, certain conditions make some industry sectors more prone to data breaches than others. According to the latest observations, the public administration experienced the highest number of data breaches between 2021 and 2022. The industry saw *** reported data breach incidents with confirmed data loss. The second were financial institutions, with *** data breach cases, followed by healthcare providers. Data breach cost Data breach incidents have various consequences, the most common impact being financial losses and business disruptions. As of 2023, the average data breach cost across businesses worldwide was **** million U.S. dollars. Meanwhile, a leaked data record cost about *** U.S. dollars. The United States saw the highest average breach cost globally, at **** million U.S. dollars.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. The data has been made public and presents a diverse set of email information ranging from internal, marketing emails to spam and fraud attempts.
In the early 2000s, Leslie Kaelbling at MIT purchased the dataset and noted that, though the dataset contained scam emails, it also had several integrity problems. The dataset was updated later, but it becomes key to ensure privacy in the data while it is used to train a deep neural network model.
Though the Enron Email Dataset contains over 500K emails, one of the problems with the dataset is the availability of labeled frauds in the dataset. Label annotation is done to detect an umbrella of fraud emails accurately. Since, fraud emails fall into several types such as Phishing, Financial, Romance, Subscription, and Nigerian Prince scams, there have to be multiple heuristics used to label all types of fraudulent emails effectively.
To tackle this problem, heuristics have been used to label the Enron data corpus using email signals, and automated labeling has been performed using simple ML models on other smaller email datasets available online. These fraud annotation techniques are discussed in detail below.
To perform fraud annotation on the Enron dataset as well as provide more fraud examples for modeling, two more fraud data sources have been used, Phishing Email Dataset: https://www.kaggle.com/dsv/6090437 Social Engineering Dataset: http://aclweb.org/aclwiki
To label the Enron email dataset two signals are used to filter suspicious emails and label them into fraud and non-fraud classes. Automated ML labeling Email Signals
The following heuristics are used to annotate labels for Enron email data using the other two data sources,
Phishing Model Annotation: A high-precision SVM model trained on the Phishing mails dataset, which is used to annotate the Phishing Label on the Enron Dataset.
Social Engineering Model Annotation: A high-precision SVM model trained on the Social Engineering mails dataset, which is used to annotate the Social Engineering Label on the Enron Dataset.
The two ML Annotator models use Term Frequency Inverse Document Frequency (TF-IDF) to embed the input text and make use of SVM models with Gaussian Kernel.
If either of the models predicted that an email was a fraud, the mail metadata was checked for several email signals. If these heuristics meet the requirements of a high-probability fraud email, we label it as a fraud email.
Email Signal-based heuristics are used to filter and target suspicious emails for fraud labeling specifically. The signals used were,
Person Of Interest: There is a publicly available list of email addresses of employees who were liable for the massive data leak at Enron. These user mailboxes have a higher chance of containing quality fraud emails.
Suspicious Folders: The Enron data is dumped into several folders for every employee. Folders consist of inbox, deleted_items, junk, calendar, etc. A set of folders with a higher chance of containing fraud emails, such as Deleted Items and Junk.
Sender Type: The sender type was categorized as ‘Internal’ and ‘External’ based on their email address.
Low Communication: A threshold of 4 emails based on the table below was used to define Low Communication. A user qualifies as a Low-Comm sender if their emails are below this threshold. Mails sent from low-comm senders have been assigned with a high probability of being a fraud.
Contains Replies and Forwards: If an email contains forwards or replies, a low probability was assigned for it to be a fraud email.
To ensure high-quality labels, the mismatch examples from ML Annotation have been manually inspected for Enron dataset relabeling.
Fraud | Non-Fraud |
---|---|
2327 | 445090 |
Enron Dataset Title: Enron Email Dataset URL: https://www.cs.cmu.edu/~enron/ Publisher: MIT, CMU Author: Leslie Kaelbling, William W. Cohen Year: 2015
Phishing Email Detection Dataset Title: Phishing Email Detection URL: https://www.kaggle.com/dsv/6090437 DOI: 10.34740/KAGGLE/DSV/6090437 Publisher: Kaggle Author: Subhadeep Chakraborty Year: 2023
CLAIR Fraud Email Collection Title: CLAIR collection of fraud email URL: http://aclweb.org/aclwiki Author: Radev, D. Year: 2008
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We record data leaks within the organization of the municipality of Utrecht in the public data leak register. This dataset contains the following data: • Date; • Description of the data breach; • (Possible) consequences for the person(s) involved; • Corrective actions taken; • Whether the Dutch Data Protection Authority (AP) has been informed; • Whether the data subject(s) have been informed. Only completed notifications are included in the register. Reports that are still being investigated by the municipality of Utrecht or the Dutch Data Protection Authority are not yet in the register. This happens when the research is completed. More information? Reporting a security or data leak: www.utrecht.nl/veiligheidslek-melden How the municipality of Utrecht deals with privacy: www.utrecht.nl/privacy
This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.
The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.
The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.
For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('civil_comments', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
http://opensource.org/licenses/BSD-2-Clausehttp://opensource.org/licenses/BSD-2-Clause
Python code (for Python 3.9 & Pandas 1.3.2) to generate the results used in "Compromised through Compression: Privacy Implications of Smart Meter Traffic Analysis".Smart metering comes with risks to privacy. One concern is the possibility of an attacker seeing the traffic that reports the energy use of a household and deriving private information from that. Encryption helps to mask the actual energy measurements, but is not sufficient to cover all risks. One aspect which has yet gone unexplored — and where encryption does not help — is traffic analysis, i.e. whether the length of messages communicating energy measurements can leak privacy-sensitive information to an observer. In this paper we examine whether using encodings or compression for smart metering data could potentially leak information about household energy use. Our analysis is based on the real-world energy use data of ±80 Dutch households.We find that traffic analysis could reveal information about the energy use of individual households if compression is used. As a result, when messages are sent daily, an attacker performing traffic analysis would be able to determine when all the members of a household are away or not using electricity for an entire day. We demonstrate this issue by recognizing when households from our dataset were on holiday. If messages are sent more often, more granular living patterns could likely be determined.We propose a method of encoding the data that is nearly as effective as compression at reducing message size, but does not leak the information that compression leaks. By not requiring compression to achieve the best possible data savings, the risk of traffic analysis is eliminated.This code operates on the relative energy measurements from the "Zonnedael dataset" from Liander N.V. This dataset needs to be obtained separately; see instructions accompanying the code. The code transforms the dataset into absolute measurements such as would be taken by a smart meter. It then generates batch messages covering 24-hour periods starting at midnight, similar to how the Dutch infrastructure batches daily meter readings, in the different possible encodings with and without compression applied. For an explanation of the different encodings, see the paper. The code will then provide statistics on the efficiency of encoding and compression for the entire dataset, and attempt to find the periods of multi-day absences for each household. It will also generate the graphs in the style used in the paper and presentation.
Dataset Card for prompt-leak-R1-dpo
This dataset card aims to be a base template for new datasets. It has been generated using this raw template.
Dataset Details
Dataset Description
Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]
Dataset Sources [optional]
Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/lianghsun/prompt-leak-R1-dpo.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is designed to test features of Connectomix. For more information, please visit the GitHub repository:
https://github.com/ln2t/connectomix
The dataset contains data for 4 participants.
control
and patient
groups in the participants.tsv
file. This split has been done artificially and serves only testing purposes.The exact commands to run the analyzes depends on your installation of fMRIPrep.
In what follows, we simply assume that fmriprep
is the command for fMRIPrep.
We show here the simplest version of the commands, assuming you adapt those depending on your setup (e.g. if you use Docker).
We also assume that the data are at the following locations:
bash
bids_dir='/data/ds005699'
derivatives_dir='/data/ds005699/derivatives'
fmriprep $bids_dir ${derivatives_dir}/fmriprep participant --fs-license-file /path/to/fs/license
Note: The following has been tested for connectomix version 1.0.1.
First set-up path to connectomix script:
bash
connectomix_cmd='/path/to/connectomix/connectomix/connectomix.py'
Second, set-up paths to config directory:
bash
config_dir='/data/ds005625/code/connectomix/config'
$connectomix_cmd ${bids_dir} ${derivatives_dir}/connectomix participant --derivatives fmriprep="${derivatives_dir}/fmriprep" --config "${config_dir}/participant_level_config.yaml"
Notes: - this is an example of Independent two-samples t-test - Since the dataset contains only four subjects (two subjects per group), the number of possible permutations is very low. For this reason, the number of computed permutations is set to 4, and connectomix can then complete the group level-analysis. Of course, realistic cases should not only include much more participants, but also a much larger number of permutations (see connectomix documentation).
$connectomix_cmd ${bids_dir} ${derivatives_dir}/connectomix group --config "${config_dir}/group_level_config.yaml"
The data is updated nightly using ArcGIS scripting. Scripting will not update the ArcGIS Online "item updated" date, which only reflects the last time the ArcGIS Online item page was last updated. A typical leaking underground storage tank (LUST) scenario involves the release of a fuel product from an underground storage tank (UST) that can contaminate surrounding soil, groundwater, or surface waters, or affect indoor air spaces. Early detection of an UST release is important, as is determining the source of the release, the type of fuel released, the occurrence of imminently threatened receptors, and the appropriate initial response. The primary objective of the initial response is to determine the nature and extent of a release as soon as possible.PROHIBITED USES: KSA 45-230 prohibits the use of names and addresses contained in public records for certain commercial purposes. By submitting this request, you are signing the following written certification that you will not use the information in the records for any purpose prohibited by law.
DATA LIMITATIONS:
This data set is not designed for use as a regulatory tool in permitting or citing decisions; it may be used as a reference source. Carefully consider the provisional or incomplete nature of these data before using them for decisions that concern personal safety or involves substantial monetary consequences.
This dataset contains one facility point per LUST data record. The points will be stacked if multiple LUST occurred at the same facility.
A new facility point is added when a new facility is added to the origination database.
Data is replicated on a nightly basis for public consumption. KDHE is not responsible for database integrity following download.
The facility point is not the exact location of the tank, but a general representative somewhere in the property of the Storage Tank Facility.
KDHE makes no assurances of the accuracy or validity of information presented in the Spatial Data. KDHE Tanks have been located using a variety of locational methods. More recent points are geocoded and validated with accuracy of 3-10 meters. Many inactive/old facilities only had a Legal description to calculate point placement on a map, with an accuracy of 250 – 2000 meters.For users who wish to interact with the data in a finished product, KDHE recommends using our Kansas Environmental Interest Finder . More information about KDHE can be found on the Kansas Department of Health and Environment website .More information about KDHE Storage Tanks can be found on the Kansas Department of Health and Environment website Storage Tanks Division .ATTRIBUTES description: Start Date/End Date: The LUST is considered finished when the remediation has occurred and the environment is back to pre-contamination state. A new LUST will be recorded if the Tank Leaks again. Approved TRUST: Flag Yes if approved for EPA TRUST: In 1986, Congress created the Leaking Underground Storage Tank (LUST) Trust Fund to address petroleum releases from federally regulated underground storage tanks (USTs) by amending Subtitle I of the Solid Waste Disposal Act. In 2005, the Energy Policy Act expanded eligible uses of the Trust Fund to include certain leak prevention activities.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Only fires are included here. All other incidences, including EMS calls and False Alarms have been excluded.
This dataset contains Connecticut Fire Department Incidents as reported to the National Fire Department Incident Reporting System (NFIRS).
Note that the 2014 and 2016 data has far more entries than the other years. In particular, they detail "False Alarm and False Calls" and "Rescue and Emergency Medical Service (EMS) Incidents"
NFIRS collects details on Fire, HazMat and EMS incidences nationwide, detailing the type of incident, where it occurred, the resources used to mitigate it and more, with a goal of understanding the nature and causes of the incidents. Information is also collected on the number of civilian or firefighter casualties and an estimate of property loss.
Participation in NFIRS is voluntary.
Data is released yearly, with a considerable delay.
Each Incidence is assigned a 3 digit Incidence Type Code. The code describes the situation emergency personnel found when they arrived. Incidence Types are grouped into larger categories, called Series.
For example, Series 400, 'Hazardous Condition' category includes incidence types: 411, 'Gasoline or other flammable liquid spill; 412, 'Gas leak and 413, 'Oil or other combustible liquid spill '.
Not every Incidence Type is included in the data. In 2012, 2013, 2014 and 2015, the NFIRS data releases contained these Series/Incidence Types:
Series 100: Fire Incidences, Series 400: Hazardous Condition (No Fire), Incidence Type 561: Unauthorized burning, under the 'Service Call' Series, Incidence Type 631: Authorized Controlled Burning, under the 'Good Intent Call' series and Incidence Type 632: Prescribed fires also under the 'Good Intent Call' series.
The 2014 and 2016 release included these additional series:
200: Overpressure Rupture, Explosion, Overheat (No Fire), 300: Rescue and Emergency Medical Service (EMS) Incidents, 500: Service Calls, 600: Good Intent Call Series, 700: False Alarm and False Call, 800 Severe Weather and Natural Disaster 900: Special Incident Type.
The official NFIRS documentation has been attached to this dataset.
This dataset does not contain all the detail available in the NFIRS database. If after reviewing the documentation, you find additional information you would like added to the dataset, please let us know.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
Reporting of leakage from water networks is based on the concept of monitoring flows at a time when demand is at a minimum which is normally during the night. This dataset includes net night flow measurements for 10% of the publisher’s total district metered areas. This 10% has been chosen on the basis that the telemetry on site is reliable, that it is not revealing of sensitive usage patterns and that the night flow there is typical of low demand.
Key Definitions
Dataset
A structured and organized collection of related elements, often stored digitally, used for analysis and interpretation in various fields.
Data Triage
The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data.
District Metered Area (DMA)
The role of a district metered area (DMA) is to divide the water distribution network into manageable areas or sectors into which the flow can be measured. These areas provide the water providers with guidance as to which DMAs (District Metered Areas) require leak detection work.
Leakage
The accidental admission or escape of a fluid or gas through a hole or crack
Night Flow
This technique considers that in a DMA, leakages can be estimated when the flow into the
DMA is at its minimum. Typically, this is measured at night between 3am and 4am when customer demand is low so that network leakage can be detected.
Centroid
The centre of a geometric object.
Data History
Data Origin
Companies have configured their networks to be able to continuously monitor night flows using district meters. Flow data is recorded on meters and normally transmitted daily to a data centre. Data is analysed to confirm its validity and used to derive continuous night flow in each monitored area.
Data Triage Considerations
Data Quality
Not all DMAs provide quality data for the purposes of trend analysis. It was decided that water companies should choose 10% of their DMAs to be represented in this data set to begin with. The advice to publishers is to choose those with reliable and consistent telemetry, indicative of genuine low demand during measurement times and not revealing of sensitive night usage patterns.
Data Consistency
There is a concern that companies measure flow allowance for legitimate night use and/or potential night use differently. To avoid any inconsistency, it was decided that we would share the net flow.
Critical National Infrastructure
The release of boundary data for district metered areas has been deemed to be revealing of critical national infrastructure. Because of this, it has been decided that the data set shall only contain point data from a centroid within the DMA.
Data Triage Review Frequency
Every 12 months, unless otherwise requested.
Data Limitations
Some of the flow recorded may be legitimate nighttime usage of the network
Some measuring systems automatically infill estimated measurements where none have been received via telemetry. These estimates are based on past flow.
The reason for a fluctuation in night flow may not be determined by this dataset but potential causes can include seasonal variation in nighttime water usage and mains bursts
Data Publish Frequency
Monthly
Supplementary information
Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset.
Ofwat – Reporting Guidance https://www.ofwat.gov.uk/wp-content/uploads/2018/03/Reporting-guidance-leakage.pdf
Water UK – UK Leakage https://www.water.org.uk/wp-content/uploads/2022/03/Water-UK-A-leakage-Routemap-to-2050.pdf
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
In recent years, social Q&A sites have developed rapidly, like Zhihu and Quora, with hundreds of millions of users, providing a convenient platform for users to ask questions and share knowledge. Most users share knowledge through real-name answers, but this sometimes hinders knowledge sharing, such as employees sharing company salaries and students sharing inside information about the laboratory. Revealing the author's true identity, in this case, can cause significant harm to the author. In order to tackle the above problems, social question-and-answer websites provide users with an anonymous answer function, which replaces the real author id of the answer with an anonymous id. Countless users who cannot disclose their identities use anonymous answers to share valuable knowledge.
Although there are countless anonymous answers in the Q&A community, few anonymization techniques have been used. The two super-large Q&A sites, Quora and Zhihu, use two anonymization technologies are hiding the author's information and protecting the storage of anonymous user information. However, anonymous answers, questions, comments, and topics and their topological structure contain distinct attributes unknown to most users, providing valuable features for de-anonymization attacks. Although the question-and-answer website warns anonymous users that personal specific information and language styles in answers may lead to privacy leaks, the question-and-answer community cannot give the probability and cause of privacy leaks for a specific anonymous answer.
In this paper, we propose a novel task, the de-anonymization of the Q&A websites, which refers to recovering the identity information of the real author of the anonymous answer. This task aims to evaluate the risk of privacy leakage of a specific anonymous answer in the question-and-answer websites and explain why the answer is vulnerable to de-anonymization.
To explore the effectiveness of various methodologies, we employ web scraping techniques on public answers from online platforms Zhihu and Quora. The first step involves the selection of seed users related to the ten popular topics. We selected one seed user from each topic. This step can ensure that the collected Q&A community dataset encompasses a diverse range of popular topics. In the second step, we recursively crawl the social relationships between users based on the ten seed users crawled in the first step. In order to make the crawled user pool more widely distributed, we only crawled the first 100 following users for each user until the crawled user pool reached 700,000. Step three is to ascertain the users to be ultimately crawled. A community discovery algorithm is employed to identify a community with the highest transitivity. This community must have a population exceeding 5,000. As a result, we will crawl all users in this community. The fourth step involves extracting the data related to all users within the chosen community. To mitigate the issue of excessive data volume, this article sets a constraint on the number of answers collected by individual users during the crawling process. This upper limit ensures that all answers of 95% of users are crawled. The crawled data consists of user homepage information, relationships between users, user-generated questions and answers, and associated comments and topics. Each question includes its title and the name of the person who asked it. Each answer contains the author's name, time of submission, the content of the answer, and any first-level comments. Additionally, each comment includes its author, submission time, and content.
Here is an example code to read the dataset https://www.kaggle.com/tianbaojie/example-code.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Purpose and Features
🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts:
OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting 749 discussion subjects / use cases split across education, health, and psychology. FinPII contains an additional ~20 types tailored to… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-300k.
Description
Dataframe containing 2075 French books in txt format (= the ~2600 French books present in gutenberg from which all books by authors present in the french_books_summuries dataset have been removed to avoid any leaks).More precisely :
the texte column contains the texts the titre column contains the book title the auteur column contains the author's name and dates of birth and death (if you want to filter the texts to keep only those from the given century to the present… See the full description on the dataset page: https://huggingface.co/datasets/CATIE-AQ/french_books.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the data repository for the LO2 dataset.
Here is an overview of the contents.
lo2-data.zip
This is the main dataset. This is the completely unedited output of our data collection process. Note that the uncompressed size is around 540 GB. For more information, see the paper and the data-appendix in this repository.
lo2-sample.zip
This is a sample that contains the data used for preliminary analysis. It contains only service logs and the most relevant metrics for the first 100 runs. Furthermore, the metrics are combined on a run level to a single csv to make them easier to utilize.
data-appendix.pdf
This document contains further details and stats about the full dataset. These include file size distributions, empty file analysis, log type analysis and the appearance of an unknown file.
lo2-scripts.zip
Various scripts for processing the data to create the sample, to conduct the preliminary analysis and to create the statistics seen in the data-appendix.
Version v3: Updated data appendix introduction, added another stage in the log analysis process in loglead_lo2.py
Full title: Using Decision Trees to Detect and Isolate Simulated Leaks in the J-2X Rocket Engine Mark Schwabacher, NASA Ames Research Center Robert Aguilar, Pratt & Whitney Rocketdyne Fernando Figueroa, NASA Stennis Space Center Abstract The goal of this work was to use data-driven methods to automatically detect and isolate faults in the J-2X rocket engine. It was decided to use decision trees, since they tend to be easier to interpret than other data-driven methods. The decision tree algorithm automatically “learns” a decision tree by performing a search through the space of possible decision trees to find one that fits the training data. The particular decision tree algorithm used is known as C4.5. Simulated J-2X data from a high-fidelity simulator developed at Pratt & Whitney Rocketdyne and known as the Detailed Real-Time Model (DRTM) was used to “train” and test the decision tree. Fifty-six DRTM simulations were performed for this purpose, with different leak sizes, different leak locations, and different times of leak onset. To make the simulations as realistic as possible, they included simulated sensor noise, and included a gradual degradation in both fuel and oxidizer turbine efficiency. A decision tree was trained using 11 of these simulations, and tested using the remaining 45 simulations. In the training phase, the C4.5 algorithm was provided with labeled examples of data from nominal operation and data including leaks in each leak location. From the data, it “learned” a decision tree that can classify unseen data as having no leak or having a leak in one of the five leak locations. In the test phase, the decision tree produced very low false alarm rates and low missed detection rates on the unseen data. It had very good fault isolation rates for three of the five simulated leak locations, but it tended to confuse the remaining two locations, perhaps because a large leak at one of these two locations can look very similar to a small leak at the other location. Introduction The J-2X rocket engine will be tested on Test Stand A-1 at NASA Stennis Space Center (SSC) in Mississippi. A team including people from SSC, NASA Ames Research Center (ARC), and Pratt & Whitney Rocketdyne (PWR) is developing a prototype end-to-end integrated systems health management (ISHM) system that will be used to monitor the test stand and the engine while the engine is on the test stand[1]. The prototype will use several different methods for detecting and diagnosing faults in the test stand and the engine, including rule-based, model-based, and data-driven approaches. SSC is currently using the G2 tool http://www.gensym.com to develop rule-based and model-based fault detection and diagnosis capabilities for the A-1 test stand. This paper describes preliminary results in applying the data-driven approach to detecting and diagnosing faults in the J-2X engine. The conventional approach to detecting and diagnosing faults in complex engineered systems such as rocket engines and test stands is to use large numbers of human experts. Test controllers watch the data in near-real time during each engine test. Engineers study the data after each test. These experts are aided by limit checks that signal when a particular variable goes outside of a predetermined range. The conventional approach is very labor intensive. Also, humans may not be able to recognize faults that involve the relationships among large numbers of variables. Further, some potential faults could happen too quickly for humans to detect them and react before they become catastrophic. Automated fault detection and diagnosis is therefore needed. One approach to automation is to encode human knowledge into rules or models. Another approach is use data-driven methods to automatically learn models from historical data or simulated data. Our prototype will combine the data-driven approach with the model-based and rule-based appro
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Version 3 with 517M hashes and counts of password usage ordered by most to least prevalent Pwned Passwords are 517,238,891 real world passwords previously exposed in data breaches. This exposure makes them unsuitable for ongoing use as they re at much greater risk of being used to take over other accounts. They re searchable online below as well as being downloadable for use in other online system. The entire set of passwords is downloadable for free below with each password being represented as a SHA-1 hash to protect the original value (some passwords contain personally identifiable information) followed by a count of how many times that password had been seen in the source data breaches. The list may be integrated into other systems and used to verify whether a password has previously appeared in a data breach after which a system may warn the user or even block the password outright.