This statistic presents the types of personal information which U.S. adults would be most concerned about online hackers gaining access to. During the August 2017 survey period, 73 percent of respondents stated that they would feel most concerned about hackers gaining access to their personal banking information.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Purpose and Features
🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts:
OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting 749 discussion subjects / use cases split across education, health, and psychology. FinPII contains an additional ~20 types tailored to… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-300k.
A 2023 survey of Canadians found that almost four out of every 10 respondents think their home address is available online to people who should not have access to it. A further share of ** percent thought their date of birth was available online, while ** percent of respondents believed their credit card number was accessible to third parties.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the era of information technology data privacy is has become sacrosanct to individual privacy. Technology being dual edged sword can be misused to harm internet privacy. Phishing an online offence is similar to and is derived from word fishing in real world where offender sends mails (hook) to victims (bait) who think it is genuine mail and relies on it.
Sensitive personal data is data “revealing racial or ethnic origin, political opinions, religious beliefs, and (…) data concerning health or sex life”. Therefore, data sharing for research purposes must be opened for human health data to address cross-discipline research and improve the well-being of humans. The EUDAT Sensitive Data Group was created to address the unsatisfactory condition of using sensitive data in data infrastructures, like EUDAT. During the EUDAT User Forum, 26-27 Sept. 2016 in Krakow, Poland, the first meeting of the EUDAT Sensitive Data Working Group took place addressing different requirements and possible solutions for the processing of sensitive data in e-infrastructures. This meeting went beyond the current solutions of EUDAT and to find possibilities for more comprehensive data services and solutions as part of the open data environment of current e-infrastructures. In the Working Paper the first results are shown.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project investigates the challenges and potential solutions around the collection and use of personal data through an interactive installation called "dataslip". It was deployed across various events and used as a conversation starter for identifying challenges, collected via post-it notes, and solutions, collected through a generative workshop. The dataset includes the vector files to build the "dataslip" installation and the challenges and solutions identified.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
đź”’ Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home đź”’
The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.
The provided data format is .jsonl
, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.
{ "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }
The data fields are:
text
: a string
feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).taxonomy
: a classification label, with possible values including informational
(0), invasion
(1), collection
(2), processing
(3), dissemination
(4), physical
(5), personal-space
(6), territoriality
(7), intrusion
(8), obtrusion
(9), contamination
(10), modesty
(11), psychological
(12), interrogation
(13), psychological-distance
(14), social
(15), association
(16), crowding-isolation
(17), public-gaze
(18), solitude
(19), intimacy
(20), anonymity
(21), reserve
(22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.category
: a classification label, with possible values including personal-information
(0), family
(1), health
(2), thoughts
(3), values
(4), acquaintance
(5), appointment
(6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.affected_speaker
: a classification label, with possible values including care-worker
(0), care-recipient
(1), other
(2), both
(3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.language
: a string
feature. Language code as defined by ISO 639.locale
: a string
feature. Regional code as defined by ISO 3166-1 alpha-2.data_type
: a string
a classification label, with possible values including real
(0), synthetic
(1).uid
: a int64
feature. A unique identifier within the dataset.split
: a string
feature. Either train
, validation
or test
.The dataset has 2 subsets:
split
: with a total of 95 examples split into train
, validation
and test
(70%-15%-15%)unsplit
: with a total of 95 examples in a single train splitname | train | validation | test |
---|---|---|---|
split | 66 | 14 | 15 |
unsplit | 95 | n/a | n/a |
The files follow the naming convention subset-split-language.jsonl
. The following files are contained in the dataset:
split-train-en.jsonl
split-validation-en.jsonl
split-test-en.jsonl
unsplit-train-en.jsonl
Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.
The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.
The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the accessible portions of the interviews were translated from German to US English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split
function from the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank"
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Slides from the introduction to a panel session at eResearch Australasia (Melbourne, October 2016). Panellists: Kate LeMay (Australian National Data Service), Gabrielle Hirsch (Walter and Eliza Hall Institute of Medical Research), Gordon McGurk (National Health and Medical Research Council) and Jeff Christiansen (Intersect).Short abstractHuman medical, health and personal data are a major category of sensitive data. These data need particular care, both during the management of a research project and when planning to publish them. The Australian National Data Service (ANDS) has developed guides around the management and sharing of sensitive data. ANDS is convening this panel to consider legal, ethical and secure storage issues around sensitive data, in the stages of the research life cycle: research conception and planning, commencement of research, data collection and processing, data analysis storage and management, and dissemination of results and data access.
The legal framework around privacy in Australia is complex and differs between states. Many Acts regulate the collection, use, disclosure and handling of private data. There are also many ethical considerations around the management and sharing of sensitive data. The National Health and Medical Research Council (NHMRC) has developed the Human Research Ethics Application (HREA) as a replacement for the National Ethics Application Form (NEAF). The aim of the HREA is to be a concise streamlined application to facilitate efficient and effective ethics review for research involving humans. The application will assist researchers to consider the ethical principles of the National Statement of Ethical Conduct in Human Research (2007) in relation to their research.
National security standard guidelines and health and medical research policy drivers underpin the need for a national fit-for-purpose health and medical research data storage facility to store, access and use health and medical research data. med.data.edu.au is an NCRIS-funded facility that underpins the Australian health and medical research sector by providing secure data storage and compute services that adhere to privacy and confidentiality requirements of data custodians who are responsible for human-derived research datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Instagram data-download example dataset
In this repository you can find a data-set consisting of 11 personal Instagram archives, or Data-Download Packages (DDPs).
How the data was generated
These Instagram accounts were all new and generated by a group of researchers who were interested to figure out in detail
the structure and variety in structure of these Instagram DDPs. The participants user the Instagram account extensively for approximately a week. The participants also intensively communicated with each other so that the data can be used as an example of a network.
The data was primarily generated to evaluate the performance of de-identification software. Therefore, the text in the DDPs particularly contain many randomly chosen (Dutch) first names, phone numbers, e-mail addresses and URLS. In addition, the images in the DDPs contain many faces and text as well. The DDPs contain faces and text (usernames) of third parties. However, only content of so-called `professional accounts' are shared, such as accounts of famous individuals or institutions who self-consciously and actively seek publicity, and these sources are easily publicly available. Furthermore, the DDPs do not contain sensitive personal data of these individuals.
Obtaining your Instagram DDP
After using the Instagram accounts intensively for approximately a week, the participants requested their personal Instagram DDPs by using the following steps. You can follow these steps yourself if you are interested in your personal Instagram DDP.
1. Go to www.instagram.com and log in
2. Click on your profile picture, go to *Settings* and *Privacy and Security*
3. Scroll to *Data download* and click *Request download*
4. Enter your email adress and click *Next*
5. Enter your password and click *Request download*
Instagram then delivered the data in a compressed zip folder with the format **username_YYYYMMDD.zip** (i.e., Instagram handle and date of download) to the participant, and the participants shared these DDPs with us.
Data cleaning
To comply with the Instagram user agreement, participants shared their full name, phone number and e-mail address. In addition, Instagram logged the i.p. addresses the participant used during their active period on Instagram. After colleting the DDPs, we manually replaced such information with random replacements such that the DDps shared here do not contain any personal data of the participants.
How this data-set can be used
This data-set was generated with the intention to evaluate the performance of the de-identification software. We invite other researchers to use this data-set for example to investigate what type of data can be found in Instagram DDPs or to investigate the structure of Instagram DDPs. The packages can also be used for example data-analyses, although no substantive research questions can be answered using this data as the data does not reflect how research subjects behave `in the wild'.
Authors
The data collection is executed by Laura Boeschoten, Ruben van den Goorbergh and Daniel Oberski of Utrecht University. For questions, please contact l.boeschoten@uu.nl.
Acknowledgments
The researchers would like to thank everyone who participated in this data-generation project.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of privacy-preserving inference of our cancer prediction model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Descriptive statistics, ANOVA results, and pairwise t-test results for perceived sensitivity, perceived confidentiality, and subjective ease of faking.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Benchmarks for one privacy-preserving GRU cell evaluations.
According to a 2023 survey of Chief Information Security Officers (CISO) worldwide, ** percent of sensitive data loss at organizations happens because of carless users, A further **** percent of the respondents said Compromised systems caused data loss. Additionally, around ** percent of respondents, malicious employee or contractor was the cause behind their incidents.
Ai4Privacy Community
Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.
Purpose and Features
Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.
According to our latest research, the global Inline Sensitive Data Redaction Card market size reached USD 1.54 billion in 2024, with a robust year-on-year growth driven by the increasing demand for real-time data privacy and regulatory compliance across industries. The market is anticipated to expand at a CAGR of 14.2% from 2025 to 2033, projecting a forecasted market size of USD 4.21 billion by 2033. This remarkable growth trajectory is primarily attributed to the proliferation of digital transformation initiatives, stringent data protection mandates, and the exponential rise in data breaches globally.
One of the primary growth factors fueling the expansion of the Inline Sensitive Data Redaction Card market is the intensifying regulatory landscape surrounding data privacy. With the implementation of regulations such as the General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), and similar frameworks worldwide, organizations are under immense pressure to ensure that sensitive data—such as personally identifiable information (PII), payment card information, and health records—is effectively protected throughout its lifecycle. Inline redaction cards, both hardware and software-based, offer automated, real-time data masking and redaction capabilities that help enterprises comply with these regulations, thereby minimizing the risk of hefty fines and reputational damage. As regulatory scrutiny continues to escalate, the demand for robust redaction solutions is expected to remain strong, propelling market growth.
Another significant driver is the accelerated digital transformation across various sectors, leading to an unprecedented surge in data generation and exchange. Industries such as financial services, healthcare, government, and retail are increasingly reliant on digital platforms to deliver seamless customer experiences, streamline operations, and enable remote work. However, this digital shift also exposes organizations to greater cyber risks, including data leaks and unauthorized access. Inline Sensitive Data Redaction Cards provide a critical layer of security by ensuring that sensitive information is automatically identified and redacted before it is stored, processed, or transmitted. This capability is particularly vital in environments where large volumes of data traverse multiple endpoints and networks, making manual redaction impractical and error-prone. The integration of AI and machine learning into these solutions further enhances their efficiency, accuracy, and adaptability, making them indispensable for modern enterprises.
The proliferation of cloud computing and hybrid IT environments is also playing a pivotal role in shaping the Inline Sensitive Data Redaction Card market. As organizations migrate their workloads to the cloud and adopt SaaS applications, the need for data-centric security measures that operate seamlessly across on-premises and cloud infrastructures becomes paramount. Inline redaction solutions are evolving to support diverse deployment models, enabling businesses to maintain consistent data protection policies regardless of where their data resides. This flexibility not only supports compliance and risk management objectives but also empowers organizations to innovate without compromising security. Furthermore, the growing awareness of the business value of data privacy—such as enhanced customer trust and competitive differentiation—is encouraging more enterprises to invest in advanced redaction technologies.
From a regional perspective, North America continues to dominate the Inline Sensitive Data Redaction Card market, accounting for the largest revenue share in 2024. The region’s leadership is underpinned by the presence of major technology vendors, early adoption of advanced cybersecurity solutions, and a highly regulated business environment. Europe follows closely, driven by stringent data privacy laws and a strong focus on digital sovereignty. Meanwhile, the Asia Pacific region is emerging as the fastest-growing market, fueled by rapid digitalization, increasing cyber threats, and evolving regulatory frameworks in countries such as China, India, and Japan. Latin America and the Middle East & Africa are also witnessing steady growth, albeit from a smaller base, as organizations in these regions ramp up their investments in data protection and compliance solutions.
Since the entry into force of the General Data Protection Regulation (GDPR), on 25 May 2018, only digital processing of the most sensitive personal data must be subject to prior formalities with the CNIL. These formalities may take the form of simplified declarations (declarations of conformity with a reference framework proposed by the CNIL), requests for an opinion (for the sovereign activities of the State) or applications for authorisation (in the field of health). To find out more: cnil.fr. In accordance with the amended Data Protection Act (Article 36), the CNIL keeps available to the public the list of these formalities in an open and easily reusable format, known as “List article 36”. ** Warnings:** 1/The published data are the result of the prior formalities completed, since May 25, 2018, by the controllers of personal data processing at the CNIL, via its dedicated teleservices. The CNIL cannot be held responsible for their content. 2/The processing carried out on behalf of the State may not appear in the dataset, the formalities having been completed in the form of requests for an opinion on a draft regulatory act (decree or decree) not submitted via the teleservices mentioned. The information relating to these treatments is available on Legifrance, the opinion of the CNIL being published with the act authorising the treatment (to access the deliberations of the CNIL: https://www.legifrance.gouv.fr/initRechExpCnil.do). In addition, some important treatments are subject to fiches on the CNIL website. 3/Exceptionally exempted from the publication of the regulatory act authorising them (decree or decree) are not included in the published data set, in accordance with article 36 of the amended Data Protection Act. The treatments referred to in Article 30 I and II may be exempted, by decree in the Council of State, from the publication of the regulatory act which authorises them. These treatments are mentioned in Decree n°2007-914 of 15 May 2007.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global data de-identification software market size was valued at approximately USD 500 million in 2023 and is projected to reach around USD 1.5 billion by 2032, growing at a CAGR of 13.5% during the forecast period. The growth in this market is driven by the increasing need for data privacy and compliance with stringent regulatory requirements across various industries.
The primary growth factor for the data de-identification software market is the rising awareness and concern regarding data privacy and security. With the advent of big data and the proliferation of digital services, organizations are increasingly recognizing the importance of protecting personal and sensitive information. Data breaches and cyber-attacks have led to significant financial and reputational damages, prompting businesses to invest in advanced data de-identification solutions to mitigate risks. Moreover, regulatory frameworks such as GDPR in Europe, CCPA in California, and HIPAA in the United States mandate strict compliance measures for data privacy, further propelling the demand for these software solutions.
Another significant driver is the growing adoption of cloud-based services and data analytics. As organizations migrate their data to cloud platforms, the need for robust data protection mechanisms becomes paramount. De-identification software enables companies to anonymize sensitive information before storing it in the cloud, ensuring compliance with data protection regulations and reducing the risk of exposure. Additionally, the rise of data analytics for business intelligence and decision-making necessitates the use of de-identified data to maintain privacy while extracting valuable insights.
The healthcare sector is particularly noteworthy for its substantial contribution to the market growth. The industry deals with large volumes of sensitive patient information that must be protected from unauthorized access. Data de-identification software plays a crucial role in enabling healthcare providers to share and analyze patient data for research and treatment purposes without compromising privacy. The COVID-19 pandemic has further accelerated the adoption of digital health solutions, increasing the demand for data de-identification tools to ensure compliance with privacy regulations and maintain patient trust.
Data Masking Technology is becoming increasingly vital as organizations strive to protect sensitive information while maintaining data utility. This technology allows businesses to create a realistic but fictional version of their data, ensuring that sensitive information is not exposed during processes such as software testing, development, and analytics. By substituting sensitive data with anonymized values, data masking technology helps organizations comply with data protection regulations without hindering their operational efficiency. As data privacy concerns continue to rise, the adoption of data masking technology is expected to grow, offering a robust solution for safeguarding sensitive information across various sectors.
Regionally, North America holds a significant share of the data de-identification software market, driven by the presence of key market players, stringent regulatory requirements, and a high level of digitalization across industries. The Asia Pacific region is expected to witness the fastest growth during the forecast period, attributed to the rapid adoption of digital technologies, increasing awareness of data privacy, and evolving regulatory landscape in countries like China, Japan, and India. Europe also plays a vital role due to the stringent data protection regulations enforced by the GDPR, which mandates rigorous data de-identification practices.
By component, the data de-identification software market is segmented into software and services. The software segment is anticipated to dominate the market, driven by the increasing demand for advanced de-identification tools that can handle large volumes of data efficiently. Organizations are investing in sophisticated software solutions that offer automated and customizable de-identification processes to meet specific compliance requirements. These software solutions often come with features like encryption, tokenization, and data masking, enhancing their appeal to businesses across different sectors.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Data De-identification & Pseudonymization Software market is experiencing robust growth, driven by increasing concerns around data privacy regulations like GDPR and CCPA, and the rising need to protect sensitive personal information. The market, estimated at $2 billion in 2025, is projected to expand significantly over the forecast period (2025-2033), fueled by a Compound Annual Growth Rate (CAGR) of approximately 15%. This growth is propelled by several factors, including the adoption of cloud-based solutions, advancements in artificial intelligence (AI) and machine learning (ML) for data anonymization, and the growing demand for data-driven insights while maintaining regulatory compliance. Key market segments include healthcare, finance, and government, which are heavily regulated and consequently require robust data anonymization strategies. The competitive landscape is dynamic, with a mix of established players like IBM and Informatica alongside innovative startups like Aircloak and Privitar. The market is witnessing a shift towards more sophisticated techniques like differential privacy and homomorphic encryption, enabling data analysis without compromising individual privacy. The adoption of data de-identification and pseudonymization is expected to accelerate in the coming years, particularly within organizations handling large volumes of personal data. This increase will be influenced by stricter enforcement of privacy regulations, coupled with the expanding application of advanced analytics techniques. While challenges remain, such as the complexity of implementing these solutions and the potential for re-identification vulnerabilities, ongoing technological advancements and increasing awareness are mitigating these risks. Further growth will depend on the development of more user-friendly and cost-effective solutions catering to diverse organizational needs, along with better education and training on best practices in data protection. The market's expansion presents significant opportunities for vendors to develop and market innovative solutions, strengthening their competitive positioning within this rapidly evolving landscape.
In 2024, the number of data compromises in the United States stood at 3,158 cases. Meanwhile, over 1.35 billion individuals were affected in the same year by data compromises, including data breaches, leakage, and exposure. While these are three different events, they have one thing in common. As a result of all three incidents, the sensitive data is accessed by an unauthorized threat actor. Industries most vulnerable to data breaches Some industry sectors usually see more significant cases of private data violations than others. This is determined by the type and volume of the personal information organizations of these sectors store. In 2024 the financial services, healthcare, and professional services were the three industry sectors that recorded most data breaches. Overall, the number of healthcare data breaches in some industry sectors in the United States has gradually increased within the past few years. However, some sectors saw decrease. Largest data exposures worldwide In 2020, an adult streaming website, CAM4, experienced a leakage of nearly 11 billion records. This, by far, is the most extensive reported data leakage. This case, though, is unique because cyber security researchers found the vulnerability before the cyber criminals. The second-largest data breach is the Yahoo data breach, dating back to 2013. The company first reported about one billion exposed records, then later, in 2017, came up with an updated number of leaked records, which was three billion. In March 2018, the third biggest data breach happened, involving India’s national identification database Aadhaar. As a result of this incident, over 1.1 billion records were exposed.
This statistic presents the types of personal information which U.S. adults would be most concerned about online hackers gaining access to. During the August 2017 survey period, 73 percent of respondents stated that they would feel most concerned about hackers gaining access to their personal banking information.