Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Elephant Labs Complex PII Dataset for Long Contexts and Advanced Anonymization (with Business and Software-related Entities)
Developed by: Elephant Labs LinkedIn: Elephant Labs Dataset Size: 20,0000 synthetic documents Number of tokens in text: 14,140,795 (Tokenized with tiktoken.encoding_for_model("gpt-3.5-turbo"))
Dataset Summary
Purpose: A synthetically generated dataset for advanced NER tasks, supporting both token classification and LLM fine-tuning (enabling… See the full description on the dataset page: https://huggingface.co/datasets/MorryShah/complex_ner.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Purpose and Features
🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts:
OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting 749 discussion subjects / use cases split across education, health, and psychology. FinPII contains an additional ~20 types tailored to… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-300k.
Non-anonymized subset of the databases used in the paper "Traveling the Silk Road: A measurement analysis of a large anonymous online marketplace" (Christin, 2013). In this dataset, textual information (item name, description, or feedback text) and handles have not been anonymized and are thus available. We don't expect any private identifiers or other PII to be present in the data, which was collected from a publicly available website -- the Silk Road anonymous marketplace -- for a few months in 2012.
For less restricted usage terms, please consider the anonymized version, which is also available without any restrictions. This non-anonymized dataset should only be requested if your project MUST rely on full textual descriptions of items and/or feedback.
Christin (2013) Traveling the Silk Road: A measurement analysis of a large anonymous online marketplace. To appear in Proceedings of the 22nd International World Wide Web Conference (WWW'13). Rio de Janeiro, Brazil. May 2013.
Summary of every designation to every charity in a campaign year with anonymized data on the source (NO PII)
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global market for data masking tools is experiencing robust growth, driven by increasing regulatory compliance needs (like GDPR and CCPA), the rising adoption of cloud computing, and the expanding volume of sensitive data requiring protection. The market, currently estimated at $2.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This growth is fueled by organizations' increasing focus on data security and privacy, particularly within sectors like healthcare, finance, and government. The demand for sophisticated data masking solutions that can effectively anonymize and pseudonymize data while maintaining data utility for testing and development is a significant driver. Furthermore, the shift towards cloud-based data masking solutions, offering scalability and ease of management, is contributing to market expansion. Several key trends are shaping the market. The integration of advanced technologies such as AI and machine learning into data masking tools is enhancing their effectiveness and automating complex masking processes. The emergence of data masking solutions designed for specific data types, such as personally identifiable information (PII) and financial data, caters to niche requirements. However, challenges such as the complexity of implementing and managing data masking solutions, and concerns about the potential impact on data usability, represent restraints on market growth. The market is segmented by deployment type (cloud, on-premises), organization size (small, medium, large enterprises), and industry vertical (healthcare, finance, etc.). Key players in this space include Oracle, Delphix, BMC Software, Informatica, IBM, and several other specialized vendors offering a range of solutions to meet diverse organizational needs. The competitive landscape is dynamic, with ongoing innovation and consolidation shaping the future of the market.
Purpose and Features
The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-65k.
"Non-anonymized database pertaining to the Hansa marketplace. This data was used in the paper "Measurement by Proxy:
On the Accuracy of Online Marketplace Measurements" (Cuevas et al., 2018). In this dataset, textual information (item name, description, or feedback text) and handles have not been anonymized and are thus available. We don't expect any private identifiers or other PII to be present in the data, which was collected from a publicly available website (Hansa marketplace) over slightly less than two years (2015-2017).
For less restricted usage terms, please consider the anonymized version, which is also available without any restrictions. This non-anonymized dataset should only be requested if your project MUST rely on full textual descriptions of items and/or feedback.
Ai4Privacy Community
Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.
Purpose and Features
Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset consists of a curated and anonymized collection of real job application confirmation emails from a Gmail inbox. It includes confirmation emails, rejection notices, and other relevant correspondences. The dataset was originally curated to address the challenge of eliminating manual job application tracking, allowing for automatic tracking directly from the inbox, capturing application confirmations and rejection notifications.
The dataset has been carefully pre-processed, cleaned, and enriched with derived features such as:
The dataset was originally curated to build a job application tracking agent that can automatically extract and track application updates—such as confirmations, rejections, interview invites, and assessment notifications—directly from the inbox. The goal was to enable users to easily interact with an AI assistant to analyze and manage their job search process more efficiently.
⚠️ Disclaimer: All personal identifiable information (PII) such as names and email addresses have been fully anonymized or redacted. This dataset is intended strictly for educational and research purposes. All personally identifiable information (PII) has been carefully anonymized. Any personal names found in the dataset have been replaced with the fictional name "Michael Gary Scott" as a placeholder. This character reference is used purely for fun and does not correspond to any real individual. Please ensure any further use of this dataset respects privacy and ethical data handling practices.
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The data pseudonymization software market is experiencing robust growth, driven by increasing concerns over data privacy regulations like GDPR and CCPA, and the rising need to protect sensitive customer information while still leveraging data for analytics and other business purposes. The market, estimated at $2 billion in 2025, is projected to witness a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $7 billion by 2033. This expansion is fueled by the adoption of cloud-based solutions, which offer scalability and cost-effectiveness, coupled with a growing preference for data pseudonymization techniques among enterprises, particularly in sectors like healthcare, finance, and telecommunications that handle vast quantities of personally identifiable information (PII). Key trends include the integration of advanced analytics capabilities into pseudonymization software and increasing demand for solutions capable of handling diverse data formats and sources. However, the market faces restraints including the complexity of implementing pseudonymization techniques, the need for specialized expertise, and potential concerns regarding data utility after pseudonymization. The market segmentation reveals a significant preference for cloud-based solutions over on-premises deployments, reflecting the broader trend toward cloud adoption in enterprise IT. Enterprise adoption outweighs individual usage, reflecting the higher volume and sensitivity of data handled by large organizations. Geographically, North America currently dominates the market, followed by Europe, driven by stringent data privacy regulations and advanced technological infrastructure. However, the Asia-Pacific region is expected to experience significant growth in the coming years, fueled by increasing digitalization and growing awareness of data privacy issues. Competition among vendors like Aircloak, AvePoint, Anonos, and others is intense, with companies focusing on innovation in areas such as AI-powered pseudonymization and enhanced data security features to gain a competitive edge. The long-term forecast indicates a sustained period of growth, propelled by ongoing regulatory pressure and the continuous need for robust data protection measures in a data-driven economy.
"Non-anonymized database pertaining to the AlphaBay marketplace. This data was used in the papers ""Plug and Prey? Measuring the Commoditization of Cybercrime via Online Anonymous Markets"" (Van Wegberg et al., 2018), ""An Empirical Analysis of Traceability in the Monero Blockchain"" (Moeser et al., 2018) and in the joint EMCDDA/EUROPOL report ""Drugs and thedarknet: Perspectives for enforcement, researchand policy"" (EMCDDA, 2017). In this dataset, textual information (item name, description, or feedback text) and handles have not been anonymized and are thus available. We don't expect any private identifiers or other PII to be present in the data, which was collected from a publicly available website (Alphabay marketplace) over two and a half years (2014-2017).
For less restricted usage terms, please consider the anonymized version, which is also available without any restrictions. This non-anonymized dataset should only be requested if your project MUST rely on full textual descriptions of items and/or feedback.
EMCDDA (2017) Drugs and thedarknet: Perspectives for enforcement, researchand policy. November 2017.
Van Wegberg et al.. Plug and Prey? Measuring the Commoditization of Cybercrime via Online Anonymous Markets. To appear in Proceedings of the 27th USENIX Security Symposium (USENIX Security'18). Baltimore, MD. August 2018.
Moeser et al. An Empirical Analysis of Traceability in the Monero Blockchain. To appear in Proceedings of the Privacy Enhancing Technology Symposium (PETS 2018), volume 3. Barcelona, Spain. July 2018."
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
📘 Overview
This dataset consists of augmented Azerbaijani text pairs (clean & masked) that contain personally identifiable information (PII). All content has been automatically generated using ChatGPT to simulate sensitive data scenarios for tasks like PII detection, anonymization, entity masking, and secure data handling.
🔍 Dataset Structure
Each example is a paired record:
original: The full augmented Azerbaijani text containing PII. masked: The same text with PII… See the full description on the dataset page: https://huggingface.co/datasets/aimtune/az_personal_info_aug_masked.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is an additional DoH dataset used for researching DoH traffic data drift phenomena. It contains anonymized packet captures (pcaps) from the following days:
2022-11-28
2022-12-05
2022-12-12
2022-12-19
2022-12-26
The traffic was captured on the CESNET2 network and anonymized. The packet capturing and anonymization follow the methodology described in [1]. The list of IP addresses used for DoH recognition is also included within the dataset in doh_resolver_ip.csv file. The structure of the dataset is as follows:
. ├── doh_resolver_ip.csv ├── pcap │ ├── 2022-11-28 │ │ ├── DoH-20221128180002.pcapng │ │ └── HTTPS-20221128180002.pcapng │ ├── 2022-12-05 │ │ ├── DoH-20221205180001.pcapng │ │ └── HTTPS-20221205180001.pcapng │ ├── 2022-12-12 │ │ ├── DoH-20221212180001.pcapng │ │ └── HTTPS-20221212180001.pcapng │ ├── 2022-12-19 │ │ ├── DoH-20221219180001.pcapng │ │ └── HTTPS-20221219180001.pcapng │ └── 2022-12-26 │ ├── DoH-20221226180001.pcapng │ └── HTTPS-20221226180001.pcapng └── README.md
[1] Jeřábek, K., Hynek, K., Čejka, T., & Ryšavý, O. (2022). Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42, 108310. https://www.sciencedirect.com/science/article/pii/S2352340922005121
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Elephant Labs Complex PII Dataset for Long Contexts and Advanced Anonymization (with Business and Software-related Entities)
Developed by: Elephant Labs LinkedIn: Elephant Labs Dataset Size: 20,0000 synthetic documents Number of tokens in text: 14,140,795 (Tokenized with tiktoken.encoding_for_model("gpt-3.5-turbo"))
Dataset Summary
Purpose: A synthetically generated dataset for advanced NER tasks, supporting both token classification and LLM fine-tuning (enabling… See the full description on the dataset page: https://huggingface.co/datasets/MorryShah/complex_ner.