14 datasets found

h
complex_ner
huggingface.co
Updated Sep 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morteza Shahrezaye (2023). complex_ner [Dataset]. http://doi.org/10.57967/hf/3143
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/3143
Dataset updated
Sep 22, 2023
Authors
Morteza Shahrezaye
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Elephant Labs Complex PII Dataset for Long Contexts and Advanced Anonymization (with Business and Software-related Entities)

Developed by: Elephant Labs LinkedIn: Elephant Labs Dataset Size: 20,0000 synthetic documents Number of tokens in text: 14,140,795 (Tokenized with tiktoken.encoding_for_model("gpt-3.5-turbo"))

Dataset Summary

Purpose: A synthetically generated dataset for advanced NER tasks, supporting both token classification and LLM fine-tuning (enabling… See the full description on the dataset page: https://huggingface.co/datasets/MorryShah/complex_ner.
h
PII-NER
huggingface.co
Updated Jul 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph G Flowers (2024). PII-NER [Dataset]. https://huggingface.co/datasets/Josephgflowers/PII-NER
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 20, 2024
Authors
Joseph G Flowers
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.
h
pii-masking-300k
huggingface.co
Updated Apr 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2024). pii-masking-300k [Dataset]. http://doi.org/10.57967/hf/1995
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/1995
Dataset updated
Apr 4, 2024
Dataset authored and provided by
Ai4Privacy
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Purpose and Features

🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts:

OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting 749 discussion subjects / use cases split across education, health, and psychology. FinPII contains an additional ~20 types tailored to… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-300k.
i
Traveling the Silk Road: Non-anonymized datasets
impactcybertrust.org
Updated Mar 4, 2012
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carnegie Mellon University (2012). Traveling the Silk Road: Non-anonymized datasets [Dataset]. http://doi.org/10.23721/116/1406256
Explore at:
Unique identifier
https://doi.org/10.23721/116/1406256
Dataset updated
Mar 4, 2012
Authors
Carnegie Mellon University
Time period covered
Mar 4, 2012 - Jul 23, 2012
Description
Non-anonymized subset of the databases used in the paper "Traveling the Silk Road: A measurement analysis of a large anonymous online marketplace" (Christin, 2013). In this dataset, textual information (item name, description, or feedback text) and handles have not been anonymized and are thus available. We don't expect any private identifiers or other PII to be present in the data, which was collected from a publicly available website -- the Silk Road anonymous marketplace -- for a few months in 2012.

For less restricted usage terms, please consider the anonymized version, which is also available without any restrictions. This non-anonymized dataset should only be requested if your project MUST rely on full textual descriptions of items and/or feedback.

Christin (2013) Traveling the Silk Road: A measurement analysis of a large anonymous online marketplace. To appear in Proceedings of the 22nd International World Wide Web Conference (WWW'13). Rio de Janeiro, Brazil. May 2013.
CFC Anonymized Raw Data
datasets.ai
catalog.data.gov
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Personnel Management (2024). CFC Anonymized Raw Data [Dataset]. https://datasets.ai/datasets/cfc-anonymized-raw-data-09c36
Explore at:
Dataset updated
Sep 10, 2024
Dataset provided by
United States Office of Personnel Managementhttps://opm.gov/
Authors
Office of Personnel Management
Description
Summary of every designation to every charity in a campaign year with anonymized data on the source (NO PII)
D
Data Masking Tools Report
archivemarketresearch.com
doc, pdf, ppt
Updated Jun 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Data Masking Tools Report [Dataset]. https://www.archivemarketresearch.com/reports/data-masking-tools-560706
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Jun 21, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global market for data masking tools is experiencing robust growth, driven by increasing regulatory compliance needs (like GDPR and CCPA), the rising adoption of cloud computing, and the expanding volume of sensitive data requiring protection. The market, currently estimated at $2.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This growth is fueled by organizations' increasing focus on data security and privacy, particularly within sectors like healthcare, finance, and government. The demand for sophisticated data masking solutions that can effectively anonymize and pseudonymize data while maintaining data utility for testing and development is a significant driver. Furthermore, the shift towards cloud-based data masking solutions, offering scalability and ease of management, is contributing to market expansion. Several key trends are shaping the market. The integration of advanced technologies such as AI and machine learning into data masking tools is enhancing their effectiveness and automating complex masking processes. The emergence of data masking solutions designed for specific data types, such as personally identifiable information (PII) and financial data, caters to niche requirements. However, challenges such as the complexity of implementing and managing data masking solutions, and concerns about the potential impact on data usability, represent restraints on market growth. The market is segmented by deployment type (cloud, on-premises), organization size (small, medium, large enterprises), and industry vertical (healthcare, finance, etc.). Key players in this space include Oracle, Delphix, BMC Software, Informatica, IBM, and several other specialized vendors offering a range of solutions to meet diverse organizational needs. The competitive landscape is dynamic, with ongoing innovation and consolidation shaping the future of the market.
h
pii-masking-65k
huggingface.co
Updated Apr 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2024). pii-masking-65k [Dataset]. http://doi.org/10.57967/hf/2012
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2012
Dataset updated
Apr 5, 2024
Dataset authored and provided by
Ai4Privacy
Description
Purpose and Features

The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-65k.
i
Hansa marketplace: Non-anonymized dataset
impactcybertrust.org
Updated Oct 8, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carnegie Mellon University (2015). Hansa marketplace: Non-anonymized dataset [Dataset]. https://www.impactcybertrust.org/dataset_view?idDataset=1499
Explore at:
Dataset updated
Oct 8, 2015
Authors
Carnegie Mellon University
Time period covered
Oct 8, 2015 - Jul 14, 2017
Description
"Non-anonymized database pertaining to the Hansa marketplace. This data was used in the paper "Measurement by Proxy:
On the Accuracy of Online Marketplace Measurements" (Cuevas et al., 2018). In this dataset, textual information (item name, description, or feedback text) and handles have not been anonymized and are thus available. We don't expect any private identifiers or other PII to be present in the data, which was collected from a publicly available website (Hansa marketplace) over slightly less than two years (2015-2017).

For less restricted usage terms, please consider the anonymized version, which is also available without any restrictions. This non-anonymized dataset should only be requested if your project MUST rely on full textual descriptions of items and/or feedback.
h
pii-masking-200k
huggingface.co
Updated Apr 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/1532
Dataset updated
Apr 22, 2024
Dataset authored and provided by
Ai4Privacy
Description
Ai4Privacy Community

Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

Purpose and Features

Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.
Job Application Email (Anonymized & Feature-Rich)
kaggle.com
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rashmi Shree (2025). Job Application Email (Anonymized & Feature-Rich) [Dataset]. https://www.kaggle.com/datasets/rasho330/job-application-email-anonymized-and-feature-rich
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rashmi Shree
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset consists of a curated and anonymized collection of real job application confirmation emails from a Gmail inbox. It includes confirmation emails, rejection notices, and other relevant correspondences. The dataset was originally curated to address the challenge of eliminating manual job application tracking, allowing for automatic tracking directly from the inbox, capturing application confirmations and rejection notifications.

The dataset has been carefully pre-processed, cleaned, and enriched with derived features such as:

📅 Parsed Date and Time

🕒 Week, Month, and Year of Email

⏱️ Days Since Email Received

📩 Email Subject and Body

🏢 Company Name (Parsed from Subject/Body)

📊 Application Status Insights

The dataset was originally curated to build a job application tracking agent that can automatically extract and track application updates—such as confirmations, rejections, interview invites, and assessment notifications—directly from the inbox. The goal was to enable users to easily interact with an AI assistant to analyze and manage their job search process more efficiently.

⚠️ Disclaimer: All personal identifiable information (PII) such as names and email addresses have been fully anonymized or redacted. This dataset is intended strictly for educational and research purposes. All personally identifiable information (PII) has been carefully anonymized. Any personal names found in the dataset have been replaced with the fictional name "Michael Gary Scott" as a placeholder. This character reference is used purely for fun and does not correspond to any real individual. Please ensure any further use of this dataset respects privacy and ethical data handling practices.
D
Data Pseudonymity Software Report
marketresearchforecast.com
doc, pdf, ppt
Updated Mar 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2025). Data Pseudonymity Software Report [Dataset]. https://www.marketresearchforecast.com/reports/data-pseudonymity-software-28311
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Mar 6, 2025
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The data pseudonymization software market is experiencing robust growth, driven by increasing concerns over data privacy regulations like GDPR and CCPA, and the rising need to protect sensitive customer information while still leveraging data for analytics and other business purposes. The market, estimated at $2 billion in 2025, is projected to witness a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $7 billion by 2033. This expansion is fueled by the adoption of cloud-based solutions, which offer scalability and cost-effectiveness, coupled with a growing preference for data pseudonymization techniques among enterprises, particularly in sectors like healthcare, finance, and telecommunications that handle vast quantities of personally identifiable information (PII). Key trends include the integration of advanced analytics capabilities into pseudonymization software and increasing demand for solutions capable of handling diverse data formats and sources. However, the market faces restraints including the complexity of implementing pseudonymization techniques, the need for specialized expertise, and potential concerns regarding data utility after pseudonymization. The market segmentation reveals a significant preference for cloud-based solutions over on-premises deployments, reflecting the broader trend toward cloud adoption in enterprise IT. Enterprise adoption outweighs individual usage, reflecting the higher volume and sensitivity of data handled by large organizations. Geographically, North America currently dominates the market, followed by Europe, driven by stringent data privacy regulations and advanced technological infrastructure. However, the Asia-Pacific region is expected to experience significant growth in the coming years, fueled by increasing digitalization and growing awareness of data privacy issues. Competition among vendors like Aircloak, AvePoint, Anonos, and others is intense, with companies focusing on innovation in areas such as AI-powered pseudonymization and enhanced data security features to gain a competitive edge. The long-term forecast indicates a sustained period of growth, propelled by ongoing regulatory pressure and the continuous need for robust data protection measures in a data-driven economy.
i
Alphabay marketplace: Non-anonymized dataset
impactcybertrust.org
Updated Dec 31, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carnegie Mellon University (2014). Alphabay marketplace: Non-anonymized dataset [Dataset]. http://doi.org/10.23721/116/1462165
Explore at:
Unique identifier
https://doi.org/10.23721/116/1462165
Dataset updated
Dec 31, 2014
Authors
Carnegie Mellon University
Time period covered
Dec 31, 2014 - May 26, 2017
Description
"Non-anonymized database pertaining to the AlphaBay marketplace. This data was used in the papers ""Plug and Prey? Measuring the Commoditization of Cybercrime via Online Anonymous Markets"" (Van Wegberg et al., 2018), ""An Empirical Analysis of Traceability in the Monero Blockchain"" (Moeser et al., 2018) and in the joint EMCDDA/EUROPOL report ""Drugs and thedarknet: Perspectives for enforcement, researchand policy"" (EMCDDA, 2017). In this dataset, textual information (item name, description, or feedback text) and handles have not been anonymized and are thus available. We don't expect any private identifiers or other PII to be present in the data, which was collected from a publicly available website (Alphabay marketplace) over two and a half years (2014-2017).

For less restricted usage terms, please consider the anonymized version, which is also available without any restrictions. This non-anonymized dataset should only be requested if your project MUST rely on full textual descriptions of items and/or feedback.

EMCDDA (2017) Drugs and thedarknet: Perspectives for enforcement, researchand policy. November 2017.
Van Wegberg et al.. Plug and Prey? Measuring the Commoditization of Cybercrime via Online Anonymous Markets. To appear in Proceedings of the 27th USENIX Security Symposium (USENIX Security'18). Baltimore, MD. August 2018.
Moeser et al. An Empirical Analysis of Traceability in the Monero Blockchain. To appear in Proceedings of the Privacy Enhancing Technology Symposium (PETS 2018), volume 3. Barcelona, Spain. July 2018."
h
az_personal_info_aug_masked
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hamza Agar, az_personal_info_aug_masked [Dataset]. https://huggingface.co/datasets/aimtune/az_personal_info_aug_masked
Explore at:
Authors
Hamza Agar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📘 Overview

This dataset consists of augmented Azerbaijani text pairs (clean & masked) that contain personally identifiable information (PII). All content has been automatically generated using ChatGPT to simulate sensitive data scenarios for tasks like PII detection, anonymization, entity masking, and secure data handling.

🔍 Dataset Structure

Each example is a paired record:

original: The full augmented Azerbaijani text containing PII. masked: The same text with PII… See the full description on the dataset page: https://huggingface.co/datasets/aimtune/az_personal_info_aug_masked.
Z
Five-week DoH Dataset collected on ISP backbone lines
data.niaid.nih.gov
zenodo.org
Updated Sep 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kamil Jeřábek (2023). Five-week DoH Dataset collected on ISP backbone lines [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8348772
Explore at:
Dataset updated
Sep 15, 2023
Dataset provided by
Ondřej Ryšavý
Kamil Jeřábek
Karel Hynek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is an additional DoH dataset used for researching DoH traffic data drift phenomena. It contains anonymized packet captures (pcaps) from the following days:

2022-11-28

2022-12-05

2022-12-12

2022-12-19

2022-12-26

The traffic was captured on the CESNET2 network and anonymized. The packet capturing and anonymization follow the methodology described in [1]. The list of IP addresses used for DoH recognition is also included within the dataset in doh_resolver_ip.csv file. The structure of the dataset is as follows:

. ├── doh_resolver_ip.csv ├── pcap │ ├── 2022-11-28 │ │ ├── DoH-20221128180002.pcapng │ │ └── HTTPS-20221128180002.pcapng │ ├── 2022-12-05 │ │ ├── DoH-20221205180001.pcapng │ │ └── HTTPS-20221205180001.pcapng │ ├── 2022-12-12 │ │ ├── DoH-20221212180001.pcapng │ │ └── HTTPS-20221212180001.pcapng │ ├── 2022-12-19 │ │ ├── DoH-20221219180001.pcapng │ │ └── HTTPS-20221219180001.pcapng │ └── 2022-12-26 │ ├── DoH-20221226180001.pcapng │ └── HTTPS-20221226180001.pcapng └── README.md

[1] Jeřábek, K., Hynek, K., Čejka, T., & Ryšavý, O. (2022). Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42, 108310. https://www.sciencedirect.com/science/article/pii/S2352340922005121
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Morteza Shahrezaye (2023). complex_ner [Dataset]. http://doi.org/10.57967/hf/3143

complex_ner

MorryShah/complex_ner

Elephant Labs Complex PII Dataset for Long Contexts and Advanced Anonymization (with Business and Software-related Entities)

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.57967/hf/3143

Dataset updated

Sep 22, 2023

Authors

Morteza Shahrezaye

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Elephant Labs Complex PII Dataset for Long Contexts and Advanced Anonymization (with Business and Software-related Entities)

Developed by: Elephant Labs LinkedIn: Elephant Labs Dataset Size: 20,0000 synthetic documents Number of tokens in text: 14,140,795 (Tokenized with tiktoken.encoding_for_model("gpt-3.5-turbo"))

  Dataset Summary

Purpose: A synthetically generated dataset for advanced NER tasks, supporting both token classification and LLM fine-tuning (enabling… See the full description on the dataset page: https://huggingface.co/datasets/MorryShah/complex_ner.

Clear search

Close search

Google apps

Main menu

complex_ner

PII-NER

pii-masking-300k

Traveling the Silk Road: Non-anonymized datasets

CFC Anonymized Raw Data

Data Masking Tools Report

pii-masking-65k

Hansa marketplace: Non-anonymized dataset

pii-masking-200k

Job Application Email (Anonymized & Feature-Rich)

Data Pseudonymity Software Report

Alphabay marketplace: Non-anonymized dataset

az_personal_info_aug_masked

Five-week DoH Dataset collected on ISP backbone lines

complex_ner

MorryShah/complex_ner

Elephant Labs Complex PII Dataset for Long Contexts and Advanced Anonymization (with Business and Software-related Entities)