21 datasets found

h
pii-masking-200k
huggingface.co
Updated Apr 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/1532
Dataset updated
Apr 22, 2024
Dataset authored and provided by
Ai4Privacy
Description
Ai4Privacy Community

Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

Purpose and Features

Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.
h
context-aware-pii-detection-v3
huggingface.co
Updated Oct 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mariia Ponomarenko (2025). context-aware-pii-detection-v3 [Dataset]. https://huggingface.co/datasets/ponoma16/context-aware-pii-detection-v3
Explore at:
Dataset updated
Oct 11, 2025
Authors
Mariia Ponomarenko
Description
ponoma16/context-aware-pii-detection-v3 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
PII-NER
huggingface.co
Updated Jul 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph G Flowers (2024). PII-NER [Dataset]. https://huggingface.co/datasets/Josephgflowers/PII-NER
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 20, 2024
Authors
Joseph G Flowers
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.
AI4privacy-PII
kaggle.com
zip
Updated Jan 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wilmer E. Henao (2024). AI4privacy-PII [Dataset]. https://www.kaggle.com/datasets/verracodeguacas/ai4privacy-pii
Explore at:
zip(93130230 bytes)Available download formats
Dataset updated
Jan 23, 2024
Authors
Wilmer E. Henao
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Developed by AI4Privacy, this dataset represents a pioneering effort in the realm of privacy and AI. As an expansive resource hosted on Hugging Face at ai4privacy/pii-masking-200k, it serves a crucial role in addressing the growing concerns around personal data security in AI applications.

Sources: The dataset is crafted using proprietary algorithms, ensuring the creation of synthetic data that avoids privacy violations. Its multilingual composition, including English, French, German, and Italian texts, reflects a diverse source base. The data is meticulously curated with human-in-the-loop validation, ensuring both relevance and quality.

Context: In an era where data privacy is paramount, this dataset is tailored to train AI models to identify and mask personally identifiable information (PII). It covers 54 PII classes and extends across 229 use cases in various domains like business, education, psychology, and legal fields, emphasizing its contextual richness and applicability.

Inspiration: The dataset draws inspiration from the need for enhanced privacy measures in AI interactions, particularly in LLMs and AI assistants. The creators, AI4Privacy, are dedicated to building tools that act as a 'global seatbelt' for AI, protecting individuals' personal data. This dataset is a testament to their commitment to advancing AI technology responsibly and ethically.

This comprehensive dataset is not just a tool but a step towards a future where AI and privacy coexist harmoniously, offering immense value to researchers, developers, and privacy advocates alike.
h
pii-url-detection
huggingface.co
Updated Oct 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yizhuohuang (2025). pii-url-detection [Dataset]. https://huggingface.co/datasets/huangyizhuo/pii-url-detection
Explore at:
Dataset updated
Oct 9, 2025
Authors
yizhuohuang
Description
huangyizhuo/pii-url-detection dataset hosted on Hugging Face and contributed by the HF Datasets community
h
gretel-pii-masking-en-v1
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gretel.ai, gretel-pii-masking-en-v1 [Dataset]. https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Gretel.ai
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Gretel Synthetic Domain-Specific Documents Dataset (English)

This dataset is a synthetically generated collection of documents enriched with Personally Identifiable Information (PII) and Protected Health Information (PHI) entities spanning multiple domains. Created using Gretel Navigator with mistral-nemo-2407 as the backend model, it is specifically designed for fine-tuning Gliner models. The dataset contains document passages featuring PII/PHI entities from a wide range of… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1.
h
pii-masking-43k
huggingface.co
Updated Jul 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2023). pii-masking-43k [Dataset]. http://doi.org/10.57967/hf/0824
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/0824
Dataset updated
Jul 1, 2023
Dataset authored and provided by
Ai4Privacy
Description
Purpose and Features

The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-43k.

TF2.16 requirements.txt

kaggle.com

zip

Updated Apr 10, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Mark Susol (2024). TF2.16 requirements.txt [Dataset]. https://www.kaggle.com/datasets/gdataranger/tf2-16-requirements-txt

Explore at:

zip(288 bytes)Available download formats

Dataset updated

Apr 10, 2024

Authors

Mark Susol

License

https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

Description

tf2-16-requirements-txt

Issues to resolve:

Downgrade datasets==2.16.0

In order to downgrade, the following 3 must be changed together:

datasets==2.16.0
fsspec==2023.10.0
gcsfs==2023.10.0

Import the "dataset" that is just a requirements.txt file. Now install the requirements.

!pip install -r /kaggle/input/tf2-16-requirements-txt/requirements.txt

!pip list | grep -E 'datasets|transformers|tensorflow|keras|ml-dtypes|numpy|fsspec|gcsfs'

datasets                 2.16.0
fsspec                  2023.10.0
gcsfs                  2023.10.0
keras                  3.0.5
keras-cv                 0.8.2
keras-nlp                0.8.2
keras-tuner               1.4.6
ml-dtypes                0.3.2
msgpack-numpy              0.4.8
numpy                  1.26.4
tensorflow                2.16.1
tensorflow-cloud             0.1.16
tensorflow-datasets           4.9.4
tensorflow-decision-forests       1.8.1
tensorflow-estimator           2.15.0
tensorflow-hub              0.16.1
tensorflow-io              0.35.0
tensorflow-io-gcs-filesystem       0.35.0
tensorflow-metadata           0.14.0
tensorflow-probability          0.23.0
tensorflow-serving-api          2.14.1
tensorflow-text             2.16.1
tensorflow-transform           0.14.0
tf_keras                 2.16.0
transformers               4.38.2

then Restart & clear cell outputs to make these changes live in the kernel.

NOTE: DO NOT FACTORY RESET!

import datasets
import tensorflow as tf

assert datasets._version_ == '2.16.0'
assert tf._version_ == '2.16.1'

h
PII-Dataset
huggingface.co
Updated Apr 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasann (2024). PII-Dataset [Dataset]. https://huggingface.co/datasets/Prasann15479/PII-Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 13, 2024
Authors
Prasann
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This Dataset was created using Gemini api using the kaggle notebook : https://www.kaggle.com/code/newtonbaba12345/pii-detection-data-generation-using-gemini
h
NinjaMasker-PII-Redaction-Dataset
huggingface.co
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harry Roy McLaughlin (2023). NinjaMasker-PII-Redaction-Dataset [Dataset]. https://huggingface.co/datasets/King-Harry/NinjaMasker-PII-Redaction-Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 7, 2023
Authors
Harry Roy McLaughlin
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
King-Harry/NinjaMasker-PII-Redaction-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
synthetic_pii_finance_multilingual
huggingface.co
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gretel.ai (2024). synthetic_pii_finance_multilingual [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 11, 2024
Dataset provided by
Gretel.ai
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Image generated by DALL-E. See prompt for more details

💼 📊 Synthetic Financial Domain Documents with PII Labels

gretelai/synthetic_pii_finance_multilingual is a dataset of full length synthetic financial documents containing Personally Identifiable Information (PII), generated using Gretel Navigator and released under Apache 2.0. This dataset is designed to assist with the following use cases:

🏷️ Training NER (Named Entity Recognition) models to detect and label PII in… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual.
h
deberta-pii-synth
huggingface.co
Updated Nov 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tursunai Turumbekova (2025). deberta-pii-synth [Dataset]. https://huggingface.co/datasets/tursunait/deberta-pii-synth
Explore at:
Dataset updated
Nov 22, 2025
Authors
Tursunai Turumbekova
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Synthetic PII Detection Dataset (DeBERTa-PII-Synth)

A large-scale synthetic dataset for training token classification models to detect Personally Identifiable Information (PII).

Dataset Summary

DeBERTa-PII-Synth is a fully synthetic dataset created to train token-classification models for detecting Personally Identifiable Information (PII). It contains 120k+ samples, each including:

Natural-language text containing synthetic PII Gold span annotations (PERSON, EMAIL, DATE… See the full description on the dataset page: https://huggingface.co/datasets/tursunait/deberta-pii-synth.
h
pile-pii
huggingface.co
Updated Jun 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rasika Bhalerao (2022). pile-pii [Dataset]. https://huggingface.co/datasets/rasikabh/pile-pii
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 30, 2022
Authors
Rasika Bhalerao
Description
rasikabh/pile-pii dataset hosted on Hugging Face and contributed by the HF Datasets community
Nemotron-PII
huggingface.co
Updated Oct 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NVIDIA (2025). Nemotron-PII [Dataset]. https://huggingface.co/datasets/nvidia/Nemotron-PII
Explore at:
Dataset updated
Oct 28, 2025
Dataset provided by
Nvidiahttp://nvidia.com/
Authors
NVIDIA
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Nemotron-PII: Synthesized Data for Privacy-Preserving AI

Dataset Description

Nemotron‑PII is a synthetic, persona‑grounded dataset for training and evaluating detection of Personally Identifiable Information (PII) and Protected Health Information (PHI) in text at production quality. It contains 100,000 English records across 50+ industries with span‑level annotations for 55+ PII/PHI categories, generated with NVIDIA NeMo Data Designer using synthetic personas grounded in… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-PII.
h
devign
huggingface.co
Updated Feb 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DetectVul (2024). devign [Dataset]. https://huggingface.co/datasets/DetectVul/devign
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 20, 2024
Dataset authored and provided by
DetectVul
Description
Dataset Card for "devign_with_norm_vul_lines"

More Information needed Original Paper: https://www.sciencedirect.com/science/article/abs/pii/S0167739X24004680 bibtex @article{TRAN2024107504, title = {DetectVul: A statement-level code vulnerability detection for Python}, journal = {Future Generation Computer Systems}, pages = {107504}, year = {2024}, issn = {0167-739X}, doi = {https://doi.org/10.1016/j.future.2024.107504}, url =… See the full description on the dataset page: https://huggingface.co/datasets/DetectVul/devign.
h
Interaction_Dialogue_with_Privacy
huggingface.co
Updated May 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hang Zeng (2025). Interaction_Dialogue_with_Privacy [Dataset]. https://huggingface.co/datasets/Nidhogg-zh/Interaction_Dialogue_with_Privacy
Explore at:
Dataset updated
May 11, 2025
Authors
Hang Zeng
Description
🔏 Interaction Dialogue Dataset with Extracted Privacy Phrases and Annotated Private Information

📑 Dataset Details Dataset Description

This is the dataset of "Automated Annotation of Privacy Information in User Interactions with Large Language Models". Traditional personally identifiable information (PII) detection in anonymous content is insufficient in real-name interaction scenarios with LLMs. By authenticating through user login, queries posed to LLMs… See the full description on the dataset page: https://huggingface.co/datasets/Nidhogg-zh/Interaction_Dialogue_with_Privacy.
h
sasakitopiichan
huggingface.co
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BangumiBase (2024). sasakitopiichan [Dataset]. https://huggingface.co/datasets/BangumiBase/sasakitopiichan
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2024
Dataset authored and provided by
BangumiBase
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Bangumi Image Base of Sasaki To Pii-chan

This is the image base of bangumi Sasaki to Pii-chan, we detected 46 characters, 4654 images in total. The full dataset is here. Please note that these image bases are not guaranteed to be 100% cleaned, they may be noisy actual. If you intend to manually train models using this dataset, we recommend performing necessary preprocessing on the downloaded dataset to eliminate potential noisy samples (approximately 1% probability). Here is the… See the full description on the dataset page: https://huggingface.co/datasets/BangumiBase/sasakitopiichan.
h
CTW1500_OCR
huggingface.co
Updated Jan 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mikhail Stepanov (2025). CTW1500_OCR [Dataset]. https://huggingface.co/datasets/MiXaiLL76/CTW1500_OCR
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 18, 2025
Authors
Mikhail Stepanov
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
CTW1500

META

https://github.com/open-mmlab/mmocr/blob/main/dataset_zoo/ctw1500/metafile.yml Name: 'CTW1500' Paper: Title: Curved scene text detection via transverse and longitudinal sequence connection URL: https://www.sciencedirect.com/science/article/pii/S0031320319300664 Venue: PR Year: '2019' BibTeX: '@article{liu2019curved, title={Curved scene text detection via transverse and longitudinal sequence connection}, author={Liu, Yuliang and Jin, Lianwen and… See the full description on the dataset page: https://huggingface.co/datasets/MiXaiLL76/CTW1500_OCR.
h
az_personal_info_aug_masked
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hamza Agar, az_personal_info_aug_masked [Dataset]. https://huggingface.co/datasets/aimtune/az_personal_info_aug_masked
Explore at:
Authors
Hamza Agar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📘 Overview

This dataset consists of augmented Azerbaijani text pairs (clean & masked) that contain personally identifiable information (PII). All content has been automatically generated using ChatGPT to simulate sensitive data scenarios for tasks like PII detection, anonymization, entity masking, and secure data handling.

🔍 Dataset Structure

Each example is a paired record:

original: The full augmented Azerbaijani text containing PII. masked: The same text with PII… See the full description on the dataset page: https://huggingface.co/datasets/aimtune/az_personal_info_aug_masked.
h
pii_ner_azerbaijani
huggingface.co
Updated Aug 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LocalDoc (2025). pii_ner_azerbaijani [Dataset]. http://doi.org/10.57967/hf/6238
Explore at:
Unique identifier
https://doi.org/10.57967/hf/6238
Dataset updated
Aug 16, 2025
Dataset authored and provided by
LocalDoc
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PII NER Azerbaijani Dataset

Short, synthetic Azerbaijani dataset for PII-aware Named Entity Recognition (token classification). Useful for training and evaluating models that detect and localize personally identifiable information (PII) in Azerbaijani text. Note: All examples are synthetically generated with the library az-data-generator [https://github.com/LocalDoc-Azerbaijan/az-data-generator]. No real persons or contact details are included.

Dataset Summary

Each… See the full description on the dataset page: https://huggingface.co/datasets/LocalDoc/pii_ner_azerbaijani.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532

pii-masking-200k

Ai4Privacy PII200k Dataset

ai4privacy/pii-masking-200k

Explore at:

21 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.57967/hf/1532

Dataset updated

Apr 22, 2024

Dataset authored and provided by

Ai4Privacy

Description

Ai4Privacy Community

Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

  Purpose and Features

Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.

Clear search

Close search

Google apps

Main menu

pii-masking-200k

context-aware-pii-detection-v3

PII-NER

AI4privacy-PII

pii-url-detection

gretel-pii-masking-en-v1

pii-masking-43k

TF2.16 requirements.txt

tf2-16-requirements-txt

Downgrade datasets==2.16.0

PII-Dataset

NinjaMasker-PII-Redaction-Dataset

synthetic_pii_finance_multilingual

deberta-pii-synth

pile-pii

Nemotron-PII

devign

Interaction_Dialogue_with_Privacy

sasakitopiichan

CTW1500_OCR

az_personal_info_aug_masked

pii_ner_azerbaijani

pii-masking-200k

Ai4Privacy PII200k Dataset

ai4privacy/pii-masking-200k