8 datasets found

h
pii-masking-200k
huggingface.co
Updated Apr 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/1532
Dataset updated
Apr 22, 2024
Dataset authored and provided by
Ai4Privacy
Description
Ai4Privacy Community

Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

Purpose and Features

Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.
h
pii-masking-65k
huggingface.co
Updated Apr 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2024). pii-masking-65k [Dataset]. http://doi.org/10.57967/hf/2012
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2012
Dataset updated
Apr 5, 2024
Dataset authored and provided by
Ai4Privacy
Description
Purpose and Features

The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-65k.
h
ai4privacy-pii-masking-en-v1
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Automated Analytics, ai4privacy-pii-masking-en-v1 [Dataset]. https://huggingface.co/datasets/automated-analytics/ai4privacy-pii-masking-en-v1
Explore at:
Dataset authored and provided by
Automated Analytics
Description
automated-analytics/ai4privacy-pii-masking-en-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
pii-masking-200k
huggingface.co
Updated Feb 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isotonic (2024). pii-masking-200k [Dataset]. https://huggingface.co/datasets/Isotonic/pii-masking-200k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 4, 2024
Authors
Isotonic
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Purpose and Features

World's largest open source privacy dataset. The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion subjects / use cases split across business, education, psychology and legal fields, and 5 interactions styles (e.g. casual conversation, formal document, emails… See the full description on the dataset page: https://huggingface.co/datasets/Isotonic/pii-masking-200k.
h
ai4privacy-pii-masking-en-v1-ner-coarse
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Automated Analytics, ai4privacy-pii-masking-en-v1-ner-coarse [Dataset]. https://huggingface.co/datasets/automated-analytics/ai4privacy-pii-masking-en-v1-ner-coarse
Explore at:
Dataset authored and provided by
Automated Analytics
Description
automated-analytics/ai4privacy-pii-masking-en-v1-ner-coarse dataset hosted on Hugging Face and contributed by the HF Datasets community
h
ai4privacy-pii-coarse-grained
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Automated Analytics, ai4privacy-pii-coarse-grained [Dataset]. https://huggingface.co/datasets/automated-analytics/ai4privacy-pii-coarse-grained
Explore at:
Dataset authored and provided by
Automated Analytics
Description
automated-analytics/ai4privacy-pii-coarse-grained dataset hosted on Hugging Face and contributed by the HF Datasets community
h
ai4privacy-pii-coarse-grained-chatml
huggingface.co
Updated Jul 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Automated Analytics (2025). ai4privacy-pii-coarse-grained-chatml [Dataset]. https://huggingface.co/datasets/automated-analytics/ai4privacy-pii-coarse-grained-chatml
Explore at:
Dataset updated
Jul 26, 2025
Dataset authored and provided by
Automated Analytics
Description
automated-analytics/ai4privacy-pii-coarse-grained-chatml dataset hosted on Hugging Face and contributed by the HF Datasets community
h
pii-masking-english-5k
huggingface.co
Updated Aug 22, 2007
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aniket Kulkarni (2007). pii-masking-english-5k [Dataset]. https://huggingface.co/datasets/aniket-curlscape/pii-masking-english-5k
Explore at:
Dataset updated
Aug 22, 2007
Authors
Aniket Kulkarni
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Important

This repository contains the English-only subset of the Ai4Privacy PII-Masking-300k Dataset. The dataset is curated to provide English texts only, while retaining the structure, labeling schema, and licensing of the original dataset.

Licensing

Academic use is encouraged with proper citation provided it follows similar license terms*. Commercial entities should contact us at licensing@ai4privacy.com for licensing inquiries and additional data access.*

Terms… See the full description on the dataset page: https://huggingface.co/datasets/aniket-curlscape/pii-masking-english-5k.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532

pii-masking-200k

Ai4Privacy PII200k Dataset

ai4privacy/pii-masking-200k

Explore at:

14 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.57967/hf/1532

Dataset updated

Apr 22, 2024

Dataset authored and provided by

Ai4Privacy

Description

Ai4Privacy Community

Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

  Purpose and Features

Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.

Clear search

Close search

Google apps

Main menu

pii-masking-200k

pii-masking-65k

ai4privacy-pii-masking-en-v1

pii-masking-200k

ai4privacy-pii-masking-en-v1-ner-coarse

ai4privacy-pii-coarse-grained

ai4privacy-pii-coarse-grained-chatml

pii-masking-english-5k

pii-masking-200k

Ai4Privacy PII200k Dataset

ai4privacy/pii-masking-200k