8 datasets found

h
pii-masking-200k
huggingface.co
Updated Apr 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/1532
Dataset updated
Apr 22, 2024
Dataset authored and provided by
Ai4Privacy
Description
Ai4Privacy Community

Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

Purpose and Features

Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.
h
pii-masking-300k
huggingface.co
Updated Apr 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2024). pii-masking-300k [Dataset]. http://doi.org/10.57967/hf/1995
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/1995
Dataset updated
Apr 4, 2024
Dataset authored and provided by
Ai4Privacy
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Purpose and Features

🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts:

OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting 749 discussion subjects / use cases split across education, health, and psychology. FinPII contains an additional ~20 types tailored to… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-300k.
h
pii-masking-65k
huggingface.co
Updated Apr 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2024). pii-masking-65k [Dataset]. http://doi.org/10.57967/hf/2012
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2012
Dataset updated
Apr 5, 2024
Dataset authored and provided by
Ai4Privacy
Description
Purpose and Features

The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-65k.
H
open-pii-masking-500k-ai4privacy
dataverse.harvard.edu
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Anthony (2025). open-pii-masking-500k-ai4privacy [Dataset]. http://doi.org/10.7910/DVN/4H11OA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/4H11OA
Dataset updated
Mar 17, 2025
Dataset provided by
Harvard Dataverse
Authors
Michael Anthony
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. # Dataset Analytics 📊 - ai4privacy/open-pii-masking-500k-ai4privacy ## p5y Data Analytics - Total Entries: 580,227 - Total Tokens: 19,199,982 - Average Source Text Length: 17.37 words - Total PII Labels: 5,705,973 - Number of Unique PII Classes: 20 (Open PII Labelset) - Unique Identity Values: 704,215 --- ## Language Distribution Analytics Number of Unique Languages: 8 | Language | Count | Percentage | |--------------------|----------|------------| | English (en) 🇺🇸🇬🇧🇨🇦🇮🇳 | 150,693 | 25.97% | | French (fr) 🇫🇷🇨🇭🇨🇦 | 112,136 | 19.33% | | German (de) 🇩🇪🇨🇭 | 82,384 | 14.20% | | Spanish (es) 🇪🇸 🇲🇽 | 78,013 | 13.45% | | Italian (it) 🇮🇹🇨🇭 | 68,824 | 11.86% | | Dutch (nl) 🇳🇱 | 26,628 | 4.59% | | Hindi (hi)* 🇮🇳 | 33,963 | 5.85% | | Telugu (te)* 🇮🇳 | 27,586 | 4.75% | *these languages are in experimental stages --- ## Region Distribution Analytics Number of Unique Regions: 11 | Region | Count | Percentage | |-----------------------|----------|------------| | Switzerland (CH) 🇨🇭 | 112,531 | 19.39% | | India (IN) 🇮🇳 | 99,724 | 17.19% | | Canada (CA) 🇨🇦 | 74,733 | 12.88% | | Germany (DE) 🇩🇪 | 41,604 | 7.17% | | Spain (ES) 🇪🇸 | 39,557 | 6.82% | | Mexico (MX) 🇲🇽 | 38,456 | 6.63% | | France (FR) 🇫🇷 | 37,886 | 6.53% | | Great Britain (GB) 🇬🇧 | 37,092 | 6.39% | | United States (US) 🇺🇸 | 37,008 | 6.38% | | Italy (IT) 🇮🇹 | 35,008 | 6.03% | | Netherlands (NL) 🇳🇱 | 26,628 | 4.59% | --- ## Machine Learning Task Analytics | Split | Count | Percentage | |-------------|----------|------------| | Train | 464,150 | 79.99% | | Validate| 116,077 | 20.01% | --- # Usage Option 1: Python terminal pip install datasets python from datasets import load_dataset dataset = load_dataset("ai4privacy/open-pii-masking-500k-ai4privacy") # Compatible Machine Learning Tasks: - Tokenclassification. Check out a HuggingFace's guide on token classification. - ALBERT, BERT, BigBird, BioGpt, BLOOM, BROS, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, ESM, Falcon, FlauBERT, FNet, Funnel Transformer, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, I-BERT, LayoutLM, LayoutLMv2, LayoutLMv3, LiLT, Longformer, LUKE, MarkupLM, MEGA, Megatron-BERT, MobileBERT,...
h
ai4privacy-pii-masking-en-v1-ner-coarse
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Automated Analytics, ai4privacy-pii-masking-en-v1-ner-coarse [Dataset]. https://huggingface.co/datasets/automated-analytics/ai4privacy-pii-masking-en-v1-ner-coarse
Explore at:
Dataset authored and provided by
Automated Analytics
Description
automated-analytics/ai4privacy-pii-masking-en-v1-ner-coarse dataset hosted on Hugging Face and contributed by the HF Datasets community
h
ai4privacy-pii-masking-en-v1-ner
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Automated Analytics, ai4privacy-pii-masking-en-v1-ner [Dataset]. https://huggingface.co/datasets/automated-analytics/ai4privacy-pii-masking-en-v1-ner
Explore at:
Dataset authored and provided by
Automated Analytics
Description
automated-analytics/ai4privacy-pii-masking-en-v1-ner dataset hosted on Hugging Face and contributed by the HF Datasets community
h
ai4privacy-pii-coarse-grained
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Automated Analytics, ai4privacy-pii-coarse-grained [Dataset]. https://huggingface.co/datasets/automated-analytics/ai4privacy-pii-coarse-grained
Explore at:
Dataset authored and provided by
Automated Analytics
Description
automated-analytics/ai4privacy-pii-coarse-grained dataset hosted on Hugging Face and contributed by the HF Datasets community
h
ai4privacy-pii-coarse-grained-chatml
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Automated Analytics, ai4privacy-pii-coarse-grained-chatml [Dataset]. https://huggingface.co/datasets/automated-analytics/ai4privacy-pii-coarse-grained-chatml
Explore at:
Dataset authored and provided by
Automated Analytics
Description
automated-analytics/ai4privacy-pii-coarse-grained-chatml dataset hosted on Hugging Face and contributed by the HF Datasets community
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532

pii-masking-200k

Ai4Privacy PII200k Dataset

ai4privacy/pii-masking-200k

Explore at:

13 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.57967/hf/1532

Dataset updated

Apr 22, 2024

Dataset authored and provided by

Ai4Privacy

Description

Ai4Privacy Community

Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

  Purpose and Features

Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.

Clear search

Close search

Google apps

Main menu

pii-masking-200k

pii-masking-300k

pii-masking-65k

open-pii-masking-500k-ai4privacy

ai4privacy-pii-masking-en-v1-ner-coarse

ai4privacy-pii-masking-en-v1-ner

ai4privacy-pii-coarse-grained

ai4privacy-pii-coarse-grained-chatml

pii-masking-200k

Ai4Privacy PII200k Dataset

ai4privacy/pii-masking-200k