Ai4Privacy Community
Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.
Purpose and Features
Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussionโฆ See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Purpose and Features
๐ World's largest open dataset for privacy masking ๐ The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts:
OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting 749 discussion subjects / use cases split across education, health, and psychology. FinPII contains an additional ~20 types tailored toโฆ See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-300k.
Purpose and Features
The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The originalโฆ See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-65k.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
terminal pip install datasets
python from datasets import load_dataset dataset = load_dataset("ai4privacy/open-pii-masking-500k-ai4privacy")
# Compatible Machine Learning Tasks: - Tokenclassification. Check out a HuggingFace's guide on token classification. - ALBERT, BERT, BigBird, BioGpt, BLOOM, BROS, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, ESM, Falcon, FlauBERT, FNet, Funnel Transformer, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, I-BERT, LayoutLM, LayoutLMv2, LayoutLMv3, LiLT, Longformer, LUKE, MarkupLM, MEGA, Megatron-BERT, MobileBERT,...automated-analytics/ai4privacy-pii-masking-en-v1-ner-coarse dataset hosted on Hugging Face and contributed by the HF Datasets community
automated-analytics/ai4privacy-pii-masking-en-v1-ner dataset hosted on Hugging Face and contributed by the HF Datasets community
automated-analytics/ai4privacy-pii-coarse-grained dataset hosted on Hugging Face and contributed by the HF Datasets community
automated-analytics/ai4privacy-pii-coarse-grained-chatml dataset hosted on Hugging Face and contributed by the HF Datasets community
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Ai4Privacy Community
Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.
Purpose and Features
Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussionโฆ See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.