8 datasets found
  1. h

    pii-masking-200k

    • huggingface.co
    Updated Apr 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2024
    Dataset authored and provided by
    Ai4Privacy
    Description

    Ai4Privacy Community

    Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

      Purpose and Features
    

    Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussionโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.

  2. h

    pii-masking-300k

    • huggingface.co
    Updated Apr 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2024). pii-masking-300k [Dataset]. http://doi.org/10.57967/hf/1995
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2024
    Dataset authored and provided by
    Ai4Privacy
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Purpose and Features

    ๐ŸŒ World's largest open dataset for privacy masking ๐ŸŒŽ The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts:

    OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting 749 discussion subjects / use cases split across education, health, and psychology. FinPII contains an additional ~20 types tailored toโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-300k.

  3. h

    pii-masking-65k

    • huggingface.co
    Updated Apr 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2024). pii-masking-65k [Dataset]. http://doi.org/10.57967/hf/2012
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 5, 2024
    Dataset authored and provided by
    Ai4Privacy
    Description

    Purpose and Features

    The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The originalโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-65k.

  4. H

    open-pii-masking-500k-ai4privacy

    • dataverse.harvard.edu
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Anthony (2025). open-pii-masking-500k-ai4privacy [Dataset]. http://doi.org/10.7910/DVN/4H11OA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Michael Anthony
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    ๐ŸŒ World's largest open dataset for privacy masking ๐ŸŒŽ The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Task Showcase of Privacy Masking # Dataset Analytics ๐Ÿ“Š - ai4privacy/open-pii-masking-500k-ai4privacy ## p5y Data Analytics - Total Entries: 580,227 - Total Tokens: 19,199,982 - Average Source Text Length: 17.37 words - Total PII Labels: 5,705,973 - Number of Unique PII Classes: 20 (Open PII Labelset) - Unique Identity Values: 704,215 --- ## Language Distribution Analytics Number of Unique Languages: 8 | Language | Count | Percentage | |--------------------|----------|------------| | English (en) ๐Ÿ‡บ๐Ÿ‡ธ๐Ÿ‡ฌ๐Ÿ‡ง๐Ÿ‡จ๐Ÿ‡ฆ๐Ÿ‡ฎ๐Ÿ‡ณ | 150,693 | 25.97% | | French (fr) ๐Ÿ‡ซ๐Ÿ‡ท๐Ÿ‡จ๐Ÿ‡ญ๐Ÿ‡จ๐Ÿ‡ฆ | 112,136 | 19.33% | | German (de) ๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡จ๐Ÿ‡ญ | 82,384 | 14.20% | | Spanish (es) ๐Ÿ‡ช๐Ÿ‡ธ ๐Ÿ‡ฒ๐Ÿ‡ฝ | 78,013 | 13.45% | | Italian (it) ๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ‡จ๐Ÿ‡ญ | 68,824 | 11.86% | | Dutch (nl) ๐Ÿ‡ณ๐Ÿ‡ฑ | 26,628 | 4.59% | | Hindi (hi)* ๐Ÿ‡ฎ๐Ÿ‡ณ | 33,963 | 5.85% | | Telugu (te)* ๐Ÿ‡ฎ๐Ÿ‡ณ | 27,586 | 4.75% | *these languages are in experimental stages --- ## Region Distribution Analytics Number of Unique Regions: 11 | Region | Count | Percentage | |-----------------------|----------|------------| | Switzerland (CH) ๐Ÿ‡จ๐Ÿ‡ญ | 112,531 | 19.39% | | India (IN) ๐Ÿ‡ฎ๐Ÿ‡ณ | 99,724 | 17.19% | | Canada (CA) ๐Ÿ‡จ๐Ÿ‡ฆ | 74,733 | 12.88% | | Germany (DE) ๐Ÿ‡ฉ๐Ÿ‡ช | 41,604 | 7.17% | | Spain (ES) ๐Ÿ‡ช๐Ÿ‡ธ | 39,557 | 6.82% | | Mexico (MX) ๐Ÿ‡ฒ๐Ÿ‡ฝ | 38,456 | 6.63% | | France (FR) ๐Ÿ‡ซ๐Ÿ‡ท | 37,886 | 6.53% | | Great Britain (GB) ๐Ÿ‡ฌ๐Ÿ‡ง | 37,092 | 6.39% | | United States (US) ๐Ÿ‡บ๐Ÿ‡ธ | 37,008 | 6.38% | | Italy (IT) ๐Ÿ‡ฎ๐Ÿ‡น | 35,008 | 6.03% | | Netherlands (NL) ๐Ÿ‡ณ๐Ÿ‡ฑ | 26,628 | 4.59% | --- ## Machine Learning Task Analytics | Split | Count | Percentage | |-------------|----------|------------| | Train | 464,150 | 79.99% | | Validate| 116,077 | 20.01% | --- # Usage Option 1: Python terminal pip install datasets python from datasets import load_dataset dataset = load_dataset("ai4privacy/open-pii-masking-500k-ai4privacy") # Compatible Machine Learning Tasks: - Tokenclassification. Check out a HuggingFace's guide on token classification. - ALBERT, BERT, BigBird, BioGpt, BLOOM, BROS, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, ESM, Falcon, FlauBERT, FNet, Funnel Transformer, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, I-BERT, LayoutLM, LayoutLMv2, LayoutLMv3, LiLT, Longformer, LUKE, MarkupLM, MEGA, Megatron-BERT, MobileBERT,...

  5. h

    ai4privacy-pii-masking-en-v1-ner-coarse

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Automated Analytics, ai4privacy-pii-masking-en-v1-ner-coarse [Dataset]. https://huggingface.co/datasets/automated-analytics/ai4privacy-pii-masking-en-v1-ner-coarse
    Explore at:
    Dataset authored and provided by
    Automated Analytics
    Description

    automated-analytics/ai4privacy-pii-masking-en-v1-ner-coarse dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    ai4privacy-pii-masking-en-v1-ner

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Automated Analytics, ai4privacy-pii-masking-en-v1-ner [Dataset]. https://huggingface.co/datasets/automated-analytics/ai4privacy-pii-masking-en-v1-ner
    Explore at:
    Dataset authored and provided by
    Automated Analytics
    Description

    automated-analytics/ai4privacy-pii-masking-en-v1-ner dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    ai4privacy-pii-coarse-grained

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Automated Analytics, ai4privacy-pii-coarse-grained [Dataset]. https://huggingface.co/datasets/automated-analytics/ai4privacy-pii-coarse-grained
    Explore at:
    Dataset authored and provided by
    Automated Analytics
    Description

    automated-analytics/ai4privacy-pii-coarse-grained dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    ai4privacy-pii-coarse-grained-chatml

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Automated Analytics, ai4privacy-pii-coarse-grained-chatml [Dataset]. https://huggingface.co/datasets/automated-analytics/ai4privacy-pii-coarse-grained-chatml
    Explore at:
    Dataset authored and provided by
    Automated Analytics
    Description

    automated-analytics/ai4privacy-pii-coarse-grained-chatml dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532

pii-masking-200k

Ai4Privacy PII200k Dataset

ai4privacy/pii-masking-200k

Explore at:
13 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2024
Dataset authored and provided by
Ai4Privacy
Description

Ai4Privacy Community

Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

  Purpose and Features

Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussionโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.

Search
Clear search
Close search
Google apps
Main menu