8 datasets found
  1. h

    pii-masking-200k

    • huggingface.co
    Updated Apr 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2024
    Dataset authored and provided by
    Ai4Privacy
    Description

    Ai4Privacy Community

    Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

      Purpose and Features
    

    Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.

  2. h

    pii-masking-65k

    • huggingface.co
    Updated Apr 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2024). pii-masking-65k [Dataset]. http://doi.org/10.57967/hf/2012
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 5, 2024
    Dataset authored and provided by
    Ai4Privacy
    Description

    Purpose and Features

    The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-65k.

  3. h

    ai4privacy-pii-masking-en-v1

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Automated Analytics, ai4privacy-pii-masking-en-v1 [Dataset]. https://huggingface.co/datasets/automated-analytics/ai4privacy-pii-masking-en-v1
    Explore at:
    Dataset authored and provided by
    Automated Analytics
    Description

    automated-analytics/ai4privacy-pii-masking-en-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    pii-masking-200k

    • huggingface.co
    Updated Feb 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isotonic (2024). pii-masking-200k [Dataset]. https://huggingface.co/datasets/Isotonic/pii-masking-200k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 4, 2024
    Authors
    Isotonic
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Purpose and Features

    World's largest open source privacy dataset. The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion subjects / use cases split across business, education, psychology and legal fields, and 5 interactions styles (e.g. casual conversation, formal document, emails… See the full description on the dataset page: https://huggingface.co/datasets/Isotonic/pii-masking-200k.

  5. h

    ai4privacy-pii-masking-en-v1-ner-coarse

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Automated Analytics, ai4privacy-pii-masking-en-v1-ner-coarse [Dataset]. https://huggingface.co/datasets/automated-analytics/ai4privacy-pii-masking-en-v1-ner-coarse
    Explore at:
    Dataset authored and provided by
    Automated Analytics
    Description

    automated-analytics/ai4privacy-pii-masking-en-v1-ner-coarse dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    ai4privacy-pii-coarse-grained

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Automated Analytics, ai4privacy-pii-coarse-grained [Dataset]. https://huggingface.co/datasets/automated-analytics/ai4privacy-pii-coarse-grained
    Explore at:
    Dataset authored and provided by
    Automated Analytics
    Description

    automated-analytics/ai4privacy-pii-coarse-grained dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    ai4privacy-pii-coarse-grained-chatml

    • huggingface.co
    Updated Jul 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Automated Analytics (2025). ai4privacy-pii-coarse-grained-chatml [Dataset]. https://huggingface.co/datasets/automated-analytics/ai4privacy-pii-coarse-grained-chatml
    Explore at:
    Dataset updated
    Jul 26, 2025
    Dataset authored and provided by
    Automated Analytics
    Description

    automated-analytics/ai4privacy-pii-coarse-grained-chatml dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    pii-masking-english-5k

    • huggingface.co
    Updated Aug 22, 2007
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aniket Kulkarni (2007). pii-masking-english-5k [Dataset]. https://huggingface.co/datasets/aniket-curlscape/pii-masking-english-5k
    Explore at:
    Dataset updated
    Aug 22, 2007
    Authors
    Aniket Kulkarni
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Important

    This repository contains the English-only subset of the Ai4Privacy PII-Masking-300k Dataset. The dataset is curated to provide English texts only, while retaining the structure, labeling schema, and licensing of the original dataset.

      Licensing
    

    Academic use is encouraged with proper citation provided it follows similar license terms*. Commercial entities should contact us at licensing@ai4privacy.com for licensing inquiries and additional data access.*

    Terms… See the full description on the dataset page: https://huggingface.co/datasets/aniket-curlscape/pii-masking-english-5k.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532

pii-masking-200k

Ai4Privacy PII200k Dataset

ai4privacy/pii-masking-200k

Explore at:
14 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2024
Dataset authored and provided by
Ai4Privacy
Description

Ai4Privacy Community

Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

  Purpose and Features

Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.

Search
Clear search
Close search
Google apps
Main menu