21 datasets found
  1. h

    pii-masking-200k

    • huggingface.co
    Updated Apr 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2024
    Dataset authored and provided by
    Ai4Privacy
    Description

    Ai4Privacy Community

    Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

      Purpose and Features
    

    Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.

  2. h

    context-aware-pii-detection-v3

    • huggingface.co
    Updated Oct 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariia Ponomarenko (2025). context-aware-pii-detection-v3 [Dataset]. https://huggingface.co/datasets/ponoma16/context-aware-pii-detection-v3
    Explore at:
    Dataset updated
    Oct 11, 2025
    Authors
    Mariia Ponomarenko
    Description

    ponoma16/context-aware-pii-detection-v3 dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    PII-NER

    • huggingface.co
    Updated Jul 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph G Flowers (2024). PII-NER [Dataset]. https://huggingface.co/datasets/Josephgflowers/PII-NER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 20, 2024
    Authors
    Joseph G Flowers
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.

  4. AI4privacy-PII

    • kaggle.com
    zip
    Updated Jan 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wilmer E. Henao (2024). AI4privacy-PII [Dataset]. https://www.kaggle.com/datasets/verracodeguacas/ai4privacy-pii
    Explore at:
    zip(93130230 bytes)Available download formats
    Dataset updated
    Jan 23, 2024
    Authors
    Wilmer E. Henao
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Developed by AI4Privacy, this dataset represents a pioneering effort in the realm of privacy and AI. As an expansive resource hosted on Hugging Face at ai4privacy/pii-masking-200k, it serves a crucial role in addressing the growing concerns around personal data security in AI applications.

    Sources: The dataset is crafted using proprietary algorithms, ensuring the creation of synthetic data that avoids privacy violations. Its multilingual composition, including English, French, German, and Italian texts, reflects a diverse source base. The data is meticulously curated with human-in-the-loop validation, ensuring both relevance and quality.

    Context: In an era where data privacy is paramount, this dataset is tailored to train AI models to identify and mask personally identifiable information (PII). It covers 54 PII classes and extends across 229 use cases in various domains like business, education, psychology, and legal fields, emphasizing its contextual richness and applicability.

    Inspiration: The dataset draws inspiration from the need for enhanced privacy measures in AI interactions, particularly in LLMs and AI assistants. The creators, AI4Privacy, are dedicated to building tools that act as a 'global seatbelt' for AI, protecting individuals' personal data. This dataset is a testament to their commitment to advancing AI technology responsibly and ethically.

    This comprehensive dataset is not just a tool but a step towards a future where AI and privacy coexist harmoniously, offering immense value to researchers, developers, and privacy advocates alike.

  5. h

    pii-url-detection

    • huggingface.co
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yizhuohuang (2025). pii-url-detection [Dataset]. https://huggingface.co/datasets/huangyizhuo/pii-url-detection
    Explore at:
    Dataset updated
    Oct 9, 2025
    Authors
    yizhuohuang
    Description

    huangyizhuo/pii-url-detection dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    gretel-pii-masking-en-v1

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gretel.ai, gretel-pii-masking-en-v1 [Dataset]. https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Gretel.ai
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Gretel Synthetic Domain-Specific Documents Dataset (English)

    This dataset is a synthetically generated collection of documents enriched with Personally Identifiable Information (PII) and Protected Health Information (PHI) entities spanning multiple domains. Created using Gretel Navigator with mistral-nemo-2407 as the backend model, it is specifically designed for fine-tuning Gliner models. The dataset contains document passages featuring PII/PHI entities from a wide range of… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1.

  7. h

    pii-masking-43k

    • huggingface.co
    Updated Jul 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2023). pii-masking-43k [Dataset]. http://doi.org/10.57967/hf/0824
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2023
    Dataset authored and provided by
    Ai4Privacy
    Description

    Purpose and Features

    The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-43k.

  8. TF2.16 requirements.txt

    • kaggle.com
    zip
    Updated Apr 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Susol (2024). TF2.16 requirements.txt [Dataset]. https://www.kaggle.com/datasets/gdataranger/tf2-16-requirements-txt
    Explore at:
    zip(288 bytes)Available download formats
    Dataset updated
    Apr 10, 2024
    Authors
    Mark Susol
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    tf2-16-requirements-txt

    Issues to resolve:

    1. https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/discussion/487474
    2. https://github.com/huggingface/datasets/issues/6753

    Downgrade datasets==2.16.0

    In order to downgrade, the following 3 must be changed together:

    datasets==2.16.0
    fsspec==2023.10.0
    gcsfs==2023.10.0
    

    Import the "dataset" that is just a requirements.txt file. Now install the requirements.

    !pip install -r /kaggle/input/tf2-16-requirements-txt/requirements.txt
    
    !pip list | grep -E 'datasets|transformers|tensorflow|keras|ml-dtypes|numpy|fsspec|gcsfs'
    
    datasets                 2.16.0
    fsspec                  2023.10.0
    gcsfs                  2023.10.0
    keras                  3.0.5
    keras-cv                 0.8.2
    keras-nlp                0.8.2
    keras-tuner               1.4.6
    ml-dtypes                0.3.2
    msgpack-numpy              0.4.8
    numpy                  1.26.4
    tensorflow                2.16.1
    tensorflow-cloud             0.1.16
    tensorflow-datasets           4.9.4
    tensorflow-decision-forests       1.8.1
    tensorflow-estimator           2.15.0
    tensorflow-hub              0.16.1
    tensorflow-io              0.35.0
    tensorflow-io-gcs-filesystem       0.35.0
    tensorflow-metadata           0.14.0
    tensorflow-probability          0.23.0
    tensorflow-serving-api          2.14.1
    tensorflow-text             2.16.1
    tensorflow-transform           0.14.0
    tf_keras                 2.16.0
    transformers               4.38.2
    

    then Restart & clear cell outputs to make these changes live in the kernel.

    NOTE: DO NOT FACTORY RESET!

    import datasets
    import tensorflow as tf
    
    assert datasets._version_ == '2.16.0'
    assert tf._version_ == '2.16.1'
    
  9. h

    PII-Dataset

    • huggingface.co
    Updated Apr 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prasann (2024). PII-Dataset [Dataset]. https://huggingface.co/datasets/Prasann15479/PII-Dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 13, 2024
    Authors
    Prasann
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This Dataset was created using Gemini api using the kaggle notebook : https://www.kaggle.com/code/newtonbaba12345/pii-detection-data-generation-using-gemini

  10. h

    NinjaMasker-PII-Redaction-Dataset

    • huggingface.co
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harry Roy McLaughlin (2023). NinjaMasker-PII-Redaction-Dataset [Dataset]. https://huggingface.co/datasets/King-Harry/NinjaMasker-PII-Redaction-Dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2023
    Authors
    Harry Roy McLaughlin
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    King-Harry/NinjaMasker-PII-Redaction-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    synthetic_pii_finance_multilingual

    • huggingface.co
    Updated Jun 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gretel.ai (2024). synthetic_pii_finance_multilingual [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 11, 2024
    Dataset provided by
    Gretel.ai
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Image generated by DALL-E. See prompt for more details

      💼 📊 Synthetic Financial Domain Documents with PII Labels
    

    gretelai/synthetic_pii_finance_multilingual is a dataset of full length synthetic financial documents containing Personally Identifiable Information (PII), generated using Gretel Navigator and released under Apache 2.0. This dataset is designed to assist with the following use cases:

    🏷️ Training NER (Named Entity Recognition) models to detect and label PII in… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual.

  12. h

    deberta-pii-synth

    • huggingface.co
    Updated Nov 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tursunai Turumbekova (2025). deberta-pii-synth [Dataset]. https://huggingface.co/datasets/tursunait/deberta-pii-synth
    Explore at:
    Dataset updated
    Nov 22, 2025
    Authors
    Tursunai Turumbekova
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Synthetic PII Detection Dataset (DeBERTa-PII-Synth)

    A large-scale synthetic dataset for training token classification models to detect Personally Identifiable Information (PII).

      Dataset Summary
    

    DeBERTa-PII-Synth is a fully synthetic dataset created to train token-classification models for detecting Personally Identifiable Information (PII). It contains 120k+ samples, each including:

    Natural-language text containing synthetic PII Gold span annotations (PERSON, EMAIL, DATE… See the full description on the dataset page: https://huggingface.co/datasets/tursunait/deberta-pii-synth.

  13. h

    pile-pii

    • huggingface.co
    Updated Jun 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rasika Bhalerao (2022). pile-pii [Dataset]. https://huggingface.co/datasets/rasikabh/pile-pii
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 30, 2022
    Authors
    Rasika Bhalerao
    Description

    rasikabh/pile-pii dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. Nemotron-PII

    • huggingface.co
    Updated Oct 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NVIDIA (2025). Nemotron-PII [Dataset]. https://huggingface.co/datasets/nvidia/Nemotron-PII
    Explore at:
    Dataset updated
    Oct 28, 2025
    Dataset provided by
    Nvidiahttp://nvidia.com/
    Authors
    NVIDIA
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Nemotron-PII: Synthesized Data for Privacy-Preserving AI

      Dataset Description
    

    Nemotron‑PII is a synthetic, persona‑grounded dataset for training and evaluating detection of Personally Identifiable Information (PII) and Protected Health Information (PHI) in text at production quality. It contains 100,000 English records across 50+ industries with span‑level annotations for 55+ PII/PHI categories, generated with NVIDIA NeMo Data Designer using synthetic personas grounded in… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-PII.

  15. h

    devign

    • huggingface.co
    Updated Feb 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DetectVul (2024). devign [Dataset]. https://huggingface.co/datasets/DetectVul/devign
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 20, 2024
    Dataset authored and provided by
    DetectVul
    Description

    Dataset Card for "devign_with_norm_vul_lines"

    More Information needed Original Paper: https://www.sciencedirect.com/science/article/abs/pii/S0167739X24004680 bibtex @article{TRAN2024107504, title = {DetectVul: A statement-level code vulnerability detection for Python}, journal = {Future Generation Computer Systems}, pages = {107504}, year = {2024}, issn = {0167-739X}, doi = {https://doi.org/10.1016/j.future.2024.107504}, url =… See the full description on the dataset page: https://huggingface.co/datasets/DetectVul/devign.

  16. h

    Interaction_Dialogue_with_Privacy

    • huggingface.co
    Updated May 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hang Zeng (2025). Interaction_Dialogue_with_Privacy [Dataset]. https://huggingface.co/datasets/Nidhogg-zh/Interaction_Dialogue_with_Privacy
    Explore at:
    Dataset updated
    May 11, 2025
    Authors
    Hang Zeng
    Description

    🔏 Interaction Dialogue Dataset with Extracted Privacy Phrases and Annotated Private Information

      📑 Dataset Details
    
    
    
    
    
      Dataset Description
    

    This is the dataset of "Automated Annotation of Privacy Information in User Interactions with Large Language Models". Traditional personally identifiable information (PII) detection in anonymous content is insufficient in real-name interaction scenarios with LLMs. By authenticating through user login, queries posed to LLMs… See the full description on the dataset page: https://huggingface.co/datasets/Nidhogg-zh/Interaction_Dialogue_with_Privacy.

  17. h

    sasakitopiichan

    • huggingface.co
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BangumiBase (2024). sasakitopiichan [Dataset]. https://huggingface.co/datasets/BangumiBase/sasakitopiichan
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2024
    Dataset authored and provided by
    BangumiBase
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Bangumi Image Base of Sasaki To Pii-chan

    This is the image base of bangumi Sasaki to Pii-chan, we detected 46 characters, 4654 images in total. The full dataset is here. Please note that these image bases are not guaranteed to be 100% cleaned, they may be noisy actual. If you intend to manually train models using this dataset, we recommend performing necessary preprocessing on the downloaded dataset to eliminate potential noisy samples (approximately 1% probability). Here is the… See the full description on the dataset page: https://huggingface.co/datasets/BangumiBase/sasakitopiichan.

  18. h

    CTW1500_OCR

    • huggingface.co
    Updated Jan 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mikhail Stepanov (2025). CTW1500_OCR [Dataset]. https://huggingface.co/datasets/MiXaiLL76/CTW1500_OCR
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 18, 2025
    Authors
    Mikhail Stepanov
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CTW1500

      META
    

    https://github.com/open-mmlab/mmocr/blob/main/dataset_zoo/ctw1500/metafile.yml Name: 'CTW1500' Paper: Title: Curved scene text detection via transverse and longitudinal sequence connection URL: https://www.sciencedirect.com/science/article/pii/S0031320319300664 Venue: PR Year: '2019' BibTeX: '@article{liu2019curved, title={Curved scene text detection via transverse and longitudinal sequence connection}, author={Liu, Yuliang and Jin, Lianwen and… See the full description on the dataset page: https://huggingface.co/datasets/MiXaiLL76/CTW1500_OCR.

  19. h

    az_personal_info_aug_masked

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamza Agar, az_personal_info_aug_masked [Dataset]. https://huggingface.co/datasets/aimtune/az_personal_info_aug_masked
    Explore at:
    Authors
    Hamza Agar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📘 Overview

    This dataset consists of augmented Azerbaijani text pairs (clean & masked) that contain personally identifiable information (PII). All content has been automatically generated using ChatGPT to simulate sensitive data scenarios for tasks like PII detection, anonymization, entity masking, and secure data handling.

      🔍 Dataset Structure
    

    Each example is a paired record:

    original: The full augmented Azerbaijani text containing PII. masked: The same text with PII… See the full description on the dataset page: https://huggingface.co/datasets/aimtune/az_personal_info_aug_masked.

  20. h

    pii_ner_azerbaijani

    • huggingface.co
    Updated Aug 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LocalDoc (2025). pii_ner_azerbaijani [Dataset]. http://doi.org/10.57967/hf/6238
    Explore at:
    Dataset updated
    Aug 16, 2025
    Dataset authored and provided by
    LocalDoc
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PII NER Azerbaijani Dataset

    Short, synthetic Azerbaijani dataset for PII-aware Named Entity Recognition (token classification). Useful for training and evaluating models that detect and localize personally identifiable information (PII) in Azerbaijani text. Note: All examples are synthetically generated with the library az-data-generator [https://github.com/LocalDoc-Azerbaijan/az-data-generator]. No real persons or contact details are included.

      Dataset Summary
    

    Each… See the full description on the dataset page: https://huggingface.co/datasets/LocalDoc/pii_ner_azerbaijani.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532

pii-masking-200k

Ai4Privacy PII200k Dataset

ai4privacy/pii-masking-200k

Explore at:
21 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2024
Dataset authored and provided by
Ai4Privacy
Description

Ai4Privacy Community

Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

  Purpose and Features

Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.

Search
Clear search
Close search
Google apps
Main menu