Facebook
TwitterAi4Privacy Community
Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.
Purpose and Features
Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.
Facebook
Twitterponoma16/context-aware-pii-detection-v3 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Developed by AI4Privacy, this dataset represents a pioneering effort in the realm of privacy and AI. As an expansive resource hosted on Hugging Face at ai4privacy/pii-masking-200k, it serves a crucial role in addressing the growing concerns around personal data security in AI applications.
Sources: The dataset is crafted using proprietary algorithms, ensuring the creation of synthetic data that avoids privacy violations. Its multilingual composition, including English, French, German, and Italian texts, reflects a diverse source base. The data is meticulously curated with human-in-the-loop validation, ensuring both relevance and quality.
Context: In an era where data privacy is paramount, this dataset is tailored to train AI models to identify and mask personally identifiable information (PII). It covers 54 PII classes and extends across 229 use cases in various domains like business, education, psychology, and legal fields, emphasizing its contextual richness and applicability.
Inspiration: The dataset draws inspiration from the need for enhanced privacy measures in AI interactions, particularly in LLMs and AI assistants. The creators, AI4Privacy, are dedicated to building tools that act as a 'global seatbelt' for AI, protecting individuals' personal data. This dataset is a testament to their commitment to advancing AI technology responsibly and ethically.
This comprehensive dataset is not just a tool but a step towards a future where AI and privacy coexist harmoniously, offering immense value to researchers, developers, and privacy advocates alike.
Facebook
Twitterhuangyizhuo/pii-url-detection dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Gretel Synthetic Domain-Specific Documents Dataset (English)
This dataset is a synthetically generated collection of documents enriched with Personally Identifiable Information (PII) and Protected Health Information (PHI) entities spanning multiple domains. Created using Gretel Navigator with mistral-nemo-2407 as the backend model, it is specifically designed for fine-tuning Gliner models. The dataset contains document passages featuring PII/PHI entities from a wide range of… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1.
Facebook
TwitterPurpose and Features
The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-43k.
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Issues to resolve:
In order to downgrade, the following 3 must be changed together:
datasets==2.16.0
fsspec==2023.10.0
gcsfs==2023.10.0
Import the "dataset" that is just a requirements.txt file. Now install the requirements.
!pip install -r /kaggle/input/tf2-16-requirements-txt/requirements.txt
!pip list | grep -E 'datasets|transformers|tensorflow|keras|ml-dtypes|numpy|fsspec|gcsfs'
datasets 2.16.0
fsspec 2023.10.0
gcsfs 2023.10.0
keras 3.0.5
keras-cv 0.8.2
keras-nlp 0.8.2
keras-tuner 1.4.6
ml-dtypes 0.3.2
msgpack-numpy 0.4.8
numpy 1.26.4
tensorflow 2.16.1
tensorflow-cloud 0.1.16
tensorflow-datasets 4.9.4
tensorflow-decision-forests 1.8.1
tensorflow-estimator 2.15.0
tensorflow-hub 0.16.1
tensorflow-io 0.35.0
tensorflow-io-gcs-filesystem 0.35.0
tensorflow-metadata 0.14.0
tensorflow-probability 0.23.0
tensorflow-serving-api 2.14.1
tensorflow-text 2.16.1
tensorflow-transform 0.14.0
tf_keras 2.16.0
transformers 4.38.2
then Restart & clear cell outputs to make these changes live in the kernel.
NOTE: DO NOT FACTORY RESET!
import datasets
import tensorflow as tf
assert datasets._version_ == '2.16.0'
assert tf._version_ == '2.16.1'
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This Dataset was created using Gemini api using the kaggle notebook : https://www.kaggle.com/code/newtonbaba12345/pii-detection-data-generation-using-gemini
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
King-Harry/NinjaMasker-PII-Redaction-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Image generated by DALL-E. See prompt for more details
💼 📊 Synthetic Financial Domain Documents with PII Labels
gretelai/synthetic_pii_finance_multilingual is a dataset of full length synthetic financial documents containing Personally Identifiable Information (PII), generated using Gretel Navigator and released under Apache 2.0. This dataset is designed to assist with the following use cases:
🏷️ Training NER (Named Entity Recognition) models to detect and label PII in… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Synthetic PII Detection Dataset (DeBERTa-PII-Synth)
A large-scale synthetic dataset for training token classification models to detect Personally Identifiable Information (PII).
Dataset Summary
DeBERTa-PII-Synth is a fully synthetic dataset created to train token-classification models for detecting Personally Identifiable Information (PII). It contains 120k+ samples, each including:
Natural-language text containing synthetic PII Gold span annotations (PERSON, EMAIL, DATE… See the full description on the dataset page: https://huggingface.co/datasets/tursunait/deberta-pii-synth.
Facebook
Twitterrasikabh/pile-pii dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Nemotron-PII: Synthesized Data for Privacy-Preserving AI
Dataset Description
Nemotron‑PII is a synthetic, persona‑grounded dataset for training and evaluating detection of Personally Identifiable Information (PII) and Protected Health Information (PHI) in text at production quality. It contains 100,000 English records across 50+ industries with span‑level annotations for 55+ PII/PHI categories, generated with NVIDIA NeMo Data Designer using synthetic personas grounded in… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-PII.
Facebook
TwitterDataset Card for "devign_with_norm_vul_lines"
More Information needed Original Paper: https://www.sciencedirect.com/science/article/abs/pii/S0167739X24004680 bibtex @article{TRAN2024107504, title = {DetectVul: A statement-level code vulnerability detection for Python}, journal = {Future Generation Computer Systems}, pages = {107504}, year = {2024}, issn = {0167-739X}, doi = {https://doi.org/10.1016/j.future.2024.107504}, url =… See the full description on the dataset page: https://huggingface.co/datasets/DetectVul/devign.
Facebook
Twitter🔏 Interaction Dialogue Dataset with Extracted Privacy Phrases and Annotated Private Information
📑 Dataset Details
Dataset Description
This is the dataset of "Automated Annotation of Privacy Information in User Interactions with Large Language Models". Traditional personally identifiable information (PII) detection in anonymous content is insufficient in real-name interaction scenarios with LLMs. By authenticating through user login, queries posed to LLMs… See the full description on the dataset page: https://huggingface.co/datasets/Nidhogg-zh/Interaction_Dialogue_with_Privacy.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Bangumi Image Base of Sasaki To Pii-chan
This is the image base of bangumi Sasaki to Pii-chan, we detected 46 characters, 4654 images in total. The full dataset is here. Please note that these image bases are not guaranteed to be 100% cleaned, they may be noisy actual. If you intend to manually train models using this dataset, we recommend performing necessary preprocessing on the downloaded dataset to eliminate potential noisy samples (approximately 1% probability). Here is the… See the full description on the dataset page: https://huggingface.co/datasets/BangumiBase/sasakitopiichan.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
CTW1500
META
https://github.com/open-mmlab/mmocr/blob/main/dataset_zoo/ctw1500/metafile.yml Name: 'CTW1500' Paper: Title: Curved scene text detection via transverse and longitudinal sequence connection URL: https://www.sciencedirect.com/science/article/pii/S0031320319300664 Venue: PR Year: '2019' BibTeX: '@article{liu2019curved, title={Curved scene text detection via transverse and longitudinal sequence connection}, author={Liu, Yuliang and Jin, Lianwen and… See the full description on the dataset page: https://huggingface.co/datasets/MiXaiLL76/CTW1500_OCR.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
📘 Overview
This dataset consists of augmented Azerbaijani text pairs (clean & masked) that contain personally identifiable information (PII). All content has been automatically generated using ChatGPT to simulate sensitive data scenarios for tasks like PII detection, anonymization, entity masking, and secure data handling.
🔍 Dataset Structure
Each example is a paired record:
original: The full augmented Azerbaijani text containing PII. masked: The same text with PII… See the full description on the dataset page: https://huggingface.co/datasets/aimtune/az_personal_info_aug_masked.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PII NER Azerbaijani Dataset
Short, synthetic Azerbaijani dataset for PII-aware Named Entity Recognition (token classification). Useful for training and evaluating models that detect and localize personally identifiable information (PII) in Azerbaijani text. Note: All examples are synthetically generated with the library az-data-generator [https://github.com/LocalDoc-Azerbaijan/az-data-generator]. No real persons or contact details are included.
Dataset Summary
Each… See the full description on the dataset page: https://huggingface.co/datasets/LocalDoc/pii_ner_azerbaijani.
Facebook
TwitterAi4Privacy Community
Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.
Purpose and Features
Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.