14 datasets found
  1. h

    complex_ner

    • huggingface.co
    Updated Sep 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morteza Shahrezaye (2023). complex_ner [Dataset]. http://doi.org/10.57967/hf/3143
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 22, 2023
    Authors
    Morteza Shahrezaye
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Elephant Labs Complex PII Dataset for Long Contexts and Advanced Anonymization (with Business and Software-related Entities)

    Developed by: Elephant Labs LinkedIn: Elephant Labs Dataset Size: 20,0000 synthetic documents Number of tokens in text: 14,140,795 (Tokenized with tiktoken.encoding_for_model("gpt-3.5-turbo"))

      Dataset Summary
    

    Purpose: A synthetically generated dataset for advanced NER tasks, supporting both token classification and LLM fine-tuning (enabling… See the full description on the dataset page: https://huggingface.co/datasets/MorryShah/complex_ner.

  2. h

    PII-NER

    • huggingface.co
    Updated Jul 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph G Flowers (2024). PII-NER [Dataset]. https://huggingface.co/datasets/Josephgflowers/PII-NER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 20, 2024
    Authors
    Joseph G Flowers
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.

  3. h

    pii-masking-300k

    • huggingface.co
    Updated Apr 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2024). pii-masking-300k [Dataset]. http://doi.org/10.57967/hf/1995
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2024
    Dataset authored and provided by
    Ai4Privacy
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Purpose and Features

    🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts:

    OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting 749 discussion subjects / use cases split across education, health, and psychology. FinPII contains an additional ~20 types tailored to… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-300k.

  4. i

    Traveling the Silk Road: Non-anonymized datasets

    • impactcybertrust.org
    Updated Mar 4, 2012
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carnegie Mellon University (2012). Traveling the Silk Road: Non-anonymized datasets [Dataset]. http://doi.org/10.23721/116/1406256
    Explore at:
    Dataset updated
    Mar 4, 2012
    Authors
    Carnegie Mellon University
    Time period covered
    Mar 4, 2012 - Jul 23, 2012
    Description

    Non-anonymized subset of the databases used in the paper "Traveling the Silk Road: A measurement analysis of a large anonymous online marketplace" (Christin, 2013). In this dataset, textual information (item name, description, or feedback text) and handles have not been anonymized and are thus available. We don't expect any private identifiers or other PII to be present in the data, which was collected from a publicly available website -- the Silk Road anonymous marketplace -- for a few months in 2012.

    For less restricted usage terms, please consider the anonymized version, which is also available without any restrictions. This non-anonymized dataset should only be requested if your project MUST rely on full textual descriptions of items and/or feedback.

    Christin (2013) Traveling the Silk Road: A measurement analysis of a large anonymous online marketplace. To appear in Proceedings of the 22nd International World Wide Web Conference (WWW'13). Rio de Janeiro, Brazil. May 2013.

  5. CFC Anonymized Raw Data

    • datasets.ai
    • catalog.data.gov
    Updated Sep 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of Personnel Management (2024). CFC Anonymized Raw Data [Dataset]. https://datasets.ai/datasets/cfc-anonymized-raw-data-09c36
    Explore at:
    Dataset updated
    Sep 10, 2024
    Dataset provided by
    United States Office of Personnel Managementhttps://opm.gov/
    Authors
    Office of Personnel Management
    Description

    Summary of every designation to every charity in a campaign year with anonymized data on the source (NO PII)

  6. D

    Data Masking Tools Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Jun 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Data Masking Tools Report [Dataset]. https://www.archivemarketresearch.com/reports/data-masking-tools-560706
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Jun 21, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global market for data masking tools is experiencing robust growth, driven by increasing regulatory compliance needs (like GDPR and CCPA), the rising adoption of cloud computing, and the expanding volume of sensitive data requiring protection. The market, currently estimated at $2.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This growth is fueled by organizations' increasing focus on data security and privacy, particularly within sectors like healthcare, finance, and government. The demand for sophisticated data masking solutions that can effectively anonymize and pseudonymize data while maintaining data utility for testing and development is a significant driver. Furthermore, the shift towards cloud-based data masking solutions, offering scalability and ease of management, is contributing to market expansion. Several key trends are shaping the market. The integration of advanced technologies such as AI and machine learning into data masking tools is enhancing their effectiveness and automating complex masking processes. The emergence of data masking solutions designed for specific data types, such as personally identifiable information (PII) and financial data, caters to niche requirements. However, challenges such as the complexity of implementing and managing data masking solutions, and concerns about the potential impact on data usability, represent restraints on market growth. The market is segmented by deployment type (cloud, on-premises), organization size (small, medium, large enterprises), and industry vertical (healthcare, finance, etc.). Key players in this space include Oracle, Delphix, BMC Software, Informatica, IBM, and several other specialized vendors offering a range of solutions to meet diverse organizational needs. The competitive landscape is dynamic, with ongoing innovation and consolidation shaping the future of the market.

  7. h

    pii-masking-65k

    • huggingface.co
    Updated Apr 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2024). pii-masking-65k [Dataset]. http://doi.org/10.57967/hf/2012
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 5, 2024
    Dataset authored and provided by
    Ai4Privacy
    Description

    Purpose and Features

    The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-65k.

  8. i

    Hansa marketplace: Non-anonymized dataset

    • impactcybertrust.org
    Updated Oct 8, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carnegie Mellon University (2015). Hansa marketplace: Non-anonymized dataset [Dataset]. https://www.impactcybertrust.org/dataset_view?idDataset=1499
    Explore at:
    Dataset updated
    Oct 8, 2015
    Authors
    Carnegie Mellon University
    Time period covered
    Oct 8, 2015 - Jul 14, 2017
    Description

    "Non-anonymized database pertaining to the Hansa marketplace. This data was used in the paper "Measurement by Proxy:
    On the Accuracy of Online Marketplace Measurements" (Cuevas et al., 2018). In this dataset, textual information (item name, description, or feedback text) and handles have not been anonymized and are thus available. We don't expect any private identifiers or other PII to be present in the data, which was collected from a publicly available website (Hansa marketplace) over slightly less than two years (2015-2017).

    For less restricted usage terms, please consider the anonymized version, which is also available without any restrictions. This non-anonymized dataset should only be requested if your project MUST rely on full textual descriptions of items and/or feedback.

  9. h

    pii-masking-200k

    • huggingface.co
    Updated Apr 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2024
    Dataset authored and provided by
    Ai4Privacy
    Description

    Ai4Privacy Community

    Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

      Purpose and Features
    

    Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.

  10. Job Application Email (Anonymized & Feature-Rich)

    • kaggle.com
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rashmi Shree (2025). Job Application Email (Anonymized & Feature-Rich) [Dataset]. https://www.kaggle.com/datasets/rasho330/job-application-email-anonymized-and-feature-rich
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rashmi Shree
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset consists of a curated and anonymized collection of real job application confirmation emails from a Gmail inbox. It includes confirmation emails, rejection notices, and other relevant correspondences. The dataset was originally curated to address the challenge of eliminating manual job application tracking, allowing for automatic tracking directly from the inbox, capturing application confirmations and rejection notifications.

    The dataset has been carefully pre-processed, cleaned, and enriched with derived features such as:

    1. 📅 Parsed Date and Time
    2. 🕒 Week, Month, and Year of Email
    3. ⏱️ Days Since Email Received
    4. 📩 Email Subject and Body
    5. 🏢 Company Name (Parsed from Subject/Body)
    6. 📊 Application Status Insights

    The dataset was originally curated to build a job application tracking agent that can automatically extract and track application updates—such as confirmations, rejections, interview invites, and assessment notifications—directly from the inbox. The goal was to enable users to easily interact with an AI assistant to analyze and manage their job search process more efficiently.

    ⚠️ Disclaimer: All personal identifiable information (PII) such as names and email addresses have been fully anonymized or redacted. This dataset is intended strictly for educational and research purposes. All personally identifiable information (PII) has been carefully anonymized. Any personal names found in the dataset have been replaced with the fictional name "Michael Gary Scott" as a placeholder. This character reference is used purely for fun and does not correspond to any real individual. Please ensure any further use of this dataset respects privacy and ethical data handling practices.

  11. D

    Data Pseudonymity Software Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Mar 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). Data Pseudonymity Software Report [Dataset]. https://www.marketresearchforecast.com/reports/data-pseudonymity-software-28311
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Mar 6, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The data pseudonymization software market is experiencing robust growth, driven by increasing concerns over data privacy regulations like GDPR and CCPA, and the rising need to protect sensitive customer information while still leveraging data for analytics and other business purposes. The market, estimated at $2 billion in 2025, is projected to witness a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $7 billion by 2033. This expansion is fueled by the adoption of cloud-based solutions, which offer scalability and cost-effectiveness, coupled with a growing preference for data pseudonymization techniques among enterprises, particularly in sectors like healthcare, finance, and telecommunications that handle vast quantities of personally identifiable information (PII). Key trends include the integration of advanced analytics capabilities into pseudonymization software and increasing demand for solutions capable of handling diverse data formats and sources. However, the market faces restraints including the complexity of implementing pseudonymization techniques, the need for specialized expertise, and potential concerns regarding data utility after pseudonymization. The market segmentation reveals a significant preference for cloud-based solutions over on-premises deployments, reflecting the broader trend toward cloud adoption in enterprise IT. Enterprise adoption outweighs individual usage, reflecting the higher volume and sensitivity of data handled by large organizations. Geographically, North America currently dominates the market, followed by Europe, driven by stringent data privacy regulations and advanced technological infrastructure. However, the Asia-Pacific region is expected to experience significant growth in the coming years, fueled by increasing digitalization and growing awareness of data privacy issues. Competition among vendors like Aircloak, AvePoint, Anonos, and others is intense, with companies focusing on innovation in areas such as AI-powered pseudonymization and enhanced data security features to gain a competitive edge. The long-term forecast indicates a sustained period of growth, propelled by ongoing regulatory pressure and the continuous need for robust data protection measures in a data-driven economy.

  12. i

    Alphabay marketplace: Non-anonymized dataset

    • impactcybertrust.org
    Updated Dec 31, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carnegie Mellon University (2014). Alphabay marketplace: Non-anonymized dataset [Dataset]. http://doi.org/10.23721/116/1462165
    Explore at:
    Dataset updated
    Dec 31, 2014
    Authors
    Carnegie Mellon University
    Time period covered
    Dec 31, 2014 - May 26, 2017
    Description

    "Non-anonymized database pertaining to the AlphaBay marketplace. This data was used in the papers ""Plug and Prey? Measuring the Commoditization of Cybercrime via Online Anonymous Markets"" (Van Wegberg et al., 2018), ""An Empirical Analysis of Traceability in the Monero Blockchain"" (Moeser et al., 2018) and in the joint EMCDDA/EUROPOL report ""Drugs and thedarknet: Perspectives for enforcement, researchand policy"" (EMCDDA, 2017). In this dataset, textual information (item name, description, or feedback text) and handles have not been anonymized and are thus available. We don't expect any private identifiers or other PII to be present in the data, which was collected from a publicly available website (Alphabay marketplace) over two and a half years (2014-2017).

    For less restricted usage terms, please consider the anonymized version, which is also available without any restrictions. This non-anonymized dataset should only be requested if your project MUST rely on full textual descriptions of items and/or feedback.

    EMCDDA (2017) Drugs and thedarknet: Perspectives for enforcement, researchand policy. November 2017.
    Van Wegberg et al.. Plug and Prey? Measuring the Commoditization of Cybercrime via Online Anonymous Markets. To appear in Proceedings of the 27th USENIX Security Symposium (USENIX Security'18). Baltimore, MD. August 2018.
    Moeser et al. An Empirical Analysis of Traceability in the Monero Blockchain. To appear in Proceedings of the Privacy Enhancing Technology Symposium (PETS 2018), volume 3. Barcelona, Spain. July 2018."

  13. h

    az_personal_info_aug_masked

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamza Agar, az_personal_info_aug_masked [Dataset]. https://huggingface.co/datasets/aimtune/az_personal_info_aug_masked
    Explore at:
    Authors
    Hamza Agar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📘 Overview

    This dataset consists of augmented Azerbaijani text pairs (clean & masked) that contain personally identifiable information (PII). All content has been automatically generated using ChatGPT to simulate sensitive data scenarios for tasks like PII detection, anonymization, entity masking, and secure data handling.

      🔍 Dataset Structure
    

    Each example is a paired record:

    original: The full augmented Azerbaijani text containing PII. masked: The same text with PII… See the full description on the dataset page: https://huggingface.co/datasets/aimtune/az_personal_info_aug_masked.

  14. Z

    Five-week DoH Dataset collected on ISP backbone lines

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kamil Jeřábek (2023). Five-week DoH Dataset collected on ISP backbone lines [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8348772
    Explore at:
    Dataset updated
    Sep 15, 2023
    Dataset provided by
    Ondřej Ryšavý
    Kamil Jeřábek
    Karel Hynek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is an additional DoH dataset used for researching DoH traffic data drift phenomena. It contains anonymized packet captures (pcaps) from the following days:

    2022-11-28

    2022-12-05

    2022-12-12

    2022-12-19

    2022-12-26

    The traffic was captured on the CESNET2 network and anonymized. The packet capturing and anonymization follow the methodology described in [1]. The list of IP addresses used for DoH recognition is also included within the dataset in doh_resolver_ip.csv file. The structure of the dataset is as follows:

    . ├── doh_resolver_ip.csv ├── pcap │ ├── 2022-11-28 │ │ ├── DoH-20221128180002.pcapng │ │ └── HTTPS-20221128180002.pcapng │ ├── 2022-12-05 │ │ ├── DoH-20221205180001.pcapng │ │ └── HTTPS-20221205180001.pcapng │ ├── 2022-12-12 │ │ ├── DoH-20221212180001.pcapng │ │ └── HTTPS-20221212180001.pcapng │ ├── 2022-12-19 │ │ ├── DoH-20221219180001.pcapng │ │ └── HTTPS-20221219180001.pcapng │ └── 2022-12-26 │ ├── DoH-20221226180001.pcapng │ └── HTTPS-20221226180001.pcapng └── README.md

    [1] Jeřábek, K., Hynek, K., Čejka, T., & Ryšavý, O. (2022). Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42, 108310. https://www.sciencedirect.com/science/article/pii/S2352340922005121

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Morteza Shahrezaye (2023). complex_ner [Dataset]. http://doi.org/10.57967/hf/3143

complex_ner

MorryShah/complex_ner

Elephant Labs Complex PII Dataset for Long Contexts and Advanced Anonymization (with Business and Software-related Entities)

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 22, 2023
Authors
Morteza Shahrezaye
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Elephant Labs Complex PII Dataset for Long Contexts and Advanced Anonymization (with Business and Software-related Entities)

Developed by: Elephant Labs LinkedIn: Elephant Labs Dataset Size: 20,0000 synthetic documents Number of tokens in text: 14,140,795 (Tokenized with tiktoken.encoding_for_model("gpt-3.5-turbo"))

  Dataset Summary

Purpose: A synthetically generated dataset for advanced NER tasks, supporting both token classification and LLM fine-tuning (enabling… See the full description on the dataset page: https://huggingface.co/datasets/MorryShah/complex_ner.

Search
Clear search
Close search
Google apps
Main menu