13 datasets found
  1. open-pii-masking-500k-ai4privacy

    • kaggle.com
    • dataverse.harvard.edu
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Anthony (2025). open-pii-masking-500k-ai4privacy [Dataset]. https://www.kaggle.com/datasets/mikedoes/open-pii-masking-500k-ai4privacy
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Michael Anthony
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    🌍 World's largest open dataset for privacy masking 🌎

    The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs.

    Task Showcase of Privacy Masking

    Dataset Analytics 📊 - ai4privacy/open-pii-masking-500k-ai4privacy

    p5y Data Analytics

    • Total Entries: 580,227
    • Total Tokens: 19,199,982
    • Average Source Text Length: 17.37 words
    • Total PII Labels: 5,705,973
    • Number of Unique PII Classes: 20 (Open PII Labelset)
    • Unique Identity Values: 704,215

    Language Distribution Analytics

    **Number of Unique Languages**: 8 | Language | Count | Percentage | |--------------------|----------|------------| | English (en) 🇺🇸🇬🇧🇨🇦🇮🇳 | 150,693 | 25.97% | | French (fr) 🇫🇷🇨🇭🇨🇦 | 112,136 | 19.33% | | German (de) 🇩🇪🇨🇭 | 82,384 | 14.20% | | Spanish (es) 🇪🇸 🇲🇽 | 78,013 | 13.45% | | Italian (it) 🇮🇹🇨🇭 | 68,824 | 11.86% | | Dutch (nl) 🇳🇱 | 26,628 | 4.59% | | Hindi (hi)* 🇮🇳 | 33,963 | 5.85% | | Telugu (te)* 🇮🇳 | 27,586 | 4.75% | *these languages are in experimental stages
    Chart

    Region Distribution Analytics

    **Number of Unique Regions**: 11 | Region | Count | Percentage | |-----------------------|----------|------------| | Switzerland (CH) 🇨🇭 | 112,531 | 19.39% | | India (IN) 🇮🇳 | 99,724 | 17.19% | | Canada (CA) 🇨🇦 | 74,733 | 12.88% | | Germany (DE) 🇩🇪 | 41,604 | 7.17% | | Spain (ES) 🇪🇸 | 39,557 | 6.82% | | Mexico (MX) 🇲🇽 | 38,456 | 6.63% | | France (FR) 🇫🇷 | 37,886 | 6.53% | | Great Britain (GB) 🇬🇧 | 37,092 | 6.39% | | United States (US) 🇺🇸 | 37,008 | 6.38% | | Italy (IT) 🇮🇹 | 35,008 | 6.03% | | Netherlands (NL) 🇳🇱 | 26,628 | 4.59% |
    Chart

    Machine Learning Task Analytics

    | Split | Count | Percentage | |-------------|----------|------------| | **Train** | 464,150 | 79.99% | | **Validate**| 116,077 | 20.01% |
    Chart

    Usage

    Option 1: Python terminal pip install datasets python from datasets import load_dataset dataset = load_dataset("ai4privacy/open-pii-masking-500k-ai4privacy")

    Compatible Machine Learning Tasks:

  2. Multi-modality medical image dataset for medical image processing in Python...

    • zenodo.org
    zip
    Updated Aug 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Candace Moore; Candace Moore; Giulia Crocioni; Giulia Crocioni (2024). Multi-modality medical image dataset for medical image processing in Python lesson [Dataset]. http://doi.org/10.5281/zenodo.13305760
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Candace Moore; Candace Moore; Giulia Crocioni; Giulia Crocioni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a collection of medical imaging files for use in the "Medical Image Processing with Python" lesson, developed by the Netherlands eScience Center.

    The dataset includes:

    1. SimpleITK compatible files: MRI T1 and CT scans (training_001_mr_T1.mha, training_001_ct.mha), digital X-ray (digital_xray.dcm in DICOM format), neuroimaging data (A1_grayT1.nrrd, A1_grayT2.nrrd). Data have been downloaded from here.
    2. MRI data: a T2-weighted image (OBJECT_phantom_T2W_TSE_Cor_14_1.nii in NIfTI-1 format). Data have been downloaded from here.
    3. Example images for the machine learning lesson: chest X-rays (rotatechest.png, other_op.png), cardiomegaly example (cardiomegaly_cc0.png).
    4. Additional anonymized data: TBA

    These files represent various medical imaging modalities and formats commonly used in clinical research and practice. They are intended for educational purposes, allowing students to practice image processing techniques, machine learning applications, and statistical analysis of medical images using Python libraries such as scikit-image, pydicom, and SimpleITK.

  3. Anonymized Image Data for DWI and T2W MRI Registration Quality Assurance

    • figshare.com
    zip
    Updated Nov 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kareem Wahid; Mohamed Naser (2022). Anonymized Image Data for DWI and T2W MRI Registration Quality Assurance [Dataset]. http://doi.org/10.6084/m9.figshare.17162435.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 9, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Kareem Wahid; Mohamed Naser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The following are pre-radiotherapy T2W and DWI MRI sequences in Digital Imaging and Communications in Medicine (DICOM) format for 20 patients curated from the MD Anderson Databases (NCT03145077).

    For each image set (T2W image and DWI image), ground truth segmentations for the left and right submandibular glands, left and right parotid glands, cervical spinal cord, brainstem, and primary gross tumor volume were manually generated by a trained physician expert (radiologist with > 5 years of experience in HNC). In a subset of five cases, segmentations for all structures in both sequences were also manually generated by three additional separate observers (two physicians and one medical student). All segmentations were generated in Velocity AI (v.3.0.1; Varian Medical Systems; Palo Alto, CA, USA) in DICOM RT structure format.

    DICOM data was anonymized using an in-house Python script that implements the RSNA CRP DICOM Anonymizer software. All files have had any DICOM header info and metadata containing PHI removed or replaced with dummy entries.

  4. MNE-somato-data-bids (anonymized)

    • openneuro.org
    Updated Aug 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lauri Parkkonen; Stefan Appelhoff; Alexandre Gramfort; Mainak Jas; Richard Höchenberger (2020). MNE-somato-data-bids (anonymized) [Dataset]. http://doi.org/10.18112/openneuro.ds003104.v1.0.0
    Explore at:
    Dataset updated
    Aug 31, 2020
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Lauri Parkkonen; Stefan Appelhoff; Alexandre Gramfort; Mainak Jas; Richard Höchenberger
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    MNE-somato-data-bids

    This dataset contains the MNE-somato-data in BIDS format.

    The conversion can be reproduced through the Python script stored in the /code directory of this dataset. See the README in that directory.

    The /derivatives directory contains the outputs of running the FreeSurfer pipeline recon-all on the MRI data with no additional commandline options (only defaults were used):

    $ recon-all -i sub-01_T1w.nii.gz -s 01 -all

    After the recon-all call, there were further FreeSurfer calls from the MNE API:

    $ mne make_scalp_surfaces -s 01 --force $ mne watershed_bem -s 01

    The derivatives also contain the forward model *-fwd.fif, which was produced using the source space definition, a *-trans.fif file, and the boundary element model (=conductor model) that lives in freesurfer/subjects/01/bem/*-bem-sol.fif.

    The *-trans.fif file is not saved, but can be recovered from the anatomical landmarks in the sub-01/anat/T1w.json file and MNE-BIDS' function get_head_mri_transform.

    See: https://github.com/mne-tools/mne-bids for more information.

    Notes on FreeSurfer

    the FreeSurfer pipeline recon-all was run new for the sake of converting the somato data to BIDS format. This needed to be done to change the "somato" subject name to the BIDS subject label "01". Note, that this is NOT "sub-01", because in BIDS, the "sub-" is just a prefix, whereas the "01" is the subject label.

  5. n

    CMT1A-BioStampNPoint2023: Charcot-Marie-Tooth disease type 1A accelerometry...

    • data.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karthik Dinesh; Nicole White; Lindsay Baker; Janet Sowden; Steffen Behrens-Spraggins; Elizabeth P Wood; Julie L Charles; David Herrmann; Gaurav Sharma; Katy Eichinger (2023). CMT1A-BioStampNPoint2023: Charcot-Marie-Tooth disease type 1A accelerometry dataset from three wearable sensor study [Dataset]. http://doi.org/10.5061/dryad.p5hqbzktr
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    University of Rochester Medical Center
    University of Rochester
    Authors
    Karthik Dinesh; Nicole White; Lindsay Baker; Janet Sowden; Steffen Behrens-Spraggins; Elizabeth P Wood; Julie L Charles; David Herrmann; Gaurav Sharma; Katy Eichinger
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The CMT1A-BioStampNPoint2023 dataset provides data from a wearable sensor accelerometry study conducted for studying gait, balance, and activity in 15 individuals with Charcot-Marie-Tooth disease Type 1A (CMT1A). In addition to individuals with CMT1A, the dataset also includes data for 15 controls that also went through the same in-clinic study protocol as the CMT1A participants with a substantial fraction (9) of the controls also participating in the in-home study protocol. For the CMT1A participants, data is provided for 15 participants for the baseline visit and associated home recording duration and, additionally, for a subset of 12 of these participants data is also provided for a 12-month longitudinal visit and associated home recording duration. For controls, no longitudinal data is provided as none was recorded. The data were acquired using lightweight MC 10 BioStamp NPoint sensors (MC 10 Inc, Lexington, MA), three of which were attached to each participant for gathering data over a roughly one day interval. For additional details, see the description in the "README.md" included with the dataset. Methods The dataset contains data from wearable sensors and clinical data. The wearable sensor data was acquired using wearable sensors and the clinical data was extracted from the clinical record. The sensor data has not been processed per-se but the start of the recording time has been anonymized to comply with HIPPA requirements. Both the sensor data and the clinical data passed through a Python program for the aforementioned time anonymization and for standard formatting. Additional details of the time anonymization are provided in the file "README.md" included with the dataset.

  6. d

    Anonymized source data files for figures in: Recurrent processes support a...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Sep 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura Gwilliams; Jean-Remi King (2020). Anonymized source data files for figures in: Recurrent processes support a cascade of hierarchical decisions [Dataset]. http://doi.org/10.5061/dryad.70rxwdbtr
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 4, 2020
    Dataset provided by
    Dryad
    Authors
    Laura Gwilliams; Jean-Remi King
    Time period covered
    2020
    Description

    The files are of two formats: .npy and .stc.

    .npy files can be read using the Numpy module in python, e.g.:

    import numpy as np

    data = np.load('file_name.npy')

    https://numpy.org/doc/stable/reference/generated/numpy.load.html

    .stc files can be read using the MNE module in python, e.g.:

    from mne import read_source_estimate

    stc = read_source_estimate('stc_name-lh.stc')

    note that reading in the data from just one hemisphere file will automatically read the data for the other one too.

    https://mne.tools/stable/generated/mne.read_source_estimate.html

  7. Z

    Dataset for a machine learning tool to improve lymph node staging with...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rogasch, Julian M.M. (2024). Dataset for a machine learning tool to improve lymph node staging with FDG-PET/CT [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7094286
    Explore at:
    Dataset updated
    Jul 16, 2024
    Dataset authored and provided by
    Rogasch, Julian M.M.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This upload provides Open Data associated with the publication "A machine learning tool to improve prediction of mediastinal lymph node metastases in non-small cell lung cancer using routinely obtainable [18F]FDG-PET/CT parameters" by Rogasch JMM et al. (2022).

    The upload contains the anonymized dataset with 10 features necessary for the final GBM model that was presented in the publication. However, the original full dataset with 40 features was excluded from this Open Data repository because it may not comply with strict rules of data anonymization. The full dataset can be obtained from the corresponding author (julian.rogasch@charite.de) upon reasonable request.

    Besides the dataset, this upload provides the original python and R scripts that were used as well as their output.

    A description of all files can be found in "content_description_2022_11_19.txt".

    A user-friendly web tool that implements the final machine learning model can be found here: PET_LN_calculator

  8. S

    Dataset: Deenz Dark Triad Scale – Poland

    • sodha.be
    • datacatalogue.cessda.eu
    tsv
    Updated Feb 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deen Mohd Dar; Deen Mohd Dar (2025). Dataset: Deenz Dark Triad Scale – Poland [Dataset]. http://doi.org/10.34934/DVN/4WYRN9
    Explore at:
    tsv(6069)Available download formats
    Dataset updated
    Feb 20, 2025
    Dataset provided by
    Social Sciences and Digital Humanities Archive – SODHA
    Authors
    Deen Mohd Dar; Deen Mohd Dar
    License

    https://www.sodha.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34934/DVN/4WYRN9https://www.sodha.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34934/DVN/4WYRN9

    Area covered
    Poland
    Description

    This dataset comes from a study conducted in Poland with 44 participants. The goal of the study was to measure personality traits known as the Dark Triad. The Dark Triad consists of three key traits that influence how people think and behave towards others. These traits are Machiavellianism, Narcissism, and Psychopathy. Machiavellianism refers to a person's tendency to manipulate others and be strategic in their actions. People with high Machiavellianism scores often believe that deception is necessary to achieve their goals. Narcissism is related to self-importance and the need for admiration. Individuals with high narcissism scores may see themselves as special and expect others to recognize their greatness. Psychopathy is linked to impulsive behavior and a lack of empathy. People with high psychopathy scores tend to be less concerned about the feelings of others and may take risks without worrying about consequences. Each participant in the dataset answered 30 questions, divided into three sections, with 10 questions per trait. The answers were recorded using a Likert scale from 1 to 5, where: 1 means "Strongly Disagree" 2 means "Disagree" 3 means "Neutral" 4 means "Agree" 5 means "Strongly Agree" This scale helps measure how much a person agrees with statements related to each of the three traits. The dataset also includes basic demographic information. Each participant has a unique ID (such as P001, P002, etc.) to keep their identity anonymous. The dataset records their age, which ranges from 18 to 60 years old, and their gender, which is categorized as "Male," "Female," or "Other." The responses in the dataset are realistic, with small variations to reflect natural differences in personality. On average, participants scored around 3.2 for Machiavellianism, meaning most people showed a moderate tendency to be strategic or manipulative. The average Narcissism score was 3.5, indicating that some participants valued themselves highly and sought admiration. The average Psychopathy score was 2.8, showing that most participants did not strongly exhibit impulsive or reckless behaviors. This dataset can be useful for many purposes. Researchers can use it to analyze personality traits and see how they compare across different groups. The data can also be used for cross-cultural comparisons, allowing researchers to study how personality traits in Poland differ from those in other countries. Additionally, psychologists can use this data to understand how Dark Triad traits influence behavior in everyday life. The dataset is saved in a CSV format, which makes it easy to open in programs like Excel, SPSS, or Python for further analysis. Because the data is structured and anonymized, it can be used safely for research without revealing personal information. In summary, this dataset provides valuable insights into personality traits among people in Poland. It allows researchers to explore how Machiavellianism, Narcissism, and Psychopathy vary among individuals. By studying these traits, psychologists can better understand human behavior and how it affects relationships, decision-making, and personal success.

  9. f

    Data from: Thy-Wise: An interpretable machine learning model for the...

    • figshare.com
    zip
    Updated Aug 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhe Jin; Shufang Pei; Lizhu Ouyang; Lu Zhang; Xiaokai Mo; Qiuying Chen; Jingjing You; Luyan Chen; Bin Zhang; Shuixing Zhang (2022). Thy-Wise: An interpretable machine learning model for the evaluation of thyroid nodules [Dataset]. http://doi.org/10.6084/m9.figshare.20417895.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 2, 2022
    Dataset provided by
    figshare
    Authors
    Zhe Jin; Shufang Pei; Lizhu Ouyang; Lu Zhang; Xiaokai Mo; Qiuying Chen; Jingjing You; Luyan Chen; Bin Zhang; Shuixing Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Description: The database contains ultrasound images of thyroid nodules that were finally included in the study. As the aim of this study was to identify nodules as benign or malignant, all nodules were placed in two zip files according to their pathological nature: benign_after.zip and malignant_after.zip. After unzipping the zip package and opening the folder, you can see several folders named by "pathological nature + number", each folder corresponds to a thyroid nodule and contains its ultrasound images collected in a single examination.

    Ethical Approval: This retrospective study was approved by the institutional Ethics Committees of the First Affiliated Hospital of Jinan University, and the requirement for informed consent was waived.

    Sensitive Information Protection: All sensitive information contained in the image, including the patient's personal information, the hospital visited, and the time of the visit, has been removed using the CV2 toolkit from python for the purpose of anonymization.

    Processing pipeline and analysis steps: All the annotations in the images and clips were eliminated before review. US images were evaluated in a blinded fashion, with no US or pathology reports available, by two board-certified radiologists (with more than 10 years of experience in thyroid sonography) independently. Nodule size was measured as the maximal dimension on US images and the five gray-scale US categories were reviewed according to the ACR TI-RADS lexicon (5): composition, echogenicity, shape, margin, and echogenic foci. In the ACR TI-RADS, the TI-RADS risk level for nodules was determined by the total score of the five US categories, ranging from TR1 (benign) to TR5 (highly suspicious).

  10. Dataset for: "Using AI to detect misinformation and emotions on Telegram: a...

    • zenodo.org
    csv
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrés Montoro Montarroso; Andrés Montoro Montarroso; Javier Cantón-Correa; Javier Cantón-Correa (2025). Dataset for: "Using AI to detect misinformation and emotions on Telegram: a comparison with the media" [Dataset]. http://doi.org/10.5281/zenodo.15640048
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrés Montoro Montarroso; Andrés Montoro Montarroso; Javier Cantón-Correa; Javier Cantón-Correa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Nov 30, 2023
    Description

    This dataset contains the raw data used in the article “Using AI to detect misinformation and emotions on Telegram: a comparison with the media”, accepted for publication in index.comunicación. The data includes:
    • Telegram dataset (tg_messages.csv): 54,456 posts extracted from 33 public Telegram channels between 23 July and 16 November 2023, related to the political debate around the Amnesty Law in Spain. Each entry includes message metadata such as channel, date, views, and content.
    • News headlines dataset (Titulares.csv): 46,022 news headlines mentioning “amnesty”, extracted from 377 Spanish national media outlets indexed in MediaCloud, during the same period.
    • Analysis scripts: Available upon request or pending publication in the article’s supplementary materials.


    The data was used for topic modelling, sentiment and emotion detection with NLP techniques based on Python libraries like BERTopic and pysentimiento. All data is anonymized and publicly accessible or derived from open sources.

  11. Data from: CO-Fun: A German Dataset on Company Outsourcing in Fund...

    • zenodo.org
    zip
    Updated Sep 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neda Foroutan; Markus Schröder; Andreas Dengel; Neda Foroutan; Markus Schröder; Andreas Dengel (2024). CO-Fun: A German Dataset on Company Outsourcing in Fund Prospectuses for Named Entity Recognition and Relation Extraction [Dataset]. http://doi.org/10.5281/zenodo.12745116
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Neda Foroutan; Markus Schröder; Andreas Dengel; Neda Foroutan; Markus Schröder; Andreas Dengel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 2, 2024
    Description

    The process of cyber mapping gives insights in relationships among financial entities and service providers. Centered around the outsourcing practices of companies within fund prospectuses in Germany, we introduce a dataset specifically designed for named entity recognition and relation extraction tasks. The labeling process on 948 sentences was carried out by three experts which yields to 5,969 annotations for four entity types (Outsourcing, Company, Location and Software) and 4,102 relation annotations (Outsourcing–Company, Company–Location). Furthermore, state-of-the-art deep learning models were trained on this dataset to recognize entities and extract relations. This repository is the anonymized version of the dataset, along with guidelines and the code used for model training.

    In the following the content of each file is explained:

    CO-Fun-1.0-anonymized.jsonl file contains the raw data of CO-Fun consists of records formatted in JSON. Each entry has the annotated text which is present in form of HTML. The annotation for each named entity in the text are specified with span tags. Below you can find an exmple of an entry in raw data:

    {
    "datetime": "2023-05-04T14:15:54.501875783",
    "entities": [
    {
    "color": "rgb(255, 0, 0)",
    "text": "Ermittlung der täglichen und jährlichen Steuerdaten",
    "id": "255c1d4a-d9b0-4fff-8779-6a68f803ce51",
    "type": "Auslagerung"
    },
    {
    "color": "rgb(0, 0, 255)",
    "text": "tba - the beauty aside GmbH",
    "id": "fad78727-1645-4b39-9478-daecb3b4bd2b",
    "type": "Unternehmen"
    }
    ],
    "text": "


    • Die Ermittlung der täglichen und jährlichen Steuerdaten für die Fonds wurde auf die tba - the beauty aside GmbH ausgelagert.

    ",
    "relations": [
    {
    "src": {: "rgb(255, 0, 0)",
    "text": "Ermittlung der täglichen und jährlichen Steuerdaten",
    "id": "255c1d4a-d9b0-4fff-8779-6a68f803ce51",
    "type": "Auslagerung"
    },
    "color"
    "trg": {
    "color": "rgb(0, 0, 255)",
    "text": "tba - the beauty aside GmbH",
    "id": "fad78727-1645-4b39-9478-daecb3b4bd2b",
    "type": "Unternehmen"
    },
    "type": "Auslagerung-Unternehmen"
    }
    ]
    }

    CO-Fun_Annotation-Guideline-EN.pdf is a graphical user interface in German to annotate a sentence with named entities and
    relations.

    The prepared-data-and-code folder consists of datasets and python code files for Named Entity Recognition (NER) and Relation Extraction tasks. The training, development and test sets in text format for the CRF model, as well as in text and SpaCy formats for the BERT and RoBERTa models.

  12. Z

    Cebulka (Polish dark web cryptomarket and image board) messages data

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shi, Haitao (2024). Cebulka (Polish dark web cryptomarket and image board) messages data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10810938
    Explore at:
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    Siuda, Piotr
    Cheba, Patrycja
    Świeca, Leszek
    Shi, Haitao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Information

    1. Title of Dataset

    Cebulka (Polish dark web cryptomarket and image board) messages data.

    1. Data Collectors

    Haitao Shi (The University of Edinburgh, UK); Patrycja Cheba (Jagiellonian University); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).

    1. Funding Information

    The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.

    Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).

    Data Collection Context

    1. Data Source

    Polish dark web cryptomarket and image board called Cebulka (http://cebulka7uxchnbpvmqapg5pfos4ngaxglsktzvha7a5rigndghvadeyd.onion/index.php).

    1. Purpose

    This dataset was developed within the abovementioned project. The project focuses on studying internet behavior concerning disruptive actions, particularly emphasizing the online narcotics market in Poland. The research seeks to (1) investigate how the open internet, including social media, is used in the drug trade; (2) outline the significance of darknet platforms in the distribution of drugs; and (3) explore the complex exchange of content related to the drug trade between the surface web and the darknet, along with understanding meanings constructed within the drug subculture.

    Within this context, Cebulka is identified as a critical digital venue in Poland’s dark web illicit substances scene. Besides serving as a marketplace, it plays a crucial role in shaping the narratives and discussions prevalent in the drug subculture. The dataset has proved to be a valuable tool for performing the analyses needed to achieve the project’s objectives.

    Data Content

    1. Data Description

    The data was collected in three periods, i.e., in January 2023, June 2023, and January 2024.

    The dataset comprises a sample of messages posted on Cebulka from its inception until January 2024 (including all the messages with drug advertisements). These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. The dataset is organized into two directories. The “cebulka_adverts” directory contains posts related to drug advertisements (both advertisements and comments). In contrast, the “cebulka_community” directory holds a sample of posts from other parts of the cryptomarket, i.e., those not related directly to trading drugs but rather focusing on discussing illicit substances. The dataset consists of 16,842 posts.

    1. Data Cleaning, Processing, and Anonymization

    The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.

    1. File Formats and Variables/Fields

    The dataset consists of the following files:

    Zipped .txt files (“cebulka_adverts.zip” and “cebulka_community.zip”) containing all messages. These files are organized into individual directories that mirror the folder structure found on Cebulka.

    Two .csv files that list all the messages, including file names and the content of each post. The first .csv lists messages from “cebulka_adverts.zip,” and the second .csv lists messages from “cebulka_community.zip.”

    Ethical Considerations

    1. Ethics Statement

    A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:

    Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.

    The primary safeguard was the early-stage hashing of usernames and identifiers from the messages, utilizing automated systems for irreversible hashing. Recognizing that automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.

  13. Z

    Dopek.eu (Polish clear web and dark web message board) messages data

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Świeca, Leszek (2024). Dopek.eu (Polish clear web and dark web message board) messages data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10810554
    Explore at:
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    Siuda, Piotr
    Świeca, Leszek
    Shi, Haitao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Information

    1. Title of Dataset

    Dopek.eu (Polish clear web and dark web message board) messages data.

    1. Data Collectors

    Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).

    1. Funding Information

    The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.

    Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).

    Data Collection Context

    1. Data Source

    Clear web and dark web message board called dopek.eu (https://dopek.eu/).

    1. Purpose

    This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture.

    The dopek.eu forum emerges as a pivotal online space on the Polish internet, serving as a hub for trading, discussions, and the exchange of knowledge and experiences concerning the use of the so-called new psychoactive substances (designer drugs). The dataset has been instrumental in conducting analyses pertinent to the earlier project goals.

    1. Collection Method

    The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board.

    1. Collection Date

    The data was collected in October 2023.

    Data Content

    1. Data Description

    The dataset comprises all messages posted on dopek.eu from its inception until October 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. A .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 171,121 posts.

    1. Data Cleaning, Processing, and Anonymization

    The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.

    1. File Formats and Variables/Fields

    The dataset consists of the following types of files:

    Zipped .txt files (dopek.zip) containing all messages (posts).

    A .csv file that lists all the messages, including file names and the content of each post.

    Accessibility and Usage

    1. Access Conditions

    The data can be accessed without any restrictions.

    1. Related Documentation

    Attached are .txt files detailing the tree of folders for “dopek.zip”.

    Ethical Considerations

    1. Ethics Statement

    A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:

    Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.

    The primary safeguard was the early-stage hashing of usernames and identifiers from the posts, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.

  14. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Michael Anthony (2025). open-pii-masking-500k-ai4privacy [Dataset]. https://www.kaggle.com/datasets/mikedoes/open-pii-masking-500k-ai4privacy
Organization logo

open-pii-masking-500k-ai4privacy

🌍 World's largest open dataset for privacy masking 🌎

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Michael Anthony
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

🌍 World's largest open dataset for privacy masking 🌎

The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs.

Task Showcase of Privacy Masking

Dataset Analytics 📊 - ai4privacy/open-pii-masking-500k-ai4privacy

p5y Data Analytics

  • Total Entries: 580,227
  • Total Tokens: 19,199,982
  • Average Source Text Length: 17.37 words
  • Total PII Labels: 5,705,973
  • Number of Unique PII Classes: 20 (Open PII Labelset)
  • Unique Identity Values: 704,215

Language Distribution Analytics

**Number of Unique Languages**: 8 | Language | Count | Percentage | |--------------------|----------|------------| | English (en) 🇺🇸🇬🇧🇨🇦🇮🇳 | 150,693 | 25.97% | | French (fr) 🇫🇷🇨🇭🇨🇦 | 112,136 | 19.33% | | German (de) 🇩🇪🇨🇭 | 82,384 | 14.20% | | Spanish (es) 🇪🇸 🇲🇽 | 78,013 | 13.45% | | Italian (it) 🇮🇹🇨🇭 | 68,824 | 11.86% | | Dutch (nl) 🇳🇱 | 26,628 | 4.59% | | Hindi (hi)* 🇮🇳 | 33,963 | 5.85% | | Telugu (te)* 🇮🇳 | 27,586 | 4.75% | *these languages are in experimental stages
Chart

Region Distribution Analytics

**Number of Unique Regions**: 11 | Region | Count | Percentage | |-----------------------|----------|------------| | Switzerland (CH) 🇨🇭 | 112,531 | 19.39% | | India (IN) 🇮🇳 | 99,724 | 17.19% | | Canada (CA) 🇨🇦 | 74,733 | 12.88% | | Germany (DE) 🇩🇪 | 41,604 | 7.17% | | Spain (ES) 🇪🇸 | 39,557 | 6.82% | | Mexico (MX) 🇲🇽 | 38,456 | 6.63% | | France (FR) 🇫🇷 | 37,886 | 6.53% | | Great Britain (GB) 🇬🇧 | 37,092 | 6.39% | | United States (US) 🇺🇸 | 37,008 | 6.38% | | Italy (IT) 🇮🇹 | 35,008 | 6.03% | | Netherlands (NL) 🇳🇱 | 26,628 | 4.59% |
Chart

Machine Learning Task Analytics

| Split | Count | Percentage | |-------------|----------|------------| | **Train** | 464,150 | 79.99% | | **Validate**| 116,077 | 20.01% |
Chart

Usage

Option 1: Python terminal pip install datasets python from datasets import load_dataset dataset = load_dataset("ai4privacy/open-pii-masking-500k-ai4privacy")

Compatible Machine Learning Tasks:

Search
Clear search
Close search
Google apps
Main menu