13 datasets found

open-pii-masking-500k-ai4privacy
kaggle.com
dataverse.harvard.edu
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Anthony (2025). open-pii-masking-500k-ai4privacy [Dataset]. https://www.kaggle.com/datasets/mikedoes/open-pii-masking-500k-ai4privacy
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Michael Anthony
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
🌍 World's largest open dataset for privacy masking 🌎

The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs.

Dataset Analytics 📊 - ai4privacy/open-pii-masking-500k-ai4privacy

p5y Data Analytics

Total Entries: 580,227

Total Tokens: 19,199,982

Average Source Text Length: 17.37 words

Total PII Labels: 5,705,973

Number of Unique PII Classes: 20 (Open PII Labelset)

Unique Identity Values: 704,215

Language Distribution Analytics

**Number of Unique Languages**: 8 | Language | Count | Percentage | |--------------------|----------|------------| | English (en) 🇺🇸🇬🇧🇨🇦🇮🇳 | 150,693 | 25.97% | | French (fr) 🇫🇷🇨🇭🇨🇦 | 112,136 | 19.33% | | German (de) 🇩🇪🇨🇭 | 82,384 | 14.20% | | Spanish (es) 🇪🇸 🇲🇽 | 78,013 | 13.45% | | Italian (it) 🇮🇹🇨🇭 | 68,824 | 11.86% | | Dutch (nl) 🇳🇱 | 26,628 | 4.59% | | Hindi (hi)* 🇮🇳 | 33,963 | 5.85% | | Telugu (te)* 🇮🇳 | 27,586 | 4.75% | *these languages are in experimental stages

Region Distribution Analytics

**Number of Unique Regions**: 11 | Region | Count | Percentage | |-----------------------|----------|------------| | Switzerland (CH) 🇨🇭 | 112,531 | 19.39% | | India (IN) 🇮🇳 | 99,724 | 17.19% | | Canada (CA) 🇨🇦 | 74,733 | 12.88% | | Germany (DE) 🇩🇪 | 41,604 | 7.17% | | Spain (ES) 🇪🇸 | 39,557 | 6.82% | | Mexico (MX) 🇲🇽 | 38,456 | 6.63% | | France (FR) 🇫🇷 | 37,886 | 6.53% | | Great Britain (GB) 🇬🇧 | 37,092 | 6.39% | | United States (US) 🇺🇸 | 37,008 | 6.38% | | Italy (IT) 🇮🇹 | 35,008 | 6.03% | | Netherlands (NL) 🇳🇱 | 26,628 | 4.59% |

Machine Learning Task Analytics

| Split | Count | Percentage | |-------------|----------|------------| | **Train** | 464,150 | 79.99% | | **Validate**| 116,077 | 20.01% |

Usage

Option 1: Python terminal pip install datasets python from datasets import load_dataset dataset = load_dataset("ai4privacy/open-pii-masking-500k-ai4privacy")

Compatible Machine Learning Tasks:

Tokenclassification. Check out a HuggingFace's guide on token classification.

ALBERT, BERT, BigBird, BioGpt, BLOOM, BROS, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, ESM, Falcon, FlauBERT, FNet, Funnel Transformer, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, I-BERT, [LayoutLM](http...
Multi-modality medical image dataset for medical image processing in Python...
zenodo.org
zip
Updated Aug 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Candace Moore; Candace Moore; Giulia Crocioni; Giulia Crocioni (2024). Multi-modality medical image dataset for medical image processing in Python lesson [Dataset]. http://doi.org/10.5281/zenodo.13305760
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13305760
Dataset updated
Aug 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Candace Moore; Candace Moore; Giulia Crocioni; Giulia Crocioni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a collection of medical imaging files for use in the "Medical Image Processing with Python" lesson, developed by the Netherlands eScience Center.

The dataset includes:

SimpleITK compatible files: MRI T1 and CT scans (training_001_mr_T1.mha, training_001_ct.mha), digital X-ray (digital_xray.dcm in DICOM format), neuroimaging data (A1_grayT1.nrrd, A1_grayT2.nrrd). Data have been downloaded from here.

MRI data: a T2-weighted image (OBJECT_phantom_T2W_TSE_Cor_14_1.nii in NIfTI-1 format). Data have been downloaded from here.

Example images for the machine learning lesson: chest X-rays (rotatechest.png, other_op.png), cardiomegaly example (cardiomegaly_cc0.png).

Additional anonymized data: TBA

These files represent various medical imaging modalities and formats commonly used in clinical research and practice. They are intended for educational purposes, allowing students to practice image processing techniques, machine learning applications, and statistical analysis of medical images using Python libraries such as scikit-image, pydicom, and SimpleITK.
Anonymized Image Data for DWI and T2W MRI Registration Quality Assurance
figshare.com
zip
Updated Nov 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kareem Wahid; Mohamed Naser (2022). Anonymized Image Data for DWI and T2W MRI Registration Quality Assurance [Dataset]. http://doi.org/10.6084/m9.figshare.17162435.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.17162435.v1
Dataset updated
Nov 9, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Kareem Wahid; Mohamed Naser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The following are pre-radiotherapy T2W and DWI MRI sequences in Digital Imaging and Communications in Medicine (DICOM) format for 20 patients curated from the MD Anderson Databases (NCT03145077).

For each image set (T2W image and DWI image), ground truth segmentations for the left and right submandibular glands, left and right parotid glands, cervical spinal cord, brainstem, and primary gross tumor volume were manually generated by a trained physician expert (radiologist with > 5 years of experience in HNC). In a subset of five cases, segmentations for all structures in both sequences were also manually generated by three additional separate observers (two physicians and one medical student). All segmentations were generated in Velocity AI (v.3.0.1; Varian Medical Systems; Palo Alto, CA, USA) in DICOM RT structure format.

DICOM data was anonymized using an in-house Python script that implements the RSNA CRP DICOM Anonymizer software. All files have had any DICOM header info and metadata containing PHI removed or replaced with dummy entries.
MNE-somato-data-bids (anonymized)
openneuro.org
Updated Aug 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lauri Parkkonen; Stefan Appelhoff; Alexandre Gramfort; Mainak Jas; Richard Höchenberger (2020). MNE-somato-data-bids (anonymized) [Dataset]. http://doi.org/10.18112/openneuro.ds003104.v1.0.0
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds003104.v1.0.0
Dataset updated
Aug 31, 2020
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Lauri Parkkonen; Stefan Appelhoff; Alexandre Gramfort; Mainak Jas; Richard Höchenberger
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
MNE-somato-data-bids

This dataset contains the MNE-somato-data in BIDS format.

The conversion can be reproduced through the Python script stored in the /code directory of this dataset. See the README in that directory.

The /derivatives directory contains the outputs of running the FreeSurfer pipeline recon-all on the MRI data with no additional commandline options (only defaults were used):

$ recon-all -i sub-01_T1w.nii.gz -s 01 -all

After the recon-all call, there were further FreeSurfer calls from the MNE API:

$ mne make_scalp_surfaces -s 01 --force $ mne watershed_bem -s 01

The derivatives also contain the forward model *-fwd.fif, which was produced using the source space definition, a *-trans.fif file, and the boundary element model (=conductor model) that lives in freesurfer/subjects/01/bem/*-bem-sol.fif.

The *-trans.fif file is not saved, but can be recovered from the anatomical landmarks in the sub-01/anat/T1w.json file and MNE-BIDS' function get_head_mri_transform.

See: https://github.com/mne-tools/mne-bids for more information.

Notes on FreeSurfer

the FreeSurfer pipeline recon-all was run new for the sake of converting the somato data to BIDS format. This needed to be done to change the "somato" subject name to the BIDS subject label "01". Note, that this is NOT "sub-01", because in BIDS, the "sub-" is just a prefix, whereas the "01" is the subject label.
n
CMT1A-BioStampNPoint2023: Charcot-Marie-Tooth disease type 1A accelerometry...
data.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karthik Dinesh; Nicole White; Lindsay Baker; Janet Sowden; Steffen Behrens-Spraggins; Elizabeth P Wood; Julie L Charles; David Herrmann; Gaurav Sharma; Katy Eichinger (2023). CMT1A-BioStampNPoint2023: Charcot-Marie-Tooth disease type 1A accelerometry dataset from three wearable sensor study [Dataset]. http://doi.org/10.5061/dryad.p5hqbzktr
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.p5hqbzktr
Dataset updated
Jun 8, 2023
Dataset provided by
University of Rochester Medical Center
University of Rochester
Authors
Karthik Dinesh; Nicole White; Lindsay Baker; Janet Sowden; Steffen Behrens-Spraggins; Elizabeth P Wood; Julie L Charles; David Herrmann; Gaurav Sharma; Katy Eichinger
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The CMT1A-BioStampNPoint2023 dataset provides data from a wearable sensor accelerometry study conducted for studying gait, balance, and activity in 15 individuals with Charcot-Marie-Tooth disease Type 1A (CMT1A). In addition to individuals with CMT1A, the dataset also includes data for 15 controls that also went through the same in-clinic study protocol as the CMT1A participants with a substantial fraction (9) of the controls also participating in the in-home study protocol. For the CMT1A participants, data is provided for 15 participants for the baseline visit and associated home recording duration and, additionally, for a subset of 12 of these participants data is also provided for a 12-month longitudinal visit and associated home recording duration. For controls, no longitudinal data is provided as none was recorded. The data were acquired using lightweight MC 10 BioStamp NPoint sensors (MC 10 Inc, Lexington, MA), three of which were attached to each participant for gathering data over a roughly one day interval. For additional details, see the description in the "README.md" included with the dataset. Methods The dataset contains data from wearable sensors and clinical data. The wearable sensor data was acquired using wearable sensors and the clinical data was extracted from the clinical record. The sensor data has not been processed per-se but the start of the recording time has been anonymized to comply with HIPPA requirements. Both the sensor data and the clinical data passed through a Python program for the aforementioned time anonymization and for standard formatting. Additional details of the time anonymization are provided in the file "README.md" included with the dataset.
d
Anonymized source data files for figures in: Recurrent processes support a...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Sep 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laura Gwilliams; Jean-Remi King (2020). Anonymized source data files for figures in: Recurrent processes support a cascade of hierarchical decisions [Dataset]. http://doi.org/10.5061/dryad.70rxwdbtr
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.70rxwdbtr
Dataset updated
Sep 4, 2020
Dataset provided by
Dryad
Authors
Laura Gwilliams; Jean-Remi King
Time period covered
2020
Description
The files are of two formats: .npy and .stc.

.npy files can be read using the Numpy module in python, e.g.:

import numpy as np

data = np.load('file_name.npy')

https://numpy.org/doc/stable/reference/generated/numpy.load.html

.stc files can be read using the MNE module in python, e.g.:

from mne import read_source_estimate

stc = read_source_estimate('stc_name-lh.stc')

note that reading in the data from just one hemisphere file will automatically read the data for the other one too.

https://mne.tools/stable/generated/mne.read_source_estimate.html
Z
Dataset for a machine learning tool to improve lymph node staging with...
data.niaid.nih.gov
zenodo.org
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rogasch, Julian M.M. (2024). Dataset for a machine learning tool to improve lymph node staging with FDG-PET/CT [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7094286
Explore at:
Dataset updated
Jul 16, 2024
Dataset authored and provided by
Rogasch, Julian M.M.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This upload provides Open Data associated with the publication "A machine learning tool to improve prediction of mediastinal lymph node metastases in non-small cell lung cancer using routinely obtainable [18F]FDG-PET/CT parameters" by Rogasch JMM et al. (2022).

The upload contains the anonymized dataset with 10 features necessary for the final GBM model that was presented in the publication. However, the original full dataset with 40 features was excluded from this Open Data repository because it may not comply with strict rules of data anonymization. The full dataset can be obtained from the corresponding author (julian.rogasch@charite.de) upon reasonable request.

Besides the dataset, this upload provides the original python and R scripts that were used as well as their output.

A description of all files can be found in "content_description_2022_11_19.txt".

A user-friendly web tool that implements the final machine learning model can be found here: PET_LN_calculator
S
Dataset: Deenz Dark Triad Scale – Poland
sodha.be
datacatalogue.cessda.eu
tsv
Updated Feb 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deen Mohd Dar; Deen Mohd Dar (2025). Dataset: Deenz Dark Triad Scale – Poland [Dataset]. http://doi.org/10.34934/DVN/4WYRN9
Explore at:
tsv(6069)Available download formats
Unique identifier
https://doi.org/10.34934/DVN/4WYRN9
Dataset updated
Feb 20, 2025
Dataset provided by
Social Sciences and Digital Humanities Archive – SODHA
Authors
Deen Mohd Dar; Deen Mohd Dar
License
https://www.sodha.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34934/DVN/4WYRN9https://www.sodha.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34934/DVN/4WYRN9
Area covered
Poland
Description
This dataset comes from a study conducted in Poland with 44 participants. The goal of the study was to measure personality traits known as the Dark Triad. The Dark Triad consists of three key traits that influence how people think and behave towards others. These traits are Machiavellianism, Narcissism, and Psychopathy. Machiavellianism refers to a person's tendency to manipulate others and be strategic in their actions. People with high Machiavellianism scores often believe that deception is necessary to achieve their goals. Narcissism is related to self-importance and the need for admiration. Individuals with high narcissism scores may see themselves as special and expect others to recognize their greatness. Psychopathy is linked to impulsive behavior and a lack of empathy. People with high psychopathy scores tend to be less concerned about the feelings of others and may take risks without worrying about consequences. Each participant in the dataset answered 30 questions, divided into three sections, with 10 questions per trait. The answers were recorded using a Likert scale from 1 to 5, where: 1 means "Strongly Disagree" 2 means "Disagree" 3 means "Neutral" 4 means "Agree" 5 means "Strongly Agree" This scale helps measure how much a person agrees with statements related to each of the three traits. The dataset also includes basic demographic information. Each participant has a unique ID (such as P001, P002, etc.) to keep their identity anonymous. The dataset records their age, which ranges from 18 to 60 years old, and their gender, which is categorized as "Male," "Female," or "Other." The responses in the dataset are realistic, with small variations to reflect natural differences in personality. On average, participants scored around 3.2 for Machiavellianism, meaning most people showed a moderate tendency to be strategic or manipulative. The average Narcissism score was 3.5, indicating that some participants valued themselves highly and sought admiration. The average Psychopathy score was 2.8, showing that most participants did not strongly exhibit impulsive or reckless behaviors. This dataset can be useful for many purposes. Researchers can use it to analyze personality traits and see how they compare across different groups. The data can also be used for cross-cultural comparisons, allowing researchers to study how personality traits in Poland differ from those in other countries. Additionally, psychologists can use this data to understand how Dark Triad traits influence behavior in everyday life. The dataset is saved in a CSV format, which makes it easy to open in programs like Excel, SPSS, or Python for further analysis. Because the data is structured and anonymized, it can be used safely for research without revealing personal information. In summary, this dataset provides valuable insights into personality traits among people in Poland. It allows researchers to explore how Machiavellianism, Narcissism, and Psychopathy vary among individuals. By studying these traits, psychologists can better understand human behavior and how it affects relationships, decision-making, and personal success.
f
Data from: Thy-Wise: An interpretable machine learning model for the...
figshare.com
zip
Updated Aug 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhe Jin; Shufang Pei; Lizhu Ouyang; Lu Zhang; Xiaokai Mo; Qiuying Chen; Jingjing You; Luyan Chen; Bin Zhang; Shuixing Zhang (2022). Thy-Wise: An interpretable machine learning model for the evaluation of thyroid nodules [Dataset]. http://doi.org/10.6084/m9.figshare.20417895.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20417895.v1
Dataset updated
Aug 2, 2022
Dataset provided by
figshare
Authors
Zhe Jin; Shufang Pei; Lizhu Ouyang; Lu Zhang; Xiaokai Mo; Qiuying Chen; Jingjing You; Luyan Chen; Bin Zhang; Shuixing Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Description: The database contains ultrasound images of thyroid nodules that were finally included in the study. As the aim of this study was to identify nodules as benign or malignant, all nodules were placed in two zip files according to their pathological nature: benign_after.zip and malignant_after.zip. After unzipping the zip package and opening the folder, you can see several folders named by "pathological nature + number", each folder corresponds to a thyroid nodule and contains its ultrasound images collected in a single examination.

Ethical Approval: This retrospective study was approved by the institutional Ethics Committees of the First Affiliated Hospital of Jinan University, and the requirement for informed consent was waived.

Sensitive Information Protection: All sensitive information contained in the image, including the patient's personal information, the hospital visited, and the time of the visit, has been removed using the CV2 toolkit from python for the purpose of anonymization.

Processing pipeline and analysis steps: All the annotations in the images and clips were eliminated before review. US images were evaluated in a blinded fashion, with no US or pathology reports available, by two board-certified radiologists (with more than 10 years of experience in thyroid sonography) independently. Nodule size was measured as the maximal dimension on US images and the five gray-scale US categories were reviewed according to the ACR TI-RADS lexicon (5): composition, echogenicity, shape, margin, and echogenic foci. In the ACR TI-RADS, the TI-RADS risk level for nodules was determined by the total score of the five US categories, ranging from TR1 (benign) to TR5 (highly suspicious).
Dataset for: "Using AI to detect misinformation and emotions on Telegram: a...
zenodo.org
csv
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrés Montoro Montarroso; Andrés Montoro Montarroso; Javier Cantón-Correa; Javier Cantón-Correa (2025). Dataset for: "Using AI to detect misinformation and emotions on Telegram: a comparison with the media" [Dataset]. http://doi.org/10.5281/zenodo.15640048
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15640048
Dataset updated
Jun 12, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrés Montoro Montarroso; Andrés Montoro Montarroso; Javier Cantón-Correa; Javier Cantón-Correa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Nov 30, 2023
Description
This dataset contains the raw data used in the article “Using AI to detect misinformation and emotions on Telegram: a comparison with the media”, accepted for publication in index.comunicación. The data includes:
• Telegram dataset (tg_messages.csv): 54,456 posts extracted from 33 public Telegram channels between 23 July and 16 November 2023, related to the political debate around the Amnesty Law in Spain. Each entry includes message metadata such as channel, date, views, and content.
• News headlines dataset (Titulares.csv): 46,022 news headlines mentioning “amnesty”, extracted from 377 Spanish national media outlets indexed in MediaCloud, during the same period.
• Analysis scripts: Available upon request or pending publication in the article’s supplementary materials.

The data was used for topic modelling, sentiment and emotion detection with NLP techniques based on Python libraries like BERTopic and pysentimiento. All data is anonymized and publicly accessible or derived from open sources.
Data from: CO-Fun: A German Dataset on Company Outsourcing in Fund...
zenodo.org
zip
Updated Sep 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neda Foroutan; Markus Schröder; Andreas Dengel; Neda Foroutan; Markus Schröder; Andreas Dengel (2024). CO-Fun: A German Dataset on Company Outsourcing in Fund Prospectuses for Named Entity Recognition and Relation Extraction [Dataset]. http://doi.org/10.5281/zenodo.12745116
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12745116
Dataset updated
Sep 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Neda Foroutan; Markus Schröder; Andreas Dengel; Neda Foroutan; Markus Schröder; Andreas Dengel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 2, 2024
Description
The process of cyber mapping gives insights in relationships among financial entities and service providers. Centered around the outsourcing practices of companies within fund prospectuses in Germany, we introduce a dataset specifically designed for named entity recognition and relation extraction tasks. The labeling process on 948 sentences was carried out by three experts which yields to 5,969 annotations for four entity types (Outsourcing, Company, Location and Software) and 4,102 relation annotations (Outsourcing–Company, Company–Location). Furthermore, state-of-the-art deep learning models were trained on this dataset to recognize entities and extract relations. This repository is the anonymized version of the dataset, along with guidelines and the code used for model training.

In the following the content of each file is explained:

CO-Fun-1.0-anonymized.jsonl file contains the raw data of CO-Fun consists of records formatted in JSON. Each entry has the annotated text which is present in form of HTML. The annotation for each named entity in the text are specified with span tags. Below you can find an exmple of an entry in raw data:

{
"datetime": "2023-05-04T14:15:54.501875783",
"entities": [
{
"color": "rgb(255, 0, 0)",
"text": "Ermittlung der tÃ¤glichen und jÃ¤hrlichen Steuerdaten",
"id": "255c1d4a-d9b0-4fff-8779-6a68f803ce51",
"type": "Auslagerung"
},
{
"color": "rgb(0, 0, 255)",
"text": "tba - the beauty aside GmbH",
"id": "fad78727-1645-4b39-9478-daecb3b4bd2b",
"type": "Unternehmen"
}
],
"text": "

• Die Ermittlung der tÃ¤glichen und jÃ¤hrlichen Steuerdaten fÃ¼r die Fonds wurde auf die tba - the beauty aside GmbH ausgelagert.

",
"relations": [
{
"src": {: "rgb(255, 0, 0)",
"text": "Ermittlung der tÃ¤glichen und jÃ¤hrlichen Steuerdaten",
"id": "255c1d4a-d9b0-4fff-8779-6a68f803ce51",
"type": "Auslagerung"
},
"color"
"trg": {
"color": "rgb(0, 0, 255)",
"text": "tba - the beauty aside GmbH",
"id": "fad78727-1645-4b39-9478-daecb3b4bd2b",
"type": "Unternehmen"
},
"type": "Auslagerung-Unternehmen"
}
]
}

CO-Fun_Annotation-Guideline-EN.pdf is a graphical user interface in German to annotate a sentence with named entities and
relations.

The prepared-data-and-code folder consists of datasets and python code files for Named Entity Recognition (NER) and Relation Extraction tasks. The training, development and test sets in text format for the CRF model, as well as in text and SpaCy formats for the BERT and RoBERTa models.
Z
Cebulka (Polish dark web cryptomarket and image board) messages data
data.niaid.nih.gov
zenodo.org
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shi, Haitao (2024). Cebulka (Polish dark web cryptomarket and image board) messages data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10810938
Explore at:
Dataset updated
Mar 18, 2024
Dataset provided by
Siuda, Piotr
Cheba, Patrycja
Świeca, Leszek
Shi, Haitao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General Information

Title of Dataset

Cebulka (Polish dark web cryptomarket and image board) messages data.

Data Collectors

Haitao Shi (The University of Edinburgh, UK); Patrycja Cheba (Jagiellonian University); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).

Funding Information

The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.

Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).

Data Collection Context

Data Source

Polish dark web cryptomarket and image board called Cebulka (http://cebulka7uxchnbpvmqapg5pfos4ngaxglsktzvha7a5rigndghvadeyd.onion/index.php).

Purpose

This dataset was developed within the abovementioned project. The project focuses on studying internet behavior concerning disruptive actions, particularly emphasizing the online narcotics market in Poland. The research seeks to (1) investigate how the open internet, including social media, is used in the drug trade; (2) outline the significance of darknet platforms in the distribution of drugs; and (3) explore the complex exchange of content related to the drug trade between the surface web and the darknet, along with understanding meanings constructed within the drug subculture.

Within this context, Cebulka is identified as a critical digital venue in Poland’s dark web illicit substances scene. Besides serving as a marketplace, it plays a crucial role in shaping the narratives and discussions prevalent in the drug subculture. The dataset has proved to be a valuable tool for performing the analyses needed to achieve the project’s objectives.

Data Content

Data Description

The data was collected in three periods, i.e., in January 2023, June 2023, and January 2024.

The dataset comprises a sample of messages posted on Cebulka from its inception until January 2024 (including all the messages with drug advertisements). These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. The dataset is organized into two directories. The “cebulka_adverts” directory contains posts related to drug advertisements (both advertisements and comments). In contrast, the “cebulka_community” directory holds a sample of posts from other parts of the cryptomarket, i.e., those not related directly to trading drugs but rather focusing on discussing illicit substances. The dataset consists of 16,842 posts.

Data Cleaning, Processing, and Anonymization

The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.

File Formats and Variables/Fields

The dataset consists of the following files:

Zipped .txt files (“cebulka_adverts.zip” and “cebulka_community.zip”) containing all messages. These files are organized into individual directories that mirror the folder structure found on Cebulka.

Two .csv files that list all the messages, including file names and the content of each post. The first .csv lists messages from “cebulka_adverts.zip,” and the second .csv lists messages from “cebulka_community.zip.”

Ethical Considerations

Ethics Statement

A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:

Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.

The primary safeguard was the early-stage hashing of usernames and identifiers from the messages, utilizing automated systems for irreversible hashing. Recognizing that automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.
Z
Dopek.eu (Polish clear web and dark web message board) messages data
data.niaid.nih.gov
zenodo.org
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Świeca, Leszek (2024). Dopek.eu (Polish clear web and dark web message board) messages data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10810554
Explore at:
Dataset updated
Mar 18, 2024
Dataset provided by
Siuda, Piotr
Świeca, Leszek
Shi, Haitao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General Information

Title of Dataset

Dopek.eu (Polish clear web and dark web message board) messages data.

Data Collectors

Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).

Funding Information

The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.

Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).

Data Collection Context

Data Source

Clear web and dark web message board called dopek.eu (https://dopek.eu/).

Purpose

This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture.

The dopek.eu forum emerges as a pivotal online space on the Polish internet, serving as a hub for trading, discussions, and the exchange of knowledge and experiences concerning the use of the so-called new psychoactive substances (designer drugs). The dataset has been instrumental in conducting analyses pertinent to the earlier project goals.

Collection Method

The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board.

Collection Date

The data was collected in October 2023.

Data Content

Data Description

The dataset comprises all messages posted on dopek.eu from its inception until October 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. A .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 171,121 posts.

Data Cleaning, Processing, and Anonymization

The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.

File Formats and Variables/Fields

The dataset consists of the following types of files:

Zipped .txt files (dopek.zip) containing all messages (posts).

A .csv file that lists all the messages, including file names and the content of each post.

Accessibility and Usage

Access Conditions

The data can be accessed without any restrictions.

Related Documentation

Attached are .txt files detailing the tree of folders for “dopek.zip”.

Ethical Considerations

Ethics Statement

A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:

Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.

The primary safeguard was the early-stage hashing of usernames and identifiers from the posts, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Michael Anthony (2025). open-pii-masking-500k-ai4privacy [Dataset]. https://www.kaggle.com/datasets/mikedoes/open-pii-masking-500k-ai4privacy

open-pii-masking-500k-ai4privacy

🌍 World's largest open dataset for privacy masking 🌎

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 17, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Michael Anthony

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

🌍 World's largest open dataset for privacy masking 🌎

The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs.

Task Showcase of Privacy Masking

Dataset Analytics 📊 - ai4privacy/open-pii-masking-500k-ai4privacy

p5y Data Analytics

Total Entries: 580,227
Total Tokens: 19,199,982
Average Source Text Length: 17.37 words
Total PII Labels: 5,705,973
Number of Unique PII Classes: 20 (Open PII Labelset)
Unique Identity Values: 704,215

Language Distribution Analytics

**Number of Unique Languages**: 8 | Language | Count | Percentage | |--------------------|----------|------------| | English (en) 🇺🇸🇬🇧🇨🇦🇮🇳 | 150,693 | 25.97% | | French (fr) 🇫🇷🇨🇭🇨🇦 | 112,136 | 19.33% | | German (de) 🇩🇪🇨🇭 | 82,384 | 14.20% | | Spanish (es) 🇪🇸 🇲🇽 | 78,013 | 13.45% | | Italian (it) 🇮🇹🇨🇭 | 68,824 | 11.86% | | Dutch (nl) 🇳🇱 | 26,628 | 4.59% | | Hindi (hi)* 🇮🇳 | 33,963 | 5.85% | | Telugu (te)* 🇮🇳 | 27,586 | 4.75% | *these languages are in experimental stages

Region Distribution Analytics

**Number of Unique Regions**: 11 | Region | Count | Percentage | |-----------------------|----------|------------| | Switzerland (CH) 🇨🇭 | 112,531 | 19.39% | | India (IN) 🇮🇳 | 99,724 | 17.19% | | Canada (CA) 🇨🇦 | 74,733 | 12.88% | | Germany (DE) 🇩🇪 | 41,604 | 7.17% | | Spain (ES) 🇪🇸 | 39,557 | 6.82% | | Mexico (MX) 🇲🇽 | 38,456 | 6.63% | | France (FR) 🇫🇷 | 37,886 | 6.53% | | Great Britain (GB) 🇬🇧 | 37,092 | 6.39% | | United States (US) 🇺🇸 | 37,008 | 6.38% | | Italy (IT) 🇮🇹 | 35,008 | 6.03% | | Netherlands (NL) 🇳🇱 | 26,628 | 4.59% |

Machine Learning Task Analytics

| Split | Count | Percentage | |-------------|----------|------------| | **Train** | 464,150 | 79.99% | | **Validate**| 116,077 | 20.01% |

Usage

Option 1: Python terminal pip install datasets python from datasets import load_dataset dataset = load_dataset("ai4privacy/open-pii-masking-500k-ai4privacy")

Compatible Machine Learning Tasks:

Tokenclassification. Check out a HuggingFace's guide on token classification.
- ALBERT, BERT, BigBird, BioGpt, BLOOM, BROS, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, ESM, Falcon, FlauBERT, FNet, Funnel Transformer, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, I-BERT, [LayoutLM](http...

Clear search

Close search

Google apps

Main menu

open-pii-masking-500k-ai4privacy

🌍 World's largest open dataset for privacy masking 🌎

Dataset Analytics 📊 - ai4privacy/open-pii-masking-500k-ai4privacy

p5y Data Analytics

Language Distribution Analytics

Region Distribution Analytics

Machine Learning Task Analytics

Usage

Compatible Machine Learning Tasks:

Multi-modality medical image dataset for medical image processing in Python...

Anonymized Image Data for DWI and T2W MRI Registration Quality Assurance

MNE-somato-data-bids (anonymized)

MNE-somato-data-bids

Notes on FreeSurfer

CMT1A-BioStampNPoint2023: Charcot-Marie-Tooth disease type 1A accelerometry...

Anonymized source data files for figures in: Recurrent processes support a...

Dataset for a machine learning tool to improve lymph node staging with...

Dataset: Deenz Dark Triad Scale – Poland

Data from: Thy-Wise: An interpretable machine learning model for the...

Dataset for: "Using AI to detect misinformation and emotions on Telegram: a...

Data from: CO-Fun: A German Dataset on Company Outsourcing in Fund...

Cebulka (Polish dark web cryptomarket and image board) messages data

Dopek.eu (Polish clear web and dark web message board) messages data

open-pii-masking-500k-ai4privacy

🌍 World's largest open dataset for privacy masking 🌎

🌍 World's largest open dataset for privacy masking 🌎

Dataset Analytics 📊 - ai4privacy/open-pii-masking-500k-ai4privacy

p5y Data Analytics

Language Distribution Analytics

Region Distribution Analytics

Machine Learning Task Analytics

Usage

Compatible Machine Learning Tasks: