Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs.
Option 1: Python
terminal
pip install datasets
python
from datasets import load_dataset
dataset = load_dataset("ai4privacy/open-pii-masking-500k-ai4privacy")
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a collection of medical imaging files for use in the "Medical Image Processing with Python" lesson, developed by the Netherlands eScience Center.
The dataset includes:
These files represent various medical imaging modalities and formats commonly used in clinical research and practice. They are intended for educational purposes, allowing students to practice image processing techniques, machine learning applications, and statistical analysis of medical images using Python libraries such as scikit-image, pydicom, and SimpleITK.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The following are pre-radiotherapy T2W and DWI MRI sequences in Digital Imaging and Communications in Medicine (DICOM) format for 20 patients curated from the MD Anderson Databases (NCT03145077).
For each image set (T2W image and DWI image), ground truth segmentations for the left and right submandibular glands, left and right parotid glands, cervical spinal cord, brainstem, and primary gross tumor volume were manually generated by a trained physician expert (radiologist with > 5 years of experience in HNC). In a subset of five cases, segmentations for all structures in both sequences were also manually generated by three additional separate observers (two physicians and one medical student). All segmentations were generated in Velocity AI (v.3.0.1; Varian Medical Systems; Palo Alto, CA, USA) in DICOM RT structure format.
DICOM data was anonymized using an in-house Python script that implements the RSNA CRP DICOM Anonymizer software. All files have had any DICOM header info and metadata containing PHI removed or replaced with dummy entries.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the MNE-somato-data in BIDS format.
The conversion can be reproduced through the Python script stored in the
/code
directory of this dataset. See the README in that directory.
The /derivatives
directory contains the outputs of running the FreeSurfer
pipeline recon-all
on the MRI data with no additional commandline options
(only defaults were used):
$ recon-all -i sub-01_T1w.nii.gz -s 01 -all
After the recon-all
call, there were further FreeSurfer calls from the MNE
API:
$ mne make_scalp_surfaces -s 01 --force $ mne watershed_bem -s 01
The derivatives also contain the forward model *-fwd.fif
, which was produced
using the source space definition, a *-trans.fif
file, and the boundary
element model (=conductor model) that lives in
freesurfer/subjects/01/bem/*-bem-sol.fif
.
The *-trans.fif
file is not saved, but can be recovered from the anatomical
landmarks in the sub-01/anat/T1w.json
file and MNE-BIDS' function
get_head_mri_transform
.
See: https://github.com/mne-tools/mne-bids for more information.
the FreeSurfer pipeline recon-all
was run new for the sake of converting the
somato data to BIDS format. This needed to be done to change the "somato"
subject name to the BIDS subject label "01". Note, that this is NOT "sub-01",
because in BIDS, the "sub-" is just a prefix, whereas the "01" is the subject
label.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The CMT1A-BioStampNPoint2023 dataset provides data from a wearable sensor accelerometry study conducted for studying gait, balance, and activity in 15 individuals with Charcot-Marie-Tooth disease Type 1A (CMT1A). In addition to individuals with CMT1A, the dataset also includes data for 15 controls that also went through the same in-clinic study protocol as the CMT1A participants with a substantial fraction (9) of the controls also participating in the in-home study protocol. For the CMT1A participants, data is provided for 15 participants for the baseline visit and associated home recording duration and, additionally, for a subset of 12 of these participants data is also provided for a 12-month longitudinal visit and associated home recording duration. For controls, no longitudinal data is provided as none was recorded. The data were acquired using lightweight MC 10 BioStamp NPoint sensors (MC 10 Inc, Lexington, MA), three of which were attached to each participant for gathering data over a roughly one day interval. For additional details, see the description in the "README.md" included with the dataset. Methods The dataset contains data from wearable sensors and clinical data. The wearable sensor data was acquired using wearable sensors and the clinical data was extracted from the clinical record. The sensor data has not been processed per-se but the start of the recording time has been anonymized to comply with HIPPA requirements. Both the sensor data and the clinical data passed through a Python program for the aforementioned time anonymization and for standard formatting. Additional details of the time anonymization are provided in the file "README.md" included with the dataset.
The files are of two formats: .npy and .stc.
.npy files can be read using the Numpy module in python, e.g.:
import numpy as np
data = np.load('file_name.npy')
https://numpy.org/doc/stable/reference/generated/numpy.load.html
.stc files can be read using the MNE module in python, e.g.:
from mne import read_source_estimate
stc = read_source_estimate('stc_name-lh.stc')
note that reading in the data from just one hemisphere file will automatically read the data for the other one too.
https://mne.tools/stable/generated/mne.read_source_estimate.html
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload provides Open Data associated with the publication "A machine learning tool to improve prediction of mediastinal lymph node metastases in non-small cell lung cancer using routinely obtainable [18F]FDG-PET/CT parameters" by Rogasch JMM et al. (2022).
The upload contains the anonymized dataset with 10 features necessary for the final GBM model that was presented in the publication. However, the original full dataset with 40 features was excluded from this Open Data repository because it may not comply with strict rules of data anonymization. The full dataset can be obtained from the corresponding author (julian.rogasch@charite.de) upon reasonable request.
Besides the dataset, this upload provides the original python and R scripts that were used as well as their output.
A description of all files can be found in "content_description_2022_11_19.txt".
A user-friendly web tool that implements the final machine learning model can be found here: PET_LN_calculator
https://www.sodha.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34934/DVN/4WYRN9https://www.sodha.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34934/DVN/4WYRN9
This dataset comes from a study conducted in Poland with 44 participants. The goal of the study was to measure personality traits known as the Dark Triad. The Dark Triad consists of three key traits that influence how people think and behave towards others. These traits are Machiavellianism, Narcissism, and Psychopathy. Machiavellianism refers to a person's tendency to manipulate others and be strategic in their actions. People with high Machiavellianism scores often believe that deception is necessary to achieve their goals. Narcissism is related to self-importance and the need for admiration. Individuals with high narcissism scores may see themselves as special and expect others to recognize their greatness. Psychopathy is linked to impulsive behavior and a lack of empathy. People with high psychopathy scores tend to be less concerned about the feelings of others and may take risks without worrying about consequences. Each participant in the dataset answered 30 questions, divided into three sections, with 10 questions per trait. The answers were recorded using a Likert scale from 1 to 5, where: 1 means "Strongly Disagree" 2 means "Disagree" 3 means "Neutral" 4 means "Agree" 5 means "Strongly Agree" This scale helps measure how much a person agrees with statements related to each of the three traits. The dataset also includes basic demographic information. Each participant has a unique ID (such as P001, P002, etc.) to keep their identity anonymous. The dataset records their age, which ranges from 18 to 60 years old, and their gender, which is categorized as "Male," "Female," or "Other." The responses in the dataset are realistic, with small variations to reflect natural differences in personality. On average, participants scored around 3.2 for Machiavellianism, meaning most people showed a moderate tendency to be strategic or manipulative. The average Narcissism score was 3.5, indicating that some participants valued themselves highly and sought admiration. The average Psychopathy score was 2.8, showing that most participants did not strongly exhibit impulsive or reckless behaviors. This dataset can be useful for many purposes. Researchers can use it to analyze personality traits and see how they compare across different groups. The data can also be used for cross-cultural comparisons, allowing researchers to study how personality traits in Poland differ from those in other countries. Additionally, psychologists can use this data to understand how Dark Triad traits influence behavior in everyday life. The dataset is saved in a CSV format, which makes it easy to open in programs like Excel, SPSS, or Python for further analysis. Because the data is structured and anonymized, it can be used safely for research without revealing personal information. In summary, this dataset provides valuable insights into personality traits among people in Poland. It allows researchers to explore how Machiavellianism, Narcissism, and Psychopathy vary among individuals. By studying these traits, psychologists can better understand human behavior and how it affects relationships, decision-making, and personal success.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Description: The database contains ultrasound images of thyroid nodules that were finally included in the study. As the aim of this study was to identify nodules as benign or malignant, all nodules were placed in two zip files according to their pathological nature: benign_after.zip and malignant_after.zip. After unzipping the zip package and opening the folder, you can see several folders named by "pathological nature + number", each folder corresponds to a thyroid nodule and contains its ultrasound images collected in a single examination.
Ethical Approval: This retrospective study was approved by the institutional Ethics Committees of the First Affiliated Hospital of Jinan University, and the requirement for informed consent was waived.
Sensitive Information Protection: All sensitive information contained in the image, including the patient's personal information, the hospital visited, and the time of the visit, has been removed using the CV2 toolkit from python for the purpose of anonymization.
Processing pipeline and analysis steps: All the annotations in the images and clips were eliminated before review. US images were evaluated in a blinded fashion, with no US or pathology reports available, by two board-certified radiologists (with more than 10 years of experience in thyroid sonography) independently. Nodule size was measured as the maximal dimension on US images and the five gray-scale US categories were reviewed according to the ACR TI-RADS lexicon (5): composition, echogenicity, shape, margin, and echogenic foci. In the ACR TI-RADS, the TI-RADS risk level for nodules was determined by the total score of the five US categories, ranging from TR1 (benign) to TR5 (highly suspicious).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the raw data used in the article “Using AI to detect misinformation and emotions on Telegram: a comparison with the media”, accepted for publication in index.comunicación. The data includes:
• Telegram dataset (tg_messages.csv): 54,456 posts extracted from 33 public Telegram channels between 23 July and 16 November 2023, related to the political debate around the Amnesty Law in Spain. Each entry includes message metadata such as channel, date, views, and content.
• News headlines dataset (Titulares.csv): 46,022 news headlines mentioning “amnesty”, extracted from 377 Spanish national media outlets indexed in MediaCloud, during the same period.
• Analysis scripts: Available upon request or pending publication in the article’s supplementary materials.
The data was used for topic modelling, sentiment and emotion detection with NLP techniques based on Python libraries like BERTopic and pysentimiento. All data is anonymized and publicly accessible or derived from open sources.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The process of cyber mapping gives insights in relationships among financial entities and service providers. Centered around the outsourcing practices of companies within fund prospectuses in Germany, we introduce a dataset specifically designed for named entity recognition and relation extraction tasks. The labeling process on 948 sentences was carried out by three experts which yields to 5,969 annotations for four entity types (Outsourcing, Company, Location and Software) and 4,102 relation annotations (Outsourcing–Company, Company–Location). Furthermore, state-of-the-art deep learning models were trained on this dataset to recognize entities and extract relations. This repository is the anonymized version of the dataset, along with guidelines and the code used for model training.
In the following the content of each file is explained:
CO-Fun-1.0-anonymized.jsonl file contains the raw data of CO-Fun consists of records formatted in JSON. Each entry has the annotated text which is present in form of HTML. The annotation for each named entity in the text are specified with span tags. Below you can find an exmple of an entry in raw data:
{
"datetime": "2023-05-04T14:15:54.501875783",
"entities": [
{
"color": "rgb(255, 0, 0)",
"text": "Ermittlung der täglichen und jährlichen Steuerdaten",
"id": "255c1d4a-d9b0-4fff-8779-6a68f803ce51",
"type": "Auslagerung"
},
{
"color": "rgb(0, 0, 255)",
"text": "tba - the beauty aside GmbH",
"id": "fad78727-1645-4b39-9478-daecb3b4bd2b",
"type": "Unternehmen"
}
],
"text": "
• Die Ermittlung der täglichen und jährlichen Steuerdaten für die Fonds wurde auf die tba - the beauty aside GmbH ausgelagert.
",
"relations": [
{
"src": {: "rgb(255, 0, 0)",
"text": "Ermittlung der täglichen und jährlichen Steuerdaten",
"id": "255c1d4a-d9b0-4fff-8779-6a68f803ce51",
"type": "Auslagerung"
},
"color"
"trg": {
"color": "rgb(0, 0, 255)",
"text": "tba - the beauty aside GmbH",
"id": "fad78727-1645-4b39-9478-daecb3b4bd2b",
"type": "Unternehmen"
},
"type": "Auslagerung-Unternehmen"
}
]
}
CO-Fun_Annotation-Guideline-EN.pdf is a graphical user interface in German to annotate a sentence with named entities and
relations.
The prepared-data-and-code folder consists of datasets and python code files for Named Entity Recognition (NER) and Relation Extraction tasks. The training, development and test sets in text format for the CRF model, as well as in text and SpaCy formats for the BERT and RoBERTa models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General Information
Cebulka (Polish dark web cryptomarket and image board) messages data.
Haitao Shi (The University of Edinburgh, UK); Patrycja Cheba (Jagiellonian University); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).
The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.
Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).
Data Collection Context
Polish dark web cryptomarket and image board called Cebulka (http://cebulka7uxchnbpvmqapg5pfos4ngaxglsktzvha7a5rigndghvadeyd.onion/index.php).
This dataset was developed within the abovementioned project. The project focuses on studying internet behavior concerning disruptive actions, particularly emphasizing the online narcotics market in Poland. The research seeks to (1) investigate how the open internet, including social media, is used in the drug trade; (2) outline the significance of darknet platforms in the distribution of drugs; and (3) explore the complex exchange of content related to the drug trade between the surface web and the darknet, along with understanding meanings constructed within the drug subculture.
Within this context, Cebulka is identified as a critical digital venue in Poland’s dark web illicit substances scene. Besides serving as a marketplace, it plays a crucial role in shaping the narratives and discussions prevalent in the drug subculture. The dataset has proved to be a valuable tool for performing the analyses needed to achieve the project’s objectives.
Data Content
The data was collected in three periods, i.e., in January 2023, June 2023, and January 2024.
The dataset comprises a sample of messages posted on Cebulka from its inception until January 2024 (including all the messages with drug advertisements). These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. The dataset is organized into two directories. The “cebulka_adverts” directory contains posts related to drug advertisements (both advertisements and comments). In contrast, the “cebulka_community” directory holds a sample of posts from other parts of the cryptomarket, i.e., those not related directly to trading drugs but rather focusing on discussing illicit substances. The dataset consists of 16,842 posts.
The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.
The dataset consists of the following files:
Zipped .txt files (“cebulka_adverts.zip” and “cebulka_community.zip”) containing all messages. These files are organized into individual directories that mirror the folder structure found on Cebulka.
Two .csv files that list all the messages, including file names and the content of each post. The first .csv lists messages from “cebulka_adverts.zip,” and the second .csv lists messages from “cebulka_community.zip.”
Ethical Considerations
A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:
Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.
The primary safeguard was the early-stage hashing of usernames and identifiers from the messages, utilizing automated systems for irreversible hashing. Recognizing that automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General Information
Dopek.eu (Polish clear web and dark web message board) messages data.
Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).
The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.
Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).
Data Collection Context
Clear web and dark web message board called dopek.eu (https://dopek.eu/).
This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture.
The dopek.eu forum emerges as a pivotal online space on the Polish internet, serving as a hub for trading, discussions, and the exchange of knowledge and experiences concerning the use of the so-called new psychoactive substances (designer drugs). The dataset has been instrumental in conducting analyses pertinent to the earlier project goals.
The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board.
The data was collected in October 2023.
Data Content
The dataset comprises all messages posted on dopek.eu from its inception until October 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. A .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 171,121 posts.
The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.
The dataset consists of the following types of files:
Zipped .txt files (dopek.zip) containing all messages (posts).
A .csv file that lists all the messages, including file names and the content of each post.
Accessibility and Usage
The data can be accessed without any restrictions.
Attached are .txt files detailing the tree of folders for “dopek.zip”.
Ethical Considerations
A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:
Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.
The primary safeguard was the early-stage hashing of usernames and identifiers from the posts, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs.
Option 1: Python
terminal
pip install datasets
python
from datasets import load_dataset
dataset = load_dataset("ai4privacy/open-pii-masking-500k-ai4privacy")