59 datasets found

AI4privacy-PII
kaggle.com
zip
Updated Jan 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wilmer E. Henao (2024). AI4privacy-PII [Dataset]. https://www.kaggle.com/datasets/verracodeguacas/ai4privacy-pii
Explore at:
zip(93130230 bytes)Available download formats
Dataset updated
Jan 23, 2024
Authors
Wilmer E. Henao
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Developed by AI4Privacy, this dataset represents a pioneering effort in the realm of privacy and AI. As an expansive resource hosted on Hugging Face at ai4privacy/pii-masking-200k, it serves a crucial role in addressing the growing concerns around personal data security in AI applications.

Sources: The dataset is crafted using proprietary algorithms, ensuring the creation of synthetic data that avoids privacy violations. Its multilingual composition, including English, French, German, and Italian texts, reflects a diverse source base. The data is meticulously curated with human-in-the-loop validation, ensuring both relevance and quality.

Context: In an era where data privacy is paramount, this dataset is tailored to train AI models to identify and mask personally identifiable information (PII). It covers 54 PII classes and extends across 229 use cases in various domains like business, education, psychology, and legal fields, emphasizing its contextual richness and applicability.

Inspiration: The dataset draws inspiration from the need for enhanced privacy measures in AI interactions, particularly in LLMs and AI assistants. The creators, AI4Privacy, are dedicated to building tools that act as a 'global seatbelt' for AI, protecting individuals' personal data. This dataset is a testament to their commitment to advancing AI technology responsibly and ethically.

This comprehensive dataset is not just a tool but a step towards a future where AI and privacy coexist harmoniously, offering immense value to researchers, developers, and privacy advocates alike.
h
pii-masking-200k
huggingface.co
Updated Apr 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/1532
Dataset updated
Apr 22, 2024
Dataset authored and provided by
Ai4Privacy
Description
Ai4Privacy Community

Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

Purpose and Features

Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.
AI-4-privacy_PII-masking-en-38k
kaggle.com
zip
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kris Smith (2024). AI-4-privacy_PII-masking-en-38k [Dataset]. https://www.kaggle.com/datasets/krist0phersmith/ai-4-privacy-pii-masking-en-38k
Explore at:
zip(23238153 bytes)Available download formats
Dataset updated
Apr 2, 2024
Authors
Kris Smith
Description
This data was scraped from the AI 4 Privacy organizations PII-300k multi lingual dataset.

It consists of English only PII labeled tokens. 30k training samples 8k validation samples

Open licensed for academic purposes: https://huggingface.co/datasets/ai4privacy/pii-masking-300k/blob/main/LICENSE.md

See the model card for full dataset here: https://huggingface.co/datasets/ai4privacy/pii-masking-300k

Citation: @misc {ai4privacy_2024, author = { {Ai4Privacy} }, title = { pii-masking-300k (Revision 86db63b) }, year = 2024, url = { https://huggingface.co/datasets/ai4privacy/pii-masking-300k }, doi = { 10.57967/hf/1995 }, publisher = { Hugging Face } }
Fundraising Data
kaggle.com
zip
Updated Aug 17, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Pawlus (2018). Fundraising Data [Dataset]. https://www.kaggle.com/michaelpawlus/fundraising-data
Explore at:
zip(1087024 bytes)Available download formats
Dataset updated
Aug 17, 2018
Authors
Michael Pawlus
Description
Context

This data set is a collection of anonymized sample fundraising data sets so that practitioners within our field can practice and share examples using a common data source

Open Call for More Content

If you have any anonymous data that you would like to include here let me know: Michael Pawlus (pawlus@usc.edu)

Acknowledgements

Thanks to everyone who has shared data so far to make this possible.
MultiSocial
zenodo.org
data.niaid.nih.gov
+1more
Updated Aug 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal (2025). MultiSocial [Dataset]. http://doi.org/10.5281/zenodo.13846152
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13846152
Dataset updated
Aug 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.

If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.

Disclaimer

Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.

Data Source

The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:

Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.

Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).

Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.

Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).

WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.

From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.

The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).

The dataset has the following fields:

'text' - a text sample,

'label' - 0 for human-written text, 1 for machine-generated text,

'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,

'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,

'language' - the ISO 639-1 language code identifying the detected language of the given text,

'length' - word count of the given text,

'source' - a string identifying the source dataset / platform of the given text,

'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.

ToDo Statistics (under construction)
F
Data from: WEA-Acceptance Data: Wind Turbine Dataset Including Acoustical,...
data.uni-hannover.de
.csv, json, parquet +2
Updated Aug 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institut für Statik und Dynamik (2025). WEA-Acceptance Data: Wind Turbine Dataset Including Acoustical, Meteorological and Turbine Parameters (Version 2.0) [Dataset]. https://data.uni-hannover.de/dataset/wea-acceptance_data_v1
Explore at:
parquet, pdf, zip, json, .csvAvailable download formats
Dataset updated
Aug 7, 2025
Dataset authored and provided by
Institut für Statik und Dynamik
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
Within the project WEA-Acceptance¹, extensive measurement campaigns were carried out, which included the recording of acoustic, meteorological and turbine-specific data. Acoustic quantities were measured at several distances to the wind turbine and under various atmospheric and turbine conditions. In the project WEA-Acceptance-Data², the acquired measurements are stored in a structured and anonymized form and provided for research purposes. Besides the data and its documentation, first evaluations as well as reference data sets for chosen scenarios are published.

In this version of the data platform, a specification 2.0, an anonymized data set and three use cases are published. The specification contains the concept of the data platform, which is primarily based on the FAIR (Findable, Accessible, Interoperable, and Reusable) principle. The data set consists of turbine-specific, meteorological and acoustic data recorded over one month. Herein, the data were corrected, conditioned and anonymized so that relevant outliers are marked and erroneous data are removed in the data set. The acoustic data includes anonymized sound pressure levels and one-third octave spectra averaged over ten minutes as well as audio data. In addition, the metadata and an overview of data availability are uploaded. As examples for the application of the data, three use cases are also published. Important information such as the approach for data anonymization is briefly described in the ReadMe file.

For further information about the measurements, it is referred to "Martens, S., Bohne, T., and Rolfes, R.: An evaluation method for extensive wind turbine sound measurement data and its application, Proceedings of Meetings on Acoustics, Acoustical Society of America, 41, 040001, https://doi.org/10.1121/2.0001326, 2020.

¹The project WEA-Acceptance (FKZ 0324134A) was funded by the German Federal Ministry for Economic Affairs and Energy (BMWi).

²The project WEA-Acceptance-Data (FKZ 03EE3062) was funded by the German Federal Ministry for Economic Affairs and Energy (BMWi).
h
pii-masking-43k
huggingface.co
Updated Jul 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2023). pii-masking-43k [Dataset]. http://doi.org/10.57967/hf/0824
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/0824
Dataset updated
Jul 1, 2023
Dataset authored and provided by
Ai4Privacy
Description
Purpose and Features

The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-43k.
Group Health Dataset (Sleep and Screen Time)
zenodo.org
csv
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gogate; Gogate (2025). Group Health Dataset (Sleep and Screen Time) [Dataset]. http://doi.org/10.5281/zenodo.15171250
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15171250
Dataset updated
Apr 8, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gogate; Gogate
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Group Health (Sleep and Screen Time) Dataset

Title: Group Health (Sleep and Screen Time) Dataset

Description: This dataset includes biometric and self-reported sleep-related information from users wearing health monitoring devices. It tracks heart rate data, screen time, and sleep quality ratings, intended for health analytics, sleep research, or machine learning applications.

Creator: Eindhoven University of Technology

Version: 1.0

License: CC-BY 4.0

Keywords: sleep health, wearable data, heart rate, screen time, sleep rating, health analytics

Format: CSV (.csv)

Size: 301,556 records

PID: 10.5281/zenodo.15171250

Column Descriptions

- Uid (int64): Unique identifier for the user. Example: `2`

- Sid (object): Session ID representing device/session (e.g., wearable device). Example: `huami.32093/11110030`

- Key (object): The type of health metric (e.g., 'heart_rate'). Example: `heart_rate`

- Time (int64): Unix timestamp of when the measurement was taken. Example: `1743911820`

- Value (object): JSON object containing measurement details (e.g., heart rate BPM). Example: `{"time":1743911820,"bpm":64}`

- UpdateTime (float64): Timestamp when the record was last updated. Example: `1743911982.0`

- screentime (object): Reported or detected screen time during sleep period. Example: `0 days 08:25:00`

- expected_sleep (object): Expected sleep time duration (possibly self-reported or algorithmic). Example: `0 days 07:45:00`

- sleep_rating (float64): Numerical rating of sleep quality. Example: `0.65`

Notes

- The `Value` field stores JSON-like strings which should be parsed for specific values such as heart rate (`bpm`).

- Missing data in `screentime`, `expected_sleep`, and `sleep_rating` should be handled carefully during analysis.

- Timestamps are in Unix format and may need conversion to readable datetime.

Provenance

The Group Health (Sleep and Screen Time) Dataset was collected by the students at the Eindhoven University of Technology as part of a health monitoring study. Participants wore wearable health devices (Mi Band Smartwatches) that tracked biometric data, including heart rate, screen time, and self-reported sleep information. The dataset was compiled from multiple sessions of device usage over time the course of two weeks, with the data anonymized for privacy and research purposes. The original data was already in a standardized csv format and was altered for preprocessing purposes and analysis. This dataset is openly shared under a CC-BY 4.0 license, enabling users to reuse and modify the data while properly attributing the original creators
Two example datasets of mobile phone records in Nanjing, China
figshare.com
txt
Updated Aug 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
XY He (2025). Two example datasets of mobile phone records in Nanjing, China [Dataset]. http://doi.org/10.6084/m9.figshare.29966149.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29966149.v2
Dataset updated
Aug 22, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
XY He
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Nanjing, China
Description
This dataset contains two examples of processed mobile phone signaling data, derived from anonymized raw records. The files Example_Data_01 and Example_Data_02 correspond to February 5, 2023, and February 24, 2024, respectively, in Nanjing, Jiangsu Province, China. They include full-day mobility records of 88,554 and 87,575 users. The dataset can be used for research on human mobility, travel behavior, urban dynamics, and spatiotemporal data analysis.
s
Data from: GoiEner smart meters data
research.science.eus
observatorio-cientifico.ua.es
+1more
Updated 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris; Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris (2022). GoiEner smart meters data [Dataset]. https://research.science.eus/documentos/668fc48cb9e7c03b01be0b72
Explore at:
Dataset updated
2022
Authors
Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris; Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris
Description
Name: GoiEner smart meters data Summary: The dataset contains hourly time series of electricity consumption (kWh) provided by the Spanish electricity retailer GoiEner. The time series are arranged in four compressed files: raw.tzst, contains raw time series of all GoiEner clients (any date, any length, may have missing samples). imp-pre.tzst, contains processed time series (imputation of missing samples), longer than one year, collected before March 1, 2020. imp-in.tzst, contains processed time series (imputation of missing samples), longer than one year, collected between March 1, 2020 and May 30, 2021. imp-post.tzst, contains processed time series (imputation of missing samples), longer than one year, collected after May 30, 2020. metadata.csv, contains relevant information for each time series. License: CC-BY-SA Acknowledge: These data have been collected in the framework of the WHY project. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 891943. Disclaimer: The sole responsibility for the content of this publication lies with the authors. It does not necessarily reflect the opinion of the Executive Agency for Small and Medium-sized Enterprises (EASME) or the European Commission (EC). EASME or the EC are not responsible for any use that may be made of the information contained therein. Collection Date: From November 2, 2014 to June 8, 2022. Publication Date: December 1, 2022. DOI: 10.5281/zenodo.7362094 Other repositories: None. Author: GoiEner, University of Deusto. Objective of collection: This dataset was originally used to establish a methodology for clustering households according to their electricity consumption. Description: The meaning of each column is described next for each file. raw.tzst: (no column names provided) timestamp; electricity consumption in kWh. imp-pre.tzst, imp-in.tzst, imp-post.tzst: “timestamp”: timestamp; “kWh”: electricity consumption in kWh; “imputed”: binary value indicating whether the row has been obtained by imputation. metadata.csv: “user”: 64-character identifying a user; “start_date”: initial timestamp of the time series; “end_date”: final timestamp of the time series; “length_days”: number of days elapsed between the initial and the final timestamps; “length_years”: number of years elapsed between the initial and the final timestamps; “potential_samples”: number of samples that should be between the initial and the final timestamps of the time series if there were no missing values; “actual_samples”: number of actual samples of the time series; “missing_samples_abs”: number of potential samples minus actual samples; “missing_samples_pct”: potential samples minus actual samples as a percentage; “contract_start_date”: contract start date; “contract_end_date”: contract end date; “contracted_tariff”: type of tariff contracted (2.X: households and SMEs, 3.X: SMEs with high consumption, 6.X: industries, large commercial areas, and farms); “self_consumption_type”: the type of self-consumption to which the users are subscribed; “p1”, “p2”, “p3”, “p4”, “p5”, “p6”: contracted power (in kW) for each of the six time slots; “province”: province where the user is located; “municipality”: municipality where the user is located (municipalities below 50.000 inhabitants have been removed); “zip_code”: post code (post codes of municipalities below 50.000 inhabitants have been removed); “cnae”: CNAE (Clasificación Nacional de Actividades Económicas) code for economic activity classification. 5 star: ⭐⭐⭐ Preprocessing steps: Data cleaning (imputation of missing values using the Last Observation Carried Forward algorithm using weekly seasons); data integration (combination of multiple SIMEL files, i.e. the data sources); data transformation (anonymization, unit conversion, metadata generation). Reuse: This dataset is related to datasets: "A database of features extracted from different electricity load profiles datasets" (DOI 10.5281/zenodo.7382818), where time series feature extraction has been performed. "Measuring the flexibility achieved by a change of tariff" (DOI 10.5281/zenodo.7382924), where the metadata has been extended to include the results of a socio-economic characterization and the answers to a survey about barriers to adapt to a change of tariff. Update policy: There might be a single update in mid-2023. Ethics and legal aspects: The data provided by GoiEner contained values of the CUPS (Meter Point Administration Number), which are personal data. A pre-processing step has been carried out to replace the CUPS by random 64-character hashes. Technical aspects: raw.tzst contains a 15.1 GB folder with 25,559 CSV files; imp-pre.tzst contains a 6.28 GB folder with 12,149 CSV files; imp-in.tzst contains a 4.36 GB folder with 15.562 CSV files; and imp-post.tzst contains a 4.01 GB folder with 17.519 CSV files. Other: None.
m
BSL-Static-48: A Dataset of Anonymized Images and MediaPipe Hand Landmarks...
data.mendeley.com
Updated Oct 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nahid Khan (2025). BSL-Static-48: A Dataset of Anonymized Images and MediaPipe Hand Landmarks for BSL Recognition [Dataset]. http://doi.org/10.17632/ms5phkw8sr.1
Explore at:
Unique identifier
https://doi.org/10.17632/ms5phkw8sr.1
Dataset updated
Oct 30, 2025
Authors
Nahid Khan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset provides a collection of images and extracted landmark features for 48 fundamental static signs in Bangla Sign Language (BSL), including 38 alphabets and 10 digits (0-9). It was created to support research in isolated sign language recognition (SLR) for BSL and provide a benchmark resource for the research community. In total, the dataset comprises 14,566 raw images, 14,566 mirrored images, and 29,132 processed feature samples.

Data Contents: The dataset is organized into two main folders: 01_Images: Contains 29,132 images in .jpg format (14,566 raw + 14,566 mirrored). • Raw_Images: Contains 14,566 original images collected from participants. • Mirrored_Images: Contains 14,566 horizontally flipped versions of the raw images for data augmentation purposes. • Privacy Note: Facial regions in all images within this folder have been anonymized (blurred) to protect participant privacy, as formal
informed consent for sharing identifiable images was not obtained prior to collection.

02_Processed_Features_NPY: Contains 29,132 126-dimensional hand landmark features saved as NumPy arrays in .npy format. Features were extracted using MediaPipe Holistic (capturing 21 landmarks each for the left and right hands, resulting in 63 + 63 = 126 features per image). These feature files are pre-split into train (23,293 samples), val (2,911 samples), and test (2,928 samples) subdirectories (approximately 80%/10%/10%) for standardized model evaluation and benchmarking .

Data Collection: Images were collected from 5 volunteers using a Macbook Air M3 camera. Data collection took place indoors under room lighting conditions against a white background. Images were captured manually using a Python script to ensure clarity.

Potential Use: Researchers can utilize the anonymized raw and mirrored images (01_Images) to develop or test novel feature extraction techniques or multimodal recognition systems. Alternatively, the pre-processed and split .npy feature files (02_Processed_Features_NPY) can be directly used to efficiently train and evaluate machine learning models for static BSL recognition, facilitating reproducible research and benchmarking.

Further Details: Please refer to the README.md file included within the dataset for detailed class mapping (e.g., L1='অ', D0='০'), comprehensive file statistics per class , specifics on the data processing pipeline, and citation guidelines.

Federated Learning for Distributed Intrusion Detection Systems in Public...

zenodo.org
data.europa.eu

bz2

Updated May 23, 2023

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Alireza Bakhshi Zadi Mahmoodi; Alireza Bakhshi Zadi Mahmoodi; Panos Kostakos; Panos Kostakos (2023). Federated Learning for Distributed Intrusion Detection Systems in Public Networks - Validation Dataset [Dataset]. http://doi.org/10.5281/zenodo.7956304

Explore at:

bz2Available download formats

Unique identifier

https://doi.org/10.5281/zenodo.7956304

Dataset updated

May 23, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Alireza Bakhshi Zadi Mahmoodi; Alireza Bakhshi Zadi Mahmoodi; Panos Kostakos; Panos Kostakos

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset has been meticulously prepared and utilized as a validation set during the evaluation phase of "Meta IDS" to asses the performance of various machine learning models. It is now made available for interested users and researchers who seek a reliable and diverse dataset for training and testing their own custom models.

The validation dataset comprises a comprehensive collection of labeled entries, that determines whether the packet type is "malicious" or "benign." It covers complex design patterns that are commonly encountered in real-world applications. The dataset is designed to be representative, encompassing edge and fog layers that are in contact with cloud layer, thereby enabling thorough testing and evaluation of different models. Each sample in the dataset is labeled with the corresponding ground truth, providing a reliable reference for model performance evaluation.

To ensure convenient distribution and storage, the dataset has been broken down into three separate batches, each containing a portion of the dataset. This allows for convenient downloading and management of the dataset. The three batches are provided as individual compressed files.

In order to extract the data, follow the following instructions:

Download and install bzip2 (if not already installed) from the official website or your package manager.
Place the compressed dataset file in a directory of your choice.
Open a terminal or command prompt and navigate to the directory where the compressed dataset file is located.
Execute the following command to uncompress the dataset:
- bzip2 -d filename.bz2
Replace "filename.bz2" with the actual name of the compressed dataset file.

Once uncompressed, you will have access to the dataset in its original format for further exploration, analysis, and model training etc. The total storage required for extraction is approximately 800 GB in total, with the first batch requiring approximately 302 GB, the second batch requiring approximately 203 GB, and the third batch requiring approximately 297 GB of data storage.

The first batch contains 1,049,527,992 entries, where as the second batch contains 711,043,331 entries, and for the third and last batch we have 1,029,303,062 entries. The following table provides the feature names along with their explanation and example value once the dataset is extracted.

Feature	Description	Example Value
ip.src	Source IP address in the packet	a05d4ecc38da01406c9635ec694917e969622160e728495e3169f62822444e17
ip.dst	Destination IP address in the packet	a52db0d87623d8a25d0db324d74f0900deb5ca4ec8ad9f346114db134e040ec5
frame.time_epoch	Epoch time of the frame	1676165569.930869
arp.hw.type	Hardware type	1
arp.hw.size	Hardware size	6
arp.proto.size	Protocol size	4
arp.opcode	Opcode	2
data.len	Length	2713
eth.dst.lg	Destination LG bit	1
eth.dst.ig	Destination IG bit	1
eth.src.lg	Source LG bit	1
eth.src.ig	Source IG bit	1
frame.offset_shift	Time shift for this packet	0
frame.len	frame length on the wire	1208
frame.cap_len	Frame length stored into the capture file	215
frame.marked	Frame is marked	0
frame.ignored	Frame is ignored	0
frame.encap_type	Encapsulation type	1
gre	Generic Routing Encapsulation	'Generic Routing Encapsulation (IP)’
ip.version	Version	6
ip.hdr_len	Header length	24
ip.dsfield.dscp	Differentiated Services Codepoint	56
ip.dsfield.ecn	Explicit Congestion Notification	2
ip.len	Total length	614
ip.flags.rb	Reserved bit	0
ip.flags.df	Don't fragment	1
ip.flags.mf	More fragments	0
ip.frag_offset	Fragment offset	0
ip.ttl	Time to live	31
ip.proto	Protocol	47
ip.checksum.status	Header checksum status	2
tcp.srcport	TCP source port	53425
tcp.flags	Flags	0x00000098
tcp.flags.ns	Nonce	0
tcp.flags.cwr	Congestion Window Reduced (CWR)	1
udp.srcport	UDP source port	64413
udp.dstport	UDP destination port	54087
udp.stream	Stream index	1345
udp.length	Length	225
udp.checksum.status	Checksum status	3
packet_type	Type of the packet which is either "benign" or "malicious"	0

Furthermore, in compliance with the GDPR and to ensure the privacy of individuals, all IP addresses present in the dataset have been anonymized through hashing. This anonymization process helps protect the identity of individuals while preserving the integrity and utility of the dataset for research and model development purposes.

Please note that while the dataset provides valuable insights and a solid foundation for machine learning tasks, it is not a substitute for extensive real-world data collection. However, it serves as a valuable resource for researchers, practitioners, and enthusiasts in the machine learning community, offering a compliant and anonymized dataset for developing and validating custom models in a specific problem domain.

By leveraging the validation dataset for machine learning model evaluation and custom model training, users can accelerate their research and development efforts, building upon the knowledge gained from my thesis while contributing to the advancement of the field.

Mental Health
kaggle.com
zip
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi Mashayekhi (2025). Mental Health [Dataset]. https://www.kaggle.com/datasets/mahdimashayekhi/mental-health
Explore at:
zip(137847 bytes)Available download formats
Dataset updated
May 7, 2025
Authors
Mahdi Mashayekhi
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
📘 Dataset Description

This dataset provides a realistic, synthetic simulation of global mental health survey responses from 10,000 individuals. It was created to reflect actual patterns seen in workplace mental health data while ensuring full anonymity and privacy.

🧠 Context & Purpose

Mental health issues affect people across all ages, countries, and industries. Understanding patterns in mental health at work, access to treatment, and stigma around disclosure is essential for shaping better workplace policies and interventions.

This dataset is ideal for:

Training and evaluating machine learning models

Practicing classification or clustering techniques

Performing exploratory data analysis (EDA)

Studying fairness and bias in mental health predictions

Creating realistic dashboards for HR analytics or healthcare systems

📊 Dataset Highlights

10,000 rows representing anonymized individuals

Diverse global coverage with country/state info

Demographic attributes like age, gender, employment type

Information about work environment and company support

Responses about mental health history, treatment, and workplace stigma

💡 Example Use Cases

Predicting the likelihood of an employee seeking mental health treatment

Identifying factors most correlated with workplace stress

Segmenting users by mental health risk using clustering

Building fairness-aware models to reduce bias in mental health predictions

⚠️ Notes

This dataset is entirely synthetic. No personally identifiable information (PII) or real user data is included.

It was generated based on patterns observed in public mental health datasets and surveys.
Measurement Data: Latencies and Traffic Traces in Global Mobile Roaming with...
data.europa.eu
unknown
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo, Measurement Data: Latencies and Traffic Traces in Global Mobile Roaming with Regional Breakouts [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-11065734?locale=ro
Explore at:
unknown(13095)Available download formats
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Shortcut through the IPX: Measuring Latencies in Global Mobile Roaming with Regional Breakouts This repository contains a description and sample data for the Paper A Shortcut through the IPX: Measuring Latencies in Global Mobile Roaming with Regional Breakouts published at the Network Traffic Measurement and Analysis (TMA) Conference 2024.In the provided README.md file, we present example snippets of the datasets, including an explanation of all contained fields. We cover the three main datasets covered in the related paper:- DT1: User plane traces captured at multiple GGSN/PGW instances of a globaly operating MVNO- DT2: GTP echo round trip times between visited network SGSN/SGWs and home network GGSN/PGWs- DT3: IPX routing information, as extracted from BGP routing tables For legal reasons, we are not able to publish the secondary datasets (DT4, DT5) covered in the manuscript. Finally, for privacy, security, and political reasons, certain fields in each of the datasets have been anonymized. These are indicated by the _anonymized prefix.In case of IP addresses, the anonymization ist consistent across datasets, meaning that similar IPs have been anonymized such that their values are still identical after anonymization. Contact For questions regarding the dataset, contact Viktoria Vomhoff (viktoria.vomhoff@uni-wuerzburg.de)
Multi-modality medical image dataset for medical image processing in Python...
zenodo.org
zip
Updated Aug 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Candace Moore; Candace Moore; Giulia Crocioni; Giulia Crocioni (2024). Multi-modality medical image dataset for medical image processing in Python lesson [Dataset]. http://doi.org/10.5281/zenodo.13305760
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13305760
Dataset updated
Aug 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Candace Moore; Candace Moore; Giulia Crocioni; Giulia Crocioni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a collection of medical imaging files for use in the "Medical Image Processing with Python" lesson, developed by the Netherlands eScience Center.

The dataset includes:

SimpleITK compatible files: MRI T1 and CT scans (training_001_mr_T1.mha, training_001_ct.mha), digital X-ray (digital_xray.dcm in DICOM format), neuroimaging data (A1_grayT1.nrrd, A1_grayT2.nrrd). Data have been downloaded from here.

MRI data: a T2-weighted image (OBJECT_phantom_T2W_TSE_Cor_14_1.nii in NIfTI-1 format). Data have been downloaded from here.

Example images for the machine learning lesson: chest X-rays (rotatechest.png, other_op.png), cardiomegaly example (cardiomegaly_cc0.png).

Additional anonymized data: TBA

These files represent various medical imaging modalities and formats commonly used in clinical research and practice. They are intended for educational purposes, allowing students to practice image processing techniques, machine learning applications, and statistical analysis of medical images using Python libraries such as scikit-image, pydicom, and SimpleITK.
Fraud Detection in Financial Transactions
kaggle.com
zip
Updated Jan 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darshan Dalvi (2025). Fraud Detection in Financial Transactions [Dataset]. https://www.kaggle.com/datasets/darshandalvi12/fraud-detection-in-financial-transactions
Explore at:
zip(230131674 bytes)Available download formats
Dataset updated
Jan 17, 2025
Authors
Darshan Dalvi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Credit Card Fraud Detection Dataset (Updated)

This dataset contains 284,807 transactions from a credit card company, where 492 transactions are fraudulent. The data is highly imbalanced, with only a small fraction of transactions being fraudulent. The dataset is commonly used to build and evaluate fraud detection models.

Dataset Details:

Number of Transactions: 284,807

Fraudulent Transactions: 492 (Highly Imbalanced)

Features:

28 anonymized features (V1 to V28)

Transaction amount

Timestamp

Label:

0: Legitimate

1: Fraudulent

Data Preprocessing:

SMOTE (Synthetic Minority Oversampling Technique) has been applied to address the class imbalance in the dataset, generating synthetic examples for the minority class (fraudulent transactions).

Additional Operations: Various preprocessing steps were performed, including data cleaning and feature engineering, to ensure the quality of the dataset for model training.

Processed Files:

The dataset has been split into training and testing sets and saved in the following files: - X_train.csv: Feature data for the training set - X_test.csv: Feature data for the testing set - y_train.csv: Labels for the training set (fraudulent or legitimate) - y_test.csv: Labels for the testing set

This updated dataset is ready to be used for training and evaluating machine learning models, specifically designed for credit card fraud detection tasks.

This description highlights the key aspects of the dataset, including its preprocessing steps and the availability of the processed files for ease of use.
Some features of the dataset from a bank.
plos.figshare.com
xls
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HaiChao Du; Li Lv; Hongliang Wang; An Guo (2024). Some features of the dataset from a bank. [Dataset]. http://doi.org/10.1371/journal.pone.0294537.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0294537.t001
Dataset updated
Mar 6, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
HaiChao Du; Li Lv; Hongliang Wang; An Guo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Credit card fraud is a significant problem that costs billions of dollars annually. Detecting fraudulent transactions is challenging due to the imbalance in class distribution, where the majority of transactions are legitimate. While pre-processing techniques such as oversampling of minority classes are commonly used to address this issue, they often generate unrealistic or overgeneralized samples. This paper proposes a method called autoencoder with probabilistic xgboost based on SMOTE and CGAN(AE-XGB-SMOTE-CGAN) for detecting credit card frauds.AE-XGB-SMOTE-CGAN is a novel method proposed for credit card fraud detection problems. The credit card fraud dataset comes from a real dataset anonymized by a bank and is highly imbalanced, with normal data far greater than fraud data. Autoencoder (AE) is used to extract relevant features from the dataset, enhancing the ability of feature representation learning, and are then fed into xgboost for classification according to the threshold. Additionally, in this study, we propose a novel approach that hybridizes Generative Adversarial Network (GAN) and Synthetic Minority Over-Sampling Technique (SMOTE) to tackle class imbalance problems. Our two-phase oversampling approach involves knowledge transfer and leverages the synergies of SMOTE and GAN. Specifically, GAN transforms the unrealistic or overgeneralized samples generated by SMOTE into realistic data distributions where there is not enough minority class data available for GAN to process effectively on its own. SMOTE is used to address class imbalance issues and CGAN is used to generate new, realistic data to supplement the original dataset. The AE-XGB-SMOTE-CGAN algorithm is also compared to other commonly used machine learning algorithms, such as KNN and Light GBM, and shows an overall improvement of 2% in terms of the ACC index compared to these algorithms. The AE-XGB-SMOTE-CGAN algorithm also outperforms KNN in terms of the MCC index by 30% when the threshold is set to 0.35. This indicates that the AE-XGB-SMOTE-CGAN algorithm has higher accuracy, true positive rate, true negative rate, and Matthew’s correlation coefficient, making it a promising method for detecting credit card fraud.
2025 Kaggle Machine Learning & Data Science Survey
kaggle.com
Updated Jan 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hina Ismail (2025). 2025 Kaggle Machine Learning & Data Science Survey [Dataset]. https://www.kaggle.com/datasets/sonialikhan/2025-kaggle-machine-learning-and-data-science-survey
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 28, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hina Ismail
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview Welcome to Kaggle's second annual Machine Learning and Data Science Survey ― and our first-ever survey data challenge.

This year, as last year, we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for one week in October, and after cleaning the data we finished with 23,859 responses, a 49% increase over last year!

There's a lot to explore here. The results include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists to break into the field. We've published the data in as raw a format as possible without compromising anonymization, which makes it an unusual example of a survey dataset.

Challenge This year Kaggle is launching the first Data Science Survey Challenge, where we will be awarding a prize pool of $28,000 to kernel authors who tell a rich story about a subset of the data science and machine learning community..

In our second year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry. This survey data EDA provides an overview of the industry on an aggregate scale, but it also leaves us wanting to know more about the many specific communities comprised within the survey. For that reason, we’re inviting the Kaggle community to dive deep into the survey datasets and help us tell the diverse stories of data scientists from around the world.

The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!

Submissions will be evaluated on the following:

Composition - Is there a clear narrative thread to the story that’s articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations. Originality - Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? A great entry will be informative, thought provoking, and fresh all at the same time. Documentation - Are your code, and kernel, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible To be valid, a submission must be contained in one kernel, made public on or before the submission deadline. Participants are free to use any datasets in addition to the Kaggle Data Science survey, but those datasets must also be publicly available on Kaggle by the deadline for a submission to be valid.

While the challenge is running, Kaggle will also give a Weekly Kernel Award of $1,500 to recognize excellent kernels that are public analyses of the survey. Weekly Kernel Awards will be announced every Friday between 11/9 and 11/30.

How to Participate To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will review the last (most recent) entry.

No submission is necessary for the Weekly Kernels Awards. To be eligible, a kernel must be public and use the 2018 Data Science Survey as a data source.

Timeline All dates are 11:59PM UTC

Submission deadline: December 3rd

Winners announced: December 10th

Weekly Kernels Award prize winners announcements: November 9th, 16th, 23rd, and 30th

All kernels are evaluated after the deadline.

Rules To be eligible to win a prize in either of the above prize tracks, you must be:

a registered account holder at Kaggle.com; the older of 18 years old or the age of majority in your jurisdiction of residence; and not a resident of Crimea, Cuba, Iran, Syria, North Korea, or Sudan Your kernels will only be eligible to win if they have been made public on kaggle.com by the above deadline. All prizes are awarded at the discretion of Kaggle. Kaggle reserves the right to cancel or modify prize criteria.

Unfortunately employees, interns, contractors, officers and directors of Kaggle Inc., and their parent companies, are not eligible to win any prizes.

Survey Methodology ...
w
Data from: TweetsKB - A Public and Large-Scale RDF Corpus of Annotated...
data.wu.ac.at
api/sparql, rdf/n3
Updated Dec 13, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
L3S Research Center (2017). TweetsKB - A Public and Large-Scale RDF Corpus of Annotated Tweets [Dataset]. https://data.wu.ac.at/schema/datahub_io/NWQzMDJiYWItNTlkZS00Zjg0LWIxNDQtNWZhNmQwMTRiNTFj
Explore at:
rdf/n3, api/sparqlAvailable download formats
Dataset updated
Dec 13, 2017
Dataset provided by
L3S Research Center
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TweetsKB is a public RDF corpus of anonymized data for a large collection of annotated tweets. The dataset currently contains data for more than 1.5 billion tweets, spanning almost 5 years (January 2013 - November 2017). Metadata information about the tweets as well as extracted entities, sentiments, hashtags and user mentions are exposed in RDF using established RDF/S vocabularies. For the sake of privacy, we anonymize the usernames and we do not provide the text of the tweets. However, through the tweet IDs, actual tweet content and further information can be fetched.

Links to all parts:

Part 1 (Jan 2013 - Feb 2014): https://zenodo.org/record/573852

Part 2 (Mar 2014 - Dec 2014): https://zenodo.org/record/577572

Part 3 (Jan 2015 - Oct 2015): https://zenodo.org/record/579597

Part 4 (Nov 2015 - Aug 2016): https://zenodo.org/record/579601

Part 5 (Sep 2016 - Nov 2017): https://zenodo.org/record/1095592

Sample files, example queries and more information are available through TweetsKB's home page: http://l3s.de/tweetsKB/.
Z
Training Dataset for HNTSMRG 2024 Challenge
data.niaid.nih.gov
Updated Jun 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wahid, Kareem; Dede, Cem; Naser, Mohamed; Fuller, Clifton (2024). Training Dataset for HNTSMRG 2024 Challenge [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11199558
Explore at:
Dataset updated
Jun 21, 2024
Dataset provided by
The University of Texas MD Anderson Cancer Center
Authors
Wahid, Kareem; Dede, Cem; Naser, Mohamed; Fuller, Clifton
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Training Dataset for HNTSMRG 2024 Challenge

Overview

This repository houses the publicly available training dataset for the Head and Neck Tumor Segmentation for MR-Guided Applications (HNTSMRG) 2024 Challenge.

Patient cohorts correspond to patients with histologically proven head and neck cancer who underwent radiotherapy (RT) at The University of Texas MD Anderson Cancer Center. The cancer types are predominately oropharyngeal cancer or cancer of unknown primary. Images include a pre-RT T2w MRI scan (1-3 weeks before start of RT) and a mid-RT T2w MRI scan (2-4 weeks intra-RT) for each patient. Segmentation masks of primary gross tumor volumes (abbreviated GTVp) and involved metastatic lymph nodes (abbreviated GTVn) are provided for each image (derived from multi-observer STAPLE consensus).

HNTSMRG 2024 is split into 2 tasks:

Task 1: Segmentation of tumor volumes (GTVp and GTVn) on pre-RT MRI.

Task 2: Segmentation of tumor volumes (GTVp and GTVn) on mid-RT MRI.

The same patient cases will be used for the training and test sets of both tasks of this challenge. Therefore, we are releasing a single training dataset that can be used to construct solutions for either segmentation task. The test data provided (via Docker containers), however, will be different for the two tasks. Please consult the challenge website for more details.

Data Details

DICOM files (images and structure files) have been converted to NIfTI format (.nii.gz) for ease of use by participants via DICOMRTTool v. 1.0.

Images are a mix of fat-suppressed and non-fat-suppressed MRI sequences. Pre-RT and mid-RT image pairs for a given patient are consistently either fat-suppressed or non-fat-suppressed.

Though some sequences may appear to be contrast enhancing, no exogenous contrast is used.

All images have been manually cropped from the top of the clavicles to the bottom of the nasal septum (~ oropharynx region to shoulders), allowing for more consistent image field of views and removal of identifiable facial structures.

The mask files have one of three possible values: background = 0, GTVp = 1, GTVn = 2 (in the case of multiple lymph nodes, they are concatenated into one single label). This labeling convention is similar to the 2022 HECKTOR Challenge.

150 unique patients are included in this dataset. Anonymized patient numeric identifiers are utilized.

The entire training dataset is ~15 GB.

Dataset Folder/File Structure

The dataset is uploaded as a ZIP archive. Please unzip before use. NIfTI files conform to the following standardized nomenclature: ID_timepoint_image/mask.nii.gz. For mid-RT files, a "registered" suffix (ID_timepoint_image/mask_registered.nii.gz) indicates the image or mask has been registered to the mid-RT image space (see more details in Additional Notes below).

The data is provided with the following folder hierarchy:

Top-level folder (named "HNTSMRG24_train")

Patient-level folder (anonymized patient ID, example: "2")

Pre-radiotherapy data folder ("preRT")

Original pre-RT T2w MRI volume (example: "2_preRT_T2.nii.gz").

Original pre-RT tumor segmentation mask (example: "2_preRT_mask.nii.gz").

Mid-radiotherapy data folder ("midRT")

Original mid-RT T2w MRI volume (example: "2_midRT_T2.nii.gz").

Original mid-RT tumor segmentation mask (example: "2_midRT_mask.nii.gz").

Registered pre-RT T2w MRI volume (example: "2_preRT_T2_registered.nii.gz").

Registered pre-RT tumor segmentation mask (example: "2_preRT_mask_registered.nii.gz").

Note: Cases will exhibit variable presentation of ground truth mask structures. For example, a case could have only a GTVp label present, only a GTVn label present, both GTVp and GTVn labels present, or a completely empty mask (i.e., complete tumor response at mid-RT). The following case IDs have empty masks at mid-RT (indicating a complete response): 21, 25, 29, 42. These empty masks are not errors. There will similarly be some cases in the test set for Task 2 that have empty masks.

Details Relevant for Algorithm Building

The goal of Task 1 is to generate a pre-RT tumor segmentation mask (e.g., "2_preRT_mask.nii.gz" is the relevant label). During blind testing for Task 1, only the pre-RT MRI (e.g., "2_preRT_T2.nii.gz") will be provided to the participants algorithms.

The goal of Task 2 is to generate a mid-RT segmentation mask (e.g., "2_midRT_mask.nii.gz" is the relevant label). During blind testing for Task 2, the mid-RT MRI (e.g., "2_midRT_T2.nii.gz"), original pre-RT MRI (e.g., "2_preRT_T2.nii.gz"), original pre-RT tumor segmentation mask (e.g., "2_preRT_mask.nii.gz"), registered pre-RT MRI (e.g., "2_preRT_T2_registered.nii.gz"), and registered pre-RT tumor segmentation mask (e.g., "2_preRT_mask_registered.nii.gz") will be provided to the participants algorithms.

When building models, the resolution of the generated prediction masks should be the same as the corresponding MRI for the given task. In other words, the generated masks should be in the correct pixel spacing and origin with respect to the original reference frame (i.e., pre-RT image for Task 1, mid-RT image for Task 2). More details on the submission of models will be located on the challenge website.

Additional Notes

General notes.

NIfTI format images and segmentations may be easily visualized in any NIfTI viewing software such as 3D Slicer.

Test data will not be made public until the completion of the challenge. The complete training and test data will be published together (along with all original multi-observer annotations and relevant clinical data) at a later date via The Cancer Imaging Archive. Expected date ~ Spring 2025.

Task 1 related notes.

When training their algorithms for Task 1, participants can choose to use only pre-RT data or add in mid-RT data as well. Initially, our plan was to limit participants to utilizing only pre-RT data for training their algorithms in Task 1. However, upon reflection, we recognized that in a practical setting, individuals aiming to develop auto-segmentation algorithms could theoretically train models using any accessible data at their disposal. Based on current literature, we actually don't know what the best solution would be! Would the incorporation of mid-RT data for training a pre-RT segmentation model actually be helpful, or would it merely introduce harmful noise? The answer remains unclear. Therefore, we leave this choice to the participants.

Remember, though, during testing, you will ONLY have the pre-RT image as an input to your model (naturally, since Task 1 is a pre-RT segmentation task and you won't know what mid-RT data for a patient will look like).

Task 2 related notes.

In addition to the mid-RT MRI and segmentation mask, we have also provided a registered pre-RT MRI and the corresponding registered pre-RT segmentation mask for each patient. We offer this data for participants who opt not to integrate any image registration techniques into their algorithms for Task 2 but still wish to use the two images as a joint input to their model. Moreover, in a real-world adaptive RT context, such registered scans are typically readily accessible. Naturally, participants are also free to incorporate their own image registration processes into their pipelines if they wish (or ignore the pre-RT images/masks altogether).

Registrations were generated using SimpleITK, where the mid-RT image serves as the fixed image and the pre-RT image serves as the moving image. Specifically, we utilized the following steps: 1. Apply a centered transformation, 2. Apply a rigid transformation, 3. Apply a deformable transformation with Elastix using a preset parameter map (Parameter map 23 in the Elastix Zoo). This particular deformable transformation was selected as it is open-source and was benchmarked in a previous similar application (https://doi.org/10.1002/mp.16128). For cases where excessive warping was noted during deformable registration (a small minority of cases), only the rigid transformation was applied.

Contact

We have set up a general email address that you can message to notify all organizers at: hntsmrg2024@gmail.com. Additional specific organizer contacts:

Kareem A. Wahid, PhD (kawahid@mdanderson.org)

Cem Dede, MD (cdede@mdanderson.org)

Mohamed A. Naser, PhD (manaser@mdanderson.org)

Facebook

Twitter

Click to copy link

Link copied

Cite

Wilmer E. Henao (2024). AI4privacy-PII [Dataset]. https://www.kaggle.com/datasets/verracodeguacas/ai4privacy-pii

AI4privacy-PII

AI4Privacy PII-Masking Dataset: Safeguarding Personal Data in AI Interactions

Explore at:

zip(93130230 bytes)Available download formats

Dataset updated

Jan 23, 2024

Authors

Wilmer E. Henao

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Developed by AI4Privacy, this dataset represents a pioneering effort in the realm of privacy and AI. As an expansive resource hosted on Hugging Face at ai4privacy/pii-masking-200k, it serves a crucial role in addressing the growing concerns around personal data security in AI applications.

Sources: The dataset is crafted using proprietary algorithms, ensuring the creation of synthetic data that avoids privacy violations. Its multilingual composition, including English, French, German, and Italian texts, reflects a diverse source base. The data is meticulously curated with human-in-the-loop validation, ensuring both relevance and quality.

Context: In an era where data privacy is paramount, this dataset is tailored to train AI models to identify and mask personally identifiable information (PII). It covers 54 PII classes and extends across 229 use cases in various domains like business, education, psychology, and legal fields, emphasizing its contextual richness and applicability.

Inspiration: The dataset draws inspiration from the need for enhanced privacy measures in AI interactions, particularly in LLMs and AI assistants. The creators, AI4Privacy, are dedicated to building tools that act as a 'global seatbelt' for AI, protecting individuals' personal data. This dataset is a testament to their commitment to advancing AI technology responsibly and ethically.

This comprehensive dataset is not just a tool but a step towards a future where AI and privacy coexist harmoniously, offering immense value to researchers, developers, and privacy advocates alike.

Clear search

Close search

Google apps

Main menu

AI4privacy-PII

pii-masking-200k

AI-4-privacy_PII-masking-en-38k

Fundraising Data

Context

Open Call for More Content

Acknowledgements

MultiSocial

Disclaimer

Data Source

Data from: WEA-Acceptance Data: Wind Turbine Dataset Including Acoustical,...

pii-masking-43k

Group Health Dataset (Sleep and Screen Time)

Two example datasets of mobile phone records in Nanjing, China

Data from: GoiEner smart meters data

BSL-Static-48: A Dataset of Anonymized Images and MediaPipe Hand Landmarks...

Federated Learning for Distributed Intrusion Detection Systems in Public...

Mental Health

📘 Dataset Description

🧠 Context & Purpose

📊 Dataset Highlights

💡 Example Use Cases

⚠️ Notes

Measurement Data: Latencies and Traffic Traces in Global Mobile Roaming with...

Multi-modality medical image dataset for medical image processing in Python...

Fraud Detection in Financial Transactions

Credit Card Fraud Detection Dataset (Updated)

Dataset Details:

Data Preprocessing:

Processed Files:

Some features of the dataset from a bank.

2025 Kaggle Machine Learning & Data Science Survey

Data from: TweetsKB - A Public and Large-Scale RDF Corpus of Annotated...

Training Dataset for HNTSMRG 2024 Challenge

AI4privacy-PII

AI4Privacy PII-Masking Dataset: Safeguarding Personal Data in AI Interactions