59 datasets found
  1. AI4privacy-PII

    • kaggle.com
    zip
    Updated Jan 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wilmer E. Henao (2024). AI4privacy-PII [Dataset]. https://www.kaggle.com/datasets/verracodeguacas/ai4privacy-pii
    Explore at:
    zip(93130230 bytes)Available download formats
    Dataset updated
    Jan 23, 2024
    Authors
    Wilmer E. Henao
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Developed by AI4Privacy, this dataset represents a pioneering effort in the realm of privacy and AI. As an expansive resource hosted on Hugging Face at ai4privacy/pii-masking-200k, it serves a crucial role in addressing the growing concerns around personal data security in AI applications.

    Sources: The dataset is crafted using proprietary algorithms, ensuring the creation of synthetic data that avoids privacy violations. Its multilingual composition, including English, French, German, and Italian texts, reflects a diverse source base. The data is meticulously curated with human-in-the-loop validation, ensuring both relevance and quality.

    Context: In an era where data privacy is paramount, this dataset is tailored to train AI models to identify and mask personally identifiable information (PII). It covers 54 PII classes and extends across 229 use cases in various domains like business, education, psychology, and legal fields, emphasizing its contextual richness and applicability.

    Inspiration: The dataset draws inspiration from the need for enhanced privacy measures in AI interactions, particularly in LLMs and AI assistants. The creators, AI4Privacy, are dedicated to building tools that act as a 'global seatbelt' for AI, protecting individuals' personal data. This dataset is a testament to their commitment to advancing AI technology responsibly and ethically.

    This comprehensive dataset is not just a tool but a step towards a future where AI and privacy coexist harmoniously, offering immense value to researchers, developers, and privacy advocates alike.

  2. h

    pii-masking-200k

    • huggingface.co
    Updated Apr 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2024
    Dataset authored and provided by
    Ai4Privacy
    Description

    Ai4Privacy Community

    Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

      Purpose and Features
    

    Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.

  3. AI-4-privacy_PII-masking-en-38k

    • kaggle.com
    zip
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kris Smith (2024). AI-4-privacy_PII-masking-en-38k [Dataset]. https://www.kaggle.com/datasets/krist0phersmith/ai-4-privacy-pii-masking-en-38k
    Explore at:
    zip(23238153 bytes)Available download formats
    Dataset updated
    Apr 2, 2024
    Authors
    Kris Smith
    Description

    This data was scraped from the AI 4 Privacy organizations PII-300k multi lingual dataset.

    It consists of English only PII labeled tokens. 30k training samples 8k validation samples

    Open licensed for academic purposes: https://huggingface.co/datasets/ai4privacy/pii-masking-300k/blob/main/LICENSE.md

    See the model card for full dataset here: https://huggingface.co/datasets/ai4privacy/pii-masking-300k

    Citation: @misc {ai4privacy_2024, author = { {Ai4Privacy} }, title = { pii-masking-300k (Revision 86db63b) }, year = 2024, url = { https://huggingface.co/datasets/ai4privacy/pii-masking-300k }, doi = { 10.57967/hf/1995 }, publisher = { Hugging Face } }

  4. Fundraising Data

    • kaggle.com
    zip
    Updated Aug 17, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Pawlus (2018). Fundraising Data [Dataset]. https://www.kaggle.com/michaelpawlus/fundraising-data
    Explore at:
    zip(1087024 bytes)Available download formats
    Dataset updated
    Aug 17, 2018
    Authors
    Michael Pawlus
    Description

    Context

    This data set is a collection of anonymized sample fundraising data sets so that practitioners within our field can practice and share examples using a common data source

    Open Call for More Content

    If you have any anonymous data that you would like to include here let me know: Michael Pawlus (pawlus@usc.edu)

    Acknowledgements

    Thanks to everyone who has shared data so far to make this possible.

  5. MultiSocial

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    Updated Aug 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal (2025). MultiSocial [Dataset]. http://doi.org/10.5281/zenodo.13846152
    Explore at:
    Dataset updated
    Aug 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.

    If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.

    Disclaimer

    Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.

    Data Source

    The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:

    1. Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.

    2. Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).

    3. Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.

    4. Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).

    5. WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.

    From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.

    The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).

    The dataset has the following fields:

    • 'text' - a text sample,

    • 'label' - 0 for human-written text, 1 for machine-generated text,

    • 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,

    • 'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,

    • 'language' - the ISO 639-1 language code identifying the detected language of the given text,

    • 'length' - word count of the given text,

    • 'source' - a string identifying the source dataset / platform of the given text,

    • 'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.

    ToDo Statistics (under construction)

  6. F

    Data from: WEA-Acceptance Data: Wind Turbine Dataset Including Acoustical,...

    • data.uni-hannover.de
    .csv, json, parquet +2
    Updated Aug 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institut für Statik und Dynamik (2025). WEA-Acceptance Data: Wind Turbine Dataset Including Acoustical, Meteorological and Turbine Parameters (Version 2.0) [Dataset]. https://data.uni-hannover.de/dataset/wea-acceptance_data_v1
    Explore at:
    parquet, pdf, zip, json, .csvAvailable download formats
    Dataset updated
    Aug 7, 2025
    Dataset authored and provided by
    Institut für Statik und Dynamik
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    Within the project WEA-Acceptance¹, extensive measurement campaigns were carried out, which included the recording of acoustic, meteorological and turbine-specific data. Acoustic quantities were measured at several distances to the wind turbine and under various atmospheric and turbine conditions. In the project WEA-Acceptance-Data², the acquired measurements are stored in a structured and anonymized form and provided for research purposes. Besides the data and its documentation, first evaluations as well as reference data sets for chosen scenarios are published.

    In this version of the data platform, a specification 2.0, an anonymized data set and three use cases are published. The specification contains the concept of the data platform, which is primarily based on the FAIR (Findable, Accessible, Interoperable, and Reusable) principle. The data set consists of turbine-specific, meteorological and acoustic data recorded over one month. Herein, the data were corrected, conditioned and anonymized so that relevant outliers are marked and erroneous data are removed in the data set. The acoustic data includes anonymized sound pressure levels and one-third octave spectra averaged over ten minutes as well as audio data. In addition, the metadata and an overview of data availability are uploaded. As examples for the application of the data, three use cases are also published. Important information such as the approach for data anonymization is briefly described in the ReadMe file.

    For further information about the measurements, it is referred to "Martens, S., Bohne, T., and Rolfes, R.: An evaluation method for extensive wind turbine sound measurement data and its application, Proceedings of Meetings on Acoustics, Acoustical Society of America, 41, 040001, https://doi.org/10.1121/2.0001326, 2020.

    ¹The project WEA-Acceptance (FKZ 0324134A) was funded by the German Federal Ministry for Economic Affairs and Energy (BMWi).

    ²The project WEA-Acceptance-Data (FKZ 03EE3062) was funded by the German Federal Ministry for Economic Affairs and Energy (BMWi).

  7. h

    pii-masking-43k

    • huggingface.co
    Updated Jul 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2023). pii-masking-43k [Dataset]. http://doi.org/10.57967/hf/0824
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2023
    Dataset authored and provided by
    Ai4Privacy
    Description

    Purpose and Features

    The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-43k.

  8. Group Health Dataset (Sleep and Screen Time)

    • zenodo.org
    csv
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gogate; Gogate (2025). Group Health Dataset (Sleep and Screen Time) [Dataset]. http://doi.org/10.5281/zenodo.15171250
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 8, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gogate; Gogate
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    Group Health (Sleep and Screen Time) Dataset


    Title: Group Health (Sleep and Screen Time) Dataset

    Description: This dataset includes biometric and self-reported sleep-related information from users wearing health monitoring devices. It tracks heart rate data, screen time, and sleep quality ratings, intended for health analytics, sleep research, or machine learning applications.
    Creator: Eindhoven University of Technology
    Version: 1.0
    License: CC-BY 4.0
    Keywords: sleep health, wearable data, heart rate, screen time, sleep rating, health analytics
    Format: CSV (.csv)
    Size: 301,556 records
    PID: 10.5281/zenodo.15171250

    Column Descriptions

    - Uid (int64): Unique identifier for the user. Example: `2`
    - Sid (object): Session ID representing device/session (e.g., wearable device). Example: `huami.32093/11110030`
    - Key (object): The type of health metric (e.g., 'heart_rate'). Example: `heart_rate`
    - Time (int64): Unix timestamp of when the measurement was taken. Example: `1743911820`
    - Value (object): JSON object containing measurement details (e.g., heart rate BPM). Example: `{"time":1743911820,"bpm":64}`
    - UpdateTime (float64): Timestamp when the record was last updated. Example: `1743911982.0`
    - screentime (object): Reported or detected screen time during sleep period. Example: `0 days 08:25:00`
    - expected_sleep (object): Expected sleep time duration (possibly self-reported or algorithmic). Example: `0 days 07:45:00`
    - sleep_rating (float64): Numerical rating of sleep quality. Example: `0.65`

    Notes
    - The `Value` field stores JSON-like strings which should be parsed for specific values such as heart rate (`bpm`).
    - Missing data in `screentime`, `expected_sleep`, and `sleep_rating` should be handled carefully during analysis.
    - Timestamps are in Unix format and may need conversion to readable datetime.
    Provenance
    The Group Health (Sleep and Screen Time) Dataset was collected by the students at the Eindhoven University of Technology as part of a health monitoring study. Participants wore wearable health devices (Mi Band Smartwatches) that tracked biometric data, including heart rate, screen time, and self-reported sleep information. The dataset was compiled from multiple sessions of device usage over time the course of two weeks, with the data anonymized for privacy and research purposes. The original data was already in a standardized csv format and was altered for preprocessing purposes and analysis. This dataset is openly shared under a CC-BY 4.0 license, enabling users to reuse and modify the data while properly attributing the original creators
  9. Two example datasets of mobile phone records in Nanjing, China

    • figshare.com
    txt
    Updated Aug 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    XY He (2025). Two example datasets of mobile phone records in Nanjing, China [Dataset]. http://doi.org/10.6084/m9.figshare.29966149.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 22, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    XY He
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Nanjing, China
    Description

    This dataset contains two examples of processed mobile phone signaling data, derived from anonymized raw records. The files Example_Data_01 and Example_Data_02 correspond to February 5, 2023, and February 24, 2024, respectively, in Nanjing, Jiangsu Province, China. They include full-day mobility records of 88,554 and 87,575 users. The dataset can be used for research on human mobility, travel behavior, urban dynamics, and spatiotemporal data analysis.

  10. s

    Data from: GoiEner smart meters data

    • research.science.eus
    • observatorio-cientifico.ua.es
    • +1more
    Updated 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris; Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris (2022). GoiEner smart meters data [Dataset]. https://research.science.eus/documentos/668fc48cb9e7c03b01be0b72
    Explore at:
    Dataset updated
    2022
    Authors
    Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris; Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris
    Description

    Name: GoiEner smart meters data Summary: The dataset contains hourly time series of electricity consumption (kWh) provided by the Spanish electricity retailer GoiEner. The time series are arranged in four compressed files: raw.tzst, contains raw time series of all GoiEner clients (any date, any length, may have missing samples). imp-pre.tzst, contains processed time series (imputation of missing samples), longer than one year, collected before March 1, 2020. imp-in.tzst, contains processed time series (imputation of missing samples), longer than one year, collected between March 1, 2020 and May 30, 2021. imp-post.tzst, contains processed time series (imputation of missing samples), longer than one year, collected after May 30, 2020. metadata.csv, contains relevant information for each time series. License: CC-BY-SA Acknowledge: These data have been collected in the framework of the WHY project. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 891943. Disclaimer: The sole responsibility for the content of this publication lies with the authors. It does not necessarily reflect the opinion of the Executive Agency for Small and Medium-sized Enterprises (EASME) or the European Commission (EC). EASME or the EC are not responsible for any use that may be made of the information contained therein. Collection Date: From November 2, 2014 to June 8, 2022. Publication Date: December 1, 2022. DOI: 10.5281/zenodo.7362094 Other repositories: None. Author: GoiEner, University of Deusto. Objective of collection: This dataset was originally used to establish a methodology for clustering households according to their electricity consumption. Description: The meaning of each column is described next for each file. raw.tzst: (no column names provided) timestamp; electricity consumption in kWh. imp-pre.tzst, imp-in.tzst, imp-post.tzst: “timestamp”: timestamp; “kWh”: electricity consumption in kWh; “imputed”: binary value indicating whether the row has been obtained by imputation. metadata.csv: “user”: 64-character identifying a user; “start_date”: initial timestamp of the time series; “end_date”: final timestamp of the time series; “length_days”: number of days elapsed between the initial and the final timestamps; “length_years”: number of years elapsed between the initial and the final timestamps; “potential_samples”: number of samples that should be between the initial and the final timestamps of the time series if there were no missing values; “actual_samples”: number of actual samples of the time series; “missing_samples_abs”: number of potential samples minus actual samples; “missing_samples_pct”: potential samples minus actual samples as a percentage; “contract_start_date”: contract start date; “contract_end_date”: contract end date; “contracted_tariff”: type of tariff contracted (2.X: households and SMEs, 3.X: SMEs with high consumption, 6.X: industries, large commercial areas, and farms); “self_consumption_type”: the type of self-consumption to which the users are subscribed; “p1”, “p2”, “p3”, “p4”, “p5”, “p6”: contracted power (in kW) for each of the six time slots; “province”: province where the user is located; “municipality”: municipality where the user is located (municipalities below 50.000 inhabitants have been removed); “zip_code”: post code (post codes of municipalities below 50.000 inhabitants have been removed); “cnae”: CNAE (Clasificación Nacional de Actividades Económicas) code for economic activity classification. 5 star: ⭐⭐⭐ Preprocessing steps: Data cleaning (imputation of missing values using the Last Observation Carried Forward algorithm using weekly seasons); data integration (combination of multiple SIMEL files, i.e. the data sources); data transformation (anonymization, unit conversion, metadata generation). Reuse: This dataset is related to datasets: "A database of features extracted from different electricity load profiles datasets" (DOI 10.5281/zenodo.7382818), where time series feature extraction has been performed. "Measuring the flexibility achieved by a change of tariff" (DOI 10.5281/zenodo.7382924), where the metadata has been extended to include the results of a socio-economic characterization and the answers to a survey about barriers to adapt to a change of tariff. Update policy: There might be a single update in mid-2023. Ethics and legal aspects: The data provided by GoiEner contained values of the CUPS (Meter Point Administration Number), which are personal data. A pre-processing step has been carried out to replace the CUPS by random 64-character hashes. Technical aspects: raw.tzst contains a 15.1 GB folder with 25,559 CSV files; imp-pre.tzst contains a 6.28 GB folder with 12,149 CSV files; imp-in.tzst contains a 4.36 GB folder with 15.562 CSV files; and imp-post.tzst contains a 4.01 GB folder with 17.519 CSV files. Other: None.

  11. m

    BSL-Static-48: A Dataset of Anonymized Images and MediaPipe Hand Landmarks...

    • data.mendeley.com
    Updated Oct 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nahid Khan (2025). BSL-Static-48: A Dataset of Anonymized Images and MediaPipe Hand Landmarks for BSL Recognition [Dataset]. http://doi.org/10.17632/ms5phkw8sr.1
    Explore at:
    Dataset updated
    Oct 30, 2025
    Authors
    Nahid Khan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides a collection of images and extracted landmark features for 48 fundamental static signs in Bangla Sign Language (BSL), including 38 alphabets and 10 digits (0-9). It was created to support research in isolated sign language recognition (SLR) for BSL and provide a benchmark resource for the research community. In total, the dataset comprises 14,566 raw images, 14,566 mirrored images, and 29,132 processed feature samples.

    Data Contents: The dataset is organized into two main folders: 01_Images: Contains 29,132 images in .jpg format (14,566 raw + 14,566 mirrored). • Raw_Images: Contains 14,566 original images collected from participants. • Mirrored_Images: Contains 14,566 horizontally flipped versions of the raw images for data augmentation purposes. • Privacy Note: Facial regions in all images within this folder have been anonymized (blurred) to protect participant privacy, as formal
    informed consent for sharing identifiable images was not obtained prior to collection.

    02_Processed_Features_NPY: Contains 29,132 126-dimensional hand landmark features saved as NumPy arrays in .npy format. Features were extracted using MediaPipe Holistic (capturing 21 landmarks each for the left and right hands, resulting in 63 + 63 = 126 features per image). These feature files are pre-split into train (23,293 samples), val (2,911 samples), and test (2,928 samples) subdirectories (approximately 80%/10%/10%) for standardized model evaluation and benchmarking .

    Data Collection: Images were collected from 5 volunteers using a Macbook Air M3 camera. Data collection took place indoors under room lighting conditions against a white background. Images were captured manually using a Python script to ensure clarity.

    Potential Use: Researchers can utilize the anonymized raw and mirrored images (01_Images) to develop or test novel feature extraction techniques or multimodal recognition systems. Alternatively, the pre-processed and split .npy feature files (02_Processed_Features_NPY) can be directly used to efficiently train and evaluate machine learning models for static BSL recognition, facilitating reproducible research and benchmarking.

    Further Details: Please refer to the README.md file included within the dataset for detailed class mapping (e.g., L1='অ', D0='০'), comprehensive file statistics per class , specifics on the data processing pipeline, and citation guidelines.

  12. Federated Learning for Distributed Intrusion Detection Systems in Public...

    • zenodo.org
    • data.europa.eu
    bz2
    Updated May 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alireza Bakhshi Zadi Mahmoodi; Alireza Bakhshi Zadi Mahmoodi; Panos Kostakos; Panos Kostakos (2023). Federated Learning for Distributed Intrusion Detection Systems in Public Networks - Validation Dataset [Dataset]. http://doi.org/10.5281/zenodo.7956304
    Explore at:
    bz2Available download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alireza Bakhshi Zadi Mahmoodi; Alireza Bakhshi Zadi Mahmoodi; Panos Kostakos; Panos Kostakos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset has been meticulously prepared and utilized as a validation set during the evaluation phase of "Meta IDS" to asses the performance of various machine learning models. It is now made available for interested users and researchers who seek a reliable and diverse dataset for training and testing their own custom models.

    The validation dataset comprises a comprehensive collection of labeled entries, that determines whether the packet type is "malicious" or "benign." It covers complex design patterns that are commonly encountered in real-world applications. The dataset is designed to be representative, encompassing edge and fog layers that are in contact with cloud layer, thereby enabling thorough testing and evaluation of different models. Each sample in the dataset is labeled with the corresponding ground truth, providing a reliable reference for model performance evaluation.

    To ensure convenient distribution and storage, the dataset has been broken down into three separate batches, each containing a portion of the dataset. This allows for convenient downloading and management of the dataset. The three batches are provided as individual compressed files.

    In order to extract the data, follow the following instructions:

    • Download and install bzip2 (if not already installed) from the official website or your package manager.
    • Place the compressed dataset file in a directory of your choice.
    • Open a terminal or command prompt and navigate to the directory where the compressed dataset file is located.
    • Execute the following command to uncompress the dataset:
      • bzip2 -d filename.bz2
    • Replace "filename.bz2" with the actual name of the compressed dataset file.

    Once uncompressed, you will have access to the dataset in its original format for further exploration, analysis, and model training etc. The total storage required for extraction is approximately 800 GB in total, with the first batch requiring approximately 302 GB, the second batch requiring approximately 203 GB, and the third batch requiring approximately 297 GB of data storage.

    The first batch contains 1,049,527,992 entries, where as the second batch contains 711,043,331 entries, and for the third and last batch we have 1,029,303,062 entries. The following table provides the feature names along with their explanation and example value once the dataset is extracted.

    FeatureDescriptionExample Value
    ip.srcSource IP address in the packeta05d4ecc38da01406c9635ec694917e969622160e728495e3169f62822444e17
    ip.dstDestination IP address in the packeta52db0d87623d8a25d0db324d74f0900deb5ca4ec8ad9f346114db134e040ec5
    frame.time_epochEpoch time of the frame1676165569.930869
    arp.hw.typeHardware type1
    arp.hw.sizeHardware size6
    arp.proto.sizeProtocol size4
    arp.opcodeOpcode2
    data.lenLength2713
    eth.dst.lgDestination LG bit1
    eth.dst.igDestination IG bit1
    eth.src.lgSource LG bit1
    eth.src.igSource IG bit1
    frame.offset_shiftTime shift for this packet0
    frame.lenframe length on the wire1208
    frame.cap_lenFrame length stored into the capture file215
    frame.markedFrame is marked0
    frame.ignoredFrame is ignored0
    frame.encap_typeEncapsulation type1
    greGeneric Routing Encapsulation'Generic Routing
    Encapsulation (IP)’
    ip.versionVersion6
    ip.hdr_lenHeader length24
    ip.dsfield.dscpDifferentiated Services
    Codepoint
    56
    ip.dsfield.ecnExplicit Congestion
    Notification
    2
    ip.lenTotal length614
    ip.flags.rbReserved bit0
    ip.flags.dfDon't fragment1
    ip.flags.mfMore fragments0
    ip.frag_offsetFragment offset0
    ip.ttlTime to live31
    ip.protoProtocol47
    ip.checksum.statusHeader checksum status2
    tcp.srcportTCP source port53425
    tcp.flagsFlags0x00000098
    tcp.flags.nsNonce0
    tcp.flags.cwrCongestion Window Reduced
    (CWR)
    1
    udp.srcportUDP source port64413
    udp.dstportUDP destination port54087
    udp.streamStream index1345
    udp.lengthLength225
    udp.checksum.statusChecksum status3
    packet_typeType of the packet which is either "benign" or "malicious"0

    Furthermore, in compliance with the GDPR and to ensure the privacy of individuals, all IP addresses present in the dataset have been anonymized through hashing. This anonymization process helps protect the identity of individuals while preserving the integrity and utility of the dataset for research and model development purposes.

    Please note that while the dataset provides valuable insights and a solid foundation for machine learning tasks, it is not a substitute for extensive real-world data collection. However, it serves as a valuable resource for researchers, practitioners, and enthusiasts in the machine learning community, offering a compliant and anonymized dataset for developing and validating custom models in a specific problem domain.

    By leveraging the validation dataset for machine learning model evaluation and custom model training, users can accelerate their research and development efforts, building upon the knowledge gained from my thesis while contributing to the advancement of the field.

  13. Mental Health

    • kaggle.com
    zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Mashayekhi (2025). Mental Health [Dataset]. https://www.kaggle.com/datasets/mahdimashayekhi/mental-health
    Explore at:
    zip(137847 bytes)Available download formats
    Dataset updated
    May 7, 2025
    Authors
    Mahdi Mashayekhi
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    📘 Dataset Description

    This dataset provides a realistic, synthetic simulation of global mental health survey responses from 10,000 individuals. It was created to reflect actual patterns seen in workplace mental health data while ensuring full anonymity and privacy.

    🧠 Context & Purpose

    Mental health issues affect people across all ages, countries, and industries. Understanding patterns in mental health at work, access to treatment, and stigma around disclosure is essential for shaping better workplace policies and interventions.

    This dataset is ideal for:

    • Training and evaluating machine learning models
    • Practicing classification or clustering techniques
    • Performing exploratory data analysis (EDA)
    • Studying fairness and bias in mental health predictions
    • Creating realistic dashboards for HR analytics or healthcare systems

    📊 Dataset Highlights

    • 10,000 rows representing anonymized individuals
    • Diverse global coverage with country/state info
    • Demographic attributes like age, gender, employment type
    • Information about work environment and company support
    • Responses about mental health history, treatment, and workplace stigma

    💡 Example Use Cases

    • Predicting the likelihood of an employee seeking mental health treatment
    • Identifying factors most correlated with workplace stress
    • Segmenting users by mental health risk using clustering
    • Building fairness-aware models to reduce bias in mental health predictions

    ⚠️ Notes

    • This dataset is entirely synthetic. No personally identifiable information (PII) or real user data is included.
    • It was generated based on patterns observed in public mental health datasets and surveys.
  14. Measurement Data: Latencies and Traffic Traces in Global Mobile Roaming with...

    • data.europa.eu
    unknown
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo, Measurement Data: Latencies and Traffic Traces in Global Mobile Roaming with Regional Breakouts [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-11065734?locale=ro
    Explore at:
    unknown(13095)Available download formats
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Shortcut through the IPX: Measuring Latencies in Global Mobile Roaming with Regional Breakouts This repository contains a description and sample data for the Paper A Shortcut through the IPX: Measuring Latencies in Global Mobile Roaming with Regional Breakouts published at the Network Traffic Measurement and Analysis (TMA) Conference 2024.In the provided README.md file, we present example snippets of the datasets, including an explanation of all contained fields. We cover the three main datasets covered in the related paper:- DT1: User plane traces captured at multiple GGSN/PGW instances of a globaly operating MVNO- DT2: GTP echo round trip times between visited network SGSN/SGWs and home network GGSN/PGWs- DT3: IPX routing information, as extracted from BGP routing tables For legal reasons, we are not able to publish the secondary datasets (DT4, DT5) covered in the manuscript. Finally, for privacy, security, and political reasons, certain fields in each of the datasets have been anonymized. These are indicated by the _anonymized prefix.In case of IP addresses, the anonymization ist consistent across datasets, meaning that similar IPs have been anonymized such that their values are still identical after anonymization. Contact For questions regarding the dataset, contact Viktoria Vomhoff (viktoria.vomhoff@uni-wuerzburg.de)

  15. Multi-modality medical image dataset for medical image processing in Python...

    • zenodo.org
    zip
    Updated Aug 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Candace Moore; Candace Moore; Giulia Crocioni; Giulia Crocioni (2024). Multi-modality medical image dataset for medical image processing in Python lesson [Dataset]. http://doi.org/10.5281/zenodo.13305760
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Candace Moore; Candace Moore; Giulia Crocioni; Giulia Crocioni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a collection of medical imaging files for use in the "Medical Image Processing with Python" lesson, developed by the Netherlands eScience Center.

    The dataset includes:

    1. SimpleITK compatible files: MRI T1 and CT scans (training_001_mr_T1.mha, training_001_ct.mha), digital X-ray (digital_xray.dcm in DICOM format), neuroimaging data (A1_grayT1.nrrd, A1_grayT2.nrrd). Data have been downloaded from here.
    2. MRI data: a T2-weighted image (OBJECT_phantom_T2W_TSE_Cor_14_1.nii in NIfTI-1 format). Data have been downloaded from here.
    3. Example images for the machine learning lesson: chest X-rays (rotatechest.png, other_op.png), cardiomegaly example (cardiomegaly_cc0.png).
    4. Additional anonymized data: TBA

    These files represent various medical imaging modalities and formats commonly used in clinical research and practice. They are intended for educational purposes, allowing students to practice image processing techniques, machine learning applications, and statistical analysis of medical images using Python libraries such as scikit-image, pydicom, and SimpleITK.

  16. Fraud Detection in Financial Transactions

    • kaggle.com
    zip
    Updated Jan 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darshan Dalvi (2025). Fraud Detection in Financial Transactions [Dataset]. https://www.kaggle.com/datasets/darshandalvi12/fraud-detection-in-financial-transactions
    Explore at:
    zip(230131674 bytes)Available download formats
    Dataset updated
    Jan 17, 2025
    Authors
    Darshan Dalvi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Credit Card Fraud Detection Dataset (Updated)

    This dataset contains 284,807 transactions from a credit card company, where 492 transactions are fraudulent. The data is highly imbalanced, with only a small fraction of transactions being fraudulent. The dataset is commonly used to build and evaluate fraud detection models.

    Dataset Details:

    • Number of Transactions: 284,807
    • Fraudulent Transactions: 492 (Highly Imbalanced)
    • Features:
      • 28 anonymized features (V1 to V28)
      • Transaction amount
      • Timestamp
    • Label:
      • 0: Legitimate
      • 1: Fraudulent

    Data Preprocessing:

    • SMOTE (Synthetic Minority Oversampling Technique) has been applied to address the class imbalance in the dataset, generating synthetic examples for the minority class (fraudulent transactions).
    • Additional Operations: Various preprocessing steps were performed, including data cleaning and feature engineering, to ensure the quality of the dataset for model training.

    Processed Files:

    The dataset has been split into training and testing sets and saved in the following files: - X_train.csv: Feature data for the training set - X_test.csv: Feature data for the testing set - y_train.csv: Labels for the training set (fraudulent or legitimate) - y_test.csv: Labels for the testing set

    This updated dataset is ready to be used for training and evaluating machine learning models, specifically designed for credit card fraud detection tasks.

    This description highlights the key aspects of the dataset, including its preprocessing steps and the availability of the processed files for ease of use.

  17. Some features of the dataset from a bank.

    • plos.figshare.com
    xls
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HaiChao Du; Li Lv; Hongliang Wang; An Guo (2024). Some features of the dataset from a bank. [Dataset]. http://doi.org/10.1371/journal.pone.0294537.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    HaiChao Du; Li Lv; Hongliang Wang; An Guo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Credit card fraud is a significant problem that costs billions of dollars annually. Detecting fraudulent transactions is challenging due to the imbalance in class distribution, where the majority of transactions are legitimate. While pre-processing techniques such as oversampling of minority classes are commonly used to address this issue, they often generate unrealistic or overgeneralized samples. This paper proposes a method called autoencoder with probabilistic xgboost based on SMOTE and CGAN(AE-XGB-SMOTE-CGAN) for detecting credit card frauds.AE-XGB-SMOTE-CGAN is a novel method proposed for credit card fraud detection problems. The credit card fraud dataset comes from a real dataset anonymized by a bank and is highly imbalanced, with normal data far greater than fraud data. Autoencoder (AE) is used to extract relevant features from the dataset, enhancing the ability of feature representation learning, and are then fed into xgboost for classification according to the threshold. Additionally, in this study, we propose a novel approach that hybridizes Generative Adversarial Network (GAN) and Synthetic Minority Over-Sampling Technique (SMOTE) to tackle class imbalance problems. Our two-phase oversampling approach involves knowledge transfer and leverages the synergies of SMOTE and GAN. Specifically, GAN transforms the unrealistic or overgeneralized samples generated by SMOTE into realistic data distributions where there is not enough minority class data available for GAN to process effectively on its own. SMOTE is used to address class imbalance issues and CGAN is used to generate new, realistic data to supplement the original dataset. The AE-XGB-SMOTE-CGAN algorithm is also compared to other commonly used machine learning algorithms, such as KNN and Light GBM, and shows an overall improvement of 2% in terms of the ACC index compared to these algorithms. The AE-XGB-SMOTE-CGAN algorithm also outperforms KNN in terms of the MCC index by 30% when the threshold is set to 0.35. This indicates that the AE-XGB-SMOTE-CGAN algorithm has higher accuracy, true positive rate, true negative rate, and Matthew’s correlation coefficient, making it a promising method for detecting credit card fraud.

  18. 2025 Kaggle Machine Learning & Data Science Survey

    • kaggle.com
    Updated Jan 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hina Ismail (2025). 2025 Kaggle Machine Learning & Data Science Survey [Dataset]. https://www.kaggle.com/datasets/sonialikhan/2025-kaggle-machine-learning-and-data-science-survey
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 28, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Hina Ismail
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview Welcome to Kaggle's second annual Machine Learning and Data Science Survey ― and our first-ever survey data challenge.

    This year, as last year, we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for one week in October, and after cleaning the data we finished with 23,859 responses, a 49% increase over last year!

    There's a lot to explore here. The results include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists to break into the field. We've published the data in as raw a format as possible without compromising anonymization, which makes it an unusual example of a survey dataset.

    Challenge This year Kaggle is launching the first Data Science Survey Challenge, where we will be awarding a prize pool of $28,000 to kernel authors who tell a rich story about a subset of the data science and machine learning community..

    In our second year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry. This survey data EDA provides an overview of the industry on an aggregate scale, but it also leaves us wanting to know more about the many specific communities comprised within the survey. For that reason, we’re inviting the Kaggle community to dive deep into the survey datasets and help us tell the diverse stories of data scientists from around the world.

    The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!

    Submissions will be evaluated on the following:

    Composition - Is there a clear narrative thread to the story that’s articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations. Originality - Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? A great entry will be informative, thought provoking, and fresh all at the same time. Documentation - Are your code, and kernel, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible To be valid, a submission must be contained in one kernel, made public on or before the submission deadline. Participants are free to use any datasets in addition to the Kaggle Data Science survey, but those datasets must also be publicly available on Kaggle by the deadline for a submission to be valid.

    While the challenge is running, Kaggle will also give a Weekly Kernel Award of $1,500 to recognize excellent kernels that are public analyses of the survey. Weekly Kernel Awards will be announced every Friday between 11/9 and 11/30.

    How to Participate To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will review the last (most recent) entry.

    No submission is necessary for the Weekly Kernels Awards. To be eligible, a kernel must be public and use the 2018 Data Science Survey as a data source.

    Timeline All dates are 11:59PM UTC

    Submission deadline: December 3rd

    Winners announced: December 10th

    Weekly Kernels Award prize winners announcements: November 9th, 16th, 23rd, and 30th

    All kernels are evaluated after the deadline.

    Rules To be eligible to win a prize in either of the above prize tracks, you must be:

    a registered account holder at Kaggle.com; the older of 18 years old or the age of majority in your jurisdiction of residence; and not a resident of Crimea, Cuba, Iran, Syria, North Korea, or Sudan Your kernels will only be eligible to win if they have been made public on kaggle.com by the above deadline. All prizes are awarded at the discretion of Kaggle. Kaggle reserves the right to cancel or modify prize criteria.

    Unfortunately employees, interns, contractors, officers and directors of Kaggle Inc., and their parent companies, are not eligible to win any prizes.

    Survey Methodology ...

  19. w

    Data from: TweetsKB - A Public and Large-Scale RDF Corpus of Annotated...

    • data.wu.ac.at
    api/sparql, rdf/n3
    Updated Dec 13, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    L3S Research Center (2017). TweetsKB - A Public and Large-Scale RDF Corpus of Annotated Tweets [Dataset]. https://data.wu.ac.at/schema/datahub_io/NWQzMDJiYWItNTlkZS00Zjg0LWIxNDQtNWZhNmQwMTRiNTFj
    Explore at:
    rdf/n3, api/sparqlAvailable download formats
    Dataset updated
    Dec 13, 2017
    Dataset provided by
    L3S Research Center
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TweetsKB is a public RDF corpus of anonymized data for a large collection of annotated tweets. The dataset currently contains data for more than 1.5 billion tweets, spanning almost 5 years (January 2013 - November 2017). Metadata information about the tweets as well as extracted entities, sentiments, hashtags and user mentions are exposed in RDF using established RDF/S vocabularies. For the sake of privacy, we anonymize the usernames and we do not provide the text of the tweets. However, through the tweet IDs, actual tweet content and further information can be fetched.

    Links to all parts:

    Sample files, example queries and more information are available through TweetsKB's home page: http://l3s.de/tweetsKB/.

  20. Z

    Training Dataset for HNTSMRG 2024 Challenge

    • data.niaid.nih.gov
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wahid, Kareem; Dede, Cem; Naser, Mohamed; Fuller, Clifton (2024). Training Dataset for HNTSMRG 2024 Challenge [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11199558
    Explore at:
    Dataset updated
    Jun 21, 2024
    Dataset provided by
    The University of Texas MD Anderson Cancer Center
    Authors
    Wahid, Kareem; Dede, Cem; Naser, Mohamed; Fuller, Clifton
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Training Dataset for HNTSMRG 2024 Challenge

    Overview

    This repository houses the publicly available training dataset for the Head and Neck Tumor Segmentation for MR-Guided Applications (HNTSMRG) 2024 Challenge.

    Patient cohorts correspond to patients with histologically proven head and neck cancer who underwent radiotherapy (RT) at The University of Texas MD Anderson Cancer Center. The cancer types are predominately oropharyngeal cancer or cancer of unknown primary. Images include a pre-RT T2w MRI scan (1-3 weeks before start of RT) and a mid-RT T2w MRI scan (2-4 weeks intra-RT) for each patient. Segmentation masks of primary gross tumor volumes (abbreviated GTVp) and involved metastatic lymph nodes (abbreviated GTVn) are provided for each image (derived from multi-observer STAPLE consensus).

    HNTSMRG 2024 is split into 2 tasks:

    Task 1: Segmentation of tumor volumes (GTVp and GTVn) on pre-RT MRI.

    Task 2: Segmentation of tumor volumes (GTVp and GTVn) on mid-RT MRI.

    The same patient cases will be used for the training and test sets of both tasks of this challenge. Therefore, we are releasing a single training dataset that can be used to construct solutions for either segmentation task. The test data provided (via Docker containers), however, will be different for the two tasks. Please consult the challenge website for more details.

    Data Details

    DICOM files (images and structure files) have been converted to NIfTI format (.nii.gz) for ease of use by participants via DICOMRTTool v. 1.0.

    Images are a mix of fat-suppressed and non-fat-suppressed MRI sequences. Pre-RT and mid-RT image pairs for a given patient are consistently either fat-suppressed or non-fat-suppressed.

    Though some sequences may appear to be contrast enhancing, no exogenous contrast is used.

    All images have been manually cropped from the top of the clavicles to the bottom of the nasal septum (~ oropharynx region to shoulders), allowing for more consistent image field of views and removal of identifiable facial structures.

    The mask files have one of three possible values: background = 0, GTVp = 1, GTVn = 2 (in the case of multiple lymph nodes, they are concatenated into one single label). This labeling convention is similar to the 2022 HECKTOR Challenge.

    150 unique patients are included in this dataset. Anonymized patient numeric identifiers are utilized.

    The entire training dataset is ~15 GB.

    Dataset Folder/File Structure

    The dataset is uploaded as a ZIP archive. Please unzip before use. NIfTI files conform to the following standardized nomenclature: ID_timepoint_image/mask.nii.gz. For mid-RT files, a "registered" suffix (ID_timepoint_image/mask_registered.nii.gz) indicates the image or mask has been registered to the mid-RT image space (see more details in Additional Notes below).

    The data is provided with the following folder hierarchy:

    Top-level folder (named "HNTSMRG24_train")

    Patient-level folder (anonymized patient ID, example: "2")

    Pre-radiotherapy data folder ("preRT")

    Original pre-RT T2w MRI volume (example: "2_preRT_T2.nii.gz").

    Original pre-RT tumor segmentation mask (example: "2_preRT_mask.nii.gz").

    Mid-radiotherapy data folder ("midRT")

    Original mid-RT T2w MRI volume (example: "2_midRT_T2.nii.gz").

    Original mid-RT tumor segmentation mask (example: "2_midRT_mask.nii.gz").

    Registered pre-RT T2w MRI volume (example: "2_preRT_T2_registered.nii.gz").

    Registered pre-RT tumor segmentation mask (example: "2_preRT_mask_registered.nii.gz").

    Note: Cases will exhibit variable presentation of ground truth mask structures. For example, a case could have only a GTVp label present, only a GTVn label present, both GTVp and GTVn labels present, or a completely empty mask (i.e., complete tumor response at mid-RT). The following case IDs have empty masks at mid-RT (indicating a complete response): 21, 25, 29, 42. These empty masks are not errors. There will similarly be some cases in the test set for Task 2 that have empty masks.

    Details Relevant for Algorithm Building

    The goal of Task 1 is to generate a pre-RT tumor segmentation mask (e.g., "2_preRT_mask.nii.gz" is the relevant label). During blind testing for Task 1, only the pre-RT MRI (e.g., "2_preRT_T2.nii.gz") will be provided to the participants algorithms.

    The goal of Task 2 is to generate a mid-RT segmentation mask (e.g., "2_midRT_mask.nii.gz" is the relevant label). During blind testing for Task 2, the mid-RT MRI (e.g., "2_midRT_T2.nii.gz"), original pre-RT MRI (e.g., "2_preRT_T2.nii.gz"), original pre-RT tumor segmentation mask (e.g., "2_preRT_mask.nii.gz"), registered pre-RT MRI (e.g., "2_preRT_T2_registered.nii.gz"), and registered pre-RT tumor segmentation mask (e.g., "2_preRT_mask_registered.nii.gz") will be provided to the participants algorithms.

    When building models, the resolution of the generated prediction masks should be the same as the corresponding MRI for the given task. In other words, the generated masks should be in the correct pixel spacing and origin with respect to the original reference frame (i.e., pre-RT image for Task 1, mid-RT image for Task 2). More details on the submission of models will be located on the challenge website.

    Additional Notes

    General notes.

    NIfTI format images and segmentations may be easily visualized in any NIfTI viewing software such as 3D Slicer.

    Test data will not be made public until the completion of the challenge. The complete training and test data will be published together (along with all original multi-observer annotations and relevant clinical data) at a later date via The Cancer Imaging Archive. Expected date ~ Spring 2025.

    Task 1 related notes.

    When training their algorithms for Task 1, participants can choose to use only pre-RT data or add in mid-RT data as well. Initially, our plan was to limit participants to utilizing only pre-RT data for training their algorithms in Task 1. However, upon reflection, we recognized that in a practical setting, individuals aiming to develop auto-segmentation algorithms could theoretically train models using any accessible data at their disposal. Based on current literature, we actually don't know what the best solution would be! Would the incorporation of mid-RT data for training a pre-RT segmentation model actually be helpful, or would it merely introduce harmful noise? The answer remains unclear. Therefore, we leave this choice to the participants.

    Remember, though, during testing, you will ONLY have the pre-RT image as an input to your model (naturally, since Task 1 is a pre-RT segmentation task and you won't know what mid-RT data for a patient will look like).

    Task 2 related notes.

    In addition to the mid-RT MRI and segmentation mask, we have also provided a registered pre-RT MRI and the corresponding registered pre-RT segmentation mask for each patient. We offer this data for participants who opt not to integrate any image registration techniques into their algorithms for Task 2 but still wish to use the two images as a joint input to their model. Moreover, in a real-world adaptive RT context, such registered scans are typically readily accessible. Naturally, participants are also free to incorporate their own image registration processes into their pipelines if they wish (or ignore the pre-RT images/masks altogether).

    Registrations were generated using SimpleITK, where the mid-RT image serves as the fixed image and the pre-RT image serves as the moving image. Specifically, we utilized the following steps: 1. Apply a centered transformation, 2. Apply a rigid transformation, 3. Apply a deformable transformation with Elastix using a preset parameter map (Parameter map 23 in the Elastix Zoo). This particular deformable transformation was selected as it is open-source and was benchmarked in a previous similar application (https://doi.org/10.1002/mp.16128). For cases where excessive warping was noted during deformable registration (a small minority of cases), only the rigid transformation was applied.

    Contact

    We have set up a general email address that you can message to notify all organizers at: hntsmrg2024@gmail.com. Additional specific organizer contacts:

    Kareem A. Wahid, PhD (kawahid@mdanderson.org)

    Cem Dede, MD (cdede@mdanderson.org)

    Mohamed A. Naser, PhD (manaser@mdanderson.org)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wilmer E. Henao (2024). AI4privacy-PII [Dataset]. https://www.kaggle.com/datasets/verracodeguacas/ai4privacy-pii
Organization logo

AI4privacy-PII

AI4Privacy PII-Masking Dataset: Safeguarding Personal Data in AI Interactions

Explore at:
zip(93130230 bytes)Available download formats
Dataset updated
Jan 23, 2024
Authors
Wilmer E. Henao
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Developed by AI4Privacy, this dataset represents a pioneering effort in the realm of privacy and AI. As an expansive resource hosted on Hugging Face at ai4privacy/pii-masking-200k, it serves a crucial role in addressing the growing concerns around personal data security in AI applications.

Sources: The dataset is crafted using proprietary algorithms, ensuring the creation of synthetic data that avoids privacy violations. Its multilingual composition, including English, French, German, and Italian texts, reflects a diverse source base. The data is meticulously curated with human-in-the-loop validation, ensuring both relevance and quality.

Context: In an era where data privacy is paramount, this dataset is tailored to train AI models to identify and mask personally identifiable information (PII). It covers 54 PII classes and extends across 229 use cases in various domains like business, education, psychology, and legal fields, emphasizing its contextual richness and applicability.

Inspiration: The dataset draws inspiration from the need for enhanced privacy measures in AI interactions, particularly in LLMs and AI assistants. The creators, AI4Privacy, are dedicated to building tools that act as a 'global seatbelt' for AI, protecting individuals' personal data. This dataset is a testament to their commitment to advancing AI technology responsibly and ethically.

This comprehensive dataset is not just a tool but a step towards a future where AI and privacy coexist harmoniously, offering immense value to researchers, developers, and privacy advocates alike.

Search
Clear search
Close search
Google apps
Main menu