35 datasets found
  1. h

    pii-masking-43k

    • huggingface.co
    Updated Jul 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2023). pii-masking-43k [Dataset]. http://doi.org/10.57967/hf/0824
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2023
    Dataset authored and provided by
    Ai4Privacy
    Description

    Purpose and Features

    The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-43k.

  2. AI4privacy-PII

    • kaggle.com
    zip
    Updated Jan 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wilmer E. Henao (2024). AI4privacy-PII [Dataset]. https://www.kaggle.com/datasets/verracodeguacas/ai4privacy-pii
    Explore at:
    zip(93130230 bytes)Available download formats
    Dataset updated
    Jan 23, 2024
    Authors
    Wilmer E. Henao
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Developed by AI4Privacy, this dataset represents a pioneering effort in the realm of privacy and AI. As an expansive resource hosted on Hugging Face at ai4privacy/pii-masking-200k, it serves a crucial role in addressing the growing concerns around personal data security in AI applications.

    Sources: The dataset is crafted using proprietary algorithms, ensuring the creation of synthetic data that avoids privacy violations. Its multilingual composition, including English, French, German, and Italian texts, reflects a diverse source base. The data is meticulously curated with human-in-the-loop validation, ensuring both relevance and quality.

    Context: In an era where data privacy is paramount, this dataset is tailored to train AI models to identify and mask personally identifiable information (PII). It covers 54 PII classes and extends across 229 use cases in various domains like business, education, psychology, and legal fields, emphasizing its contextual richness and applicability.

    Inspiration: The dataset draws inspiration from the need for enhanced privacy measures in AI interactions, particularly in LLMs and AI assistants. The creators, AI4Privacy, are dedicated to building tools that act as a 'global seatbelt' for AI, protecting individuals' personal data. This dataset is a testament to their commitment to advancing AI technology responsibly and ethically.

    This comprehensive dataset is not just a tool but a step towards a future where AI and privacy coexist harmoniously, offering immense value to researchers, developers, and privacy advocates alike.

  3. Z

    Sentinel-2 Cloud Mask Catalogue

    • data.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francis, Alistair; Mrziglod, John; Sidiropoulos, Panagiotis; Muller, Jan-Peter (2024). Sentinel-2 Cloud Mask Catalogue [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4172870
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    University College London
    Hummingbird Technologies Ltd
    World Food Programme
    Authors
    Francis, Alistair; Mrziglod, John; Sidiropoulos, Panagiotis; Muller, Jan-Peter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset comprises cloud masks for 513 1022-by-1022 pixel subscenes, at 20m resolution, sampled random from the 2018 Level-1C Sentinel-2 archive. The design of this dataset follows from some observations about cloud masking: (i) performance over an entire product is highly correlated, thus subscenes provide more value per-pixel than full scenes, (ii) current cloud masking datasets often focus on specific regions, or hand-select the products used, which introduces a bias into the dataset that is not representative of the real-world data, (iii) cloud mask performance appears to be highly correlated to surface type and cloud structure, so testing should include analysis of failure modes in relation to these variables.

    The data was annotated semi-automatically, using the IRIS toolkit, which allows users to dynamically train a Random Forest (implemented using LightGBM), speeding up annotations by iteratively improving it's predictions, but preserving the annotator's ability to make final manual changes when needed. This hybrid approach allowed us to process many more masks than would have been possible manually, which we felt was vital in creating a large enough dataset to approximate the statistics of the whole Sentinel-2 archive.

    In addition to the pixel-wise, 3 class (CLEAR, CLOUD, CLOUD_SHADOW) segmentation masks, we also provide users with binary classification "tags" for each subscene that can be used in testing to determine performance in specific circumstances. These include:

    SURFACE TYPE: 11 categories

    CLOUD TYPE: 7 categories

    CLOUD HEIGHT: low, high

    CLOUD THICKNESS: thin, thick

    CLOUD EXTENT: isolated, extended

    Wherever practical, cloud shadows were also annotated, however this was sometimes not possible due to high-relief terrain, or large ambiguities. In total, 424 were marked with shadows (if present), and 89 have shadows that were not annotatable due to very ambiguous shadow boundaries, or terrain that cast significant shadows. If users wish to train an algorithm specifically for cloud shadow masks, we advise them to remove those 89 images for which shadow was not possible, however, bear in mind that this will systematically reduce the difficulty of the shadow class compared to real-world use, as these contain the most difficult shadow examples.

    In addition to the 20m sampled subscenes and masks, we also provide users with shapefiles that define the boundary of the mask on the original Sentinel-2 scene. If users wish to retrieve the L1C bands at their original resolutions, they can use these to do so.

    Please see the README for further details on the dataset structure and more.

    Contributions & Acknowledgements

    The data were collected, annotated, checked, formatted and published by Alistair Francis and John Mrziglod.

    Support and advice was provided by Prof. Jan-Peter Muller and Dr. Panagiotis Sidiropoulos, for which we are grateful.

    We would like to extend our thanks to Dr. Pierre-Philippe Mathieu and the rest of the team at ESA PhiLab, who provided the environment in which this project was conceived, and continued to give technical support throughout.

    Finally, we thank the ESA Network of Resources for sponsoring this project by providing ICT resources.

  4. h

    pii-masking-200k

    • huggingface.co
    Updated Apr 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2024
    Dataset authored and provided by
    Ai4Privacy
    Description

    Ai4Privacy Community

    Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

      Purpose and Features
    

    Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.

  5. 2D Printed Mask and Replay Attack Videos Dataset

    • kaggle.com
    zip
    Updated Aug 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata (2025). 2D Printed Mask and Replay Attack Videos Dataset [Dataset]. https://www.kaggle.com/datasets/unidpro/2d-printed-mask-dataset/code
    Explore at:
    zip(636187802 bytes)Available download formats
    Dataset updated
    Aug 17, 2025
    Authors
    Unidata
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    2D Mask Attack Dataset - 26 436 videos

    The dataset comprises 26,436 videos of real faces, 2D print attacks (printed photos), and replay attacks (faces displayed on screens), captured under varied conditions. Designed for attack detection research, it supports the development of robust face antispoofing and spoofing detection methods, critical for facial recognition security.

    Ideal for training models and refining anti-spoofing methods, the dataset enhances detection accuracy in biometric systems. - Get the data

    Example of the data

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F27063537%2F9c0b06aa82909632bd83a7048ac513ae%2FFrame%2052%20(2).png?generation=1755454141939894&alt=media" alt="">

    💵 Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

    Researchers can leverage this training data to improve detection accuracy, validate models trained on adversarial examples, and advance recognition systems against sophisticated masked attacks.

    🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects

  6. Example of strategies to implement masking during data analysis.

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natasha A. Karp; Esther J. Pearl; Emma J. Stringer; Chris Barkus; Jane Coates Ulrichsen; Nathalie Percie du Sert (2023). Example of strategies to implement masking during data analysis. [Dataset]. http://doi.org/10.1371/journal.pbio.3001873.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Natasha A. Karp; Esther J. Pearl; Emma J. Stringer; Chris Barkus; Jane Coates Ulrichsen; Nathalie Percie du Sert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We have classified the masking options as either low, moderate, or high based on the ability to minimise risk of bias. Shown in bold is the recommended high-quality strategy that is readily implementable across different experiment types.

  7. Image Mask (Deprecated)

    • data-salemva.opendata.arcgis.com
    Updated Jun 27, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    esri_en (2018). Image Mask (Deprecated) [Dataset]. https://data-salemva.opendata.arcgis.com/datasets/59486ebf228f4661aeaecb770dd73de8
    Explore at:
    Dataset updated
    Jun 27, 2018
    Dataset provided by
    Esrihttp://esri.com/
    Authors
    esri_en
    Description

    Image Mask is a configurable app template for identifying areas of an image that have changed over time or that meet user-set thresholds for calculated spectral indexes. The template also includes tools for measurement, recording locations, and more.App users can zoom to bookmarked areas of interest (or search for their own), select any of the imagery layers from the associated web map to analyze, use a time slider or dropdown menu to select images, then choose between the Change Detection or Mask tools to produce results.Image Mask users can do the following:Zoom to bookmarked areas of interest (or bookmark their own)Select specific images from a layer to visualize (search by date or another attribute)Use the Change Detection tool to compare two images in a layer (see options, below)Use the Mask tool to highlight areas that meet a user-set threshold for common spectral indexes (NDVI, SAVI, a burn index, and a water index). For example, highlight all the areas in an image with NDVI values above 0.25 to find vegetation.Annotate imagery using editable feature layersPerform image measurement on imagery layers that have mensuration capabilitiesExport an imagery layer to the user's local machine, or as a layer in the user’s ArcGIS accountUse CasesA student investigating urban expansion over time using Esri’s Multispectral Landsat image serviceA farmer using NAIP imagery to examine changes in crop healthAn image analyst recording burn scar extents using satellite imageryAn aid worker identifying regions with extreme drought to focus assistanceChange detection methodsFor each imagery layer, give app users one or more of the following change detection options:Image Brightness (calculates the change in overall brightness)Vegetation Index (NDVI) (requires red and infrared bands)Soil-Adjusted Vegetation Index (SAVI) (requires red and infrared bands)Water Index (requires green and short-wave infrared bands)Burn Index (requires infrared and short-wave infrared bands)For each of the indexes, users also have a choice between three modes:Difference Image: calculates increases and decreases for the full extent Difference Mask: users can focus on significant change by setting the minimum increase or decrease to be masked—for example, a user could mask only areas where NDVI increased by at least 0.2Threshold Mask: The user sets a threshold and magnitude for what is masked as change. The app will only identify change that’s above the user-set lower threshold and bigger than the user-set minimum magnitude.Supported DevicesThis application is responsively designed to support use in browsers on desktops, mobile phones, and tablets.Data RequirementsCreating an app with this template requires a web map with at least one imagery layer.Get Started This application can be created in the following ways:Click the Create a Web App button on this pageShare a map and choose to Create a Web AppOn the Content page, click Create - App - From Template Click the Download button to access the source code. Do this if you want to host the app on your own server and optionally customize it to add features or change styling.

  8. o

    Multi Token Completion

    • registry.opendata.aws
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2023). Multi Token Completion [Dataset]. https://registry.opendata.aws/multi-token-completion/
    Explore at:
    Dataset updated
    Feb 11, 2023
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences. We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts. As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated). For these datasets, the columns provided for each datapoint are as follows: text- the original sentence span- the span (phrase) which is masked out span_lower- the lowercase version of span range- the range in the text string which will be masked out (this is important because span might appear more than once in text) freq- the corpus frequency of span_lower masked_text- the masked version of text, span is replaced with [MASK] Additinaly, we provide a small (3K) dataset with human annotations.

  9. Printed 2D Masks Attacks Video

    • kaggle.com
    zip
    Updated Aug 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unique Data (2023). Printed 2D Masks Attacks Video [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/cut-out-printout-attacks
    Explore at:
    zip(702274659 bytes)Available download formats
    Dataset updated
    Aug 1, 2023
    Authors
    Unique Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Printed 2D Masks Attacks - Biometric Attack dataset

    The anti spoofing dataset includes 3 different types of files of the real people: original selfies, original videos and videos of attacks with printed 2D masks. The liveness detection dataset solves tasks in the field of anti-spoofing and it is useful for buisness and safety systems.

    👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset - Full dataset

    Content

    The dataset contains of three folders:

    • live_selfie contains the original selfies of people
    • live_video includes original videos of people
    • 2d_masks contains videos of attacks with the 2d printed mask using original images from "live_selfie" folder

    🧩 This is just an example of the data. Leave a request here to learn more

    File with the extension .csv

    includes the following information for each media file: - live_selfie: the link to access the original selfie - live_video: the link to access the original video - phone_model: model of the phone, with which selfie and video were shot - 2d_masks: the link to access the video with the attack with the 2d printed mask

    🚀 You can learn more about our high-quality unique datasets here

    keywords: ibeta level 1, ibeta level 2, liveness detection systems, liveness detection dataset, biometric dataset, biometric data dataset, biometric system attacks, anti-spoofing dataset, face liveness detection, deep learning dataset, face spoofing database, face anti-spoofing, face recognition, face detection, face identification, human video dataset, video dataset, presentation attack detection, presentation attack dataset, 2d print attacks, print 2d attacks dataset, phone attack dataset, face anti spoofing, large-scale face anti spoofing, rich annotations anti spoofing dataset, cut prints spoof attack

  10. Land/Sea static mask relevant to IMERG precipitation 0.1x0.1 degree V2...

    • registry.opendata.aws
    • s.cnmilf.com
    • +5more
    Updated Aug 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NASA (2025). Land/Sea static mask relevant to IMERG precipitation 0.1x0.1 degree V2 (GPM_IMERG_LandSeaMask) at GES DISC [Dataset]. https://registry.opendata.aws/nasa-gpmimerglandseamask/
    Explore at:
    Dataset updated
    Aug 14, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Version 2 is the current version of the data set. Older versions will no longer be available and have been superseded by Version 2.This land sea mask originated from the NOAA group at SSEC in the 1980s. It was originally produced at 1/6 deg resolution, and then regridded for the purposes of GPCP, TMPA, and IMERG precipitation products. NASA code 610.2, Terrestrial Information Systems Laboratory, restructured this land sea mask to match the IMERG grid, and converted the file to CF-compliant netCDF4. Version 2 was created in May, 2019 to resolve detected inaccuracies in coastal regions.Users should be aware that this is a static mask, i.e. there is no seasonal or annual variability, and it is due for update. It is not recommended to be used outside of the aforementioned precipitation data. Read our doc on how to get AWS Credentials to retrieve this data: https://data.gesdisc.earthdata.nasa.gov/s3credentialsREADME

  11. MODIS/Terra Cloud Mask and Spectral Test Results 5-Min L2 Swath 250m and 1km...

    • data.nasa.gov
    • datasets.ai
    • +3more
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). MODIS/Terra Cloud Mask and Spectral Test Results 5-Min L2 Swath 250m and 1km - NRT [Dataset]. https://data.nasa.gov/dataset/modis-terra-cloud-mask-and-spectral-test-results-5-min-l2-swath-250m-and-1km-nrt-88812
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The MODIS level-2 cloud mask product is a global product generated for both daytime and nighttime conditions at 1-km spatial resolution (at nadir) and for daytime at 250-m resolution. The algorithm employs a series of visible and infrared threshold and consistency tests to specify confidence levels that an unobstructed view of the Earth's surface is observed. The Terra MODIS Photovoltaic (PVLWIR) bands 27-30 are known to experience an electronic crosstalk contamination. The influence of the crosstalk has gradually increased over the mission lifetime, causing for example, earth surface features to become prominent in atmospheric band 27, increased detector striping, and long term drift in the radiometric bias of these bands. The drift has compromised the climate quality of C6 Terra MODIS L2 products that depend significantly on these bands, including cloud mask (MOD35), cloud fraction and cloud top properties (MOD06), and total precipitable water (MOD07). A linear crosstalk correction algorithm has been developed and tested by MCST.The electronic crosstalk correction was made to the calibration algorithm for bands 27-30 and implemented into C6.1 operational L1B processing. This implementation greatly improves the performance of the cloud mask.For more information on C6.1 changes visit:https://modis-atmos.gsfc.nasa.gov/documentation/collection-61The shortname for this Level-2 MODIS cloud mask product is MOD35_L2 and the principal investigator for this product is MODIS scientist Dr. Paul Menzel ( paulm@ssec.wisc.edu). MOD35_L2 product files are stored in Hierarchical Data Format (HDF-EOS). Each of the 9 gridded parameters is stored as a Scientific Data Set (SDS) within the HDF-EOS file. The Cloud Mask and Quality Assurance SDS's are stored at 1 kilometer pixel resolution. All other SDS's (those relating to time, geolocation, and viewing geometry) are stored at 5 kilometer pixel resolution.Link to the MODIS homepage for more data set information: https://modis-atmos.gsfc.nasa.gov/products/cloud-mask

  12. m

    The Empirical Cloud Mask Algorithm (ECMA)

    • data.mendeley.com
    Updated Oct 17, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fahad Alawadi (2018). The Empirical Cloud Mask Algorithm (ECMA) [Dataset]. http://doi.org/10.17632/92dpg5xvr2.1
    Explore at:
    Dataset updated
    Oct 17, 2018
    Authors
    Fahad Alawadi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NASA’s atmospheric science team using MODIS, have developed their own standard cloud mask product that can be used to detect the presence of clouds over a given area. The cloud mask product however, has several significant drawbacks. For example, the mask is incapable of discriminating effectively between heavy aerosols (dust) and clouds that appear over both land and water. This shortcoming is commonly attributed to its dependency on static thresholds. Moreover, it cannot be generated in near-real-time, due to the absence of specific ancillary data required for its generation.

    Due to these shortcomings, a single mathematical formula -The Empirical Cloud Mask Algorithm (ECMA) – has been devised which is able to bypass these constraints of dependency upon auxiliary data or thresholds. The EMCA is composed from a total of eight of MODIS bands (469 nm, 555 nm, 645 nm, 859 nm, 1380 nm, 2130 nm, 11 μm and 12 μm).

    Currently, the ECMA has a simple yes/no output which corresponds to cloudy or non-cloudy pixels in a day-only MODIS scene. Therefore, further research remains pending to expand its applicability over night passes. Although, the current ECMA expression is developed based on the MODIS bands, its applicability using different sensors using different spectral and thermal bands is also open for investigation. Finally, the ECMA can be viewed as the dynamic mathematical-code representation of day-only clouds observed in MODIS, whose nature is equally dynamic and can thus explain its success.

  13. S

    Mask Bylaw Survey

    • splitgraph.com
    • data.edmonton.ca
    Updated Mar 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    edmonton-ca (2022). Mask Bylaw Survey [Dataset]. https://www.splitgraph.com/edmonton-ca/mask-bylaw-survey-w8d9-693d
    Explore at:
    json, application/openapi+json, application/vnd.splitgraph.imageAvailable download formats
    Dataset updated
    Mar 24, 2022
    Authors
    edmonton-ca
    Description

    This was a single topic survey. To view the survey questions, click the following link:

    https://www.edmontoninsightcommunity.ca/c/a/6iUwvMihdo6IbwunZP6YSY?t=1

    The survey was open from February 28 - March 7, 2022.

    The dataset includes 77,869 responses.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  14. ATLAS/ICESat-2 ATL03 Ancillary Masks, Version 1

    • nsidc.org
    • search.dataone.org
    • +2more
    Updated Oct 13, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Snow and Ice Data Center (2018). ATLAS/ICESat-2 ATL03 Ancillary Masks, Version 1 [Dataset]. http://doi.org/10.5067/GCQEVTNR6IED
    Explore at:
    Dataset updated
    Oct 13, 2018
    Dataset authored and provided by
    National Snow and Ice Data Center
    Area covered
    WGS 84 EPSG:4326
    Description

    This ancillary ICESat-2 data set contains four static surface masks (land ice, sea ice, land, and ocean) provided by ATL03 to reduce the volume of data that each surface-specific along-track data product is required to process. For example, the land ice surface mask directs the ATL06 land ice algorithm to consider data from only those areas of interest to the land ice community. Similarly, the sea ice, land, and ocean masks direct ATL07, ATL08, and ATL12 algorithms, respectively. A detailed description of all four masks can be found in section 4 of the Algorithm Theoretical Basis Document (ATBD) for ATL03 linked under technical references.

  15. Example of strategies to implement masking during the allocation.

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natasha A. Karp; Esther J. Pearl; Emma J. Stringer; Chris Barkus; Jane Coates Ulrichsen; Nathalie Percie du Sert (2023). Example of strategies to implement masking during the allocation. [Dataset]. http://doi.org/10.1371/journal.pbio.3001873.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Natasha A. Karp; Esther J. Pearl; Emma J. Stringer; Chris Barkus; Jane Coates Ulrichsen; Nathalie Percie du Sert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example of strategies to implement masking during the allocation.

  16. Face Mask Detection Dataset - 500 GB of data

    • kaggle.com
    zip
    Updated Jun 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KUCEV ROMAN (2021). Face Mask Detection Dataset - 500 GB of data [Dataset]. https://www.kaggle.com/tapakah68/medical-masks-part1
    Explore at:
    zip(86359172123 bytes)Available download formats
    Dataset updated
    Jun 14, 2021
    Authors
    KUCEV ROMAN
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Face Mask Detection - Faces Dataset

    Dataset includes 376 000+ images, 4 types of mask worn on 94 000 unique faces. All images were collected by TrainingData.pro

    💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on roman@kucev.com to buy the dataset

    Metadata for the full dataset:

    • assignment_id - unique identifier of the media file
    • worker_id - unique identifier of the person
    • age - age of the person
    • true_gender - gender of the person
    • country - country of the person
    • ethnicity - ethnicity of the person
    • photo_1_extension, photo_2_extension, photo_3_extension, photo_4_extension - photo extensions in the dataset
    • photo_1_resolution, photo_2_resolution, photo_3_extension, photo_4_resolution - photo resolution in the dataset

    Content

    File with the extension .csv includes variables: - ID - image id - TYPE - image type - USER_ID - user id - GENDER - gender of a person - AGE - person's age - name - file name - size_mb - image size

    Types of images:

    • TYPE 1 - There is no mask on the face.
    • TYPE 2 - The mask is on, but does not cover the nose or mouth.
    • TYPE 3 - The mask covers the mouth, but does not cover the nose.
    • TYPE 4 - The mask is worn correctly, covers the nose and mouth.

    https://sun9-10.userapi.com/impg/qn0W_s_C3xVYUc_5_IUNEJ6a3xQexHj8GSLlHg/breQf6Qthzo.jpg?size=2560x988&quality=96&sign=1d633a32909adb9c95eeb5e781e17490&type=album" alt="">

    💴 Buy the Dataset: This is just an example of the data. Leave a request on roman@kucev.com to discuss your requirements, learn about the price and buy the dataset.

    keywords: facial mask detection, face masks detection, face masks classification, face masks recognition, covid-19, re-identification, public safety, health, automatic face mask detection, biometric system, biometric system attacks, biometric dataset, face recognition database, face recognition dataset, face detection dataset, facial analysis, object detection dataset, deep learning datasets, computer vision datset, human images dataset, human faces dataset

  17. Dataset for: Modelling the filtration efficiency of a woven fabric: The role...

    • zenodo.org
    • data.niaid.nih.gov
    bin, text/x-python +2
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard P Sear; Richard P Sear; Ioatzin Rios de Anda; Jake W Wilkins; Joshua F Robinsion; C Patrick Royall; Ioatzin Rios de Anda; Jake W Wilkins; Joshua F Robinsion; C Patrick Royall (2024). Dataset for: Modelling the filtration efficiency of a woven fabric: The role of multiple lengthscales [Dataset]. http://doi.org/10.5281/zenodo.5552357
    Explore at:
    text/x-python, txt, bin, tiffAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Richard P Sear; Richard P Sear; Ioatzin Rios de Anda; Jake W Wilkins; Joshua F Robinsion; C Patrick Royall; Ioatzin Rios de Anda; Jake W Wilkins; Joshua F Robinsion; C Patrick Royall
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is data for: "Modelling the filtration efficiency of a woven fabric: The role of multiple lengthscales", on arXiv

    Files are (this is also in README file):

    1) FinalFused.tif : stack of slices taken with confocal at Bristol by Ioatzin Rios de Anda. This is the imaging data of the fabric used

    2) processDataTo3D_PAPER.py : Python code to analyse 1) to produce mask of fibre voxels needed for LB simulation, by Jake Wilkins

    3) LBregionstack.tiff : image stack for region in LB simulations

    4) masknx330ny280nz462_t10.txt : mask in right format to be read in to Palabos LB code to specify which voxels are fibre and so need bounce-back

    5) Ioatzin3D.cpp : C++ code for Palabos LB. NB need Palabos LB code: https://palabos.unige.ch/, should go in directory "~/palabos-v2.2.0/examples/Ioatzin/3D
    ". Needs 4)

    6) make_pkl.py : converts output of LB code into Python pickled format for .py codes below.

    7) IoatzinDarcy_pkl.py : takes pickled output of LB code and computes Darcy k etc

    8) traj2_pkledge.py : computes trajectories of particles and so filtration efficiency, needs pickled output of LBC code and 9)

    9) lattice_params.yaml : parameter values for 7) and 8)

    10) eff_filter_edges.txt : filtration efficiencies computed by 8) WITH inertia

    11) eff_filter0Stokes.txt : filtration efficiencies computed by 8) WITHOUT inertia

    12) plot_filtration.py : plots 10) and 11)

    13) Final_render.mp4 : rotating animation showing region simulated by LB code, by Jake Wilkins

    14) alpha_ofz.txt : alpha - fraction of fibres voxels as function of z

    15) plot_justalpha.py : plots 14)

    16) vtk01.vti : flow field velocity field in vti format - as used by Paraview

    17) vel3D.pkl : flow field velocity field in Python's pkl format

    18) slice_heatmap.py : produces heatmap of velocities in xy slice through the flow field

    19) plot_sigma_streamlines.py : plots Sigma (curvature lengthscale) from 20), 21), 22), 23)

    20) stream4.txt: streamline for flow field

    21) stream5.txt: streamline for flow field

    22) stream6.txt: streamline for flow field

    23) stream7.txt: streamline for flow field

    24) plot_Stokes.py : plots Stokes number as function of particle diameter

    25) 0traj20.0_47.xyz : trajectory in format that Paraview can read

    26) intraj20.0_47.xyz : another trajectory

    27) streamlines_pkl.py : calculates streamlines, eg 20), 21), 22) and 23)

    28) this README file

    Abstract of that work:

    During the COVID-19 pandemic, many millions have worn masks made of woven fabric, to reduce the risk of transmission of COVID-19. Masks are essentially air filters worn on the face, that should filter out as many of the dangerous particles as possible. Here the dangerous particles are the droplets containing virus that are exhaled by an infected person. Woven fabric is unlike the material used in standard air filters. Woven fabric consists of fibres twisted together into yarns that are then woven into fabric. There are therefore two lengthscales: the diameters of: (i) the fibre and (ii) the yarn. Standard air filters have only (i). To understand how woven fabrics filter, we have used confocal microscopy to take three dimensional images of woven fabric. We then used the image to perform Lattice Boltzmann simulations of the air flow through fabric. With this flow field we calculated the filtration efficiency for particles around a micrometre in diameter. We find that for particles in this size range, filtration efficiency is low ($\sim 10\%$) but increases with increasing particle size. These efficiencies are comparable to measurements made for fabrics. The low efficiency is due to most of the air flow being channeled through relatively large (tens of micrometres across) inter-yarn pores. So we conclude that our sampled fabric is expected to filter poorly due to the hierarchical structure of woven fabrics.

  18. 2D Masks Presentation Attack Detection

    • kaggle.com
    zip
    Updated Aug 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unique Data (2023). 2D Masks Presentation Attack Detection [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/real-people-and-attacks-with-2d-masks/data
    Explore at:
    zip(901069185 bytes)Available download formats
    Dataset updated
    Aug 1, 2023
    Authors
    Unique Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    2D Masks Presentation Attack Detection - Biometric Attack dataset

    The anti spoofing dataset consists of videos of individuals wearing printed 2D masks or printed 2D masks with cut-out eyes and directly looking at the camera. Videos are filmed in different lightning conditions and in different places (indoors, outdoors). Each video in the liveness detection dataset has an approximate duration of 2 seconds.

    👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset - Full dataset

    Types of videos in the dataset:

    • real - 4 videos of the person without a mask.
    • mask - 4 videos of the person wearing a printed 2D mask.
    • cut - 4 videos of the person wearing a printed 2D mask with cut-out holes for eyes.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fd29be8e22b3376efc1260f0a90f66d5c%2FMacBook%20Air%20-%201%20(2).png?generation=1690460078319549&alt=media" alt="">

    People in the dataset wear different accessorieses, such as glasses, caps, scarfs, hats and masks. Most of them are worn over a mask, however glasses and masks can be are also printed on the mask itself.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Faa17e51fbcb74d5920dd0f5331f89668%2FMacBook%20Air%20-%201%20(3).png?generation=1690462300531653&alt=media" alt="">

    The dataset serves as a valuable resource for computer vision, anti-spoofing tasks, video analysis, and security systems. It allows for the development of algorithms and models that can effectively detect attacks perpetrated by individuals wearing printed 2D masks.

    The dataset comprises videos of genuine facial presentations using various methods, including 2D masks and printed photos, as well as real and spoof faces. It proposes a novel approach that learns and extracts facial features to prevent spoofing attacks, based on deep neural networks and advanced biometric techniques.

    Our results show that this technology works effectively in securing most applications and prevents unauthorized access by distinguishing between genuine and spoofed inputs. Additionally, it addresses the challenging task of identifying unseen spoofing cues, making it one of the most effective techniques in the field of anti-spoofing research.

    🧩 This is just an example of the data. Leave a request here to learn more

    Content

    The folder "files" includes 17 folders:

    • corresponding to each person in the sample
    • containing of 12 videos of the individual

    File with the extension .csv

    • user: person in the videos,
    • real_1,... real_4: links to the videos with people without mask,
    • mask_1,... mask_4: links to the videos with 2D mask,
    • cut_1,... cut_4: links to the videos with 2D mask with cut-out eyes

    Attacks might be collected in accordance with your requirements.

    🚀 You can learn more about our high-quality unique datasets here

    keywords: ibeta level 1, ibeta level 2, liveness detection systems, liveness detection dataset, biometric dataset, biometric data dataset, biometric system attacks, anti-spoofing dataset, face liveness detection, deep learning dataset, face spoofing database, face anti-spoofing, face recognition, face detection, face identification, human video dataset, video dataset, presentation attack detection, presentation attack dataset, 2d print attacks, print 2d attacks dataset, phone attack dataset, face anti spoofing, large-scale face anti spoofing, rich annotations anti spoofing dataset, cut prints spoof attack

  19. Silicone Masks Biometric Attacks Dataset

    • kaggle.com
    zip
    Updated Oct 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unique Data (2023). Silicone Masks Biometric Attacks Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/silicone-masks-biometric-attacks/discussion
    Explore at:
    zip(156867023 bytes)Available download formats
    Dataset updated
    Oct 3, 2023
    Authors
    Unique Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Silicone Masks Biometric Attack Dataset

    The anti spoofing dataset consists of videos of individuals and attacks with printed 2D masks and silicone masks . Videos are filmed in different lightning conditions (in a dark room, daylight, light room and nightlight). Dataset includes videos of people with different attributes (glasses, mask, hat, hood, wigs and mustaches for men).

    The dataset comprises videos of genuine facial presentations using various methods, including 3D masks and photos, as well as real and spoof faces. It proposes a novel approach that learns and extracts facial features to prevent spoofing attacks, based on deep neural networks and advanced biometric techniques.

    Our results show that this technology works effectively in securing most applications and prevents unauthorized access by distinguishing between genuine and spoofed inputs. Additionally, it addresses the challenging task of identifying unseen spoofing cues, making it one of the most effective techniques in the field of anti-spoofing research.

    👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset - Full dataset

    Types of videos in the dataset:

    • real - real video of the person
    • outline -video of the person wearing a printed 2D mask
    • silicone - video of the person wearing a silicone mask

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Ff9be1f70a38085709c85716b212cdd11%2FFrame%2027.png?generation=1696329340111093&alt=media" alt="">

    Types and number of videos in the full dataset:

    • 2885 real videos of people
    • 2859 videos of people wearing silicone mask
    • 48 videos of people wearing a 2D mask.

    Gender of people in the dataset:

    • women: 2685
    • men: 3107

    The dataset serves as a valuable resource for computer vision, anti-spoofing tasks, video analysis, and security systems. It allows for the development of algorithms and models that can effectively detect attacks.

    Studying the dataset may lead to the development of improved security systems, surveillance technologies, and solutions to mitigate the risks associated with masked individuals carrying out attacks.

    🧩 This is just an example of the data. Leave a request here to learn more

    Content

    • real - contains of real videos of people,
    • mask - contains of videos with people wearing a printed 2D mask,
    • silicone - contains of videos with people wearing a silicone mask,
    • dataset_info.csvl - includes the information about videos in the dataset

    File with the extension .csv

    • video: link to the video,
    • type: type of the video

    Attacks might be collected in accordance with your requirements.

    🚀 You can learn more about our high-quality unique datasets here

    keywords: ibeta level 1, ibeta level 2, liveness detection systems, liveness detection dataset, biometric dataset, biometric data dataset, biometric system attacks, anti-spoofing dataset, face liveness detection, deep learning dataset, face spoofing database, face anti-spoofing, face recognition, face detection, face identification, human video dataset, video dataset, presentation attack detection, presentation attack dataset, silicone masks attacks, spoofing deep face recognition, phone attack dataset, face anti spoofing, large-scale face anti spoofing, rich annotations anti spoofing dataset

  20. Z

    Training Dataset for HNTSMRG 2024 Challenge

    • data.niaid.nih.gov
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wahid, Kareem; Dede, Cem; Naser, Mohamed; Fuller, Clifton (2024). Training Dataset for HNTSMRG 2024 Challenge [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11199558
    Explore at:
    Dataset updated
    Jun 21, 2024
    Dataset provided by
    The University of Texas MD Anderson Cancer Center
    Authors
    Wahid, Kareem; Dede, Cem; Naser, Mohamed; Fuller, Clifton
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Training Dataset for HNTSMRG 2024 Challenge

    Overview

    This repository houses the publicly available training dataset for the Head and Neck Tumor Segmentation for MR-Guided Applications (HNTSMRG) 2024 Challenge.

    Patient cohorts correspond to patients with histologically proven head and neck cancer who underwent radiotherapy (RT) at The University of Texas MD Anderson Cancer Center. The cancer types are predominately oropharyngeal cancer or cancer of unknown primary. Images include a pre-RT T2w MRI scan (1-3 weeks before start of RT) and a mid-RT T2w MRI scan (2-4 weeks intra-RT) for each patient. Segmentation masks of primary gross tumor volumes (abbreviated GTVp) and involved metastatic lymph nodes (abbreviated GTVn) are provided for each image (derived from multi-observer STAPLE consensus).

    HNTSMRG 2024 is split into 2 tasks:

    Task 1: Segmentation of tumor volumes (GTVp and GTVn) on pre-RT MRI.

    Task 2: Segmentation of tumor volumes (GTVp and GTVn) on mid-RT MRI.

    The same patient cases will be used for the training and test sets of both tasks of this challenge. Therefore, we are releasing a single training dataset that can be used to construct solutions for either segmentation task. The test data provided (via Docker containers), however, will be different for the two tasks. Please consult the challenge website for more details.

    Data Details

    DICOM files (images and structure files) have been converted to NIfTI format (.nii.gz) for ease of use by participants via DICOMRTTool v. 1.0.

    Images are a mix of fat-suppressed and non-fat-suppressed MRI sequences. Pre-RT and mid-RT image pairs for a given patient are consistently either fat-suppressed or non-fat-suppressed.

    Though some sequences may appear to be contrast enhancing, no exogenous contrast is used.

    All images have been manually cropped from the top of the clavicles to the bottom of the nasal septum (~ oropharynx region to shoulders), allowing for more consistent image field of views and removal of identifiable facial structures.

    The mask files have one of three possible values: background = 0, GTVp = 1, GTVn = 2 (in the case of multiple lymph nodes, they are concatenated into one single label). This labeling convention is similar to the 2022 HECKTOR Challenge.

    150 unique patients are included in this dataset. Anonymized patient numeric identifiers are utilized.

    The entire training dataset is ~15 GB.

    Dataset Folder/File Structure

    The dataset is uploaded as a ZIP archive. Please unzip before use. NIfTI files conform to the following standardized nomenclature: ID_timepoint_image/mask.nii.gz. For mid-RT files, a "registered" suffix (ID_timepoint_image/mask_registered.nii.gz) indicates the image or mask has been registered to the mid-RT image space (see more details in Additional Notes below).

    The data is provided with the following folder hierarchy:

    Top-level folder (named "HNTSMRG24_train")

    Patient-level folder (anonymized patient ID, example: "2")

    Pre-radiotherapy data folder ("preRT")

    Original pre-RT T2w MRI volume (example: "2_preRT_T2.nii.gz").

    Original pre-RT tumor segmentation mask (example: "2_preRT_mask.nii.gz").

    Mid-radiotherapy data folder ("midRT")

    Original mid-RT T2w MRI volume (example: "2_midRT_T2.nii.gz").

    Original mid-RT tumor segmentation mask (example: "2_midRT_mask.nii.gz").

    Registered pre-RT T2w MRI volume (example: "2_preRT_T2_registered.nii.gz").

    Registered pre-RT tumor segmentation mask (example: "2_preRT_mask_registered.nii.gz").

    Note: Cases will exhibit variable presentation of ground truth mask structures. For example, a case could have only a GTVp label present, only a GTVn label present, both GTVp and GTVn labels present, or a completely empty mask (i.e., complete tumor response at mid-RT). The following case IDs have empty masks at mid-RT (indicating a complete response): 21, 25, 29, 42. These empty masks are not errors. There will similarly be some cases in the test set for Task 2 that have empty masks.

    Details Relevant for Algorithm Building

    The goal of Task 1 is to generate a pre-RT tumor segmentation mask (e.g., "2_preRT_mask.nii.gz" is the relevant label). During blind testing for Task 1, only the pre-RT MRI (e.g., "2_preRT_T2.nii.gz") will be provided to the participants algorithms.

    The goal of Task 2 is to generate a mid-RT segmentation mask (e.g., "2_midRT_mask.nii.gz" is the relevant label). During blind testing for Task 2, the mid-RT MRI (e.g., "2_midRT_T2.nii.gz"), original pre-RT MRI (e.g., "2_preRT_T2.nii.gz"), original pre-RT tumor segmentation mask (e.g., "2_preRT_mask.nii.gz"), registered pre-RT MRI (e.g., "2_preRT_T2_registered.nii.gz"), and registered pre-RT tumor segmentation mask (e.g., "2_preRT_mask_registered.nii.gz") will be provided to the participants algorithms.

    When building models, the resolution of the generated prediction masks should be the same as the corresponding MRI for the given task. In other words, the generated masks should be in the correct pixel spacing and origin with respect to the original reference frame (i.e., pre-RT image for Task 1, mid-RT image for Task 2). More details on the submission of models will be located on the challenge website.

    Additional Notes

    General notes.

    NIfTI format images and segmentations may be easily visualized in any NIfTI viewing software such as 3D Slicer.

    Test data will not be made public until the completion of the challenge. The complete training and test data will be published together (along with all original multi-observer annotations and relevant clinical data) at a later date via The Cancer Imaging Archive. Expected date ~ Spring 2025.

    Task 1 related notes.

    When training their algorithms for Task 1, participants can choose to use only pre-RT data or add in mid-RT data as well. Initially, our plan was to limit participants to utilizing only pre-RT data for training their algorithms in Task 1. However, upon reflection, we recognized that in a practical setting, individuals aiming to develop auto-segmentation algorithms could theoretically train models using any accessible data at their disposal. Based on current literature, we actually don't know what the best solution would be! Would the incorporation of mid-RT data for training a pre-RT segmentation model actually be helpful, or would it merely introduce harmful noise? The answer remains unclear. Therefore, we leave this choice to the participants.

    Remember, though, during testing, you will ONLY have the pre-RT image as an input to your model (naturally, since Task 1 is a pre-RT segmentation task and you won't know what mid-RT data for a patient will look like).

    Task 2 related notes.

    In addition to the mid-RT MRI and segmentation mask, we have also provided a registered pre-RT MRI and the corresponding registered pre-RT segmentation mask for each patient. We offer this data for participants who opt not to integrate any image registration techniques into their algorithms for Task 2 but still wish to use the two images as a joint input to their model. Moreover, in a real-world adaptive RT context, such registered scans are typically readily accessible. Naturally, participants are also free to incorporate their own image registration processes into their pipelines if they wish (or ignore the pre-RT images/masks altogether).

    Registrations were generated using SimpleITK, where the mid-RT image serves as the fixed image and the pre-RT image serves as the moving image. Specifically, we utilized the following steps: 1. Apply a centered transformation, 2. Apply a rigid transformation, 3. Apply a deformable transformation with Elastix using a preset parameter map (Parameter map 23 in the Elastix Zoo). This particular deformable transformation was selected as it is open-source and was benchmarked in a previous similar application (https://doi.org/10.1002/mp.16128). For cases where excessive warping was noted during deformable registration (a small minority of cases), only the rigid transformation was applied.

    Contact

    We have set up a general email address that you can message to notify all organizers at: hntsmrg2024@gmail.com. Additional specific organizer contacts:

    Kareem A. Wahid, PhD (kawahid@mdanderson.org)

    Cem Dede, MD (cdede@mdanderson.org)

    Mohamed A. Naser, PhD (manaser@mdanderson.org)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ai4Privacy (2023). pii-masking-43k [Dataset]. http://doi.org/10.57967/hf/0824

pii-masking-43k

ai4privacy/pii-masking-43k

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 1, 2023
Dataset authored and provided by
Ai4Privacy
Description

Purpose and Features

The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-43k.

Search
Clear search
Close search
Google apps
Main menu