Facebook
TwitterAi4Privacy Community
Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.
Purpose and Features
Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion⊠See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The anti spoofing dataset includes 3 different types of files of the real people: original selfies, original videos and videos of attacks with printed 2D masks. The liveness detection dataset solves tasks in the field of anti-spoofing and it is useful for buisness and safety systems.
includes the following information for each media file: - live_selfie: the link to access the original selfie - live_video: the link to access the original video - phone_model: model of the phone, with which selfie and video were shot - 2d_masks: the link to access the video with the attack with the 2d printed mask
đ You can learn more about our high-quality unique datasets here
keywords: ibeta level 1, ibeta level 2, liveness detection systems, liveness detection dataset, biometric dataset, biometric data dataset, biometric system attacks, anti-spoofing dataset, face liveness detection, deep learning dataset, face spoofing database, face anti-spoofing, face recognition, face detection, face identification, human video dataset, video dataset, presentation attack detection, presentation attack dataset, 2d print attacks, print 2d attacks dataset, phone attack dataset, face anti spoofing, large-scale face anti spoofing, rich annotations anti spoofing dataset, cut prints spoof attack
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset comprises cloud masks for 513 1022-by-1022 pixel subscenes, at 20m resolution, sampled random from the 2018 Level-1C Sentinel-2 archive. The design of this dataset follows from some observations about cloud masking: (i) performance over an entire product is highly correlated, thus subscenes provide more value per-pixel than full scenes, (ii) current cloud masking datasets often focus on specific regions, or hand-select the products used, which introduces a bias into the dataset that is not representative of the real-world data, (iii) cloud mask performance appears to be highly correlated to surface type and cloud structure, so testing should include analysis of failure modes in relation to these variables.
The data was annotated semi-automatically, using the IRIS toolkit, which allows users to dynamically train a Random Forest (implemented using LightGBM), speeding up annotations by iteratively improving it's predictions, but preserving the annotator's ability to make final manual changes when needed. This hybrid approach allowed us to process many more masks than would have been possible manually, which we felt was vital in creating a large enough dataset to approximate the statistics of the whole Sentinel-2 archive.
In addition to the pixel-wise, 3 class (CLEAR, CLOUD, CLOUD_SHADOW) segmentation masks, we also provide users with binary classification "tags" for each subscene that can be used in testing to determine performance in specific circumstances. These include:
SURFACE TYPE: 11 categories
CLOUD TYPE: 7 categories
CLOUD HEIGHT: low, high
CLOUD THICKNESS: thin, thick
CLOUD EXTENT: isolated, extended
Wherever practical, cloud shadows were also annotated, however this was sometimes not possible due to high-relief terrain, or large ambiguities. In total, 424 were marked with shadows (if present), and 89 have shadows that were not annotatable due to very ambiguous shadow boundaries, or terrain that cast significant shadows. If users wish to train an algorithm specifically for cloud shadow masks, we advise them to remove those 89 images for which shadow was not possible, however, bear in mind that this will systematically reduce the difficulty of the shadow class compared to real-world use, as these contain the most difficult shadow examples.
In addition to the 20m sampled subscenes and masks, we also provide users with shapefiles that define the boundary of the mask on the original Sentinel-2 scene. If users wish to retrieve the L1C bands at their original resolutions, they can use these to do so.
Please see the README for further details on the dataset structure and more.
Contributions & Acknowledgements
The data were collected, annotated, checked, formatted and published by Alistair Francis and John Mrziglod.
Support and advice was provided by Prof. Jan-Peter Muller and Dr. Panagiotis Sidiropoulos, for which we are grateful.
We would like to extend our thanks to Dr. Pierre-Philippe Mathieu and the rest of the team at ESA PhiLab, who provided the environment in which this project was conceived, and continued to give technical support throughout.
Finally, we thank the ESA Network of Resources for sponsoring this project by providing ICT resources.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The anti spoofing dataset consists of videos of individuals wearing printed 2D masks or printed 2D masks with cut-out eyes and directly looking at the camera. Videos are filmed in different lightning conditions and in different places (indoors, outdoors). Each video in the liveness detection dataset has an approximate duration of 2 seconds.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fd29be8e22b3376efc1260f0a90f66d5c%2FMacBook%20Air%20-%201%20(2).png?generation=1690460078319549&alt=media" alt="">
People in the dataset wear different accessorieses, such as glasses, caps, scarfs, hats and masks. Most of them are worn over a mask, however glasses and masks can be are also printed on the mask itself.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Faa17e51fbcb74d5920dd0f5331f89668%2FMacBook%20Air%20-%201%20(3).png?generation=1690462300531653&alt=media" alt="">
The dataset serves as a valuable resource for computer vision, anti-spoofing tasks, video analysis, and security systems. It allows for the development of algorithms and models that can effectively detect attacks perpetrated by individuals wearing printed 2D masks.
The dataset comprises videos of genuine facial presentations using various methods, including 2D masks and printed photos, as well as real and spoof faces. It proposes a novel approach that learns and extracts facial features to prevent spoofing attacks, based on deep neural networks and advanced biometric techniques.
Our results show that this technology works effectively in securing most applications and prevents unauthorized access by distinguishing between genuine and spoofed inputs. Additionally, it addresses the challenging task of identifying unseen spoofing cues, making it one of the most effective techniques in the field of anti-spoofing research.
đ You can learn more about our high-quality unique datasets here
keywords: ibeta level 1, ibeta level 2, liveness detection systems, liveness detection dataset, biometric dataset, biometric data dataset, biometric system attacks, anti-spoofing dataset, face liveness detection, deep learning dataset, face spoofing database, face anti-spoofing, face recognition, face detection, face identification, human video dataset, video dataset, presentation attack detection, presentation attack dataset, 2d print attacks, print 2d attacks dataset, phone attack dataset, face anti spoofing, large-scale face anti spoofing, rich annotations anti spoofing dataset, cut prints spoof attack
Facebook
TwitterThis dataset consists of face & silicon masks images from 8 different subjects captured with 3 different smartphones.
This dataset consists of images captured from 8 different bona fide subjects using three different smartphones (iPhone X, Samsung S7 and Samsung S8). For each subject within the database, varying number of samples are collected using all the three phones. Similarly, the silicone masks of each of the subject is collected using three phones. The masks, each costing about USD 4000, have been manufactured by a professional special-effects company.
For the bona fide presentations of the same eight subjects, each data subject is asked to pose in a manner compliant to standard portrait capture. The data is captured indoors, with adequate artificial lighting. Silicone mask presentations have been captured under similar conditions, by placing the masks on their bespoke support provided by the manufacturer, with prosthetic eyes and silicone eye sockets.
The database is organized in three folders corresponding to three smartphones and further each subject within the database is organized in sub-folders.
The files are named using the convention "PHONE/CLASS/SUBJECTNUMBER/PHONEIDENTIFIER-PRESENTATION-SUBJECTNUMBER-SAMPLENUMBER.jpg".
PHONE is iPhone, SamS7 or SamS8 corresponding to iPhone, Samsung S7 and Samsung S8 respectively.
CLASS is "Bona" or "Mask" indicating the bona fide presentation or mask presentation respectively.
SUBJECTNUMBER is "s1" to "s8" indicating 8 subjects in the database.
PHONEIDENTIFIER is the two letter keyword as given by "ip", "s7" and "s8" corresponding to iPhone, Samsung S7 and Samsung S7 respectively.
PRESENTATION identifies bona-fide or mask-attack presentation using 2 letter identifier "bp" or "ap".
SAMPLENUMBER indicates the sample number of the subject.
Reference
If you publish results using this dataset, please cite the following publication.
âCustom Silicone Face Masks - Vulnerability of Commercial Face Recognition Systems & Presentation Attack Detectionâ, R. Raghavendra, S. Venkatesh, K. B. Raja, S. Bhattacharjee, P. Wasnik, S. Marcel, and C. Busch. IAPR/IEEE International Workshop on Biometrics and Forensics (IWBF), 2019. 10.1109/IWBF.2019.8739236 https://publications.idiap.ch/index.php/publications/show/4065
Facebook
TwitterImage Mask is a configurable app template for identifying areas of an image that have changed over time or that meet user-set thresholds for calculated spectral indexes. The template also includes tools for measurement, recording locations, and more.App users can zoom to bookmarked areas of interest (or search for their own), select any of the imagery layers from the associated web map to analyze, use a time slider or dropdown menu to select images, then choose between the Change Detection or Mask tools to produce results.Image Mask users can do the following:Zoom to bookmarked areas of interest (or bookmark their own)Select specific images from a layer to visualize (search by date or another attribute)Use the Change Detection tool to compare two images in a layer (see options, below)Use the Mask tool to highlight areas that meet a user-set threshold for common spectral indexes (NDVI, SAVI, a burn index, and a water index). For example, highlight all the areas in an image with NDVI values above 0.25 to find vegetation.Annotate imagery using editable feature layersPerform image measurement on imagery layers that have mensuration capabilitiesExport an imagery layer to the user's local machine, or as a layer in the userâs ArcGIS accountUse CasesA student investigating urban expansion over time using Esriâs Multispectral Landsat image serviceA farmer using NAIP imagery to examine changes in crop healthAn image analyst recording burn scar extents using satellite imageryAn aid worker identifying regions with extreme drought to focus assistanceChange detection methodsFor each imagery layer, give app users one or more of the following change detection options:Image Brightness (calculates the change in overall brightness)Vegetation Index (NDVI) (requires red and infrared bands)Soil-Adjusted Vegetation Index (SAVI) (requires red and infrared bands)Water Index (requires green and short-wave infrared bands)Burn Index (requires infrared and short-wave infrared bands)For each of the indexes, users also have a choice between three modes:Difference Image: calculates increases and decreases for the full extent Difference Mask: users can focus on significant change by setting the minimum increase or decrease to be maskedâfor example, a user could mask only areas where NDVI increased by at least 0.2Threshold Mask: The user sets a threshold and magnitude for what is masked as change. The app will only identify change thatâs above the user-set lower threshold and bigger than the user-set minimum magnitude.Supported DevicesThis application is responsively designed to support use in browsers on desktops, mobile phones, and tablets.Data RequirementsCreating an app with this template requires a web map with at least one imagery layer.Get Started This application can be created in the following ways:Click the Create a Web App button on this pageShare a map and choose to Create a Web AppOn the Content page, click Create - App - From Template Click the Download button to access the source code. Do this if you want to host the app on your own server and optionally customize it to add features or change styling.
Facebook
TwitterThis child item contains files representing Particle Image Velocimetry (PIV) processing masks which excluded regions of invalid velocities from the PIV results. Masks typically are used to screen out velocities or prevent the creation of velocities for regions of an image where computed PIV velocities would be nonsensical or invalid. For example, near or on the channel banks, where a tree overhangs the channel, or the presence of a boat or other object in the water. By using masks, these regions can be excluded from analysis. The PIVLab software allows for the designation of a rectangular Region of Interest (ROI). For five of the field sites, which were located at engineered canals, a rectangular ROI was sufficient to exclude areas in an image scene which were invalid such as the channel banks. However, for three of the sites located in natural rivers, a rectangular ROI was not sufficient to screen invalid regions in the image, so polygonal masks were used in conjunction with the ROIs to segment valid regions for the PIV analysis.
The mask files included here are named with a prefix indicating which field site the mask is for. In two of the sites, multiple masks were used. Mask files are simple, headerless comma-delimited text files consisting of x,y pixel coordinate pairs which outline a polygon on the PIV image.
Each Field Site is abbreviated in various files in this data release. File and folder names are used to quickly identify which site a particular file or dataset represents. The following abbreviations are used for masks:
Facebook
TwitterPurpose and Features
The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original⊠See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-43k.
Facebook
TwitterThis ancillary ICESat-2 data set contains four static surface masks (land ice, sea ice, land, and ocean) provided by ATL03 to reduce the volume of data that each surface-specific along-track data product is required to process. For example, the land ice surface mask directs the ATL06 land ice algorithm to consider data from only those areas of interest to the land ice community. Similarly, the sea ice, land, and ocean masks direct ATL07, ATL08, and ATL12 algorithms, respectively.A detailed description of all four masks can be found in section 4 of the Algorithm Theoretical Basis Document (ATBD) for ATL03 linked under technical references.
Facebook
TwitterThis ancillary ICESat-2 data set contains four static surface masks (land ice, sea ice, land, and ocean) provided by ATL03 to reduce the volume of data that each surface-specific along-track data product is required to process. For example, the land ice surface mask directs the ATL06 land ice algorithm to consider data from only those areas of interest to the land ice community. Similarly, the sea ice, land, and ocean masks direct ATL07, ATL08, and ATL12 algorithms, respectively.A detailed description of all four masks can be found in section 4 of the Algorithm Theoretical Basis Document (ATBD) for ATL03 linked under technical references.
Facebook
TwitterThe 3D Mask Attack Database (3DMAD) is a biometric (face) spoofing database. It contains 76500 frames of 17 persons, recorded using Kinect for both real-access and spoofing attacks. Each frame consists of:
a depth image (640x480 pixels â 1x11 bits)
the corresponding RGB image (640x480 pixels â 3x8 bits)
manually annotated eye positions (with respect to the RGB image).
The data is collected in 3 different sessions for all subjects and for each session 5 videos of 300 frames are captured. The recordings are done under controlled conditions, with frontal-view and neutral expression. The first two sessions are dedicated to the real access samples, in which subjects are recorded with a time delay of ~2 weeks between the acquisitions. In the third session, 3D mask attacks are captured by a single operator (attacker).
In each video, the eye-positions are manually labelled for every 1st, 61st, 121st, 181st, 241st and 300th frames and they are linearly interpolated for the rest.
The real-size masks are obtained using "ThatsMyFace.com". The database additionally contains the face images used to generate these masks (1 frontal and 2 profiles) and paper-cut masks that are also produced by the same service and using the same images.
The satellite package which contains the Bob accessor methods to use this database directly from Python, with the certified protocols, is available in two different distribution formats:
You can download it from PyPI, or
You can download it in its source form from its git repository.
Acknowledgments
If you use this database, please cite the following publication:
Nesli Erdogmus and Sébastien Marcel, "Spoofing in 2D Face Recognition with 3D Masks and Anti-spoofing with Kinect", Biometrics: Theory, Applications and Systems, 2013. 10.1109/BTAS.2013.6712688 https://publications.idiap.ch/index.php/publications/show/2657
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences. We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts. As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated). For these datasets, the columns provided for each datapoint are as follows: text- the original sentence span- the span (phrase) which is masked out span_lower- the lowercase version of span range- the range in the text string which will be masked out (this is important because span might appear more than once in text) freq- the corpus frequency of span_lower masked_text- the masked version of text, span is replaced with [MASK] Additinaly, we provide a small (3K) dataset with human annotations.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset comprises 26,436 videos of real faces, 2D print attacks (printed photos), and replay attacks (faces displayed on screens), captured under varied conditions. Designed for attack detection research, it supports the development of robust face antispoofing and spoofing detection methods, critical for facial recognition security.
Ideal for training models and refining anti-spoofing methods, the dataset enhances detection accuracy in biometric systems. - Get the data
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F27063537%2F9c0b06aa82909632bd83a7048ac513ae%2FFrame%2052%20(2).png?generation=1755454141939894&alt=media" alt="">
Researchers can leverage this training data to improve detection accuracy, validate models trained on adversarial examples, and advance recognition systems against sophisticated masked attacks.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Developed by AI4Privacy, this dataset represents a pioneering effort in the realm of privacy and AI. As an expansive resource hosted on Hugging Face at ai4privacy/pii-masking-200k, it serves a crucial role in addressing the growing concerns around personal data security in AI applications.
Sources: The dataset is crafted using proprietary algorithms, ensuring the creation of synthetic data that avoids privacy violations. Its multilingual composition, including English, French, German, and Italian texts, reflects a diverse source base. The data is meticulously curated with human-in-the-loop validation, ensuring both relevance and quality.
Context: In an era where data privacy is paramount, this dataset is tailored to train AI models to identify and mask personally identifiable information (PII). It covers 54 PII classes and extends across 229 use cases in various domains like business, education, psychology, and legal fields, emphasizing its contextual richness and applicability.
Inspiration: The dataset draws inspiration from the need for enhanced privacy measures in AI interactions, particularly in LLMs and AI assistants. The creators, AI4Privacy, are dedicated to building tools that act as a 'global seatbelt' for AI, protecting individuals' personal data. This dataset is a testament to their commitment to advancing AI technology responsibly and ethically.
This comprehensive dataset is not just a tool but a step towards a future where AI and privacy coexist harmoniously, offering immense value to researchers, developers, and privacy advocates alike.
Facebook
TwitterVersion 2 is the current version of the data set. Older versions will no longer be available and have been superseded by Version 2.This land sea mask originated from the NOAA group at SSEC in the 1980s. It was originally produced at 1/6 deg resolution, and then regridded for the purposes of GPCP, TMPA, and IMERG precipitation products. NASA code 610.2, Terrestrial Information Systems Laboratory, restructured this land sea mask to match the IMERG grid, and converted the file to CF-compliant netCDF4. Version 2 was created in May, 2019 to resolve detected inaccuracies in coastal regions.Users should be aware that this is a static mask, i.e. there is no seasonal or annual variability, and it is due for update. It is not recommended to be used outside of the aforementioned precipitation data. Read our doc on how to get AWS Credentials to retrieve this data: https://data.gesdisc.earthdata.nasa.gov/s3credentialsREADME
Facebook
Twitter
According to our latest research, the global privacy-masked analytics at the edge market size reached USD 2.75 billion in 2024, driven by the growing necessity for data privacy and compliance in real-time edge computing. The market is exhibiting a robust growth trajectory, registering a CAGR of 22.4% from 2025 to 2033. By the end of 2033, the market is forecasted to attain a value of USD 21.86 billion, underscoring the transformative impact of privacy-preserving technologies in decentralized analytics. This growth is fueled by the convergence of stringent data protection regulations, advances in edge computing hardware, and escalating demand for secure, real-time insights across diverse industries.
A primary growth factor for the privacy-masked analytics at the edge market is the intensification of global data privacy regulations such as GDPR, CCPA, and similar frameworks in emerging economies. Organizations are compelled to process and analyze data closer to its source, minimizing exposure and ensuring compliance with privacy mandates. The proliferation of IoT devices and the exponential increase in data generated at the edge further augment this shift. Enterprises across sectors like healthcare, finance, and government are rapidly adopting privacy-masked analytics to mitigate risks associated with data breaches and unauthorized access. This regulatory environment, combined with the need for immediate actionable intelligence, is catalyzing the adoption of privacy-masked analytics at the edge, making it an indispensable component of modern digital infrastructure.
Another significant driver is the technological advancement in edge computing hardware and software. The evolution of lightweight, high-performance edge devices equipped with built-in privacy-masking capabilities has revolutionized data processing paradigms. These advancements enable real-time analytics and decision-making without compromising sensitive information, which is particularly crucial in industries handling confidential or regulated data. The integration of artificial intelligence and machine learning with privacy-preserving techniques, such as differential privacy and homomorphic encryption, is further enhancing the utility and reliability of edge analytics solutions. As organizations increasingly prioritize zero-trust architectures and data minimization strategies, privacy-masked analytics at the edge emerges as a critical enabler of secure digital transformation.
Furthermore, the surge in remote work, digital transformation initiatives, and the expansion of 5G networks are accelerating the demand for edge-based analytics solutions. Enterprises are seeking to decentralize their data processing capabilities to support distributed workforces and improve operational agility. Privacy-masked analytics at the edge provides a scalable and efficient approach to managing sensitive data in real time, reducing latency and bandwidth consumption while upholding compliance standards. The convergence of edge computing with privacy-enhancing technologies is empowering organizations to unlock new business models and revenue streams without sacrificing user trust or regulatory compliance. This trend is expected to intensify as more industries recognize the strategic value of privacy-masked analytics in driving innovation and maintaining competitive advantage.
From a regional perspective, North America currently dominates the privacy-masked analytics at the edge market, accounting for the largest share in 2024, followed by Europe and Asia Pacific. The United States, in particular, is witnessing significant investments from both public and private sectors to enhance data privacy infrastructure and edge computing capabilities. Europeâs strong regulatory environment and Asia Pacificâs rapid digitalization are also fostering substantial market growth. Latin America and the Middle East & Africa are emerging as promising markets, driven by increasing awareness and adoption of privacy-preserving technologies. Regional dynamics are shaped by local regulatory frameworks, technological readiness, and industry-specific adoption patterns, with each region presenting unique opportunities and challenges for market participants.
Facebook
TwitterGLAH06 is used in conjunction with GLAH05 to create the Level-2 altimetry products. Level-2 altimetry data provide surface elevations for ice sheets (GLAH12), sea ice (GLAH13), land (GLAH14), and oceans (GLAH15). Data also include the laser footprint geolocation and reflectance, as well as geodetic, instrument, and atmospheric corrections for range measurements. The Level-2 elevation products, are regional products archived at 14 orbits per granule, starting and stopping at the same demarcation (± 50° latitude) as GLAH05 and GLAH06. Each regional product is processed with algorithms specific to that surface type. Surface type masks define which data are written to each of the products. If any data within a given record fall within a specific mask, the entire record is written to the product. Masks can overlap: for example, non-land data in the sea ice region may be written to the sea ice and ocean products. This means that an algorithm may write the same data to more than one Level-2 product. In this case, different algorithms calculate the elevations in their respective products. The surface type masks are versioned and archived at NSIDC, so users can tell which data to expect in each product. Each data granule has an associated browse product.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset includes 3,600 videos of 30 people wearing different types of fabric masks, filmed in diverse settings. Its purpose is to aid research in detecting presentation attacks, helping to improve facial recognition technology and prevent fraud. â Get the data
| Characteristic | Data |
|---|---|
| Description | Videos of people in fabric masks training algorithms to detect biometric hacking attempts |
| Data types | Video |
| Tasks | Face recognition, Computer Vision |
| Total number of videos | 3,600 |
| Total number of people | 30 |
| Labeling | Only technical characteristics and metadata (age, gender, ethnicity, glasses, wig, camera, light condition, background) |
| Gender | Male (50%), Female (50%) |
| Ethnicity | Caucasian, African, Asian |
Facebook
TwitterThese valid ice masks provide a way to remove spurious ice caused by residual weather effects and land spillover in passive microwave data. They are derived from the National Ice Center Arctic Sea Ice Charts and Climatologies data set and show where ice could possibly exist based on where it has existed in the past. There are 12 valid ice masks, one for each month, in netCDF-CF 1.6 compliant files with all associated metadata. The data are on a 304 x 448 grid and are available via FTP.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reference classifications generated with Active Learning for Cloud Detection (ALCD)
This data set provides a reference cloud mask data set for 38 Sentinel-2 scenes. These reference masks have been created with the ALCD tool, developed by Louis Baetens, under the direction of Olivier Hagolle at CESBIO/CNES[1]. They were created to validate the cloud masks generated by the MAJA software [2].
- The `Reference_dataset` directory contains 31 scenes selected in 2017 or 2018.
- The `Hollstein` directory contains 7 scenes that were used to validate the ALCD tool by comparison to manually generated reference images kindlyprovided by Hollstein et al[3]
One of these scenes is present in both directories. For the validation of MAJA, the "Hollstein" scenes were not used because of their acquisition at a time period when Sentinel-2 was not yet operational, with a degraded repetitivity of observations.
# Description of the data structure
The name of each scene directory is the name of the corresponding Sentinel-2 L1C product.
In the scene directory, three sub-directories can be found.
- `Classification`
- `Samples`
- `Statistics`
# Description of the files
- `Classification/classification_map.tif` --- the main product, which is the classified scene. 7 classes are available. Each one is represented with a different integer.
0: no_data.
1: not used.
2: low clouds.
3: high clouds.
4: clouds shadows.
5: land.
6: water.
7: snow.
- `Classification/confidence_enhanced.tif` --- enhanced confidence map of the classification. The values are between 0 and 255 (coded on 1 bit).
The original confidence map is, for each pixel, the proportion of votes for the majority class as the classification map has been created via a Random Forest algorithm.
A median filter has been applied to this confidence map. Finally, the value was saved on 1 bit, leading to the value being between 0 and 255.
- `Classification/contours.png` --- the contours of the classes from the classification map, overlayed on the scene. The color code depends on each class.
Green: low and high clouds. Yellow: cloud shadows. Blue: water. Purple: snow.
- `Classification/used_parameters.json` --- the parameters that were used to classify the scene. It includes the tile code, the cloudy and clear dates, along with their product reference.
- `Samples/` --- this directory contains all the shapefiles, one per class.
- `Statistics/k_fold_summary.json` --- results of the 10-fold cross-validation on the scene.
5 metrics are computed, in the order given in the "metrics_names". "all_metrics" is a list of the 10 folds, with the 5 metrics in the correct order for each fold.
"means" and "stds" are the means and standard deviations of the 10 folds.
# References
[1] Baetens, L.; Desjardins, C.; Hagolle, O. Validation of Copernicus Sentinel-2 Cloud Masks Obtained from MAJA, Sen2Cor, and FMask Processors Using Reference Cloud Masks Generated with a Supervised Active Learning Procedure. Remote Sens. 2019, 11, 433.
[2] A multi-temporal method for cloud detection, applied to FORMOSAT-2, VEN”S, LANDSAT and SENTINEL-2 images, O Hagolle, M Huc, D. Villa Pascual, G Dedieu, Remote Sensing of Environment 114 (8), 1747-1755, 2010
[3] Hollstein, A.; Segl, K.; Guanter, L.; Brell, M.; Enesco, M. Ready-to-Use Methods for the Detection of Clouds, Cirrus, Snow, Shadow, Water and Clear Sky Pixels in Sentinel-2 MSI Images. Remote Sens. 2016, 8, 666
Facebook
TwitterAi4Privacy Community
Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.
Purpose and Features
Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion⊠See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.