Ai4Privacy Community
Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.
Purpose and Features
Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.
Purpose and Features
The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-43k.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset comprises cloud masks for 513 1022-by-1022 pixel subscenes, at 20m resolution, sampled random from the 2018 Level-1C Sentinel-2 archive. The design of this dataset follows from some observations about cloud masking: (i) performance over an entire product is highly correlated, thus subscenes provide more value per-pixel than full scenes, (ii) current cloud masking datasets often focus on specific regions, or hand-select the products used, which introduces a bias into the dataset that is not representative of the real-world data, (iii) cloud mask performance appears to be highly correlated to surface type and cloud structure, so testing should include analysis of failure modes in relation to these variables.
The data was annotated semi-automatically, using the IRIS toolkit, which allows users to dynamically train a Random Forest (implemented using LightGBM), speeding up annotations by iteratively improving it's predictions, but preserving the annotator's ability to make final manual changes when needed. This hybrid approach allowed us to process many more masks than would have been possible manually, which we felt was vital in creating a large enough dataset to approximate the statistics of the whole Sentinel-2 archive.
In addition to the pixel-wise, 3 class (CLEAR, CLOUD, CLOUD_SHADOW) segmentation masks, we also provide users with binary classification "tags" for each subscene that can be used in testing to determine performance in specific circumstances. These include:
SURFACE TYPE: 11 categories
CLOUD TYPE: 7 categories
CLOUD HEIGHT: low, high
CLOUD THICKNESS: thin, thick
CLOUD EXTENT: isolated, extended
Wherever practical, cloud shadows were also annotated, however this was sometimes not possible due to high-relief terrain, or large ambiguities. In total, 424 were marked with shadows (if present), and 89 have shadows that were not annotatable due to very ambiguous shadow boundaries, or terrain that cast significant shadows. If users wish to train an algorithm specifically for cloud shadow masks, we advise them to remove those 89 images for which shadow was not possible, however, bear in mind that this will systematically reduce the difficulty of the shadow class compared to real-world use, as these contain the most difficult shadow examples.
In addition to the 20m sampled subscenes and masks, we also provide users with shapefiles that define the boundary of the mask on the original Sentinel-2 scene. If users wish to retrieve the L1C bands at their original resolutions, they can use these to do so.
Please see the README for further details on the dataset structure and more.
Contributions & Acknowledgements
The data were collected, annotated, checked, formatted and published by Alistair Francis and John Mrziglod.
Support and advice was provided by Prof. Jan-Peter Muller and Dr. Panagiotis Sidiropoulos, for which we are grateful.
We would like to extend our thanks to Dr. Pierre-Philippe Mathieu and the rest of the team at ESA PhiLab, who provided the environment in which this project was conceived, and continued to give technical support throughout.
Finally, we thank the ESA Network of Resources for sponsoring this project by providing ICT resources.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We have classified the masking options as either low, moderate, or high based on the ability to minimise risk of bias. Shown in bold is the recommended high-quality strategy that is readily implementable across different experiment types.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundNonpharmacological interventions, such as personal protective equipment for example, surgical masks and respirators, and maintenance of hand hygiene along with COVID-19 vaccines have been recommended to reduce viral transmission in the community and health care settings. There is evidence from the literature that surgical and N95 masks may reduce the initial degree of exposure to the virus. A limited research that has studied the cost-effective analysis of surgical masks and N95 masks among health care workers in the prevention of COVID-19 in India. The objective of this study was to estimate the cost-effectiveness of N95 and surgical mask compared to wearing no mask in public hospital settings for preventing COVID-19 infection among Health care workers (HCWs) from the health care provider’s perspective.MethodsA deterministic baseline model, without any mask use, based on Eikenberry et al was used to form the foundation for parameter estimation and to estimate transmission rates among HCWs. Information on mask efficacy, including the overall filtering efficiency of a mask and clinical efficiency, in terms of either inward efficiency(ei) or outward efficiency(e0), was obtained from published literature. Hospitalized HCWs were assumed to be in one of the disease states i.e., mild, moderate, severe, or critical. A total of 10,000 HCWs was considered as representative of the size of a tertiary care institution HCW population. The utility values for the mild, moderate and severe model health states were sourced from the primary data collection on quality-of-life of HCWs COVID-19 survivors. The utility scores for mild, moderate, and severe COVID-19 conditions were 0.88, 0.738 and 0.58, respectively. The cost of treatment for mild sickness (6,500 INR per day), moderate sickness (10,000 INR per day), severe (require ICU facility without ventilation, 15,000 INR per day), and critical (require ICU facility with ventilation per day, 18,000 INR) per day as per government and private COVID-19 treatment costs and capping were considered. One way sensitivity analyses were performed to identify the model inputs which had the largest impact on model results.ResultsThe use of N95 masks compared to using no mask is cost-saving of $1,454,632 (INR 0.106 billion) per 10,000 HCWs in a year. The use of N95 masks compared to using surgical masks is cost-saving of $63,919 (INR 0.005 billion) per 10,000 HCWs in a year. the use of surgical masks compared to using no mask is cost-saving of $1,390,713 (INR 0.102 billion) per 10,000 HCWs in a year. The uncertainty analysis showed that considering fixed transmission rate (1.7), adoption of mask efficiency as 20%, 50% and 80% reduces the cumulative relative mortality to 41%, 79% and 94% respectively. On considering ei = e0 (99%) for N95 and surgical mask with ei = e0 (90%) the cumulative relative mortality was reduced by 97% and the use of N95 masks compared to using surgical masks is cost-saving of $24,361 (INR 0.002 billion) per 10,000 HCWs in a year.DiscussionBoth considered interventions were dominant compared to no mask based on the model estimates. N95 masks were also dominant compared to surgical masks.
This child item contains files representing Particle Image Velocimetry (PIV) processing masks which excluded regions of invalid velocities from the PIV results. Masks typically are used to screen out velocities or prevent the creation of velocities for regions of an image where computed PIV velocities would be nonsensical or invalid. For example, near or on the channel banks, where a tree overhangs the channel, or the presence of a boat or other object in the water. By using masks, these regions can be excluded from analysis. The PIVLab software allows for the designation of a rectangular Region of Interest (ROI). For five of the field sites, which were located at engineered canals, a rectangular ROI was sufficient to exclude areas in an image scene which were invalid such as the channel banks. However, for three of the sites located in natural rivers, a rectangular ROI was not sufficient to screen invalid regions in the image, so polygonal masks were used in conjunction with the ROIs to segment valid regions for the PIV analysis.
The mask files included here are named with a prefix indicating which field site the mask is for. In two of the sites, multiple masks were used. Mask files are simple, headerless comma-delimited text files consisting of x,y pixel coordinate pairs which outline a polygon on the PIV image.
Each Field Site is abbreviated in various files in this data release. File and folder names are used to quickly identify which site a particular file or dataset represents. The following abbreviations are used for masks:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides supplemental information for the manuscript, "Diverse terracing practices revealed by automated lidar analysis across the Sāmoan islands", submitted to Archaeological Prospection. The dataset contains a trained Mask R-CNN deep learning model designed for detecting archaeological terracing features on the islands of American Samoa, associated training data, and the raw and cleaned output of detected terraces.
https://www.promarketreports.com/privacy-policyhttps://www.promarketreports.com/privacy-policy
The global single-sided masking tape market is experiencing robust growth, driven by increasing demand across diverse industries. While the exact market size for 2025 isn't provided, we can reasonably estimate it based on available information and industry trends. Assuming a CAGR (Compound Annual Growth Rate) of, for example, 5% (a conservative estimate given the growth potential in various application segments), and a hypothetical 2019 market size of $2 billion (this figure needs to be replaced with the actual value from the provided data if available), the 2025 market size could be estimated at approximately $2.6 billion. This growth is fueled by several key factors: the expanding automotive and electronics sectors (requiring precise masking solutions), the increasing popularity of DIY and professional painting projects, and the rise of advanced manufacturing processes using specialized tapes. Key segments like silicon-based adhesives are expected to maintain a strong presence, while acrylic-based adhesives are likely to witness significant growth due to their cost-effectiveness and versatility. The market's geographical distribution is also expected to be dynamic, with North America and Asia-Pacific likely maintaining their leading positions due to robust industrial activities and manufacturing bases. However, emerging economies in regions like South America and Africa hold significant untapped potential for future expansion. Continued growth in the single-sided masking tape market is anticipated through 2033, driven by technological advancements in adhesive formulations (improving performance in high-temperature applications, for instance) and the ongoing need for precise and efficient masking solutions across diverse industries. The competitive landscape is characterized by both established players (like 3M, tesa, and Nitto Denko) and regional manufacturers. These companies are constantly innovating to meet the evolving needs of their customers, leading to advancements in product features (enhanced adhesion, improved removability, and specialized functionalities). Furthermore, sustainable and eco-friendly tape options are gaining traction, catering to rising environmental concerns. These combined factors position the single-sided masking tape market for continued growth and evolution in the coming years. A more accurate projection will require the missing market size data from the provided content. This comprehensive report provides an in-depth analysis of the global single-sided masking tape market, projecting a value exceeding $15 billion by 2030. We delve into market segmentation, competitive landscape, emerging trends, and growth drivers to offer a complete picture for stakeholders. The report utilizes rigorous data analysis and incorporates insights from leading industry players such as 3M, Intertape Polymer Group, and Shurtape. Keywords: Single sided masking tape, masking tape market, adhesive tape, industrial tape, automotive tape, painting tape, 3M tape, market analysis, market trends, market forecast.
This ancillary ICESat-2 data set contains four static surface masks (land ice, sea ice, land, and ocean) provided by ATL03 to reduce the volume of data that each surface-specific along-track data product is required to process. For example, the land ice surface mask directs the ATL06 land ice algorithm to consider data from only those areas of interest to the land ice community. Similarly, the sea ice, land, and ocean masks direct ATL07, ATL08, and ATL12 algorithms, respectively. A detailed description of all four masks can be found in section 4 of the Algorithm Theoretical Basis Document (ATBD) for ATL03 linked under technical references.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example of strategies to implement masking during the allocation.
The MODIS level-2 cloud mask product is a global product generated for both daytime and nighttime conditions at 1-km spatial resolution (at nadir) and for daytime at 250-m resolution. The algorithm employs a series of visible and infrared threshold and consistency tests to specify confidence levels that an unobstructed view of the Earth's surface is observed. The Terra MODIS Photovoltaic (PVLWIR) bands 27-30 are known to experience an electronic crosstalk contamination. The influence of the crosstalk has gradually increased over the mission lifetime, causing for example, earth surface features to become prominent in atmospheric band 27, increased detector striping, and long term drift in the radiometric bias of these bands. The drift has compromised the climate quality of C6 Terra MODIS L2 products that depend significantly on these bands, including cloud mask (MOD35), cloud fraction and cloud top properties (MOD06), and total precipitable water (MOD07). A linear crosstalk correction algorithm has been developed and tested by MCST.The electronic crosstalk correction was made to the calibration algorithm for bands 27-30 and implemented into C6.1 operational L1B processing. This implementation greatly improves the performance of the cloud mask.For more information on C6.1 changes visit:https://modis-atmos.gsfc.nasa.gov/documentation/collection-61The shortname for this Level-2 MODIS cloud mask product is MOD35_L2 and the principal investigator for this product is MODIS scientist Dr. Paul Menzel ( paulm@ssec.wisc.edu). MOD35_L2 product files are stored in Hierarchical Data Format (HDF-EOS). Each of the 9 gridded parameters is stored as a Scientific Data Set (SDS) within the HDF-EOS file. The Cloud Mask and Quality Assurance SDS's are stored at 1 kilometer pixel resolution. All other SDS's (those relating to time, geolocation, and viewing geometry) are stored at 5 kilometer pixel resolution.Link to the MODIS homepage for more data set information: https://modis-atmos.gsfc.nasa.gov/products/cloud-mask
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Transformer-based neural language models have led to breakthroughs for a variety of natural language processing (NLP) tasks. However, most models are pretrained on general domain data. We propose a methodology to produce a model focused on the clinical domain: continued pretraining of a model with a broad representation of biomedical terminology (PubMedBERT) on a clinical corpus (MIMIC-III) along with a novel entity-centric masking strategy to infuse domain knowledge in the learning process.
We curated the MIMIC-III corpus by annotating events (including diseases/disorders, signs/symptoms, medications, anatomical sites, and procedures) and time expressions (e.g. "yesterday", "this weekend", "02/31/2028"(an example date)) with special markers. Marked events and time expressions are randomly chosen together with other words in a certain ratio to be masked for training the entity-centric mask language model. Therefore, the models are infused with clinical entity information and good for entity-related clinical NLP tasks.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences. We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts. As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated). For these datasets, the columns provided for each datapoint are as follows: text- the original sentence span- the span (phrase) which is masked out span_lower- the lowercase version of span range- the range in the text string which will be masked out (this is important because span might appear more than once in text) freq- the corpus frequency of span_lower masked_text- the masked version of text, span is replaced with [MASK] Additinaly, we provide a small (3K) dataset with human annotations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We have classified the masking options as either low, moderate, or high based on the ability to minimise risk of bias.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
------------------------------------------------------------------------------------------------------------- CITATION ------------------------------------------------------------------------------------------------------------- Please cite this data and code as: H. Khamis, R. Weiss, Y. Xie, C-W. Chang, N. H. Lovell, S. J. Redmond, "QRS detection algorithm for telehealth electrocardiogram recordings," IEEE Transaction in Biomedical Engineering, vol. 63(7), p. 1377-1388, 2016. ------------------------------------------------------------------------------------------------------------- DATABASE DESCRIPTION ------------------------------------------------------------------------------------------------------------- The following description of the TELE database is from Khamis et al (2016): "In Redmond et al (2012), 300 ECG single lead-I signals recorded in a telehealth environment are described. The data was recorded using the TeleMedCare Health Monitor (TeleMedCare Pty. Ltd. Sydney, Australia). This ECG is sampled at a rate of 500 Hz using dry metal Ag/AgCl plate electrodes which the patient holds with each hand; a reference electrode plate is also positioned under the pad of the right hand. Of the 300 recordings, 250 were selected randomly from 120 patients, and the remaining 50 were manually selected from 168 patients to obtain a larger representation of poor quality data. Three independent scorers annotated the data by identifying sections of artifact and QRS complexes. All scorers then annotated the signals as a group, to reconcile the individual annotations. Sections of the ECG signal which were less than 5 s in duration were considered to be part of the neighboring artifact sections and were subsequently masked. QRS annotations in the masked regions were discarded prior to the artifact mask and QRS locations being saved. Of the 300 telehealth ECG records in Redmond et al. (2012), 50 records (including 29 of the 250 randomly selected records and 21 of the 50 manually selected records) were discarded as all annotated RR intervals within these records overlap with the annotated artifact mask and therefore, no heart rate can be calculated, which is required for measuring algorithm performance. The remaining 250 records will be referred to as the TELE database." For all 250 recordings in the TELE database, the mains frequency was 50 Hz, the sampling frequency was 500 Hz and the top and bottom rail voltages were 5.556912223578890 and -5.554198887532222 mV respectively. ------------------------------------------------------------------------------------------------------------- DATA FILE DESCRIPTION ------------------------------------------------------------------------------------------------------------- Each record in the TELE database is stored as a X_Y.dat file where X indicates the index of the record in the TELE database (containing a total of 250 records) and Y indicates the index of the record in the original dataset containing 300 records (see Redmond et al. 2012). The .dat file is a comma separated values file. Each line contains: - the ECG sample value (mV) - a boolean indicating the locations of the annotated qrs complexes - a boolean indicating the visually determined mask - a boolean indicating the software determined mask (see Khamis et al. 2016) ------------------------------------------------------------------------------------------------------------- CONVERTING DATA TO MATLAB STRUCTURE ------------------------------------------------------------------------------------------------------------- A matlab function (readFromCSV_TELE.m) has been provided to read the .dat files into a matlab structure: %% % [DB,fm,fs,rail_mv] = readFromCSV_TELE(DATA_PATH) % % Extracts the data for each of the 250 telehealth ECG records of the TELE database [1] % and returns a structure containing all data, annotations and masks. % % IN: DATA_PATH - String. The path containing the .hdr and .dat files % % OUT: DB - 1xM Structure. Contains the extracted data from the M (250) data files. % The structure has fields: % * data_orig_ind - 1x1 double. The index of the data file in the original dataset of 300 records (see [1]) - for tracking purposes. % * ecg_mv - 1xN double. The ecg samples (mV). N is the number of samples for the data file. % * qrs_annotations - 1xN double. The qrs complexes - value of 1 where a qrs is located and 0 otherwise. % * visual_mask - 1xN double. The visually determined artifact mask - value of 1 where the data is masked and 0 otherwise. % * software_mask - 1xN double. The software artifact mask - value of 1 where the data is masked and 0 otherwise. % fm - 1x1 double. The mains frequency (Hz) % fs - 1x1 double. The sampling frequency (Hz) % rail_mv - 1x2 double. The bottom and top rail voltages (mV) % % If you use this code or data, please cite as follows: % % [1] H. Khamis, R. Weiss, Y. Xie, C-W. Chang, N. H. Lovell, S. J. Redmond, % "QRS detection algorithm...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# README: IRHMapNet Radargram and Mask Patches Dataset
## Dataset Overview
This dataset contains radargram patches and corresponding mask patches used for training and evaluating the **IRHMapNet** model. The dataset is designed for segmentation of internal reflection horizons (IRHs) from radio-echo sounding data. The data is organized into two directories: radargram patches (`grams_patches`) and mask patches (`masks_patches`), with each patch having dimensions of 512x512 pixels.
### Contents
- **grams_patches/**: Contains 600 `.csv` files representing radargram patches. Each file is a 512x512 matrix corresponding to a small section of the radargram image.
- **masks_patches/**: Contains 600 `.csv` files representing the ground-truth mask patches for segmentation. Each file is a 512x512 binary mask, where `1` indicates the presence of an internal reflection horizon (IRH), and `0` represents background or ice.
## Data Format
- The files in both directories are named consistently, with matching pairs of radargram and mask patches.
- Example: `grams_patches/patch_001.csv` corresponds to `masks_patches/patch_001.csv`.
- Each `.csv` file is a comma-separated values (CSV) file containing 512 rows and 512 columns.
## Directory Structure
```
DATA_IRHMapNet/
├── grams_patches/ # Radargram patches
│ ├── patch_001.csv
│ ├── patch_002.csv
│ └── ... (600 patches)
└── masks_patches/ # Mask patches (Ground truth)
├── patch_001.csv
├── patch_002.csv
└── ... (600 patches)
```
## Usage Instructions
1. **Loading the data**: Each `.csv` file can be loaded using standard CSV reading functions in Python, such as `numpy.loadtxt()` or `pandas.read_csv()`.
Example in Python using `numpy`:
```python
import numpy as np
radargram_patch = np.loadtxt('grams_patches/patch_001.csv', delimiter=',')
mask_patch = np.loadtxt('masks_patches/patch_001.csv', delimiter=',')
```
2. **Model training**: These patches are designed for input into a U-Net or similar convolutional neural network architectures for pixel-wise classification tasks. The radargram patches serve as input, and the mask patches provide the ground-truth labels for training.
## License
This dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to:
Share — copy and redistribute the material in any medium or format.
Adapt — remix, transform, and build upon the material for any purpose, even commercially.
You must give appropriate credit by citing the following publication:
**Citation**: Moqadam, H., et al. (2024). Going deeper with deep learning: Automatically tracing internal reflection horizons in ice sheets. *Journal of Geophysical Research: Machine Learning and Computation*. DOI: [insert DOI]
## Contact
For questions or further information, please contact Hameed Moqadam at [hameed.moqadam@awi.de].
Data Curator: Hameed Moqadam
Annotator: Hameed Moqadam
Data Manager: Hameed Moqadam
Broad-scale alterations of historical fire regimes and vegetation dynamics have occurred in many landscapes in the U.S. through the combined influence of land management practices, fire exclusion, ungulate herbivory, insect and disease outbreaks, climate change, and invasion of non-native plant species. The LANDFIRE Project produces maps of simulated historical fire regimes and vegetation conditions using the LANDSUM landscape succession and disturbance dynamics model. The LANDFIRE Project also produces maps of current vegetation and measurements of current vegetation departure from simulated historical reference conditions. These maps support fire and landscape management planning outlined in the goals of the National Fire Plan, Federal Wildland Fire Management Policy, and the Healthy Forests Restoration Act. Data Summary: The Vegetation Condition Class (VCC) data layer categorizes departure between current vegetation conditions and reference vegetation conditions according to the methods outlined in the Interagency Fire Regime Condition Class Guidebook (Hann and others 2004). For the full product description, please refer to Rollins and others 2007, Developing the LANDFIRE Fire Regime Data Products available at www.landfire.gov, however, LANDSUM was not incorporated into the LF_1.1.0 methods. Technical Methods: Hydrologic unit codes (HUCs) were used within LANDFIRE mapping zones to stratify the calculation of vegetation departure. Within each biophysical setting (BpS) in each subsection, we compare the reference percentage of each succession class (SClass) to the current percentage, and the smaller of the two is summed to determine the similarity index for the BpS. This value is then subtracted from 100 to determine the departure index. This departure index is represented using a 0 to 100 percent scale, with 100 representing maximum departure. The departure index is then classified into three condition classes. It is important to note that the LANDFIRE VCC approach differs from that outlined in the Interagency Fire Regime Condition Class Guidebook (Hann and others 2004) as follows: LANDFIRE VCC is based on departure of current vegetation conditions from reference vegetation conditions only, whereas the Guidebook approach also includes departure of current fire regimes from those of the reference period. The reference conditions are derived from vegetation and disturbance dynamics model VDDT. The current conditions are derived from the corresponding version of the LANDFIRE Succession Class data layer; please refer to the product description page at www.landfire.gov for more information. The proportion of the landscape occupied by each SClass in each BpS unit in each subsection is used to represent the current condition of that SClass in the VCC calculation. The areas currently mapped to agriculture, urban, water, barren, or sparsely vegetated BpS units are not included in the VCC calculation; thus, VCC is based entirely on the remaining area of each BpS unit that is occupied by valid SClasses. The vegetation condition classes are defined as follows: Condition Class I: vegetation departure index of 0 to 33 Condition Class II: vegetation departure index of 34 to 66 Condition Class III: vegetation departure index of 67 to 100 Additional data layer values were included to represent vcc not calculated (0), Water (111), Snow Ice (112), Urban (120), Barren (131), Sparsely Vegetated BpS (132), and Agriculture (180). Summarization at the national and state levels does not change the relevance of LANDFIRE data that are available to support management decisions at the unit level. The advantages of a nationally consistent data set and repeatable methodology preclude any short comings of the LANDFIRE data products when used at the local level. Field plot data contributed either directly or indirectly to this LANDFIRE National data product. Go to for more information regarding contributors of field plot data. REFRESH 2008 (lf_1.1.0): Refresh 2008 (lf_1.1.0) used 2001 data as a launching point to incorporate disturbance and its severity, both managed and natural, which occurred on the landscape after 2001. Specific examples of disturbance are: fire, vegetation management, weather, and insect and disease. The final disturbance data used in Refresh 2008 (lf_1.1.0) is the result of several efforts that include data derived in part from remotely sensed land change methods, Monitoring Trends in Burn Severity (MTBS), and the LANDFIRE Refresh events data call. Vegetation growth was modeled where both disturbance and non-disturbance occurs. References: Hann, W.; Shlisky, A.; Havlina, D.; Schon, K.; Barrett, S.; DeMeo, T.; Pohl, K.; Menakis, J.; Hamilton, D.; Jones, J.; Levesque, M. 2004. Interagency Fire Regime Condition Class Guidebook. Interagency and The Nature Conservancy Fire Regime Condition Class website. USDA Forest Service, U.S. Department of the Interior, The Nature Conservancy, and Systems for Environmental Management. Available online: www.frcc.gov.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Information received in response to notices published in the Canada Gazette under section 71 of the Canadian Environmental Protection Act (CEPA 1999). These notices target chemical substances of interest under the Chemicals Management Plan.
To increase transparency and to facilitate access to information on substances in commerce in Canada, these documents provide the non-confidential information collected by the Government of Canada under the respective Notices.
Important information about these summaries:
Some information gathered under these initiatives was considered Confidential Business Information (CBI) by the submitters. While these summaries and the Excel compilations were prepared using the full dataset (including CBI), CBI was masked in both documents prior to publication. Masking refers to the process whereby the information is used in such a manner so that CBI is not revealed. This can be done by, for example, aggregating data or by providing quantity ranges. For instances when masking could not adequately provide protection, CBI data elements were removed from the final version.
Sensitive information, such as names of submitters, names of their customers and suppliers, or any information that could identify a submitter, was also not included.
The s.71 Notices targeted specific information to address specific data needs identified for the substances. As such, the information gathered and reported here does not represent the entire range of commercial activities in Canada with the substances. The specific reporting requirements, exclusions and definitions applicable to the Notice can be found in Schedule 2 and Schedule 3 to the Notices.
Certain submitters chose to provide information that was not legally required under the Notice. This voluntary information is included here.
It should be noted that these documents do not include an assessment of the potential risks these substances may represent for the environment or the health of Canadians.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pan-EU Land Mask Summary
Considering the land mask for pan-EU, we will closely match the data coverage of https://land.copernicus.eu/pan-european i.e. the official selection of countries listed here: https://lanEEA39d.copernicus.eu/portal_vocabularies/geotags/eea39.
There are a total of three landmask files available, each of which is aligned with the standard spatial/temporal resolution and sizes of AI4SoilHealth Data Cube specifications, which is: Xmin = 900,000, Ymin = 899,000, Xmax = 7,401,000, Ymax = 5,501,000, with Coordinate reference system of epsg:3035. Additionally, these files include a corresponding look-up table that provides explanations for the values present in the raster data. The scripts used to generate these masks can be found here.
The masks are:
Landmask
ISO-code country mask
NUTS3 mask
Name convention
To ensure consistency and ease of use across and within the projects, the files here are named according to the standard OpenLandMap file-naming convention. The OpenLandMap file-naming convention works with 10 fields that basically define the most important properties of the data, this way users can search files, prepare data analysis etc, without even needing to access or open files. The 10 fields include:
Generic variable name: country.code
Variable procedure combination i.e. method standard (standard abbreviation): iso.3166
Position in the probability distribution / variable type: c
Spatial support (usually horizontal block) in m or km: 30m
Depth reference or depth interval e.g. below ("b"), above ("a") ground or at surface ("s"): s
Time reference begin time (YYYYMMDD): 20210101
Time reference end time: 20211231
Bounding box (2 letters max): eu
EPSG code: epsg.3035
Version code i.e. creation date: v20230722
An example of a file-name based on the description above:
country.code_iso.3166_c_100m_s_20210101_20211231_eu_epsg.3035_v20230722
Landmask
The basic principle to create the land mask is to include as much as land as possible, to avoid missing any land pixels and ensure precise differentiation between land, ocean and inland water bodies.
Two reference datasets are used,
WorldCover, 10 m resolution.
EuroGlobalMap, with shapefiles of administrative boundaries, inland water bodies, ocean and landmask.
When generating the land mask, the two reference datasets in a way that:
If either of the two reference datasets identifies a pixel as land, it is considered a land pixel in our mask.
Regarding ocean and inland water bodies, a pixel is classified as a water pixel only when both reference datasets confirm its identification as water.
The landmask consists of 4 values:
10: not in the pan-EU area, i.e. out of mapping scope
1: land
2: inland water
3: ocean
This landmask is available in 10m, 30m, 100m, 250m, and 1km resolution formats respectively. The coarse resolution landmasks (>10 m) are generated by resampling from the 10m resolution base map using resampling method “min” in GDAL. This “min” method allows taking the minimum values from the contributing pixels, to keep as much land as possible.
ISO-3166 country code mask
This ISO-3166 country code mask is created from EuroGlobalMap country shapefile. This mask is available in 10m, 30m and 100m resolution. In this raster file, each country is assigned a unique value, which allows for the interpretation and analysis of data associated with a specific country.
The values are assigned to each country according to iso-3166 country code, which can be found in the corresponding look-up table. The coarse resolution masks (>10 m) are generated by resampling from the 10m resolution base map using resampling method “mode” in GDAL.
NUTS-3 mask
The nuts-3 code mask is created from the European NUTS3 shapefile. In this raster file, each unique NUT3 level area is assigned a unique value, which allows for the interpretation and analysis of data associated with specific NUTS3 regions.
The values of pixels and its associated meanings can be found in the corresponding look-up table. This nut-3 code mask is available in 10m, 30m and 100m resolution formats. The coarse resolution masks (>10 m) are generated by resampling from the 10m resolution base map using resampling method “mode” in GDAL.
It should be noted that the ISO-code country mask covers a more extensive area compared to the NUTS3 mask. This broader coverage includes countries like Ukraine and others beyond the NUTS3 mask, while NUTS mask shows more details about regional administrative boundaries.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Information received in response to notices published in the Canada Gazette under sections 46 and 71 of the Canadian Environmental Protection Act 1999 (CEPA). These notices target certain substances under the Chemicals Management Plan. Supplemental Information To increase transparency and to facilitate access to information on substances in commerce in Canada, these documents provide the non-confidential information collected by the Government of Canada under the respective notices. Important information about these summaries: Some information gathered under these initiatives was considered Confidential Business Information (CBI) by the submitters. While these summaries and the Excel compilations were prepared using the full dataset (including CBI), CBI and protected information were masked in both documents prior to publication. Masking refers to the process whereby the information is used in such a manner so that CBI and protected information are not revealed. This can be done by, for example, aggregating data or by providing quantity ranges. For instances when masking could not adequately provide protection, data was removed. Protected information, such as names of submitters, names of their customers and suppliers, or any information that could identify a submitter, was also removed. The section 46 and 71 notices targeted information to address data needs identified for the substances. As such, the information gathered and reported here does not represent the entire range of commercial activities in Canada with the substances. The specific reporting requirements, exclusions and definitions can be found in each applicable notice. It should be noted that these documents do not include an assessment of the potential risks these substances may represent for the environment or the health of Canadians. Datasets available for download
Ai4Privacy Community
Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.
Purpose and Features
Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.