Ai4Privacy Community
Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.
Purpose and Features
Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.
Purpose and Features
The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-65k.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Purpose and Features
🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts:
OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting 749 discussion subjects / use cases split across education, health, and psychology. FinPII contains an additional ~20 types tailored to… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-300k.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
terminal pip install datasets
python from datasets import load_dataset dataset = load_dataset("ai4privacy/open-pii-masking-500k-ai4privacy")
# Compatible Machine Learning Tasks: - Tokenclassification. Check out a HuggingFace's guide on token classification. - ALBERT, BERT, BigBird, BioGpt, BLOOM, BROS, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, ESM, Falcon, FlauBERT, FNet, Funnel Transformer, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, I-BERT, LayoutLM, LayoutLMv2, LayoutLMv3, LiLT, Longformer, LUKE, MarkupLM, MEGA, Megatron-BERT, MobileBERT,...Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data set records the perceptions of Bangladeshi university students on the influence that AI tools, especially ChatGPT, have on their academic practices, learning experiences, and problem-solving abilities. The varying role of AI in education, which covers common usage statistics, what AI does to our creative abilities, its impact on our learning, and whether it could invade our privacy. This dataset reveals perspective on how AI tools are changing education in the country and offering valuable information for researchers, educators, policymakers, to understand trends, challenges, and opportunities in the adoption of AI in the academic contex.
Methodology Data Collection Method: Online survey using google from Participants: A total of 3,512 students from various Bangladeshi universities participated. Survey Questions:The survey included questions on demographic information, frequency of AI tool usage, perceived benefits, concerns regarding privacy, and impacts on creativity and learning.
Sampling Technique: Random sampling of university students Data Collection Period: June 2024 to December 2024
Privacy Compliance This dataset has been anonymized to remove any personally identifiable information (PII). It adheres to relevant privacy regulations to ensure the confidentiality of participants.
For further inquiries, please contact: Name: Md Jhirul Islam, Daffodil International University Email: jhirul15-4063@diu.edu.bd Phone: 01316317573
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OverviewWater companies in the UK are responsible for testing the quality of drinking water. This dataset contains the results of samples taken from the taps in domestic households to make sure they meet the standards set out by UK and European legislation. This data shows the location, date, and measured levels of determinands set out by the Drinking Water Inspectorate (DWI).Key Definitions AggregationProcess involving summarising or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes Anonymisation Anonymised data is a type of information sanitisation in which data anonymisation tools encrypt or remove personally identifiable information from datasets for the purpose of preserving a data subject's privacy Dataset Structured and organised collection of related elements, often stored digitally, used for analysis and interpretation in various fields. Determinand A constituent or property of drinking water which can be determined or estimated. DWI Drinking Water Inspectorate, an organisation “providing independent reassurance that water supplies in England and Wales are safe and drinking water quality is acceptable to consumers.” DWI Determinands Constituents or properties that are tested for when evaluating a sample for its quality as per the guidance of the DWI. For this dataset, only determinands with “point of compliance” as “customer taps” are included. Granularity Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours ID Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance. LSOA Lower-Level Super Output Area is made up of small geographic areas used for statistical and administrative purposes by the Office for National Statistics. It is designed to have homogeneous populations in terms of population size, making them suitable for statistical analysis and reporting. Each LSOA is built from groups of contiguous Output Areas with an average of about 1,500 residents or 650 households allowing for granular data collection useful for analysis, planning and policy- making while ensuring privacy. ONS Office for National Statistics Open Data Triage The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data. Sample A sample is a representative segment or portion of water taken from a larger whole for the purpose of analysing or testing to ensure compliance with safety and quality standards. Schema Structure for organizing and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute. Units Standard measurements used to quantify and compare different physical quantities. Water Quality The chemical, physical, biological, and radiological characteristics of water, typically in relation to its suitability for a specific purpose, such as drinking, swimming, or ecological health. It is determined by assessing a variety of parameters, including but not limited to pH, turbidity, microbial content, dissolved oxygen, presence of substances and temperature.Data HistoryData Origin These samples were taken from customer taps. They were then analysed for water quality, and the results were uploaded to a database. This dataset is an extract from this database.Data Triage Considerations Granularity Is it useful to share results as averages or individual? We decided to share as individual results as the lowest level of granularity Anonymisation It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed: • Water Supply Zone (WSZ) - Limits interoperability with other datasets • Postcode – Some postcodes contain very few households and may not offer necessary anonymisation • Postal Sector – Deemed not granular enough in highly populated areas • Rounded Co-ordinates – Not a recognised standard and may cause overlapping areas • MSOA – Deemed not granular enough • LSOA – Agreed as a recognised standard appropriate for England and Wales • Data Zones – Agreed as a recognised standard appropriate for Scotland Data Triage Review Frequency Annually unless otherwise requested Publish FrequencyAnnuallyData Specifications • Each dataset will cover a year of samples in calendar year • This dataset will be published annually • Historical datasets will be published as far back as 2016 from the introduction of The Water Supply (Water Quality) Regulations 2016 • The determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate. • A small proportion of samples could not be allocated to an LSOA – these represented less than 0.1% of samples and were removed from the dataset in 2023. • The postcode to LSOA lookup table used for 2022 was not available when 2023 data was processed, see supplementary information for the lookup table applied to each calendar year of data. Context Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset. Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area. Some samples are tested on site and others are sent to scientific laboratories.Supplementary informationBelow is a curated selection of links for additional reading, which provide a deeper understanding of this dataset. 1. Drinking Water Inspectorate Standards and Regulations: https://www.dwi.gov.uk/drinking-water-standards-and-regulations/ 2. LSOA (England and Wales) and Data Zone (Scotland): https://www.nrscotland.gov.uk/files/geography/2011-census/geography-bckground-info-comparison-of-thresholds.pdf 3. Description for LSOA boundaries by the ONS: https://www.ons.gov.uk/methodology/geography/ukgeographies/censusgeographies/census2021geographies4. Postcode to LSOA lookup tables (2022 calendar year data): https://geoportal.statistics.gov.uk/datasets/postcode-to-2021-census-output-area-to-lower-layer-super-output-area-to-middle-layer-super-output-area-to-local-authority-district-august-2023-lookup-in-the-uk/about 5. Postcode to LSOA lookup tables (2023 calendar year data): https://geoportal.statistics.gov.uk/datasets/b8451168e985446eb8269328615dec62/about6. Legislation history: https://www.dwi.gov.uk/water-companies/legislation/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset and codes for "Observation of Acceleration and Deceleration Periods at Pine Island Ice Shelf from 1997–2023 "
The MATLAB codes and related datasets are used for generating the figures for the paper "Observation of Acceleration and Deceleration Periods at Pine Island Ice Shelf from 1997–2023".
Files and variables
File 1: Data_and_Code.zip
Directory: Main_function
**Description:****Include MATLAB scripts and functions. Each script include discriptions that guide the user how to used it and how to find the dataset that used for processing.
MATLAB Main Scripts: Include the whole steps to process the data, output figures, and output videos.
Script_1_Ice_velocity_process_flow.m
Script_2_strain_rate_process_flow.m
Script_3_DROT_grounding_line_extraction.m
Script_4_Read_ICESat2_h5_files.m
Script_5_Extraction_results.m
MATLAB functions: Five Files that includes MATLAB functions that support the main script:
1_Ice_velocity_code: Include MATLAB functions related to ice velocity post-processing, includes remove outliers, filter, correct for atmospheric and tidal effect, inverse weited averaged, and error estimate.
2_strain_rate: Include MATLAB functions related to strain rate calculation.
3_DROT_extract_grounding_line_code: Include MATLAB functions related to convert range offset results output from GAMMA to differential vertical displacement and used the result extract grounding line.
4_Extract_data_from_2D_result: Include MATLAB functions that used for extract profiles from 2D data.
5_NeRD_Damage_detection: Modified code fom Izeboud et al. 2023. When apply this code please also cite Izeboud et al. 2023 (https://www.sciencedirect.com/science/article/pii/S0034425722004655).
6_Figure_plotting_code:Include MATLAB functions related to Figures in the paper and support information.
Director: data_and_result
Description:**Include directories that store the results output from MATLAB. user only neeed to modify the path in MATLAB script to their own path.
1_origin : Sample data ("PS-20180323-20180329", “PS-20180329-20180404”, “PS-20180404-20180410”) output from GAMMA software in Geotiff format that can be used to calculate DROT and velocity. Includes displacment, theta, phi, and ccp.
2_maskccpN: Remove outliers by ccp < 0.05 and change displacement to velocity (m/day).
3_rockpoint: Extract velocities at non-moving region
4_constant_detrend: removed orbit error
5_Tidal_correction: remove atmospheric and tidal induced error
6_rockpoint: Extract non-aggregated velocities at non-moving region
6_vx_vy_v: trasform velocities from va/vr to vx/vy
7_rockpoint: Extract aggregated velocities at non-moving region
7_vx_vy_v_aggregate_and_error_estimate: inverse weighted average of three ice velocity maps and calculate the error maps
8_strain_rate: calculated strain rate from aggregate ice velocity
9_compare: store the results before and after tidal correction and aggregation.
10_Block_result: times series results that extrac from 2D data.
11_MALAB_output_png_result: Store .png files and time serties result
12_DROT: Differential Range Offset Tracking results
13_ICESat_2: ICESat_2 .h5 files and .mat files can put here (in this file only include the samples from tracks 0965 and 1094)
14_MODIS_images: you can store MODIS images here
shp: grounding line, rock region, ice front, and other shape files.
File 2 : PIG_front_1947_2023.zip
Includes Ice front positions shape files from 1947 to 2023, which used for plotting figure.1 in the paper.
File 3 : PIG_DROT_GL_2016_2021.zip
Includes grounding line positions shape files from 1947 to 2023, which used for plotting figure.1 in the paper.
Data was derived from the following sources:
Those links can be found in MATLAB scripts or in the paper "**Open Research" **section.
Kenntnisse über Datenschutzgesetze und Kenntnis der unabhängigen Datenschutzbehörde. Kenntnisse über den Schutz persönlicher Daten. Themen: Interesse am Schutz persönlicher Daten, die in privaten und öffentlichen Organisationen gespeichert werden; Vertrauen in ausgewählte Institutionen im eigenen Lande bezüglich des Datenschutzes; Meinung zum Schutz persönlicher Daten: ausreichender Datenschutz im eigenen Land, Einschätzung des allgemeinen Bewusstseins über den Schutz persönlicher Daten, Beunruhigung über das Hinterlassen persönlicher Daten im Internet, Vertrauen in die Datenschutzgesetzgebung; Kenntnis der Datenschutzbehörde und deren Aufgaben: Annahme von Beschwerden von Privatpersonen, Verhängen von Sanktionen, eigene Kontaktaufnahme zu dieser Behörde; Kenntnisse der Pflichten von datenhaltenden Organisationen gegenüber dem Befragten; Kenntnistest der Rechte des Befragten hinsichtlich der Verwendung seiner persönlichen Daten: erforderliche Zustimmung, Widerspruchsrecht, Auskunftsrecht, Recht auf Korrektur oder Löschung von Daten, Rechtsmittel gegen Verstöße, Schadensersatzforderung bei ungesetzlicher Verwendung; Meinung über Übertragungssicherheit von Daten im Internet; Kenntnis über Technologien, die die Sammlung persönlicher Daten vom eigenen Computer einschränken (Cookies, Firewall); Verwendung dieser Technologien; Gründe für eine Nichtnutzung; Einstellung zur Überwachung von: Telefongesprächen, Internetnutzung, Kreditkartennutzung und Daten von Flugpassagieren zur Terrorismusbekämpfung (Split: umgedrehte Antwortvorgaben); Kenntnis über Verbot der Weitergabe persönlicher Daten an Nicht-EU-Länder, mit unzureichendem Datenschutz; Kenntnis über strengere Datenschutzregelungen für empfindliche Daten. Demographie: Geschlecht; Alter; Alter bei Beendigung der Ausbildung; Beruf; berufliche Stellung; Urbanisierungsgrad; Haushaltszusammensetzung und Haushaltsgröße; Besitz eines Mobiltelefons; Festnetztelefon im Haushalt. Zusätzlich verkodet wurde: Befragten-ID; Interviewsprache; Interviewer-ID; Land; Interviewdatum; Interviewdauer (Interviewbeginn und Interviewende); Interviewmodus (Mobiltelefon oder Festnetz); Region; Gewichtungsfaktor. Attitudes towards the protection of personal data. Topics: concern with regard to the protection of personal information by private and public organisations; trust in the following institutions regarding the use of personal information in a proper way: travel companies, medical services, insurance companies, credit card companies, financial institutions, employers, police, social security, tax authorities, local authorities, credit reference agencies, mail order companies, non-profit organisations, market and opinion research companies; attitude towards the following statements on the protection of personal data in the own country: is properly protected, low awareness of people on the subject, worry about leaving personal information on the internet, appropriate legislation to cope with growing number of personal information on the internet; awareness of the national authority to monitor the application of data protection laws; responsibility of the national authority to hear individuals; ability of the authority to pose sanctions; personal contact to authority; awareness of the obligation of data collectors to provide information on identity, purpose, and further data sharing; knowledge test concerning the storage of personal data: need for personal consent with regard to the use of personal information, right to oppose the use, legal assurance to access personal data, right to correct or remove data, national laws allow access to courts to seek remedies for breaches of data protection laws, right for compensation caused by unlawful use of personal data; assessment of the security of transmitting personal data over the internet; awareness of technologies to limit the collection of personal data from personal computer; use of these technologies; reasons for not using; attitude towards selected measures to fight international terrorism: monitor telephone calls, monitor internet use, monitor credit card use, monitor flight passenger data; awareness of the assurance that personal data of EU citizens can only be transferred outside the EU to countries which ensure an adequate level or protection; awareness of stricter data protection rules applied for sensitive data. Demography: sex; age; age at end of education; occupation; professional position; type of community; household composition and household size; own a mobile phone and fixed (landline) phone. Additionally coded was: respondent ID; language of the interview; interviewer ID; country; date of interview; time of the beginning of the interview; duration of the interview; type of phone line; region; weighting factor.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
Water companies in the UK are responsible for testing the quality of drinking water. This dataset contains the results of samples taken from the taps in domestic households to make sure they meet the standards set out by UK and European legislation. This data shows the location, date, and measured levels of determinands set out by the Drinking Water Inspectorate (DWI).
Key Definitions
Aggregation
Process involving summarizing or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes
Anonymisation
Anonymised data is a type of information sanitization in which data anonymisation tools encrypt or remove personally identifiable information from datasets for the purpose of preserving a data subject's privacy
Dataset
Structured and organized collection of related elements, often stored digitally, used for analysis and interpretation in various fields.
Determinand
A constituent or property of drinking water which can be determined or estimated.
DWI
Drinking Water Inspectorate, an organisation “providing independent reassurance that water supplies in England and Wales are safe and drinking water quality is acceptable to consumers.”
DWI Determinands
Constituents or properties that are tested for when evaluating a sample for its quality as per the guidance of the DWI. For this dataset, only determinands with “point of compliance” as “customer taps” are included.
Granularity
Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours
ID
Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance.
LSOA
Lower-Level Super Output Area is made up of small geographic areas used for statistical and administrative purposes by the Office for National Statistics. It is designed to have homogeneous populations in terms of population size, making them suitable for statistical analysis and reporting. Each LSOA is built from groups of contiguous Output Areas with an average of about 1,500 residents or 650 households allowing for granular data collection useful for analysis, planning and policy- making while ensuring privacy.
ONS
Office for National Statistics
Open Data Triage
The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data. <
Sample
A sample is a representative segment or portion of water taken from a larger whole for the purpose of analysing or testing to ensure compliance with safety and quality standards.
Schema
Structure for organizing and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute.
Units
Standard measurements used to quantify and compare different physical quantities.
Water Quality
The chemical, physical, biological, and radiological characteristics of water, typically in relation to its suitability for a specific purpose, such as drinking, swimming, or ecological health. It is determined by assessing a variety of parameters, including but not limited to pH, turbidity, microbial content, dissolved oxygen, presence of substances and temperature.
Data History
Data Origin
These samples were taken from customer taps. They were then analysed for water quality, and the results were uploaded to a database. This dataset is an extract from this database.
Data Triage Considerations
Granularity
Is it useful to share results as averages or individual?
We decided to share as individual results as the lowest level of granularity
Anonymisation
It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed:
<!--·
Water Supply Zone (WSZ) - Limits interoperability
with other datasets
<!--·
Postcode – Some postcodes contain very few
households and may not offer necessary anonymisation
<!--·
Postal Sector – Deemed not granular enough in
highly populated areas
<!--·
Rounded Co-ordinates – Not a recognised standard
and may cause overlapping areas
<!--·
MSOA – Deemed not granular enough
<!--·
LSOA – Agreed as a recognised standard appropriate
for England and Wales
<!--·
Data Zones – Agreed as a recognised standard
appropriate for Scotland
Data Specifications
Each dataset will cover a calendar year of samples
This dataset will be published annually
Historical datasets will be published as far back as 2016 from the introduction of of The Water Supply (Water Quality) Regulations 2016
The Determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate.
Context
Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset
Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area.
Some samples are tested on site and others are sent to scientific laboratories.
Data Publish Frequency
Annually
Data Triage Review Frequency
Annually unless otherwise requested
Supplementary information
Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset.
<!--1.
Drinking Water
Inspectorate Standards and Regulations:
<!--2.
https://www.dwi.gov.uk/drinking-water-standards-and-regulations/
<!--3.
LSOA (England
and Wales) and Data Zone (Scotland):
<!--5.
Description
for LSOA boundaries by the ONS: Census
2021 geographies - Office for National Statistics (ons.gov.uk)
<!--[6.
Postcode to
LSOA lookup tables: Postcode
to 2021 Census Output Area to Lower Layer Super Output Area to Middle Layer
Super Output Area to Local Authority District (August 2023) Lookup in the UK
(statistics.gov.uk)
<!--7.
Legislation history: Legislation -
Drinking Water Inspectorate (dwi.gov.uk)
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset contain lung x-ray image including:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15315323%2F8041ddd2485bfe9cdf2ba1f9d96bd7e5%2F6_Class_Img.jpg?generation=1741951756137022&alt=media" alt="">
The dataset we use is compiled from many reputable sources including: Dataset 1 [1]: This dataset includes four classes of diseases: COVID-19, viral pneumonia, bacterial pneumonia, and normal. It has multiple versions, and we are currently using the latest version (version 4). Previous studies, such as those by Hariri et al. [18] and Ahmad et al. [20], have also utilized earlier versions of this dataset. Dataset 2 [2]: This dataset is from the National Institutes of Health (NIH) Chest X-Ray Dataset, which contains over 100,000 chest X-ray images from over 30,000 patients. It includes 14 disease classes, including conditions like atelectasis, consolidation, and infiltration. For this study, we have selected 2,550 chest X-ray images specifically from the Emphysema class. Dataset 3 [3]: This is the COVQU dataset, which we have extended to include two additional classes: COVID-19 and viral pneumonia. This dataset has been widely used in previous studies by M.E.H. Chowdhury et al. [4] and Rahman T et al. [5], establishing its reputation as a reliable resource.
In addition, we also publish a modified dataset that aims to remove image regions that do not contain lungs (abdomen, arms, etc.).
References: [1] U. Sait, K. G. Lal, S. P. Prajapati, R. Bhaumik, T. Kumar, S. Shivakumar, K. Bhalla, Curated dataset for covid-19 posterior-anterior chest radiography images (x-rays)., Mendeley Data V4 (2022). doi:10.17632/9xkhgts2s6.4. [2] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, R. M. Summers, Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases (2017) 3462–3471. doi:10.1109/CVPR.2017.369. [3] A. M. Tahir, M. E. Chowdhury, A. Khandakar, T. Rahman, Y. Qiblawey, U. Khurshid, S. Kiranyaz, N. Ibtehaz, M. S. Rahman, S. Al-Maadeed,S. Mahmud, M. Ezeddin, K. Hameed, T. Hamid, Covid-19 infection localization and severity grading from chest x-ray images, Computers in Biology and Medicine 139 (2021) 105002. URL: https://www.sciencedirect.com/science/article/pii/S0010482521007964. doi:https://doi.org/10.1016/j.compbiomed.2021.105002. [4] M. E. Chowdhury, T. Rahman, A. Khandakar, R. Mazhar, M. A. Kadir, Z. B. Mahbub, K. R. Islam, M. S. Khan, A. Iqbal, N. A. Emadi, M. B. I. Reaz, M. T. Islam, Can ai help in screening viral and covid-19 pneumonia?, IEEE Access 8 (2020) 132665–132676. doi:10.1109/ACCESS.2020.3010287. [5] T. Rahman, A. Khandakar, Y. Qiblawey, A. Tahir, S. Kiranyaz, S. B. A. Kashem, M. T. Islam, S. A. Maadeed, S. M. Zughaier, M. S. Khan, M. E. Chowdhury, Exploring the effect of image enhancement techniques on covid-19 detection using chest x-ray images, Computers in Biology and Medicine 132 (2021). doi:10.1016/j.compbiomed.2021.104319.
** Version 2 of BIPED have been released, please use BIPEDv2** It contains 250 outdoor images of 1280x720 pixels each. These images have been carefully annotated by experts on the computer vision field, hence no redundancy has been considered. In spite of that, all results have been cross-checked several times in order to correct possible mistakes or wrong edges by just one subject. This dataset is publicly available as a benchmark for evaluating edge detection algorithms. The generation of this dataset is motivated by the lack of edge detection datasets, actually, there is just one dataset publicly available for the edge detection task published in 2016 (MDBD: Multicue Dataset for Boundary Detection---the subset for edge detection). The level of details of the edge level annotations in the BIPED's images can be appreciated looking at the GT, see Figs above.
BIPED dataset has 250 images in high definition. Those images are already split up for training and testing. 200 for training and 50 for testing.
To augment the BIPED data for the DL training visit this repository
Please cite our dataset if you find helpful,
@InProceedings{soria2020dexined,
title={Dense Extreme Inception Network: Towards a Robust CNN Model for Edge Detection},
author={Xavier Soria and Edgar Riba and Angel Sappa},
booktitle={The IEEE Winter Conference on Applications of Computer Vision (WACV '20)},
year={2020}
}
@article{SORIA2023BIPEDv2,
title = {Dense extreme inception network for edge detection},
journal = {Pattern Recognition},
volume = {139},
pages = {109461},
year = {2023},
issn = {0031-3203},
doi = {https://doi.org/10.1016/j.patcog.2023.109461},
url = {https://www.sciencedirect.com/science/article/pii/S0031320323001619},
author = {Xavier Soria and Angel Sappa and Patricio Humanante and Arash Akbarinia},
keywords = {Edge detection, Deep learning, CNN, Contour detection, Boundary detection, Segmentation}
}
This Dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use the data given that you agree to our license terms. However, if any of our images are infringing any privacy or rights, feel free to contact us and we will remove immediately.
If you need more information, Dont hesitate and contact me :)
This dataset contains high resolution residential water use data for 31 residential homes located in Logan City and Providence City in Cache County, Utah, USA. Data were collected using a low-cost, open source monitoring device that was designed to operate on magnetically driven residential water meters. Data were recorded with a temporal frequency of 4 seconds and were collected for a period of at least two weeks during the summer when outdoor water use was active and two weeks during the winter when no outdoor water use was expected. The data were measured on the meter located on the water supply line to each home and represent a trace of the total water use for each residence. The dataset also includes secondary data about each of the residences at which data were collected. These data have been anonymized to remove any personally identifiable information from participants in this data collection effort.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset consists of over 5,000 mineral varieties with the total of nearly 44,000 images. Some images are also accompanied by textual descriptions.
The dataset is with the paper https://www.sciencedirect.com/science/article/abs/pii/S0098300423001188
Images are processed to remove text tables, and to zoom in the minerals.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3838650%2Fcd90450b88cab88d993acb7f80664545%2Fminerals_data_preprocessing_scheme.png?generation=1709238741202275&alt=media" alt="">
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
CREDIT: Dataset Card for "heliosbrahma/mental_health_chatbot_dataset"
Dataset Description
Dataset Summary
This dataset contains conversational pair of questions and answers in a single text related to Mental Health. Dataset was curated from popular healthcare blogs like WebMD, Mayo Clinic and HeatlhLine, online FAQs etc. All questions and answers have been anonymized to remove any PII data and pre-processed to remove any unwanted characters.
Languages… See the full description on the dataset page: https://huggingface.co/datasets/ZahrizhalAli/mental_health_conversational_dataset.
OpenEDS2020 is a dataset of eye-image sequences captured at a frame rate of 100 Hz under controlled illumination, using a virtual-reality head-mounted display mounted with two synchronized eye-facing cameras. The dataset, which is anonymized to remove any personally identifiable information on participants, consists of 80 participants of varied appearance performing several gaze-elicited tasks, and is divided in two subsets: 1) Gaze Prediction Dataset, with up to 66,560 sequences containing 550,400 eye-images and respective gaze vectors, created to foster research in spatio-temporal gaze estimation and prediction approaches; and 2) Eye Segmentation Dataset, consisting of 200 sequences sampled at 5 Hz, with up to 29,500 images, of which 5% contain a semantic segmentation label, devised to encourage the use of temporal information to propagate labels to contiguous frames.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ZOD-Mini-2D-Road-Scenes
The ZOD-Mini-2D-Road-Scenes dataset is derived from the Zenseact Open Dataset (ZOD), property of Zenseact AB (© 2022 Zenseact AB), and is licensed under the permissive CC BY-SA 4.0. Any public use, distribution, or display of this dataset must contain this entire notice:
For this dataset, Zenseact AB has taken all reasonable measures to remove all personally identifiable information, including faces and license plates. To the extent that you like to request… See the full description on the dataset page: https://huggingface.co/datasets/8bits-ai/ZOD-Mini-2D-Road-Scenes.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Twitter Indonesia Sarcastic
Twitter Indonesia Sarcastic is a dataset intended for sarcasm detection in the Indonesian language. This dataset is introduced in Khotijah et al. (2020), whereby Indonesian tweets are collected and labeled as either sarcastic or non-sarcastic. We took the raw data, and performed several cleaning procedures such as: sentence order re-reversal, deduplication with minHash LSH, PII masking to remove usernames, hashtags, emails, URLs, and finally a random… See the full description on the dataset page: https://huggingface.co/datasets/w11wo/twitter_indonesia_sarcastic.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Ai4Privacy Community
Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.
Purpose and Features
Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.