18 datasets found

h
pii-masking-200k
huggingface.co
Updated Apr 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/1532
Dataset updated
Apr 22, 2024
Dataset authored and provided by
Ai4Privacy
Description
Ai4Privacy Community

Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

Purpose and Features

Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.
h
pii-masking-65k
huggingface.co
Updated Apr 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2024). pii-masking-65k [Dataset]. http://doi.org/10.57967/hf/2012
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2012
Dataset updated
Apr 5, 2024
Dataset authored and provided by
Ai4Privacy
Description
Purpose and Features

The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-65k.
h
pii-masking-300k
huggingface.co
Updated Apr 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2024). pii-masking-300k [Dataset]. http://doi.org/10.57967/hf/1995
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/1995
Dataset updated
Apr 4, 2024
Dataset authored and provided by
Ai4Privacy
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Purpose and Features

🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts:

OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting 749 discussion subjects / use cases split across education, health, and psychology. FinPII contains an additional ~20 types tailored to… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-300k.
H
open-pii-masking-500k-ai4privacy
dataverse.harvard.edu
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Anthony (2025). open-pii-masking-500k-ai4privacy [Dataset]. http://doi.org/10.7910/DVN/4H11OA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/4H11OA
Dataset updated
Mar 17, 2025
Dataset provided by
Harvard Dataverse
Authors
Michael Anthony
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. # Dataset Analytics 📊 - ai4privacy/open-pii-masking-500k-ai4privacy ## p5y Data Analytics - Total Entries: 580,227 - Total Tokens: 19,199,982 - Average Source Text Length: 17.37 words - Total PII Labels: 5,705,973 - Number of Unique PII Classes: 20 (Open PII Labelset) - Unique Identity Values: 704,215 --- ## Language Distribution Analytics Number of Unique Languages: 8 | Language | Count | Percentage | |--------------------|----------|------------| | English (en) 🇺🇸🇬🇧🇨🇦🇮🇳 | 150,693 | 25.97% | | French (fr) 🇫🇷🇨🇭🇨🇦 | 112,136 | 19.33% | | German (de) 🇩🇪🇨🇭 | 82,384 | 14.20% | | Spanish (es) 🇪🇸 🇲🇽 | 78,013 | 13.45% | | Italian (it) 🇮🇹🇨🇭 | 68,824 | 11.86% | | Dutch (nl) 🇳🇱 | 26,628 | 4.59% | | Hindi (hi)* 🇮🇳 | 33,963 | 5.85% | | Telugu (te)* 🇮🇳 | 27,586 | 4.75% | *these languages are in experimental stages --- ## Region Distribution Analytics Number of Unique Regions: 11 | Region | Count | Percentage | |-----------------------|----------|------------| | Switzerland (CH) 🇨🇭 | 112,531 | 19.39% | | India (IN) 🇮🇳 | 99,724 | 17.19% | | Canada (CA) 🇨🇦 | 74,733 | 12.88% | | Germany (DE) 🇩🇪 | 41,604 | 7.17% | | Spain (ES) 🇪🇸 | 39,557 | 6.82% | | Mexico (MX) 🇲🇽 | 38,456 | 6.63% | | France (FR) 🇫🇷 | 37,886 | 6.53% | | Great Britain (GB) 🇬🇧 | 37,092 | 6.39% | | United States (US) 🇺🇸 | 37,008 | 6.38% | | Italy (IT) 🇮🇹 | 35,008 | 6.03% | | Netherlands (NL) 🇳🇱 | 26,628 | 4.59% | --- ## Machine Learning Task Analytics | Split | Count | Percentage | |-------------|----------|------------| | Train | 464,150 | 79.99% | | Validate| 116,077 | 20.01% | --- # Usage Option 1: Python terminal pip install datasets python from datasets import load_dataset dataset = load_dataset("ai4privacy/open-pii-masking-500k-ai4privacy") # Compatible Machine Learning Tasks: - Tokenclassification. Check out a HuggingFace's guide on token classification. - ALBERT, BERT, BigBird, BioGpt, BLOOM, BROS, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, ESM, Falcon, FlauBERT, FNet, Funnel Transformer, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, I-BERT, LayoutLM, LayoutLMv2, LayoutLMv3, LiLT, Longformer, LUKE, MarkupLM, MEGA, Megatron-BERT, MobileBERT,...
m
The Impact of AI and ChatGPT on Bangladeshi University Students
data.mendeley.com
Updated Jan 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Jhirul Islam (2025). The Impact of AI and ChatGPT on Bangladeshi University Students [Dataset]. http://doi.org/10.17632/zykphpvbr7.2
Explore at:
Unique identifier
https://doi.org/10.17632/zykphpvbr7.2
Dataset updated
Jan 6, 2025
Authors
Md Jhirul Islam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bangladesh
Description
The data set records the perceptions of Bangladeshi university students on the influence that AI tools, especially ChatGPT, have on their academic practices, learning experiences, and problem-solving abilities. The varying role of AI in education, which covers common usage statistics, what AI does to our creative abilities, its impact on our learning, and whether it could invade our privacy. This dataset reveals perspective on how AI tools are changing education in the country and offering valuable information for researchers, educators, policymakers, to understand trends, challenges, and opportunities in the adoption of AI in the academic contex.

Methodology Data Collection Method: Online survey using google from Participants: A total of 3,512 students from various Bangladeshi universities participated. Survey Questions:The survey included questions on demographic information, frequency of AI tool usage, perceived benefits, concerns regarding privacy, and impacts on creativity and learning.

Sampling Technique: Random sampling of university students Data Collection Period: June 2024 to December 2024

Privacy Compliance This dataset has been anonymized to remove any personally identifiable information (PII). It adheres to relevant privacy regulations to ensure the confidentiality of participants.

For further inquiries, please contact: Name: Md Jhirul Islam, Daffodil International University Email: jhirul15-4063@diu.edu.bd Phone: 01316317573
a
Wessex Water Domestic Water Quality
arc-gis-hub-home-arcgishub.hub.arcgis.com
Updated Jan 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sophie.sherriff_wessex (2024). Wessex Water Domestic Water Quality [Dataset]. https://arc-gis-hub-home-arcgishub.hub.arcgis.com/datasets/acc078ffd7a44426998ebfa3f468e89f
Explore at:
Dataset updated
Jan 30, 2024
Dataset authored and provided by
sophie.sherriff_wessex
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OverviewWater companies in the UK are responsible for testing the quality of drinking water. This dataset contains the results of samples taken from the taps in domestic households to make sure they meet the standards set out by UK and European legislation. This data shows the location, date, and measured levels of determinands set out by the Drinking Water Inspectorate (DWI).Key Definitions  AggregationProcess involving summarising or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes  Anonymisation Anonymised data is a type of information sanitisation in which data anonymisation tools encrypt or remove personally identifiable information from datasets for the purpose of preserving a data subject's privacy Dataset Structured and organised collection of related elements, often stored digitally, used for analysis and interpretation in various fields.  Determinand A constituent or property of drinking water which can be determined or estimated. DWI Drinking Water Inspectorate, an organisation “providing independent reassurance that water supplies in England and Wales are safe and drinking water quality is acceptable to consumers.”  DWI Determinands Constituents or properties that are tested for when evaluating a sample for its quality as per the guidance of the DWI. For this dataset, only determinands with “point of compliance” as “customer taps” are included.   Granularity Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours ID Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance.  LSOA Lower-Level Super Output Area is made up of small geographic areas used for statistical and administrative purposes by the Office for National Statistics. It is designed to have homogeneous populations in terms of population size, making them suitable for statistical analysis and reporting. Each LSOA is built from groups of contiguous Output Areas with an average of about 1,500 residents or 650 households allowing for granular data collection useful for analysis, planning and policy- making while ensuring privacy.  ONS Office for National Statistics  Open Data Triage The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data.  Sample A sample is a representative segment or portion of water taken from a larger whole for the purpose of analysing or testing to ensure compliance with safety and quality standards.  Schema Structure for organizing and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute.  Units Standard measurements used to quantify and compare different physical quantities.  Water Quality The chemical, physical, biological, and radiological characteristics of water, typically in relation to its suitability for a specific purpose, such as drinking, swimming, or ecological health. It is determined by assessing a variety of parameters, including but not limited to pH, turbidity, microbial content, dissolved oxygen, presence of substances and temperature.Data HistoryData Origin  These samples were taken from customer taps. They were then analysed for water quality, and the results were uploaded to a database. This dataset is an extract from this database.Data Triage Considerations Granularity Is it useful to share results as averages or individual? We decided to share as individual results as the lowest level of granularity Anonymisation It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed: • Water Supply Zone (WSZ) - Limits interoperability with other datasets • Postcode – Some postcodes contain very few households and may not offer necessary anonymisation • Postal Sector – Deemed not granular enough in highly populated areas • Rounded Co-ordinates – Not a recognised standard and may cause overlapping areas • MSOA – Deemed not granular enough • LSOA – Agreed as a recognised standard appropriate for England and Wales • Data Zones – Agreed as a recognised standard appropriate for Scotland Data Triage Review Frequency Annually unless otherwise requested Publish FrequencyAnnuallyData Specifications • Each dataset will cover a year of samples in calendar year • This dataset will be published annually • Historical datasets will be published as far back as 2016 from the introduction of The Water Supply (Water Quality) Regulations 2016 • The determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate. • A small proportion of samples could not be allocated to an LSOA – these represented less than 0.1% of samples and were removed from the dataset in 2023. • The postcode to LSOA lookup table used for 2022 was not available when 2023 data was processed, see supplementary information for the lookup table applied to each calendar year of data. Context Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset. Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area. Some samples are tested on site and others are sent to scientific laboratories.Supplementary informationBelow is a curated selection of links for additional reading, which provide a deeper understanding of this dataset.   1. Drinking Water Inspectorate Standards and Regulations: https://www.dwi.gov.uk/drinking-water-standards-and-regulations/   2. LSOA (England and Wales) and Data Zone (Scotland): https://www.nrscotland.gov.uk/files/geography/2011-census/geography-bckground-info-comparison-of-thresholds.pdf   3. Description for LSOA boundaries by the ONS: https://www.ons.gov.uk/methodology/geography/ukgeographies/censusgeographies/census2021geographies4. Postcode to LSOA lookup tables (2022 calendar year data): https://geoportal.statistics.gov.uk/datasets/postcode-to-2021-census-output-area-to-lower-layer-super-output-area-to-middle-layer-super-output-area-to-local-authority-district-august-2023-lookup-in-the-uk/about   5. Postcode to LSOA lookup tables (2023 calendar year data):  https://geoportal.statistics.gov.uk/datasets/b8451168e985446eb8269328615dec62/about6. Legislation history: https://www.dwi.gov.uk/water-companies/legislation/
Dataset for the paper "Observation of Acceleration and Deceleration Periods...
zenodo.org
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yide Qian; Yide Qian (2025). Dataset for the paper "Observation of Acceleration and Deceleration Periods at Pine Island Ice Shelf from 1997–2023 " [Dataset]. http://doi.org/10.5281/zenodo.15022854
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15022854
Dataset updated
Mar 26, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yide Qian; Yide Qian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Pine Island Glacier
Description
Dataset and codes for "Observation of Acceleration and Deceleration Periods at Pine Island Ice Shelf from 1997–2023 "

Description of the data and file structure

The MATLAB codes and related datasets are used for generating the figures for the paper "Observation of Acceleration and Deceleration Periods at Pine Island Ice Shelf from 1997–2023".

Files and variables

File 1: Data_and_Code.zip

Directory: Main_function

**Description:****Include MATLAB scripts and functions. Each script include discriptions that guide the user how to used it and how to find the dataset that used for processing.

MATLAB Main Scripts: Include the whole steps to process the data, output figures, and output videos.

Script_1_Ice_velocity_process_flow.m

Script_2_strain_rate_process_flow.m

Script_3_DROT_grounding_line_extraction.m

Script_4_Read_ICESat2_h5_files.m

Script_5_Extraction_results.m

MATLAB functions: Five Files that includes MATLAB functions that support the main script:

1_Ice_velocity_code: Include MATLAB functions related to ice velocity post-processing, includes remove outliers, filter, correct for atmospheric and tidal effect, inverse weited averaged, and error estimate.

2_strain_rate: Include MATLAB functions related to strain rate calculation.

3_DROT_extract_grounding_line_code: Include MATLAB functions related to convert range offset results output from GAMMA to differential vertical displacement and used the result extract grounding line.

4_Extract_data_from_2D_result: Include MATLAB functions that used for extract profiles from 2D data.

5_NeRD_Damage_detection: Modified code fom Izeboud et al. 2023. When apply this code please also cite Izeboud et al. 2023 (https://www.sciencedirect.com/science/article/pii/S0034425722004655).

6_Figure_plotting_code:Include MATLAB functions related to Figures in the paper and support information.

Director: data_and_result

Description:**Include directories that store the results output from MATLAB. user only neeed to modify the path in MATLAB script to their own path.

1_origin : Sample data ("PS-20180323-20180329", “PS-20180329-20180404”, “PS-20180404-20180410”) output from GAMMA software in Geotiff format that can be used to calculate DROT and velocity. Includes displacment, theta, phi, and ccp.

2_maskccpN: Remove outliers by ccp < 0.05 and change displacement to velocity (m/day).

3_rockpoint: Extract velocities at non-moving region

4_constant_detrend: removed orbit error

5_Tidal_correction: remove atmospheric and tidal induced error

6_rockpoint: Extract non-aggregated velocities at non-moving region

6_vx_vy_v: trasform velocities from va/vr to vx/vy

7_rockpoint: Extract aggregated velocities at non-moving region

7_vx_vy_v_aggregate_and_error_estimate: inverse weighted average of three ice velocity maps and calculate the error maps

8_strain_rate: calculated strain rate from aggregate ice velocity

9_compare: store the results before and after tidal correction and aggregation.

10_Block_result: times series results that extrac from 2D data.

11_MALAB_output_png_result: Store .png files and time serties result

12_DROT: Differential Range Offset Tracking results

13_ICESat_2: ICESat_2 .h5 files and .mat files can put here (in this file only include the samples from tracks 0965 and 1094)

14_MODIS_images: you can store MODIS images here

shp: grounding line, rock region, ice front, and other shape files.

File 2 : PIG_front_1947_2023.zip

Includes Ice front positions shape files from 1947 to 2023, which used for plotting figure.1 in the paper.

File 3 : PIG_DROT_GL_2016_2021.zip

Includes grounding line positions shape files from 1947 to 2023, which used for plotting figure.1 in the paper.

Data was derived from the following sources:
Those links can be found in MATLAB scripts or in the paper "**Open Research" **section.
e
Flash Eurobarometer 225 (Data Protection - General Public) - Dataset -...
b2find.eudat.eu
Updated Jul 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
Dataset updated
Jul 23, 2025
Description
Kenntnisse über Datenschutzgesetze und Kenntnis der unabhängigen Datenschutzbehörde. Kenntnisse über den Schutz persönlicher Daten. Themen: Interesse am Schutz persönlicher Daten, die in privaten und öffentlichen Organisationen gespeichert werden; Vertrauen in ausgewählte Institutionen im eigenen Lande bezüglich des Datenschutzes; Meinung zum Schutz persönlicher Daten: ausreichender Datenschutz im eigenen Land, Einschätzung des allgemeinen Bewusstseins über den Schutz persönlicher Daten, Beunruhigung über das Hinterlassen persönlicher Daten im Internet, Vertrauen in die Datenschutzgesetzgebung; Kenntnis der Datenschutzbehörde und deren Aufgaben: Annahme von Beschwerden von Privatpersonen, Verhängen von Sanktionen, eigene Kontaktaufnahme zu dieser Behörde; Kenntnisse der Pflichten von datenhaltenden Organisationen gegenüber dem Befragten; Kenntnistest der Rechte des Befragten hinsichtlich der Verwendung seiner persönlichen Daten: erforderliche Zustimmung, Widerspruchsrecht, Auskunftsrecht, Recht auf Korrektur oder Löschung von Daten, Rechtsmittel gegen Verstöße, Schadensersatzforderung bei ungesetzlicher Verwendung; Meinung über Übertragungssicherheit von Daten im Internet; Kenntnis über Technologien, die die Sammlung persönlicher Daten vom eigenen Computer einschränken (Cookies, Firewall); Verwendung dieser Technologien; Gründe für eine Nichtnutzung; Einstellung zur Überwachung von: Telefongesprächen, Internetnutzung, Kreditkartennutzung und Daten von Flugpassagieren zur Terrorismusbekämpfung (Split: umgedrehte Antwortvorgaben); Kenntnis über Verbot der Weitergabe persönlicher Daten an Nicht-EU-Länder, mit unzureichendem Datenschutz; Kenntnis über strengere Datenschutzregelungen für empfindliche Daten. Demographie: Geschlecht; Alter; Alter bei Beendigung der Ausbildung; Beruf; berufliche Stellung; Urbanisierungsgrad; Haushaltszusammensetzung und Haushaltsgröße; Besitz eines Mobiltelefons; Festnetztelefon im Haushalt. Zusätzlich verkodet wurde: Befragten-ID; Interviewsprache; Interviewer-ID; Land; Interviewdatum; Interviewdauer (Interviewbeginn und Interviewende); Interviewmodus (Mobiltelefon oder Festnetz); Region; Gewichtungsfaktor. Attitudes towards the protection of personal data. Topics: concern with regard to the protection of personal information by private and public organisations; trust in the following institutions regarding the use of personal information in a proper way: travel companies, medical services, insurance companies, credit card companies, financial institutions, employers, police, social security, tax authorities, local authorities, credit reference agencies, mail order companies, non-profit organisations, market and opinion research companies; attitude towards the following statements on the protection of personal data in the own country: is properly protected, low awareness of people on the subject, worry about leaving personal information on the internet, appropriate legislation to cope with growing number of personal information on the internet; awareness of the national authority to monitor the application of data protection laws; responsibility of the national authority to hear individuals; ability of the authority to pose sanctions; personal contact to authority; awareness of the obligation of data collectors to provide information on identity, purpose, and further data sharing; knowledge test concerning the storage of personal data: need for personal consent with regard to the use of personal information, right to oppose the use, legal assurance to access personal data, right to correct or remove data, national laws allow access to courts to seek remedies for breaches of data protection laws, right for compensation caused by unlawful use of personal data; assessment of the security of transmitting personal data over the internet; awareness of technologies to limit the collection of personal data from personal computer; use of these technologies; reasons for not using; attitude towards selected measures to fight international terrorism: monitor telephone calls, monitor internet use, monitor credit card use, monitor flight passenger data; awareness of the assurance that personal data of EU citizens can only be transferred outside the EU to countries which ensure an adequate level or protection; awareness of stricter data protection rules applied for sensitive data. Demography: sex; age; age at end of education; occupation; professional position; type of community; household composition and household size; own a mobile phone and fixed (landline) phone. Additionally coded was: respondent ID; language of the interview; interviewer ID; country; date of interview; time of the beginning of the interview; duration of the interview; type of phone line; region; weighting factor.
s
Portsmouth Water Drinking Water Quality Data 2022, 2023 & 2024
streamwaterdata.co.uk
Updated May 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AHughes_Portsmouth (2024). Portsmouth Water Drinking Water Quality Data 2022, 2023 & 2024 [Dataset]. https://www.streamwaterdata.co.uk/datasets/b9b8d038ae70461386f5cab102adbbb9
Explore at:
Dataset updated
May 22, 2024
Dataset authored and provided by
AHughes_Portsmouth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

Water companies in the UK are responsible for testing the quality of drinking water. This dataset contains the results of samples taken from the taps in domestic households to make sure they meet the standards set out by UK and European legislation. This data shows the location, date, and measured levels of determinands set out by the Drinking Water Inspectorate (DWI).

Key Definitions

Aggregation

Process involving summarizing or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes

Anonymisation

Anonymised data is a type of information sanitization in which data anonymisation tools encrypt or remove personally identifiable information from datasets for the purpose of preserving a data subject's privacy

Dataset

Structured and organized collection of related elements, often stored digitally, used for analysis and interpretation in various fields.

Determinand

A constituent or property of drinking water which can be determined or estimated.

DWI

Drinking Water Inspectorate, an organisation “providing independent reassurance that water supplies in England and Wales are safe and drinking water quality is acceptable to consumers.”

DWI Determinands

Constituents or properties that are tested for when evaluating a sample for its quality as per the guidance of the DWI. For this dataset, only determinands with “point of compliance” as “customer taps” are included.

Granularity

Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours

ID

Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance.

LSOA

Lower-Level Super Output Area is made up of small geographic areas used for statistical and administrative purposes by the Office for National Statistics. It is designed to have homogeneous populations in terms of population size, making them suitable for statistical analysis and reporting. Each LSOA is built from groups of contiguous Output Areas with an average of about 1,500 residents or 650 households allowing for granular data collection useful for analysis, planning and policy- making while ensuring privacy.

ONS

Office for National Statistics

Open Data Triage

The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data. <

Sample

A sample is a representative segment or portion of water taken from a larger whole for the purpose of analysing or testing to ensure compliance with safety and quality standards.

Schema

Structure for organizing and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute.

Units

Standard measurements used to quantify and compare different physical quantities.

Water Quality

The chemical, physical, biological, and radiological characteristics of water, typically in relation to its suitability for a specific purpose, such as drinking, swimming, or ecological health. It is determined by assessing a variety of parameters, including but not limited to pH, turbidity, microbial content, dissolved oxygen, presence of substances and temperature.

Data History

Data Origin

These samples were taken from customer taps. They were then analysed for water quality, and the results were uploaded to a database. This dataset is an extract from this database.

Data Triage Considerations

Granularity

Is it useful to share results as averages or individual?

We decided to share as individual results as the lowest level of granularity

Anonymisation

It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed:

<!--·
Water Supply Zone (WSZ) - Limits interoperability with other datasets

<!--·
Postcode – Some postcodes contain very few households and may not offer necessary anonymisation

<!--·
Postal Sector – Deemed not granular enough in highly populated areas

<!--·
Rounded Co-ordinates – Not a recognised standard and may cause overlapping areas

<!--·
MSOA – Deemed not granular enough

<!--·
LSOA – Agreed as a recognised standard appropriate for England and Wales

<!--·
Data Zones – Agreed as a recognised standard appropriate for Scotland

Data Specifications

Each dataset will cover a calendar year of samples

This dataset will be published annually

Historical datasets will be published as far back as 2016 from the introduction of of The Water Supply (Water Quality) Regulations 2016

The Determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate.

Context

Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset

Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area.

Some samples are tested on site and others are sent to scientific laboratories.

Data Publish Frequency

Annually

Data Triage Review Frequency

Annually unless otherwise requested

Supplementary information

Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset.

<!--1.
Drinking Water Inspectorate Standards and Regulations:

<!--2.
https://www.dwi.gov.uk/drinking-water-standards-and-regulations/

<!--3.
LSOA (England and Wales) and Data Zone (Scotland):

<!--4. https://www.nrscotland.gov.uk/files/geography/2011-census/geography-bckground-info-comparison-of-thresholds.pdf

<!--5.
Description for LSOA boundaries by the ONS: Census 2021 geographies - Office for National Statistics (ons.gov.uk)

<!--[6.
Postcode to LSOA lookup tables: Postcode to 2021 Census Output Area to Lower Layer Super Output Area to Middle Layer Super Output Area to Local Authority District (August 2023) Lookup in the UK (statistics.gov.uk)

<!--7.
Legislation history: Legislation - Drinking Water Inspectorate (dwi.gov.uk)
Dataset (Covid-Bacterial-Viral-Normal-Emphysema)
kaggle.com
Updated Jun 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nhật Nguyễn Minh (2024). Dataset (Covid-Bacterial-Viral-Normal-Emphysema) [Dataset]. https://www.kaggle.com/datasets/minhnhat232/dataset-covid-bacterial-viral-normal-emphysema/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 13, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nhật Nguyễn Minh
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The dataset contain lung x-ray image including:

Normal - 3,270 images

Covid-19 - 3,017 images

Viral-pneumonia - 3,013 images

Bacterial-pneumonia - 3,000 images

Emphysema - 2,550 images

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15315323%2F8041ddd2485bfe9cdf2ba1f9d96bd7e5%2F6_Class_Img.jpg?generation=1741951756137022&alt=media" alt="">

The dataset we use is compiled from many reputable sources including: Dataset 1 [1]: This dataset includes four classes of diseases: COVID-19, viral pneumonia, bacterial pneumonia, and normal. It has multiple versions, and we are currently using the latest version (version 4). Previous studies, such as those by Hariri et al. [18] and Ahmad et al. [20], have also utilized earlier versions of this dataset. Dataset 2 [2]: This dataset is from the National Institutes of Health (NIH) Chest X-Ray Dataset, which contains over 100,000 chest X-ray images from over 30,000 patients. It includes 14 disease classes, including conditions like atelectasis, consolidation, and infiltration. For this study, we have selected 2,550 chest X-ray images specifically from the Emphysema class. Dataset 3 [3]: This is the COVQU dataset, which we have extended to include two additional classes: COVID-19 and viral pneumonia. This dataset has been widely used in previous studies by M.E.H. Chowdhury et al. [4] and Rahman T et al. [5], establishing its reputation as a reliable resource.

In addition, we also publish a modified dataset that aims to remove image regions that do not contain lungs (abdomen, arms, etc.).

References: [1] U. Sait, K. G. Lal, S. P. Prajapati, R. Bhaumik, T. Kumar, S. Shivakumar, K. Bhalla, Curated dataset for covid-19 posterior-anterior chest radiography images (x-rays)., Mendeley Data V4 (2022). doi:10.17632/9xkhgts2s6.4. [2] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, R. M. Summers, Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases (2017) 3462–3471. doi:10.1109/CVPR.2017.369. [3] A. M. Tahir, M. E. Chowdhury, A. Khandakar, T. Rahman, Y. Qiblawey, U. Khurshid, S. Kiranyaz, N. Ibtehaz, M. S. Rahman, S. Al-Maadeed,S. Mahmud, M. Ezeddin, K. Hameed, T. Hamid, Covid-19 infection localization and severity grading from chest x-ray images, Computers in Biology and Medicine 139 (2021) 105002. URL: https://www.sciencedirect.com/science/article/pii/S0010482521007964. doi:https://doi.org/10.1016/j.compbiomed.2021.105002. [4] M. E. Chowdhury, T. Rahman, A. Khandakar, R. Mazhar, M. A. Kadir, Z. B. Mahbub, K. R. Islam, M. S. Khan, A. Iqbal, N. A. Emadi, M. B. I. Reaz, M. T. Islam, Can ai help in screening viral and covid-19 pneumonia?, IEEE Access 8 (2020) 132665–132676. doi:10.1109/ACCESS.2020.3010287. [5] T. Rahman, A. Khandakar, Y. Qiblawey, A. Tahir, S. Kiranyaz, S. B. A. Kashem, M. T. Islam, S. A. Maadeed, S. M. Zughaier, M. S. Khan, M. E. Chowdhury, Exploring the effect of image enhancement techniques on covid-19 detection using chest x-ray images, Computers in Biology and Medicine 132 (2021). doi:10.1016/j.compbiomed.2021.104319.
BIPED Dataset
kaggle.com
Updated Oct 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xavier Soria (2021). BIPED Dataset [Dataset]. https://www.kaggle.com/xavysp/biped/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 27, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Xavier Soria
Description
** Version 2 of BIPED have been released, please use BIPEDv2** It contains 250 outdoor images of 1280x720 pixels each. These images have been carefully annotated by experts on the computer vision field, hence no redundancy has been considered. In spite of that, all results have been cross-checked several times in order to correct possible mistakes or wrong edges by just one subject. This dataset is publicly available as a benchmark for evaluating edge detection algorithms. The generation of this dataset is motivated by the lack of edge detection datasets, actually, there is just one dataset publicly available for the edge detection task published in 2016 (MDBD: Multicue Dataset for Boundary Detection---the subset for edge detection). The level of details of the edge level annotations in the BIPED's images can be appreciated looking at the GT, see Figs above.

BIPED dataset has 250 images in high definition. Those images are already split up for training and testing. 200 for training and 50 for testing.

BIPED Augmentation

To augment the BIPED data for the DL training visit this repository

Citation

Please cite our dataset if you find helpful,

@InProceedings{soria2020dexined, title={Dense Extreme Inception Network: Towards a Robust CNN Model for Edge Detection}, author={Xavier Soria and Edgar Riba and Angel Sappa}, booktitle={The IEEE Winter Conference on Applications of Computer Vision (WACV '20)}, year={2020} }

@article{SORIA2023BIPEDv2, title = {Dense extreme inception network for edge detection}, journal = {Pattern Recognition}, volume = {139}, pages = {109461}, year = {2023}, issn = {0031-3203}, doi = {https://doi.org/10.1016/j.patcog.2023.109461}, url = {https://www.sciencedirect.com/science/article/pii/S0031320323001619}, author = {Xavier Soria and Angel Sappa and Patricio Humanante and Arash Akbarinia}, keywords = {Edge detection, Deep learning, CNN, Contour detection, Boundary detection, Segmentation} }

License

This Dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use the data given that you agree to our license terms. However, if any of our images are infringing any privacy or rights, feel free to contact us and we will remove immediately.

If you need more information, Dont hesitate and contact me :)
d
High Resolution Residential Water Use Data in Cache County, Utah, USA
search.dataone.org
Updated Dec 5, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Camilo J. Bastidas Pacheco; Nour Atallah; Jeffery S. Horsburgh (2021). High Resolution Residential Water Use Data in Cache County, Utah, USA [Dataset]. https://search.dataone.org/view/sha256%3A0f3764e6031308973489ad5fc20e8953635e46056cd6c63001cfd8e6f1a079b6
Explore at:
Dataset updated
Dec 5, 2021
Dataset provided by
Hydroshare
Authors
Camilo J. Bastidas Pacheco; Nour Atallah; Jeffery S. Horsburgh
Area covered

Description
This dataset contains high resolution residential water use data for 31 residential homes located in Logan City and Providence City in Cache County, Utah, USA. Data were collected using a low-cost, open source monitoring device that was designed to operate on magnetically driven residential water meters. Data were recorded with a temporal frequency of 4 seconds and were collected for a period of at least two weeks during the summer when outdoor water use was active and two weeks during the winter when no outdoor water use was expected. The data were measured on the meter located on the water supply line to each home and represent a trace of the total water use for each residence. The dataset also includes secondary data about each of the residences at which data were collected. These data have been anonymized to remove any personally identifiable information from participants in this data collection effort.
Destination Sri Lanka
kaggle.com
Updated May 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanchana1990 (2024). Destination Sri Lanka [Dataset]. http://doi.org/10.34740/kaggle/dsv/8321374
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/8321374
Dataset updated
May 5, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kanchana1990
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Area covered
Sri Lanka
Description
Overview:

Dataset comprises nearly 5,000 accommodations across Sri Lanka, including villas, guest houses, homestays, hotels, and bungalows.

Captures a variety of data points like town based location, capacity, and ratings.

Data Science Applications:

Ideal for Market Rating Analysis.

Can be used for geographic data visualization and competitive analysis in the hospitality sector.

Ethically Mined Data:

Data has been ethically sourced and rigorously anonymized to remove any Personally Identifiable Information (PII).

Ensures compliance with data privacy standards.

Acknowledgements:

Special thanks to Airbnb for being a primary data source.

Additional data gathered through personal contacts, ensuring a comprehensive dataset.

Image Credits:

Dataset thumbnail on Kaggle created with DALL-E 3, showcasing iconic Sri Lankan imagery.
MineralImage5k
kaggle.com
Updated Feb 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergey Nesteruk (2024). MineralImage5k [Dataset]. https://www.kaggle.com/datasets/sergeynesteruk/minerals/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 29, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sergey Nesteruk
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset consists of over 5,000 mineral varieties with the total of nearly 44,000 images. Some images are also accompanied by textual descriptions.

The dataset is with the paper https://www.sciencedirect.com/science/article/abs/pii/S0098300423001188

Images are processed to remove text tables, and to zoom in the minerals.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3838650%2Fcd90450b88cab88d993acb7f80664545%2Fminerals_data_preprocessing_scheme.png?generation=1709238741202275&alt=media" alt="">
h
mental_health_conversational_dataset
huggingface.co
Updated Aug 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zahrizhal Ali (2023). mental_health_conversational_dataset [Dataset]. https://huggingface.co/datasets/ZahrizhalAli/mental_health_conversational_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 10, 2023
Authors
Zahrizhal Ali
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
CREDIT: Dataset Card for "heliosbrahma/mental_health_chatbot_dataset"

Dataset Description Dataset Summary

This dataset contains conversational pair of questions and answers in a single text related to Mental Health. Dataset was curated from popular healthcare blogs like WebMD, Mayo Clinic and HeatlhLine, online FAQs etc. All questions and answers have been anonymized to remove any PII data and pre-processed to remove any unwanted characters.

Languages… See the full description on the dataset page: https://huggingface.co/datasets/ZahrizhalAli/mental_health_conversational_dataset.
P
OpenEDS2020 Dataset
paperswithcode.com
opendatalab.com
Updated May 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cristina Palmero; Abhishek Sharma; Karsten Behrendt; Kapil Krishnakumar; Oleg V. Komogortsev; Sachin S. Talathi (2020). OpenEDS2020 Dataset [Dataset]. https://paperswithcode.com/dataset/openeds2020
Explore at:
Dataset updated
May 7, 2020
Authors
Cristina Palmero; Abhishek Sharma; Karsten Behrendt; Kapil Krishnakumar; Oleg V. Komogortsev; Sachin S. Talathi
Description
OpenEDS2020 is a dataset of eye-image sequences captured at a frame rate of 100 Hz under controlled illumination, using a virtual-reality head-mounted display mounted with two synchronized eye-facing cameras. The dataset, which is anonymized to remove any personally identifiable information on participants, consists of 80 participants of varied appearance performing several gaze-elicited tasks, and is divided in two subsets: 1) Gaze Prediction Dataset, with up to 66,560 sequences containing 550,400 eye-images and respective gaze vectors, created to foster research in spatio-temporal gaze estimation and prediction approaches; and 2) Eye Segmentation Dataset, consisting of 200 sequences sampled at 5 Hz, with up to 29,500 images, of which 5% contain a semantic segmentation label, devised to encourage the use of temporal information to propagate labels to contiguous frames.
h
ZOD-Mini-2D-Road-Scenes
huggingface.co
Updated Aug 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
8Bits (2024). ZOD-Mini-2D-Road-Scenes [Dataset]. https://huggingface.co/datasets/8bits-ai/ZOD-Mini-2D-Road-Scenes
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 8, 2024
Dataset authored and provided by
8Bits
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
ZOD-Mini-2D-Road-Scenes

The ZOD-Mini-2D-Road-Scenes dataset is derived from the Zenseact Open Dataset (ZOD), property of Zenseact AB (© 2022 Zenseact AB), and is licensed under the permissive CC BY-SA 4.0. Any public use, distribution, or display of this dataset must contain this entire notice:

For this dataset, Zenseact AB has taken all reasonable measures to remove all personally identifiable information, including faces and license plates. To the extent that you like to request… See the full description on the dataset page: https://huggingface.co/datasets/8bits-ai/ZOD-Mini-2D-Road-Scenes.
h
twitter_indonesia_sarcastic
huggingface.co
Updated Nov 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wilson Wongso (2024). twitter_indonesia_sarcastic [Dataset]. https://huggingface.co/datasets/w11wo/twitter_indonesia_sarcastic
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 19, 2024
Authors
Wilson Wongso
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
Indonesia
Description
Twitter Indonesia Sarcastic

Twitter Indonesia Sarcastic is a dataset intended for sarcasm detection in the Indonesian language. This dataset is introduced in Khotijah et al. (2020), whereby Indonesian tweets are collected and labeled as either sarcastic or non-sarcastic. We took the raw data, and performed several cleaning procedures such as: sentence order re-reversal, deduplication with minHash LSH, PII masking to remove usernames, hashtags, emails, URLs, and finally a random… See the full description on the dataset page: https://huggingface.co/datasets/w11wo/twitter_indonesia_sarcastic.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532

pii-masking-200k

Ai4Privacy PII200k Dataset

ai4privacy/pii-masking-200k

Explore at:

12 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.57967/hf/1532

Dataset updated

Apr 22, 2024

Dataset authored and provided by

Ai4Privacy

Description

Ai4Privacy Community

Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

  Purpose and Features

Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.

Clear search

Close search

Google apps

Main menu

pii-masking-200k

pii-masking-65k

pii-masking-300k

open-pii-masking-500k-ai4privacy

The Impact of AI and ChatGPT on Bangladeshi University Students

Wessex Water Domestic Water Quality

Dataset for the paper "Observation of Acceleration and Deceleration Periods...

Flash Eurobarometer 225 (Data Protection - General Public) - Dataset -...

Portsmouth Water Drinking Water Quality Data 2022, 2023 & 2024

Dataset (Covid-Bacterial-Viral-Normal-Emphysema)

BIPED Dataset

BIPED Augmentation

Citation

License

High Resolution Residential Water Use Data in Cache County, Utah, USA

Destination Sri Lanka

Overview:

Data Science Applications:

Ethically Mined Data:

Acknowledgements:

Image Credits:

MineralImage5k

mental_health_conversational_dataset

OpenEDS2020 Dataset

ZOD-Mini-2D-Road-Scenes

twitter_indonesia_sarcastic

pii-masking-200k

Ai4Privacy PII200k Dataset

ai4privacy/pii-masking-200k