18 datasets found
  1. h

    pii-masking-200k

    • huggingface.co
    Updated Apr 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2024
    Dataset authored and provided by
    Ai4Privacy
    Description

    Ai4Privacy Community

    Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

      Purpose and Features
    

    Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.

  2. h

    pii-masking-65k

    • huggingface.co
    Updated Apr 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2024). pii-masking-65k [Dataset]. http://doi.org/10.57967/hf/2012
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 5, 2024
    Dataset authored and provided by
    Ai4Privacy
    Description

    Purpose and Features

    The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-65k.

  3. h

    pii-masking-300k

    • huggingface.co
    Updated Apr 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2024). pii-masking-300k [Dataset]. http://doi.org/10.57967/hf/1995
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2024
    Dataset authored and provided by
    Ai4Privacy
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Purpose and Features

    🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts:

    OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting 749 discussion subjects / use cases split across education, health, and psychology. FinPII contains an additional ~20 types tailored to… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-300k.

  4. H

    open-pii-masking-500k-ai4privacy

    • dataverse.harvard.edu
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Anthony (2025). open-pii-masking-500k-ai4privacy [Dataset]. http://doi.org/10.7910/DVN/4H11OA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Michael Anthony
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Task Showcase of Privacy Masking # Dataset Analytics 📊 - ai4privacy/open-pii-masking-500k-ai4privacy ## p5y Data Analytics - Total Entries: 580,227 - Total Tokens: 19,199,982 - Average Source Text Length: 17.37 words - Total PII Labels: 5,705,973 - Number of Unique PII Classes: 20 (Open PII Labelset) - Unique Identity Values: 704,215 --- ## Language Distribution Analytics Number of Unique Languages: 8 | Language | Count | Percentage | |--------------------|----------|------------| | English (en) 🇺🇸🇬🇧🇨🇦🇮🇳 | 150,693 | 25.97% | | French (fr) 🇫🇷🇨🇭🇨🇦 | 112,136 | 19.33% | | German (de) 🇩🇪🇨🇭 | 82,384 | 14.20% | | Spanish (es) 🇪🇸 🇲🇽 | 78,013 | 13.45% | | Italian (it) 🇮🇹🇨🇭 | 68,824 | 11.86% | | Dutch (nl) 🇳🇱 | 26,628 | 4.59% | | Hindi (hi)* 🇮🇳 | 33,963 | 5.85% | | Telugu (te)* 🇮🇳 | 27,586 | 4.75% | *these languages are in experimental stages --- ## Region Distribution Analytics Number of Unique Regions: 11 | Region | Count | Percentage | |-----------------------|----------|------------| | Switzerland (CH) 🇨🇭 | 112,531 | 19.39% | | India (IN) 🇮🇳 | 99,724 | 17.19% | | Canada (CA) 🇨🇦 | 74,733 | 12.88% | | Germany (DE) 🇩🇪 | 41,604 | 7.17% | | Spain (ES) 🇪🇸 | 39,557 | 6.82% | | Mexico (MX) 🇲🇽 | 38,456 | 6.63% | | France (FR) 🇫🇷 | 37,886 | 6.53% | | Great Britain (GB) 🇬🇧 | 37,092 | 6.39% | | United States (US) 🇺🇸 | 37,008 | 6.38% | | Italy (IT) 🇮🇹 | 35,008 | 6.03% | | Netherlands (NL) 🇳🇱 | 26,628 | 4.59% | --- ## Machine Learning Task Analytics | Split | Count | Percentage | |-------------|----------|------------| | Train | 464,150 | 79.99% | | Validate| 116,077 | 20.01% | --- # Usage Option 1: Python terminal pip install datasets python from datasets import load_dataset dataset = load_dataset("ai4privacy/open-pii-masking-500k-ai4privacy") # Compatible Machine Learning Tasks: - Tokenclassification. Check out a HuggingFace's guide on token classification. - ALBERT, BERT, BigBird, BioGpt, BLOOM, BROS, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, ESM, Falcon, FlauBERT, FNet, Funnel Transformer, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, I-BERT, LayoutLM, LayoutLMv2, LayoutLMv3, LiLT, Longformer, LUKE, MarkupLM, MEGA, Megatron-BERT, MobileBERT,...

  5. m

    The Impact of AI and ChatGPT on Bangladeshi University Students

    • data.mendeley.com
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Jhirul Islam (2025). The Impact of AI and ChatGPT on Bangladeshi University Students [Dataset]. http://doi.org/10.17632/zykphpvbr7.2
    Explore at:
    Dataset updated
    Jan 6, 2025
    Authors
    Md Jhirul Islam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Bangladesh
    Description

    The data set records the perceptions of Bangladeshi university students on the influence that AI tools, especially ChatGPT, have on their academic practices, learning experiences, and problem-solving abilities. The varying role of AI in education, which covers common usage statistics, what AI does to our creative abilities, its impact on our learning, and whether it could invade our privacy. This dataset reveals perspective on how AI tools are changing education in the country and offering valuable information for researchers, educators, policymakers, to understand trends, challenges, and opportunities in the adoption of AI in the academic contex.

    Methodology Data Collection Method: Online survey using google from Participants: A total of 3,512 students from various Bangladeshi universities participated. Survey Questions:The survey included questions on demographic information, frequency of AI tool usage, perceived benefits, concerns regarding privacy, and impacts on creativity and learning.

    Sampling Technique: Random sampling of university students Data Collection Period: June 2024 to December 2024

    Privacy Compliance This dataset has been anonymized to remove any personally identifiable information (PII). It adheres to relevant privacy regulations to ensure the confidentiality of participants.

    For further inquiries, please contact: Name: Md Jhirul Islam, Daffodil International University Email: jhirul15-4063@diu.edu.bd Phone: 01316317573

  6. a

    Wessex Water Domestic Water Quality

    • arc-gis-hub-home-arcgishub.hub.arcgis.com
    Updated Jan 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sophie.sherriff_wessex (2024). Wessex Water Domestic Water Quality [Dataset]. https://arc-gis-hub-home-arcgishub.hub.arcgis.com/datasets/acc078ffd7a44426998ebfa3f468e89f
    Explore at:
    Dataset updated
    Jan 30, 2024
    Dataset authored and provided by
    sophie.sherriff_wessex
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OverviewWater companies in the UK are responsible for testing the quality of drinking water. This dataset contains the results of samples taken from the taps in domestic households to make sure they meet the standards set out by UK and European legislation. This data shows the location, date, and measured levels of determinands set out by the Drinking Water Inspectorate (DWI).Key Definitions  AggregationProcess involving summarising or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes  Anonymisation Anonymised data is a type of information sanitisation in which data anonymisation tools encrypt or remove personally identifiable information from datasets for the purpose of preserving a data subject's privacy Dataset Structured and organised collection of related elements, often stored digitally, used for analysis and interpretation in various fields.  Determinand A constituent or property of drinking water which can be determined or estimated. DWI Drinking Water Inspectorate, an organisation “providing independent reassurance that water supplies in England and Wales are safe and drinking water quality is acceptable to consumers.”  DWI Determinands Constituents or properties that are tested for when evaluating a sample for its quality as per the guidance of the DWI. For this dataset, only determinands with “point of compliance” as “customer taps” are included.   Granularity Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours ID Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance.  LSOA Lower-Level Super Output Area is made up of small geographic areas used for statistical and administrative purposes by the Office for National Statistics. It is designed to have homogeneous populations in terms of population size, making them suitable for statistical analysis and reporting. Each LSOA is built from groups of contiguous Output Areas with an average of about 1,500 residents or 650 households allowing for granular data collection useful for analysis, planning and policy- making while ensuring privacy.  ONS Office for National Statistics  Open Data Triage The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data.  Sample A sample is a representative segment or portion of water taken from a larger whole for the purpose of analysing or testing to ensure compliance with safety and quality standards.  Schema Structure for organizing and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute.  Units Standard measurements used to quantify and compare different physical quantities.  Water Quality The chemical, physical, biological, and radiological characteristics of water, typically in relation to its suitability for a specific purpose, such as drinking, swimming, or ecological health. It is determined by assessing a variety of parameters, including but not limited to pH, turbidity, microbial content, dissolved oxygen, presence of substances and temperature.Data HistoryData Origin  These samples were taken from customer taps. They were then analysed for water quality, and the results were uploaded to a database. This dataset is an extract from this database.Data Triage Considerations Granularity Is it useful to share results as averages or individual? We decided to share as individual results as the lowest level of granularity Anonymisation It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed: • Water Supply Zone (WSZ) - Limits interoperability with other datasets • Postcode – Some postcodes contain very few households and may not offer necessary anonymisation • Postal Sector – Deemed not granular enough in highly populated areas • Rounded Co-ordinates – Not a recognised standard and may cause overlapping areas • MSOA – Deemed not granular enough • LSOA – Agreed as a recognised standard appropriate for England and Wales • Data Zones – Agreed as a recognised standard appropriate for Scotland Data Triage Review Frequency Annually unless otherwise requested Publish FrequencyAnnuallyData Specifications • Each dataset will cover a year of samples in calendar year • This dataset will be published annually • Historical datasets will be published as far back as 2016 from the introduction of The Water Supply (Water Quality) Regulations 2016 • The determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate. • A small proportion of samples could not be allocated to an LSOA – these represented less than 0.1% of samples and were removed from the dataset in 2023. • The postcode to LSOA lookup table used for 2022 was not available when 2023 data was processed, see supplementary information for the lookup table applied to each calendar year of data. Context Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset. Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area. Some samples are tested on site and others are sent to scientific laboratories.Supplementary informationBelow is a curated selection of links for additional reading, which provide a deeper understanding of this dataset.   1. Drinking Water Inspectorate Standards and Regulations: https://www.dwi.gov.uk/drinking-water-standards-and-regulations/   2. LSOA (England and Wales) and Data Zone (Scotland): https://www.nrscotland.gov.uk/files/geography/2011-census/geography-bckground-info-comparison-of-thresholds.pdf   3. Description for LSOA boundaries by the ONS: https://www.ons.gov.uk/methodology/geography/ukgeographies/censusgeographies/census2021geographies4. Postcode to LSOA lookup tables (2022 calendar year data): https://geoportal.statistics.gov.uk/datasets/postcode-to-2021-census-output-area-to-lower-layer-super-output-area-to-middle-layer-super-output-area-to-local-authority-district-august-2023-lookup-in-the-uk/about   5. Postcode to LSOA lookup tables (2023 calendar year data):  https://geoportal.statistics.gov.uk/datasets/b8451168e985446eb8269328615dec62/about6. Legislation history: https://www.dwi.gov.uk/water-companies/legislation/

  7. Dataset for the paper "Observation of Acceleration and Deceleration Periods...

    • zenodo.org
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yide Qian; Yide Qian (2025). Dataset for the paper "Observation of Acceleration and Deceleration Periods at Pine Island Ice Shelf from 1997–2023 " [Dataset]. http://doi.org/10.5281/zenodo.15022854
    Explore at:
    Dataset updated
    Mar 26, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yide Qian; Yide Qian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Pine Island Glacier
    Description

    Dataset and codes for "Observation of Acceleration and Deceleration Periods at Pine Island Ice Shelf from 1997–2023 "

    • Description of the data and file structure

    The MATLAB codes and related datasets are used for generating the figures for the paper "Observation of Acceleration and Deceleration Periods at Pine Island Ice Shelf from 1997–2023".

    Files and variables

    File 1: Data_and_Code.zip

    Directory: Main_function

    **Description:****Include MATLAB scripts and functions. Each script include discriptions that guide the user how to used it and how to find the dataset that used for processing.

    MATLAB Main Scripts: Include the whole steps to process the data, output figures, and output videos.

    Script_1_Ice_velocity_process_flow.m

    Script_2_strain_rate_process_flow.m

    Script_3_DROT_grounding_line_extraction.m

    Script_4_Read_ICESat2_h5_files.m

    Script_5_Extraction_results.m

    MATLAB functions: Five Files that includes MATLAB functions that support the main script:

    1_Ice_velocity_code: Include MATLAB functions related to ice velocity post-processing, includes remove outliers, filter, correct for atmospheric and tidal effect, inverse weited averaged, and error estimate.

    2_strain_rate: Include MATLAB functions related to strain rate calculation.

    3_DROT_extract_grounding_line_code: Include MATLAB functions related to convert range offset results output from GAMMA to differential vertical displacement and used the result extract grounding line.

    4_Extract_data_from_2D_result: Include MATLAB functions that used for extract profiles from 2D data.

    5_NeRD_Damage_detection: Modified code fom Izeboud et al. 2023. When apply this code please also cite Izeboud et al. 2023 (https://www.sciencedirect.com/science/article/pii/S0034425722004655).

    6_Figure_plotting_code:Include MATLAB functions related to Figures in the paper and support information.

    Director: data_and_result

    Description:**Include directories that store the results output from MATLAB. user only neeed to modify the path in MATLAB script to their own path.

    1_origin : Sample data ("PS-20180323-20180329", “PS-20180329-20180404”, “PS-20180404-20180410”) output from GAMMA software in Geotiff format that can be used to calculate DROT and velocity. Includes displacment, theta, phi, and ccp.

    2_maskccpN: Remove outliers by ccp < 0.05 and change displacement to velocity (m/day).

    3_rockpoint: Extract velocities at non-moving region

    4_constant_detrend: removed orbit error

    5_Tidal_correction: remove atmospheric and tidal induced error

    6_rockpoint: Extract non-aggregated velocities at non-moving region

    6_vx_vy_v: trasform velocities from va/vr to vx/vy

    7_rockpoint: Extract aggregated velocities at non-moving region

    7_vx_vy_v_aggregate_and_error_estimate: inverse weighted average of three ice velocity maps and calculate the error maps

    8_strain_rate: calculated strain rate from aggregate ice velocity

    9_compare: store the results before and after tidal correction and aggregation.

    10_Block_result: times series results that extrac from 2D data.

    11_MALAB_output_png_result: Store .png files and time serties result

    12_DROT: Differential Range Offset Tracking results

    13_ICESat_2: ICESat_2 .h5 files and .mat files can put here (in this file only include the samples from tracks 0965 and 1094)

    14_MODIS_images: you can store MODIS images here

    shp: grounding line, rock region, ice front, and other shape files.

    File 2 : PIG_front_1947_2023.zip

    Includes Ice front positions shape files from 1947 to 2023, which used for plotting figure.1 in the paper.

    File 3 : PIG_DROT_GL_2016_2021.zip

    Includes grounding line positions shape files from 1947 to 2023, which used for plotting figure.1 in the paper.

    Data was derived from the following sources:
    Those links can be found in MATLAB scripts or in the paper "**Open Research" **section.

  8. e

    Flash Eurobarometer 225 (Data Protection - General Public) - Dataset -...

    • b2find.eudat.eu
    Updated Jul 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    Dataset updated
    Jul 23, 2025
    Description

    Kenntnisse über Datenschutzgesetze und Kenntnis der unabhängigen Datenschutzbehörde. Kenntnisse über den Schutz persönlicher Daten. Themen: Interesse am Schutz persönlicher Daten, die in privaten und öffentlichen Organisationen gespeichert werden; Vertrauen in ausgewählte Institutionen im eigenen Lande bezüglich des Datenschutzes; Meinung zum Schutz persönlicher Daten: ausreichender Datenschutz im eigenen Land, Einschätzung des allgemeinen Bewusstseins über den Schutz persönlicher Daten, Beunruhigung über das Hinterlassen persönlicher Daten im Internet, Vertrauen in die Datenschutzgesetzgebung; Kenntnis der Datenschutzbehörde und deren Aufgaben: Annahme von Beschwerden von Privatpersonen, Verhängen von Sanktionen, eigene Kontaktaufnahme zu dieser Behörde; Kenntnisse der Pflichten von datenhaltenden Organisationen gegenüber dem Befragten; Kenntnistest der Rechte des Befragten hinsichtlich der Verwendung seiner persönlichen Daten: erforderliche Zustimmung, Widerspruchsrecht, Auskunftsrecht, Recht auf Korrektur oder Löschung von Daten, Rechtsmittel gegen Verstöße, Schadensersatzforderung bei ungesetzlicher Verwendung; Meinung über Übertragungssicherheit von Daten im Internet; Kenntnis über Technologien, die die Sammlung persönlicher Daten vom eigenen Computer einschränken (Cookies, Firewall); Verwendung dieser Technologien; Gründe für eine Nichtnutzung; Einstellung zur Überwachung von: Telefongesprächen, Internetnutzung, Kreditkartennutzung und Daten von Flugpassagieren zur Terrorismusbekämpfung (Split: umgedrehte Antwortvorgaben); Kenntnis über Verbot der Weitergabe persönlicher Daten an Nicht-EU-Länder, mit unzureichendem Datenschutz; Kenntnis über strengere Datenschutzregelungen für empfindliche Daten. Demographie: Geschlecht; Alter; Alter bei Beendigung der Ausbildung; Beruf; berufliche Stellung; Urbanisierungsgrad; Haushaltszusammensetzung und Haushaltsgröße; Besitz eines Mobiltelefons; Festnetztelefon im Haushalt. Zusätzlich verkodet wurde: Befragten-ID; Interviewsprache; Interviewer-ID; Land; Interviewdatum; Interviewdauer (Interviewbeginn und Interviewende); Interviewmodus (Mobiltelefon oder Festnetz); Region; Gewichtungsfaktor. Attitudes towards the protection of personal data. Topics: concern with regard to the protection of personal information by private and public organisations; trust in the following institutions regarding the use of personal information in a proper way: travel companies, medical services, insurance companies, credit card companies, financial institutions, employers, police, social security, tax authorities, local authorities, credit reference agencies, mail order companies, non-profit organisations, market and opinion research companies; attitude towards the following statements on the protection of personal data in the own country: is properly protected, low awareness of people on the subject, worry about leaving personal information on the internet, appropriate legislation to cope with growing number of personal information on the internet; awareness of the national authority to monitor the application of data protection laws; responsibility of the national authority to hear individuals; ability of the authority to pose sanctions; personal contact to authority; awareness of the obligation of data collectors to provide information on identity, purpose, and further data sharing; knowledge test concerning the storage of personal data: need for personal consent with regard to the use of personal information, right to oppose the use, legal assurance to access personal data, right to correct or remove data, national laws allow access to courts to seek remedies for breaches of data protection laws, right for compensation caused by unlawful use of personal data; assessment of the security of transmitting personal data over the internet; awareness of technologies to limit the collection of personal data from personal computer; use of these technologies; reasons for not using; attitude towards selected measures to fight international terrorism: monitor telephone calls, monitor internet use, monitor credit card use, monitor flight passenger data; awareness of the assurance that personal data of EU citizens can only be transferred outside the EU to countries which ensure an adequate level or protection; awareness of stricter data protection rules applied for sensitive data. Demography: sex; age; age at end of education; occupation; professional position; type of community; household composition and household size; own a mobile phone and fixed (landline) phone. Additionally coded was: respondent ID; language of the interview; interviewer ID; country; date of interview; time of the beginning of the interview; duration of the interview; type of phone line; region; weighting factor.

  9. s

    Portsmouth Water Drinking Water Quality Data 2022, 2023 & 2024

    • streamwaterdata.co.uk
    Updated May 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AHughes_Portsmouth (2024). Portsmouth Water Drinking Water Quality Data 2022, 2023 & 2024 [Dataset]. https://www.streamwaterdata.co.uk/datasets/b9b8d038ae70461386f5cab102adbbb9
    Explore at:
    Dataset updated
    May 22, 2024
    Dataset authored and provided by
    AHughes_Portsmouth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    Water companies in the UK are responsible for testing the quality of drinking water. This dataset contains the results of samples taken from the taps in domestic households to make sure they meet the standards set out by UK and European legislation. This data shows the location, date, and measured levels of determinands set out by the Drinking Water Inspectorate (DWI).

    Key Definitions

    Aggregation

    Process involving summarizing or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes

    Anonymisation

    Anonymised data is a type of information sanitization in which data anonymisation tools encrypt or remove personally identifiable information from datasets for the purpose of preserving a data subject's privacy

    Dataset

    Structured and organized collection of related elements, often stored digitally, used for analysis and interpretation in various fields.

    Determinand

    A constituent or property of drinking water which can be determined or estimated.

    DWI

    Drinking Water Inspectorate, an organisation “providing independent reassurance that water supplies in England and Wales are safe and drinking water quality is acceptable to consumers.”

    DWI Determinands

    Constituents or properties that are tested for when evaluating a sample for its quality as per the guidance of the DWI. For this dataset, only determinands with “point of compliance” as “customer taps” are included.

    Granularity

    Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours

    ID

    Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance.

    LSOA

    Lower-Level Super Output Area is made up of small geographic areas used for statistical and administrative purposes by the Office for National Statistics. It is designed to have homogeneous populations in terms of population size, making them suitable for statistical analysis and reporting. Each LSOA is built from groups of contiguous Output Areas with an average of about 1,500 residents or 650 households allowing for granular data collection useful for analysis, planning and policy- making while ensuring privacy.

    ONS

    Office for National Statistics

    Open Data Triage

    The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data. <

    Sample

    A sample is a representative segment or portion of water taken from a larger whole for the purpose of analysing or testing to ensure compliance with safety and quality standards.

    Schema

    Structure for organizing and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute.

    Units

    Standard measurements used to quantify and compare different physical quantities.

    Water Quality

    The chemical, physical, biological, and radiological characteristics of water, typically in relation to its suitability for a specific purpose, such as drinking, swimming, or ecological health. It is determined by assessing a variety of parameters, including but not limited to pH, turbidity, microbial content, dissolved oxygen, presence of substances and temperature.

    Data History

    Data Origin

    These samples were taken from customer taps. They were then analysed for water quality, and the results were uploaded to a database. This dataset is an extract from this database.

    Data Triage Considerations

    Granularity

    Is it useful to share results as averages or individual?

    We decided to share as individual results as the lowest level of granularity

    Anonymisation

    It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed:

    <!--·
    Water Supply Zone (WSZ) - Limits interoperability with other datasets

    <!--·
    Postcode – Some postcodes contain very few households and may not offer necessary anonymisation

    <!--·
    Postal Sector – Deemed not granular enough in highly populated areas

    <!--·
    Rounded Co-ordinates – Not a recognised standard and may cause overlapping areas

    <!--·
    MSOA – Deemed not granular enough

    <!--·
    LSOA – Agreed as a recognised standard appropriate for England and Wales

    <!--·
    Data Zones – Agreed as a recognised standard appropriate for Scotland

    Data Specifications

    Each dataset will cover a calendar year of samples

    This dataset will be published annually

    Historical datasets will be published as far back as 2016 from the introduction of of The Water Supply (Water Quality) Regulations 2016

    The Determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate.

    Context

    Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset

    Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area.

    Some samples are tested on site and others are sent to scientific laboratories.

    Data Publish Frequency

    Annually

    Data Triage Review Frequency

    Annually unless otherwise requested

    Supplementary information

    Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset.

    <!--1.
    Drinking Water Inspectorate Standards and Regulations:

    <!--2.
    https://www.dwi.gov.uk/drinking-water-standards-and-regulations/

    <!--3.
    LSOA (England and Wales) and Data Zone (Scotland):

    <!--4. https://www.nrscotland.gov.uk/files/geography/2011-census/geography-bckground-info-comparison-of-thresholds.pdf

    <!--5.
    Description for LSOA boundaries by the ONS: Census 2021 geographies - Office for National Statistics (ons.gov.uk)

    <!--[6.
    Postcode to LSOA lookup tables: Postcode to 2021 Census Output Area to Lower Layer Super Output Area to Middle Layer Super Output Area to Local Authority District (August 2023) Lookup in the UK (statistics.gov.uk)

    <!--7.
    Legislation history: Legislation - Drinking Water Inspectorate (dwi.gov.uk)

  10. Dataset (Covid-Bacterial-Viral-Normal-Emphysema)

    • kaggle.com
    Updated Jun 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nhật Nguyễn Minh (2024). Dataset (Covid-Bacterial-Viral-Normal-Emphysema) [Dataset]. https://www.kaggle.com/datasets/minhnhat232/dataset-covid-bacterial-viral-normal-emphysema/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 13, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nhật Nguyễn Minh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The dataset contain lung x-ray image including:

    1. Normal - 3,270 images
    2. Covid-19 - 3,017 images
    3. Viral-pneumonia - 3,013 images
    4. Bacterial-pneumonia - 3,000 images
    5. Emphysema - 2,550 images

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15315323%2F8041ddd2485bfe9cdf2ba1f9d96bd7e5%2F6_Class_Img.jpg?generation=1741951756137022&alt=media" alt="">

    The dataset we use is compiled from many reputable sources including: Dataset 1 [1]: This dataset includes four classes of diseases: COVID-19, viral pneumonia, bacterial pneumonia, and normal. It has multiple versions, and we are currently using the latest version (version 4). Previous studies, such as those by Hariri et al. [18] and Ahmad et al. [20], have also utilized earlier versions of this dataset. Dataset 2 [2]: This dataset is from the National Institutes of Health (NIH) Chest X-Ray Dataset, which contains over 100,000 chest X-ray images from over 30,000 patients. It includes 14 disease classes, including conditions like atelectasis, consolidation, and infiltration. For this study, we have selected 2,550 chest X-ray images specifically from the Emphysema class. Dataset 3 [3]: This is the COVQU dataset, which we have extended to include two additional classes: COVID-19 and viral pneumonia. This dataset has been widely used in previous studies by M.E.H. Chowdhury et al. [4] and Rahman T et al. [5], establishing its reputation as a reliable resource.

    In addition, we also publish a modified dataset that aims to remove image regions that do not contain lungs (abdomen, arms, etc.).

    References: [1] U. Sait, K. G. Lal, S. P. Prajapati, R. Bhaumik, T. Kumar, S. Shivakumar, K. Bhalla, Curated dataset for covid-19 posterior-anterior chest radiography images (x-rays)., Mendeley Data V4 (2022). doi:10.17632/9xkhgts2s6.4. [2] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, R. M. Summers, Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases (2017) 3462–3471. doi:10.1109/CVPR.2017.369. [3] A. M. Tahir, M. E. Chowdhury, A. Khandakar, T. Rahman, Y. Qiblawey, U. Khurshid, S. Kiranyaz, N. Ibtehaz, M. S. Rahman, S. Al-Maadeed,S. Mahmud, M. Ezeddin, K. Hameed, T. Hamid, Covid-19 infection localization and severity grading from chest x-ray images, Computers in Biology and Medicine 139 (2021) 105002. URL: https://www.sciencedirect.com/science/article/pii/S0010482521007964. doi:https://doi.org/10.1016/j.compbiomed.2021.105002. [4] M. E. Chowdhury, T. Rahman, A. Khandakar, R. Mazhar, M. A. Kadir, Z. B. Mahbub, K. R. Islam, M. S. Khan, A. Iqbal, N. A. Emadi, M. B. I. Reaz, M. T. Islam, Can ai help in screening viral and covid-19 pneumonia?, IEEE Access 8 (2020) 132665–132676. doi:10.1109/ACCESS.2020.3010287. [5] T. Rahman, A. Khandakar, Y. Qiblawey, A. Tahir, S. Kiranyaz, S. B. A. Kashem, M. T. Islam, S. A. Maadeed, S. M. Zughaier, M. S. Khan, M. E. Chowdhury, Exploring the effect of image enhancement techniques on covid-19 detection using chest x-ray images, Computers in Biology and Medicine 132 (2021). doi:10.1016/j.compbiomed.2021.104319.

  11. BIPED Dataset

    • kaggle.com
    Updated Oct 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xavier Soria (2021). BIPED Dataset [Dataset]. https://www.kaggle.com/xavysp/biped/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 27, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Xavier Soria
    Description

    ** Version 2 of BIPED have been released, please use BIPEDv2** It contains 250 outdoor images of 1280x720 pixels each. These images have been carefully annotated by experts on the computer vision field, hence no redundancy has been considered. In spite of that, all results have been cross-checked several times in order to correct possible mistakes or wrong edges by just one subject. This dataset is publicly available as a benchmark for evaluating edge detection algorithms. The generation of this dataset is motivated by the lack of edge detection datasets, actually, there is just one dataset publicly available for the edge detection task published in 2016 (MDBD: Multicue Dataset for Boundary Detection---the subset for edge detection). The level of details of the edge level annotations in the BIPED's images can be appreciated looking at the GT, see Figs above.

    BIPED dataset has 250 images in high definition. Those images are already split up for training and testing. 200 for training and 50 for testing.

    BIPED Augmentation

    To augment the BIPED data for the DL training visit this repository

    Citation

    Please cite our dataset if you find helpful,

    @InProceedings{soria2020dexined,
      title={Dense Extreme Inception Network: Towards a Robust CNN Model for Edge Detection},
      author={Xavier Soria and Edgar Riba and Angel Sappa},
      booktitle={The IEEE Winter Conference on Applications of Computer Vision (WACV '20)},
      year={2020}
    }
    
    @article{SORIA2023BIPEDv2,
    title = {Dense extreme inception network for edge detection},
    journal = {Pattern Recognition},
    volume = {139},
    pages = {109461},
    year = {2023},
    issn = {0031-3203},
    doi = {https://doi.org/10.1016/j.patcog.2023.109461},
    url = {https://www.sciencedirect.com/science/article/pii/S0031320323001619},
    author = {Xavier Soria and Angel Sappa and Patricio Humanante and Arash Akbarinia},
    keywords = {Edge detection, Deep learning, CNN, Contour detection, Boundary detection, Segmentation}
    }
    

    License

    This Dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use the data given that you agree to our license terms. However, if any of our images are infringing any privacy or rights, feel free to contact us and we will remove immediately.

    If you need more information, Dont hesitate and contact me :)

  12. d

    High Resolution Residential Water Use Data in Cache County, Utah, USA

    • search.dataone.org
    Updated Dec 5, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Camilo J. Bastidas Pacheco; Nour Atallah; Jeffery S. Horsburgh (2021). High Resolution Residential Water Use Data in Cache County, Utah, USA [Dataset]. https://search.dataone.org/view/sha256%3A0f3764e6031308973489ad5fc20e8953635e46056cd6c63001cfd8e6f1a079b6
    Explore at:
    Dataset updated
    Dec 5, 2021
    Dataset provided by
    Hydroshare
    Authors
    Camilo J. Bastidas Pacheco; Nour Atallah; Jeffery S. Horsburgh
    Area covered
    Description

    This dataset contains high resolution residential water use data for 31 residential homes located in Logan City and Providence City in Cache County, Utah, USA. Data were collected using a low-cost, open source monitoring device that was designed to operate on magnetically driven residential water meters. Data were recorded with a temporal frequency of 4 seconds and were collected for a period of at least two weeks during the summer when outdoor water use was active and two weeks during the winter when no outdoor water use was expected. The data were measured on the meter located on the water supply line to each home and represent a trace of the total water use for each residence. The dataset also includes secondary data about each of the residences at which data were collected. These data have been anonymized to remove any personally identifiable information from participants in this data collection effort.

  13. Destination Sri Lanka

    • kaggle.com
    Updated May 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanchana1990 (2024). Destination Sri Lanka [Dataset]. http://doi.org/10.34740/kaggle/dsv/8321374
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kanchana1990
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Area covered
    Sri Lanka
    Description

    Overview:

    • Dataset comprises nearly 5,000 accommodations across Sri Lanka, including villas, guest houses, homestays, hotels, and bungalows.
    • Captures a variety of data points like town based location, capacity, and ratings.

    Data Science Applications:

    • Ideal for Market Rating Analysis.
    • Can be used for geographic data visualization and competitive analysis in the hospitality sector.

    Ethically Mined Data:

    • Data has been ethically sourced and rigorously anonymized to remove any Personally Identifiable Information (PII).
    • Ensures compliance with data privacy standards.

    Acknowledgements:

    • Special thanks to Airbnb for being a primary data source.
    • Additional data gathered through personal contacts, ensuring a comprehensive dataset.

    Image Credits:

    • Dataset thumbnail on Kaggle created with DALL-E 3, showcasing iconic Sri Lankan imagery.
  14. MineralImage5k

    • kaggle.com
    Updated Feb 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergey Nesteruk (2024). MineralImage5k [Dataset]. https://www.kaggle.com/datasets/sergeynesteruk/minerals/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 29, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sergey Nesteruk
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset consists of over 5,000 mineral varieties with the total of nearly 44,000 images. Some images are also accompanied by textual descriptions.

    The dataset is with the paper https://www.sciencedirect.com/science/article/abs/pii/S0098300423001188

    Images are processed to remove text tables, and to zoom in the minerals.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3838650%2Fcd90450b88cab88d993acb7f80664545%2Fminerals_data_preprocessing_scheme.png?generation=1709238741202275&alt=media" alt="">

  15. h

    mental_health_conversational_dataset

    • huggingface.co
    Updated Aug 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zahrizhal Ali (2023). mental_health_conversational_dataset [Dataset]. https://huggingface.co/datasets/ZahrizhalAli/mental_health_conversational_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 10, 2023
    Authors
    Zahrizhal Ali
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CREDIT: Dataset Card for "heliosbrahma/mental_health_chatbot_dataset"

      Dataset Description
    
    
    
    
    
      Dataset Summary
    

    This dataset contains conversational pair of questions and answers in a single text related to Mental Health. Dataset was curated from popular healthcare blogs like WebMD, Mayo Clinic and HeatlhLine, online FAQs etc. All questions and answers have been anonymized to remove any PII data and pre-processed to remove any unwanted characters.

      Languages… See the full description on the dataset page: https://huggingface.co/datasets/ZahrizhalAli/mental_health_conversational_dataset.
    
  16. P

    OpenEDS2020 Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated May 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cristina Palmero; Abhishek Sharma; Karsten Behrendt; Kapil Krishnakumar; Oleg V. Komogortsev; Sachin S. Talathi (2020). OpenEDS2020 Dataset [Dataset]. https://paperswithcode.com/dataset/openeds2020
    Explore at:
    Dataset updated
    May 7, 2020
    Authors
    Cristina Palmero; Abhishek Sharma; Karsten Behrendt; Kapil Krishnakumar; Oleg V. Komogortsev; Sachin S. Talathi
    Description

    OpenEDS2020 is a dataset of eye-image sequences captured at a frame rate of 100 Hz under controlled illumination, using a virtual-reality head-mounted display mounted with two synchronized eye-facing cameras. The dataset, which is anonymized to remove any personally identifiable information on participants, consists of 80 participants of varied appearance performing several gaze-elicited tasks, and is divided in two subsets: 1) Gaze Prediction Dataset, with up to 66,560 sequences containing 550,400 eye-images and respective gaze vectors, created to foster research in spatio-temporal gaze estimation and prediction approaches; and 2) Eye Segmentation Dataset, consisting of 200 sequences sampled at 5 Hz, with up to 29,500 images, of which 5% contain a semantic segmentation label, devised to encourage the use of temporal information to propagate labels to contiguous frames.

  17. h

    ZOD-Mini-2D-Road-Scenes

    • huggingface.co
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    8Bits (2024). ZOD-Mini-2D-Road-Scenes [Dataset]. https://huggingface.co/datasets/8bits-ai/ZOD-Mini-2D-Road-Scenes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 8, 2024
    Dataset authored and provided by
    8Bits
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    ZOD-Mini-2D-Road-Scenes

    The ZOD-Mini-2D-Road-Scenes dataset is derived from the Zenseact Open Dataset (ZOD), property of Zenseact AB (© 2022 Zenseact AB), and is licensed under the permissive CC BY-SA 4.0. Any public use, distribution, or display of this dataset must contain this entire notice:

    For this dataset, Zenseact AB has taken all reasonable measures to remove all personally identifiable information, including faces and license plates. To the extent that you like to request… See the full description on the dataset page: https://huggingface.co/datasets/8bits-ai/ZOD-Mini-2D-Road-Scenes.

  18. h

    twitter_indonesia_sarcastic

    • huggingface.co
    Updated Nov 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wilson Wongso (2024). twitter_indonesia_sarcastic [Dataset]. https://huggingface.co/datasets/w11wo/twitter_indonesia_sarcastic
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 19, 2024
    Authors
    Wilson Wongso
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    Indonesia
    Description

    Twitter Indonesia Sarcastic

    Twitter Indonesia Sarcastic is a dataset intended for sarcasm detection in the Indonesian language. This dataset is introduced in Khotijah et al. (2020), whereby Indonesian tweets are collected and labeled as either sarcastic or non-sarcastic. We took the raw data, and performed several cleaning procedures such as: sentence order re-reversal, deduplication with minHash LSH, PII masking to remove usernames, hashtags, emails, URLs, and finally a random… See the full description on the dataset page: https://huggingface.co/datasets/w11wo/twitter_indonesia_sarcastic.

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ai4Privacy (2024). pii-masking-200k [Dataset]. http://doi.org/10.57967/hf/1532

pii-masking-200k

Ai4Privacy PII200k Dataset

ai4privacy/pii-masking-200k

Explore at:
12 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2024
Dataset authored and provided by
Ai4Privacy
Description

Ai4Privacy Community

Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.

  Purpose and Features

Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.

Search
Clear search
Close search
Google apps
Main menu