28 datasets found

Privacy Preserving Distributed Data Mining
data.staging.idas-ds1.appdat.jsc.nasa.gov
datadiscoverystudio.org
+2more
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.staging.idas-ds1.appdat.jsc.nasa.gov (2025). Privacy Preserving Distributed Data Mining [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/privacy-preserving-distributed-data-mining
Explore at:
Dataset updated
Feb 18, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Distributed data mining from privacy-sensitive multi-party data is likely to play an important role in the next generation of integrated vehicle health monitoring systems. For example, consider an airline manufacturer [tex]$\mathcal{C}$[/tex] manufacturing an aircraft model [tex]$A$[/tex] and selling it to five different airline operating companies [tex]$\mathcal{V}_1 \dots \mathcal{V}_5$[/tex]. These aircrafts, during their operation, generate huge amount of data. Mining this data can reveal useful information regarding the health and operability of the aircraft which can be useful for disaster management and prediction of efficient operating regimes. Now if the manufacturer [tex]$\mathcal{C}$[/tex] wants to analyze the performance data collected from different aircrafts of model-type [tex]$A$[/tex] belonging to different airlines then central collection of data for subsequent analysis may not be an option. It should be noted that the result of this analysis may be statistically more significant if the data for aircraft model [tex]$A$[/tex] across all companies were available to [tex]$\mathcal{C}$[/tex]. The potential problems arising out of such a data mining scenario are:
t
Privacy-Sensitive Conversations between Care Workers and Care Home Residents...
test.researchdata.tuwien.ac.at
researchdata.tuwien.ac.at
+1more
bin, text/markdown
Updated Dec 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reinhard Grabler; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger (2024). Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home [Dataset]. http://doi.org/10.70124/hbtq5-ykv92
Explore at:
bin, text/markdownAvailable download formats
Unique identifier
https://doi.org/10.70124/hbtq5-ykv92
Dataset updated
Dec 6, 2024
Dataset provided by
TU Wien
Authors
Reinhard Grabler; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 2024 - Aug 2024
Description
Dataset Card for "privacy-care-interactions"

Table of Contents

Dataset Description

Purpose and Features

Dataset Overview

Language Distribution

Locale Distribution

Key Facts

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Dataset Description

Purpose and Features

🔒 Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home 🔒

The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.

Dataset Overview

Total entries: 95

Number of distinct taxonomy categories in the public dataset: 4

Number of distinct conversational categories in public dataset: 7

Papers:

Continues the work of: Privacy Agents: Utilizing Large Language Models to Safeguard Contextual Integrity in Elderly Care

Continues the work of: Prototype of a care documentation support system using audio recordings of care actions and large language models

Language Distribution 🌍

English (en): 95

Locale Distribution 🌎

United States (US) 🇺🇸: 95

Key Facts 🔑

This is synthetic data! Generated using proprietary algorithms - no privacy violations!

Conversations are classified following the taxonomy for privacy-sensitive robotics by Rueben et al. (2017).

The data was manually labeled by an expert.

Dataset Structure

Data Instances

The provided data format is .jsonl, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.

{ "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }

Data Fields

The data fields are:

text: a string feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).

taxonomy: a classification label, with possible values including informational (0), invasion (1), collection (2), processing (3), dissemination (4), physical (5), personal-space (6), territoriality (7), intrusion (8), obtrusion (9), contamination (10), modesty (11), psychological (12), interrogation (13), psychological-distance (14), social (15), association (16), crowding-isolation (17), public-gaze (18), solitude (19), intimacy (20), anonymity (21), reserve (22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.

category: a classification label, with possible values including personal-information (0), family (1), health (2), thoughts (3), values (4), acquaintance (5), appointment (6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.

affected_speaker: a classification label, with possible values including care-worker (0), care-recipient (1), other (2), both (3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.

language: a string feature. Language code as defined by ISO 639.

locale: a string feature. Regional code as defined by ISO 3166-1 alpha-2.

data_type: a string a classification label, with possible values including real (0), synthetic (1).

uid: a int64 feature. A unique identifier within the dataset.

split: a string feature. Either train, validation or test.

Dataset Splits

The dataset has 2 subsets:

split: with a total of 95 examples split into train, validation and test (70%-15%-15%)

unsplit: with a total of 95 examples in a single train split

name train validation test
split 66 14 15
unsplit 95 n/a n/a

The files follow the naming convention subset-split-language.jsonl. The following files are contained in the dataset:

split-train-en.jsonl

split-validation-en.jsonl

split-test-en.jsonl

unsplit-train-en.jsonl

Dataset Creation

Curation Rationale

Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.

Source Data

Initial Data Collection

The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.

Data Processing

The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the accessible portions of the interviews were translated from German to US English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split function from the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank"

name	train	validation	test
split	66	14	15
unsplit	95	n/a	n/a

Dairy Supply Chain Sales Dataset

zenodo.org
ieee-dataport.org
+1more

pdf, zip

Updated Jul 12, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Dimitris Iatropoulos; Konstantinos Georgakidis; Konstantinos Georgakidis; Ilias Siniosoglou; Ilias Siniosoglou; Christos Chaschatzis; Christos Chaschatzis; Anna Triantafyllou; Anna Triantafyllou; Athanasios Liatifis; Athanasios Liatifis; Dimitrios Pliatsios; Dimitrios Pliatsios; Thomas Lagkas; Thomas Lagkas; Vasileios Argyriou; Vasileios Argyriou; Panagiotis Sarigiannidis; Panagiotis Sarigiannidis; Dimitris Iatropoulos (2024). Dairy Supply Chain Sales Dataset [Dataset]. http://doi.org/10.21227/smv6-z405

Explore at:

zip, pdfAvailable download formats

Unique identifier

https://doi.org/10.21227/smv6-z405

Dataset updated

Jul 12, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

1.Introduction

Sales data collection is a crucial aspect of any manufacturing industry as it provides valuable insights about the performance of products, customer behaviour, and market trends. By gathering and analysing this data, manufacturers can make informed decisions about product development, pricing, and marketing strategies in Internet of Things (IoT) business environments like the dairy supply chain.

One of the most important benefits of the sales data collection process is that it allows manufacturers to identify their most successful products and target their efforts towards those areas. For example, if a manufacturer could notice that a particular product is selling well in a certain region, this information could be utilised to develop new products, optimise the supply chain or improve existing ones to meet the changing needs of customers.

This dataset includes information about 7 of MEVGAL’s products [1]. According to the above information the data published will help researchers to understand the dynamics of the dairy market and its consumption patterns, which is creating the fertile ground for synergies between academia and industry and eventually help the industry in making informed decisions regarding product development, pricing and market strategies in the IoT playground. The use of this dataset could also aim to understand the impact of various external factors on the dairy market such as the economic, environmental, and technological factors. It could help in understanding the current state of the dairy industry and identifying potential opportunities for growth and development.

2. Citation

Please cite the following papers when using this dataset:

I. Siniosoglou, K. Xouveroudis, V. Argyriou, T. Lagkas, S. K. Goudos, K. E. Psannis and P. Sarigiannidis, "Evaluating the Effect of Volatile Federated Timeseries on Modern DNNs: Attention over Long/Short Memory," in the 12th International Conference on Circuits and Systems Technologies (MOCAST 2023), April 2023, Accepted

3. Dataset Modalities

The dataset includes data regarding the daily sales of a series of dairy product codes offered by MEVGAL. In particular, the dataset includes information gathered by the logistics division and agencies within the industrial infrastructures overseeing the production of each product code. The products included in this dataset represent the daily sales and logistics of a variety of yogurt-based stock. Each of the different files include the logistics for that product on a daily basis for three years, from 2020 to 2022.

3.1 Data Collection

The process of building this dataset involves several steps to ensure that the data is accurate, comprehensive and relevant.

The first step is to determine the specific data that is needed to support the business objectives of the industry, i.e., in this publication’s case the daily sales data.

Once the data requirements have been identified, the next step is to implement an effective sales data collection method. In MEVGAL’s case this is conducted through direct communication and reports generated each day by representatives & selling points.

It is also important for MEVGAL to ensure that the data collection process conducted is in an ethical and compliant manner, adhering to data privacy laws and regulation. The industry also has a data management plan in place to ensure that the data is securely stored and protected from unauthorised access.

The published dataset is consisted of 13 features providing information about the date and the number of products that have been sold. Finally, the dataset was anonymised in consideration to the privacy requirement of the data owner (MEVGAL).

File	Period	Number of Samples (days)
product 1 2020.xlsx	01/01/2020–31/12/2020	363
product 1 2021.xlsx	01/01/2021–31/12/2021	364
product 1 2022.xlsx	01/01/2022–31/12/2022	365
product 2 2020.xlsx	01/01/2020–31/12/2020	363
product 2 2021.xlsx	01/01/2021–31/12/2021	364
product 2 2022.xlsx	01/01/2022–31/12/2022	365
product 3 2020.xlsx	01/01/2020–31/12/2020	363
product 3 2021.xlsx	01/01/2021–31/12/2021	364
product 3 2022.xlsx	01/01/2022–31/12/2022	365
product 4 2020.xlsx	01/01/2020–31/12/2020	363
product 4 2021.xlsx	01/01/2021–31/12/2021	364
product 4 2022.xlsx	01/01/2022–31/12/2022	364
product 5 2020.xlsx	01/01/2020–31/12/2020	363
product 5 2021.xlsx	01/01/2021–31/12/2021	364
product 5 2022.xlsx	01/01/2022–31/12/2022	365
product 6 2020.xlsx	01/01/2020–31/12/2020	362
product 6 2021.xlsx	01/01/2021–31/12/2021	364
product 6 2022.xlsx	01/01/2022–31/12/2022	365
product 7 2020.xlsx	01/01/2020–31/12/2020	362
product 7 2021.xlsx	01/01/2021–31/12/2021	364
product 7 2022.xlsx	01/01/2022–31/12/2022	365

3.2 Dataset Overview

The following table enumerates and explains the features included across all of the included files.

Feature	Description	Unit
Day	day of the month	-
Month	Month	-
Year	Year	-
daily_unit_sales	Daily sales - the amount of products, measured in units, that during that specific day were sold	units
previous_year_daily_unit_sales	Previous Year’s sales - the amount of products, measured in units, that during that specific day were sold the previous year	units
percentage_difference_daily_unit_sales	The percentage difference between the two above values	%
daily_unit_sales_kg	The amount of products, measured in kilograms, that during that specific day were sold	kg
previous_year_daily_unit_sales_kg	Previous Year’s sales - the amount of products, measured in kilograms, that during that specific day were sold, the previous year	kg
percentage_difference_daily_unit_sales_kg	The percentage difference between the two above values	kg
daily_unit_returns_kg	The percentage of the products that were shipped to selling points and were returned	%
previous_year_daily_unit_returns_kg	The percentage of the products that were shipped to

Yahoo Password Frequency Corpus
kaggle.com
zip
Updated Jan 14, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Imani (2019). Yahoo Password Frequency Corpus [Dataset]. https://www.kaggle.com/datasets/drwardog/yahoo-password-frequency-corpus
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Jan 14, 2019
Authors
Imani
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset includes sanitized password frequency lists collected from Yahoo in May 2011.

For details of the original collection experiment, please see:

Bonneau, Joseph. "The science of guessing: analyzing an anonymized corpus of 70 million passwords." IEEE Symposium on Security & Privacy, 2012. http://www.jbonneau.com/doc/B12-IEEESP-analyzing_70M_anonymized_passwords.pdf

This data has been modified to preserve differential privacy. For details of this modification, please see:

Jeremiah Blocki, Anupam Datta and Joseph Bonneau. "Differentially Private Password Frequency Lists." Network & Distributed Systems Symposium (NDSS), 2016. http://www.jbonneau.com/doc/BDB16-NDSS-pw_list_differential_privacy.pdf

Each of the 51 .txt files represents one subset of all users' passwords observed during the experiment period. "yahoo-all.txt" includes all users; every other file represents a strict subset of that group.

Content

Each file is a series of lines of the format:

FREQUENCY #OBSERVATIONS ...

with FREQUENCY in descending order. For example, the file:

3 1 2 1 1 3

would represent a the frequency list (3, 2, 1, 1, 1), that is, one password observed 3 times, one observed twice, and three separate passwords observed once each.
Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
zenodo.org
explore.openaire.eu
zip
Updated Oct 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6832242
Dataset updated
Oct 20, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{ _id:
i
Face Detection Dataset for Programmable Threshold-Based Sparse-Vision
ieee-dataport.org
Updated Mar 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Riadul Islam (2025). Face Detection Dataset for Programmable Threshold-Based Sparse-Vision [Dataset]. http://doi.org/10.21227/bw2e-dj78
Explore at:
Unique identifier
https://doi.org/10.21227/bw2e-dj78
Dataset updated
Mar 13, 2025
Dataset provided by
IEEE Dataport
Authors
Riadul Islam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Smart focal-plane and in-chip image processing has emerged as a crucial technology for vision-enabled embedded systems with energy efficiency and privacy. However, the lack of special datasets providing examples of the data that these neuromorphic sensors compute to convey visual information has hindered the adoption of these promising technologies. Neuromorphic imager variants, including event-based sensors, producevarious representations such as streams of pixel addresses representing time and locations of intensity changes in the focal plane, temporal-difference data, data sifted/thresholded by temporal differences, image data after applying spatial transformations, optical flow data, and/or statistical representations. To address the critical barrier to entry, we provide an annotated, temporal-threshold-based vision dataset specifically designed for face detection tasks derived from the same videos used for Aff-Wild2. By offering multiple threshold levels (e.g., 4, 8, 12, and 16), this dataset allows for comprehensive evaluation and optimization of state-of-the-art neural architectures under varying conditions and settings compared to traditional methods. The accompanying tool flow for generating event data from raw videos further enhances accessibility and usability. We anticipate that this resource will significantly support the development of robust vision systems based on smart sensors that can process based on temporal-difference thresholds, enabling more accurate and efficient object detection and localization and ultimately promoting the broader adoption of low-power, neuromorphic imaging technologies. To support further research, we publicly released the dataset at \url{https://github.com/riaduli/Thresholded_event_vision_face_dataset}.
Daily Global Historical Climatology Network
kaggle.com
zip
Updated Aug 30, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA (2019). Daily Global Historical Climatology Network [Dataset]. https://www.kaggle.com/noaa/ghcn-d
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Aug 30, 2019
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
Authors
NOAA
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Weather is the state of the atmosphere, describing for example the degree to which it is hot or cold, wet or dry, calm or stormy, clear or cloudy. Source: https://en.wikipedia.org/wiki/Weather

Content

NOAA’s Global Historical Climatology Network (GHCN) is an integrated database of climate summaries from land surface stations across the globe that have been subjected to a common suite of quality assurance reviews. Two GHCN datasets are available in BigQuery, the GHCN-D (daily) and the GHCN-M (monthly). The data included in the GHCN datasets are obtained from more than 20 sources, including some data from every year since 1763.

For a complete description of data variables available in this dataset, see NOAA’s readme.txt: https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt

Update Frequency: daily

Fork this kernel to get started with this dataset.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:ghcn_d

https://cloud.google.com/bigquery/public-data/noaa-ghcn

Dataset Source: NOAA. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Banner Photo by Max LaRochelle from Unplash.

Inspiration

Find weather stations close to a specific location?

Daily rainfall amounts at specific station?

Pulling daily min/max temperature (in Celsius) and rainfall (in mm) for the past 14 days?
C
Eindhoven open data principles
ckan.mobidatalab.eu
data.europa.eu
Updated Jul 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OverheidNl (2023). Eindhoven open data principles [Dataset]. https://ckan.mobidatalab.eu/dataset/22451-eindhoven-open-data-principes
Explore at:
http://publications.europa.eu/resource/authority/file-type/json, http://publications.europa.eu/resource/authority/file-type/csvAvailable download formats
Dataset updated
Jul 12, 2023
Dataset provided by
OverheidNl
License
Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
Area covered
Eindhoven
Description
Eindhoven open data principles a. Data in the public space (hereafter: “data”) belongs to everyone. This data is public good. Data that is collected, generated or measured (for example by sensors placed in public space) must be made available so that everyone can use it for commercial and non-commercial purposes. However, a privacy and security consideration must be made. b. Data may contain personal data. This data can therefore affect people's lives. The rules of the Personal Data Protection Act apply to this. This data must only be made available after this data has been processed in such a way (for example, anonymised or aggregated) that there are no longer any privacy risks. c. Data that does entail privacy or security risks may only be processed within the framework of privacy legislation. Storage and processing of data must be carried out in accordance with existing legislation. d. Data that does not (any longer) contain personal data must be placed in such a way that everyone has equal access to that data (for example via an Open Data portal). We call this opening up data. No technical or legal barriers are raised that make access to data impossible, restrict or discriminate. e. Data is always made available free of charge, without unnecessary processing (where possible in the raw form) and according to functional and technical requirements to be determined. f. A distinction is made with personal data (such as an e-mail address or payment details) which are collected with the conscious knowledge and after explicit consent of individuals. Use of this data is determined by an agreement between the parties involved within the framework of privacy legislation (such as a user agreement). g. The municipality always has insight into which data is collected in the public space, regardless of whether or not the data can be made public. h. The municipality remains in dialogue with the parties that contribute to the data infrastructure in the city and strives to create earning opportunities and a fertile economic climate.
COVID-19 Case Surveillance Public Use Data
data.cdc.gov
data.virginia.gov
+6more
application/rdfxml +5
Updated Jul 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CDC Data, Analytics and Visualization Task Force (2024). COVID-19 Case Surveillance Public Use Data [Dataset]. https://data.cdc.gov/widgets/vbim-akqf
Explore at:
json, application/rdfxml, csv, xml, tsv, application/rssxmlAvailable download formats
Dataset updated
Jul 9, 2024
Dataset provided by
Centers for Disease Control and Preventionhttp://www.cdc.gov/
Authors
CDC Data, Analytics and Visualization Task Force
License
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Description
Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.

Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.

This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.

CDC has three COVID-19 case surveillance datasets:
COVID-19 Case Surveillance Public Use Data with Geography: Public use, patient-level dataset with clinical data (including symptoms), demographics, and county and state of residence. (19 data elements)
COVID-19 Case Surveillance Public Use Data: Public use, patient-level dataset with clinical and symptom data and demographics, with no geographic data. (12 data elements)
COVID-19 Case Surveillance Restricted Access Detailed Data: Restricted access, patient-level dataset with clinical and symptom data, demographics, and state and county of residence. Access requires a registration process and a data use agreement. (33 data elements)
The following apply to all three datasets:
Data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.
Data are considered provisional by CDC and are subject to change until the data are reconciled and verified with the state and territorial data providers.
Some data cells are suppressed to protect individual privacy.
The datasets will include all cases with the earliest date available in each record (date received by CDC or date related to illness/specimen collection) at least 14 days prior to the creation of the current datasets. This 14-day lag allows case reporting to be stabilized and ensures that time-dependent outcome data are accurately captured.
Datasets are updated monthly.
Datasets are created using CDC’s Policy on Public Health Research and Nonresearch Data Management and Access and include protections designed to protect individual privacy.
For more information about data collection and reporting, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/about-us-cases-deaths.html.
For more information about the COVID-19 case surveillance data, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/faq-surveillance.html

Overview

The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.

For more information: NNDSS Supports the COVID-19 Response | CDC.

The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.

COVID-19 Case Reports

COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.

All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.

Data are Considered Provisional

The COVID-19 case surveillance data are dynamic; case reports can be modified at any time by the jurisdictions sharing COVID-19 data with CDC. CDC may update prior cases shared with CDC based on any updated information from jurisdictions. For instance, as new information is gathered about previously reported cases, health departments provide updated data to CDC. As more information and data become available, analyses might find changes in surveillance data and trends during a previously reported time window. Data may also be shared late with CDC due to the volume of COVID-19 cases.
Annual finalized data: To create the final NNDSS data used in the annual tables, CDC works carefully with the reporting jurisdictions to reconcile the data received during the year until each state or territorial epidemiologist confirms that the data from their area are correct.
Access Addressing Gaps in Public Health Reporting of Race and Ethnicity for COVID-19, a report from the Council of State and Territorial Epidemiologists, to better understand the challenges in completing race and ethnicity data for COVID-19 and recommendations for improvement.

Data Limitations

To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.

Data Quality Assurance Procedures

CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:
Questions that have been left unanswered (blank) on the case report form are reclassified to a Missing value, if applicable to the question. For example, in the question “Was the individual hospitalized?” where the possible answer choices include “Yes,” “No,” or “Unknown,” the blank value is recoded to Missing because the case report form did not include a response to the question.
Logic checks are performed for date data. If an illogical date has been provided, CDC reviews the data with the reporting jurisdiction. For example, if a symptom onset date in the future is reported to CDC, this value is set to null until the reporting jurisdiction updates the date appropriately.
Additional data quality processing to recode free text data is ongoing. Data on symptoms, race and ethnicity, and healthcare worker status have been prioritized.

Data Suppression

To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.

For questions, please contact Ask SRRG (eocevent394@cdc.gov).

Additional COVID-19 Data

COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These
T
wine_quality
tensorflow.org
beta.dataverse.org
+1more
Updated Nov 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). wine_quality [Dataset]. https://www.tensorflow.org/datasets/catalog/wine_quality
Explore at:
Dataset updated
Nov 23, 2022
Description
Two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

Number of Instances: red wine - 1599; white wine - 4898

Input variables (based on physicochemical tests):

fixed acidity

volatile acidity

citric acid

residual sugar

chlorides

free sulfur dioxide

total sulfur dioxide

density

pH

sulphates

alcohol

Output variable (based on sensory data):

quality (score between 0 and 10)

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wine_quality', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
a
Portsmouth Water Domestic Consumption
hub.arcgis.com
streamwaterdata.co.uk
Updated Apr 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AHughes_Portsmouth (2024). Portsmouth Water Domestic Consumption [Dataset]. https://hub.arcgis.com/datasets/ae7c87ab4bdd4d2090e7f1773efc5a44
Explore at:
Dataset updated
Apr 25, 2024
Dataset authored and provided by
AHughes_Portsmouth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

This dataset offers valuable insights into yearly domestic water consumption across various Lower Super Output Areas (LSOAs) or Data Zones, accompanied by the count of water meters within each area. It is instrumental for analysing residential water use patterns, facilitating water conservation efforts, and guiding infrastructure development and policy making at a localised level.

Key Definitions

Aggregation

The process of summarising or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes.

AMR Meter

Automatic meter reading (AMR) is the technology of automatically collecting consumption, diagnostic, and status data from a water meter remotely and periodically.

Dataset

Structured and organised collection of related elements, often stored digitally, used for analysis and interpretation in various fields.

Data Zone

Data zones are the key geography for the dissemination of small area statistics in Scotland

Dumb Meter

A dumb meter or analogue meter is read manually. It does not have any external connectivity.

Granularity

Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours

ID

Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance.

LSOA

Lower Layer Super Output Areas (LSOA) are a geographic hierarchy designed to improve the reporting of small area statistics in England and Wales.

Open Data Triage

The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data.

Schema

Structure for organising and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute.

Smart Meter

A smart meter is an electronic device that records information and communicates it to the consumer and the supplier. It differs from automatic meter reading (AMR) in that it enables two-way communication between the meter and the supplier.

Units

Standard measurements used to quantify and compare different physical quantities.

Water Meter

Water metering is the practice of measuring water use. Water meters measure the volume of water used by residential and commercial building units that are supplied with water by a public water supply system.

Data History

Data Origin

Domestic consumption data is recorded using water meters. The consumption recorded is then sent back to water companies. This dataset is extracted from the water companies.

Data Triage Considerations

This section discusses the careful handling of data to maintain anonymity and addresses the challenges associated with data updates, such as identifying household changes or meter replacements.

Identification of Critical Infrastructure

This aspect is not applicable for the dataset, as the focus is on domestic water consumption and does not contain any information that reveals critical infrastructure details.

Commercial Risks and Anonymisation

Individual Identification Risks

There is a potential risk of identifying individuals or households if the consumption data is updated irregularly (e.g., every 6 months) and an out-of-cycle update occurs (e.g., after 2 months), which could signal a change in occupancy or ownership. Such patterns need careful handling to avoid accidental exposure of sensitive information.

Meter and Property Association

Challenges arise in maintaining historical data integrity when meters are replaced but the property remains the same. Ensuring continuity in the data without revealing personal information is crucial.

Interpretation of Null Consumption

Instances of null consumption could be misunderstood as a lack of water use, whereas they might simply indicate missing data. Distinguishing between these scenarios is vital to prevent misleading conclusions.

Meter Re-reads

The dataset must account for instances where meters are read multiple times for accuracy.

Joint Supplies & Multiple Meters per Household

Special consideration is required for households with multiple meters as well as multiple households that share a meter as this could complicate data aggregation.

Schema Consistency with the Energy Industry:

In formulating the schema for the domestic water consumption dataset, careful consideration was given to the potential risks to individual privacy. This evaluation included examining the frequency of data updates, the handling of property and meter associations, interpretations of null consumption, meter re-reads, joint suppliers, and the presence of multiple meters within a single household as described above.

After a thorough assessment of these factors and their implications for individual privacy, it was decided to align the dataset's schema with the standards established within the energy industry. This decision was influenced by the energy sector's experience and established practices in managing similar risks associated with smart meters. This ensures a high level of data integrity and privacy protection.

Schema

The dataset schema is aligned with those used in the energy industry, which has encountered similar challenges with smart meters. However, it is important to note that the energy industry has a much higher density of meter distribution, especially smart meters.

Aggregation to Mitigate Risks

The dataset employs an elevated level of data aggregation to minimise the risk of individual identification. This approach is crucial in maintaining the utility of the dataset while ensuring individual privacy. The aggregation level is carefully chosen to remove identifiable risks without excluding valuable data, thus balancing data utility with privacy concerns.

Data Freshness

Users should be aware that this dataset reflects historical consumption patterns and does not represent real-time data.

Publish Frequency

Annually

Data Triage Review Frequency

An annual review is conducted to ensure the dataset's relevance and accuracy, with adjustments made based on specific requests or evolving data trends.

Data Specifications

For the domestic water consumption dataset, the data specifications are designed to ensure comprehensiveness and relevance, while maintaining clarity and focus. The specifications for this dataset include:

·
Each dataset encompasses recordings of domestic water consumption as measured and reported by the data publisher. It excludes commercial consumption.

· Where it is necessary to estimate consumption, this is calculated based on actual meter readings.

· Meters of all types (smart, dumb, AMR) are included in this dataset.

·
The dataset is updated and published annually.

·
Historical data may be made available to facilitate trend analysis and comparative studies, although it is not mandatory for each dataset release.

Context

Users are cautioned against using the dataset for immediate operational decisions regarding water supply management. The data should be interpreted considering potential seasonal and weather-related influences on water consumption patterns.

The geographical data provided does not pinpoint locations of water meters within an LSOA.

The dataset aims to cover a broad spectrum of households, from single-meter homes to those with multiple meters, to accurately reflect the diversity of water use within an LSOA.

Supplementary Information

Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset.

Ofwat guidance on water meters

https://www.ofwat.gov.uk/wp-content/uploads/2015/11/prs_lft_101117meters.pdf
Open Data Inventory
open.canada.ca
ouvert.canada.ca
+1more
csv, html, xls
Updated Dec 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Treasury Board of Canada Secretariat (2024). Open Data Inventory [Dataset]. https://open.canada.ca/data/en/dataset/4ed351cf-95d8-4c10-97ac-6b3511f359b7
Explore at:
csv, html, xlsAvailable download formats
Dataset updated
Dec 9, 2024
Dataset provided by
Treasury Board of Canada Secretariathttp://www.tbs-sct.gc.ca/
Treasury Board of Canadahttps://www.canada.ca/en/treasury-board-secretariat/corporate/about-treasury-board.html
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
Building a comprehensive data inventory as required by section 6.3 of the Directive on Open Government: “Establishing and maintaining comprehensive inventories of data and information resources of business value held by the department to determine their eligibility and priority, and to plan for their effective release.” Creating a data inventory is among the first steps in identifying federal data that is eligible for release. Departmental data inventories has been published on the Open Government portal, Open.Canada.ca, so that Canadians can see what federal data is collected and have the opportunity to indicate what data is of most interest to them, helping departments to prioritize data releases based on both external demand and internal capacity. The objective of the inventory is to provide a landscape of all federal data. While it is recognized that not all data is eligible for release due to the nature of the content, departments are responsible for identifying and including all datasets of business values as part of the inventory exercise with the exception of datasets whose title contains information that should not be released to be released to the public due to security or privacy concerns. These titles have been excluded from the inventory. Departments were provided with an open data inventory template with standardized elements to populate, and upload in the metadata catalogue, the Open Government Registry. These elements are described in the data dictionary file. Departments are responsible for maintaining up-to-date data inventories that reflect significant additions to their data holdings. For purposes of this open data inventory exercise, a dataset is defined as: “An organized collection of data used to carry out the business of a department or agency, that can be understood alone or in conjunction with other datasets”. Please note that the Open Data Inventory is no longer being maintained by Government of Canada organizations and is therefore not being updated. However, we will continue to provide access to the dataset for review and analysis.
s
Crime Data 2023 - Part 1 Offenses (With Lat & Long Info)
data.syr.gov
Updated Aug 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
admin_syr (2023). Crime Data 2023 - Part 1 Offenses (With Lat & Long Info) [Dataset]. https://data.syr.gov/datasets/94bc33f1a2c646b995d1ab356edf0730
Explore at:
Dataset updated
Aug 11, 2023
Dataset authored and provided by
admin_syr
License
https://data.syr.gov/pages/termsofusehttps://data.syr.gov/pages/termsofuse
Area covered

Description
This 2023 crime data is the list of crimes that the Syracuse Police Department responded to in 2023. These records does not include rape offenses as well as any crimes that have been sealed by the court. These records are derived from the records management system utilized by the SPD. The data is then anonymized by SPD Crime Analysts weekly. After this data is received weekly from the SPD, this data is then mapped to the approximate location of that incident, using the 100 block level and a Geolocator File from Onondaga County GIS Department. This data is then updated on the Open Data Portal. The points should not be construed to be the exact point this incidents were reported to occur, rather the block where these incident is reported to occur.Crimes are reported to the FBI in two major categories under the Uniform Crime Reports specification: Part 1 and Part 2 crimes. Part 1 crimes include criminal homicide, forcible rape, robbery, aggravated assault, burglary, larceny-theft, and motor vehicle theft. In these records, rape offenses have been excluded due to victim privacy concerns.Part 2 crimes include all other offenses. A more detailed guide to Part 1 crimes is listed below. More details about Part 2 Crimes is listed in the Part 2 Crimes Dataset.When using the data, the date and time provided are when the crime was actually reported. This means that though a larceny might be reported at noon, the actual crime could have happened at 8am, but was not realized until someone noticed hours later. Similarly, if a home break-in happens during a holiday weekend when the owners are out of town, the crime report may not come in until they return home and notice the crime took place previously. The address in the dataset is where the crime occurred. The location is also anonymized to the block level, so a crime that occurred at 123 Main St. will appear as occurring on the 100 block of Main St. This is to protect the privacy of all involved. Finally, information about crimes is fluid, and details about the crime could change.Data DictionaryDate End - Date that the crime was reported. It could have happened earlier. This is in the format of DD-MON-YY (Ex. 01-Jan-22).Time start and time end - Listed in military time (2400) - Burglaries and larcenies are often a time frame. Address - Where the crime occurred. All addresses are in the 100’s because the Syracuse Police Department allows privacy for residents and only lists the block number.Code Defined - Offense names are listed as crime categories group for ease of understanding. There may have been other offenses also, but the one displayed is the highest Unified Crime Reporting (UCR) category.Arrest - Means that there was an arrest, but not necessarily for that crime.Larceny Code - Indicates the type of larceny (Example: From Building or From Motor Vehicle).LAT - The approximate latitude (not actual) that this call for service occurred.LONG - The approximate latitude (not actual) that this call for service occurred.DisclaimerData derived from the Syracuse Police Department record management system, any data not listed is not currently available.Part I Crime DefinitionsCriminal homicide—a.) Murder and non-negligent manslaughter: the willful (non-negligent) killing of one human being by another. Deaths caused by negligence, attempts to kill, assaults to kill, suicides, and accidental deaths are excluded. The program classifies justifiable homicides separately and limits the definition to: (1) the killing of a felon by a law enforcement officer in the line of duty; or (2) the killing of a felon, during the commission of a felony, by a private citizen. b.) Manslaughter by negligence: the killing of another person through gross negligence. Deaths of persons due to their own negligence, accidental deaths not resulting from gross negligence, and traffic fatalities are not included in the category Manslaughter by Negligence. Robbery—The taking or attempting to take anything of value from the care, custody, or control of a person or persons by force or threat of force or violence and/or by putting the victim in fear. Aggravated assault—An unlawful attack by one person upon another for the purpose of inflicting severe or aggravated bodily injury. This type of assault usually is accompanied by the use of a weapon or by means likely to produce death or great bodily harm. Simple assaults are excluded. Burglary (breaking or entering)—The unlawful entry of a structure to commit a felony or a theft. Attempted forcible entry is included. Larceny-theft (except motor vehicle theft)—The unlawful taking, carrying, leading, or riding away of property from the possession or constructive possession of another. Examples are thefts of bicycles, motor vehicle parts and accessories, shoplifting, pocket picking, or the stealing of any property or article that is not taken by force and violence or by fraud. Attempted larcenies are included. Embezzlement, confidence games, forgery, check fraud, etc., are excluded. Motor vehicle theft—The theft or attempted theft of a motor vehicle. A motor vehicle is self-propelled and runs on land surface and not on rails. Motorboats, construction equipment, airplanes, and farming equipment are specifically excluded from this category. Dataset Contact Information:Organization: Syracuse Police Department (SPD)Position: Data Program ManagerCity: Syracuse, NYE-Mail Address: opendata@syrgov.net
S
Synthetic Data Generation Market Report
marketresearchforecast.com
doc, pdf, ppt
Updated Dec 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2024). Synthetic Data Generation Market Report [Dataset]. https://www.marketresearchforecast.com/reports/synthetic-data-generation-market-1834
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Dec 8, 2024
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Synthetic Data Generation Marketsize was valued at USD 288.5 USD Million in 2023 and is projected to reach USD 1920.28 USD Million by 2032, exhibiting a CAGR of 31.1 % during the forecast period.Synthetic data generation stands for the generation of fake datasets that resemble real datasets with reference to their data distribution and patterns. It refers to the process of creating synthetic data points utilizing algorithms or models instead of conducting observations or surveys. There is one of its core advantages: it can maintain the statistical characteristics of the original data and remove the privacy risk of using real data. Further, with synthetic data, there is no limitation to how much data can be created, and hence, it can be used for extensive testing and training of machine learning models, unlike the case with conventional data, which may be highly regulated or limited in availability. It also helps in the generation of datasets that are comprehensive and include many examples of specific situations or contexts that may occur in practice for improving the AI system’s performance. The use of SDG significantly shortens the process of the development cycle, requiring less time and effort for data collection as well as annotation. It basically allows researchers and developers to be highly efficient in their discovery and development in specific domains like healthcare, finance, etc. Key drivers for this market are: Growing Demand for Data Privacy and Security to Fuel Market Growth. Potential restraints include: Lack of Data Accuracy and Realism Hinders Market Growth. Notable trends are: Growing Implementation of Touch-based and Voice-based Infotainment Systems to Increase Adoption of Intelligent Cars.

Data from: WikiReddit: Tracing Information and Attention Flows Between...

zenodo.org

bin

Updated Feb 10, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14653265

Dataset updated

Feb 10, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 15, 2025

Description

Preprint

Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Datasheet

Motivation

The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

Composition

WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

Collection Process

Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

Preprocessing/cleaning/labeling

Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

Uses

We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

Distribution

The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

Maintenance

Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.

SQL Database Schema

Table: `posts`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`crosspost_parent_id`	TEXT	The ID of the original Reddit post if this post is a crosspost.
`post_id`	TEXT	Unique identifier for the Reddit post.
`created_at`	TIMESTAMP	The timestamp when the post was created.
`updated_at`	TIMESTAMP	The timestamp when the post was last updated.
`language_code`	TEXT	The language code of the post.
`score`	INTEGER	The score (upvotes minus downvotes) of the post.
`upvote_ratio`	REAL	The ratio of upvotes to total votes.
`gildings`	INTEGER	Number of awards (gildings) received by the post.
`num_comments`	INTEGER	Number of comments on the post.

Table: `comments`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`post_id`	TEXT	The ID of the Reddit post the comment belongs to.
`parent_id`	TEXT	The ID of the parent comment (if a reply).
`comment_id`	TEXT	Unique identifier for the comment.
`created_at`	TIMESTAMP	The timestamp when the comment was created.
`last_modified_at`	TIMESTAMP	The timestamp when the comment was last modified.
`score`	INTEGER	The score (upvotes minus downvotes) of the comment.
`upvote_ratio`	REAL	The ratio of upvotes to total votes for the comment.
`gilded`	INTEGER	Number of awards (gildings) received by the comment.

Table: `postlinks`

Column Name	Type	Description
`post_id`	TEXT	Unique identifier for the Reddit post.
`end_processed_valid`	INTEGER	Whether the extracted URL from the post resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the Reddit post.
`final_valid`	INTEGER	Whether the final URL from the post resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final URL.
`final_url`	TEXT	The final URL after redirections.
`redirected`	INTEGER	Indicator of whether the posted URL was redirected (1) or not (0).
`in_title`	INTEGER	Indicator of whether the link appears in the post title (1) or post body (0).

Table: `commentlinks`

Column Name	Type	Description
`comment_id`	TEXT	Unique identifier for the Reddit comment.
`end_processed_valid`	INTEGER	Whether the extracted URL from the comment resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the comment.
`final_valid`	INTEGER	Whether the final URL from the comment resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final URL.
`final_url`	TEXT	The final URL after

d
Syntegra Synthetic EHR Data | Structured Healthcare Electronic Health Record...
datarade.ai
Updated Feb 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syntegra (2022). Syntegra Synthetic EHR Data | Structured Healthcare Electronic Health Record Data [Dataset]. https://datarade.ai/data-products/syntegra-synthetic-ehr-data-structured-healthcare-electroni-syntegra
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Feb 23, 2022
Dataset authored and provided by
Syntegra
Area covered
United States of America
Description
Organizations can license synthetic, structured data generated by Syntegra from electronic health record systems of community hospitals across the United States, reaching beyond just claims and Rx data.

The synthetic data provides a detailed picture of the patient's journey throughout their hospital stay, including patient demographic information and payer type, as well as rich data not found in any other sources. Examples of this data include: drugs given (timing and dosing), patient location (e.g., ICU, floor, ER), lab results (timing by day and hour), physician roles (e.g., surgeon, attending), medications given, and vital signs. The participating community hospitals with bed sizes ranging from 25 to 532 provide unique visibility and assessment of variation in care outside of large academic medical centers and healthcare networks.

Our synthetic data engine is trained on a broadly representative dataset made up of deep clinical information of approximately 6 million unique patient records and 18 million encounters over 5 years of history. Notably, synthetic data generation allows for the creation of any number of records needed to power your project.

EHR data is available in the following formats: — Cleaned, analytics-ready (a layer of clean and normalized concepts in Tuva Health’s standard relational data model format — FHIR USCDI (labs, medications, vitals, encounters, patients, etc.)

The synthetic data maintains full statistical accuracy, yet does not contain any actual patients, thus removing any patient privacy liability risk. Privacy is preserved in a way that goes beyond HIPAA or GDPR compliance. Our industry-leading metrics prove that both privacy and fidelity are fully maintained.

— Generate the data needed for product development, testing, demo, or other needs — Access data at a scalable price point — Build your desired population, both in size and demographics — Scale up and down to fit specific needs, increasing efficiency and affordability

Syntegra's synthetic data engine also has the ability to augment the original data: — Expand population sizes, rare cohorts, or outcomes of interest — Address algorithmic fairness by correcting bias or introducing intentional bias — Conditionally generate data to inform scenario planning — Impute missing value to minimize gaps in the data
a
Portsmouth Water Drinking Water Quality Data 2022, 2023 & 2024
hub.arcgis.com
Updated May 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AHughes_Portsmouth (2024). Portsmouth Water Drinking Water Quality Data 2022, 2023 & 2024 [Dataset]. https://hub.arcgis.com/datasets/b9b8d038ae70461386f5cab102adbbb9
Explore at:
Dataset updated
May 22, 2024
Dataset authored and provided by
AHughes_Portsmouth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

Water companies in the UK are responsible for testing the quality of drinking water. This dataset contains the results of samples taken from the taps in domestic households to make sure they meet the standards set out by UK and European legislation. This data shows the location, date, and measured levels of determinands set out by the Drinking Water Inspectorate (DWI).

Key Definitions

Aggregation

Process involving summarizing or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes

Anonymisation

Anonymised data is a type of information sanitization in which data anonymisation tools encrypt or remove personally identifiable information from datasets for the purpose of preserving a data subject's privacy

Dataset

Structured and organized collection of related elements, often stored digitally, used for analysis and interpretation in various fields.

Determinand

A constituent or property of drinking water which can be determined or estimated.

DWI

Drinking Water Inspectorate, an organisation “providing independent reassurance that water supplies in England and Wales are safe and drinking water quality is acceptable to consumers.”

DWI Determinands

Constituents or properties that are tested for when evaluating a sample for its quality as per the guidance of the DWI. For this dataset, only determinands with “point of compliance” as “customer taps” are included.

Granularity

Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours

ID

Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance.

LSOA

Lower-Level Super Output Area is made up of small geographic areas used for statistical and administrative purposes by the Office for National Statistics. It is designed to have homogeneous populations in terms of population size, making them suitable for statistical analysis and reporting. Each LSOA is built from groups of contiguous Output Areas with an average of about 1,500 residents or 650 households allowing for granular data collection useful for analysis, planning and policy- making while ensuring privacy.

ONS

Office for National Statistics

Open Data Triage

The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data. <

Sample

A sample is a representative segment or portion of water taken from a larger whole for the purpose of analysing or testing to ensure compliance with safety and quality standards.

Schema

Structure for organizing and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute.

Units

Standard measurements used to quantify and compare different physical quantities.

Water Quality

The chemical, physical, biological, and radiological characteristics of water, typically in relation to its suitability for a specific purpose, such as drinking, swimming, or ecological health. It is determined by assessing a variety of parameters, including but not limited to pH, turbidity, microbial content, dissolved oxygen, presence of substances and temperature.

Data History

Data Origin

These samples were taken from customer taps. They were then analysed for water quality, and the results were uploaded to a database. This dataset is an extract from this database.

Data Triage Considerations

Granularity

Is it useful to share results as averages or individual?

We decided to share as individual results as the lowest level of granularity

Anonymisation

It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed:

<!--·
Water Supply Zone (WSZ) - Limits interoperability with other datasets

<!--·
Postcode – Some postcodes contain very few households and may not offer necessary anonymisation

<!--·
Postal Sector – Deemed not granular enough in highly populated areas

<!--·
Rounded Co-ordinates – Not a recognised standard and may cause overlapping areas

<!--·
MSOA – Deemed not granular enough

<!--·
LSOA – Agreed as a recognised standard appropriate for England and Wales

<!--·
Data Zones – Agreed as a recognised standard appropriate for Scotland

Data Specifications

Each dataset will cover a calendar year of samples

This dataset will be published annually

Historical datasets will be published as far back as 2016 from the introduction of of The Water Supply (Water Quality) Regulations 2016

The Determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate.

Context

Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset

Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area.

Some samples are tested on site and others are sent to scientific laboratories.

Data Publish Frequency

Annually

Data Triage Review Frequency

Annually unless otherwise requested

Supplementary information

Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset.

<!--1.
Drinking Water Inspectorate Standards and Regulations:

<!--2.
https://www.dwi.gov.uk/drinking-water-standards-and-regulations/

<!--3.
LSOA (England and Wales) and Data Zone (Scotland):

<!--4. https://www.nrscotland.gov.uk/files/geography/2011-census/geography-bckground-info-comparison-of-thresholds.pdf

<!--5.
Description for LSOA boundaries by the ONS: Census 2021 geographies - Office for National Statistics (ons.gov.uk)

<!--[6.
Postcode to LSOA lookup tables: Postcode to 2021 Census Output Area to Lower Layer Super Output Area to Middle Layer Super Output Area to Local Authority District (August 2023) Lookup in the UK (statistics.gov.uk)

<!--7.
Legislation history: Legislation - Drinking Water Inspectorate (dwi.gov.uk)
Global number of breached user accounts Q1 2020-Q3 2024
statista.com
Updated Nov 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Global number of breached user accounts Q1 2020-Q3 2024 [Dataset]. https://www.statista.com/statistics/1307426/number-of-data-breaches-worldwide/
Explore at:
Dataset updated
Nov 8, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
During the third quarter of 2024, data breaches exposed more than 422 million records worldwide. Since the first quarter of 2020, the highest number of data records were exposed in the first quarter of 202, more than 818 million data sets. Data breaches remain among the biggest concerns of company leaders worldwide. The most common causes of sensitive information loss were operating system vulnerabilities on endpoint devices. Which industries see the most data breaches? Meanwhile, certain conditions make some industry sectors more prone to data breaches than others. According to the latest observations, the public administration experienced the highest number of data breaches between 2021 and 2022. The industry saw 495 reported data breach incidents with confirmed data loss. The second were financial institutions, with 421 data breach cases, followed by healthcare providers. Data breach cost Data breach incidents have various consequences, the most common impact being financial losses and business disruptions. As of 2023, the average data breach cost across businesses worldwide was 4.45 million U.S. dollars. Meanwhile, a leaked data record cost about 165 U.S. dollars. The United States saw the highest average breach cost globally, at 9.48 million U.S. dollars.
Number of data compromises and impacted individuals in U.S. 2005-2023
statista.com
Updated Dec 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Number of data compromises and impacted individuals in U.S. 2005-2023 [Dataset]. https://www.statista.com/statistics/273550/data-breaches-recorded-in-the-united-states-by-number-of-breaches-and-records-exposed/
Explore at:
Dataset updated
Dec 10, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
United States
Description
In 2023, the number of data compromises in the United States stood at 3,205 cases. Meanwhile, over 353 million individuals were affected in the same year by data compromises, including data breaches, leakage, and exposure. While these are three different events, they have one thing in common. As a result of all three incidents, the sensitive data is accessed by an unauthorized threat actor. Industries most vulnerable to data breaches Some industry sectors usually see more significant cases of private data violations than others. This is determined by the type and volume of the personal information organizations of these sectors store. In 2022, healthcare, financial services, and manufacturing were the three industry sectors that recorded most data breaches. The number of healthcare data breaches in the United States has gradually increased within the past few years. In the financial sector, data compromises increased almost twice between 2020 and 2022, while manufacturing saw an increase of more than three times in data compromise incidents. Largest data exposures worldwide In 2020, an adult streaming website, CAM4, experienced a leakage of nearly 11 billion records. This, by far, is the most extensive reported data leakage. This case, though, is unique because cyber security researchers found the vulnerability before the cyber criminals. The second-largest data breach is the Yahoo data breach, dating back to 2013. The company first reported about one billion exposed records, then later, in 2017, came up with an updated number of leaked records, which was three billion. In March 2018, the third biggest data breach happened, involving India’s national identification database Aadhaar. As a result of this incident, over 1.1 billion records were exposed.
s
SES Water Domestic Consumption
streamwaterdata.co.uk
Updated Apr 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dpararajasingam_ses (2024). SES Water Domestic Consumption [Dataset]. https://www.streamwaterdata.co.uk/maps/f2cdc1248fcf4fd289ac1d3f25e75b3b_0/about
Explore at:
Dataset updated
Apr 26, 2024
Dataset authored and provided by
dpararajasingam_ses
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview    This dataset offers valuable insights into yearly domestic water consumption across various Lower Super Output Areas (LSOAs) or Data Zones, accompanied by the count of water meters within each area. It is instrumental for analysing residential water use patterns, facilitating water conservation efforts, and guiding infrastructure development and policy making at a localised level. Key Definitions    Aggregation   The process of summarising or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes.     AMR Meter Automatic meter reading (AMR) is the technology of automatically collecting consumption, diagnostic, and status data from a water meter remotely and periodically. Dataset   Structured and organised collection of related elements, often stored digitally, used for analysis and interpretation in various fields.  Data Zone Data zones are the key geography for the dissemination of small area statistics in Scotland Dumb Meter A dumb meter or analogue meter is read manually. It does not have any external connectivity. Granularity   Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours   ID   Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance.    LSOA Lower Layer Super Output Areas (LSOA) are a geographic hierarchy designed to improve the reporting of small area statistics in England and Wales. Open Data Triage   The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data.    Schema   Structure for organising and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute.    Smart Meter A smart meter is an electronic device that records information and communicates it to the consumer and the supplier. It differs from automatic meter reading (AMR) in that it enables two-way communication between the meter and the supplier. Units   Standard measurements used to quantify and compare different physical quantities.  Water Meter Water metering is the practice of measuring water use. Water meters measure the volume of water used by residential and commercial building units that are supplied with water by a public water supply system. Data History    Data Origin    Domestic consumption data is recorded using water meters. The consumption recorded is then sent back to water companies. This dataset is extracted from the water companies. Data Triage Considerations    This section discusses the careful handling of data to maintain anonymity and addresses the challenges associated with data updates, such as identifying household changes or meter replacements. Identification of Critical Infrastructure  This aspect is not applicable for the dataset, as the focus is on domestic water consumption and does not contain any information that reveals critical infrastructure details. Commercial Risks and Anonymisation Individual Identification Risks There is a potential risk of identifying individuals or households if the consumption data is updated irregularly (e.g., every 6 months) and an out-of-cycle update occurs (e.g., after 2 months), which could signal a change in occupancy or ownership. Such patterns need careful handling to avoid accidental exposure of sensitive information. Meter and Property Association Challenges arise in maintaining historical data integrity when meters are replaced but the property remains the same. Ensuring continuity in the data without revealing personal information is crucial. Interpretation of Null Consumption Instances of null consumption could be misunderstood as a lack of water use, whereas they might simply indicate missing data. Distinguishing between these scenarios is vital to prevent misleading conclusions. Meter Re-reads The dataset must account for instances where meters are read multiple times for accuracy. Joint Supplies & Multiple Meters per Household Special consideration is required for households with multiple meters as well as multiple households that share a meter as this could complicate data aggregation. Schema Consistency with the Energy Industry: In formulating the schema for the domestic water consumption dataset, careful consideration was given to the potential risks to individual privacy. This evaluation included examining the frequency of data updates, the handling of property and meter associations, interpretations of null consumption, meter re-reads, joint suppliers, and the presence of multiple meters within a single household as described above. After a thorough assessment of these factors and their implications for individual privacy, it was decided to align the dataset's schema with the standards established within the energy industry. This decision was influenced by the energy sector's experience and established practices in managing similar risks associated with smart meters. This ensures a high level of data integrity and privacy protection. Schema The dataset schema is aligned with those used in the energy industry, which has encountered similar challenges with smart meters. However, it is important to note that the energy industry has a much higher density of meter distribution, especially smart meters. Aggregation to Mitigate Risks The dataset employs an elevated level of data aggregation to minimise the risk of individual identification. This approach is crucial in maintaining the utility of the dataset while ensuring individual privacy. The aggregation level is carefully chosen to remove identifiable risks without excluding valuable data, thus balancing data utility with privacy concerns. Data Freshness  Users should be aware that this dataset reflects historical consumption patterns and does not represent real-time data. Publish Frequency  Annually Data Triage Review Frequency    An annual review is conducted to ensure the dataset's relevance and accuracy, with adjustments made based on specific requests or evolving data trends. Data Specifications   For the domestic water consumption dataset, the data specifications are designed to ensure comprehensiveness and relevance, while maintaining clarity and focus. The specifications for this dataset include: Each dataset encompasses recordings of domestic water consumption as measured and reported by the data publisher. It excludes commercial consumption. Where it is necessary to estimate consumption, this is calculated based on actual meter readings. Meters of all types (smart, dumb, AMR) are included in this dataset. The dataset is updated and published annually. Historical data may be made available to facilitate trend analysis and comparative studies, although it is not mandatory for each dataset release. Context   Users are cautioned against using the dataset for immediate operational decisions regarding water supply management. The data should be interpreted considering potential seasonal and weather-related influences on water consumption patterns. The geographical data provided does not pinpoint locations of water meters within an LSOA. The dataset aims to cover a broad spectrum of households, from single-meter homes to those with multiple meters, to accurately reflect the diversity of water use within an LSOA.

Facebook

Twitter

Click to copy link

Link copied

Cite

data.staging.idas-ds1.appdat.jsc.nasa.gov (2025). Privacy Preserving Distributed Data Mining [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/privacy-preserving-distributed-data-mining

Privacy Preserving Distributed Data Mining

Explore at:

Dataset updated

Feb 18, 2025

Dataset provided by

NASAhttp://nasa.gov/

Description

Distributed data mining from privacy-sensitive multi-party data is likely to play an important role in the next generation of integrated vehicle health monitoring systems. For example, consider an airline manufacturer [tex]$\mathcal{C}$[/tex] manufacturing an aircraft model [tex]$A$[/tex] and selling it to five different airline operating companies [tex]$\mathcal{V}_1 \dots \mathcal{V}_5$[/tex]. These aircrafts, during their operation, generate huge amount of data. Mining this data can reveal useful information regarding the health and operability of the aircraft which can be useful for disaster management and prediction of efficient operating regimes. Now if the manufacturer [tex]$\mathcal{C}$[/tex] wants to analyze the performance data collected from different aircrafts of model-type [tex]$A$[/tex] belonging to different airlines then central collection of data for subsequent analysis may not be an option. It should be noted that the result of this analysis may be statistically more significant if the data for aircraft model [tex]$A$[/tex] across all companies were available to [tex]$\mathcal{C}$[/tex]. The potential problems arising out of such a data mining scenario are:

Clear search

Close search

Google apps

Main menu

Privacy Preserving Distributed Data Mining

Privacy-Sensitive Conversations between Care Workers and Care Home Residents...

Dataset Card for "privacy-care-interactions"

Table of Contents

Dataset Description

Purpose and Features

Dataset Overview

Language Distribution 🌍

Locale Distribution 🌎

Key Facts 🔑

Dataset Structure

Data Instances

Data Fields

Dataset Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection

Data Processing

Dairy Supply Chain Sales Dataset

Yahoo Password Frequency Corpus

Context

Content

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

Face Detection Dataset for Programmable Threshold-Based Sparse-Vision

Daily Global Historical Climatology Network

Context

Content

Acknowledgements

Inspiration

Eindhoven open data principles

COVID-19 Case Surveillance Public Use Data

CDC has three COVID-19 case surveillance datasets:

Overview

COVID-19 Case Reports

Data are Considered Provisional

Data Limitations

Data Quality Assurance Procedures

Data Suppression

Additional COVID-19 Data

wine_quality

Portsmouth Water Domestic Consumption

Open Data Inventory

Crime Data 2023 - Part 1 Offenses (With Lat & Long Info)

Synthetic Data Generation Market Report

Data from: WikiReddit: Tracing Information and Attention Flows Between...

Preprint

Abstract

Datasheet

Motivation

Composition

Collection Process

Preprocessing/cleaning/labeling

Uses

Distribution

Maintenance

SQL Database Schema

Table: posts

Table: comments

Table: postlinks

Table: commentlinks

Syntegra Synthetic EHR Data | Structured Healthcare Electronic Health Record...

Portsmouth Water Drinking Water Quality Data 2022, 2023 & 2024

Global number of breached user accounts Q1 2020-Q3 2024

Number of data compromises and impacted individuals in U.S. 2005-2023

SES Water Domestic Consumption

Privacy Preserving Distributed Data Mining

Table: `posts`

Table: `comments`

Table: `postlinks`

Table: `commentlinks`