46 datasets found
  1. Data from: Synthetic time series data generation for edge analytics

    • zenodo.org
    bin
    Updated Nov 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subarmaniam Kannan; Subarmaniam Kannan (2021). Synthetic time series data generation for edge analytics [Dataset]. http://doi.org/10.5281/zenodo.5673806
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 25, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Subarmaniam Kannan; Subarmaniam Kannan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this research, we create synthetic data with features that are like data from IoT devices. We use an existing air quality dataset that includes temperature and gas sensor measurements. This real-time dataset includes component values for the Air Quality Index (AQI) and ppm concentrations for various polluting gas concentrations. We build a JavaScript Object Notation (JSON) model to capture the distribution of variables and structure of this real dataset to generate the synthetic data. Based on the synthetic dataset and original dataset, we create a comparative predictive model. Analysis of synthetic dataset predictive model shows that it can be successfully used for edge analytics purposes, replacing real-world datasets. There is no significant difference between the real-world dataset compared the synthetic dataset. The generated synthetic data requires no modification to suit the edge computing requirements. The framework can generate correct synthetic datasets based on JSON schema attributes. The accuracy, precision, and recall values for the real and synthetic datasets indicate that the logistic regression model is capable of successfully classifying data

  2. Data from: A large synthetic dataset for machine learning applications in...

    • zenodo.org
    csv, json, png, zip
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
    Explore at:
    zip, png, csv, jsonAvailable download formats
    Dataset updated
    Mar 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

    This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

    Data generation algorithm

    The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

    Network

    The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

    Time series

    The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

    There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

    Usage

    The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

    Selecting a particular country

    This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

    import pandas as pd
    CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

    The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

    CH_gens_list = CH_gens.dropna().squeeze().to_list()

    Finally, we can import all the time series of Swiss generators from a given data table with

    pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

    The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

    Averaging over time

    This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

    hourly_loads = pd.read_csv('loads_2018_3.csv')

    To get a daily average of the loads, we can use:

    daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

    This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

    weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

    Source code

    The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

    Funding

    This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.

  3. m

    Synthetic Datasets of Water Consumptions

    • data.mendeley.com
    Updated Feb 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marta Santos (2021). Synthetic Datasets of Water Consumptions [Dataset]. http://doi.org/10.17632/v4ynw83j6k.2
    Explore at:
    Dataset updated
    Feb 8, 2021
    Authors
    Marta Santos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A group of 6 synthetic datasets is provided with a 914 days water consumption time horizon with structural breaks based on actual datasets from a hotel and hospital. The parameters of the best-fit probability distributions to the actual water consumption data were used to generate these datasets. The distributions used that best fit the datasets used were the gamma distribution for the hotel data set and the gamma and logistics distribution for the hospital dataset. Two structural breaks of 5% and 10% in the mean of the distributions were added to simulate reductions in water consumption patterns.

  4. Synthetic Data Generation for Hard Drive Failure Prediction in Large-scale...

    • figshare.com
    zip
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chandranil Chakraborttii (2025). Synthetic Data Generation for Hard Drive Failure Prediction in Large-scale Systems [Dataset]. http://doi.org/10.6084/m9.figshare.28878830.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 27, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Chandranil Chakraborttii
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Accurate failure prediction is critical for the reliability of HPC facilities and data centers storage systems. This study addresses data scarcity, privacy concerns, and class imbalance in HDD failure datasets by leveraging synthetic data generation. We propose an end-to-end framework to generate synthetic storage data using Generative Adversarial Networks and Diffusion models. We implement a data segmentation approach considering temporal variation of disks access to generate high-fidelity synthetic data that replicates the nuanced temporal and feature-specific patterns of disk failures. Experimental results show that synthetic data achieves similarity scores of 0.81–0.89 and enhances failure prediction performance, with up to 3% improvement in accuracy and 2% in ROC-AUC. With only minor performance drops versus real-data training, synthetically trained models prove viable for predictive maintenance.

  5. ActiveHuman Part 2

    • zenodo.org
    • data.niaid.nih.gov
    Updated Apr 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charalampos Georgiadis; Charalampos Georgiadis (2025). ActiveHuman Part 2 [Dataset]. http://doi.org/10.5281/zenodo.8361114
    Explore at:
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Charalampos Georgiadis; Charalampos Georgiadis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is Part 2/2 of the ActiveHuman dataset! Part 1 can be found here.

    Dataset Description

    ActiveHuman was generated using Unity's Perception package.

    It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals).

    The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset.

    Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.

    Folder configuration

    The dataset consists of 3 folders:

    • JSON Data: Contains all the generated JSON files.
    • RGB Images: Contains the generated RGB images.
    • Semantic Segmentation Images: Contains the generated semantic segmentation images.

    Essential Terminology

    • Annotation: Recorded data describing a single capture.
    • Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.).
    • Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor).
    • Ego coordinate system: Coordinates with respect to the ego.
    • Global coordinate system: Coordinates with respect to the global origin in Unity.
    • Sensor: Device that captures the dataset (in this instance the sensor is a camera).
    • Sensor coordinate system: Coordinates with respect to the sensor.
    • Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital.
    • UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.

    Dataset Data

    The dataset includes 4 types of JSON annotation files files:

    • annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:
      • id: Integer identifier of the annotation's definition.
      • name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation).
      • description: Description of the annotation's specifications.
      • format: Format of the file containing the annotation specifications (e.g., json, PNG).
      • spec: Format-specific specifications for the annotation values generated by each Labeler.

    Most Labelers generate different annotation specifications in the spec key-value pair:

    • BoundingBox2DLabeler/BoundingBox3DLabeler:
      • label_id: Integer identifier of a label.
      • label_name: String identifier of a label.
    • KeypointLabeler:
      • template_id: Keypoint template UUID.
      • template_name: Name of the keypoint template.
      • key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:
        • label: Joint label.
        • index: Joint index.
        • color: RGBA values of the keypoint.
        • color_code: Hex color code of the keypoint
      • skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:
        • label1: Label of the first joint.
        • label2: Label of the second joint.
        • joint1: Index of the first joint.
        • joint2: Index of the second joint.
        • color: RGBA values of the connection.
        • color_code: Hex color code of the connection.
    • SemanticSegmentationLabeler:
      • label_name: String identifier of a label.
      • pixel_value: RGBA values of the label.
      • color_code: Hex color code of the label.

    • captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:
      • id: UUID of the capture.
      • sequence_id: UUID of the sequence.
      • step: Index of the capture within a sequence.
      • timestamp: Timestamp (in ms) since the beginning of a sequence.
      • sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:
        • sensor_id: Sensor UUID.
        • ego_id: Ego UUID.
        • modality: Modality of the sensor (e.g., camera, radar).
        • translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system.
        • rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system.
        • camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration.
        • projection: Projection type used by the camera (e.g., orthographic, perspective).
      • ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:
        • ego_id: Ego UUID.
        • translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system.
        • rotation: Quaternion variable containing the ego's orientation.
        • velocity: 3D vector containing the ego's velocity (in meters per second).
        • acceleration: 3D vector containing the ego's acceleration (in ).
      • format: Format of the file captured by the sensor (e.g., PNG, JPG).
      • annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:
        • id: Annotation UUID .
        • annotation_definition: Integer identifier of the annotation's definition.
        • filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image.
        • values: List of key-value pairs containing annotation data for the current Labeler.

    Each Labeler generates different annotation specifications in the values key-value pair:

    • BoundingBox2DLabeler:
      • label_id: Integer identifier of a label.
      • label_name: String identifier of a label.
      • instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values.
      • x: Position of the 2D bounding box on the X axis.
      • y: Position of the 2D bounding box position on the Y axis.
      • width: Width of the 2D bounding box.
      • height: Height of the 2D bounding box.
    • BoundingBox3DLabeler:
      • label_id: Integer identifier of a label.
      • label_name: String identifier of a label.
      • instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values.
      • translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters).
      • size: 3D

  6. f

    Details of parameters used to generate synthetic data.

    • plos.figshare.com
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanchan Aggarwal; Siddhartha Mukhopadhya; Arun K. Tangirala (2023). Details of parameters used to generate synthetic data. [Dataset]. http://doi.org/10.1371/journal.pone.0250008.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Kanchan Aggarwal; Siddhartha Mukhopadhya; Arun K. Tangirala
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Details of parameters used to generate synthetic data.

  7. z

    Controlled Anomalies Time Series (CATS) Dataset

    • zenodo.org
    • explore.openaire.eu
    bin
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646897
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Solenix Engineering GmbH
    Authors
    Patrick Fleith; Patrick Fleith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

    The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

    • Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:
      • 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.
      • 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.
      • 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.
    • 5 million timestamps. Sensors readings are at 1Hz sampling frequency.
      • 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.
      • 4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).
    • 200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.
    • Different types of anomalies to understand what anomaly types can be detected by different approaches.
    • Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.
    • Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.
    • Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.
    • Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.
    • No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

    [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

    About Solenix

    Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.

  8. d

    Data from: SMART-DS Synthetic Electrical Network Data OpenDSS Models for...

    • catalog.data.gov
    • data.openei.org
    Updated Jan 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Renewable Energy Laboratory (NREL) (2024). SMART-DS Synthetic Electrical Network Data OpenDSS Models for SFO, GSO, and AUS [Dataset]. https://catalog.data.gov/dataset/smart-ds-synthetic-electrical-network-data-opendss-models-for-sfo-gso-and-aus
    Explore at:
    Dataset updated
    Jan 3, 2024
    Dataset provided by
    National Renewable Energy Laboratory (NREL)
    Description

    The SMART-DS datasets (Synthetic Models for Advanced, Realistic Testing: Distribution systems and Scenarios) are realistic large-scale U.S. electrical distribution models for testing advanced grid algorithms and technology analysis. This document provides a user guide for the datasets. This dataset contains synthetic detailed electrical distribution network models, and connected timeseries loads for the greater San Francisco (SFO), Greensboro, and Austin areas. It is intended to provide researchers with very realistic and complete models that can be used for extensive powerflow simulations under a variety of scenarios. The data is synthetic, but has been validated against thousands of utility feeders to ensure statistical and operational similarity to electrical distribution networks in the US. The OpenDSS data is partitioned into several regions (each zipped separately). After unzipping these files, each region has a folder for each substation, and subsequent folders for each feeder within the substation. This allows users to simulate smaller sections of the full dataset. Each of these folders (region, substation and feeder) has a folder titled "analysis" which contains CSV files listing voltages and overloads throughout the network for the peak loading time in the year. It also contains .png files showing the loading of residential and commercial loads on the network for every day of the year, and daily breakdowns of loads for commercial building categories. Time series data is provided in the "profiles" folder including real and reactive power at 15 minute resolution along with parquet files in the "endues" folder with breakdowns of building end-uses.

  9. Heat pump COP drop - synthetic faults

    • kaggle.com
    zip
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathieu Vallee (2023). Heat pump COP drop - synthetic faults [Dataset]. https://www.kaggle.com/datasets/mathieuvallee/ai-dhc-heatpump-cop
    Explore at:
    zip(68378018 bytes)Available download formats
    Dataset updated
    Feb 28, 2023
    Authors
    Mathieu Vallee
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains data generated in the AI DHC project.

    This dataset contains synthetic fault data for decrease of the COP of a heat pump

    The IEA DHC Annex XIII project “Artificial Intelligence for Failure Detection and Forecasting of Heat Production and Heat demand in District Heating Networks” is developing Artificial Intelligence (AI) methods for forecasting heat demand and heat production and is evaluating algorithms for detecting faults which can be used by interested stakeholders (operators, suppliers of DHC components and manufacturers of control devices).

    See https://github.com/mathieu-vallee/ai-dhc for the models and pythons scripts used to generate the dataset

    Please cite this dataset as: Vallee, M., Wissocq T., Gaoua Y., Lamaison N., Generation and Evaluation of a Synthetic Dataset to improve Fault Detection in District Heating and Cooling Systems, 2023 (under review at the Energy journal)

    Disclaimer notice (IEA DHC): This project has been independently funded by the International Energy Agency Technology Collaboration Programme on District Heating and Cooling including Combined Heat and Power (IEA DHC).

    Any views expressed in this publication are not necessarily those of IEA DHC.

    IEA DHC can take no responsibility for the use of the information within this publication, nor for any errors or omissions it may contain.

    Information contained herein have been compiled or arrived from sources believed to be reliable. Nevertheless, the authors or their organizations do not accept liability for any loss or damage arising from the use thereof. Using the given information is strictly your own responsibility.

    Disclaimer Notice (Authors):

    This publication has been compiled with reasonable skill and care. However, neither the authors nor the DHC Contracting Parties (of the International Energy Agency Technology Collaboration Programme on District Heating & Cooling) make any representation as to the adequacy or accuracy of the information contained herein, or as to its suitability for any particular application, and accept no responsibility or liability arising out of the use of this publication. The information contained herein does not supersede the requirements given in any national codes, regulations or standards, and should not be regarded as a substitute

    Copyright:

    All property rights, including copyright, are vested in IEA DHC. In particular, all parts of this publication may be reproduced, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise only by crediting IEA DHC as the original source. Republishing of this report in another format or storing the report in a public retrieval system is prohibited unless explicitly permitted by the IEA DHC Operating Agent in writing.

  10. Dataset Artifact for paper "Root Cause Analysis for Microservice System...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Aug 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luan Pham; Luan Pham; Huong Ha; Huong Ha; Hongyu Zhang; Hongyu Zhang (2024). Dataset Artifact for paper "Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?" [Dataset]. http://doi.org/10.5281/zenodo.13305663
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Luan Pham; Luan Pham; Huong Ha; Huong Ha; Hongyu Zhang; Hongyu Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Artifacts for the paper titled Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?.

    This artifact repository contains 9 compressed folders, as follows:

    IDFile NameDescription
    1syn_circa.zipCIRCA10, and CIRCA50 datasets for Causal Discovery
    2syn_rcd.zipRCD10, and RCD50 datasets for Causal Discovery
    3syn_causil.zipCausIL10, and CausIL50 datasets for Causal Discovery
    4rca_circa.zipCIRCA10, and CIRCA50 datasets for RCA
    5rca_rcd.zipRCD10, and RCD50 datasets for RCA
    6online-boutique.zipOnline Boutique dataset for RCA
    7sock-shop-1.zipSock Shop 1 dataset for RCA
    8sock-shop-2.zipSock Shop 2 dataset for RCA
    9train-ticket.zipTrain Ticket dataset for RCA

    Each zip file contains the generated/collected data from the corresponding data generator or microservice benchmark systems (e.g., online-boutique.zip contains metrics data collected from the Online Boutique system).

    Details about the generation of our datasets

    1. Synthetic datasets

    We use three different synthetic data generators from three previous RCA studies [15, 25, 28] to create the synthetic datasets: CIRCA, RCD, and CausIL data generators. Their mechanisms are as follows:

    1. CIRCA datagenerator [28] generates a random causal directed acyclic graph (DAG) based on a given number of nodes and edges. From this DAG, time series data for each node is generated using a vector auto-regression (VAR) model. A fault is injected into a node by altering the noise term in the VAR model for two timestamps.

    2. RCD data generator [25] uses the pyAgrum package [3] to generate
    a random DAG based on a given number of nodes, subsequently generating discrete time series data for each node, with values ranging from 0 to 5. A fault is introduced into a node by changing its conditional probability distribution.

    3. CausIL data generator [15] generates causal graphs and time series data that simulate
    the behavior of microservice systems. It first constructs a DAG of services and metrics based on domain knowledge, then generates metric data for each node of the DAG using regressors trained on real metrics data. Unlike the CIRCA and RCD data generators, the CausIL data generator does not have the capability to inject faults.

    To create our synthetic datasets, we first generate 10 DAGs whose nodes range from 10 to 50 for each of the synthetic data generators. Next, we generate fault-free datasets using these DAGs with different seedings, resulting in 100 cases for the CIRCA and RCD generators and 10 cases for the CausIL generator. We then create faulty datasets by introducing ten faults into each DAG and generating the corresponding faulty data, yielding 100 cases for the CIRCA and RCD data generators. The fault-free datasets (e.g. `syn_rcd`, `syn_circa`) are used to evaluate causal discovery methods, while the faulty datasets (e.g. `rca_rcd`, `rca_circa`) are used to assess RCA methods.

    2. Data collected from benchmark microservice systems

    We deploy three popular benchmark microservice systems: Sock Shop [6], Online Boutique [4], and Train Ticket [8], on a four-node Kubernetes cluster hosted by AWS. Next, we use the Istio service mesh [2] with Prometheus [5] and cAdvisor [1] to monitor and collect resource-level and service-level metrics of all services, as in previous works [ 25 , 39, 59 ]. To generate traffic, we use the load generators provided by these systems and customise them to explore all services with 100 to 200 users concurrently. We then introduce five common faults (CPU hog, memory leak, disk IO stress, network delay, and packet loss) into five different services within each system. Finally, we collect metrics data before and after the fault injection operation. An overview of our setup is presented in the Figure below.

    Code

    The code to reproduce the experimental results in the paper is available at https://github.com/phamquiluan/RCAEval.

    References

    As in our paper.

  11. m

    Data from: Synthetic Wind Speed Data for Tunnel Ventilation System...

    • data.mendeley.com
    • portalinvestigacion.uniovi.es
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luciano Sanchez (2025). Synthetic Wind Speed Data for Tunnel Ventilation System Monitoring [Dataset]. http://doi.org/10.17632/jk85jpzn6j.1
    Explore at:
    Dataset updated
    Mar 26, 2025
    Authors
    Luciano Sanchez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains synthetic time series data generated to simulate the operation of a road tunnel ventilation system subject to vehicle induced disturbances. The data were created as part of a study on generative modeling for industrial equipment condition monitoring. The primary goal is to illustrate how latent input effects (specifically, the "piston effect" induced by passing vehicles) can be accounted for in a model that estimates the relationship between the power supplied to the ventilation fans and the measured wind speed in the tunnel.

    Context and Application: In the simulated tunnel, ventilation fans are used to renew the air continuously. The fans can operate at different speeds; under normal conditions, a higher power input produces a faster wind speed. However, as the fans degrade over time, the same power input results in a lower wind speed. A major challenge in monitoring system performance is that passing vehicles generate transient disturbances (the piston effect) that temporarily alter the measured wind speed. These synthetic data mimic the operational scenario where measurements of wind speed (from anemometers placed at the tunnel entrance and exit) are corrupted by such disturbances.

    File Descriptions: The dataset comprises six CSV files. Each file contains a sequence of 600 measurements and represents one of two vehicle separation scenarios combined with three noise (disturbance) conditions. The naming convention is as follows:

    Filename Convention: The files follow the format 25_3_SYN_{separation}_G{gain}.csv, where: {separation} indicates vehicle separation (20 = low rate, 5 = high rate) {gain} indicates the noise level (1.5 = high noise, 1.0 = medium noise, 0.5 = low noise).

    Data Format and Variables: Each CSV file includes the following columns:

    time: Sequential time steps (in seconds). u1, u2: The observable input representing the fan setpoint (power supplied to the fans). y1, y2: The measured output, corresponding to the wind speed recorded by the anemometers. y1clean, y2clean: The theoretical wind speed in absence of vehicles. ySSA1, ySSA2: The estimation of y1clean and y2clean from u1, u2, y1, y2 in the accompanying paper.

    Note: The latent variable representing vehicle entries that cause the piston effect is not directly observable in the files; instead, its impact is embedded in the measured output.

    Usage: Researchers can use these data files to:

    • Reproduce the experiments described in the accompanying paper.
    • Test and benchmark alternative methods for filtering or denoising signals affected by transient disturbances.
    • Explore generative modeling techniques and inverse problem formulations in the context of equipment condition monitoring.

    Citation: If you use these data in your research, please cite the accompanying paper as well as this dataset.

  12. Synthetic Cybersecurity Logs for Anomaly Detection

    • kaggle.com
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fcWebDev (2024). Synthetic Cybersecurity Logs for Anomaly Detection [Dataset]. http://doi.org/10.34740/kaggle/dsv/10211131
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    fcWebDev
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains synthetic HTTP log data designed for cybersecurity analysis, particularly for anomaly detection tasks.

    Dataset Features Timestamp: Simulated time for each log entry. IP_Address: Randomized IP addresses to simulate network traffic. Request_Type: Common HTTP methods (GET, POST, PUT, DELETE). Status_Code: HTTP response status codes (e.g., 200, 404, 403, 500). Anomaly_Flag: Binary flag indicating anomalies (1 = anomaly, 0 = normal). User_Agent: Simulated user agents for device and browser identification. Session_ID: Random session IDs to simulate user activity. Location: Geographic locations of requests. Applications This dataset can be used for:

    Anomaly Detection: Identify suspicious network activity or attacks. Machine Learning: Train models for classification tasks (e.g., detect anomalies). Cybersecurity Analysis: Analyze HTTP traffic patterns and identify threats. Example Challenge Build a machine learning model to predict the Anomaly_Flag based on the features provided.

  13. Z

    Data from: A Meta-Learner Approach to Multistep-Ahead Time Series Prediction...

    • data.niaid.nih.gov
    Updated May 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark (2023). A Meta-Learner Approach to Multistep-Ahead Time Series Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7907676
    Explore at:
    Dataset updated
    May 9, 2023
    Dataset provided by
    andrew.mccarren@dcu.ie
    Fouad Bahrpeyma
    Mark
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    The application of machine learning has become commonplace for problems in modern data science. The democratization of the decision process when choosing a machine learning algorithm has also received considerable attention through the use of meta features and automated machine learning for both classification and regression type problems. However, this is not the case for multistep-ahead time series problems. Time series models generally rely upon the series itself to make future predictions, as opposed to independent features used in regression and classification problems. The structure of a time series is generally described by features such as trend, seasonality, cyclicality, and irregularity. In this research, we demonstrate how time series metrics for these features, in conjunction with an ensemble based regression learner, were used to predict the standardized mean square error of candidate time series prediction models. These experiments used datasets that cover a wide feature space and enable researchers to select the single best performing model or the top N performing models. A robust evaluation was carried out to test the learner's performance on both synthetic and real time series.

    Proposed Dataset

    The dataset proposed here gives the results for 20 step ahead predictions for eight Machine Learning/Multi-step ahead prediction strategies for 5,842 time series datasets outlined here. It was used as the training data for the Meta Learners in this research. The meta features used are columns C to AE. Columns AH outlines the method/strategy used and columns AI to BB (the error) is the outcome variable for each prediction step. The description of the method/strategies is as follows:

    Machine Learning methods:

    NN: Neural Network

    ARIMA: Autoregressive Integrated Moving Average

    SVR: Support Vector Regression

    LSTM: Long Short Term Memory

    RNN: Recurrent Neural Network

    Multistep ahead prediction strategy:

    OSAP: One Step ahead strategy

    MRFA: Multi Resolution Forecast Aggregation

  14. Online Retail & E-Commerce Dataset

    • kaggle.com
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ertuğrul EŞOL (2025). Online Retail & E-Commerce Dataset [Dataset]. https://www.kaggle.com/datasets/ertugrulesol/online-retail-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 20, 2025
    Dataset provided by
    Kaggle
    Authors
    Ertuğrul EŞOL
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview:

    This dataset contains 1000 rows of synthetic online retail sales data, mimicking transactions from an e-commerce platform. It includes information about customer demographics, product details, purchase history, and (optional) reviews. This dataset is suitable for a variety of data analysis, data visualization and machine learning tasks, including but not limited to: customer segmentation, product recommendation, sales forecasting, market basket analysis, and exploring general e-commerce trends. The data was generated using the Python Faker library, ensuring realistic values and distributions, while maintaining no privacy concerns as it contains no real customer information.

    Data Source:

    This dataset is entirely synthetic. It was generated using the Python Faker library and does not represent any real individuals or transactions.

    Data Content:

    Column NameData TypeDescription
    customer_idIntegerUnique customer identifier (ranging from 10000 to 99999)
    order_dateDateOrder date (a random date within the last year)
    product_idIntegerProduct identifier (ranging from 100 to 999)
    category_idIntegerProduct category identifier (10, 20, 30, 40, or 50)
    category_nameStringProduct category name (Electronics, Fashion, Home & Living, Books & Stationery, Sports & Outdoors)
    product_nameStringProduct name (randomly selected from a list of products within the corresponding category)
    quantityIntegerQuantity of the product ordered (ranging from 1 to 5)
    priceFloatUnit price of the product (ranging from 10.00 to 500.00, with two decimal places)
    payment_methodStringPayment method used (Credit Card, Bank Transfer, Cash on Delivery)
    cityStringCustomer's city (generated using Faker's city() method, so the locations will depend on the Faker locale you used)
    review_scoreIntegerCustomer's product rating (ranging from 1 to 5, or None with a 20% probability)
    genderStringCustomer's gender (M/F, or None with a 10% probability)
    ageIntegerCustomer's age (ranging from 18 to 75)

    Potential Use Cases (Inspiration):

    Customer Segmentation: Group customers based on demographics, purchasing behavior, and preferences.

    Product Recommendation: Build a recommendation system to suggest products to customers based on their past purchases and browsing history.

    Sales Forecasting: Predict future sales based on historical trends.

    Market Basket Analysis: Identify products that are frequently purchased together.

    Price Optimization: Analyze the relationship between price and demand.

    Geographic Analysis: Explore sales patterns across different cities.

    Time Series Analysis: Investigate sales trends over time.

    Educational Purposes: Great for practicing data cleaning, EDA, feature engineering, and modeling.

  15. The Health Gym v2.0 Synthetic Antiretroviral Therapy (ART) for HIV Dataset

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Kuo (2023). The Health Gym v2.0 Synthetic Antiretroviral Therapy (ART) for HIV Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.22827878.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Nicholas Kuo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ===###

    This synthetic dataset, centred on ART for HIV, was synthesised employing the model outlined in reference [1], incorporating the techniques of WGAN-GP+G_EOT+VAE+Buffer.

    This dataset serves as a principal resource for the Centre for Big Data Research in Health (CBDRH) Datathon (see: CBDRH Health Data Science Datathon 2023 (cbdrh-hds-datathon-2023.github.io)). Its primary purpose is to advance the Health Data Analytics (HDAT) courses at the University of New South Wales (UNSW), providing students with exposure to synthetic yet realistic datasets that simulate real-world data.

    The dataset is composed of 534,960 records, distributed over 15 distinct columns, and is preserved in a CSV format with a size of 39.1 MB. It contains information about 8,916 synthetic patients over a period of 60 months, with data summarised on a monthly basis. The total number of records corresponds to the product of the synthetic patient count and the record duration in months, thus equating to 8,916 multiplied by 60.

    The dataset's structure encompasses 15 columns, which include 13 variables pertinent to ART for HIV as delineated in reference [1], a unique patient identifier, and a further variable signifying the specific time point.

    ===

    This dataset forms part of a continuous series of work, building upon reference [2]. For further details, kindly refer to our papers: [1] Kuo, Nicholas I., Louisa Jorm, and Sebastiano Barbieri. "Generating Synthetic Clinical Data that Capture Class Imbalanced Distributions with Generative Adversarial Networks: Example using Antiretroviral Therapy for HIV." arXiv preprint arXiv:2208.08655 (2022). [2] Kuo, Nicholas I-Hsien, et al. "The Health Gym: synthetic health-related datasets for the development of reinforcement learning algorithms." Scientific Data 9.1 (2022): 693.

    ===

    Latest edit: 16th May 2023.

  16. Supporting data and code: Beyond Economic Dispatch: Modeling Renewable...

    • zenodo.org
    zip
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaruwan Pimsawan; Jaruwan Pimsawan; Stefano Galelli; Stefano Galelli (2025). Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models [Dataset]. http://doi.org/10.5281/zenodo.15219959
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jaruwan Pimsawan; Jaruwan Pimsawan; Stefano Galelli; Stefano Galelli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository provides the necessary data and Python code to replicate the experiments and generate the figures presented in our manuscript: "Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models".

    Contents:

    • pownet.zip: Contains PowNet version 3.2, the specific version of the simulation software used in this study.
    • inputs.zip: Contains essential modeling inputs required by PowNet for the experiments, including network data, and pre-generated synthetic load and solar time series.
    • scripts.zip: Contains the Python scripts used for installing PowNet, optionally regenerating synthetic data, running simulation experiments, processing results, and generating figures.
    • thai_data.zip (Reference Only): Contains raw data related to the 2023 Thai power system. This data served as a reference during the creation of the PowNet inputs for this study but is not required to run the replication experiments themselves. Code to process the raw data is also provided.

    System Requirements:

    • Python version 3.10+
    • pip package manager

    Setup Instructions:

    1. Download and Unzip Core Files: Download pownet.zip, inputs.zip, scripts.zip, and thai_data.zip. Extract their contents into the same parent folder. Your directory structure should look like this:

      Parent_Folder/
      ├── pownet/    # from pownet.zip
      ├── inputs/    # from inputs.zip
      ├── scripts/    # from scripts.zip
      ├── thai_data.zip/ # from scripts.zip ├── figures/ # Created by scripts later
      ├── outputs/ # Created by scripts later
    2. Install PowNet:

      • Open your terminal or command prompt.
      • Navigate into the pownet directory that you just extracted:

    cd path/to/Parent_Folder/pownet

    pip install -e .

      • These commands install PowNet and its required dependencies into your active Python environment.

    Workflow and Usage:

    Note: All subsequent Python script commands should be run from the scripts directory. Navigate to it first:

    cd path/to/Parent_Folder/scripts

    1. Generate Synthetic Time Series (Optional):

    • This step is optional as the required time series files are already provided within the inputs directory (extracted from inputs.zip). If you wish to regenerate them:
    • Run the generation scripts:
      python create_synthetic_load.py
      python create_synthetic_solar.py
    • Evaluate the generated time series (optional):
      python eval_synthetic_load.py
      python eval_synthetic_solar.py

    2. Calculate Total Solar Availability:

    • Process solar scenarios using data from the inputs directory:
      python process_scenario_solar.py
      

    3. Experiment 1: Compare Strategies for Modeling Purchase Obligations:

    • Run the base case simulations for different modeling strategies:
      • No Must-Take (NoMT):
        python run_basecase.py --model_name "TH23NMT"
        
      • Zero-Cost Renewables (ZCR):
        python run_basecase.py --model_name "TH23ZC"
        
      • Penalized Curtailment (Proposed Method):
        python run_basecase.py --model_name "TH23"
        
    • Run the base case simulation for the Minimum Capacity (MinCap) strategy:
      python run_min_cap.py

      This is a new script because we need to modify the objective function and add constraints.

    4. Experiment 2: Simulate Partial-Firm Contract Switching:

    • Run simulations comparing the base case with the partial-firm contract scenario:
      • Base Case Scenario:
        python run_scenarios.py --model_name "TH23"
        
      • Partial-Firm Contract Scenario:
        python run_scenarios.py --model_name "TH23ESB"
        

    5. Visualize Results:

    • Generate all figures presented in the manuscript:
      python run_viz.py
      
    • Figures will typically be saved in afigures directory within the Parent_Folder.
  17. n

    CSU Synthetic Attribution Benchmark Dataset

    • cmr.earthdata.nasa.gov
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). CSU Synthetic Attribution Benchmark Dataset [Dataset]. http://doi.org/10.34911/rdnt.8snx6c
    Explore at:
    Dataset updated
    Oct 10, 2023
    Time period covered
    Jan 1, 2020 - Jan 1, 2023
    Area covered
    Description

    This is a synthetic dataset that can be used by users that are interested in benchmarking methods of explainable artificial intelligence (XAI) for geoscientific applications. The dataset is specifically inspired from a climate forecasting setting (seasonal timescales) where the task is to predict regional climate variability given global climate information lagged in time. The dataset consists of a synthetic input X (series of 2D arrays of random fields drawn from a multivariate normal distribution) and a synthetic output Y (scalar series) generated by using a nonlinear function F: R^d -> R.

    The synthetic input aims to represent temporally independent realizations of anomalous global fields of sea surface temperature, the synthetic output series represents some type of regional climate variability that is of interest (temperature, precipitation totals, etc.) and the function F is a simplification of the climate system.

    Since the nonlinear function F that is used to generate the output given the input is known, we also derive and provide the attribution of each output value to the corresponding input features. Using this synthetic dataset users can train any AI model to predict Y given X and then implement XAI methods to interpret it. Based on the “ground truth” of attribution of F the user can assess the faithfulness of any XAI method.

    NOTE: the spatial configuration of the observations in the NetCDF database file conform to the planetocentric coordinate system (89.5N - 89.5S, 0.5E - 359.5E), where longitude is measured in the positive heading east from the prime meridian.

  18. Spacecraft Thruster Firing Test Dataset

    • zenodo.org
    • data.niaid.nih.gov
    csv, zip
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith; Patrick Fleith (2024). Spacecraft Thruster Firing Test Dataset [Dataset]. http://doi.org/10.5281/zenodo.7137930
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Fleith; Patrick Fleith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    WARNING

    This version of the dataset is not recommended for anomaly detection use case. We discovered discrepancies in the anomalous sequences. A new version will be released. In the meantime, please ignore all sequence marked as anomalous.

    CONTEXT

    Testing hardware to qualify it for Spaceflight is critical to model and verify performances. Hot fire tests (also known as life-tests) are typically run during the qualification campaigns of satellite thrusters, but results remain proprietary data, hence making it difficult for the machine learning community to develop suitable data-driven predictive models. This synthetic dataset was generated partially based on the real-world physics of monopropellant chemical thrusters, to foster the development and benchmarking of new data-driven analytical methods (machine learning, deep-learning, etc.).

    The PDF document "STFT Dataset Description" describes in much details the structure, context, use cases and domain-knowledge about thruster in order for ML practitioners to use the dataset.

    PROPOSED TASKS

    Supervised:

    • Performance Modelling: Prediction of the thruster performances (target can be thrust, mass flow rate, and/or the average specific impulse)
    • Acceptance Test for Individualised Performance Model refinement: Taking into account the acceptance test of individual thruster might be helpful to generate individualised thruster predictive model
    • Uncertainty Quantification for Thruster-to-thruster reproducibility verification, i.e. to evaluate the prediction variability between several thrusters in order to construct uncertainty bounds around the prediction (predictive intervals) of the thrust and mass flow rate of future thrusters that may be used during an actual space mission

    Unsupervised / Anomaly Detection

    • Anomaly Detection: Anomalies can be detected in an unsupervised setting (outlier detection) or in a semi-supervised setting (novelty detection). The dataset includes a total of 270 anomalies. A simple approach is to predict if a firing test sequence is anomalous or nominal. A more advanced approach is trying to predict which portion of a time series is anomalous. The dataset also provide a detailed information about each time point being anomalous or nominal. In case of an anomaly, a code is provided which allows to diagnosis the detection system performance on the different types of anomalies contained in the dataset.

  19. O

    Data from: BuildingsBench: A Large-Scale Dataset of 900K Buildings and...

    • data.openei.org
    • osti.gov
    code, data, website
    Updated Dec 31, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Emami; Peter Graf; Patrick Emami; Peter Graf (2018). BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting [Dataset]. http://doi.org/10.25984/1986147
    Explore at:
    code, website, dataAvailable download formats
    Dataset updated
    Dec 31, 2018
    Dataset provided by
    Open Energy Data Initiative (OEDI)
    USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
    National Renewable Energy Laboratory
    Authors
    Patrick Emami; Peter Graf; Patrick Emami; Peter Graf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The BuildingsBench datasets consist of:

    • Buildings-900K: A large-scale dataset of 900K buildings for pretraining models on the task of short-term load forecasting (STLF). Buildings-900K is statistically representative of the entire U.S. building stock.
    • 7 real residential and commercial building datasets for benchmarking two downstream tasks evaluating generalization: zero-shot STLF and transfer learning for STLF.

    Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB).

    BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below:

    1. ElectricityLoadDiagrams20112014
    2. Building Data Genome Project-2
    3. Individual household electric power consumption (Sceaux)
    4. Borealis
    5. SMART
    6. IDEAL
    7. Low Carbon London

    A README file providing details about how the data is stored and describing the organization of the datasets can be found within each data lake version under BuildingsBench.

  20. h

    Synthetic Dataset of Hospital Admissions for an acute Stroke

    • healthdatagateway.org
    unknown
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158) (2024). Synthetic Dataset of Hospital Admissions for an acute Stroke [Dataset]. https://healthdatagateway.org/en/dataset/1003
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Dec 4, 2024
    Dataset authored and provided by
    This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158)
    License

    https://www.pioneerdatahub.co.uk/data/data-request-process/https://www.pioneerdatahub.co.uk/data/data-request-process/

    Description

    Strokes can be ischaemic or haemorrhagic in nature, leading to debilitating symptoms which are dependent on the location of the stroke in the brain and the severity of the insult. Stroke care is centred around Hyper-acute Stroke Units (HASU), Acute Stroke and Brain Injury Units (ASU/ABIU) and specialist stroke services. Early presentation enables the use of more invasive treatments to clear blood clots, but commonly strokes present late, preventing their use.

    This synthetic dataset represents approximately 29,000 stroke patients. Data includes demography, socioeconomic status, co-morbidities, “time stamped” serial acuity, physiology and treatments, investigations (structured and unstructured data), hospital care processes, and outcomes.

    The dataset was created using the Synthetic Data Vault (SDV) package, specifically employing the GAN synthesizer. Real. data was first read and pre-processed, ensuring datetime columns were correctly parsed and identifiers were handled as strings. Metadata was defined to capture the schema, specifying field types and primary keys. This metadata guided the synthesizer in understanding the structure of the data. The GAN synthesizer was then fitted to the real data, learning the distributions and dependencies within. After fitting, the synthesizer generated synthetic data that mirrors the statistical properties and relationships of the original dataset.

    Geography: The West Midlands (WM) has a population of 6 million & includes a diverse ethnic & socio-economic mix. UHB is one of the largest NHS Trusts in England, providing direct acute stroke services & specialist care across four hospital sites.

    Data set availability: Data access is available via the PIONEER Hub for projects which will benefit the public or patients. Data access can be provided to NHS, academic, commercial, policy and third sector organisations. Applications from SMEs are welcome. There is a single data access process, with public oversight provided by our public review committee, the Data Trust Committee. Contact pioneer@uhb.nhs.uk or visit www.pioneerdatahub.co.uk for more details.

    Available supplementary data: Matched controls; ambulance and community data. Unstructured data (images). We can provide the dataset in OMOP and other common data models and can build synthetic data to meet bespoke requirements.

    Available supplementary support: Analytics, model build, validation & refinement; A.I. support. Data partner support for ETL (extract, transform & load) processes. Bespoke and “off the shelf” Trusted Research Environment (TRE) build and run. Consultancy with clinical, patient & end-user and purchaser access/ support. Support for regulatory requirements. Cohort discovery. Data-driven trials and “fast screen” services to assess population size.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Subarmaniam Kannan; Subarmaniam Kannan (2021). Synthetic time series data generation for edge analytics [Dataset]. http://doi.org/10.5281/zenodo.5673806
Organization logo

Data from: Synthetic time series data generation for edge analytics

Related Article
Explore at:
binAvailable download formats
Dataset updated
Nov 25, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Subarmaniam Kannan; Subarmaniam Kannan
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

In this research, we create synthetic data with features that are like data from IoT devices. We use an existing air quality dataset that includes temperature and gas sensor measurements. This real-time dataset includes component values for the Air Quality Index (AQI) and ppm concentrations for various polluting gas concentrations. We build a JavaScript Object Notation (JSON) model to capture the distribution of variables and structure of this real dataset to generate the synthetic data. Based on the synthetic dataset and original dataset, we create a comparative predictive model. Analysis of synthetic dataset predictive model shows that it can be successfully used for edge analytics purposes, replacing real-world datasets. There is no significant difference between the real-world dataset compared the synthetic dataset. The generated synthetic data requires no modification to suit the edge computing requirements. The framework can generate correct synthetic datasets based on JSON schema attributes. The accuracy, precision, and recall values for the real and synthetic datasets indicate that the logistic regression model is capable of successfully classifying data

Search
Clear search
Close search
Google apps
Main menu