Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this research, we create synthetic data with features that are like data from IoT devices. We use an existing air quality dataset that includes temperature and gas sensor measurements. This real-time dataset includes component values for the Air Quality Index (AQI) and ppm concentrations for various polluting gas concentrations. We build a JavaScript Object Notation (JSON) model to capture the distribution of variables and structure of this real dataset to generate the synthetic data. Based on the synthetic dataset and original dataset, we create a comparative predictive model. Analysis of synthetic dataset predictive model shows that it can be successfully used for edge analytics purposes, replacing real-world datasets. There is no significant difference between the real-world dataset compared the synthetic dataset. The generated synthetic data requires no modification to suit the edge computing requirements. The framework can generate correct synthetic datasets based on JSON schema attributes. The accuracy, precision, and recall values for the real and synthetic datasets indicate that the logistic regression model is capable of successfully classifying data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.
This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.
The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.
The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.
The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.
There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).
The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.
This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):
import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)
The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:
CH_gens_list = CH_gens.dropna().squeeze().to_list()
Finally, we can import all the time series of Swiss generators from a given data table with
pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)
The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.
This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:
hourly_loads = pd.read_csv('loads_2018_3.csv')
To get a daily average of the loads, we can use:
daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()
This results in series of length 364. To average further over entire weeks and get series of length 52, we use:
weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()
The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.
This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A group of 6 synthetic datasets is provided with a 914 days water consumption time horizon with structural breaks based on actual datasets from a hotel and hospital. The parameters of the best-fit probability distributions to the actual water consumption data were used to generate these datasets. The distributions used that best fit the datasets used were the gamma distribution for the hotel data set and the gamma and logistics distribution for the hospital dataset. Two structural breaks of 5% and 10% in the mean of the distributions were added to simulate reductions in water consumption patterns.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accurate failure prediction is critical for the reliability of HPC facilities and data centers storage systems. This study addresses data scarcity, privacy concerns, and class imbalance in HDD failure datasets by leveraging synthetic data generation. We propose an end-to-end framework to generate synthetic storage data using Generative Adversarial Networks and Diffusion models. We implement a data segmentation approach considering temporal variation of disks access to generate high-fidelity synthetic data that replicates the nuanced temporal and feature-specific patterns of disk failures. Experimental results show that synthetic data achieves similarity scores of 0.81–0.89 and enhances failure prediction performance, with up to 3% improvement in accuracy and 2% in ROC-AUC. With only minor performance drops versus real-data training, synthetically trained models prove viable for predictive maintenance.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is Part 2/2 of the ActiveHuman dataset! Part 1 can be found here.
Dataset Description
ActiveHuman was generated using Unity's Perception package.
It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals).
The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset.
Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.
Folder configuration
The dataset consists of 3 folders:
Essential Terminology
Dataset Data
The dataset includes 4 types of JSON annotation files files:
Most Labelers generate different annotation specifications in the spec key-value pair:
Each Labeler generates different annotation specifications in the values key-value pair:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Details of parameters used to generate synthetic data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.
The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:
[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”
About Solenix
Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
The SMART-DS datasets (Synthetic Models for Advanced, Realistic Testing: Distribution systems and Scenarios) are realistic large-scale U.S. electrical distribution models for testing advanced grid algorithms and technology analysis. This document provides a user guide for the datasets. This dataset contains synthetic detailed electrical distribution network models, and connected timeseries loads for the greater San Francisco (SFO), Greensboro, and Austin areas. It is intended to provide researchers with very realistic and complete models that can be used for extensive powerflow simulations under a variety of scenarios. The data is synthetic, but has been validated against thousands of utility feeders to ensure statistical and operational similarity to electrical distribution networks in the US. The OpenDSS data is partitioned into several regions (each zipped separately). After unzipping these files, each region has a folder for each substation, and subsequent folders for each feeder within the substation. This allows users to simulate smaller sections of the full dataset. Each of these folders (region, substation and feeder) has a folder titled "analysis" which contains CSV files listing voltages and overloads throughout the network for the peak loading time in the year. It also contains .png files showing the loading of residential and commercial loads on the network for every day of the year, and daily breakdowns of loads for commercial building categories. Time series data is provided in the "profiles" folder including real and reactive power at 15 minute resolution along with parquet files in the "endues" folder with breakdowns of building end-uses.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains data generated in the AI DHC project.
This dataset contains synthetic fault data for decrease of the COP of a heat pump
The IEA DHC Annex XIII project “Artificial Intelligence for Failure Detection and Forecasting of Heat Production and Heat demand in District Heating Networks” is developing Artificial Intelligence (AI) methods for forecasting heat demand and heat production and is evaluating algorithms for detecting faults which can be used by interested stakeholders (operators, suppliers of DHC components and manufacturers of control devices).
See https://github.com/mathieu-vallee/ai-dhc for the models and pythons scripts used to generate the dataset
Please cite this dataset as: Vallee, M., Wissocq T., Gaoua Y., Lamaison N., Generation and Evaluation of a Synthetic Dataset to improve Fault Detection in District Heating and Cooling Systems, 2023 (under review at the Energy journal)
Disclaimer notice (IEA DHC): This project has been independently funded by the International Energy Agency Technology Collaboration Programme on District Heating and Cooling including Combined Heat and Power (IEA DHC).
Any views expressed in this publication are not necessarily those of IEA DHC.
IEA DHC can take no responsibility for the use of the information within this publication, nor for any errors or omissions it may contain.
Information contained herein have been compiled or arrived from sources believed to be reliable. Nevertheless, the authors or their organizations do not accept liability for any loss or damage arising from the use thereof. Using the given information is strictly your own responsibility.
Disclaimer Notice (Authors):
This publication has been compiled with reasonable skill and care. However, neither the authors nor the DHC Contracting Parties (of the International Energy Agency Technology Collaboration Programme on District Heating & Cooling) make any representation as to the adequacy or accuracy of the information contained herein, or as to its suitability for any particular application, and accept no responsibility or liability arising out of the use of this publication. The information contained herein does not supersede the requirements given in any national codes, regulations or standards, and should not be regarded as a substitute
Copyright:
All property rights, including copyright, are vested in IEA DHC. In particular, all parts of this publication may be reproduced, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise only by crediting IEA DHC as the original source. Republishing of this report in another format or storing the report in a public retrieval system is prohibited unless explicitly permitted by the IEA DHC Operating Agent in writing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Artifacts for the paper titled Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?.
This artifact repository contains 9 compressed folders, as follows:
ID | File Name | Description |
1 | syn_circa.zip | CIRCA10, and CIRCA50 datasets for Causal Discovery |
2 | syn_rcd.zip | RCD10, and RCD50 datasets for Causal Discovery |
3 | syn_causil.zip | CausIL10, and CausIL50 datasets for Causal Discovery |
4 | rca_circa.zip | CIRCA10, and CIRCA50 datasets for RCA |
5 | rca_rcd.zip | RCD10, and RCD50 datasets for RCA |
6 | online-boutique.zip | Online Boutique dataset for RCA |
7 | sock-shop-1.zip | Sock Shop 1 dataset for RCA |
8 | sock-shop-2.zip | Sock Shop 2 dataset for RCA |
9 | train-ticket.zip | Train Ticket dataset for RCA |
Each zip file contains the generated/collected data from the corresponding data generator or microservice benchmark systems (e.g., online-boutique.zip contains metrics data collected from the Online Boutique system).
Details about the generation of our datasets
1. Synthetic datasets
We use three different synthetic data generators from three previous RCA studies [15, 25, 28] to create the synthetic datasets: CIRCA, RCD, and CausIL data generators. Their mechanisms are as follows:
1. CIRCA datagenerator [28] generates a random causal directed acyclic graph (DAG) based on a given number of nodes and edges. From this DAG, time series data for each node is generated using a vector auto-regression (VAR) model. A fault is injected into a node by altering the noise term in the VAR model for two timestamps.
2. RCD data generator [25] uses the pyAgrum package [3] to generate a random DAG based on a given number of nodes, subsequently generating discrete time series data for each node, with values ranging from 0 to 5. A fault is introduced into a node by changing its conditional probability distribution.
3. CausIL data generator [15] generates causal graphs and time series data that simulate the behavior of microservice systems. It first constructs a DAG of services and metrics based on domain knowledge, then generates metric data for each node of the DAG using regressors trained on real metrics data. Unlike the CIRCA and RCD data generators, the CausIL data generator does not have the capability to inject faults.
To create our synthetic datasets, we first generate 10 DAGs whose nodes range from 10 to 50 for each of the synthetic data generators. Next, we generate fault-free datasets using these DAGs with different seedings, resulting in 100 cases for the CIRCA and RCD generators and 10 cases for the CausIL generator. We then create faulty datasets by introducing ten faults into each DAG and generating the corresponding faulty data, yielding 100 cases for the CIRCA and RCD data generators. The fault-free datasets (e.g. `syn_rcd`, `syn_circa`) are used to evaluate causal discovery methods, while the faulty datasets (e.g. `rca_rcd`, `rca_circa`) are used to assess RCA methods.
2. Data collected from benchmark microservice systems
We deploy three popular benchmark microservice systems: Sock Shop [6], Online Boutique [4], and Train Ticket [8], on a four-node Kubernetes cluster hosted by AWS. Next, we use the Istio service mesh [2] with Prometheus [5] and cAdvisor [1] to monitor and collect resource-level and service-level metrics of all services, as in previous works [ 25 , 39, 59 ]. To generate traffic, we use the load generators provided by these systems and customise them to explore all services with 100 to 200 users concurrently. We then introduce five common faults (CPU hog, memory leak, disk IO stress, network delay, and packet loss) into five different services within each system. Finally, we collect metrics data before and after the fault injection operation. An overview of our setup is presented in the Figure below.
Code
The code to reproduce the experimental results in the paper is available at https://github.com/phamquiluan/RCAEval.
References
As in our paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains synthetic time series data generated to simulate the operation of a road tunnel ventilation system subject to vehicle induced disturbances. The data were created as part of a study on generative modeling for industrial equipment condition monitoring. The primary goal is to illustrate how latent input effects (specifically, the "piston effect" induced by passing vehicles) can be accounted for in a model that estimates the relationship between the power supplied to the ventilation fans and the measured wind speed in the tunnel.
Context and Application: In the simulated tunnel, ventilation fans are used to renew the air continuously. The fans can operate at different speeds; under normal conditions, a higher power input produces a faster wind speed. However, as the fans degrade over time, the same power input results in a lower wind speed. A major challenge in monitoring system performance is that passing vehicles generate transient disturbances (the piston effect) that temporarily alter the measured wind speed. These synthetic data mimic the operational scenario where measurements of wind speed (from anemometers placed at the tunnel entrance and exit) are corrupted by such disturbances.
File Descriptions: The dataset comprises six CSV files. Each file contains a sequence of 600 measurements and represents one of two vehicle separation scenarios combined with three noise (disturbance) conditions. The naming convention is as follows:
Filename Convention: The files follow the format 25_3_SYN_{separation}_G{gain}.csv, where: {separation} indicates vehicle separation (20 = low rate, 5 = high rate) {gain} indicates the noise level (1.5 = high noise, 1.0 = medium noise, 0.5 = low noise).
Data Format and Variables: Each CSV file includes the following columns:
time: Sequential time steps (in seconds). u1, u2: The observable input representing the fan setpoint (power supplied to the fans). y1, y2: The measured output, corresponding to the wind speed recorded by the anemometers. y1clean, y2clean: The theoretical wind speed in absence of vehicles. ySSA1, ySSA2: The estimation of y1clean and y2clean from u1, u2, y1, y2 in the accompanying paper.
Note: The latent variable representing vehicle entries that cause the piston effect is not directly observable in the files; instead, its impact is embedded in the measured output.
Usage: Researchers can use these data files to:
Citation: If you use these data in your research, please cite the accompanying paper as well as this dataset.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains synthetic HTTP log data designed for cybersecurity analysis, particularly for anomaly detection tasks.
Dataset Features Timestamp: Simulated time for each log entry. IP_Address: Randomized IP addresses to simulate network traffic. Request_Type: Common HTTP methods (GET, POST, PUT, DELETE). Status_Code: HTTP response status codes (e.g., 200, 404, 403, 500). Anomaly_Flag: Binary flag indicating anomalies (1 = anomaly, 0 = normal). User_Agent: Simulated user agents for device and browser identification. Session_ID: Random session IDs to simulate user activity. Location: Geographic locations of requests. Applications This dataset can be used for:
Anomaly Detection: Identify suspicious network activity or attacks. Machine Learning: Train models for classification tasks (e.g., detect anomalies). Cybersecurity Analysis: Analyze HTTP traffic patterns and identify threats. Example Challenge Build a machine learning model to predict the Anomaly_Flag based on the features provided.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
The application of machine learning has become commonplace for problems in modern data science. The democratization of the decision process when choosing a machine learning algorithm has also received considerable attention through the use of meta features and automated machine learning for both classification and regression type problems. However, this is not the case for multistep-ahead time series problems. Time series models generally rely upon the series itself to make future predictions, as opposed to independent features used in regression and classification problems. The structure of a time series is generally described by features such as trend, seasonality, cyclicality, and irregularity. In this research, we demonstrate how time series metrics for these features, in conjunction with an ensemble based regression learner, were used to predict the standardized mean square error of candidate time series prediction models. These experiments used datasets that cover a wide feature space and enable researchers to select the single best performing model or the top N performing models. A robust evaluation was carried out to test the learner's performance on both synthetic and real time series.
Proposed Dataset
The dataset proposed here gives the results for 20 step ahead predictions for eight Machine Learning/Multi-step ahead prediction strategies for 5,842 time series datasets outlined here. It was used as the training data for the Meta Learners in this research. The meta features used are columns C to AE. Columns AH outlines the method/strategy used and columns AI to BB (the error) is the outcome variable for each prediction step. The description of the method/strategies is as follows:
Machine Learning methods:
NN: Neural Network
ARIMA: Autoregressive Integrated Moving Average
SVR: Support Vector Regression
LSTM: Long Short Term Memory
RNN: Recurrent Neural Network
Multistep ahead prediction strategy:
OSAP: One Step ahead strategy
MRFA: Multi Resolution Forecast Aggregation
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Overview:
This dataset contains 1000 rows of synthetic online retail sales data, mimicking transactions from an e-commerce platform. It includes information about customer demographics, product details, purchase history, and (optional) reviews. This dataset is suitable for a variety of data analysis, data visualization and machine learning tasks, including but not limited to: customer segmentation, product recommendation, sales forecasting, market basket analysis, and exploring general e-commerce trends. The data was generated using the Python Faker library, ensuring realistic values and distributions, while maintaining no privacy concerns as it contains no real customer information.
Data Source:
This dataset is entirely synthetic. It was generated using the Python Faker library and does not represent any real individuals or transactions.
Data Content:
Column Name | Data Type | Description |
---|---|---|
customer_id | Integer | Unique customer identifier (ranging from 10000 to 99999) |
order_date | Date | Order date (a random date within the last year) |
product_id | Integer | Product identifier (ranging from 100 to 999) |
category_id | Integer | Product category identifier (10, 20, 30, 40, or 50) |
category_name | String | Product category name (Electronics, Fashion, Home & Living, Books & Stationery, Sports & Outdoors) |
product_name | String | Product name (randomly selected from a list of products within the corresponding category) |
quantity | Integer | Quantity of the product ordered (ranging from 1 to 5) |
price | Float | Unit price of the product (ranging from 10.00 to 500.00, with two decimal places) |
payment_method | String | Payment method used (Credit Card, Bank Transfer, Cash on Delivery) |
city | String | Customer's city (generated using Faker's city() method, so the locations will depend on the Faker locale you used) |
review_score | Integer | Customer's product rating (ranging from 1 to 5, or None with a 20% probability) |
gender | String | Customer's gender (M/F, or None with a 10% probability) |
age | Integer | Customer's age (ranging from 18 to 75) |
Potential Use Cases (Inspiration):
Customer Segmentation: Group customers based on demographics, purchasing behavior, and preferences.
Product Recommendation: Build a recommendation system to suggest products to customers based on their past purchases and browsing history.
Sales Forecasting: Predict future sales based on historical trends.
Market Basket Analysis: Identify products that are frequently purchased together.
Price Optimization: Analyze the relationship between price and demand.
Geographic Analysis: Explore sales patterns across different cities.
Time Series Analysis: Investigate sales trends over time.
Educational Purposes: Great for practicing data cleaning, EDA, feature engineering, and modeling.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This synthetic dataset, centred on ART for HIV, was synthesised employing the model outlined in reference [1], incorporating the techniques of WGAN-GP+G_EOT+VAE+Buffer.
This dataset serves as a principal resource for the Centre for Big Data Research in Health (CBDRH) Datathon (see: CBDRH Health Data Science Datathon 2023 (cbdrh-hds-datathon-2023.github.io)). Its primary purpose is to advance the Health Data Analytics (HDAT) courses at the University of New South Wales (UNSW), providing students with exposure to synthetic yet realistic datasets that simulate real-world data.
The dataset is composed of 534,960 records, distributed over 15 distinct columns, and is preserved in a CSV format with a size of 39.1 MB. It contains information about 8,916 synthetic patients over a period of 60 months, with data summarised on a monthly basis. The total number of records corresponds to the product of the synthetic patient count and the record duration in months, thus equating to 8,916 multiplied by 60.
The dataset's structure encompasses 15 columns, which include 13 variables pertinent to ART for HIV as delineated in reference [1], a unique patient identifier, and a further variable signifying the specific time point.
This dataset forms part of a continuous series of work, building upon reference [2]. For further details, kindly refer to our papers: [1] Kuo, Nicholas I., Louisa Jorm, and Sebastiano Barbieri. "Generating Synthetic Clinical Data that Capture Class Imbalanced Distributions with Generative Adversarial Networks: Example using Antiretroviral Therapy for HIV." arXiv preprint arXiv:2208.08655 (2022). [2] Kuo, Nicholas I-Hsien, et al. "The Health Gym: synthetic health-related datasets for the development of reinforcement learning algorithms." Scientific Data 9.1 (2022): 693.
Latest edit: 16th May 2023.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository provides the necessary data and Python code to replicate the experiments and generate the figures presented in our manuscript: "Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models".
Contents:
pownet.zip
: Contains PowNet version 3.2, the specific version of the simulation software used in this study.inputs.zip
: Contains essential modeling inputs required by PowNet for the experiments, including network data, and pre-generated synthetic load and solar time series.scripts.zip
: Contains the Python scripts used for installing PowNet, optionally regenerating synthetic data, running simulation experiments, processing results, and generating figures.thai_data.zip
(Reference Only): Contains raw data related to the 2023 Thai power system. This data served as a reference during the creation of the PowNet inputs for this study but is not required to run the replication experiments themselves. Code to process the raw data is also provided.System Requirements:
pip
package managerSetup Instructions:
Download and Unzip Core Files: Download pownet.zip
, inputs.zip
, scripts.zip
, and thai_data.zip
. Extract their contents into the same parent folder. Your directory structure should look like this:
Parent_Folder/
├── pownet/ # from pownet.zip
├── inputs/ # from inputs.zip
├── scripts/ # from scripts.zip
├── thai_data.zip/ # from scripts.zip
├── figures/ # Created by scripts later
├── outputs/ # Created by scripts later
Install PowNet:
pownet
directory that you just extracted:cd path/to/Parent_Folder/pownet
pip install -e .
Workflow and Usage:
Note: All subsequent Python script commands should be run from the scripts
directory. Navigate to it first:
cd path/to/Parent_Folder/scripts
1. Generate Synthetic Time Series (Optional):
inputs
directory (extracted from inputs.zip
). If you wish to regenerate them:python create_synthetic_load.py
python create_synthetic_solar.py
python eval_synthetic_load.py
python eval_synthetic_solar.py
2. Calculate Total Solar Availability:
inputs
directory:
python process_scenario_solar.py
3. Experiment 1: Compare Strategies for Modeling Purchase Obligations:
python run_basecase.py --model_name "TH23NMT"
python run_basecase.py --model_name "TH23ZC"
python run_basecase.py --model_name "TH23"
python run_min_cap.py
This is a new script because we need to modify the objective function and add constraints.
4. Experiment 2: Simulate Partial-Firm Contract Switching:
python run_scenarios.py --model_name "TH23"
python run_scenarios.py --model_name "TH23ESB"
5. Visualize Results:
python run_viz.py
figures
directory within the Parent_Folder
.This is a synthetic dataset that can be used by users that are interested in benchmarking methods of explainable artificial intelligence (XAI) for geoscientific applications. The dataset is specifically inspired from a climate forecasting setting (seasonal timescales) where the task is to predict regional climate variability given global climate information lagged in time. The dataset consists of a synthetic input X (series of 2D arrays of random fields drawn from a multivariate normal distribution) and a synthetic output Y (scalar series) generated by using a nonlinear function F: R^d -> R.
The synthetic input aims to represent temporally independent realizations of anomalous global fields of sea surface temperature, the synthetic output series represents some type of regional climate variability that is of interest (temperature, precipitation totals, etc.) and the function F is a simplification of the climate system.
Since the nonlinear function F that is used to generate the output given the input is known, we also derive and provide the attribution of each output value to the corresponding input features. Using this synthetic dataset users can train any AI model to predict Y given X and then implement XAI methods to interpret it. Based on the “ground truth” of attribution of F the user can assess the faithfulness of any XAI method.
NOTE: the spatial configuration of the observations in the NetCDF database file conform to the planetocentric coordinate system (89.5N - 89.5S, 0.5E - 359.5E), where longitude is measured in the positive heading east from the prime meridian.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
WARNING
This version of the dataset is not recommended for anomaly detection use case. We discovered discrepancies in the anomalous sequences. A new version will be released. In the meantime, please ignore all sequence marked as anomalous.
CONTEXT
Testing hardware to qualify it for Spaceflight is critical to model and verify performances. Hot fire tests (also known as life-tests) are typically run during the qualification campaigns of satellite thrusters, but results remain proprietary data, hence making it difficult for the machine learning community to develop suitable data-driven predictive models. This synthetic dataset was generated partially based on the real-world physics of monopropellant chemical thrusters, to foster the development and benchmarking of new data-driven analytical methods (machine learning, deep-learning, etc.).
The PDF document "STFT Dataset Description" describes in much details the structure, context, use cases and domain-knowledge about thruster in order for ML practitioners to use the dataset.
PROPOSED TASKS
Supervised:
Unsupervised / Anomaly Detection
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The BuildingsBench datasets consist of:
Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB).
BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below:
A README file providing details about how the data is stored and describing the organization of the datasets can be found within each data lake version under BuildingsBench.
https://www.pioneerdatahub.co.uk/data/data-request-process/https://www.pioneerdatahub.co.uk/data/data-request-process/
Strokes can be ischaemic or haemorrhagic in nature, leading to debilitating symptoms which are dependent on the location of the stroke in the brain and the severity of the insult. Stroke care is centred around Hyper-acute Stroke Units (HASU), Acute Stroke and Brain Injury Units (ASU/ABIU) and specialist stroke services. Early presentation enables the use of more invasive treatments to clear blood clots, but commonly strokes present late, preventing their use.
This synthetic dataset represents approximately 29,000 stroke patients. Data includes demography, socioeconomic status, co-morbidities, “time stamped” serial acuity, physiology and treatments, investigations (structured and unstructured data), hospital care processes, and outcomes.
The dataset was created using the Synthetic Data Vault (SDV) package, specifically employing the GAN synthesizer. Real. data was first read and pre-processed, ensuring datetime columns were correctly parsed and identifiers were handled as strings. Metadata was defined to capture the schema, specifying field types and primary keys. This metadata guided the synthesizer in understanding the structure of the data. The GAN synthesizer was then fitted to the real data, learning the distributions and dependencies within. After fitting, the synthesizer generated synthetic data that mirrors the statistical properties and relationships of the original dataset.
Geography: The West Midlands (WM) has a population of 6 million & includes a diverse ethnic & socio-economic mix. UHB is one of the largest NHS Trusts in England, providing direct acute stroke services & specialist care across four hospital sites.
Data set availability: Data access is available via the PIONEER Hub for projects which will benefit the public or patients. Data access can be provided to NHS, academic, commercial, policy and third sector organisations. Applications from SMEs are welcome. There is a single data access process, with public oversight provided by our public review committee, the Data Trust Committee. Contact pioneer@uhb.nhs.uk or visit www.pioneerdatahub.co.uk for more details.
Available supplementary data: Matched controls; ambulance and community data. Unstructured data (images). We can provide the dataset in OMOP and other common data models and can build synthetic data to meet bespoke requirements.
Available supplementary support: Analytics, model build, validation & refinement; A.I. support. Data partner support for ETL (extract, transform & load) processes. Bespoke and “off the shelf” Trusted Research Environment (TRE) build and run. Consultancy with clinical, patient & end-user and purchaser access/ support. Support for regulatory requirements. Cohort discovery. Data-driven trials and “fast screen” services to assess population size.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this research, we create synthetic data with features that are like data from IoT devices. We use an existing air quality dataset that includes temperature and gas sensor measurements. This real-time dataset includes component values for the Air Quality Index (AQI) and ppm concentrations for various polluting gas concentrations. We build a JavaScript Object Notation (JSON) model to capture the distribution of variables and structure of this real dataset to generate the synthetic data. Based on the synthetic dataset and original dataset, we create a comparative predictive model. Analysis of synthetic dataset predictive model shows that it can be successfully used for edge analytics purposes, replacing real-world datasets. There is no significant difference between the real-world dataset compared the synthetic dataset. The generated synthetic data requires no modification to suit the edge computing requirements. The framework can generate correct synthetic datasets based on JSON schema attributes. The accuracy, precision, and recall values for the real and synthetic datasets indicate that the logistic regression model is capable of successfully classifying data