Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.
This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.
The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.
The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.
The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.
There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).
The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.
This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):
import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)
The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:
CH_gens_list = CH_gens.dropna().squeeze().to_list()
Finally, we can import all the time series of Swiss generators from a given data table with
pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)
The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.
This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:
hourly_loads = pd.read_csv('loads_2018_3.csv')
To get a daily average of the loads, we can use:
daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()
This results in series of length 364. To average further over entire weeks and get series of length 52, we use:
weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()
The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.
This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This synthetic dataset contains 20,000 records of X-ray data labeled as "Normal" or "Tuberculosis". It is specifically created for training and evaluating classification models in the field of medical image analysis. The dataset aims to aid in building machine learning and deep learning models for detecting tuberculosis from X-ray data.
Tuberculosis (TB) is a highly infectious disease that primarily affects the lungs. Accurate detection of TB using chest X-rays can significantly enhance medical diagnostics. However, real-world datasets are often scarce or restricted due to privacy concerns. This synthetic dataset bridges that gap by providing simulated patient data while maintaining realistic distributions and patterns commonly observed in TB cases.
Column Name | Description |
---|---|
Patient_ID | Unique ID for each patient (e.g., PID000001) |
Age | Age of the patient (in years) |
Gender | Gender of the patient (Male/Female) |
Chest_Pain | Presence of chest pain (Yes/No) |
Cough_Severity | Severity of cough (Scale: 0-9) |
Breathlessness | Severity of breathlessness (Scale: 0-4) |
Fatigue | Level of fatigue experienced (Scale: 0-9) |
Weight_Loss | Weight loss (in kg) |
Fever | Level of fever (Mild, Moderate, High) |
Night_Sweats | Whether night sweats are present (Yes/No) |
Sputum_Production | Level of sputum production (Low, Medium, High) |
Blood_in_Sputum | Presence of blood in sputum (Yes/No) |
Smoking_History | Smoking status (Never, Former, Current) |
Previous_TB_History | Previous tuberculosis history (Yes/No) |
Class | Target variable indicating the condition (Normal, Tuberculosis) |
The dataset was generated using Python with the following libraries:
- Pandas: To create and save the dataset as a CSV file
- NumPy: To generate random numbers and simulate realistic data
- Random Seed: Set to ensure reproducibility
The target variable "Class" has a 70-30 distribution between Normal and Tuberculosis cases. The data is randomly generated with realistic patterns that mimic typical TB symptoms and demographic distributions.
This dataset is intended for:
- Machine Learning and Deep Learning classification tasks
- Data exploration and feature analysis
- Model evaluation and comparison
- Educational and research purposes
This synthetic dataset is open for educational and research use. Please credit the creator if used in any public or academic work.
This dataset was generated as a synthetic alternative to real-world data to help developers and researchers practice building and fine-tuning classification models without the constraints of sensitive patient data.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Synthetic Healthcare Dataset
Overview
This dataset is a synthetic healthcare dataset created for use in data analysis. It mimics real-world patient healthcare data and is intended for applications within the healthcare industry.
Data Generation
The data has been generated using the Faker Python library, which produces randomized and synthetic records that resemble real-world data patterns. It includes various healthcare-related fields such as patient⦠See the full description on the dataset page: https://huggingface.co/datasets/vrajakishore/dummy_health_data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetic river flow videos for evaluating image-based velocimetry methods
### This file describes the data attached to the article
This folder contains the data used in the case studies: synthetic videos + reference files.
- 00_reference_velocities
-> Reference velocities interpolated on a regular grid. Data are given in conventionnal units, i.e. m/s and m.
- 01_XX
-> Data of the first case study
- 02_XX
-> Data of the second case study
This folder contains the Python libraries and Mantaflow modified source code used in the paper. The libraries are provided as is. Feel free to contact us for support or guidelines.
- lspiv
-> Python library used to extract, process and display results of LSPIV analysis carried out with Fudaa-LSPIV
- mantaflow-modified
-> Modified version of Mantaflow described in the article. Installation instructions can be found at http://mantaflow.com
- syri
-> Python library used to extract, process and display fluid simulations carried out on Mantaflow and Blender. (Require the lspiv library)
This folder contains synthetic videos generated with the method described in the article. The fluid simulation parameters, and thus the reference velocities, are the same as those presented in the article.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetic data, source code, and supplementary text for the article "Euler deconvolution of potential field data" by Leonardo Uieda, Vanderlei C. Oliveira Jr., and ValƩria C. F. Barbosa. This is part of a tutorial submitted to The Leading Edge (http://library.seg.org/journal/tle). Results were generated using the open-source Python package Fatiando a Terra version 0.2 (http://www.fatiando.org). This material along with the manuscript can also be found at https://github.com/pinga-lab/paper-tle-euler-tutorial Synthetic data and model Examples in the tutorial use synthetic data generated with the IPython notebook create_synthetic_data.ipynb. File synthetic_data.txt has 4 columns: x (north), y (east), z (down) and the total field magnetic anomaly. x, y, and z are in meters. The total field anomaly is in nanoTesla (nT). File metadata.json contains extra information about the data, such as inclination and declination of the inducing field (in degrees), shape of the data grid (number of points in y and x, respectively), the area containing the data (W, E, S, N, in meters), and the model boundaries (W, E, S, N, top, bottom, in meters). File model.pickle is a serialized version of the model used to generate the data. It contains a list of instances of the PolygonalPrism class of Fatiando a Terra. The serialization was done using the cPickle Python module. Reproducing the results in the tutorial The notebook euler-deconvolution-examples.ipynb runs the Euler deconvolution on the synthetic data and generates the figures for the manuscript. It also presents a more detailed explanation of the method and more tests than went into the finished manuscript.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides a simulated retail data warehouse designed using star schema modeling principles.
It includes both normalized and denormalized versions of a retail sales star schema, making it a valuable resource for data engineers, analysts, and data warehouse enthusiasts who want to explore real-world scenarios, performance tuning, and modeling strategies.
This dataset set has two Fact tables:
- fact_sales_normalized.csv ā No columns from the dim_* tables have been normalised.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12492162%2F11f3c0350acd609e6b9d9336d0abb448%2FNormalized-Retail-Star-Schema.png?generation=1745327115564885&alt=media" alt="Normalized Star Schema">
However, the dim_* table stay the same for both as follows: - Dim_Customers.csv - Dim_Products.csv - Dim_Stores.csv - Dim_Dates.csv - Dim_Salesperson - Dim_Campaign
Explore how denormalization affects storage, redundancy, and performance
All data is synthetic and randomly generated via python scripts that use polars library for data manipulationā no real customer or business data is included.
Ideal for use with tools like SQL engines, Redshift, BigQuery, Snowflake, or even DuckDB.
Shrinivas Vishnupurikar, Data Engineer @Velotio Technologies.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This repository contains a comprehensive and clean dataset for predicting e-commerce sales, tailored for data scientists, machine learning enthusiasts, and researchers. The dataset is crafted to analyze sales trends, optimize pricing strategies, and develop predictive models for sales forecasting.
The dataset includes 1,000 records across the following features:
Column Name | Description |
---|---|
Date | The date of the sale (01-01-2023 onward). |
Product_Category | Category of the product (e.g., Electronics, Sports, Other). |
Price | Price of the product (numerical). |
Discount | Discount applied to the product (numerical). |
Customer_Segment | Buyer segment (e.g., Regular, Occasional, Other). |
Marketing_Spend | Marketing budget allocated for sales (numerical). |
Units_Sold | Number of units sold per transaction (numerical). |
Date: - Range: 01-01-2023 to 12-31-2023. - Contains 1,000 unique values without missing data.
Product_Category: - Categories: Electronics (21%), Sports (21%), Other (58%). - Most common category: Electronics (21%).
Price: - Range: From 244 to 999. - Mean: 505, Standard Deviation: 290. - Most common price range: 14.59 - 113.07.
Discount: - Range: From 0.01% to 49.92%. - Mean: 24.9%, Standard Deviation: 14.4%. - Most common discount range: 0.01 - 5.00%.
Customer_Segment: - Segments: Regular (35%), Occasional (34%), Other (31%). - Most common segment: Regular.
Marketing_Spend: - Range: From 2.41k to 10k. - Mean: 4.91k, Standard Deviation: 2.84k.
Units_Sold: - Range: From 5 to 57. - Mean: 29.6, Standard Deviation: 7.26. - Most common range: 24 - 34 units sold.
The dataset is suitable for creating the following visualizations: - 1. Price Distribution: Histogram to show the spread of prices. - 2. Discount Distribution: Histogram to analyze promotional offers. - 3. Marketing Spend Distribution: Histogram to understand marketing investment patterns. - 4. Customer Segment Distribution: Bar plot of customer segments. - 5. Price vs Units Sold: Scatter plot to show pricing effects on sales. - 6. Discount vs Units Sold: Scatter plot to explore the impact of discounts. - 7. Marketing Spend vs Units Sold: Scatter plot for marketing effectiveness. - 8. Correlation Heatmap: Identify relationships between features. - 9. Pairplot: Visualize pairwise feature interactions.
The dataset is synthetically generated to mimic realistic e-commerce sales trends. Below are the steps taken for data generation:
Feature Engineering:
Data Simulation:
Validation:
Note: The dataset is synthetic and not sourced from any real-world e-commerce platform.
Hereās an example of building a predictive model using Linear Regression:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the dataset
df = pd.read_csv('ecommerce_sales.csv')
# Feature selection
X = df[['Price', 'Discount', 'Marketing_Spend']]
y = df['Units_Sold']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Model training
model = LinearRegression()
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: This repository/dataset provides a suite of Python scripts to generate a simulated relational database for inventory management processes and transform this data into object-centric event logs (OCEL) suitable for advanced process mining analysis. The primary goal is to offer a synthetic yet realistic dataset that facilitates research, development, and application of object-centric process mining techniques in the domain of inventory control and supply chain management. The generated event logs capture common inventory operations, track stock level changes, and are enriched with key inventory management parameters (like EOQ, Safety Stock, Reorder Point) and status-based activity labels (e.g., indicating understock or overstock situations).
Overview: Inventory management is a critical business process characterized by the interaction of various entities such as materials, purchase orders, sales orders, plants, suppliers, and customers. Traditional process mining often struggles to capture these complex interactions. Object-Centric Process Mining (OCPM) offers a more suitable paradigm. This project provides the tools to create and explore such data.
The workflow involves:
pm4py
library.Contents:
The repository contains the following Python scripts:
01_generate_simulation.py
:
inventory_management.db
.Materials
, SalesOrderDocuments
, SalesOrderItems
, PurchaseOrderDocuments
, PurchaseOrderItems
, PurchaseRequisitions
, GoodsReceiptsAndIssues
, MaterialStocks
, MaterialDocuments
, SalesDocumentFlows
, and OrderSuggestions
.02_database_to_ocel_csv.py
:
inventory_management.db
.ocel_inventory_management.csv
.MAT
(Material), PLA
(Plant), PO_ITEM
(Purchase Order Item), SO_ITEM
(Sales Order Item), CUSTOMER
, SUPPLIER
.ocel:activity
, ocel:timestamp
, ocel:type:
).03_ocel_csv_to_ocel.py
:
ocel_inventory_management.csv
.pm4py
to convert the CSV event log into the standard OCEL XML format (ocel_inventory_management.xml
).04_postprocess_activities.py
:
inventory_management.db
to calculate inventory parameters:
ocel_inventory_management.csv
.ocel:activity
label (e.g., "Goods Issue (Understock)").MAT_PLA
(Material-Plant combination) for easier status tracking.post_ocel_inventory_management.csv
.05_ocel_csv_to_ocel.py
:
post_ocel_inventory_management.csv
.pm4py
to convert this enriched CSV event log into the standard OCEL XML format (post_ocel_inventory_management.xml
).Generated Dataset Files (if included, or can be generated using the scripts):
inventory_management.db
: The SQLite database containing the simulated raw data.ocel_inventory_management.csv
: The initial OCEL in CSV format.ocel_inventory_management.xml
: The initial OCEL in standard OCEL XML format.post_ocel_inventory_management.csv
: The post-processed and enriched OCEL in CSV format.post_ocel_inventory_management.xml
: The post-processed and enriched OCEL in standard OCEL XML format.How to Use:
sqlite3
(standard library), pandas
, numpy
, pm4py
.python 01_generate_simulation.py
(generates inventory_management.db
)python 02_database_to_ocel_csv.py
(generates ocel_inventory_management.csv
from the database)python 03_ocel_csv_to_ocel.py
(generates ocel_inventory_management.xml
)python 04_postprocess_activities.py
(generates post_ocel_inventory_management.csv
using the database and the initial CSV OCEL)python 05_ocel_csv_to_ocel.py
(generates post_ocel_inventory_management.xml
)Potential Applications and Research: This dataset and the accompanying scripts can be used for:
Keywords: Object-Centric Event Log, OCEL, Process Mining, Inventory Management, Supply Chain, Simulation, Synthetic Data, SQLite, Python, pandas, pm4py, Economic Order Quantity (EOQ), Safety Stock (SS), Reorder Point (ROP), Stock Status Analysis.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Smashcima is a library and framework for synthesizing images containing handwritten music for creating synthetic training data for OMR models. It is primarily intended to be used as part of optical music recognition workflows, esp. with domain adaptation in mind. The target user is therefore a machine-learning, document processing, library sciences, or computational musicology researcher with minimal skills in python programming.
Smashcima is the only tool that simultaneously: - synthesizes handwritten music notation, - produces not only raster images but also segmentation masks, classification labels, bounding boxes, and more, - synthesizes entire pages as well as individual symbols, - synthesizes background paper textures, - synthesizes also polyphonic and pianoform music images, - accepts just MusicXML as input, - is written in Python, which simplifies its adoption and extensibility.
Therefore, Smashcima brings a unique new capability for optical music recognition (OMR): synthesizing a near-realistic image of handwritten sheet music from just a MusicXML file. As opposed to notation editors, which work with a fixed set of fonts and a set of layout rules, it can adapt handwriting styles from existing OMR datasets to arbitrary music (beyond the music encoded in existing OMR datasets), and randomize layout to simulate the imprecisions of handwriting, while guaranteeing the semantic correctness of the output rendering. Crucially, the rendered image is provided also with the positions of all the visual elements of music notation, so that both object detection-based and sequence-to-sequence OMR pipelines can utilize Smashcima as a synthesizer of training data.
(In combination with the LMX canonical linearization of MusicXML, one can imagine the endless possibilities of running Smashcima on inputs from a MusicXML generator.)
https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
Deze synthetische dataset is gemaakt als onderdeel van de vertaling en implementatie van het algoritme dat door de CNAM wordt gebruikt om de topdiabetes te bouwen (link to the description sheet of the algorithm).
De Python- en SAS-versies die door de HDH zijn aangepast, hebben betrekking op synthetische gegevens voor de jaren 2018-2019, maar kunnen worden uitgebreid tot andere jaren. Het CNAM-bronprogramma is ontwikkeld in SAS en draait op gegevens van 2015 tot 2019.
Het hierboven genoemde algoritme heeft tot doel zich te richten op mensen in de zorg voor diabetes in de hoofdbasis van de NSDS om de ātopdiabetesā van de door de CNAM gecreĆ«erde en onderhouden pathologiekaart (versie G8) te creĆ«ren.
De implementatie van het topdiabetesalgoritme vereiste de mobilisatie van synthetische (fictieve) tabellen en variabelen.
-jaartabellen samenvoegen tot ƩƩn tabel voor ER_PRS_F, ER_ETE_F, ER_PHA_F,
Data/SNDS gemeenschap. - Resultaten in verband met het aanmaken van de database: Het algoritme dat door de CNAM wordt gebruikt om de topdiabetes te construeren:(bronversie (CNAM), Python-versie en SAS-versie (HDH)) (https://www.health-data-hub.fr/library-open-algorithms-health/algorithm-to-build-the-top-diabete-of-mapping).
Werven van mensen, in een breed scala van gebieden, om te werken in Quebec, die op zoek is om te werven in de regio.
De programma's werken op basis van de synthetische gegevens van de HDH, met enkele aanpassingen: Deze dataset is gegenereerd met behulp van de regeling van de belangrijkste NSDS-databanktabellen van 2019.
-de conversie van het datumformaat naar yymmdd10.
De identificatie van de patiƫnt is gebaseerd op de targeting van specifieke geneesmiddelen en/of ALD en/of ziekenhuisopname in MCO.
-de naamswijziging van NUM_ENQ in BEN_NIR_PSA, De mapping-algoritmen zijn erop gericht de specificiteit (niet de gevoeligheid) te maximaliseren, d.w.z. de afwezigheid van niet-diabetici bij de beoogde patiƫnten te waarborgen.
De implementatie van het algoritme vereist de mobilisatie van de volgende tabellen en variabelen (de vereiste geschiedenis wordt aangegeven in het overeenkomstige vak):
Patiƫnten met minder dan 3 toedieningen van specifieke geneesmiddelen, die geen ALD hebben en die niet binnen 5 jaar in het ziekenhuis zijn opgenomen voor diabetes, worden niet vastgehouden.
De programma's aangepast in SAS en Python draaien op synthetische data uit de jaren 2018 en 2019. De CNAM-broncode (in SAS) is ontworpen om te werken aan gegevens uit de jaren 2015 tot en met 2019.
https://gitlab.com/healthdatahub/boas/cnam/top-diabete/-/raw/main/Tables_et_variables_du_SNDS_n%C3%A9cessaires.png?ref_type=heads" alt="afbeeldingsbeschrijving hier invoeren" title="afbeeldingstitel hier invoeren">
het gebrek aan medische consistentie, het gebrek aan actualisering van jaarlijkse veranderingen, een evolutionair tabelschema dat onvolledig en onvolmaakt kan zijn.
Dit programma omvat geen analyse van de geraamde uitgavenposten die door de ziektekostenverzekering worden vergoed.
Het algoritme identificeert veelvoorkomende patiƫnten met diabetes in een bepaald jaar (2019). Het bepaalt niet de exacte datum van het begin van diabetes in de basis.
Het gebruik van synthetische gegevens, hoewel nuttig voor het manipuleren van NSDS-gegevens, heeft beperkingen:
Meer informatie over het gebruik van de databank in het kader van de topdiabetesprogramma's (CNAM) op de GitLab-repository van de programma's (link van de GitLab-repository).
Contactpunt: dir.donnees-SNDS@health-data-hub.fr
Op Gitlab (maak een ticket of merge-request)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.
This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.
The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.
The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.
The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.
There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).
The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.
This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):
import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)
The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:
CH_gens_list = CH_gens.dropna().squeeze().to_list()
Finally, we can import all the time series of Swiss generators from a given data table with
pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)
The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.
This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:
hourly_loads = pd.read_csv('loads_2018_3.csv')
To get a daily average of the loads, we can use:
daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()
This results in series of length 364. To average further over entire weeks and get series of length 52, we use:
weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()
The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.
This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.