10 datasets found

Data from: A large synthetic dataset for machine learning applications in...
zenodo.org
csv, json, png, zip
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
Explore at:
zip, png, csv, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13378476
Dataset updated
Mar 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.

Tuberculosis X-Ray Dataset (Synthetic)

kaggle.com

Updated Mar 12, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Arif Miah (2025). Tuberculosis X-Ray Dataset (Synthetic) [Dataset]. https://www.kaggle.com/datasets/miadul/tuberculosis-x-ray-dataset-synthetic

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 12, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Arif Miah

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

📝 Dataset Summary

This synthetic dataset contains 20,000 records of X-ray data labeled as "Normal" or "Tuberculosis". It is specifically created for training and evaluating classification models in the field of medical image analysis. The dataset aims to aid in building machine learning and deep learning models for detecting tuberculosis from X-ray data.

💡 Context

Tuberculosis (TB) is a highly infectious disease that primarily affects the lungs. Accurate detection of TB using chest X-rays can significantly enhance medical diagnostics. However, real-world datasets are often scarce or restricted due to privacy concerns. This synthetic dataset bridges that gap by providing simulated patient data while maintaining realistic distributions and patterns commonly observed in TB cases.

🗃️ Dataset Details

Number of Rows: 20,000
Number of Columns: 15
File Format: CSV
Resolution: Simulated patient data, not real X-ray images
Size: Approximately 10 MB

🏷️ Columns and Descriptions

Column Name	Description
Patient_ID	Unique ID for each patient (e.g., PID000001)
Age	Age of the patient (in years)
Gender	Gender of the patient (Male/Female)
Chest_Pain	Presence of chest pain (Yes/No)
Cough_Severity	Severity of cough (Scale: 0-9)
Breathlessness	Severity of breathlessness (Scale: 0-4)
Fatigue	Level of fatigue experienced (Scale: 0-9)
Weight_Loss	Weight loss (in kg)
Fever	Level of fever (Mild, Moderate, High)
Night_Sweats	Whether night sweats are present (Yes/No)
Sputum_Production	Level of sputum production (Low, Medium, High)
Blood_in_Sputum	Presence of blood in sputum (Yes/No)
Smoking_History	Smoking status (Never, Former, Current)
Previous_TB_History	Previous tuberculosis history (Yes/No)
Class	Target variable indicating the condition (Normal, Tuberculosis)

🔍 Data Generation Process

The dataset was generated using Python with the following libraries:
- Pandas: To create and save the dataset as a CSV file
- NumPy: To generate random numbers and simulate realistic data
- Random Seed: Set to ensure reproducibility

The target variable "Class" has a 70-30 distribution between Normal and Tuberculosis cases. The data is randomly generated with realistic patterns that mimic typical TB symptoms and demographic distributions.

🔧 Usage

This dataset is intended for:
- Machine Learning and Deep Learning classification tasks
- Data exploration and feature analysis
- Model evaluation and comparison
- Educational and research purposes

📊 Potential Applications

Tuberculosis Detection Models: Train CNNs or other classification algorithms to detect TB.
Healthcare Research: Analyze the correlation between symptoms and TB outcomes.
Data Visualization: Perform EDA to uncover patterns and insights.
Model Benchmarking: Compare various algorithms for TB detection.

📑 License

This synthetic dataset is open for educational and research use. Please credit the creator if used in any public or academic work.

🙌 Acknowledgments

This dataset was generated as a synthetic alternative to real-world data to help developers and researchers practice building and fine-tuning classification models without the constraints of sensitive patient data.

h
dummy_health_data
huggingface.co
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mudumbai Vraja Kishore (2025). dummy_health_data [Dataset]. https://huggingface.co/datasets/vrajakishore/dummy_health_data
Explore at:
Dataset updated
May 29, 2025
Authors
Mudumbai Vraja Kishore
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Synthetic Healthcare Dataset

Overview

This dataset is a synthetic healthcare dataset created for use in data analysis. It mimics real-world patient healthcare data and is intended for applications within the healthcare industry.

Data Generation

The data has been generated using the Faker Python library, which produces randomized and synthetic records that resemble real-world data patterns. It includes various healthcare-related fields such as patient… See the full description on the dataset page: https://huggingface.co/datasets/vrajakishore/dummy_health_data.
Z
Synthetic river flow videos dataset
data.niaid.nih.gov
zenodo.org
Updated Aug 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Synthetic river flow videos dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6257391
Explore at:
Dataset updated
Aug 3, 2022
Dataset provided by
Magali Jodeau
Alexandre Hauet
Jérôme Le Coz
Guillaume Bodart
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
######### ########## ###### #########

Synthetic river flow videos for evaluating image-based velocimetry methods

######### ########## ###### #########

Year : 2022

Authors : G.Bodart (guillaume.bodart@inrae.fr), J.Le Coz (jerome.lecoz@inrae.fr), M.Jodeau (magali.jodeau@edf.fr), A.Hauet (alexandre.hauet@edf.fr)

### This file describes the data attached to the article

-> 00_article_cases

This folder contains the data used in the case studies: synthetic videos + reference files.

- 00_reference_velocities -> Reference velocities interpolated on a regular grid. Data are given in conventionnal units, i.e. m/s and m. - 01_XX -> Data of the first case study - 02_XX -> Data of the second case study

-> 01_dev

This folder contains the Python libraries and Mantaflow modified source code used in the paper. The libraries are provided as is. Feel free to contact us for support or guidelines.

- lspiv -> Python library used to extract, process and display results of LSPIV analysis carried out with Fudaa-LSPIV - mantaflow-modified -> Modified version of Mantaflow described in the article. Installation instructions can be found at http://mantaflow.com - syri -> Python library used to extract, process and display fluid simulations carried out on Mantaflow and Blender. (Require the lspiv library)

-> 02_dataset

This folder contains synthetic videos generated with the method described in the article. The fluid simulation parameters, and thus the reference velocities, are the same as those presented in the article.

The videos can be used freely. Please consider citing the corresponding paper.
Synthetic total-field magnetic anomaly data and code to perform Euler...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leonardo Uieda; Vanderlei C. Oliveira Jr.; Valeria C. F. Barbosa (2023). Synthetic total-field magnetic anomaly data and code to perform Euler deconvolution on it [Dataset]. http://doi.org/10.6084/m9.figshare.923450.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.923450.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Leonardo Uieda; Vanderlei C. Oliveira Jr.; Valeria C. F. Barbosa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetic data, source code, and supplementary text for the article "Euler deconvolution of potential field data" by Leonardo Uieda, Vanderlei C. Oliveira Jr., and Valéria C. F. Barbosa. This is part of a tutorial submitted to The Leading Edge (http://library.seg.org/journal/tle). Results were generated using the open-source Python package Fatiando a Terra version 0.2 (http://www.fatiando.org). This material along with the manuscript can also be found at https://github.com/pinga-lab/paper-tle-euler-tutorial Synthetic data and model Examples in the tutorial use synthetic data generated with the IPython notebook create_synthetic_data.ipynb. File synthetic_data.txt has 4 columns: x (north), y (east), z (down) and the total field magnetic anomaly. x, y, and z are in meters. The total field anomaly is in nanoTesla (nT). File metadata.json contains extra information about the data, such as inclination and declination of the inducing field (in degrees), shape of the data grid (number of points in y and x, respectively), the area containing the data (W, E, S, N, in meters), and the model boundaries (W, E, S, N, top, bottom, in meters). File model.pickle is a serialized version of the model used to generate the data. It contains a list of instances of the PolygonalPrism class of Fatiando a Terra. The serialization was done using the cPickle Python module. Reproducing the results in the tutorial The notebook euler-deconvolution-examples.ipynb runs the Euler deconvolution on the synthetic data and generates the figures for the manuscript. It also presents a more detailed explanation of the method and more tests than went into the finished manuscript.
Retail Store Star Schema Dataset
kaggle.com
Updated Apr 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrinivas Vishnupurikar (2025). Retail Store Star Schema Dataset [Dataset]. https://www.kaggle.com/datasets/shrinivasv/retail-store-star-schema-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shrinivas Vishnupurikar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🛍️ Retail Star Schema (Normalized & Denormalized) – Synthetic Dataset

This dataset provides a simulated retail data warehouse designed using star schema modeling principles.

It includes both normalized and denormalized versions of a retail sales star schema, making it a valuable resource for data engineers, analysts, and data warehouse enthusiasts who want to explore real-world scenarios, performance tuning, and modeling strategies.

📁 Dataset Structure

This dataset set has two Fact tables: - fact_sales_normalized.csv – No columns from the dim_* tables have been normalised. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12492162%2F11f3c0350acd609e6b9d9336d0abb448%2FNormalized-Retail-Star-Schema.png?generation=1745327115564885&alt=media" alt="Normalized Star Schema">

fact_sales_denormalized.csv – Specific columns from certain dim_* tables have been normalised. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12492162%2Fb567c752c7bc8bc55d9d6142d6ac40cf%2FDenormalized-Retial-Star-Schema.png?generation=1745327148166677&alt=media" alt="Denormalized Star Schema">

However, the dim_* table stay the same for both as follows: - Dim_Customers.csv - Dim_Products.csv - Dim_Stores.csv - Dim_Dates.csv - Dim_Salesperson - Dim_Campaign

🧠 Use Cases

Practice star schema design and dimensional modeling

Learn how to denormalize dimensions for BI and analytics performance

Benchmark analytical queries (joins, aggregations, filtering)

Test data pipelines, ETL/ELT transformations, and query optimization strategies

Explore how denormalization affects storage, redundancy, and performance

📌 Notes

All data is synthetic and randomly generated via python scripts that use polars library for data manipulation— no real customer or business data is included.

Ideal for use with tools like SQL engines, Redshift, BigQuery, Snowflake, or even DuckDB.

📎 Credits

Shrinivas Vishnupurikar, Data Engineer @Velotio Technologies.

E-commerce Sales Prediction Dataset

kaggle.com

Updated Dec 14, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Nevil Dhinoja (2024). E-commerce Sales Prediction Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/10197264

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/10197264

Dataset updated

Dec 14, 2024

Dataset provided by

Kaggle

Authors

Nevil Dhinoja

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

E-commerce Sales Prediction Dataset

This repository contains a comprehensive and clean dataset for predicting e-commerce sales, tailored for data scientists, machine learning enthusiasts, and researchers. The dataset is crafted to analyze sales trends, optimize pricing strategies, and develop predictive models for sales forecasting.

📂 Dataset Overview

The dataset includes 1,000 records across the following features:

Column Name	Description
Date	The date of the sale (01-01-2023 onward).
Product_Category	Category of the product (e.g., Electronics, Sports, Other).
Price	Price of the product (numerical).
Discount	Discount applied to the product (numerical).
Customer_Segment	Buyer segment (e.g., Regular, Occasional, Other).
Marketing_Spend	Marketing budget allocated for sales (numerical).
Units_Sold	Number of units sold per transaction (numerical).

📊 Data Summary

General Properties

Date: - Range: 01-01-2023 to 12-31-2023. - Contains 1,000 unique values without missing data.

Product_Category: - Categories: Electronics (21%), Sports (21%), Other (58%). - Most common category: Electronics (21%).

Price: - Range: From 244 to 999. - Mean: 505, Standard Deviation: 290. - Most common price range: 14.59 - 113.07.

Discount: - Range: From 0.01% to 49.92%. - Mean: 24.9%, Standard Deviation: 14.4%. - Most common discount range: 0.01 - 5.00%.

Customer_Segment: - Segments: Regular (35%), Occasional (34%), Other (31%). - Most common segment: Regular.

Marketing_Spend: - Range: From 2.41k to 10k. - Mean: 4.91k, Standard Deviation: 2.84k.

Units_Sold: - Range: From 5 to 57. - Mean: 29.6, Standard Deviation: 7.26. - Most common range: 24 - 34 units sold.

📈 Data Visualizations

The dataset is suitable for creating the following visualizations: - 1. Price Distribution: Histogram to show the spread of prices. - 2. Discount Distribution: Histogram to analyze promotional offers. - 3. Marketing Spend Distribution: Histogram to understand marketing investment patterns. - 4. Customer Segment Distribution: Bar plot of customer segments. - 5. Price vs Units Sold: Scatter plot to show pricing effects on sales. - 6. Discount vs Units Sold: Scatter plot to explore the impact of discounts. - 7. Marketing Spend vs Units Sold: Scatter plot for marketing effectiveness. - 8. Correlation Heatmap: Identify relationships between features. - 9. Pairplot: Visualize pairwise feature interactions.

💡 How the Data Was Created

The dataset is synthetically generated to mimic realistic e-commerce sales trends. Below are the steps taken for data generation:

Feature Engineering:
- Identified key attributes such as product category, price, discount, and marketing spend, typically observed in e-commerce data.
- Generated dependent features like units sold based on logical relationships.
Data Simulation:
- Python Libraries: Used NumPy and Pandas to generate and distribute values.
- Statistical Modeling: Ensured feature distributions aligned with real-world sales data patterns.
Validation:
- Verified data consistency with no missing or invalid values.
- Ensured logical correlations (e.g., higher discounts → increased units sold).

Note: The dataset is synthetic and not sourced from any real-world e-commerce platform.

🛠 Example Usage: Sales Prediction Model

Here’s an example of building a predictive model using Linear Regression:

Written in python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
df = pd.read_csv('ecommerce_sales.csv')

# Feature selection
X = df[['Price', 'Discount', 'Marketing_Spend']]
y = df['Units_Sold']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model training
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')

z
Simulated Inventory Management Database and Object-Centric Event Logs for...
zenodo.org
bin, csv +2
Updated May 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessandro Berti; Alessandro Berti (2025). Simulated Inventory Management Database and Object-Centric Event Logs for Process Analysis [Dataset]. http://doi.org/10.5281/zenodo.15515788
Explore at:
xml, text/x-python, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15515788
Dataset updated
May 26, 2025
Dataset provided by
Zenodo
Authors
Alessandro Berti; Alessandro Berti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract: This repository/dataset provides a suite of Python scripts to generate a simulated relational database for inventory management processes and transform this data into object-centric event logs (OCEL) suitable for advanced process mining analysis. The primary goal is to offer a synthetic yet realistic dataset that facilitates research, development, and application of object-centric process mining techniques in the domain of inventory control and supply chain management. The generated event logs capture common inventory operations, track stock level changes, and are enriched with key inventory management parameters (like EOQ, Safety Stock, Reorder Point) and status-based activity labels (e.g., indicating understock or overstock situations).

Overview: Inventory management is a critical business process characterized by the interaction of various entities such as materials, purchase orders, sales orders, plants, suppliers, and customers. Traditional process mining often struggles to capture these complex interactions. Object-Centric Process Mining (OCPM) offers a more suitable paradigm. This project provides the tools to create and explore such data.

The workflow involves:

Database Simulation: Generating a SQLite database with tables for materials, sales orders, purchase orders, goods movements, stock levels, etc., populated with simulated data.

Initial OCEL Generation: Extracting data from the SQLite database and structuring it as an object-centric event log (in CSV format). This log includes activities like "Create Purchase Order Item", "Goods Receipt", "Create Sales Order Item", "Goods Issue", and tracks running stock levels for materials.

OCEL Post-processing and Enrichment:

Calculating standard inventory management metrics such as Economic Order Quantity (EOQ), Safety Stock (SS), and Reorder Point (ROP) for each material-plant combination based on the simulated historical data.

Merging these metrics into the event log.

Enhancing activity labels to include the current stock status (e.g., "Understock", "Overstock", "Normal") relative to calculated SS and Overstock (OS) levels (where OS = SS + EOQ).

Generating new, distinct events to explicitly mark the moments when stock statuses change (e.g., "START UNDERSTOCK", "ST CHANGE NORMAL to OVERSTOCK", "END NORMAL").

Format Conversion: Converting the CSV-based OCELs into the standard OCEL XML/OCEL2 format using the pm4py library.

Contents:

The repository contains the following Python scripts:

01_generate_simulation.py:

Creates a SQLite database named inventory_management.db.

Defines and populates tables including: Materials, SalesOrderDocuments, SalesOrderItems, PurchaseOrderDocuments, PurchaseOrderItems, PurchaseRequisitions, GoodsReceiptsAndIssues, MaterialStocks, MaterialDocuments, SalesDocumentFlows, and OrderSuggestions.

Simulates data for a configurable number of materials, customers, sales, purchases, etc., with randomized dates and quantities.

02_database_to_ocel_csv.py:

Connects to the inventory_management.db.

Executes a SQL query to extract relevant events and their associated objects for inventory processes.

Constructs an initial object-centric event log, saved as ocel_inventory_management.csv.

Identified object types include: MAT (Material), PLA (Plant), PO_ITEM (Purchase Order Item), SO_ITEM (Sales Order Item), CUSTOMER, SUPPLIER.

Calculates "Stock Before" and "Stock After" for each event affecting material stock.

Standardizes column names to OCEL conventions (e.g., ocel:activity, ocel:timestamp, ocel:type:).

03_ocel_csv_to_ocel.py:

Reads ocel_inventory_management.csv.

Uses pm4py to convert the CSV event log into the standard OCEL XML format (ocel_inventory_management.xml).

04_postprocess_activities.py:

Reads data from inventory_management.db to calculate inventory parameters:

Annual Demand (Dm)

Average Daily Demand (dm)

Standard Deviation of Daily Demand (σm)

Average Lead Time (lm)

Economic Order Quantity (EOQ): (2⋅Dm⋅S)/H (where S is fixed order cost, H is holding cost)

Safety Stock (SS): z⋅σm⋅lm (where z is the z-score for the desired service level)

Reorder Point (ROP): (dm⋅lm)+SS

Merges these calculated parameters with ocel_inventory_management.csv.

Computes an Overstock level (OS) as SS+EOQ.

Derives a "Current Status" (Understock, Overstock, Normal) for each event based on "Stock After" relative to SS and OS.

Appends this status to the ocel:activity label (e.g., "Goods Issue (Understock)").

Generates new events for status changes (e.g., "START NORMAL", "ST CHANGE UNDERSTOCK to NORMAL", "END OVERSTOCK") with adjusted timestamps to precisely mark these transitions.

Creates a new object type MAT_PLA (Material-Plant combination) for easier status tracking.

Saves the enriched and transformed log as post_ocel_inventory_management.csv.

05_ocel_csv_to_ocel.py:

Reads the post-processed post_ocel_inventory_management.csv.

Uses pm4py to convert this enriched CSV event log into the standard OCEL XML format (post_ocel_inventory_management.xml).

Generated Dataset Files (if included, or can be generated using the scripts):

inventory_management.db: The SQLite database containing the simulated raw data.

ocel_inventory_management.csv: The initial OCEL in CSV format.

ocel_inventory_management.xml: The initial OCEL in standard OCEL XML format.

post_ocel_inventory_management.csv: The post-processed and enriched OCEL in CSV format.

post_ocel_inventory_management.xml: The post-processed and enriched OCEL in standard OCEL XML format.

How to Use:

Ensure you have Python installed along with the following libraries: sqlite3 (standard library), pandas, numpy, pm4py.

Run the scripts sequentially in a terminal or command prompt:

python 01_generate_simulation.py (generates inventory_management.db)

python 02_database_to_ocel_csv.py (generates ocel_inventory_management.csv from the database)

python 03_ocel_csv_to_ocel.py (generates ocel_inventory_management.xml)

python 04_postprocess_activities.py (generates post_ocel_inventory_management.csv using the database and the initial CSV OCEL)

python 05_ocel_csv_to_ocel.py (generates post_ocel_inventory_management.xml)

Potential Applications and Research: This dataset and the accompanying scripts can be used for:

Applying and evaluating object-centric process mining algorithms on inventory management data.

Analyzing inventory dynamics, such as the causes and effects of understocking or overstocking.

Discovering and conformance checking process models that involve multiple interacting objects (materials, orders, plants).

Investigating the impact of different inventory control parameters (EOQ, SS, ROP) on process execution.

Developing educational materials for teaching OCPM in a supply chain context.

Serving as a benchmark for new OCEL-based analysis techniques.

Keywords: Object-Centric Event Log, OCEL, Process Mining, Inventory Management, Supply Chain, Simulation, Synthetic Data, SQLite, Python, pandas, pm4py, Economic Order Quantity (EOQ), Safety Stock (SS), Reorder Point (ROP), Stock Status Analysis.
E
Smashcima (2025-03-28)
live.european-language-grid.eu
Updated Dec 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Smashcima (2025-03-28) [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/23850
Explore at:
Dataset updated
Dec 29, 2024
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Smashcima is a library and framework for synthesizing images containing handwritten music for creating synthetic training data for OMR models. It is primarily intended to be used as part of optical music recognition workflows, esp. with domain adaptation in mind. The target user is therefore a machine-learning, document processing, library sciences, or computational musicology researcher with minimal skills in python programming.

Smashcima is the only tool that simultaneously: - synthesizes handwritten music notation, - produces not only raster images but also segmentation masks, classification labels, bounding boxes, and more, - synthesizes entire pages as well as individual symbols, - synthesizes background paper textures, - synthesizes also polyphonic and pianoform music images, - accepts just MusicXML as input, - is written in Python, which simplifies its adoption and extensibility.

Therefore, Smashcima brings a unique new capability for optical music recognition (OMR): synthesizing a near-realistic image of handwritten sheet music from just a MusicXML file. As opposed to notation editors, which work with a fixed set of fonts and a set of layout rules, it can adapt handwriting styles from existing OMR datasets to arbitrary music (beyond the music encoded in existing OMR datasets), and randomize layout to simulate the imprecisions of handwriting, while guaranteeing the semantic correctness of the output rendering. Crucially, the rendered image is provided also with the positions of all the visual elements of music notation, so that both object detection-based and sequence-to-sequence OMR pipelines can utilize Smashcima as a synthesizer of training data.

(In combination with the LMX canonical linearization of MusicXML, one can imagine the endless possibilities of running Smashcima on inputs from a MusicXML generator.)
e
Synthetische gegevens - topdiabetes
data.europa.eu
zip
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Health Data Hub, Synthetische gegevens - topdiabetes [Dataset]. https://data.europa.eu/data/datasets/662a7a37ee85069bfb9a666b?locale=nl
Explore at:
zip(179413420)Available download formats
Dataset authored and provided by
Health Data Hub
License
https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
Description
Beschrijving van de databank:

Doelstellingen en oorspronkelijke doeleinden van de databank:

Deze synthetische dataset is gemaakt als onderdeel van de vertaling en implementatie van het algoritme dat door de CNAM wordt gebruikt om de topdiabetes te bouwen (link to the description sheet of the algorithm).

De Python- en SAS-versies die door de HDH zijn aangepast, hebben betrekking op synthetische gegevens voor de jaren 2018-2019, maar kunnen worden uitgebreid tot andere jaren. Het CNAM-bronprogramma is ontwikkeld in SAS en draait op gegevens van 2015 tot 2019.

Het hierboven genoemde algoritme heeft tot doel zich te richten op mensen in de zorg voor diabetes in de hoofdbasis van de NSDS om de “topdiabetes” van de door de CNAM gecreëerde en onderhouden pathologiekaart (versie G8) te creëren.

Context van de creatie:

De implementatie van het topdiabetesalgoritme vereiste de mobilisatie van synthetische (fictieve) tabellen en variabelen.

-jaartabellen samenvoegen tot één tabel voor ER_PRS_F, ER_ETE_F, ER_PHA_F,

Data/SNDS gemeenschap. - Resultaten in verband met het aanmaken van de database: Het algoritme dat door de CNAM wordt gebruikt om de topdiabetes te construeren:(bronversie (CNAM), Python-versie en SAS-versie (HDH)) (https://www.health-data-hub.fr/library-open-algorithms-health/algorithm-to-build-the-top-diabete-of-mapping).

Werven van mensen, in een breed scala van gebieden, om te werken in Quebec, die op zoek is om te werven in de regio.

Inzamelingsmethodologie en inclusiecriteria:

Gegevenspresentatie:

De programma's werken op basis van de synthetische gegevens van de HDH, met enkele aanpassingen: Deze dataset is gegenereerd met behulp van de regeling van de belangrijkste NSDS-databanktabellen van 2019.

Doelgroep:

-de conversie van het datumformaat naar yymmdd10.

De identificatie van de patiënt is gebaseerd op de targeting van specifieke geneesmiddelen en/of ALD en/of ziekenhuisopname in MCO.

-de naamswijziging van NUM_ENQ in BEN_NIR_PSA, De mapping-algoritmen zijn erop gericht de specificiteit (niet de gevoeligheid) te maximaliseren, d.w.z. de afwezigheid van niet-diabetici bij de beoogde patiënten te waarborgen.

Keuze van variabelen:

De implementatie van het algoritme vereist de mobilisatie van de volgende tabellen en variabelen (de vereiste geschiedenis wordt aangegeven in het overeenkomstige vak):

Patiënten met minder dan 3 toedieningen van specifieke geneesmiddelen, die geen ALD hebben en die niet binnen 5 jaar in het ziekenhuis zijn opgenomen voor diabetes, worden niet vastgehouden.

De programma's aangepast in SAS en Python draaien op synthetische data uit de jaren 2018 en 2019. De CNAM-broncode (in SAS) is ontworpen om te werken aan gegevens uit de jaren 2015 tot en met 2019.

Grenzen van deze dataset:

https://gitlab.com/healthdatahub/boas/cnam/top-diabete/-/raw/main/Tables_et_variables_du_SNDS_n%C3%A9cessaires.png?ref_type=heads" alt="afbeeldingsbeschrijving hier invoeren" title="afbeeldingstitel hier invoeren"> het gebrek aan medische consistentie, het gebrek aan actualisering van jaarlijkse veranderingen, een evolutionair tabelschema dat onvolledig en onvolmaakt kan zijn.

Dit programma omvat geen analyse van de geraamde uitgavenposten die door de ziektekostenverzekering worden vergoed.

Het algoritme identificeert veelvoorkomende patiënten met diabetes in een bepaald jaar (2019). Het bepaalt niet de exacte datum van het begin van diabetes in de basis.

Het gebruik van synthetische gegevens, hoewel nuttig voor het manipuleren van NSDS-gegevens, heeft beperkingen:

Meer informatie over het gebruik van de databank in het kader van de topdiabetesprogramma's (CNAM) op de GitLab-repository van de programma's (link van de GitLab-repository).

Ondersteuning:

Contactpunt: dir.donnees-SNDS@health-data-hub.fr

Bijdrage:

Op Gitlab (maak een ticket of merge-request)
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476

Data from: A large synthetic dataset for machine learning applications in power transmission grids

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zip, png, csv, jsonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13378476

Dataset updated

Mar 25, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.

Clear search

Close search

Google apps

Main menu

Data from: A large synthetic dataset for machine learning applications in...

Data generation algorithm

Network

Time series

Usage

Selecting a particular country

Averaging over time

Source code

Funding

Tuberculosis X-Ray Dataset (Synthetic)

📝 Dataset Summary

💡 Context

🗃️ Dataset Details

🏷️ Columns and Descriptions

🔍 Data Generation Process

🔧 Usage

📊 Potential Applications

📑 License

🙌 Acknowledgments

dummy_health_data

Synthetic river flow videos dataset

######### ########## ###### #########

######### ########## ###### #########

Year : 2022

Authors : G.Bodart (guillaume.bodart@inrae.fr), J.Le Coz (jerome.lecoz@inrae.fr), M.Jodeau (magali.jodeau@edf.fr), A.Hauet (alexandre.hauet@edf.fr)

-> 00_article_cases

-> 01_dev

-> 02_dataset

Synthetic total-field magnetic anomaly data and code to perform Euler...

Retail Store Star Schema Dataset

🛍️ Retail Star Schema (Normalized & Denormalized) – Synthetic Dataset

📁 Dataset Structure

🧠 Use Cases

📌 Notes

📎 Credits

E-commerce Sales Prediction Dataset

E-commerce Sales Prediction Dataset

📂 Dataset Overview

📊 Data Summary

General Properties

📈 Data Visualizations

💡 How the Data Was Created

🛠 Example Usage: Sales Prediction Model

Written in python

Simulated Inventory Management Database and Object-Centric Event Logs for...

Smashcima (2025-03-28)

Synthetische gegevens - topdiabetes

Beschrijving van de databank:

Gegevenspresentatie:

Grenzen van deze dataset:

Ondersteuning:

Bijdrage:

Data from: A large synthetic dataset for machine learning applications in power transmission grids

Data generation algorithm

Network

Time series

Usage

Selecting a particular country

Averaging over time

Source code

Funding