10 datasets found
  1. Data from: A large synthetic dataset for machine learning applications in...

    • zenodo.org
    csv, json, png, zip
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
    Explore at:
    zip, png, csv, jsonAvailable download formats
    Dataset updated
    Mar 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

    This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

    Data generation algorithm

    The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

    Network

    The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

    Time series

    The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

    There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

    Usage

    The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

    Selecting a particular country

    This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

    import pandas as pd
    CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

    The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

    CH_gens_list = CH_gens.dropna().squeeze().to_list()

    Finally, we can import all the time series of Swiss generators from a given data table with

    pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

    The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

    Averaging over time

    This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

    hourly_loads = pd.read_csv('loads_2018_3.csv')

    To get a daily average of the loads, we can use:

    daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

    This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

    weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

    Source code

    The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

    Funding

    This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.

  2. Tuberculosis X-Ray Dataset (Synthetic)

    • kaggle.com
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Miah (2025). Tuberculosis X-Ray Dataset (Synthetic) [Dataset]. https://www.kaggle.com/datasets/miadul/tuberculosis-x-ray-dataset-synthetic
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arif Miah
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    šŸ“ Dataset Summary

    This synthetic dataset contains 20,000 records of X-ray data labeled as "Normal" or "Tuberculosis". It is specifically created for training and evaluating classification models in the field of medical image analysis. The dataset aims to aid in building machine learning and deep learning models for detecting tuberculosis from X-ray data.

    šŸ’” Context

    Tuberculosis (TB) is a highly infectious disease that primarily affects the lungs. Accurate detection of TB using chest X-rays can significantly enhance medical diagnostics. However, real-world datasets are often scarce or restricted due to privacy concerns. This synthetic dataset bridges that gap by providing simulated patient data while maintaining realistic distributions and patterns commonly observed in TB cases.

    šŸ—ƒļø Dataset Details

    • Number of Rows: 20,000
    • Number of Columns: 15
    • File Format: CSV
    • Resolution: Simulated patient data, not real X-ray images
    • Size: Approximately 10 MB

    šŸ·ļø Columns and Descriptions

    Column NameDescription
    Patient_IDUnique ID for each patient (e.g., PID000001)
    AgeAge of the patient (in years)
    GenderGender of the patient (Male/Female)
    Chest_PainPresence of chest pain (Yes/No)
    Cough_SeveritySeverity of cough (Scale: 0-9)
    BreathlessnessSeverity of breathlessness (Scale: 0-4)
    FatigueLevel of fatigue experienced (Scale: 0-9)
    Weight_LossWeight loss (in kg)
    FeverLevel of fever (Mild, Moderate, High)
    Night_SweatsWhether night sweats are present (Yes/No)
    Sputum_ProductionLevel of sputum production (Low, Medium, High)
    Blood_in_SputumPresence of blood in sputum (Yes/No)
    Smoking_HistorySmoking status (Never, Former, Current)
    Previous_TB_HistoryPrevious tuberculosis history (Yes/No)
    ClassTarget variable indicating the condition (Normal, Tuberculosis)

    šŸ” Data Generation Process

    The dataset was generated using Python with the following libraries:
    - Pandas: To create and save the dataset as a CSV file
    - NumPy: To generate random numbers and simulate realistic data
    - Random Seed: Set to ensure reproducibility

    The target variable "Class" has a 70-30 distribution between Normal and Tuberculosis cases. The data is randomly generated with realistic patterns that mimic typical TB symptoms and demographic distributions.

    šŸ”§ Usage

    This dataset is intended for:
    - Machine Learning and Deep Learning classification tasks
    - Data exploration and feature analysis
    - Model evaluation and comparison
    - Educational and research purposes

    šŸ“Š Potential Applications

    1. Tuberculosis Detection Models: Train CNNs or other classification algorithms to detect TB.
    2. Healthcare Research: Analyze the correlation between symptoms and TB outcomes.
    3. Data Visualization: Perform EDA to uncover patterns and insights.
    4. Model Benchmarking: Compare various algorithms for TB detection.

    šŸ“‘ License

    This synthetic dataset is open for educational and research use. Please credit the creator if used in any public or academic work.

    šŸ™Œ Acknowledgments

    This dataset was generated as a synthetic alternative to real-world data to help developers and researchers practice building and fine-tuning classification models without the constraints of sensitive patient data.

  3. h

    dummy_health_data

    • huggingface.co
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mudumbai Vraja Kishore (2025). dummy_health_data [Dataset]. https://huggingface.co/datasets/vrajakishore/dummy_health_data
    Explore at:
    Dataset updated
    May 29, 2025
    Authors
    Mudumbai Vraja Kishore
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Synthetic Healthcare Dataset

      Overview
    

    This dataset is a synthetic healthcare dataset created for use in data analysis. It mimics real-world patient healthcare data and is intended for applications within the healthcare industry.

      Data Generation
    

    The data has been generated using the Faker Python library, which produces randomized and synthetic records that resemble real-world data patterns. It includes various healthcare-related fields such as patient… See the full description on the dataset page: https://huggingface.co/datasets/vrajakishore/dummy_health_data.

  4. Z

    Synthetic river flow videos dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Synthetic river flow videos dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6257391
    Explore at:
    Dataset updated
    Aug 3, 2022
    Dataset provided by
    Magali Jodeau
    Alexandre Hauet
    JƩrƓme Le Coz
    Guillaume Bodart
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    ######### ########## ###### #########

    Synthetic river flow videos for evaluating image-based velocimetry methods

    ######### ########## ###### #########

    Year : 2022

    Authors : G.Bodart (guillaume.bodart@inrae.fr), J.Le Coz (jerome.lecoz@inrae.fr), M.Jodeau (magali.jodeau@edf.fr), A.Hauet (alexandre.hauet@edf.fr)

    ### This file describes the data attached to the article

    -> 00_article_cases

    This folder contains the data used in the case studies: synthetic videos + reference files.

    - 00_reference_velocities
      -> Reference velocities interpolated on a regular grid. Data are given in conventionnal units, i.e. m/s and m.
    
    
    - 01_XX
      -> Data of the first case study
    
    
    - 02_XX
      -> Data of the second case study
    

    -> 01_dev

    This folder contains the Python libraries and Mantaflow modified source code used in the paper. The libraries are provided as is. Feel free to contact us for support or guidelines.

    - lspiv
      -> Python library used to extract, process and display results of LSPIV analysis carried out with Fudaa-LSPIV
    
    
    - mantaflow-modified
      -> Modified version of Mantaflow described in the article. Installation instructions can be found at http://mantaflow.com
    
    
    - syri
      -> Python library used to extract, process and display fluid simulations carried out on Mantaflow and Blender. (Require the lspiv library)
    

    -> 02_dataset

    This folder contains synthetic videos generated with the method described in the article. The fluid simulation parameters, and thus the reference velocities, are the same as those presented in the article.

    • The videos can be used freely. Please consider citing the corresponding paper.
  5. Synthetic total-field magnetic anomaly data and code to perform Euler...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonardo Uieda; Vanderlei C. Oliveira Jr.; Valeria C. F. Barbosa (2023). Synthetic total-field magnetic anomaly data and code to perform Euler deconvolution on it [Dataset]. http://doi.org/10.6084/m9.figshare.923450.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Leonardo Uieda; Vanderlei C. Oliveira Jr.; Valeria C. F. Barbosa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synthetic data, source code, and supplementary text for the article "Euler deconvolution of potential field data" by Leonardo Uieda, Vanderlei C. Oliveira Jr., and ValƩria C. F. Barbosa. This is part of a tutorial submitted to The Leading Edge (http://library.seg.org/journal/tle). Results were generated using the open-source Python package Fatiando a Terra version 0.2 (http://www.fatiando.org). This material along with the manuscript can also be found at https://github.com/pinga-lab/paper-tle-euler-tutorial Synthetic data and model Examples in the tutorial use synthetic data generated with the IPython notebook create_synthetic_data.ipynb. File synthetic_data.txt has 4 columns: x (north), y (east), z (down) and the total field magnetic anomaly. x, y, and z are in meters. The total field anomaly is in nanoTesla (nT). File metadata.json contains extra information about the data, such as inclination and declination of the inducing field (in degrees), shape of the data grid (number of points in y and x, respectively), the area containing the data (W, E, S, N, in meters), and the model boundaries (W, E, S, N, top, bottom, in meters). File model.pickle is a serialized version of the model used to generate the data. It contains a list of instances of the PolygonalPrism class of Fatiando a Terra. The serialization was done using the cPickle Python module. Reproducing the results in the tutorial The notebook euler-deconvolution-examples.ipynb runs the Euler deconvolution on the synthetic data and generates the figures for the manuscript. It also presents a more detailed explanation of the method and more tests than went into the finished manuscript.

  6. Retail Store Star Schema Dataset

    • kaggle.com
    Updated Apr 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shrinivas Vishnupurikar (2025). Retail Store Star Schema Dataset [Dataset]. https://www.kaggle.com/datasets/shrinivasv/retail-store-star-schema-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shrinivas Vishnupurikar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    šŸ›ļø Retail Star Schema (Normalized & Denormalized) – Synthetic Dataset

    This dataset provides a simulated retail data warehouse designed using star schema modeling principles.

    It includes both normalized and denormalized versions of a retail sales star schema, making it a valuable resource for data engineers, analysts, and data warehouse enthusiasts who want to explore real-world scenarios, performance tuning, and modeling strategies.

    šŸ“ Dataset Structure

    This dataset set has two Fact tables: - fact_sales_normalized.csv – No columns from the dim_* tables have been normalised. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12492162%2F11f3c0350acd609e6b9d9336d0abb448%2FNormalized-Retail-Star-Schema.png?generation=1745327115564885&alt=media" alt="Normalized Star Schema">

    • fact_sales_denormalized.csv – Specific columns from certain dim_* tables have been normalised. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12492162%2Fb567c752c7bc8bc55d9d6142d6ac40cf%2FDenormalized-Retial-Star-Schema.png?generation=1745327148166677&alt=media" alt="Denormalized Star Schema">

    However, the dim_* table stay the same for both as follows: - Dim_Customers.csv - Dim_Products.csv - Dim_Stores.csv - Dim_Dates.csv - Dim_Salesperson - Dim_Campaign

    🧠 Use Cases

    • Practice star schema design and dimensional modeling
    • Learn how to denormalize dimensions for BI and analytics performance
    • Benchmark analytical queries (joins, aggregations, filtering)
    • Test data pipelines, ETL/ELT transformations, and query optimization strategies

    Explore how denormalization affects storage, redundancy, and performance

    šŸ“Œ Notes

    All data is synthetic and randomly generated via python scripts that use polars library for data manipulation— no real customer or business data is included.

    Ideal for use with tools like SQL engines, Redshift, BigQuery, Snowflake, or even DuckDB.

    šŸ“Ž Credits

    Shrinivas Vishnupurikar, Data Engineer @Velotio Technologies.

  7. E-commerce Sales Prediction Dataset

    • kaggle.com
    Updated Dec 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nevil Dhinoja (2024). E-commerce Sales Prediction Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/10197264
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 14, 2024
    Dataset provided by
    Kaggle
    Authors
    Nevil Dhinoja
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    E-commerce Sales Prediction Dataset

    This repository contains a comprehensive and clean dataset for predicting e-commerce sales, tailored for data scientists, machine learning enthusiasts, and researchers. The dataset is crafted to analyze sales trends, optimize pricing strategies, and develop predictive models for sales forecasting.

    šŸ“‚ Dataset Overview

    The dataset includes 1,000 records across the following features:

    Column NameDescription
    DateThe date of the sale (01-01-2023 onward).
    Product_CategoryCategory of the product (e.g., Electronics, Sports, Other).
    PricePrice of the product (numerical).
    DiscountDiscount applied to the product (numerical).
    Customer_SegmentBuyer segment (e.g., Regular, Occasional, Other).
    Marketing_SpendMarketing budget allocated for sales (numerical).
    Units_SoldNumber of units sold per transaction (numerical).

    šŸ“Š Data Summary

    General Properties

    Date: - Range: 01-01-2023 to 12-31-2023. - Contains 1,000 unique values without missing data.

    Product_Category: - Categories: Electronics (21%), Sports (21%), Other (58%). - Most common category: Electronics (21%).

    Price: - Range: From 244 to 999. - Mean: 505, Standard Deviation: 290. - Most common price range: 14.59 - 113.07.

    Discount: - Range: From 0.01% to 49.92%. - Mean: 24.9%, Standard Deviation: 14.4%. - Most common discount range: 0.01 - 5.00%.

    Customer_Segment: - Segments: Regular (35%), Occasional (34%), Other (31%). - Most common segment: Regular.

    Marketing_Spend: - Range: From 2.41k to 10k. - Mean: 4.91k, Standard Deviation: 2.84k.

    Units_Sold: - Range: From 5 to 57. - Mean: 29.6, Standard Deviation: 7.26. - Most common range: 24 - 34 units sold.

    šŸ“ˆ Data Visualizations

    The dataset is suitable for creating the following visualizations: - 1. Price Distribution: Histogram to show the spread of prices. - 2. Discount Distribution: Histogram to analyze promotional offers. - 3. Marketing Spend Distribution: Histogram to understand marketing investment patterns. - 4. Customer Segment Distribution: Bar plot of customer segments. - 5. Price vs Units Sold: Scatter plot to show pricing effects on sales. - 6. Discount vs Units Sold: Scatter plot to explore the impact of discounts. - 7. Marketing Spend vs Units Sold: Scatter plot for marketing effectiveness. - 8. Correlation Heatmap: Identify relationships between features. - 9. Pairplot: Visualize pairwise feature interactions.

    šŸ’” How the Data Was Created

    The dataset is synthetically generated to mimic realistic e-commerce sales trends. Below are the steps taken for data generation:

    1. Feature Engineering:

      • Identified key attributes such as product category, price, discount, and marketing spend, typically observed in e-commerce data.
      • Generated dependent features like units sold based on logical relationships.
    2. Data Simulation:

      • Python Libraries: Used NumPy and Pandas to generate and distribute values.
      • Statistical Modeling: Ensured feature distributions aligned with real-world sales data patterns.
    3. Validation:

      • Verified data consistency with no missing or invalid values.
      • Ensured logical correlations (e.g., higher discounts → increased units sold).

    Note: The dataset is synthetic and not sourced from any real-world e-commerce platform.

    šŸ›  Example Usage: Sales Prediction Model

    Here’s an example of building a predictive model using Linear Regression:

    Written in python

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error, r2_score
    
    # Load the dataset
    df = pd.read_csv('ecommerce_sales.csv')
    
    # Feature selection
    X = df[['Price', 'Discount', 'Marketing_Spend']]
    y = df['Units_Sold']
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Model training
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred = model.predict(X_test)
    
    # Evaluation
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    print(f'Mean Squared Error: {mse:.2f}')
    print(f'R-squared: {r2:.2f}')
    
  8. z

    Simulated Inventory Management Database and Object-Centric Event Logs for...

    • zenodo.org
    bin, csv +2
    Updated May 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessandro Berti; Alessandro Berti (2025). Simulated Inventory Management Database and Object-Centric Event Logs for Process Analysis [Dataset]. http://doi.org/10.5281/zenodo.15515788
    Explore at:
    xml, text/x-python, csv, binAvailable download formats
    Dataset updated
    May 26, 2025
    Dataset provided by
    Zenodo
    Authors
    Alessandro Berti; Alessandro Berti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract: This repository/dataset provides a suite of Python scripts to generate a simulated relational database for inventory management processes and transform this data into object-centric event logs (OCEL) suitable for advanced process mining analysis. The primary goal is to offer a synthetic yet realistic dataset that facilitates research, development, and application of object-centric process mining techniques in the domain of inventory control and supply chain management. The generated event logs capture common inventory operations, track stock level changes, and are enriched with key inventory management parameters (like EOQ, Safety Stock, Reorder Point) and status-based activity labels (e.g., indicating understock or overstock situations).

    Overview: Inventory management is a critical business process characterized by the interaction of various entities such as materials, purchase orders, sales orders, plants, suppliers, and customers. Traditional process mining often struggles to capture these complex interactions. Object-Centric Process Mining (OCPM) offers a more suitable paradigm. This project provides the tools to create and explore such data.

    The workflow involves:

    1. Database Simulation: Generating a SQLite database with tables for materials, sales orders, purchase orders, goods movements, stock levels, etc., populated with simulated data.
    2. Initial OCEL Generation: Extracting data from the SQLite database and structuring it as an object-centric event log (in CSV format). This log includes activities like "Create Purchase Order Item", "Goods Receipt", "Create Sales Order Item", "Goods Issue", and tracks running stock levels for materials.
    3. OCEL Post-processing and Enrichment:
      • Calculating standard inventory management metrics such as Economic Order Quantity (EOQ), Safety Stock (SS), and Reorder Point (ROP) for each material-plant combination based on the simulated historical data.
      • Merging these metrics into the event log.
      • Enhancing activity labels to include the current stock status (e.g., "Understock", "Overstock", "Normal") relative to calculated SS and Overstock (OS) levels (where OS = SS + EOQ).
      • Generating new, distinct events to explicitly mark the moments when stock statuses change (e.g., "START UNDERSTOCK", "ST CHANGE NORMAL to OVERSTOCK", "END NORMAL").
    4. Format Conversion: Converting the CSV-based OCELs into the standard OCEL XML/OCEL2 format using the pm4py library.

    Contents:

    The repository contains the following Python scripts:

    • 01_generate_simulation.py:

      • Creates a SQLite database named inventory_management.db.
      • Defines and populates tables including: Materials, SalesOrderDocuments, SalesOrderItems, PurchaseOrderDocuments, PurchaseOrderItems, PurchaseRequisitions, GoodsReceiptsAndIssues, MaterialStocks, MaterialDocuments, SalesDocumentFlows, and OrderSuggestions.
      • Simulates data for a configurable number of materials, customers, sales, purchases, etc., with randomized dates and quantities.
    • 02_database_to_ocel_csv.py:

      • Connects to the inventory_management.db.
      • Executes a SQL query to extract relevant events and their associated objects for inventory processes.
      • Constructs an initial object-centric event log, saved as ocel_inventory_management.csv.
      • Identified object types include: MAT (Material), PLA (Plant), PO_ITEM (Purchase Order Item), SO_ITEM (Sales Order Item), CUSTOMER, SUPPLIER.
      • Calculates "Stock Before" and "Stock After" for each event affecting material stock.
      • Standardizes column names to OCEL conventions (e.g., ocel:activity, ocel:timestamp, ocel:type:).
    • 03_ocel_csv_to_ocel.py:

      • Reads ocel_inventory_management.csv.
      • Uses pm4py to convert the CSV event log into the standard OCEL XML format (ocel_inventory_management.xml).
    • 04_postprocess_activities.py:

      • Reads data from inventory_management.db to calculate inventory parameters:
        • Annual Demand (Dm)
        • Average Daily Demand (dm)
        • Standard Deviation of Daily Demand (σm)
        • Average Lead Time (lm)
        • Economic Order Quantity (EOQ): (2ā‹…Dmā‹…S)/H (where S is fixed order cost, H is holding cost)
        • Safety Stock (SS): zā‹…Ļƒmā‹…lm (where z is the z-score for the desired service level)
        • Reorder Point (ROP): (dmā‹…lm)+SS
      • Merges these calculated parameters with ocel_inventory_management.csv.
      • Computes an Overstock level (OS) as SS+EOQ.
      • Derives a "Current Status" (Understock, Overstock, Normal) for each event based on "Stock After" relative to SS and OS.
      • Appends this status to the ocel:activity label (e.g., "Goods Issue (Understock)").
      • Generates new events for status changes (e.g., "START NORMAL", "ST CHANGE UNDERSTOCK to NORMAL", "END OVERSTOCK") with adjusted timestamps to precisely mark these transitions.
      • Creates a new object type MAT_PLA (Material-Plant combination) for easier status tracking.
      • Saves the enriched and transformed log as post_ocel_inventory_management.csv.
    • 05_ocel_csv_to_ocel.py:

      • Reads the post-processed post_ocel_inventory_management.csv.
      • Uses pm4py to convert this enriched CSV event log into the standard OCEL XML format (post_ocel_inventory_management.xml).

    Generated Dataset Files (if included, or can be generated using the scripts):

    • inventory_management.db: The SQLite database containing the simulated raw data.
    • ocel_inventory_management.csv: The initial OCEL in CSV format.
    • ocel_inventory_management.xml: The initial OCEL in standard OCEL XML format.
    • post_ocel_inventory_management.csv: The post-processed and enriched OCEL in CSV format.
    • post_ocel_inventory_management.xml: The post-processed and enriched OCEL in standard OCEL XML format.

    How to Use:

    1. Ensure you have Python installed along with the following libraries: sqlite3 (standard library), pandas, numpy, pm4py.
    2. Run the scripts sequentially in a terminal or command prompt:
      • python 01_generate_simulation.py (generates inventory_management.db)
      • python 02_database_to_ocel_csv.py (generates ocel_inventory_management.csv from the database)
      • python 03_ocel_csv_to_ocel.py (generates ocel_inventory_management.xml)
      • python 04_postprocess_activities.py (generates post_ocel_inventory_management.csv using the database and the initial CSV OCEL)
      • python 05_ocel_csv_to_ocel.py (generates post_ocel_inventory_management.xml)

    Potential Applications and Research: This dataset and the accompanying scripts can be used for:

    • Applying and evaluating object-centric process mining algorithms on inventory management data.
    • Analyzing inventory dynamics, such as the causes and effects of understocking or overstocking.
    • Discovering and conformance checking process models that involve multiple interacting objects (materials, orders, plants).
    • Investigating the impact of different inventory control parameters (EOQ, SS, ROP) on process execution.
    • Developing educational materials for teaching OCPM in a supply chain context.
    • Serving as a benchmark for new OCEL-based analysis techniques.

    Keywords: Object-Centric Event Log, OCEL, Process Mining, Inventory Management, Supply Chain, Simulation, Synthetic Data, SQLite, Python, pandas, pm4py, Economic Order Quantity (EOQ), Safety Stock (SS), Reorder Point (ROP), Stock Status Analysis.

  9. E

    Smashcima (2025-03-28)

    • live.european-language-grid.eu
    Updated Dec 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Smashcima (2025-03-28) [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/23850
    Explore at:
    Dataset updated
    Dec 29, 2024
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Smashcima is a library and framework for synthesizing images containing handwritten music for creating synthetic training data for OMR models. It is primarily intended to be used as part of optical music recognition workflows, esp. with domain adaptation in mind. The target user is therefore a machine-learning, document processing, library sciences, or computational musicology researcher with minimal skills in python programming.

    Smashcima is the only tool that simultaneously: - synthesizes handwritten music notation, - produces not only raster images but also segmentation masks, classification labels, bounding boxes, and more, - synthesizes entire pages as well as individual symbols, - synthesizes background paper textures, - synthesizes also polyphonic and pianoform music images, - accepts just MusicXML as input, - is written in Python, which simplifies its adoption and extensibility.

    Therefore, Smashcima brings a unique new capability for optical music recognition (OMR): synthesizing a near-realistic image of handwritten sheet music from just a MusicXML file. As opposed to notation editors, which work with a fixed set of fonts and a set of layout rules, it can adapt handwriting styles from existing OMR datasets to arbitrary music (beyond the music encoded in existing OMR datasets), and randomize layout to simulate the imprecisions of handwriting, while guaranteeing the semantic correctness of the output rendering. Crucially, the rendered image is provided also with the positions of all the visual elements of music notation, so that both object detection-based and sequence-to-sequence OMR pipelines can utilize Smashcima as a synthesizer of training data.

    (In combination with the LMX canonical linearization of MusicXML, one can imagine the endless possibilities of running Smashcima on inputs from a MusicXML generator.)

  10. e

    Synthetische gegevens - topdiabetes

    • data.europa.eu
    zip
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Health Data Hub, Synthetische gegevens - topdiabetes [Dataset]. https://data.europa.eu/data/datasets/662a7a37ee85069bfb9a666b?locale=nl
    Explore at:
    zip(179413420)Available download formats
    Dataset authored and provided by
    Health Data Hub
    License

    https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence

    Description

    Beschrijving van de databank:

    • Doelstellingen en oorspronkelijke doeleinden van de databank:

    Deze synthetische dataset is gemaakt als onderdeel van de vertaling en implementatie van het algoritme dat door de CNAM wordt gebruikt om de topdiabetes te bouwen (link to the description sheet of the algorithm).

    De Python- en SAS-versies die door de HDH zijn aangepast, hebben betrekking op synthetische gegevens voor de jaren 2018-2019, maar kunnen worden uitgebreid tot andere jaren. Het CNAM-bronprogramma is ontwikkeld in SAS en draait op gegevens van 2015 tot 2019.

    Het hierboven genoemde algoritme heeft tot doel zich te richten op mensen in de zorg voor diabetes in de hoofdbasis van de NSDS om de ā€œtopdiabetesā€ van de door de CNAM gecreĆ«erde en onderhouden pathologiekaart (versie G8) te creĆ«ren.

    • Context van de creatie:

    De implementatie van het topdiabetesalgoritme vereiste de mobilisatie van synthetische (fictieve) tabellen en variabelen.

    -jaartabellen samenvoegen tot ƩƩn tabel voor ER_PRS_F, ER_ETE_F, ER_PHA_F,

    Data/SNDS gemeenschap. - Resultaten in verband met het aanmaken van de database: Het algoritme dat door de CNAM wordt gebruikt om de topdiabetes te construeren:(bronversie (CNAM), Python-versie en SAS-versie (HDH)) (https://www.health-data-hub.fr/library-open-algorithms-health/algorithm-to-build-the-top-diabete-of-mapping).

    Werven van mensen, in een breed scala van gebieden, om te werken in Quebec, die op zoek is om te werven in de regio.

    • Inzamelingsmethodologie en inclusiecriteria:

    Gegevenspresentatie:

    De programma's werken op basis van de synthetische gegevens van de HDH, met enkele aanpassingen: Deze dataset is gegenereerd met behulp van de regeling van de belangrijkste NSDS-databanktabellen van 2019.

    • Doelgroep:

    -de conversie van het datumformaat naar yymmdd10.

    De identificatie van de patiƫnt is gebaseerd op de targeting van specifieke geneesmiddelen en/of ALD en/of ziekenhuisopname in MCO.

    -de naamswijziging van NUM_ENQ in BEN_NIR_PSA, De mapping-algoritmen zijn erop gericht de specificiteit (niet de gevoeligheid) te maximaliseren, d.w.z. de afwezigheid van niet-diabetici bij de beoogde patiƫnten te waarborgen.

    • Keuze van variabelen:

    De implementatie van het algoritme vereist de mobilisatie van de volgende tabellen en variabelen (de vereiste geschiedenis wordt aangegeven in het overeenkomstige vak):

    Patiƫnten met minder dan 3 toedieningen van specifieke geneesmiddelen, die geen ALD hebben en die niet binnen 5 jaar in het ziekenhuis zijn opgenomen voor diabetes, worden niet vastgehouden.

    De programma's aangepast in SAS en Python draaien op synthetische data uit de jaren 2018 en 2019. De CNAM-broncode (in SAS) is ontworpen om te werken aan gegevens uit de jaren 2015 tot en met 2019.

    Grenzen van deze dataset:

    https://gitlab.com/healthdatahub/boas/cnam/top-diabete/-/raw/main/Tables_et_variables_du_SNDS_n%C3%A9cessaires.png?ref_type=heads" alt="afbeeldingsbeschrijving hier invoeren" title="afbeeldingstitel hier invoeren"> het gebrek aan medische consistentie, het gebrek aan actualisering van jaarlijkse veranderingen, een evolutionair tabelschema dat onvolledig en onvolmaakt kan zijn.

    Dit programma omvat geen analyse van de geraamde uitgavenposten die door de ziektekostenverzekering worden vergoed.

    Het algoritme identificeert veelvoorkomende patiƫnten met diabetes in een bepaald jaar (2019). Het bepaalt niet de exacte datum van het begin van diabetes in de basis.

    Het gebruik van synthetische gegevens, hoewel nuttig voor het manipuleren van NSDS-gegevens, heeft beperkingen:

    Meer informatie over het gebruik van de databank in het kader van de topdiabetesprogramma's (CNAM) op de GitLab-repository van de programma's (link van de GitLab-repository).

    Ondersteuning:

    Contactpunt: dir.donnees-SNDS@health-data-hub.fr

    Bijdrage:

    Op Gitlab (maak een ticket of merge-request)

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
Organization logo

Data from: A large synthetic dataset for machine learning applications in power transmission grids

Related Article
Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip, png, csv, jsonAvailable download formats
Dataset updated
Mar 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.

Search
Clear search
Close search
Google apps
Main menu