39 datasets found
  1. Data from: A large synthetic dataset for machine learning applications in...

    • zenodo.org
    csv, json, png, zip
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
    Explore at:
    zip, png, csv, jsonAvailable download formats
    Dataset updated
    Mar 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

    This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

    Data generation algorithm

    The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

    Network

    The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

    Time series

    The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

    There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

    Usage

    The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

    Selecting a particular country

    This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

    import pandas as pd
    CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

    The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

    CH_gens_list = CH_gens.dropna().squeeze().to_list()

    Finally, we can import all the time series of Swiss generators from a given data table with

    pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

    The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

    Averaging over time

    This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

    hourly_loads = pd.read_csv('loads_2018_3.csv')

    To get a daily average of the loads, we can use:

    daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

    This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

    weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

    Source code

    The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

    Funding

    This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.

  2. h

    Instella-GSM8K-synthetic

    • huggingface.co
    Updated Jun 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AMD (2025). Instella-GSM8K-synthetic [Dataset]. https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 16, 2025
    Dataset authored and provided by
    AMD
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Instella-GSM8K-synthetic

    The Instella-GSM8K-synthetic dataset was used in the second stage pre-training of Instella-3B model, which was trained on top of the Instella-3B-Stage1 model. This synthetic dataset was generated using the training set of GSM8k dataset, where we first used Qwen2.5-72B-Instruct to

    Abstract numerical values as function parameters and generate a Python program to solve the math question. Identify and replace numerical values in the existing question with… See the full description on the dataset page: https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic.

  3. Z

    replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Labonte, David (2023). replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7849595
    Explore at:
    Dataset updated
    Apr 21, 2023
    Dataset provided by
    Beck, Hendrik
    Imirzian, Natalie
    Plum, Fabian
    Labonte, David
    Bulla, René
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for semantic and instance segmentation experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

    Abstract:

    Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

    Benchmark data

    Two pose-estimation datasets were procured. Both datasets used first instar Sungaya nexpectata (Zompro 1996) stick insects as a model species. Recordings from an evenly lit platform served as representative for controlled laboratory conditions; recordings from a hand-held phone camera served as approximate example for serendipitous recordings in the field.

    For the platform experiments, walking S. inexpectata were recorded using a calibrated array of five FLIR blackfly colour cameras (Blackfly S USB3, Teledyne FLIR LLC, Wilsonville, Oregon, U.S.), each equipped with 8 mm c-mount lenses (M0828-MPW3 8MM 6MP F2.8-16 C-MOUNT, CBC Co., Ltd., Tokyo, Japan). All videos were recorded with 55 fps, and at the sensors’ native resolution of 2048 px by 1536 px. The cameras were synchronised for simultaneous capture from five perspectives (top, front right and left, back right and left), allowing for time-resolved, 3D reconstruction of animal pose.

    The handheld footage was recorded in landscape orientation with a Huawei P20 (Huawei Technologies Co., Ltd., Shenzhen, China) in stabilised video mode: S. inexpectata were recorded walking across cluttered environments (hands, lab benches, PhD desks etc), resulting in frequent partial occlusions, magnification changes, and uneven lighting, so creating a more varied pose-estimation dataset.

    Representative frames were extracted from videos using DeepLabCut (DLC)-internal k-means clustering. 46 key points in 805 and 200 frames for the platform and handheld case, respectively, were subsequently hand-annotated using the DLC annotation GUI.

    Synthetic data

    We generated a synthetic dataset of 10,000 images at a resolution of 1500 by 1500 px, based on a 3D model of a first instar S. inexpectata specimen, generated with the scAnt photogrammetry workflow. Generating 10,000 samples took about three hours on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super). We applied 70\% scale variation, and enforced hue, brightness, contrast, and saturation shifts, to generate 10 separate sub-datasets containing 1000 samples each, which were combined to form the full dataset.

    Funding

    This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

  4. Tuberculosis X-Ray Dataset (Synthetic)

    • kaggle.com
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Miah (2025). Tuberculosis X-Ray Dataset (Synthetic) [Dataset]. https://www.kaggle.com/datasets/miadul/tuberculosis-x-ray-dataset-synthetic
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arif Miah
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📝 Dataset Summary

    This synthetic dataset contains 20,000 records of X-ray data labeled as "Normal" or "Tuberculosis". It is specifically created for training and evaluating classification models in the field of medical image analysis. The dataset aims to aid in building machine learning and deep learning models for detecting tuberculosis from X-ray data.

    💡 Context

    Tuberculosis (TB) is a highly infectious disease that primarily affects the lungs. Accurate detection of TB using chest X-rays can significantly enhance medical diagnostics. However, real-world datasets are often scarce or restricted due to privacy concerns. This synthetic dataset bridges that gap by providing simulated patient data while maintaining realistic distributions and patterns commonly observed in TB cases.

    🗃️ Dataset Details

    • Number of Rows: 20,000
    • Number of Columns: 15
    • File Format: CSV
    • Resolution: Simulated patient data, not real X-ray images
    • Size: Approximately 10 MB

    🏷️ Columns and Descriptions

    Column NameDescription
    Patient_IDUnique ID for each patient (e.g., PID000001)
    AgeAge of the patient (in years)
    GenderGender of the patient (Male/Female)
    Chest_PainPresence of chest pain (Yes/No)
    Cough_SeveritySeverity of cough (Scale: 0-9)
    BreathlessnessSeverity of breathlessness (Scale: 0-4)
    FatigueLevel of fatigue experienced (Scale: 0-9)
    Weight_LossWeight loss (in kg)
    FeverLevel of fever (Mild, Moderate, High)
    Night_SweatsWhether night sweats are present (Yes/No)
    Sputum_ProductionLevel of sputum production (Low, Medium, High)
    Blood_in_SputumPresence of blood in sputum (Yes/No)
    Smoking_HistorySmoking status (Never, Former, Current)
    Previous_TB_HistoryPrevious tuberculosis history (Yes/No)
    ClassTarget variable indicating the condition (Normal, Tuberculosis)

    🔍 Data Generation Process

    The dataset was generated using Python with the following libraries:
    - Pandas: To create and save the dataset as a CSV file
    - NumPy: To generate random numbers and simulate realistic data
    - Random Seed: Set to ensure reproducibility

    The target variable "Class" has a 70-30 distribution between Normal and Tuberculosis cases. The data is randomly generated with realistic patterns that mimic typical TB symptoms and demographic distributions.

    🔧 Usage

    This dataset is intended for:
    - Machine Learning and Deep Learning classification tasks
    - Data exploration and feature analysis
    - Model evaluation and comparison
    - Educational and research purposes

    📊 Potential Applications

    1. Tuberculosis Detection Models: Train CNNs or other classification algorithms to detect TB.
    2. Healthcare Research: Analyze the correlation between symptoms and TB outcomes.
    3. Data Visualization: Perform EDA to uncover patterns and insights.
    4. Model Benchmarking: Compare various algorithms for TB detection.

    📑 License

    This synthetic dataset is open for educational and research use. Please credit the creator if used in any public or academic work.

    🙌 Acknowledgments

    This dataset was generated as a synthetic alternative to real-world data to help developers and researchers practice building and fine-tuning classification models without the constraints of sensitive patient data.

  5. Synthetic dataset used in "The maximum weighted submatrix coverage problem:...

    • zenodo.org
    text/x-python, zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Derval Guillaume; Derval Guillaume; Branders Vincent; Dupont Pierre; Schaus Pierre; Branders Vincent; Dupont Pierre; Schaus Pierre (2020). Synthetic dataset used in "The maximum weighted submatrix coverage problem: A CP approach" [Dataset]. http://doi.org/10.5281/zenodo.3549866
    Explore at:
    zip, text/x-pythonAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Derval Guillaume; Derval Guillaume; Branders Vincent; Dupont Pierre; Schaus Pierre; Branders Vincent; Dupont Pierre; Schaus Pierre
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synthetic dataset used in "The maximum weighted submatrix coverage problem: A CP approach".

    Includes both the generated datasets as a zip archive and the python script used to generate them.

    Each instance is composed of two files in the form

    • XxY_K_O_0xN_AxB_Smatrix.tsv being the matrix to use. Each row on a separate line, with tab-separated cells.
    • XxY_K_O_0xN_AxB_Ssolution.txt giving the implanted solution. One submatrix per line. Then two JSON arrays follow, separated by a tabulation. The first is the list of rows selected in the submatrix, the second the columns.

    With:

    • X and Y the size of the matrix
    • K the number of submatrices in the implanted solution
    • O the (minimum) overlap percentage of each submatrix
    • N the sigma used for the background noise
    • A and B the size of the implanted submatrices (subject to noise)

  6. Z

    Data from: Synthetic Multimodal Dataset for Daily Life Activities

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kawamura, Takahiro (2024). Synthetic Multimodal Dataset for Daily Life Activities [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8046266
    Explore at:
    Dataset updated
    Jan 29, 2024
    Dataset provided by
    Fukuda, Ken
    Kozaki, Kouji
    Egami, Shusaku
    Ugai, Takanori
    Swe Nwe Nwe Htun
    Kawamura, Takahiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Outline

    This dataset is originally created for the Knowledge Graph Reasoning Challenge for Social Issues (KGRC4SI)

    Video data that simulates daily life actions in a virtual space from Scenario Data.

    Knowledge graphs, and transcriptions of the Video Data content ("who" did what "action" with what "object," when and where, and the resulting "state" or "position" of the object).

    Knowledge Graph Embedding Data are created for reasoning based on machine learning

    This data is open to the public as open data

    Details

    Videos

    mp4 format

    203 action scenarios

    For each scenario, there is a character rear view (file name ending in 0), an indoor camera switching view (file name ending in 1), and a fixed camera view placed in each corner of the room (file name ending in 2-5). Also, for each action scenario, data was generated for a minimum of 1 to a maximum of 7 patterns with different room layouts (scenes). A total of 1,218 videos

    Videos with slowly moving characters simulate the movements of elderly people.

    Knowledge Graphs

    RDF format

    203 knowledge graphs corresponding to the videos

    Includes schema and location supplement information

    The schema is described below

    SPARQL endpoints and query examples are available

    Script Data

    txt format

    Data provided to VirtualHome2KG to generate videos and knowledge graphs

    Includes the action title and a brief description in text format.

    Embedding

    Embedding Vectors in TransE, ComplEx, and RotatE. Created with DGL-KE (https://dglke.dgl.ai/doc/)

    Embedding Vectors created with jRDF2vec (https://github.com/dwslab/jRDF2Vec).

    Specification of Ontology

    Please refer to the specification for descriptions of all classes, instances, and properties: https://aistairc.github.io/VirtualHome2KG/vh2kg_ontology.htm

    Related Resources

    KGRC4SI Final Presentations with automatic English subtitles (YouTube)

    VirtualHome2KG (Software)

    VirtualHome-AIST (Unity)

    VirtualHome-AIST (Python API)

    Visualization Tool (Software)

    Script Editor (Software)

  7. d

    A dataset of 1500-word stories generated by gpt-4o-mini for 236...

    • search.dataone.org
    • dataverse.no
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rettberg, Jill Walker; Wigers, Hermann (2025). A dataset of 1500-word stories generated by gpt-4o-mini for 236 nationalities [Dataset]. http://doi.org/10.18710/VM2K4O
    Explore at:
    Dataset updated
    May 29, 2025
    Dataset provided by
    DataverseNO
    Authors
    Rettberg, Jill Walker; Wigers, Hermann
    Description

    We created a dataset of stories generated by OpenAI’s gpt-4o-miniby using a Python script to construct prompts that were sent to the OpenAI API. We used Statistics Norway’s list of 252 countries, added demonyms for each country, for example Norwegian for Norway, and removed countries without demonyms, leaving us with 236 countries. Our base prompt was “Write a 1500 word potential {demonym} story”, and we generated 50 stories for each country. The scripts used to generate the data, and additional scripts for analysis are available at the GitHub repository https://github.com/MachineVisionUiB/GPT_stories

  8. Z

    Data from: Domain-adaptive Data Synthesis for Large-scale Supermarket...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Strohmayer, Julian (2024). Domain-adaptive Data Synthesis for Large-scale Supermarket Product Recognition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7750241
    Explore at:
    Dataset updated
    Apr 5, 2024
    Dataset provided by
    Kampel, Martin
    Strohmayer, Julian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition

    This repository contains the data synthesis pipeline and synthetic product recognition datasets proposed in [1].

    Data Synthesis Pipeline:

    We provide the Blender 3.1 project files and Python source code of our data synthesis pipeline pipeline.zip, accompanied by the FastCUT models used for synthetic-to-real domain translation models.zip. For the synthesis of new shelf images, a product assortment list and product images must be provided in the corresponding directories products/assortment/ and products/img/. The pipeline expects product images to follow the naming convention c.png, with c corresponding to a GTIN or generic class label (e.g., 9120050882171.png). The assortment list, assortment.csv, is expected to use the sample format [c, w, d, h], with c being the class label and w, d, and h being the packaging dimensions of the given product in mm (e.g., [4004218143128, 140, 70, 160]). The assortment list to use and the number of images to generate can be specified in generateImages.py (see comments). The rendering process is initiated by either executing load.py from within Blender or within a command-line terminal as a background process.

    Datasets:

    SG3k - Synthetic GroZi-3.2k (SG3k) dataset, consisting of 10,000 synthetic shelf images with 851,801 instances of 3,234 GroZi-3.2k products. Instance-level bounding boxes and generic class labels are provided for all product instances.

    SG3kt - Domain-translated version of SGI3k, utilizing GroZi-3.2k as the target domain. Instance-level bounding boxes and generic class labels are provided for all product instances.

    SGI3k - Synthetic GroZi-3.2k (SG3k) dataset, consisting of 10,000 synthetic shelf images with 838,696 instances of 1,063 GroZi-3.2k products. Instance-level bounding boxes and generic class labels are provided for all product instances.

    SGI3kt - Domain-translated version of SGI3k, utilizing GroZi-3.2k as the target domain. Instance-level bounding boxes and generic class labels are provided for all product instances.

    SPS8k - Synthetic Product Shelves 8k (SPS8k) dataset, comprised of 16,224 synthetic shelf images with 1,981,967 instances of 8,112 supermarket products. Instance-level bounding boxes and GTIN class labels are provided for all product instances.

    SPS8kt - Domain-translated version of SPS8k, utilizing SKU110k as the target domain. Instance-level bounding boxes and GTIN class labels for all product instances.

    Table 1: Dataset characteristics.

    Dataset

    images

    products

    instances

    labels
    translation

    SG3k 10,000 3,234 851,801 bounding box & generic class¹ none

    SG3kt 10,000 3,234 851,801 bounding box & generic class¹ GroZi-3.2k

    SGI3k 10,000 1,063 838,696 bounding box & generic class² none

    SGI3kt 10,000 1,063 838,696 bounding box & generic class² GroZi-3.2k

    SPS8k 16,224 8,112 1,981,967 bounding box & GTIN none

    SPS8kt 16,224 8,112 1,981,967 bounding box & GTIN SKU110k

    Sample Format

    A sample consists of an RGB image (i.png) and an accompanying label file (i.txt), which contains the labels for all product instances present in the image. Labels use the YOLO format [c, x, y, w, h].

    ¹SG3k and SG3kt use generic pseudo-GTIN class labels, created by combining the GroZi-3.2k food product category number i (1-27) with the product image index j (j.jpg), following the convention i0000j (e.g., 13000097).

    ²SGI3k and SGI3kt use the generic GroZi-3.2k class labels from https://arxiv.org/abs/2003.06800.

    Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

    [1] Strohmayer, Julian, and Martin Kampel. "Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition." International Conference on Computer Analysis of Images and Patterns. Cham: Springer Nature Switzerland, 2023.

    BibTeX citation:

    @inproceedings{strohmayer2023domain, title={Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition}, author={Strohmayer, Julian and Kampel, Martin}, booktitle={International Conference on Computer Analysis of Images and Patterns}, pages={239--250}, year={2023}, organization={Springer} }

  9. Synthetic total-field magnetic anomaly data and code to perform Euler...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonardo Uieda; Vanderlei C. Oliveira Jr.; Valeria C. F. Barbosa (2023). Synthetic total-field magnetic anomaly data and code to perform Euler deconvolution on it [Dataset]. http://doi.org/10.6084/m9.figshare.923450.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Leonardo Uieda; Vanderlei C. Oliveira Jr.; Valeria C. F. Barbosa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synthetic data, source code, and supplementary text for the article "Euler deconvolution of potential field data" by Leonardo Uieda, Vanderlei C. Oliveira Jr., and Valéria C. F. Barbosa. This is part of a tutorial submitted to The Leading Edge (http://library.seg.org/journal/tle). Results were generated using the open-source Python package Fatiando a Terra version 0.2 (http://www.fatiando.org). This material along with the manuscript can also be found at https://github.com/pinga-lab/paper-tle-euler-tutorial Synthetic data and model Examples in the tutorial use synthetic data generated with the IPython notebook create_synthetic_data.ipynb. File synthetic_data.txt has 4 columns: x (north), y (east), z (down) and the total field magnetic anomaly. x, y, and z are in meters. The total field anomaly is in nanoTesla (nT). File metadata.json contains extra information about the data, such as inclination and declination of the inducing field (in degrees), shape of the data grid (number of points in y and x, respectively), the area containing the data (W, E, S, N, in meters), and the model boundaries (W, E, S, N, top, bottom, in meters). File model.pickle is a serialized version of the model used to generate the data. It contains a list of instances of the PolygonalPrism class of Fatiando a Terra. The serialization was done using the cPickle Python module. Reproducing the results in the tutorial The notebook euler-deconvolution-examples.ipynb runs the Euler deconvolution on the synthetic data and generates the figures for the manuscript. It also presents a more detailed explanation of the method and more tests than went into the finished manuscript.

  10. Online Retail & E-Commerce Dataset

    • kaggle.com
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ertuğrul EŞOL (2025). Online Retail & E-Commerce Dataset [Dataset]. https://www.kaggle.com/datasets/ertugrulesol/online-retail-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 20, 2025
    Dataset provided by
    Kaggle
    Authors
    Ertuğrul EŞOL
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview:

    This dataset contains 1000 rows of synthetic online retail sales data, mimicking transactions from an e-commerce platform. It includes information about customer demographics, product details, purchase history, and (optional) reviews. This dataset is suitable for a variety of data analysis, data visualization and machine learning tasks, including but not limited to: customer segmentation, product recommendation, sales forecasting, market basket analysis, and exploring general e-commerce trends. The data was generated using the Python Faker library, ensuring realistic values and distributions, while maintaining no privacy concerns as it contains no real customer information.

    Data Source:

    This dataset is entirely synthetic. It was generated using the Python Faker library and does not represent any real individuals or transactions.

    Data Content:

    Column NameData TypeDescription
    customer_idIntegerUnique customer identifier (ranging from 10000 to 99999)
    order_dateDateOrder date (a random date within the last year)
    product_idIntegerProduct identifier (ranging from 100 to 999)
    category_idIntegerProduct category identifier (10, 20, 30, 40, or 50)
    category_nameStringProduct category name (Electronics, Fashion, Home & Living, Books & Stationery, Sports & Outdoors)
    product_nameStringProduct name (randomly selected from a list of products within the corresponding category)
    quantityIntegerQuantity of the product ordered (ranging from 1 to 5)
    priceFloatUnit price of the product (ranging from 10.00 to 500.00, with two decimal places)
    payment_methodStringPayment method used (Credit Card, Bank Transfer, Cash on Delivery)
    cityStringCustomer's city (generated using Faker's city() method, so the locations will depend on the Faker locale you used)
    review_scoreIntegerCustomer's product rating (ranging from 1 to 5, or None with a 20% probability)
    genderStringCustomer's gender (M/F, or None with a 10% probability)
    ageIntegerCustomer's age (ranging from 18 to 75)

    Potential Use Cases (Inspiration):

    Customer Segmentation: Group customers based on demographics, purchasing behavior, and preferences.

    Product Recommendation: Build a recommendation system to suggest products to customers based on their past purchases and browsing history.

    Sales Forecasting: Predict future sales based on historical trends.

    Market Basket Analysis: Identify products that are frequently purchased together.

    Price Optimization: Analyze the relationship between price and demand.

    Geographic Analysis: Explore sales patterns across different cities.

    Time Series Analysis: Investigate sales trends over time.

    Educational Purposes: Great for practicing data cleaning, EDA, feature engineering, and modeling.

  11. Supporting data and code: Beyond Economic Dispatch: Modeling Renewable...

    • zenodo.org
    zip
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaruwan Pimsawan; Jaruwan Pimsawan; Stefano Galelli; Stefano Galelli (2025). Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models [Dataset]. http://doi.org/10.5281/zenodo.15219959
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jaruwan Pimsawan; Jaruwan Pimsawan; Stefano Galelli; Stefano Galelli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository provides the necessary data and Python code to replicate the experiments and generate the figures presented in our manuscript: "Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models".

    Contents:

    • pownet.zip: Contains PowNet version 3.2, the specific version of the simulation software used in this study.
    • inputs.zip: Contains essential modeling inputs required by PowNet for the experiments, including network data, and pre-generated synthetic load and solar time series.
    • scripts.zip: Contains the Python scripts used for installing PowNet, optionally regenerating synthetic data, running simulation experiments, processing results, and generating figures.
    • thai_data.zip (Reference Only): Contains raw data related to the 2023 Thai power system. This data served as a reference during the creation of the PowNet inputs for this study but is not required to run the replication experiments themselves. Code to process the raw data is also provided.

    System Requirements:

    • Python version 3.10+
    • pip package manager

    Setup Instructions:

    1. Download and Unzip Core Files: Download pownet.zip, inputs.zip, scripts.zip, and thai_data.zip. Extract their contents into the same parent folder. Your directory structure should look like this:

      Parent_Folder/
      ├── pownet/    # from pownet.zip
      ├── inputs/    # from inputs.zip
      ├── scripts/    # from scripts.zip
      ├── thai_data.zip/ # from scripts.zip ├── figures/ # Created by scripts later
      ├── outputs/ # Created by scripts later
    2. Install PowNet:

      • Open your terminal or command prompt.
      • Navigate into the pownet directory that you just extracted:

    cd path/to/Parent_Folder/pownet

    pip install -e .

      • These commands install PowNet and its required dependencies into your active Python environment.

    Workflow and Usage:

    Note: All subsequent Python script commands should be run from the scripts directory. Navigate to it first:

    cd path/to/Parent_Folder/scripts

    1. Generate Synthetic Time Series (Optional):

    • This step is optional as the required time series files are already provided within the inputs directory (extracted from inputs.zip). If you wish to regenerate them:
    • Run the generation scripts:
      python create_synthetic_load.py
      python create_synthetic_solar.py
    • Evaluate the generated time series (optional):
      python eval_synthetic_load.py
      python eval_synthetic_solar.py

    2. Calculate Total Solar Availability:

    • Process solar scenarios using data from the inputs directory:
      python process_scenario_solar.py
      

    3. Experiment 1: Compare Strategies for Modeling Purchase Obligations:

    • Run the base case simulations for different modeling strategies:
      • No Must-Take (NoMT):
        python run_basecase.py --model_name "TH23NMT"
        
      • Zero-Cost Renewables (ZCR):
        python run_basecase.py --model_name "TH23ZC"
        
      • Penalized Curtailment (Proposed Method):
        python run_basecase.py --model_name "TH23"
        
    • Run the base case simulation for the Minimum Capacity (MinCap) strategy:
      python run_min_cap.py

      This is a new script because we need to modify the objective function and add constraints.

    4. Experiment 2: Simulate Partial-Firm Contract Switching:

    • Run simulations comparing the base case with the partial-firm contract scenario:
      • Base Case Scenario:
        python run_scenarios.py --model_name "TH23"
        
      • Partial-Firm Contract Scenario:
        python run_scenarios.py --model_name "TH23ESB"
        

    5. Visualize Results:

    • Generate all figures presented in the manuscript:
      python run_viz.py
      
    • Figures will typically be saved in afigures directory within the Parent_Folder.
  12. R

    Replication Data for: Towards tsunami early-warning with Distributed...

    • entrepot.recherche.data.gouv.fr
    txt, zip
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos Becerril; Carlos Becerril (2025). Replication Data for: Towards tsunami early-warning with Distributed Acoustic Sensing: expected seafloor strains induced by tsunamis [Dataset]. http://doi.org/10.57745/ENBAIS
    Explore at:
    txt(2945), zip(29141166165), zip(29048595813)Available download formats
    Dataset updated
    May 12, 2025
    Dataset provided by
    Recherche Data Gouv
    Authors
    Carlos Becerril; Carlos Becerril
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Time period covered
    Mar 3, 2024
    Dataset funded by
    Agence nationale de la recherche
    European Research Council
    European Innovation Council
    Description

    This repository contains the data sets and Python routines to replicate results outlined in the manuscript: Towards tsunami early-warning with Distributed Acoustic Sensing: expected seafloor strains induced by tsunamis. The contents of this repository are divided into 2 zip files containing: Repository_Part1.zip A) Input files used to define and render the simulation with the SeisSol software package. B) StrainModel_Fig4_5.py -- Python routine to generate Figures 3 and 4 from the seafloor strain model as described in the manuscript. C) PREM.csv -- PREM model, auxiliary file for StrainModel_Fig4_5.py D) Rcvr_Processing.py -- Python routine to extract and process data contained in F) through I). Generate results observed in Fig.6 E) receiver_lines1234567.dat -- Receiver location file, auxiliary file for Rcvr_Processing.py F) Seafloor_Array_Y_0km -- Directory containing synthetic data (SeisSol generated) from the seafloor-buried (10cm) receivers along the array Y=0 km. G) SeaSurface_Array_Y_0km -- Directory containing synthetic data (SeisSol generated) from the receivers placed 10cm below the sea surface, along the array Y=0 km. Repository_Part2.zip H) Seafloor_Array_X_100km -- Directory containing synthetic data (SeisSol generated) from the seafloor-buried (10cm) receivers along the array X=100 km. I) SeaSurface_Array_X_100km -- Directory containing synthetic data (SeisSol generated) from the receivers placed 10cm below the sea surface, along the array X=100 km.

  13. h

    glaive-code-assistant

    • huggingface.co
    Updated Sep 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Glaive AI (2023). glaive-code-assistant [Dataset]. https://huggingface.co/datasets/glaiveai/glaive-code-assistant
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 22, 2023
    Dataset authored and provided by
    Glaive AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Glaive-code-assistant

    Glaive-code-assistant is a dataset of ~140k code problems and solutions generated using Glaive’s synthetic data generation platform. The data is intended to be used to make models act as code assistants, and so the data is structured in a QA format where the questions are worded similar to how real users will ask code related questions. The data has ~60% python samples. To report any problems or suggestions in the data, join the Glaive discord

  14. d

    6DOF pose estimation - synthetically generated dataset using BlenderProc

    • search.dataone.org
    • datadryad.org
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divyam Sheth (2023). 6DOF pose estimation - synthetically generated dataset using BlenderProc [Dataset]. http://doi.org/10.5061/dryad.rbnzs7hj5
    Explore at:
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    Dryad Digital Repository
    Authors
    Divyam Sheth
    Time period covered
    Jan 1, 2023
    Description

    Accurate and robust 6DOF (Six Degrees of Freedom) pose estimation is a critical task in various fields, including computer vision, robotics, and augmented reality. This research paper presents a novel approach to enhance the accuracy and reliability of 6DOF pose estimation by introducing a robust method for generating synthetic data and leveraging the ease of multi-class training using the generated dataset. The proposed method tackles the challenge of insufficient real-world annotated data by creating a large and diverse synthetic dataset that accurately mimics real-world scenarios. The proposed method only requires a CAD model of the object and there is no limit to the number of unique data that can be generated. Furthermore, a multi-class training strategy that harnesses the synthetic dataset's diversity is proposed and presented. This approach mitigates class imbalance issues and significantly boosts accuracy across varied object classes and poses. Experimental results underscore th..., This dataset has been synthetically generated using 3D software like Blender and APIs like Blendeproc., , # Data Repository README

    This repository contains data organized into a structured format. The data consists of three main folders and two files, each serving a specific purpose. The data contains two folders - Cat and Hand.

    Cat Dataset: 63492 labeled data with images, masks, and poses.

    Hand Dataset: 42418 labeled data with images, masks, and poses.

    Usage: The dataset is ready for use by simply extracting the contents of the zip file, whether for training in a segmentation task or a pose estimation task.

    To view .npy files you will need to use Python with the numpy package installed. In Python use the following commands.

    import numpy
    data = numpy.load('file.npy')
    print(data)

    What free/open software is appropriate for viewing the .ply files?
    These files can be opened using any 3D modeling software like Blender, Meshlab, etc.

    Camera Matrix Intrinstics Format :

    Fx 0 px 0 Fy py 0 0 0

    Below is an overview of the data organization:

    Folder Structure

    1. Rgb:
      • This ...
  15. h

    python_plagiarism_code_dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nop, python_plagiarism_code_dataset [Dataset]. https://huggingface.co/datasets/nop12/python_plagiarism_code_dataset
    Explore at:
    Authors
    nop
    Description

    Python Plagiarism Code Dataset

      Overview
    

    This dataset contains pairs of Python code samples with varying degrees of similarity, designed for training and evaluating plagiarism detection systems. The dataset was created using Large Language Models (LLMs) to generate synthetic code variations at different transformation levels, simulating real-world plagiarism scenarios in an academic context.

      Purpose
    

    The dataset addresses the limitations of existing code… See the full description on the dataset page: https://huggingface.co/datasets/nop12/python_plagiarism_code_dataset.

  16. h

    spp

    • huggingface.co
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wuyetao (2023). spp [Dataset]. https://huggingface.co/datasets/wuyetao/spp
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 6, 2023
    Authors
    wuyetao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synthetic Python Problems(SPP) Dataset

    The dataset includes around 450k synthetic Python programming problems. Each Python problem consists of a task description, 1-3 examples, code solution and 1-3 test cases. The CodeGeeX-13B model was used to generate this dataset. A subset of the data has been verified by Python interpreter and de-duplicated. This data is SPP_30k_verified.jsonl. The dataset is in a .jsonl format (json per line). Released as part of Self-Learning to Improve Code… See the full description on the dataset page: https://huggingface.co/datasets/wuyetao/spp.

  17. c

    Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation,...

    • cancerimagingarchive.net
    csv, dicom, n/a +1
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2025). Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation, MIDI-B-Curated-Validation, MIDI-B-Synthetic-Test, MIDI-B-Curated-Test) [Dataset]. http://doi.org/10.7937/cf2p-aw56
    Explore at:
    sqlite and zip, dicom, csv, n/aAvailable download formats
    Dataset updated
    May 2, 2025
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    May 2, 2025
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    Abstract

    These resources comprise a large and diverse collection of multi-site, multi-modality, and multi-cancer clinical DICOM images from 538 subjects infused with synthetic PHI/PII in areas encountered by TCIA curation teams. Also provided is a TCIA-curated version of the synthetic dataset, along with mapping files for mapping identifiers between the two.

    This new MIDI data resource includes DICOM datasets used in the Medical Image De-Identification Benchmark (MIDI-B) challenge at MICCAI 2024. They are accompanied by ground truth answer keys and a validation script for evaluating the effectiveness of medical image de-identification workflows. The validation script systematically assesses de-identified data against an answer key outlining appropriate actions and values for proper de-identification of medical images, promoting safer and more consistent medical image sharing.

    Introduction

    Medical imaging research increasingly relies on large-scale data sharing. However, reliable de-identification of DICOM images still presents significant challenges due to the wide variety of DICOM header elements and pixel data where identifiable information may be embedded. To address this, we have developed an openly accessible synthetic dataset containing artificially generated protected health information (PHI) and personally identifiable information (PII).

    These resources complement our earlier work (Pseudo-PHI-DICOM-data ) hosted on The Cancer Imaging Archive. As an example of its use, we also provide a version curated by The Cancer Imaging Archive (TCIA) curation team. This resource builds upon best practices emphasized by the MIDI Task Group who underscore the importance of transparency, documentation, and reproducibility in de-identification workflows, part of the themes at recent conferences (Synapse:syn53065760) and workshops (2024 MIDI-B Challenge Workshop).

    This framework enables objective benchmarking of de-identification performance, promotes transparency in compliance with regulatory standards, and supports the establishment of consistent best practices for sharing clinical imaging data. We encourage the research community to use these resources to enhance and standardize their medical image de-identification workflows.

    Methods

    Subject Inclusion and Exclusion Criteria

    The source data were selected from imaging already hosted in de-identified form on TCIA. Imaging containing faces were excluded, and no new human studies were performed for his project.

    Data Acquisition

    To build the synthetic dataset, image series were selected from TCIA’s curated datasets to represent a broad range of imaging modalities (CR, CT, DX, MG, MR, PT, SR, US) , manufacturers including (GE, Siemens, Varian , Confirma, Agfa, Eigen, Elekta, Hologic, KONICA MINOLTA, others) , scan parameters, and regions of the body. These were processed to inject the synthetic PHI/PII as described.

    Data Analysis

    Synthetic pools of PHI, like subject and scanning institution information, were generated using the Python package Faker (https://pypi.org/project/Faker/8.10.3/). These were inserted into DICOM metadata of selected imaging files using a system of inheritable rule-based templates outlining re-identification functions for data insertion and logging for answer key creation. Text was also burned-in to the pixel data of a number of images. By systematically embedding realistic synthetic PHI into image headers and pixel data, accompanied by a detailed ground-truth answer key, our framework enables users transparency, documentation, and reproducibility in de-identification practices, aligned with the HIPAA Safe Harbor method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices.

    Usage Notes

    This DICOM collection is split into two datasets, synthetic and curated. The synthetic dataset is the PHI/PII infused DICOM collection accompanied by a validation script and answer keys for testing, refining and benchmarking medical image de-identification pipelines. The curated dataset is a version of the synthetic dataset curated and de-identified by members of The Cancer Imaging Archive curation team. It can be used as a guide, an example of medical image curation best practices. For the purposes of the De-Identification challenge at MICCAI 2024, the synthetic and curated datasets each contain two subsets, a portion for Validation and the other for Testing.

    To link a curated dataset to the original synthetic dataset and answer keys, a mapping between the unique identifiers (UIDs) and patient IDs must be provided in CSV format to the evaluation software. We include the mapping files associated with the TCIA-curated set as an example. Lastly, for both the Validation and Testing datasets, an answer key in sqlite.db format is provided. These components are for use with the Python validation script linked below (4). Combining these components, a user developing or evaluating de-identification methods can ensure they meet a specification for successfully de-identifying medical image data.

  18. o

    BY-COVID - WP5 - Baseline Use Case: SARS-CoV-2 vaccine effectiveness...

    • explore.openaire.eu
    Updated Jan 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco Estupiñán-Romero; Nina Van Goethem; Marjan Meurisse; Javier González-Galindo; Enrique Bernal-Delgado (2023). BY-COVID - WP5 - Baseline Use Case: SARS-CoV-2 vaccine effectiveness assessment - Common Data Model Specification [Dataset]. http://doi.org/10.5281/zenodo.6913045
    Explore at:
    Dataset updated
    Jan 26, 2023
    Authors
    Francisco Estupiñán-Romero; Nina Van Goethem; Marjan Meurisse; Javier González-Galindo; Enrique Bernal-Delgado
    Description

    This publication corresponds to the Common Data Model (CDM) specification of the Baseline Use Case proposed in T.5.2 (WP5) in the BY-COVID project on “SARS-CoV-2 Vaccine(s) effectiveness in preventing SARS-CoV-2 infection.” Research Question: “How effective have the SARS-CoV-2 vaccination programmes been in preventing SARS-CoV-2 infections?” Intervention (exposure): COVID-19 vaccine(s) Outcome: SARS-CoV-2 infection Subgroup analysis: Vaccination schedule (type of vaccine) Study Design: An observational retrospective longitudinal study to assess the effectiveness of the SARS-CoV-2 vaccine in preventing SARS-CoV-2 infections using routinely collected social, health and care data from several countries. A causal model was established using Directed Acyclic Graphs (DAGs) to map domain knowledge, theories and assumptions about the causal relationship between exposure and outcome. The DAG developed for the research question of interest is shown below. Cohort definition: All people eligible to be vaccinated (from 5 to 115 years old, included) or with, at least, one dose of a SARS-CoV-2 vaccine (any of the available brands) having or not a previous SARS-CoV-2 infection. Inclusion criteria: All people vaccinated with at least one dose of the COVID-19 vaccine (any available brands) in an area of residence. Any person eligible to be vaccinated (from 5 to 115 years old, included) with a positive diagnosis (irrespective of the type of test) for SARS-CoV-2 infection (COVID-19) during the period of study. Exclusion criteria: People not eligible for the vaccine (from 0 to 4 years old, included) Study period: From the date of the first documented SARS-CoV-2 infection in each country to the most recent date in which data is available at the time of analysis. Roughly from 01-03-2020 to 30-06-2022, depending on the country. Files included in this publication: Causal model (responding to the research question) SARS-CoV-2 vaccine effectiveness causal model v.1.0.0 (HTML) - Interactive report showcasing the structural causal model (DAG) to answer the research question SARS-CoV-2 vaccine effectiveness causal model v.1.0.0 (QMD) - Quarto RMarkdown script to produce the structural causal model Common data model specification (following the causal model) SARS-CoV-2 vaccine effectiveness data model specification (XLXS) - Human-readable version (Excel) SARS-CoV-2 vaccine effectiveness data model specification dataspice (HTML) - Human-readable version (interactive report) SARS-CoV-2 vaccine effectiveness data model specification dataspice (JSON) - Machine-readable version Synthetic dataset (complying with the common data model specifications) SARS-CoV-2 vaccine effectiveness synthetic dataset (CSV) [UTF-8, pipe | separated, N~650,000 registries] SARS-CoV-2 vaccine effectiveness synthetic dataset EDA (HTML) - Interactive report of the exploratory data analysis (EDA) of the synthetic dataset SARS-CoV-2 vaccine effectiveness synthetic dataset EDA (JSON) - Machine-readable version of the exploratory data analysis (EDA) of the synthetic dataset SARS-CoV-2 vaccine effectiveness synthetic dataset generation script (IPYNB) - Jupyter notebook with Python scripting and commenting to generate the synthetic dataset #### Baseline Use Case: SARS-CoV-2 vaccine effectiveness assessment - Common Data Model Specification v.1.1.0 change log #### Updated Causal model to eliminate the consideration of 'vaccination_schedule_cd' as a mediator Adjusted the study period to be consistent with the Study Protocol Updated 'sex_cd' as a required variable Added 'chronic_liver_disease_bl' as a comorbidity at the individual level Updated 'socecon_lvl_cd' at the area level as a recommended variable Added crosswalks for the definition of 'chronic_liver_disease_bl' in a separate sheet Updated the 'vaccination_schedule_cd' reference to the 'Vaccine' node in the updated DAG Updated the description of the 'confirmed_case_dt' and 'previous_infection_dt' variables to clarify the definition and the need for a single registry per person The scripts (software) accompanying the data model specification are offered "as-is" without warranty and disclaiming liability for damages resulting from using it. The software is released under the CC-BY-4.0 licence, which permits you to use the content for almost any purpose (but does not grant you any trademark permissions), so long as you note the license and give credit.

  19. Synthetic Electrochemical Impedance Spectra Generator

    • zenodo.org
    text/x-python, txt +1
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Slava SHKIRSKIY; Slava SHKIRSKIY (2025). Synthetic Electrochemical Impedance Spectra Generator [Dataset]. http://doi.org/10.5281/zenodo.14652183
    Explore at:
    zip, txt, text/x-pythonAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Slava SHKIRSKIY; Slava SHKIRSKIY
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    # Synthetic Electrochemical Impedance Spectra Generator

    This Python script generates synthetic EIS spectra for predefined circuits using the `impedance.models.circuits.CustomCircuit` library. It simulates realistic experimental datasets for educational purposes, incorporating random file names, missing values, and empty files.

    ## Features
    - **Circuit Modeling**: Supports circuits like `R0-C0`, `R0-p(R1,C1)`, etc., with randomized parameters.
    - **Custom Frequency Range**: Logarithmic sweep from \(10^5\) to \(10^{-2}\) Hz.
    - **Realistic Data Challenges**:
    - Random 3-line headers in files.
    - Missing values in every other 100th file.
    - Empty data in every 100th file.

    ## Output Format
    - **Columns**: `Freq_Hz`, `Re_Z_Ohm`, `-Im_Z_Ohm`, `|Z|_Ohm`, `Phase_deg`.
    - **File Naming**: Random 8-character alphanumeric strings.

    Customize circuits, frequency range, and data patterns as needed.

  20. E

    Smashcima (2025-03-28)

    • live.european-language-grid.eu
    Updated Dec 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Smashcima (2025-03-28) [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/23850
    Explore at:
    Dataset updated
    Dec 29, 2024
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Smashcima is a library and framework for synthesizing images containing handwritten music for creating synthetic training data for OMR models. It is primarily intended to be used as part of optical music recognition workflows, esp. with domain adaptation in mind. The target user is therefore a machine-learning, document processing, library sciences, or computational musicology researcher with minimal skills in python programming.

    Smashcima is the only tool that simultaneously: - synthesizes handwritten music notation, - produces not only raster images but also segmentation masks, classification labels, bounding boxes, and more, - synthesizes entire pages as well as individual symbols, - synthesizes background paper textures, - synthesizes also polyphonic and pianoform music images, - accepts just MusicXML as input, - is written in Python, which simplifies its adoption and extensibility.

    Therefore, Smashcima brings a unique new capability for optical music recognition (OMR): synthesizing a near-realistic image of handwritten sheet music from just a MusicXML file. As opposed to notation editors, which work with a fixed set of fonts and a set of layout rules, it can adapt handwriting styles from existing OMR datasets to arbitrary music (beyond the music encoded in existing OMR datasets), and randomize layout to simulate the imprecisions of handwriting, while guaranteeing the semantic correctness of the output rendering. Crucially, the rendered image is provided also with the positions of all the visual elements of music notation, so that both object detection-based and sequence-to-sequence OMR pipelines can utilize Smashcima as a synthesizer of training data.

    (In combination with the LMX canonical linearization of MusicXML, one can imagine the endless possibilities of running Smashcima on inputs from a MusicXML generator.)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
Organization logo

Data from: A large synthetic dataset for machine learning applications in power transmission grids

Related Article
Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip, png, csv, jsonAvailable download formats
Dataset updated
Mar 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.

Search
Clear search
Close search
Google apps
Main menu