Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetic electronic health record data accompanying the paper "Synthesizing High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Mode"
Overview
This is the data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators". It contains the paper’s data archive with model outputs (see results
folder) and the Singularity image for (optionally) re-running experiments.
For the Python tool used to generate synthetic data, please refer to Synthia.
Requirements
*Although PBS in not a strict requirement, it is required to run all helper scripts as included in this repository. Please note that depending on your specific system settings and resource availability, you may need to modify PBS parameters at the top of submit scripts stored in the hpc
directory (e.g. #PBS -lwalltime=72:00:00
).
Usage
To reproduce the results from the experiments described in the paper, first fit all copula models to the reduced NWP-SAF dataset with:
qsub hpc/fit.sh
then, to generate synthetic data, run all machine learning model configurations, and compute the relevant statistics use:
qsub hpc/stats.sh
qsub hpc/ml_control.sh
qsub hpc/ml_synth.sh
Finally, to plot all artifacts included in the paper use:
qsub hpc/plot.sh
Licence
Code released under MIT license. Data from the reduced NWP-SAF dataset released under CC BY 4.0.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset access for the paper: A Large-scale Synthetic Pathological Dataset for Deep Learning-enabled Segmentation of Breast Cancer
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
This repository contains the dataset and code used to generate synthetic dataset as explained in the paper "Usefulness of synthetic datasets for diatom automatic detection using a deep-learning approach". Dataset : The dataset consists of two components: individual diatom images extracted from publicly available diatom atlases [1,2,3] and individual debris images. - Individual diatom images : currently, the repository consists of 166 diatom species, totalling 9230 images. These images were automatically extracted from atlases using PDF scraping, cleaned and verified by diatom taxonomists. The subfolders within each diatom specie indicates the origin of the images: RA[1], IDF[2], BRG[3]. Additional diatom species and images will be regularly updated in the repository. - Individual debris images : the debris images were extracted from real microscopy images. The repository contains 600 debris objects. Code : Contains the code used to generate synthetic microscopy images. For details on how to use the code, kindly refer to the README file available in synthetic_data_generator/
.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The rise of artificial intelligence (AI) and in particular modern machine learning (ML) algorithms during the last decade has been met with great interest in the agricultural industry. While undisputedly powerful, their main drawback remains the need for sufficient and diverse training data. The collection of real datasets and their annotation are the main cost drivers of ML developments, and while promising results on synthetically generated training data have been shown, their generation is not without difficulties on their own. In this paper, we present a development model for the iterative, cost-efficient generation of synthetic training data. Its application is demonstrated by developing a low-cost early disease detector for tomato plants (Solanum lycopersicum) using synthetic training data. A neural classifier is trained by exclusively using synthetic images, whose generation process is iteratively refined to obtain optimal performance. In contrast to other approaches that rely on a human assessment of similarity between real and synthetic data, we instead introduce a structured, quantitative approach. Our evaluation shows superior generalization results when compared to using non-task-specific real training data and a higher cost efficiency of development compared to traditional synthetic training data. We believe that our approach will help to reduce the cost of synthetic data generation in future applications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We here provide an image dataset consisting of 1,000 synthetic whole-body bone scintigraphy scans (anterior projection) generated by a generative artificial intelligence model. This dataset consists of images representing three different clinical conditions: (1) bone uptake indicative of bone metastases, (2) cardiac uptake indicative of cardiac amyloidosis, and (3) none of the two.
The clinical condition (label) of each image is provided in the csv file:
This synthetic dataset does not comprise real patient data. The provided synthetic images were created by a generative artificial intelligence model. The model was trained on bone scintigraphy scans (radiotracer: 99mTc-DPD) from 9,170 patients from the Vienna General Hospital collected as part of the clinical routine. The training data covered a wide range of different pathologies, scanners, and imaging protocols. Hence, the provided synthetic dataset represents real-world data without disclosing patient privacy.
More details about the dataset can be found in the corresponding paper (link added upon publication). Please cite this paper if you use the dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SYNTHETIC dataset to replicate the results in "Grasp Pre-shape Selection by Synthetic Training: Eye-in-hand Shared Control on the Hannes Prosthesis", accepted to IEEE/RSJ IROS 2022.
In order to fully reproduce the experiments, download also the REAL dataset.
To automatically download the REAL and SYNTHETIC dataset, run the script provided at the link below.
Code to replicate the results available at: https://github.com/hsp-iit/prosthetic-grasping-experiments
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.7910/DVN/EXVWQYhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.7910/DVN/EXVWQY
This dataset contains Synthea synthetic patient data used in building ML models for stroke risk prediction. The ML models are used to simulate ML-enabled LHS. See the first LHS simulation paper published in Nature Scientific Reports. This open dataset is part of the synthetic data repository of the Open LHS project on GitHub: https://github.com/lhs-open/synthetic-data.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.7910/DVN/GD5XWEhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.7910/DVN/GD5XWE
This dataset contains Synthea synthetic patient data used in building ML models for lung cancer risk prediction. The ML models are used to simulate ML-enabled LHS. This open dataset is part of the synthetic data repository of the Open LHS project on GitHub: https://github.com/lhs-open/synthetic-data. For data source and methods, see the first ML-LHS simulation paper published in Nature Scientific Reports: https://www.nature.com/articles/s41598-022-23011-4.
Synthetic dataset of over 13,000 images of damaged and intact parcels with full 2D and 3D annotations in the COCO format. For details see our paper and for visual samples our project page.
Relevant computer vision tasks:
The dataset is for academic research use only, since it uses resources with restrictive licenses.
For a detailed description of how the resources are used, we refer to our paper and project page.
Licenses of the resources in detail:
You can use our textureless models (i.e. the obj files) of damaged parcels under CC BY 4.0 (note that this does not apply to the textures).
If you use this resource for scientific research, please consider citing
@inproceedings{naumannParcel3DShapeReconstruction2023,
author = {Naumann, Alexander and Hertlein, Felix and D\"orr, Laura and Furmans, Kai},
title = {Parcel3D: Shape Reconstruction From Single RGB Images for Applications in Transportation Logistics},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2023},
pages = {4402-4412}
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context: The Bearings with Varying Degradation Behaviors data set is a synthetic data set representing the run-to-failure degradation data of rolling bearings. This data set is designed to facilitate the development and evaluation of diagnostic and prognostic methods in the context of Prognostics and Health Management (PHM). For the generation of the data set, the simulation model presented by Mauthe, Hagmeyer, and Zeiler (2025) was used. The simulation model is publicly available on GitHub.
Simulation Model: Mauthe, Hagmeyer, and Zeiler (2025) introduce a generic simulation model for generating representative run-to-failure data of rolling bearings. It is designed to address challenges in the development of data-driven diagnostic and prognostic methods, such as unbalanced or limited data availability. The model consists of three modular components: the life and fault modeling, the degradation progression simulation, and the vibration signal generation. Each module incorporates random processes to reproduce real-world variations, such as differences in bearing lives and degradation progressions under similar operating conditions. The model simulates vibration signals throughout a bearing's life, reflecting both operating and degradation conditions. As such, the versatile model enables its users to create synthetic data sets of rolling bearings tailored to specific scenarios. A more detailed description of the model can be found in the corresponding paper (see Data Set Citation).
Given Data Scenario and Specification: See the provided description file Bearings_with_Varying_Degradation_Behaviors.pdf
Task: The data set contains training and test data, consisting of run-to-failure data from 28 and 12 simulated bearings. The objective of the data set is to predict the remaining useful life (RUL) of the rolling bearings within the given test data. All data proceed up to the identical failure threshold, which means that RUL=0 applies to the last point in time and the last vibration measurement, respectively.
Data Set Creator: Hochschule Esslingen – University of Applied Sciences, Institute for Technical Reliability and Prognostics (IZP), Robert-Bosch-Straße 1, 73037 Göppingen, Germany
Data Set Citation: Mauthe, F.; Hagmeyer, S.; Zeiler, P. (2025). Holistic simulation model of the temporal degradation of rolling bearings. In E. B. Abrahamsen, T. Aven, F. Bouder, R. Flage, and M. Yloenen (Eds.), Proceedings of the 35nd European Safety and Reliability Conference (Accepted). Research Publishing.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Explore the Synthetic Rock Paper Scissors Dataset featuring a diverse collection of augmented images for training and testing machine learning models.
This is a dataset used to test deep learning-supported deep learning for fault diagnosis: - A digital twin model for a robot. - A synthetic data from the digital twin to train a deep learning-based fault diagnosis model. - A real dataset collected from the real robot to test the sim-to-real performance. Download the dataset from: https://nextcloud.centralesupelec.fr/s/7AR6aamBZNXcRM8/download
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With deep learning becoming a more prominent approach for automatic classification of three-dimensional point cloud data, a key bottleneck is the amount of high quality training data, especially when compared to that available for two-dimensional images. One potential solution is the use of synthetic data for pre-training networks, however the ability for models to generalise from synthetic data to real world data has been poorly studied for point clouds. Despite this, a huge wealth of 3D virtual environments exist, which if proved effective can be exploited. We therefore argue that research in this domain would be hugely useful. In this paper we present SynthCity an open dataset to help aid research. SynthCity is a 367.9M point synthetic full colour Mobile Laser Scanning point cloud. Every point is labelled from one of nine categories. We generate our point cloud in a typical Urban/Suburban environment using the Blensor plugin for Blender. See our project website http://www.synthcity.xyz or paper https://arxiv.org/abs/1907.04758 for more information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetic COVID-19 dataset, including 1186 chest CT images These data come from ‘Deep Learning for COVID-19 chest CT (computed tomography) image analysis’ The deep learning model used in the paper is CycleGAN, and the classification experiment is used to test the usability of the Synthetic COVID-19 dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the image files from Survey2Survey: a deep learning generative model approach for cross-survey image mapping. Please cite https://arxiv.org/abs/2011.07124 if you use this data in a publication. For more information, contact Brandon Buncher at buncher2(at)illinois.edu --- Directory structure --- tutorial.ipynb demonstrates how to load the image files (uploaded here as tarballs). Images were obtained from the SDSS DR16 cutout server (https://skyserver.sdss.org/dr16/en/help/docs/api.aspx) and DES DR1 cutout server (https://des.ncsa.illinois.edu/desaccess/
./sdss_train/ and ./des_train/ contain the original SDSS and DES images used to train the neural network (Stripe82) ./sdss_test/ and ./des_test/ contain the original SDSS and DES images used for the validation dataset (Stripe82) ./sdss_ext/ contain images from the external SDSS dataset (SDSS images without a DES counterpart, outside Stripe82) ./cae and ./cyclegan contain images generated by the CAE and CycleGAN, respectively. train_decoded/ and test_decoded/ contain the reconstructions of the images from the training dataset and test dataset, respectively. external_decoded/ contain the DES-like image reconstructions of SDSS objects from the external dataset (outside Stripe82).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study is publicly available for research purposes. If you are using this dataset, please cite the following paper, which outlines the complete details of the dataset and the methodology used for its generation:
Amit Karamchandani, Javier Núñez, Luis de-la-Cal, Yenny Moreno, Alberto Mozo, Antonio Pastor, "On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination," under submission.
This dataset contains a synthetic dataset generated to differentiate between benign and malicious heavy hitter flows within complex network environments. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction.
To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic.
The feature set includes flow statistics commonly used in network analysis, such as:
Traffic protocol type,
Flow duration (the time between the initial and final packet in both directions),
Total count of payload packets transmitted in both directions,
Cumulative bytes transmitted in both directions,
Time discrepancy between the first packet observations at the source and destination,
Packet and byte transmission rates per second within each interval, and
Total packet and byte counts within each interval in both directions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset of 637 journal papers applying neural networks for various tasks in seismology spanning January 1988 to January 2022. The dataset mainly includes peer reviewed papers and does not contain duplicated works. It follows a hierarchical classification of papers based on seismological tasks (i.e. category, sub_category_I, sub_category_II, task, and sub_task). For each paper following information are provided: 1) first author's last name, 2) publication year, 3) paper's title, 4) journal 's name, 5) machine learning method used, 6) the type of used neural network, 7) the name of neural network architecture, 8) the number of neurons/kernels in each hidden layer, 9) type of training process, i.e. supervised, semi-supervised, etc, 10) input data into the network, 11) output data, 12) data domain, i.e. time, frequency, feature, etc, 13) the type of data used for training, e.g. synthetic or real data, 14) the size of training set, 15) the metrics used to measure the performance, 16) performance scores, 17) the baseline method used for evaluation, and 18) a short note summarizing the paper's objective, its approach, and its significance.
An updating version of the dataset can be find from here: https://smousavi05.github.io/dl_seismology/ and here:https://github.com/smousavi05/dl_seismology/tree/main/docs.
An updating glossary of seismological tasks and relevant machine learning techniques and papers are provided here: https://smousavi05.gitbook.io/mlseismology/
This dataset contains capsicum NIR+RGB dataset used in our paper; deepNIR: Dataset for generating synthetic NIR images and improved fruit detection system using deep learning techniques. Please refer to http://tiny.one/deepNIR for more detail.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supporting datasets (iid and ood) used in the evaluation experiments of the paper "Why did AI get this one wrong? - tree-based explanations of machine learning model predictions" by Parimbelli, Buonocore, Nicora, Michalowski, Wilk and Bellazzi.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetic electronic health record data accompanying the paper "Synthesizing High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Mode"