86 datasets found

Z
Surgical-Synthetic-Data-Generation-and-Segmentation
data.niaid.nih.gov
zenodo.org
Updated Jan 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leoncini, Pietro (2025). Surgical-Synthetic-Data-Generation-and-Segmentation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14671905
Explore at:
Dataset updated
Jan 16, 2025
Authors
Leoncini, Pietro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains synthetic and real images, with their labels, for Computer Vision in robotic surgery. It is part of ongoing research on sim-to-real applications in surgical robotics. The dataset will be updated with further details and references once the related work is published. For further information see the repository on GitHub: https://github.com/PietroLeoncini/Surgical-Synthetic-Data-Generation-and-Segmentation
MOSTLY AI Prize Data
kaggle.com
zip
Updated May 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ivonaK (2025). MOSTLY AI Prize Data [Dataset]. https://www.kaggle.com/datasets/ivonav/mostly-ai-prize-data/code
Explore at:
zip(9871594 bytes)Available download formats
Dataset updated
May 16, 2025
Authors
ivonaK
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Competition

Generate the BEST tabular synthetic data and win 100,000 USD in cash.

Competition runs for 50 days: May 14 - July 3, 2025.

MOSTLY AI Prize

This competition features two independent synthetic data challenges that you can join separately: - The FLAT DATA Challenge - The SEQUENTIAL DATA Challenge

For each challenge, generate a dataset with the same size and structure as the original, capturing its statistical patterns — but without being significantly closer to the (released) original samples than to the (unreleased) holdout samples.

Train a generative model that generalizes well, using any open-source tools (Synthetic Data SDK, synthcity, reprosyn, etc.) or your own solution. Submissions must be fully open-source, reproducible, and runnable within 6 hours on a standard machine.

Timeline

Submissions open: May 14, 2025, 15:30 UTC

Submission credits: 3 per calendar week (+bonus)

Submissions close: July 3, 2025, 23:59 UTC

Evaluation of Leaders: July 3 - July 9

Winners announced: on July 9 🏆

Datasets

Flat Data - 100,000 records - 80 data columns: 60 numeric, 20 categorical

Sequential Data - 20,000 groups - each group contains 5-10 records - 10 data columns: 7 numeric, 3 categorical

Evaluation

CSV submissions are parsed using pandas.read_csv() and checked for expected structure & size

Evaluated using the Synthetic Data Quality Assurance toolkit

Compared against the released training set and a hidden holdout set (same size, non-overlapping, from the same source)

Submission

MOSTLY AI Prize

Citation

If you use this dataset in your research, please cite:

@dataset{mostlyaiprize, author = {MOSTLY AI}, title = {MOSTLY AI Prize Dataset}, year = {2025}, url = {https://www.mostlyaiprize.com/}, }
Synthetic datasets of the UK Biobank cohort
zenodo.org
data.niaid.nih.gov
bin, csv, pdf, zip
Updated Sep 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
Explore at:
bin, csv, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13983170
Dataset updated
Sep 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:

Vanoli J, et al. Long-term associations between time-varying exposure to ambient PM2.5 and mortality: an analysis of the UK Biobank. Epidemiology. 2025;36(1):1-10. DOI: 10.1097/EDE.0000000000001796 [freely available here, with code provided in this GitHub repo]

Vanoli J, et al. Confounding issues in air pollution epidemiology: an empirical assessment with the UK Biobank cohort. International Journal of Epidemiology. 2025;54(5):dyaf163. DOI: 10.1093/ije/dyaf163 [freely available here, with code provided in this GitHub repo]

Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

Content

The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.

synthbdbasevar: baseline variables, mostly collected at recruitment.

synthpmdata: annual average exposure to PM_2.5 for each participant reconstructed using their residential history.

synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

In addition, this repository provides these additional files:

codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.

asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).

Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

Generation of the synthetic data

The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM_2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

The first part merges all the data, including the annual PM_2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM_2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.
o
Nominal and adversarial synthetic PMU data for standard IEEE test systems
osti.gov
Updated Jun 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pacific Northwest National Laboratory 2 (2021). Nominal and adversarial synthetic PMU data for standard IEEE test systems [Dataset]. http://doi.org/10.25584/DataHub/1788186
Explore at:
Unique identifier
https://doi.org/10.25584/DataHub/1788186
Dataset updated
Jun 15, 2021
Dataset provided by
PNNL
US
Pacific Northwest National Laboratory 2
Description
GridSTAGE (Spatio-Temporal Adversarial scenario GEneration) is a framework for the simulation of adversarial scenarios and the generation of multivariate spatio-temporal data in cyber-physical systems. GridSTAGE is developed based on Matlab and leverages Power System Toolbox (PST) where the evolution of the power network is governed by nonlinear differential equations. Using GridSTAGE, one can create several event scenarios that correspond to several operating states of the power network by enabling or disabling any of the following: faults, AGC control, PSS control, exciter control, load changes, generation changes, and different types of cyber-attacks. Standard IEEE bus system data is used to define the power system environment. GridSTAGE emulates the data from PMU and SCADA sensors. The rate of frequency and location of the sensors can be adjusted as well. Detailed instructions on generating data scenarios with different system topologies, attack characteristics, load characteristics, sensor configuration, control parameters are available in the Github repository - https://github.com/pnnl/GridSTAGE. There is no existing adversarial data-generation framework that can incorporate several attack characteristics and yield adversarial PMU data. The GridSTAGE framework currently supports simulation of False Data Injection attacks (such as a ramp, step, random, trapezoidal, multiplicative, replay, freezing) and Denial of Service attacks (such as time-delay, packet-loss) on PMU data. Furthermore, it supports generating spatio-temporal time-series data corresponding to several random load changes across the network or corresponding to several generation changes. A Koopman mode decomposition (KMD) based algorithm to detect and identify the false data attacks in real-time is proposed in https://ieeexplore.ieee.org/document/9303022. Machine learning-based predictive models are developed to capture the dynamics of the underlying power system with a high level of accuracy under various operating conditions for IEEE 68 bus system. The corresponding machine learning models are available at https://github.com/pnnl/grid_prediction.
MatSim Dataset and benchmark for one-shot visual materials and textures...
zenodo.org
data.niaid.nih.gov
pdf, zip
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik; Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik (2025). MatSim Dataset and benchmark for one-shot visual materials and textures recognition [Dataset]. http://doi.org/10.5281/zenodo.7390166
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7390166
Dataset updated
Jun 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik; Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The MatSim Dataset and benchmark

Lastest version

Synthetic dataset and real images benchmark for visual similarity recognition of materials and textures.

MatSim: a synthetic dataset, a benchmark, and a method for computer vision-based recognition of similarities and transitions between materials and textures focusing on identifying any material under any conditions using one or a few examples (one-shot learning).

Based on the paper: One-shot recognition of any material anywhere using contrastive learning with physics-based rendering

Benchmark_MATSIM.zip: contain the benchmark made of real-world images as described in the paper

MatSim_object_train_split_1,2,3.zip: Contain a subset of the synthetics dataset for images of CGI images materials on random objects as described in the paper.

MatSim_Vessels_Train_1,2,3.zip : Contain a subset of the synthetics dataset for images of CGI images materials inside transparent containers as described in the paper.

*Note: these are subsets of the dataset; the full dataset can be found at:
https://e1.pcloud.link/publink/show?code=kZIiSQZCYU5M4HOvnQykql9jxF4h0KiC5MX

or
https://icedrive.net/s/A13FWzZ8V2aP9T4ufGQ1N3fBZxDF

Code:

Up to date code for generating the dataset, reading and evaluation and trained nets can be found in this URL:https://github.com/sagieppel/MatSim-Dataset-Generator-Scripts-And-Neural-net

Dataset Generation Scripts.zip: Contain the Blender (3.1) Python scripts used for generating the dataset, this code might be odl up to date code can be found here
Net_Code_And_Trained_Model.zip: Contain a reference neural net code, including loaders, trained models, and evaluators scripts that can be used to read and train with the synthetic dataset or test the model with the benchmark. Note code in the ZIP file is not up to date and contains some bugs For the Latest version of this code see this URL

Further documentation can be found inside the zip files or in the paper.
Credit_Card_Frauds(Synthetic Dataset)
kaggle.com
zip
Updated Apr 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahesh Yadav (2023). Credit_Card_Frauds(Synthetic Dataset) [Dataset]. https://www.kaggle.com/datasets/maheshyaadav/credit-card-fraudssynthetic-dataset
Explore at:
zip(211766720 bytes)Available download formats
Dataset updated
Apr 18, 2023
Authors
Mahesh Yadav
Description
About the Dataset This is a simulated credit card transaction dataset containing legitimate and fraud transactions from the duration 1st Jan 2019 - 31st Dec 2020. It covers credit cards of 1000 customers doing transactions with a pool of 800 merchants.

Source of Simulation This was generated using Sparkov Data Generation | Github tool created by Brandon Harris. This simulation was run for the duration - 1 Jan 2019 to 31 Dec 2020. The files were combined and converted into a standard format.

Information about the Simulator I do not own the simulator. I used the one used by Brandon Harris and just to understand how it works, I went through few portions of the code. This is what I understood from what I read:

The simulator has certain pre-defined list of merchants, customers and transaction categories. And then using a python library called "faker", and with the number of customers, merchants that you mention during simulation, an intermediate list is created.

After this, depending on the profile you choose for e.g. "adults 2550 female rural.json" (which means simulation properties of adult females in the age range of 25-50 who are from rural areas), the transactions are created. Say, for this profile, you could check "Sparkov | Github | adults_2550_female_rural.json", there are parameter value ranges defined in terms of min, max transactions per day, distribution of transactions across days of the week and normal distribution properties (mean, standard deviation) for amounts in various categories. Using these measures of distributions, the transactions are generated using faker.

What I did was generate transactions across all profiles and then merged them together to create a more realistic representation of simulated transactions.

Acknowledgements - Brandon Harris for his amazing work in creating this easy-to-use simulation tool for creating fraud transaction datasets.
h
synpat-dataset
huggingface.co
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karan Srivastava (2025). synpat-dataset [Dataset]. https://huggingface.co/datasets/Karan0901/synpat-dataset
Explore at:
Dataset updated
May 28, 2025
Authors
Karan Srivastava
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
SynPAT: Generating Synthetic Physical Theories with Data

This is the Hugging Face dataset entry for SynPAT, a synthetic theory and data generation system developed for the paper:SynPAT: Generating Synthetic Physical Theories with DataGitHub: https://github.com/jlenchner/theorizer SynPAT generates symbolic physical systems and corresponding synthetic data to benchmark symbolic regression and scientific discovery algorithms. Each synthetic system includes symbolic equations… See the full description on the dataset page: https://huggingface.co/datasets/Karan0901/synpat-dataset.
h
internal-datasets
huggingface.co
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Rivaldo Marbun (2023). internal-datasets [Dataset]. https://huggingface.co/datasets/Marbyun/internal-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 1, 2023
Authors
Ivan Rivaldo Marbun
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
SynQA is a Reading Comprehension dataset created in the work "Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation" (https://aclanthology.org/2021.emnlp-main.696/). It consists of 314,811 synthetically generated questions on the passages in the SQuAD v1.1 (https://arxiv.org/abs/1606.05250) training set.

In this work, we use a synthetic adversarial data generation to make QA models more robust to human adversaries. We develop a data generation pipeline that selects source passages, identifies candidate answers, generates questions, then finally filters or re-labels them to improve quality. Using this approach, we amplify a smaller human-written adversarial dataset to a much larger set of synthetic question-answer pairs. By incorporating our synthetic data, we improve the state-of-the-art on the AdversarialQA (https://adversarialqa.github.io/) dataset by 3.7F1 and improve model generalisation on nine of the twelve MRQA datasets. We further conduct a novel human-in-the-loop evaluation to show that our models are considerably more robust to new human-written adversarial examples: crowdworkers can fool our model only 8.8% of the time on average, compared to 17.6% for a model trained without synthetic data.

For full details on how the dataset was created, kindly refer to the paper.
Bearings with Varying Degradation Behaviors
kaggle.com
zip
Updated Jun 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prognostics @ HSE (2025). Bearings with Varying Degradation Behaviors [Dataset]. https://www.kaggle.com/datasets/prognosticshse/bearings-with-varying-degradation-behaviors
Explore at:
zip(297945986 bytes)Available download formats
Dataset updated
Jun 13, 2025
Authors
Prognostics @ HSE
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context: The Bearings with Varying Degradation Behaviors data set is a synthetic data set representing the run-to-failure degradation data of rolling bearings. This data set is designed to facilitate the development and evaluation of diagnostic and prognostic methods in the context of Prognostics and Health Management (PHM). For the generation of the data set, the simulation model presented by Mauthe, Hagmeyer, and Zeiler (2025) was used. The simulation model is publicly available on GitHub.

Simulation Model: Mauthe, Hagmeyer, and Zeiler (2025) introduce a generic simulation model for generating representative run-to-failure data of rolling bearings. It is designed to address challenges in the development of data-driven diagnostic and prognostic methods, such as unbalanced or limited data availability. The model consists of three modular components: the life and fault modeling, the degradation progression simulation, and the vibration signal generation. Each module incorporates random processes to reproduce real-world variations, such as differences in bearing lives and degradation progressions under similar operating conditions. The model simulates vibration signals throughout a bearing's life, reflecting both operating and degradation conditions. As such, the versatile model enables its users to create synthetic data sets of rolling bearings tailored to specific scenarios. A more detailed description of the model can be found in the corresponding paper (see Data Set Citation).

Given Data Scenario and Specification: See the provided description file Bearings_with_Varying_Degradation_Behaviors.pdf

Task: The data set contains training and test data, consisting of run-to-failure data from 28 and 12 simulated bearings. The objective of the data set is to predict the remaining useful life (RUL) of the rolling bearings within the given test data. All data proceed up to the identical failure threshold, which means that RUL=0 applies to the last point in time and the last vibration measurement, respectively.

Data Set Creator: Hochschule Esslingen – University of Applied Sciences, Institute for Technical Reliability and Prognostics (IZP), Robert-Bosch-Straße 1, 73037 Göppingen, Germany

Data Set Citation: Mauthe, F.; Hagmeyer, S.; Zeiler, P. (2025). Holistic simulation model of the temporal degradation of rolling bearings. Proceedings of the 35th European Safety and Reliability Conference and the 33rd Society for Risk Analysis Europe Conference, 15.06. – 19.06.2025, Stavanger, Norway, pp. 953–960, DOI: 10.3850/978-981-94-3281-3_ESREL-SRA-E2025-P8028-cd

https://rpsonline.com.sg/proceedings/esrel-sra-e2025/html/ESREL-SRA-E2025-P8028.html
h
synthetic-multiturn-multimodal
huggingface.co
Updated Jan 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mesolitica (2024). synthetic-multiturn-multimodal [Dataset]. https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 28, 2024
Dataset authored and provided by
Mesolitica
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Multiturn Multimodal

We want to generate synthetic data that able to understand position and relationship between multi-images and multi-audio, example as below, All notebooks at https://github.com/mesolitica/malaysian-dataset/tree/master/chatbot/multiturn-multimodal

multi-images

synthetic-multi-images-relationship.jsonl, 100000 rows, 109MB. Images at https://huggingface.co/datasets/mesolitica/translated-LLaVA-Pretrain/tree/main

Example data

{'filename':… See the full description on the dataset page: https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal.
LLM - Detect AI Datamix
kaggle.com
zip
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26
Explore at:
zip(172818297 bytes)Available download formats
Dataset updated
Jan 19, 2024
Authors
Raja Biswas
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:

Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.

One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.

Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
Z
Unmet Risk Index Dataset
data.niaid.nih.gov
Updated Aug 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeanson, Francis; Farkouh, Michael E.; Godoy, Lucas C.; Minha, Sa'ar; Tzuman, Oran; Marcus, Gil (2023). Unmet Risk Index Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8241871
Explore at:
Dataset updated
Aug 15, 2023
Dataset provided by
Department of Cardiology, Shamir Medical Center, Zeriffin, Israel; and Sackler School of Medicine, Tel-Aviv University, Ramat-Aviv, Israel
Datadex Inc., Toronto, Canada
Peter Munk Cardiac Centre and Heart and Stroke Richard Lewar Centre, University of Toronto, Toronto, Ontario, Canada
Authors
Jeanson, Francis; Farkouh, Michael E.; Godoy, Lucas C.; Minha, Sa'ar; Tzuman, Oran; Marcus, Gil
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the generated profiles from the combination of ASCVD and SMART risk calculators. In addition, the Unmet Risk Index value is included at the end of each data row. This data was used in the research paper titled "Medical calculators derived synthetic patients: a novel method for generation of synthetic patient data" in pre-print.

The code used to generate these profiles is available on GitHub at: https://github.com/FrancisJMR/unmet-risk-index
h
aditi-syn-v1
huggingface.co
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manish Prakash (2024). aditi-syn-v1 [Dataset]. https://huggingface.co/datasets/manishiitg/aditi-syn-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 28, 2024
Authors
Manish Prakash
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
v1 for synthetic dataset generate for aditi model. Generation scripts are located here https://github.com/manishiitg/aditi_dataset/tree/main/gen
H
SOD Synthetic Forecast Generation Dataset
hydroshare.org
zip
Updated Oct 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zachary Paul Brodeur (2025). SOD Synthetic Forecast Generation Dataset [Dataset]. http://doi.org/10.4211/hs.833b01b4c0ee47378fd1eac7ba17ace4
Explore at:
zip(223.3 MB)Available download formats
Unique identifier
https://doi.org/10.4211/hs.833b01b4c0ee47378fd1eac7ba17ace4
Dataset updated
Oct 31, 2025
Dataset provided by
HydroShare
Authors
Zachary Paul Brodeur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Mar 2, 1912 - Sep 30, 2024
Area covered

Description
Pre-processed subset of raw HEFS hindcast data for Seven Oaks Dam (SOD) configured for compatibility with the repository structure of the versions 1 and 2 synthetic forecast model contained here: https://github.com/zpb4/Synthetic-Forecast-v1-FIRO-DISES and here: https://github.com/zpb4/Synthetic-Forecast-v2-FIRO-DISES. The data are pre-structured for the repository setup and instructions are included in README files for both GitHub repos on how to setup the data contained in this resource.

Contains HEFS hindcast .csv files and observed full-natural-flow .csv files for the following sites: SRWC1 - main reservoir inflow to Seven Oaks Dam

Note: The zipped file contains some R scripts that were used to pre-process the raw data. They do not interact with the GitHub scripts referenced above and can be discarded. All the information in the raw data is contained within the zipped files, it has simply been converted to a standardized format for compatibility with the synthetic forecast generation codebase.
Data from: GHTraffic: A Dataset for Reproducible Research in...
zenodo.org
zip
Updated Aug 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thilini Bhagya; Jens Dietrich; Hans Guesgen; Thilini Bhagya; Jens Dietrich; Hans Guesgen (2020). GHTraffic: A Dataset for Reproducible Research in Service-Oriented Computing [Dataset]. http://doi.org/10.5281/zenodo.3748921
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3748921
Dataset updated
Aug 29, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thilini Bhagya; Jens Dietrich; Hans Guesgen; Thilini Bhagya; Jens Dietrich; Hans Guesgen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the latest version of the GHTraffic project. The main aim is to model a variety of transaction sequences to reflect more complex service behaviour.

This version consists of a single edition collected from the google/guava repository.

The entire data generation process is quite similar to the original GHTraffic design. But it incorporates minor changes to the process of synthetic data generation where it uses a random date after successfully posting a resource to make up the request and response for all of the HTTP methods. It also adds yet another subset of unsuccessful transactions by stipulating requests before resource creation is successful.

This results in a far more dynamic series of transactions to named resources.

Scripts used for datasets construction are accessible from the repository.
Z
replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Apr 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David (2023). replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7849416
Explore at:
Dataset updated
Apr 21, 2023
Dataset provided by
The Pocket Dimension, Munich
Imperial College London
Authors
Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for detection and tracking experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

Abstract:

Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

Benchmark data

Two video datasets were curated to quantify detection performance; one in laboratory and one in field conditions. The laboratory dataset consists of top-down recordings of foraging trails of Atta vollenweideri (Forel 1893) leaf-cutter ants. The colony was collected in Uruguay in 2014, and housed in a climate chamber at 25°C and 60% humidity. A recording box was built from clear acrylic, and placed between the colony nest and a box external to the climate chamber, which functioned as feeding site. Bramble leaves were placed in the feeding area prior to each recording session, and ants had access to the recording area at will. The recorded area was 104 mm wide and 200 mm long. An OAK-D camera (OpenCV AI Kit: OAK-D, Luxonis Holding Corporation) was positioned centrally 195 mm above the ground. While keeping the camera position constant, lighting, exposure, and background conditions were varied to create recordings with variable appearance: The “base” case is an evenly lit and well exposed scene with scattered leaf fragments on an otherwise plain white backdrop. A “bright” and “dark” case are characterised by systematic over- or underexposure, respectively, which introduces motion blur, colour-clipped appendages, and extensive flickering and compression artefacts. In a separate well exposed recording, the clear acrylic backdrop was substituted with a printout of a highly textured forest ground to create a “noisy” case. Last, we decreased the camera distance to 100 mm at constant focal distance, effectively doubling the magnification, and yielding a “close” case, distinguished by out-of-focus workers. All recordings were captured at 25 frames per second (fps).

The field datasets consists of video recordings of Gnathamitermes sp. desert termites, filmed close to the nest entrance in the desert of Maricopa County, Arizona, using a Nikon D850 and a Nikkor 18-105 mm lens on a tripod at camera distances between 20 cm to 40 cm. All video recordings were well exposed, and captured at 23.976 fps.

Each video was trimmed to the first 1000 frames, and contains between 36 and 103 individuals. In total, 5000 and 1000 frames were hand-annotated for the laboratory- and field-dataset, respectively: each visible individual was assigned a constant size bounding box, with a centre coinciding approximately with the geometric centre of the thorax in top-down view. The size of the bounding boxes was chosen such that they were large enough to completely enclose the largest individuals, and was automatically adjusted near the image borders. A custom-written Blender Add-on aided hand-annotation: the Add-on is a semi-automated multi animal tracker, which leverages blender’s internal contrast-based motion tracker, but also include track refinement options, and CSV export functionality. Comprehensive documentation of this tool and Jupyter notebooks for track visualisation and benchmarking is provided on the replicAnt and BlenderMotionExport GitHub repositories.

Synthetic data generation

Two synthetic datasets, each with a population size of 100, were generated from 3D models of \textit{Atta vollenweideri} leaf-cutter ants. All 3D models were created with the scAnt photogrammetry workflow. A “group” population was based on three distinct 3D models of an ant minor (1.1 mg), a media (9.8 mg), and a major (50.1 mg) (see 10.5281/zenodo.7849059)). To approximately simulate the size distribution of A. vollenweideri colonies, these models make up 20%, 60%, and 20% of the simulated population, respectively. A 33% within-class scale variation, with default hue, contrast, and brightness subject material variation, was used. A “single” population was generated using the major model only, with 90% scale variation, but equal material variation settings.

A Gnathamitermes sp. synthetic dataset was generated from two hand-sculpted models; a worker and a soldier made up 80% and 20% of the simulated population of 100 individuals, respectively with default hue, contrast, and brightness subject material variation. Both 3D models were created in Blender v3.1, using reference photographs.

Each of the three synthetic datasets contains 10,000 images, rendered at a resolution of 1024 by 1024 px, using the default generator settings as documented in the Generator_example level file (see documentation on GitHub). To assess how the training dataset size affects performance, we trained networks on 100 (“small”), 1,000 (“medium”), and 10,000 (“large”) subsets of the “group” dataset. Generating 10,000 samples at the specified resolution took approximately 10 hours per dataset on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super).

Additionally, five datasets which contain both real and synthetic images were curated. These “mixed” datasets combine image samples from the synthetic “group” dataset with image samples from the real “base” case. The ratio between real and synthetic images across the five datasets varied between 10/1 to 1/100.

Funding

This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
d
YRS Synthetic Forecast Generation Dataset
search.dataone.org
hydroshare.org
Updated Jun 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zachary Paul Brodeur (2025). YRS Synthetic Forecast Generation Dataset [Dataset]. http://doi.org/10.4211/hs.29a7c696ee4e4766883078ca0d681884
Explore at:
Unique identifier
https://doi.org/10.4211/hs.29a7c696ee4e4766883078ca0d681884
Dataset updated
Jun 7, 2025
Dataset provided by
Hydroshare
Authors
Zachary Paul Brodeur
Time period covered
Oct 2, 1985 - Sep 30, 2019
Area covered

Description
Pre-processed subset of raw HEFS hindcast data for Feather-Yuba system (YRS) configured for compatibility with the repository structure of the versions 1 and 2 synthetic forecast model contained here: https://github.com/zpb4/Synthetic-Forecast-v1-FIRO-DISES and here: https://github.com/zpb4/Synthetic-Forecast-v2-FIRO-DISES. The data are pre-structured for the repository setup and instructions are included in README files for both GitHub repos on how to setup the data contained in this resource.

Contains HEFS hindcast .csv files and observed full-natural-flow files for the following sites: ORDC1 - main reservoir inflow to Oroville Lake NBBC1 - main reservoir inflow to New Bullards Bar MRYC1L - downstream local flows at Marysville junction

Data also contains R scripts used to preprocess the raw HEFS data. These raw data are too large for easy storage in a public repository (YRS has 30+ modeled sites) but are available upon reasonable request from: Zach Brodeur, zpb4@cornell.edu
Data from: eCARLA-scenes: A synthetically generated dataset for event-based...
zenodo.org
zip
Updated Dec 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jad Mansour; Jad Mansour; Hayat Rajani; Hayat Rajani; Rafael Garcia; Rafael Garcia; Nuno Gracias; Nuno Gracias (2024). eCARLA-scenes: A synthetically generated dataset for event-based optical flow prediction [Dataset]. http://doi.org/10.5281/zenodo.14412251
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14412251
Dataset updated
Dec 13, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jad Mansour; Jad Mansour; Hayat Rajani; Hayat Rajani; Rafael Garcia; Rafael Garcia; Nuno Gracias; Nuno Gracias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository hosts a synthetic event-based optical flow dataset, meticulously designed to simulate diverse environments under varying weather conditions using the CARLA simulator. The dataset is specifically tailored for autonomous field vehicles, featuring event streams, grayscale images, and corresponding ground truth optical flow.

In addition to the dataset, the accompanying repository provides a user-friendly pipeline for generating custom datasets, including optical flow displacements and grayscale images. The generated data leverages the optimized eWiz framework, ensuring efficient storage, access, and processing.

The data generation pipeline can be utilized by cloning the eCARLA-scenes repository. Whether you're a researcher or developer, this resource is an ideal starting point for advancing event-based vision systems in real-world autonomous applications.
2D high-resolution synthetic MR images of Alzheimer's patients and healthy...
zenodo.org
data.niaid.nih.gov
application/gzip, csv
Updated Dec 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matteo Lai; Matteo Lai; Chiara Marzi; Chiara Marzi; Luca Citi; Luca Citi; Stefano Diciotti; Stefano Diciotti (2023). 2D high-resolution synthetic MR images of Alzheimer's patients and healthy subjects using PACGAN [Dataset]. http://doi.org/10.5281/zenodo.8276786
Explore at:
application/gzip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8276786
Dataset updated
Dec 13, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matteo Lai; Matteo Lai; Chiara Marzi; Chiara Marzi; Luca Citi; Luca Citi; Stefano Diciotti; Stefano Diciotti
Description
This dataset encompasses a NIfTI file containing a collection of 500 images, each capturing the central axial slice of a synthetic brain MRI.

Accompanying this file is a CSV dataset that serves as a repository for the corresponding labels linked to each image:

Label 0: Healthy Controls (HC)

Label 1: Alzheimer's Disease (AD)

Each image within this dataset has been generated by PACGAN (Progressive Auxiliary Classifier Generative Adversarial Network), a framework designed and implemented by the AI for Medicine Research Group at the University of Bologna.

PACGAN is a generative adversarial network trained to generate high-resolution images belonging to different classes. In our work, we trained this framework on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, which contains brain MRI images of AD patients and HC.

The implementation of the training algorithm can be found within our GitHub repository, with Docker containerization.

For further exploration, the pre-trained models are available within the Code Ocean capsule. These models can facilitate the generation of synthetic images for both classes and also aid in classifying new brain MRI images.
GARD: Gustavo’s Awesome Runway Dataset (2025)
kaggle.com
zip
Updated Mar 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gustavo de Paula (2025). GARD: Gustavo’s Awesome Runway Dataset (2025) [Dataset]. https://www.kaggle.com/datasets/depaulagu/gard2025
Explore at:
zip(55376320216 bytes)Available download formats
Dataset updated
Mar 30, 2025
Authors
Gustavo de Paula
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
GARD (Gustavo’s Awesome Runway Dataset) is the largest publicly available synthetic runway image dataset, built to support machine learning tasks in vision-based aircraft landing systems. It contains over 45,000 high-resolution (1024×1024) labeled images.

This dataset was created using Canny2Concrete, a modular open-source data augmentation pipeline leveraging ControlNet and Stable Diffusion XL. The generation process conditions on edge maps extracted from real-world template images and applies multiple stages of variation including weather, lighting, and occlusion effects.

Models trained with GARD have been shown to outperform or match those trained on existing synthetic datasets like LARD, especially in challenging segmentation tasks.

🚀 What’s Inside:

BaseImages: Direct and diverse generations from runway edge maps (Canny).

VariantImages: Geometric augmentations (rotations, translations, etc).

VariantImagesWithOcclusion: Added weather occlusion effects (rain, fog, snow, night).

Each image includes: - 📷 .png image file
- 🏷 .txt YOLO-format label
- 🧩 .mask.png segmentation mask
- 📄 .json full metadata, designed for full reproducibility (prompt, seed, label points, effects applied)

📂 Resources:

📘 Full Thesis: Landing in the Latent Space (PDF)

💻 Source Code (Canny2Concrete Pipeline): GitHub Repo

🏁 Built For:

Runway segmentation and detection

Computer vision research in aviation

Synthetic dataset generation at scale

Researchers working on UAV and autonomous landing

Facebook

Twitter

Click to copy link

Link copied

Cite

Leoncini, Pietro (2025). Surgical-Synthetic-Data-Generation-and-Segmentation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14671905

Surgical-Synthetic-Data-Generation-and-Segmentation

Explore at:

Dataset updated

Jan 16, 2025

Authors

Leoncini, Pietro

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains synthetic and real images, with their labels, for Computer Vision in robotic surgery. It is part of ongoing research on sim-to-real applications in surgical robotics. The dataset will be updated with further details and references once the related work is published. For further information see the repository on GitHub: https://github.com/PietroLeoncini/Surgical-Synthetic-Data-Generation-and-Segmentation

Clear search

Close search

Google apps

Main menu

Surgical-Synthetic-Data-Generation-and-Segmentation

MOSTLY AI Prize Data

Competition

Timeline

Datasets

Evaluation

Submission

Citation

Synthetic datasets of the UK Biobank cohort

Content

Generation of the synthetic data

Nominal and adversarial synthetic PMU data for standard IEEE test systems

MatSim Dataset and benchmark for one-shot visual materials and textures...

Credit_Card_Frauds(Synthetic Dataset)

synpat-dataset

internal-datasets

Bearings with Varying Degradation Behaviors

synthetic-multiturn-multimodal

LLM - Detect AI Datamix

Unmet Risk Index Dataset

aditi-syn-v1

SOD Synthetic Forecast Generation Dataset

Data from: GHTraffic: A Dataset for Reproducible Research in...

replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks

YRS Synthetic Forecast Generation Dataset

Data from: eCARLA-scenes: A synthetically generated dataset for event-based...

2D high-resolution synthetic MR images of Alzheimer's patients and healthy...

GARD: Gustavo’s Awesome Runway Dataset (2025)

🚀 What’s Inside:

📂 Resources:

🏁 Built For:

Surgical-Synthetic-Data-Generation-and-Segmentation