Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains synthetic and real images, with their labels, for Computer Vision in robotic surgery. It is part of ongoing research on sim-to-real applications in surgical robotics. The dataset will be updated with further details and references once the related work is published. For further information see the repository on GitHub: https://github.com/PietroLeoncini/Surgical-Synthetic-Data-Generation-and-Segmentation
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This competition features two independent synthetic data challenges that you can join separately: - The FLAT DATA Challenge - The SEQUENTIAL DATA Challenge
For each challenge, generate a dataset with the same size and structure as the original, capturing its statistical patterns — but without being significantly closer to the (released) original samples than to the (unreleased) holdout samples.
Train a generative model that generalizes well, using any open-source tools (Synthetic Data SDK, synthcity, reprosyn, etc.) or your own solution. Submissions must be fully open-source, reproducible, and runnable within 6 hours on a standard machine.
Flat Data - 100,000 records - 80 data columns: 60 numeric, 20 categorical
Sequential Data - 20,000 groups - each group contains 5-10 records - 10 data columns: 7 numeric, 3 categorical
If you use this dataset in your research, please cite:
@dataset{mostlyaiprize,
author = {MOSTLY AI},
title = {MOSTLY AI Prize Dataset},
year = {2025},
url = {https://www.mostlyaiprize.com/},
}
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.
The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.
The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:
Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.
The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).
The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:
In addition, this repository provides these additional files:
The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).
The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.
This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.
Facebook
TwitterGridSTAGE (Spatio-Temporal Adversarial scenario GEneration) is a framework for the simulation of adversarial scenarios and the generation of multivariate spatio-temporal data in cyber-physical systems. GridSTAGE is developed based on Matlab and leverages Power System Toolbox (PST) where the evolution of the power network is governed by nonlinear differential equations. Using GridSTAGE, one can create several event scenarios that correspond to several operating states of the power network by enabling or disabling any of the following: faults, AGC control, PSS control, exciter control, load changes, generation changes, and different types of cyber-attacks. Standard IEEE bus system data is used to define the power system environment. GridSTAGE emulates the data from PMU and SCADA sensors. The rate of frequency and location of the sensors can be adjusted as well. Detailed instructions on generating data scenarios with different system topologies, attack characteristics, load characteristics, sensor configuration, control parameters are available in the Github repository - https://github.com/pnnl/GridSTAGE. There is no existing adversarial data-generation framework that can incorporate several attack characteristics and yield adversarial PMU data. The GridSTAGE framework currently supports simulation of False Data Injection attacks (such as a ramp, step, random, trapezoidal, multiplicative, replay, freezing) and Denial of Service attacks (such as time-delay, packet-loss) on PMU data. Furthermore, it supports generating spatio-temporal time-series data corresponding to several random load changes across the network or corresponding to several generation changes. A Koopman mode decomposition (KMD) based algorithm to detect and identify the false data attacks in real-time is proposed in https://ieeexplore.ieee.org/document/9303022. Machine learning-based predictive models are developed to capture the dynamics of the underlying power system with a high level of accuracy under various operating conditions for IEEE 68 bus system. The corresponding machine learning models are available at https://github.com/pnnl/grid_prediction.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The MatSim Dataset and benchmark
Synthetic dataset and real images benchmark for visual similarity recognition of materials and textures.
MatSim: a synthetic dataset, a benchmark, and a method for computer vision-based recognition of similarities and transitions between materials and textures focusing on identifying any material under any conditions using one or a few examples (one-shot learning).
Based on the paper: One-shot recognition of any material anywhere using contrastive learning with physics-based rendering
Benchmark_MATSIM.zip: contain the benchmark made of real-world images as described in the paper
MatSim_object_train_split_1,2,3.zip: Contain a subset of the synthetics dataset for images of CGI images materials on random objects as described in the paper.
MatSim_Vessels_Train_1,2,3.zip : Contain a subset of the synthetics dataset for images of CGI images materials inside transparent containers as described in the paper.
*Note: these are subsets of the dataset; the full dataset can be found at:
https://e1.pcloud.link/publink/show?code=kZIiSQZCYU5M4HOvnQykql9jxF4h0KiC5MX
or
https://icedrive.net/s/A13FWzZ8V2aP9T4ufGQ1N3fBZxDF
Code:
Up to date code for generating the dataset, reading and evaluation and trained nets can be found in this URL:https://github.com/sagieppel/MatSim-Dataset-Generator-Scripts-And-Neural-net
Dataset Generation Scripts.zip: Contain the Blender (3.1) Python scripts used for generating the dataset, this code might be odl up to date code can be found here
Net_Code_And_Trained_Model.zip: Contain a reference neural net code, including loaders, trained models, and evaluators scripts that can be used to read and train with the synthetic dataset or test the model with the benchmark. Note code in the ZIP file is not up to date and contains some bugs For the Latest version of this code see this URL
Further documentation can be found inside the zip files or in the paper.
Facebook
TwitterAbout the Dataset This is a simulated credit card transaction dataset containing legitimate and fraud transactions from the duration 1st Jan 2019 - 31st Dec 2020. It covers credit cards of 1000 customers doing transactions with a pool of 800 merchants.
Source of Simulation This was generated using Sparkov Data Generation | Github tool created by Brandon Harris. This simulation was run for the duration - 1 Jan 2019 to 31 Dec 2020. The files were combined and converted into a standard format.
Information about the Simulator I do not own the simulator. I used the one used by Brandon Harris and just to understand how it works, I went through few portions of the code. This is what I understood from what I read:
The simulator has certain pre-defined list of merchants, customers and transaction categories. And then using a python library called "faker", and with the number of customers, merchants that you mention during simulation, an intermediate list is created.
After this, depending on the profile you choose for e.g. "adults 2550 female rural.json" (which means simulation properties of adult females in the age range of 25-50 who are from rural areas), the transactions are created. Say, for this profile, you could check "Sparkov | Github | adults_2550_female_rural.json", there are parameter value ranges defined in terms of min, max transactions per day, distribution of transactions across days of the week and normal distribution properties (mean, standard deviation) for amounts in various categories. Using these measures of distributions, the transactions are generated using faker.
What I did was generate transactions across all profiles and then merged them together to create a more realistic representation of simulated transactions.
Acknowledgements - Brandon Harris for his amazing work in creating this easy-to-use simulation tool for creating fraud transaction datasets.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
SynPAT: Generating Synthetic Physical Theories with Data
This is the Hugging Face dataset entry for SynPAT, a synthetic theory and data generation system developed for the paper:SynPAT: Generating Synthetic Physical Theories with DataGitHub: https://github.com/jlenchner/theorizer SynPAT generates symbolic physical systems and corresponding synthetic data to benchmark symbolic regression and scientific discovery algorithms. Each synthetic system includes symbolic equations… See the full description on the dataset page: https://huggingface.co/datasets/Karan0901/synpat-dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
SynQA is a Reading Comprehension dataset created in the work "Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation" (https://aclanthology.org/2021.emnlp-main.696/). It consists of 314,811 synthetically generated questions on the passages in the SQuAD v1.1 (https://arxiv.org/abs/1606.05250) training set.
In this work, we use a synthetic adversarial data generation to make QA models more robust to human adversaries. We develop a data generation pipeline that selects source passages, identifies candidate answers, generates questions, then finally filters or re-labels them to improve quality. Using this approach, we amplify a smaller human-written adversarial dataset to a much larger set of synthetic question-answer pairs. By incorporating our synthetic data, we improve the state-of-the-art on the AdversarialQA (https://adversarialqa.github.io/) dataset by 3.7F1 and improve model generalisation on nine of the twelve MRQA datasets. We further conduct a novel human-in-the-loop evaluation to show that our models are considerably more robust to new human-written adversarial examples: crowdworkers can fool our model only 8.8% of the time on average, compared to 17.6% for a model trained without synthetic data.
For full details on how the dataset was created, kindly refer to the paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context: The Bearings with Varying Degradation Behaviors data set is a synthetic data set representing the run-to-failure degradation data of rolling bearings. This data set is designed to facilitate the development and evaluation of diagnostic and prognostic methods in the context of Prognostics and Health Management (PHM). For the generation of the data set, the simulation model presented by Mauthe, Hagmeyer, and Zeiler (2025) was used. The simulation model is publicly available on GitHub.
Simulation Model: Mauthe, Hagmeyer, and Zeiler (2025) introduce a generic simulation model for generating representative run-to-failure data of rolling bearings. It is designed to address challenges in the development of data-driven diagnostic and prognostic methods, such as unbalanced or limited data availability. The model consists of three modular components: the life and fault modeling, the degradation progression simulation, and the vibration signal generation. Each module incorporates random processes to reproduce real-world variations, such as differences in bearing lives and degradation progressions under similar operating conditions. The model simulates vibration signals throughout a bearing's life, reflecting both operating and degradation conditions. As such, the versatile model enables its users to create synthetic data sets of rolling bearings tailored to specific scenarios. A more detailed description of the model can be found in the corresponding paper (see Data Set Citation).
Given Data Scenario and Specification: See the provided description file Bearings_with_Varying_Degradation_Behaviors.pdf
Task: The data set contains training and test data, consisting of run-to-failure data from 28 and 12 simulated bearings. The objective of the data set is to predict the remaining useful life (RUL) of the rolling bearings within the given test data. All data proceed up to the identical failure threshold, which means that RUL=0 applies to the last point in time and the last vibration measurement, respectively.
Data Set Creator: Hochschule Esslingen – University of Applied Sciences, Institute for Technical Reliability and Prognostics (IZP), Robert-Bosch-Straße 1, 73037 Göppingen, Germany
Data Set Citation: Mauthe, F.; Hagmeyer, S.; Zeiler, P. (2025). Holistic simulation model of the temporal degradation of rolling bearings. Proceedings of the 35th European Safety and Reliability Conference and the 33rd Society for Risk Analysis Europe Conference, 15.06. – 19.06.2025, Stavanger, Norway, pp. 953–960, DOI: 10.3850/978-981-94-3281-3_ESREL-SRA-E2025-P8028-cd
https://rpsonline.com.sg/proceedings/esrel-sra-e2025/html/ESREL-SRA-E2025-P8028.html
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Multiturn Multimodal
We want to generate synthetic data that able to understand position and relationship between multi-images and multi-audio, example as below, All notebooks at https://github.com/mesolitica/malaysian-dataset/tree/master/chatbot/multiturn-multimodal
multi-images
synthetic-multi-images-relationship.jsonl, 100000 rows, 109MB. Images at https://huggingface.co/datasets/mesolitica/translated-LLaVA-Pretrain/tree/main
Example data
{'filename':… See the full description on the dataset page: https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is the datamix created by Team 🔍 📝 🕵️♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.
It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.
To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.
Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset
We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays
Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the generated profiles from the combination of ASCVD and SMART risk calculators. In addition, the Unmet Risk Index value is included at the end of each data row. This data was used in the research paper titled "Medical calculators derived synthetic patients: a novel method for generation of synthetic patient data" in pre-print.
The code used to generate these profiles is available on GitHub at: https://github.com/FrancisJMR/unmet-risk-index
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
v1 for synthetic dataset generate for aditi model. Generation scripts are located here https://github.com/manishiitg/aditi_dataset/tree/main/gen
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pre-processed subset of raw HEFS hindcast data for Seven Oaks Dam (SOD) configured for compatibility with the repository structure of the versions 1 and 2 synthetic forecast model contained here: https://github.com/zpb4/Synthetic-Forecast-v1-FIRO-DISES and here: https://github.com/zpb4/Synthetic-Forecast-v2-FIRO-DISES. The data are pre-structured for the repository setup and instructions are included in README files for both GitHub repos on how to setup the data contained in this resource.
Contains HEFS hindcast .csv files and observed full-natural-flow .csv files for the following sites: SRWC1 - main reservoir inflow to Seven Oaks Dam
Note: The zipped file contains some R scripts that were used to pre-process the raw data. They do not interact with the GitHub scripts referenced above and can be discarded. All the information in the raw data is contained within the zipped files, it has simply been converted to a standardized format for compatibility with the synthetic forecast generation codebase.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the latest version of the GHTraffic project. The main aim is to model a variety of transaction sequences to reflect more complex service behaviour.
This version consists of a single edition collected from the google/guava repository.
The entire data generation process is quite similar to the original GHTraffic design. But it incorporates minor changes to the process of synthetic data generation where it uses a random date after successfully posting a resource to make up the request and response for all of the HTTP methods. It also adds yet another subset of unsuccessful transactions by stipulating requests before resource creation is successful.
This results in a far more dynamic series of transactions to named resources.
Scripts used for datasets construction are accessible from the repository.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for detection and tracking experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.
Abstract:
Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.
Benchmark data
Two video datasets were curated to quantify detection performance; one in laboratory and one in field conditions. The laboratory dataset consists of top-down recordings of foraging trails of Atta vollenweideri (Forel 1893) leaf-cutter ants. The colony was collected in Uruguay in 2014, and housed in a climate chamber at 25°C and 60% humidity. A recording box was built from clear acrylic, and placed between the colony nest and a box external to the climate chamber, which functioned as feeding site. Bramble leaves were placed in the feeding area prior to each recording session, and ants had access to the recording area at will. The recorded area was 104 mm wide and 200 mm long. An OAK-D camera (OpenCV AI Kit: OAK-D, Luxonis Holding Corporation) was positioned centrally 195 mm above the ground. While keeping the camera position constant, lighting, exposure, and background conditions were varied to create recordings with variable appearance: The “base” case is an evenly lit and well exposed scene with scattered leaf fragments on an otherwise plain white backdrop. A “bright” and “dark” case are characterised by systematic over- or underexposure, respectively, which introduces motion blur, colour-clipped appendages, and extensive flickering and compression artefacts. In a separate well exposed recording, the clear acrylic backdrop was substituted with a printout of a highly textured forest ground to create a “noisy” case. Last, we decreased the camera distance to 100 mm at constant focal distance, effectively doubling the magnification, and yielding a “close” case, distinguished by out-of-focus workers. All recordings were captured at 25 frames per second (fps).
The field datasets consists of video recordings of Gnathamitermes sp. desert termites, filmed close to the nest entrance in the desert of Maricopa County, Arizona, using a Nikon D850 and a Nikkor 18-105 mm lens on a tripod at camera distances between 20 cm to 40 cm. All video recordings were well exposed, and captured at 23.976 fps.
Each video was trimmed to the first 1000 frames, and contains between 36 and 103 individuals. In total, 5000 and 1000 frames were hand-annotated for the laboratory- and field-dataset, respectively: each visible individual was assigned a constant size bounding box, with a centre coinciding approximately with the geometric centre of the thorax in top-down view. The size of the bounding boxes was chosen such that they were large enough to completely enclose the largest individuals, and was automatically adjusted near the image borders. A custom-written Blender Add-on aided hand-annotation: the Add-on is a semi-automated multi animal tracker, which leverages blender’s internal contrast-based motion tracker, but also include track refinement options, and CSV export functionality. Comprehensive documentation of this tool and Jupyter notebooks for track visualisation and benchmarking is provided on the replicAnt and BlenderMotionExport GitHub repositories.
Synthetic data generation
Two synthetic datasets, each with a population size of 100, were generated from 3D models of \textit{Atta vollenweideri} leaf-cutter ants. All 3D models were created with the scAnt photogrammetry workflow. A “group” population was based on three distinct 3D models of an ant minor (1.1 mg), a media (9.8 mg), and a major (50.1 mg) (see 10.5281/zenodo.7849059)). To approximately simulate the size distribution of A. vollenweideri colonies, these models make up 20%, 60%, and 20% of the simulated population, respectively. A 33% within-class scale variation, with default hue, contrast, and brightness subject material variation, was used. A “single” population was generated using the major model only, with 90% scale variation, but equal material variation settings.
A Gnathamitermes sp. synthetic dataset was generated from two hand-sculpted models; a worker and a soldier made up 80% and 20% of the simulated population of 100 individuals, respectively with default hue, contrast, and brightness subject material variation. Both 3D models were created in Blender v3.1, using reference photographs.
Each of the three synthetic datasets contains 10,000 images, rendered at a resolution of 1024 by 1024 px, using the default generator settings as documented in the Generator_example level file (see documentation on GitHub). To assess how the training dataset size affects performance, we trained networks on 100 (“small”), 1,000 (“medium”), and 10,000 (“large”) subsets of the “group” dataset. Generating 10,000 samples at the specified resolution took approximately 10 hours per dataset on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super).
Additionally, five datasets which contain both real and synthetic images were curated. These “mixed” datasets combine image samples from the synthetic “group” dataset with image samples from the real “base” case. The ratio between real and synthetic images across the five datasets varied between 10/1 to 1/100.
Funding
This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Facebook
TwitterPre-processed subset of raw HEFS hindcast data for Feather-Yuba system (YRS) configured for compatibility with the repository structure of the versions 1 and 2 synthetic forecast model contained here: https://github.com/zpb4/Synthetic-Forecast-v1-FIRO-DISES and here: https://github.com/zpb4/Synthetic-Forecast-v2-FIRO-DISES. The data are pre-structured for the repository setup and instructions are included in README files for both GitHub repos on how to setup the data contained in this resource.
Contains HEFS hindcast .csv files and observed full-natural-flow files for the following sites: ORDC1 - main reservoir inflow to Oroville Lake NBBC1 - main reservoir inflow to New Bullards Bar MRYC1L - downstream local flows at Marysville junction
Data also contains R scripts used to preprocess the raw HEFS data. These raw data are too large for easy storage in a public repository (YRS has 30+ modeled sites) but are available upon reasonable request from: Zach Brodeur, zpb4@cornell.edu
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository hosts a synthetic event-based optical flow dataset, meticulously designed to simulate diverse environments under varying weather conditions using the CARLA simulator. The dataset is specifically tailored for autonomous field vehicles, featuring event streams, grayscale images, and corresponding ground truth optical flow.
In addition to the dataset, the accompanying repository provides a user-friendly pipeline for generating custom datasets, including optical flow displacements and grayscale images. The generated data leverages the optimized eWiz framework, ensuring efficient storage, access, and processing.
The data generation pipeline can be utilized by cloning the eCARLA-scenes repository. Whether you're a researcher or developer, this resource is an ideal starting point for advancing event-based vision systems in real-world autonomous applications.
Facebook
TwitterThis dataset encompasses a NIfTI file containing a collection of 500 images, each capturing the central axial slice of a synthetic brain MRI.
Accompanying this file is a CSV dataset that serves as a repository for the corresponding labels linked to each image:
Each image within this dataset has been generated by PACGAN (Progressive Auxiliary Classifier Generative Adversarial Network), a framework designed and implemented by the AI for Medicine Research Group at the University of Bologna.
PACGAN is a generative adversarial network trained to generate high-resolution images belonging to different classes. In our work, we trained this framework on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, which contains brain MRI images of AD patients and HC.
The implementation of the training algorithm can be found within our GitHub repository, with Docker containerization.
For further exploration, the pre-trained models are available within the Code Ocean capsule. These models can facilitate the generation of synthetic images for both classes and also aid in classifying new brain MRI images.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
GARD (Gustavo’s Awesome Runway Dataset) is the largest publicly available synthetic runway image dataset, built to support machine learning tasks in vision-based aircraft landing systems. It contains over 45,000 high-resolution (1024×1024) labeled images.
This dataset was created using Canny2Concrete, a modular open-source data augmentation pipeline leveraging ControlNet and Stable Diffusion XL. The generation process conditions on edge maps extracted from real-world template images and applies multiple stages of variation including weather, lighting, and occlusion effects.
Models trained with GARD have been shown to outperform or match those trained on existing synthetic datasets like LARD, especially in challenging segmentation tasks.
Each image includes:
- 📷 .png image file
- 🏷 .txt YOLO-format label
- 🧩 .mask.png segmentation mask
- 📄 .json full metadata, designed for full reproducibility (prompt, seed, label points, effects applied)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains synthetic and real images, with their labels, for Computer Vision in robotic surgery. It is part of ongoing research on sim-to-real applications in surgical robotics. The dataset will be updated with further details and references once the related work is published. For further information see the repository on GitHub: https://github.com/PietroLeoncini/Surgical-Synthetic-Data-Generation-and-Segmentation