100+ datasets found

f
Data Sheet 2_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
xlsx
Updated Feb 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s002
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
i
Dataset of article: Synthetic Datasets Generator for Testing Information...
ieee-dataport.org
Updated Mar 13, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandro Mendonça (2020). Dataset of article: Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools [Dataset]. http://doi.org/10.21227/5aeq-rr34
Explore at:
Unique identifier
https://doi.org/10.21227/5aeq-rr34
Dataset updated
Mar 13, 2020
Dataset provided by
IEEE Dataport
Authors
Sandro Mendonça
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset used in the article entitled 'Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools'. These datasets can be used to test several characteristics in machine learning and data processing algorithms.
f
Table1_Enhancing biomechanical machine learning with limited data:...
frontiersin.figshare.com
pdf
Updated Feb 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich (2024). Table1_Enhancing biomechanical machine learning with limited data: generating realistic synthetic posture data using generative artificial intelligence.pdf [Dataset]. http://doi.org/10.3389/fbioe.2024.1350135.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fbioe.2024.1350135.s001
Dataset updated
Feb 14, 2024
Dataset provided by
Frontiers
Authors
Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Objective: Biomechanical Machine Learning (ML) models, particularly deep-learning models, demonstrate the best performance when trained using extensive datasets. However, biomechanical data are frequently limited due to diverse challenges. Effective methods for augmenting data in developing ML models, specifically in the human posture domain, are scarce. Therefore, this study explored the feasibility of leveraging generative artificial intelligence (AI) to produce realistic synthetic posture data by utilizing three-dimensional posture data.Methods: Data were collected from 338 subjects through surface topography. A Variational Autoencoder (VAE) architecture was employed to generate and evaluate synthetic posture data, examining its distinguishability from real data by domain experts, ML classifiers, and Statistical Parametric Mapping (SPM). The benefits of incorporating augmented posture data into the learning process were exemplified by a deep autoencoder (AE) for automated feature representation.Results: Our findings highlight the challenge of differentiating synthetic data from real data for both experts and ML classifiers, underscoring the quality of synthetic data. This observation was also confirmed by SPM. By integrating synthetic data into AE training, the reconstruction error can be reduced compared to using only real data samples. Moreover, this study demonstrates the potential for reduced latent dimensions, while maintaining a reconstruction accuracy comparable to AEs trained exclusively on real data samples.Conclusion: This study emphasizes the prospects of harnessing generative AI to enhance ML tasks in the biomechanics domain.
S
Synthetic Data Generation Market Report
marketresearchforecast.com
doc, pdf, ppt
Updated Dec 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2024). Synthetic Data Generation Market Report [Dataset]. https://www.marketresearchforecast.com/reports/synthetic-data-generation-market-1834
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Dec 8, 2024
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Synthetic Data Generation Marketsize was valued at USD 288.5 USD Million in 2023 and is projected to reach USD 1920.28 USD Million by 2032, exhibiting a CAGR of 31.1 % during the forecast period.Synthetic data generation stands for the generation of fake datasets that resemble real datasets with reference to their data distribution and patterns. It refers to the process of creating synthetic data points utilizing algorithms or models instead of conducting observations or surveys. There is one of its core advantages: it can maintain the statistical characteristics of the original data and remove the privacy risk of using real data. Further, with synthetic data, there is no limitation to how much data can be created, and hence, it can be used for extensive testing and training of machine learning models, unlike the case with conventional data, which may be highly regulated or limited in availability. It also helps in the generation of datasets that are comprehensive and include many examples of specific situations or contexts that may occur in practice for improving the AI system’s performance. The use of SDG significantly shortens the process of the development cycle, requiring less time and effort for data collection as well as annotation. It basically allows researchers and developers to be highly efficient in their discovery and development in specific domains like healthcare, finance, etc. Key drivers for this market are: Growing Demand for Data Privacy and Security to Fuel Market Growth. Potential restraints include: Lack of Data Accuracy and Realism Hinders Market Growth. Notable trends are: Growing Implementation of Touch-based and Voice-based Infotainment Systems to Increase Adoption of Intelligent Cars.
D
TiCaM: Synthetic Images Dataset
datasetninja.com
Updated May 23, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jigyasa Katrolia; Jason Raphael Rambach; Bruno Mirbach (2021). TiCaM: Synthetic Images Dataset [Dataset]. https://datasetninja.com/ticam-synthetic-images
Explore at:
Dataset updated
May 23, 2021
Dataset provided by
Dataset Ninja
Authors
Jigyasa Katrolia; Jason Raphael Rambach; Bruno Mirbach
License
https://spdx.org/licenses/https://spdx.org/licenses/
Description
TiCaM Synthectic Images: A Time-of-Flight In-Car Cabin Monitoring Dataset is a time-of-flight dataset of car in-cabin images providing means to test extensive car cabin monitoring systems based on deep learning methods. The authors provide a synthetic image dataset of car cabin images similar to the real dataset leveraging advanced simulation software’s capability to generate abundant data with little effort. This can be used to test domain adaptation between synthetic and real data for select classes. For both datasets the authors provide ground truth annotations for 2D and 3D object detection, as well as for instance segmentation.
Synthea synthetic patient generator data in OMOP Common Data Model
registry.opendata.aws
Updated Jan 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon Web Sevices (2023). Synthea synthetic patient generator data in OMOP Common Data Model [Dataset]. https://registry.opendata.aws/synthea-omop/
Explore at:
Dataset updated
Jan 4, 2023
Dataset provided by
Amazon.comhttp://amazon.com/
Description
The Synthea generated data is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,800,000 persom (2.8m) data sets in the OMOP Common Data Model format. SyntheaTM is a synthetic patient generator that models the medical history of synthetic patients. Our mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions. It can be used without restriction for a variety of secondary uses in academia, research, industry, and government (although a citation would be appreciated). You can read our first academic paper here: https://doi.org/10.1093/jamia/ocx079
Australian synthetic healthcare data with Synthea
data.csiro.au
Updated Jul 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Australian synthetic healthcare data with Synthea [Dataset]. https://data.csiro.au/collection/csiro:61499
Explore at:
Unique identifier
https://doi.org/10.25919/efcw-bm49
Dataset updated
Jul 4, 2024
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Ibrahima Diouf; Mitchell O'Brien; Hamed Hassanzadeh; Donna Truran; Hoa Ngo; Parnesh Raniga; Denis Bauer; David Hansen; Sankalp Khanna; Roc Reguant Comellas; Michael Lawley; John Grimes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Australia
Dataset funded by
CSIROhttp://www.csiro.au/
Description
We developed an Australianised version of Synthea. Synthea is a synthetic data generation software that uses publicly available population aggregate statistics such as demographics, disease prevalence and incidence rates, and health reports. Synthea generates data based on manually curated models of clinical workflows and disease progression that cover a patient’s entire life and does not use real patient data; guaranteeing a completely synthetic dataset. We generated 117,258 synthetic patients from Queensland.
Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk...
catalog.data.gov
data.va.gov
+2more
Updated Apr 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Veterans Affairs (2021). Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk Factor Modeling Challenge [Dataset]. https://catalog.data.gov/dataset/synthetic-cohort-for-vha-innovation-ecosystem-and-precisionfda-covid-19-risk-factor-modeli
Explore at:
Dataset updated
Apr 25, 2021
Dataset provided by
United States Department of Veterans Affairshttp://va.gov/
Description
The dataset is a synthetic cohort for use for the VHA Innovation Ecosystem and precisionFDA COVID-19 Risk Factor Modeling Challenge. The dataset was generated using Synthea, a tool created by MITRE to generate synthetic electronic health records (EHRs) from curated care maps and publicly available statistics. This dataset represents 147,451 patients developed using the COVID-19 module. The dataset format conforms to the CSV file outputs. Below are links to all relevant information. PrecisionFDA Challenge: https://precision.fda.gov/challenges/11 Synthea hompage: https://synthetichealth.github.io/synthea/ Synethea GitHub repository: https://github.com/synthetichealth/synthea Synthea COVID-19 Module publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7531559/ CSV File Format Data Dictionary: https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary
d
Scrambled text: training Language Models to correct OCR errors using...
b2find.dkrz.de
Updated Oct 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Scrambled text: training Language Models to correct OCR errors using synthetic data - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/1ea0205e-de3a-54e7-a918-fde36ad3156f
Explore at:
Dataset updated
Oct 15, 2024
Description
This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data".In addition it contains the 10,000 synthetic 19th century articles generated using GPT4o. These articles are available both as a csv with the prompt parameters as columns as well as the articles as individual text files.The files in the repository are as followsncse_hf_dataset: A huggingface dictionary dataset containing 91 articles from the Nineteenth Century Serials Edition (NCSE) with original OCR and the transcribed groundtruth. This dataset is used as the testset in the papersynth_gt.zip: A zip file containing 5 parquet files of training data from the 10,000 synthetic articles. The each parquet file is made up of observations of a fixed length of tokens, for a total of 2 Million tokens. The observation lengths are 200, 100, 50, 25, 10.synthetic_articles.zip: A zip file containing the csv of all the synthetic articles and the prompts used to generate them.synthetic_articles_text.zip: A zip file containing the text files of all the synthetic articles. The file names are the prompt parameters and the id reference from the synthetic article csv.The data in this repo is used by the code repositories associated with the project https://github.com/JonnoB/scrambledtext_analysishttps://github.com/JonnoB/training_lms_with_synthetic_data
d
Synthetic version of anonymized Norway Registry data containing...
search.dataone.org
dataverse.azure.uit.no
+2more
Updated Sep 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chauhan, Pavitra (2024). Synthetic version of anonymized Norway Registry data containing prescriptions and hospitalization of the patients [Dataset]. http://doi.org/10.18710/YABAGM
Explore at:
Unique identifier
https://doi.org/10.18710/YABAGM
Dataset updated
Sep 25, 2024
Dataset provided by
DataverseNO
Authors
Chauhan, Pavitra
Time period covered
Jan 1, 2011 - Jan 1, 2013
Description
This dataset represents synthetic data derived from anonymized Norwegian Registry Data of pa aged 65 and above from 2011 to 2013. It includes the Norwegian Patient Registry (NPR), which contains hospitalization details, and the Norwegian Prescription Database (NorPD), which contains prescription details. The NPR and NorPD datasets are combined into a single CSV file. This real dataset was part of a project to study medication use in the elderly and its association with hospitalization. The project has ethical approval from the Regional Committees for Medical and Health Research Ethics in Norway (REK-Nord number: 2014/2182). The dataset was anonymized to ensure that the synthetic version could not reasonably be identical to any real-life individuals. The anonymization process was done as follows: first, only relevant information was kept from the original data set. Second, individuals' birth year and gender were replaced with randomly generated values within a plausible range of values. And last, all dates were replaced with randomly generated dates. This dataset was sufficiently scrambled to generate a synthetic dataset and was only used for the current study. The dataset has details related to Patient, Prescriber, Hospitalization, Diagnosis, Location, Medications, Prescriptions, and Prescriptions dispatched. A publication using this data to create a machine learning model for predicting hospitalization risk is under review.
6DOF pose estimation - synthetically generated dataset using BlenderProc
data.niaid.nih.gov
datadryad.org
zip
Updated Nov 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Divyam Sheth (2023). 6DOF pose estimation - synthetically generated dataset using BlenderProc [Dataset]. http://doi.org/10.5061/dryad.rbnzs7hj5
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rbnzs7hj5
Dataset updated
Nov 26, 2023
Dataset provided by
Dwarkadas J. Sanghvi College of Engineering
Authors
Divyam Sheth
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Accurate and robust 6DOF (Six Degrees of Freedom) pose estimation is a critical task in various fields, including computer vision, robotics, and augmented reality. This research paper presents a novel approach to enhance the accuracy and reliability of 6DOF pose estimation by introducing a robust method for generating synthetic data and leveraging the ease of multi-class training using the generated dataset. The proposed method tackles the challenge of insufficient real-world annotated data by creating a large and diverse synthetic dataset that accurately mimics real-world scenarios. The proposed method only requires a CAD model of the object and there is no limit to the number of unique data that can be generated. Furthermore, a multi-class training strategy that harnesses the synthetic dataset's diversity is proposed and presented. This approach mitigates class imbalance issues and significantly boosts accuracy across varied object classes and poses. Experimental results underscore the method's effectiveness in challenging conditions, highlighting its potential for advancing 6DOF pose estimation across diverse applications. Our approach only uses a single RGB frame and is real-time. Methods This dataset has been synthetically generated using 3D software like Blender and APIs like Blendeproc.
d
Syntegra Synthetic EHR Data | Structured Healthcare Electronic Health Record...
datarade.ai
Updated Feb 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syntegra (2022). Syntegra Synthetic EHR Data | Structured Healthcare Electronic Health Record Data [Dataset]. https://datarade.ai/data-products/syntegra-synthetic-ehr-data-structured-healthcare-electroni-syntegra
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Feb 23, 2022
Dataset authored and provided by
Syntegra
Area covered
United States of America
Description
Organizations can license synthetic, structured data generated by Syntegra from electronic health record systems of community hospitals across the United States, reaching beyond just claims and Rx data.

The synthetic data provides a detailed picture of the patient's journey throughout their hospital stay, including patient demographic information and payer type, as well as rich data not found in any other sources. Examples of this data include: drugs given (timing and dosing), patient location (e.g., ICU, floor, ER), lab results (timing by day and hour), physician roles (e.g., surgeon, attending), medications given, and vital signs. The participating community hospitals with bed sizes ranging from 25 to 532 provide unique visibility and assessment of variation in care outside of large academic medical centers and healthcare networks.

Our synthetic data engine is trained on a broadly representative dataset made up of deep clinical information of approximately 6 million unique patient records and 18 million encounters over 5 years of history. Notably, synthetic data generation allows for the creation of any number of records needed to power your project.

EHR data is available in the following formats: — Cleaned, analytics-ready (a layer of clean and normalized concepts in Tuva Health’s standard relational data model format — FHIR USCDI (labs, medications, vitals, encounters, patients, etc.)

The synthetic data maintains full statistical accuracy, yet does not contain any actual patients, thus removing any patient privacy liability risk. Privacy is preserved in a way that goes beyond HIPAA or GDPR compliance. Our industry-leading metrics prove that both privacy and fidelity are fully maintained.

— Generate the data needed for product development, testing, demo, or other needs — Access data at a scalable price point — Build your desired population, both in size and demographics — Scale up and down to fit specific needs, increasing efficiency and affordability

Syntegra's synthetic data engine also has the ability to augment the original data: — Expand population sizes, rare cohorts, or outcomes of interest — Address algorithmic fairness by correcting bias or introducing intentional bias — Conditionally generate data to inform scenario planning — Impute missing value to minimize gaps in the data
Synthetic Design-Related Data Generated by LLMs
figshare.com
txt
Updated Aug 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yunjian Qiu (2024). Synthetic Design-Related Data Generated by LLMs [Dataset]. http://doi.org/10.6084/m9.figshare.26122543.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26122543.v1
Dataset updated
Aug 24, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Yunjian Qiu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To produce a domain-specific dataset, GPT-4 is assigned the role of an engineering design expert. Furthermore, the ontology, which signifies the design process and design entities, is integrated into the prompts to label the synthetic dataset and enhance the GPT model's grasp of the conceptual design process and domain-specific knowledge. Additionally, the CoT prompting technique compels the GPT models to clarify their reasoning process, thereby fostering a deeper understanding of the tasks.
replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks
zenodo.org
data.niaid.nih.gov
zip
Updated Apr 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabian Plum; Fabian Plum; René Bulla; Hendrik Beck; Hendrik Beck; Natalie Imirzian; Natalie Imirzian; David Labonte; David Labonte; René Bulla (2023). replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks [Dataset]. http://doi.org/10.5281/zenodo.7849417
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7849417
Dataset updated
Apr 21, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fabian Plum; Fabian Plum; René Bulla; Hendrik Beck; Hendrik Beck; Natalie Imirzian; Natalie Imirzian; David Labonte; David Labonte; René Bulla
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for detection and tracking experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

Abstract:

Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

Benchmark data

Two video datasets were curated to quantify detection performance; one in laboratory and one in field conditions. The laboratory dataset consists of top-down recordings of foraging trails of Atta vollenweideri (Forel 1893) leaf-cutter ants. The colony was collected in Uruguay in 2014, and housed in a climate chamber at 25°C and 60% humidity. A recording box was built from clear acrylic, and placed between the colony nest and a box external to the climate chamber, which functioned as feeding site. Bramble leaves were placed in the feeding area prior to each recording session, and ants had access to the recording area at will. The recorded area was 104 mm wide and 200 mm long. An OAK-D camera (OpenCV AI Kit: OAK-D, Luxonis Holding Corporation) was positioned centrally 195 mm above the ground. While keeping the camera position constant, lighting, exposure, and background conditions were varied to create recordings with variable appearance: The “base” case is an evenly lit and well exposed scene with scattered leaf fragments on an otherwise plain white backdrop. A “bright” and “dark” case are characterised by systematic over- or underexposure, respectively, which introduces motion blur, colour-clipped appendages, and extensive flickering and compression artefacts. In a separate well exposed recording, the clear acrylic backdrop was substituted with a printout of a highly textured forest ground to create a “noisy” case. Last, we decreased the camera distance to 100 mm at constant focal distance, effectively doubling the magnification, and yielding a “close” case, distinguished by out-of-focus workers. All recordings were captured at 25 frames per second (fps).

The field datasets consists of video recordings of Gnathamitermes sp. desert termites, filmed close to the nest entrance in the desert of Maricopa County, Arizona, using a Nikon D850 and a Nikkor 18-105 mm lens on a tripod at camera distances between 20 cm to 40 cm. All video recordings were well exposed, and captured at 23.976 fps.

Each video was trimmed to the first 1000 frames, and contains between 36 and 103 individuals. In total, 5000 and 1000 frames were hand-annotated for the laboratory- and field-dataset, respectively: each visible individual was assigned a constant size bounding box, with a centre coinciding approximately with the geometric centre of the thorax in top-down view. The size of the bounding boxes was chosen such that they were large enough to completely enclose the largest individuals, and was automatically adjusted near the image borders. A custom-written Blender Add-on aided hand-annotation: the Add-on is a semi-automated multi animal tracker, which leverages blender’s internal contrast-based motion tracker, but also include track refinement options, and CSV export functionality. Comprehensive documentation of this tool and Jupyter notebooks for track visualisation and benchmarking is provided on the replicAnt and BlenderMotionExport GitHub repositories.

Synthetic data generation

Two synthetic datasets, each with a population size of 100, were generated from 3D models of \textit{Atta vollenweideri} leaf-cutter ants. All 3D models were created with the scAnt photogrammetry workflow. A “group” population was based on three distinct 3D models of an ant minor (1.1 mg), a media (9.8 mg), and a major (50.1 mg) (see 10.5281/zenodo.7849059)). To approximately simulate the size distribution of A. vollenweideri colonies, these models make up 20%, 60%, and 20% of the simulated population, respectively. A 33% within-class scale variation, with default hue, contrast, and brightness subject material variation, was used. A “single” population was generated using the major model only, with 90% scale variation, but equal material variation settings.

A Gnathamitermes sp. synthetic dataset was generated from two hand-sculpted models; a worker and a soldier made up 80% and 20% of the simulated population of 100 individuals, respectively with default hue, contrast, and brightness subject material variation. Both 3D models were created in Blender v3.1, using reference photographs.

Each of the three synthetic datasets contains 10,000 images, rendered at a resolution of 1024 by 1024 px, using the default generator settings as documented in the Generator_example level file (see documentation on GitHub). To assess how the training dataset size affects performance, we trained networks on 100 (“small”), 1,000 (“medium”), and 10,000 (“large”) subsets of the “group” dataset. Generating 10,000 samples at the specified resolution took approximately 10 hours per dataset on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super).

Additionally, five datasets which contain both real and synthetic images were curated. These “mixed” datasets combine image samples from the synthetic “group” dataset with image samples from the real “base” case. The ratio between real and synthetic images across the five datasets varied between 10/1 to 1/100.

Funding

This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
g
Synthetic datasets
generated.photos
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Generated Media, Inc. (2024). Synthetic datasets [Dataset]. https://generated.photos/datasets
Explore at:
Dataset updated
Jun 25, 2024
Dataset authored and provided by
Generated Media, Inc.
Description
100% synthetic. Based on model-released photos. Can be used for any purpose except for the ones violating the law. Worldwide. Different backgrounds: colored, transparent, photographic. Diversity: ethnicity, demographics, facial expressions, and poses.
R
Dataset for publication: Usefulness of synthetic datasets for diatom...
entrepot.recherche.data.gouv.fr
dorel.univ-lorraine.fr
+1more
bin, jpeg +3
Updated Jul 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Laviale; Martin Laviale; Aishwarya Venkataramanan; Aishwarya Venkataramanan (2023). Dataset for publication: Usefulness of synthetic datasets for diatom automatic detection using a deep-learning approach [Dataset]. http://doi.org/10.12763/UADENQ
Explore at:
text/x-python(1957), text/x-python(4882), jpeg(7239), tsv(1716), text/markdown(2269), bin(1530), bin(456), text/x-python(8545), text/x-python(652), bin(50188610), text/x-python(3391), text/x-python(12356)Available download formats
Unique identifier
https://doi.org/10.12763/UADENQ
Dataset updated
Jul 21, 2023
Dataset provided by
Recherche Data Gouv
Authors
Martin Laviale; Martin Laviale; Aishwarya Venkataramanan; Aishwarya Venkataramanan
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Description
This repository contains the dataset and code used to generate synthetic dataset as explained in the paper "Usefulness of synthetic datasets for diatom automatic detection using a deep-learning approach". Dataset : The dataset consists of two components: individual diatom images extracted from publicly available diatom atlases [1,2,3] and individual debris images. - Individual diatom images : currently, the repository consists of 166 diatom species, totalling 9230 images. These images were automatically extracted from atlases using PDF scraping, cleaned and verified by diatom taxonomists. The subfolders within each diatom specie indicates the origin of the images: RA[1], IDF[2], BRG[3]. Additional diatom species and images will be regularly updated in the repository. - Individual debris images : the debris images were extracted from real microscopy images. The repository contains 600 debris objects. Code : Contains the code used to generate synthetic microscopy images. For details on how to use the code, kindly refer to the README file available in synthetic_data_generator/.
S
Global Synthetic Data Solution Market Growth Opportunities 2025-2032
statsndata.org
excel, pdf
Updated Feb 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats N Data (2025). Global Synthetic Data Solution Market Growth Opportunities 2025-2032 [Dataset]. https://www.statsndata.org/report/synthetic-data-solution-market-311381
Explore at:
pdf, excelAvailable download formats
Dataset updated
Feb 2025
Dataset authored and provided by
Stats N Data
License
https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
Area covered
Global
Description
The Synthetic Data Solution market is rapidly emerging as a transformative force across various industries, providing organizations with the ability to generate artificial data that closely mimics real-world scenarios. This innovative approach to data generation is proving to be invaluable in sectors like healthcare
Z
Data from: Synthetic Multimodal Dataset for Daily Life Activities
data.niaid.nih.gov
zenodo.org
Updated Jan 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fukuda, Ken (2024). Synthetic Multimodal Dataset for Daily Life Activities [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8046266
Explore at:
Dataset updated
Jan 29, 2024
Dataset provided by
Kozaki, Kouji
Swe Nwe Nwe Htun
Egami, Shusaku
Fukuda, Ken
Ugai, Takanori
Kawamura, Takahiro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Outline

This dataset is originally created for the Knowledge Graph Reasoning Challenge for Social Issues (KGRC4SI)

Video data that simulates daily life actions in a virtual space from Scenario Data.

Knowledge graphs, and transcriptions of the Video Data content ("who" did what "action" with what "object," when and where, and the resulting "state" or "position" of the object).

Knowledge Graph Embedding Data are created for reasoning based on machine learning

This data is open to the public as open data

Details

Videos

mp4 format

203 action scenarios

For each scenario, there is a character rear view (file name ending in 0), an indoor camera switching view (file name ending in 1), and a fixed camera view placed in each corner of the room (file name ending in 2-5). Also, for each action scenario, data was generated for a minimum of 1 to a maximum of 7 patterns with different room layouts (scenes). A total of 1,218 videos

Videos with slowly moving characters simulate the movements of elderly people.

Knowledge Graphs

RDF format

203 knowledge graphs corresponding to the videos

Includes schema and location supplement information

The schema is described below

SPARQL endpoints and query examples are available

Script Data

txt format

Data provided to VirtualHome2KG to generate videos and knowledge graphs

Includes the action title and a brief description in text format.

Embedding

Embedding Vectors in TransE, ComplEx, and RotatE. Created with DGL-KE (https://dglke.dgl.ai/doc/)

Embedding Vectors created with jRDF2vec (https://github.com/dwslab/jRDF2Vec).

Specification of Ontology

Please refer to the specification for descriptions of all classes, instances, and properties: https://aistairc.github.io/VirtualHome2KG/vh2kg_ontology.htm

Related Resources

KGRC4SI Final Presentations with automatic English subtitles (YouTube)

VirtualHome2KG (Software)

VirtualHome-AIST (Unity)

VirtualHome-AIST (Python API)

Visualization Tool (Software)

Script Editor (Software)
Kimberlina 1.2 CCUS Geophysical Models and Synthetic Data Sets
osti.gov
Updated Sep 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Energy Technology Laboratory (NETL), Pittsburgh, PA, Morgantown, WV (United States) (2022). Kimberlina 1.2 CCUS Geophysical Models and Synthetic Data Sets [Dataset]. http://doi.org/10.18141/1887287
Explore at:
Unique identifier
https://doi.org/10.18141/1887287
Dataset updated
Sep 14, 2022
Dataset provided by
National Energy Technology Laboratoryhttps://netl.doe.gov/
United States Department of Energyhttp://energy.gov/
Description
This synthetic multi-scale and multi-physics data set was produced in collaboration with teams at the Lawrence Berkeley National Laboratory, National Energy Technology Laboratory, Los Alamos National Laboratory, and Colorado School of Mines through the Science-informed Machine Learning for Accelerating Real-Time Decisions in Subsurface Applications (SMART) Initiative. Data are associated with the following publication: Alumbaugh, D., Gasperikova, E., Crandall, D., Commer, M., Feng, S., Harbert, W., Li, Y., Lin, Y., and Samarasinghe, S., “The Kimberlina Synthetic Geophysical Model and Data Set for CO2 Monitoring Investigations”, The Geoscience Data Journal, 2023, DOI: 10.1002/gdj3.191. The dataset uses the Kimberlina 1.2 CO2 reservoir flow model simulations based on a hypothetical CO2 storage site in California (Birkholzer et al., 2011; Wainwright et al., 2013). Geophysical properties models (P- and S-wave seismic velocities, saturated density, and electrical resistivity) were produced with an approach similar to that of Yang et al. (2019) and Gasperikova et al. (2022) for 100 Kimberlina 1.2 reservoir models. Links to individual resources are provided below: CO2 Saturation Models; Resistivity Models – part 1, part 2, and part 3; Vp Velocity Models; Vs Velocity Models; Density Models. The 3D distributions of geophysical properties for the 33 time stamps of the SIM001 model were used to generate synthetic seismic, gravity, and electromagnetic (EM) responses for 33 times between zero and 200 years. Synthetic surface seismic data were generated using 2D and 3D finite-difference codes that simulate the acoustic wave equation (Moczo et al., 2007). 2D data were simulated for six point-pressure sources along a 2D line with 10 m receiver spacing and a time spacing of 0.0005 s. 3D simulations were completed for 25 surface pressure sources using a source separation of 1 km in both the x and y directions and a time spacing of 0.001 s. Links to individual resources are provided below: 2D velocity models and 2D surface seismic data. 3D velocity models, and 3D seismic data year0, year1, year2, year5, year10, year15, year20, year25, year30, year35, year40, year45, year49, year50, year51, year52, year55, year60, year65, year70, year75, year80, year85, year90, year95, year100, year110, year120, year130, year140, year150, year175, year200. EM simulations used a borehole-to-surface survey configuration, with the source located near the reservoir level and receivers on the surface using the code developed by Commer and Newman (2008). Pseudo-2D data for the source at 2500 m and 3025 m, used a 2D inline receiver configuration to simulate a response over 3D resistivity models. The 3D data contain electric fields generated by borehole sources at monitoring well locations and measured over a surface receiver grid. Vector gravity data, both on the surface and in boreholes, were simulated using a modeling code developed by Rim and Li (2015). The simulation scenarios were parallel to those used for the EM: pseudo-2D data were calculated along the same lines and within the same boreholes, and 3D data were simulated over 3D models on the surface and in three monitoring wells. A series of synthetic well logs of CO2 saturation, acoustic velocity, density, and induction resistivity in the injection well and three monitoring wells are also provided at 0, 1, 2, 5, 10, 15, and 20 years after the initiation of injection. These were constructed by combining the low-frequency trend of the geophysical models with the high-frequency variations of actual well logs collected in the Kimberlina 1 well that was drilled at the proposed site. Measurements of permeability and pore connectivity were made on cores of Vedder Sandstone, which forms the primary reservoir unit: CT micro scans and [Industrial CT
SASC: A Simple Approach to Synthetic Cohorts. Applying COVID-19 clinical...
zenodo.org
data.niaid.nih.gov
bin
Updated Oct 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zaliani Andrea; Zaliani Andrea (2021). SASC: A Simple Approach to Synthetic Cohorts. Applying COVID-19 clinical data to generate longitudinal observational patient cohorts and comparison with alternative synthetic cohort approaches as well as real patient data [Dataset]. http://doi.org/10.5281/zenodo.5544057
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5544057
Dataset updated
Oct 2, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zaliani Andrea; Zaliani Andrea
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Subset from COVID-19 Dataset from https://zenodo.org/record/3766350#.YVcfyTFBxgA. Used as reference for a publication dealing with synthetic patient cohort generation.

Facebook

Twitter

Click to copy link

Link copied

Cite

Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002

Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.3389/frai.2025.1533508.s002

Dataset updated

Feb 5, 2025

Dataset provided by

Frontiers

Authors

Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

Clear search

Close search

Google apps

Main menu

Data Sheet 2_Large language models generating synthetic clinical datasets: a...

Dataset of article: Synthetic Datasets Generator for Testing Information...

Table1_Enhancing biomechanical machine learning with limited data:...

Synthetic Data Generation Market Report

TiCaM: Synthetic Images Dataset

Synthea synthetic patient generator data in OMOP Common Data Model

Australian synthetic healthcare data with Synthea

Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk...

Scrambled text: training Language Models to correct OCR errors using...

Synthetic version of anonymized Norway Registry data containing...

6DOF pose estimation - synthetically generated dataset using BlenderProc

Syntegra Synthetic EHR Data | Structured Healthcare Electronic Health Record...

Synthetic Design-Related Data Generated by LLMs

replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks

Synthetic datasets

Dataset for publication: Usefulness of synthetic datasets for diatom...

Global Synthetic Data Solution Market Growth Opportunities 2025-2032

Data from: Synthetic Multimodal Dataset for Daily Life Activities

Kimberlina 1.2 CCUS Geophysical Models and Synthetic Data Sets

SASC: A Simple Approach to Synthetic Cohorts. Applying COVID-19 clinical...

Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsxSee More Versions

Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx