71 datasets found

f
Data Sheet 2_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
xlsx
Updated Feb 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s002
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
f
Table1_Enhancing biomechanical machine learning with limited data:...
frontiersin.figshare.com
pdf
Updated Feb 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich (2024). Table1_Enhancing biomechanical machine learning with limited data: generating realistic synthetic posture data using generative artificial intelligence.pdf [Dataset]. http://doi.org/10.3389/fbioe.2024.1350135.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fbioe.2024.1350135.s001
Dataset updated
Feb 14, 2024
Dataset provided by
Frontiers
Authors
Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Objective: Biomechanical Machine Learning (ML) models, particularly deep-learning models, demonstrate the best performance when trained using extensive datasets. However, biomechanical data are frequently limited due to diverse challenges. Effective methods for augmenting data in developing ML models, specifically in the human posture domain, are scarce. Therefore, this study explored the feasibility of leveraging generative artificial intelligence (AI) to produce realistic synthetic posture data by utilizing three-dimensional posture data.Methods: Data were collected from 338 subjects through surface topography. A Variational Autoencoder (VAE) architecture was employed to generate and evaluate synthetic posture data, examining its distinguishability from real data by domain experts, ML classifiers, and Statistical Parametric Mapping (SPM). The benefits of incorporating augmented posture data into the learning process were exemplified by a deep autoencoder (AE) for automated feature representation.Results: Our findings highlight the challenge of differentiating synthetic data from real data for both experts and ML classifiers, underscoring the quality of synthetic data. This observation was also confirmed by SPM. By integrating synthetic data into AE training, the reconstruction error can be reduced compared to using only real data samples. Moreover, this study demonstrates the potential for reduced latent dimensions, while maintaining a reconstruction accuracy comparable to AEs trained exclusively on real data samples.Conclusion: This study emphasizes the prospects of harnessing generative AI to enhance ML tasks in the biomechanics domain.
d
Syntegra Synthetic EHR Data | Structured Healthcare Electronic Health Record...
datarade.ai
Updated Feb 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syntegra (2022). Syntegra Synthetic EHR Data | Structured Healthcare Electronic Health Record Data [Dataset]. https://datarade.ai/data-products/syntegra-synthetic-ehr-data-structured-healthcare-electroni-syntegra
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Feb 23, 2022
Dataset authored and provided by
Syntegra
Area covered
United States of America
Description
Organizations can license synthetic, structured data generated by Syntegra from electronic health record systems of community hospitals across the United States, reaching beyond just claims and Rx data.

The synthetic data provides a detailed picture of the patient's journey throughout their hospital stay, including patient demographic information and payer type, as well as rich data not found in any other sources. Examples of this data include: drugs given (timing and dosing), patient location (e.g., ICU, floor, ER), lab results (timing by day and hour), physician roles (e.g., surgeon, attending), medications given, and vital signs. The participating community hospitals with bed sizes ranging from 25 to 532 provide unique visibility and assessment of variation in care outside of large academic medical centers and healthcare networks.

Our synthetic data engine is trained on a broadly representative dataset made up of deep clinical information of approximately 6 million unique patient records and 18 million encounters over 5 years of history. Notably, synthetic data generation allows for the creation of any number of records needed to power your project.

EHR data is available in the following formats: — Cleaned, analytics-ready (a layer of clean and normalized concepts in Tuva Health’s standard relational data model format — FHIR USCDI (labs, medications, vitals, encounters, patients, etc.)

The synthetic data maintains full statistical accuracy, yet does not contain any actual patients, thus removing any patient privacy liability risk. Privacy is preserved in a way that goes beyond HIPAA or GDPR compliance. Our industry-leading metrics prove that both privacy and fidelity are fully maintained.

— Generate the data needed for product development, testing, demo, or other needs — Access data at a scalable price point — Build your desired population, both in size and demographics — Scale up and down to fit specific needs, increasing efficiency and affordability

Syntegra's synthetic data engine also has the ability to augment the original data: — Expand population sizes, rare cohorts, or outcomes of interest — Address algorithmic fairness by correcting bias or introducing intentional bias — Conditionally generate data to inform scenario planning — Impute missing value to minimize gaps in the data
Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk...
catalog.data.gov
data.va.gov
+2more
Updated Apr 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Veterans Affairs (2021). Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk Factor Modeling Challenge [Dataset]. https://catalog.data.gov/dataset/synthetic-cohort-for-vha-innovation-ecosystem-and-precisionfda-covid-19-risk-factor-modeli
Explore at:
Dataset updated
Apr 25, 2021
Dataset provided by
United States Department of Veterans Affairshttp://va.gov/
Description
The dataset is a synthetic cohort for use for the VHA Innovation Ecosystem and precisionFDA COVID-19 Risk Factor Modeling Challenge. The dataset was generated using Synthea, a tool created by MITRE to generate synthetic electronic health records (EHRs) from curated care maps and publicly available statistics. This dataset represents 147,451 patients developed using the COVID-19 module. The dataset format conforms to the CSV file outputs. Below are links to all relevant information. PrecisionFDA Challenge: https://precision.fda.gov/challenges/11 Synthea hompage: https://synthetichealth.github.io/synthea/ Synethea GitHub repository: https://github.com/synthetichealth/synthea Synthea COVID-19 Module publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7531559/ CSV File Format Data Dictionary: https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary
Z
Data from: Synthetic Multimodal Dataset for Daily Life Activities
data.niaid.nih.gov
zenodo.org
Updated Jan 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fukuda, Ken (2024). Synthetic Multimodal Dataset for Daily Life Activities [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8046266
Explore at:
Dataset updated
Jan 29, 2024
Dataset provided by
Kawamura, Takahiro
Egami, Shusaku
Fukuda, Ken
Swe Nwe Nwe Htun
Ugai, Takanori
Kozaki, Kouji
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Outline

This dataset is originally created for the Knowledge Graph Reasoning Challenge for Social Issues (KGRC4SI)

Video data that simulates daily life actions in a virtual space from Scenario Data.

Knowledge graphs, and transcriptions of the Video Data content ("who" did what "action" with what "object," when and where, and the resulting "state" or "position" of the object).

Knowledge Graph Embedding Data are created for reasoning based on machine learning

This data is open to the public as open data

Details

Videos

mp4 format

203 action scenarios

For each scenario, there is a character rear view (file name ending in 0), an indoor camera switching view (file name ending in 1), and a fixed camera view placed in each corner of the room (file name ending in 2-5). Also, for each action scenario, data was generated for a minimum of 1 to a maximum of 7 patterns with different room layouts (scenes). A total of 1,218 videos

Videos with slowly moving characters simulate the movements of elderly people.

Knowledge Graphs

RDF format

203 knowledge graphs corresponding to the videos

Includes schema and location supplement information

The schema is described below

SPARQL endpoints and query examples are available

Script Data

txt format

Data provided to VirtualHome2KG to generate videos and knowledge graphs

Includes the action title and a brief description in text format.

Embedding

Embedding Vectors in TransE, ComplEx, and RotatE. Created with DGL-KE (https://dglke.dgl.ai/doc/)

Embedding Vectors created with jRDF2vec (https://github.com/dwslab/jRDF2Vec).

Specification of Ontology

Please refer to the specification for descriptions of all classes, instances, and properties: https://aistairc.github.io/VirtualHome2KG/vh2kg_ontology.htm

Related Resources

KGRC4SI Final Presentations with automatic English subtitles (YouTube)

VirtualHome2KG (Software)

VirtualHome-AIST (Unity)

VirtualHome-AIST (Python API)

Visualization Tool (Software)

Script Editor (Software)
B
Open Data Training Workshop: Synthetic Data & The 2023 Pediatric Sepsis Data...
borealisdata.ca
Updated Apr 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charly Huxford; Vuong Nguyen; Jessica Trawin; Teresa Johnson; Niranjan Kissoon; Matthew Wiens; Gina Ogilvie; Srinivas Murthy; Gurm Dhugga; Maggie Woo Kinshella; J Mark Ansermino (2023). Open Data Training Workshop: Synthetic Data & The 2023 Pediatric Sepsis Data Challenge [Dataset]. http://doi.org/10.5683/SP3/IVSKZ6
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/IVSKZ6
Dataset updated
Apr 18, 2023
Dataset provided by
Borealis
Authors
Charly Huxford; Vuong Nguyen; Jessica Trawin; Teresa Johnson; Niranjan Kissoon; Matthew Wiens; Gina Ogilvie; Srinivas Murthy; Gurm Dhugga; Maggie Woo Kinshella; J Mark Ansermino
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset funded by
Digital Research Alliance of Canada
Description
Objective(s): Momentum for open access to research is growing. Funding agencies and publishers are increasingly requiring researchers make their data and research outputs open and publicly available. However, this introduces many challenges, especially when managing confidential clinical data. The aim of this 1 hr virtual workshop is to provide participants with knowledge about what synthetic data is, methods to create synthetic data, and the 2023 Pediatric Sepsis Data Challenge. Workshop Agenda: 1. Introduction - Speaker: Mark Ansermino, Director, Centre for International Child Health 2. "Leveraging Synthetic Data for an International Data Challenge" - Speaker: Charly Huxford, Research Assistant, Centre for International Child Health 3. "Methods in Synthetic Data Generation." - Speaker: Vuong Nguyen, Biostatistician, Centre for International Child Health and The HIPpy Lab This workshop draws on work supported by the Digital Research Alliance of Canada. Data Description: Presentation slides, Workshop Video, and Workshop Communication Charly Huxford: Leveraging Synthetic Data for an International Data Challenge presentation and accompanying PowerPoint slides. Vuong Nguyen: Methods in Synthetic Data Generation presentation and accompanying Powerpoint slides. This workshop was developed as part of Dr. Ansermino's Data Champions Pilot Project supported by the Digital Research Alliance of Canada. NOTE for restricted files: If you are not yet a CoLab member, please complete our membership application survey to gain access to restricted files within 2 business days. Some files may remain restricted to CoLab members. These files are deemed more sensitive by the file owner and are meant to be shared on a case-by-case basis. Please contact the CoLab coordinator on this page under "collaborate with the pediatric sepsis colab."
d
Synthetic version of anonymized Norway Registry data containing...
search.dataone.org
dataverse.azure.uit.no
+2more
Updated Sep 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chauhan, Pavitra (2024). Synthetic version of anonymized Norway Registry data containing prescriptions and hospitalization of the patients [Dataset]. http://doi.org/10.18710/YABAGM
Explore at:
Unique identifier
https://doi.org/10.18710/YABAGM
Dataset updated
Sep 25, 2024
Dataset provided by
DataverseNO
Authors
Chauhan, Pavitra
Time period covered
Jan 1, 2011 - Jan 1, 2013
Description
This dataset represents synthetic data derived from anonymized Norwegian Registry Data of pa aged 65 and above from 2011 to 2013. It includes the Norwegian Patient Registry (NPR), which contains hospitalization details, and the Norwegian Prescription Database (NorPD), which contains prescription details. The NPR and NorPD datasets are combined into a single CSV file. This real dataset was part of a project to study medication use in the elderly and its association with hospitalization. The project has ethical approval from the Regional Committees for Medical and Health Research Ethics in Norway (REK-Nord number: 2014/2182). The dataset was anonymized to ensure that the synthetic version could not reasonably be identical to any real-life individuals. The anonymization process was done as follows: first, only relevant information was kept from the original data set. Second, individuals' birth year and gender were replaced with randomly generated values within a plausible range of values. And last, all dates were replaced with randomly generated dates. This dataset was sufficiently scrambled to generate a synthetic dataset and was only used for the current study. The dataset has details related to Patient, Prescriber, Hospitalization, Diagnosis, Location, Medications, Prescriptions, and Prescriptions dispatched. A publication using this data to create a machine learning model for predicting hospitalization risk is under review.
replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks
zenodo.org
data.niaid.nih.gov
zip
Updated Apr 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabian Plum; Fabian Plum; René Bulla; Hendrik Beck; Hendrik Beck; Natalie Imirzian; Natalie Imirzian; David Labonte; David Labonte; René Bulla (2023). replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks [Dataset]. http://doi.org/10.5281/zenodo.7849417
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7849417
Dataset updated
Apr 21, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fabian Plum; Fabian Plum; René Bulla; Hendrik Beck; Hendrik Beck; Natalie Imirzian; Natalie Imirzian; David Labonte; David Labonte; René Bulla
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for detection and tracking experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

Abstract:

Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

Benchmark data

Two video datasets were curated to quantify detection performance; one in laboratory and one in field conditions. The laboratory dataset consists of top-down recordings of foraging trails of Atta vollenweideri (Forel 1893) leaf-cutter ants. The colony was collected in Uruguay in 2014, and housed in a climate chamber at 25°C and 60% humidity. A recording box was built from clear acrylic, and placed between the colony nest and a box external to the climate chamber, which functioned as feeding site. Bramble leaves were placed in the feeding area prior to each recording session, and ants had access to the recording area at will. The recorded area was 104 mm wide and 200 mm long. An OAK-D camera (OpenCV AI Kit: OAK-D, Luxonis Holding Corporation) was positioned centrally 195 mm above the ground. While keeping the camera position constant, lighting, exposure, and background conditions were varied to create recordings with variable appearance: The “base” case is an evenly lit and well exposed scene with scattered leaf fragments on an otherwise plain white backdrop. A “bright” and “dark” case are characterised by systematic over- or underexposure, respectively, which introduces motion blur, colour-clipped appendages, and extensive flickering and compression artefacts. In a separate well exposed recording, the clear acrylic backdrop was substituted with a printout of a highly textured forest ground to create a “noisy” case. Last, we decreased the camera distance to 100 mm at constant focal distance, effectively doubling the magnification, and yielding a “close” case, distinguished by out-of-focus workers. All recordings were captured at 25 frames per second (fps).

The field datasets consists of video recordings of Gnathamitermes sp. desert termites, filmed close to the nest entrance in the desert of Maricopa County, Arizona, using a Nikon D850 and a Nikkor 18-105 mm lens on a tripod at camera distances between 20 cm to 40 cm. All video recordings were well exposed, and captured at 23.976 fps.

Each video was trimmed to the first 1000 frames, and contains between 36 and 103 individuals. In total, 5000 and 1000 frames were hand-annotated for the laboratory- and field-dataset, respectively: each visible individual was assigned a constant size bounding box, with a centre coinciding approximately with the geometric centre of the thorax in top-down view. The size of the bounding boxes was chosen such that they were large enough to completely enclose the largest individuals, and was automatically adjusted near the image borders. A custom-written Blender Add-on aided hand-annotation: the Add-on is a semi-automated multi animal tracker, which leverages blender’s internal contrast-based motion tracker, but also include track refinement options, and CSV export functionality. Comprehensive documentation of this tool and Jupyter notebooks for track visualisation and benchmarking is provided on the replicAnt and BlenderMotionExport GitHub repositories.

Synthetic data generation

Two synthetic datasets, each with a population size of 100, were generated from 3D models of \textit{Atta vollenweideri} leaf-cutter ants. All 3D models were created with the scAnt photogrammetry workflow. A “group” population was based on three distinct 3D models of an ant minor (1.1 mg), a media (9.8 mg), and a major (50.1 mg) (see 10.5281/zenodo.7849059)). To approximately simulate the size distribution of A. vollenweideri colonies, these models make up 20%, 60%, and 20% of the simulated population, respectively. A 33% within-class scale variation, with default hue, contrast, and brightness subject material variation, was used. A “single” population was generated using the major model only, with 90% scale variation, but equal material variation settings.

A Gnathamitermes sp. synthetic dataset was generated from two hand-sculpted models; a worker and a soldier made up 80% and 20% of the simulated population of 100 individuals, respectively with default hue, contrast, and brightness subject material variation. Both 3D models were created in Blender v3.1, using reference photographs.

Each of the three synthetic datasets contains 10,000 images, rendered at a resolution of 1024 by 1024 px, using the default generator settings as documented in the Generator_example level file (see documentation on GitHub). To assess how the training dataset size affects performance, we trained networks on 100 (“small”), 1,000 (“medium”), and 10,000 (“large”) subsets of the “group” dataset. Generating 10,000 samples at the specified resolution took approximately 10 hours per dataset on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super).

Additionally, five datasets which contain both real and synthetic images were curated. These “mixed” datasets combine image samples from the synthetic “group” dataset with image samples from the real “base” case. The ratio between real and synthetic images across the five datasets varied between 10/1 to 1/100.

Funding

This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
c
Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data
datacatalogue.cessda.eu
beta.ukdataservice.ac.uk
Updated Nov 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Little, C.; Elliott, M.; Allmendinger, R., University of Manchester (2024). Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data [Dataset]. http://doi.org/10.5255/UKDA-SN-9282-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-9282-1
Dataset updated
Nov 29, 2024
Dataset provided by
Manchester Business School
University of Manchester
Authors
Little, C.; Elliott, M.; Allmendinger, R., University of Manchester
Time period covered
Jan 1, 2023 - Dec 31, 2023
Area covered
England and Wales
Variables measured
Individuals, National
Measurement technique
Compilation/Synthesis
Description
Abstract copyright UK Data Service and data collection copyright owner.

The aim of this project was to create a synthetic dataset without using the original (secure, controlled) dataset to do so, and instead using only publicly available analytical output (i.e. output that was cleared for publication) to create the synthetic data. Such synthetic data may allow users to gain familiarity with and practise on data that is like the original before they gain access to the original data (where time in a secure setting may be limited).
The Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data was created without access to the original ASHE-2011 Census dataset (which is only available in a secure setting via the ONS Secure Research Service: "Annual Survey of Hours and Earnings linked to 2011 Census - England and Wales"). It was created as a teaching aid to support a training course "An Introduction to the linked ASHE-2011 Census dataset" organised by Administrative Data Research UK and the National Centre for Research Methods. The synthetic dataset contains a subset of the variables in the original dataset and was designed to reproduce the analytical output contained in the ASHE-Census 2011 Data Linkage User Guide.

Main Topics:

Variables available in this study relate to synthetic employment, earnings and demographic information for adults employed in England and Wales in 2011.
c
Synthetic Administrative Data: Census 1991, 2023
datacatalogue.cessda.eu
beta.ukdataservice.ac.uk
Updated Mar 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shlomo, N; Kim, M (2025). Synthetic Administrative Data: Census 1991, 2023 [Dataset]. http://doi.org/10.5255/UKDA-SN-856310
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-856310
Dataset updated
Mar 25, 2025
Dataset provided by
University of Manchester
Authors
Shlomo, N; Kim, M
Time period covered
Jan 1, 2021 - Jan 1, 2023
Area covered
United Kingdom
Variables measured
Individual
Measurement technique
This is a synthetic administrative dataset with only 6 variables to enable the calculation of quality indicators in the R package: https://github.com/sook-tusk/qualadmin See also the user manual.The dataset was created from a 1991 synthetic UK census dataset containing over 1 million records by deleting, moving and duplicating records across geographies according to pre-specified proportions within broad ethnic group and gender. The geography variable includes 6 local authorities but they are completely anonymized and labelled 1,2..6. Other variables are (number of categories in parentheses): sex (2), age groups (14), ethnic groups (5) and employment (3). The final size of the synthetic administrative data is 1033664 individuals.The description of the variables are in the data dictionary that is uploaded with the data.
Description
We create a synthetic administrative dataset to be used in the development of the R package for calculating quality indicators for administrative data (see: https://github.com/sook-tusk/qualadmin) that mimic the properties of a real administrative dataset according to specifications by the ONS. Taking over 1 million records from a synthetic 1991 UK census dataset, we deleted records, moved records to a different geography and duplicated records to a different geography according to pre-specified proportions for each broad ethnic group (White, Non-white) and gender (males, females). The final size of the synthetic administrative data was 1033664 individuals.
National Statistical Institutes (NSIs) are directing resources into advancing the use of administrative data in official statistics systems. This is a top priority for the UK Office for National Statistics (ONS) as they are undergoing transformations in their statistical systems to make more use of administrative data for future censuses and population statistics. Administrative data are defined as secondary data sources since they are produced by other agencies as a result of an event or a transaction relating to administrative procedures of organisations, public administrations and government agencies. Nevertheless, they have the potential to become important data sources for the production of official statistics by significantly reducing the cost and burden of response and improving the efficiency of such systems. Embedding administrative data in statistical systems is not without costs and it is vital to understand where potential errors may arise. The Total Administrative Data Error Framework sets out all possible sources of error when using administrative data as statistical data, depending on whether it is a single data source or integrated with other data sources such as survey data. For a single administrative data, one of the main sources of error is coverage and representation to the target population of interest. This is particularly relevant when administrative data is delivered over time, such as tax data for maintaining the Business Register. For sub-project 1 of this research project, we develop quality indicators that allow the statistical agency to assess if the administrative data is representative to the target population and which sub-groups may be missing or over-covered. This is essential for producing unbiased estimates from administrative data. Another priority at statistical agencies is to produce a statistical register for population characteristic estimates, such as employment statistics, from multiple sources of administrative and survey data. Using administrative data to build a spine, survey data can be integrated using record linkage and statistical matching approaches on a set of common matching variables. This will be the topic for sub-project 2, which will be split into several topics of research. The first topic is whether adding statistical predictions and correlation structures improves the linkage and data integration. The second topic is to research a mass imputation framework for imputing missing target variables in the statistical register where the missing data may be due to multiple underlying mechanisms. Therefore, the third topic will aim to improve the mass imputation framework to mitigate against possible measurement errors, for example by adding benchmarks and other constraints into the approaches. On completion of a statistical register, estimates for key target variables at local areas can easily be aggregated. However, it is essential to also measure the precision of these estimates through mean square errors and this will be the fourth topic of the sub-project. Finally, this new way of producing official statistics is compared to the more common method of incorporating administrative data through survey weights and model-based estimation approaches. In other words, we evaluate whether it is better 'to weight' or 'to impute' for population characteristic estimates - a key question under investigation by survey statisticians in the last decade.
CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) in...
registry.opendata.aws
Updated Jan 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon Web Sevices (2023). CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) in OMOP Common Data Model [Dataset]. https://registry.opendata.aws/cmsdesynpuf-omop/
Explore at:
Dataset updated
Jan 18, 2023
Dataset provided by
Amazon.comhttp://amazon.com/
Description
DE-SynPUF is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,300,000 persom (2.3m) data sets in the OMOP Common Data Model format. The DE-SynPUF was created with the goal of providing a realistic set of claims data in the public domain while providing the very highest degree of protection to the Medicare beneficiaries’ protected health information. The purposes of the DE-SynPUF are to:

allow data entrepreneurs to develop and create software and applications that may eventually be applied to actual CMS claims data;

train researchers on the use and complexity of conducting analyses with CMS claims data prior to initiating the process to obtain access to actual CMS data; and,

support safe data mining innovations that may reveal unanticipated knowledge gains while preserving beneficiary privacy. The files have been designed so that programs and procedures created on the DE-SynPUF will function on CMS Limited Data Sets. The data structure of the Medicare DE-SynPUF is very similar to the CMS Limited Data Sets, but with a smaller number of variables. The DE-SynPUF also provides a robust set of metadata on the CMS claims data that have not been previously available in the public domain. Although the DE-SynPUF has very limited inferential research value to draw conclusions about Medicare beneficiaries due to the synthetic processes used to create the file, the Medicare DE-SynPUF does increase access to a realistic Medicare claims data file in a timely and less expensive manner to spur the innovation necessary to achieve the goals of better care for beneficiaries and improve the health of the population.
Synthetic nursing handover training and development data set - text files
data.csiro.au
researchdata.edu.au
Updated Mar 21, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maricel Angel; Hanna Suominen; Liyuan Zhou; Leif Hanlen (2017). Synthetic nursing handover training and development data set - text files [Dataset]. http://doi.org/10.4225/08/58d097ee92e95
Explore at:
Unique identifier
https://doi.org/10.4225/08/58d097ee92e95
Dataset updated
Mar 21, 2017
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Maricel Angel; Hanna Suominen; Liyuan Zhou; Leif Hanlen
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset funded by
NICTAhttp://nicta.com.au/
Description
This is one of two collection records. Please see the link below for the other collection of associated audio files.

Both collections together comprise an open clinical dataset of three sets of 101 nursing handover records, very similar to real documents in Australian English. Each record consists of a patient profile, spoken free-form text document, written free-form text document, and written structured document.

This collection contains 3 sets of text documents.

Data Set 1 for Training and Development

The data set, released in June 2014, includes the following documents:

Folder initialisation: Initialisation details for speech recognition using Dragon Medical 11.0 (i.e., i) DOCX for the written, free-form text document that originates from the Dragon software release and ii) WMA for the spoken, free-form text document by the RN) Folder 100profiles: 100 patient profiles (DOCX) Folder 101writtenfreetextreports: 101 written, free-form text documents (TXT) Folder 100x6speechrecognised: 100 speech-recognized, written, free-form text documents for six Dragon vocabularies (TXT) Folder 101informationextraction: 101 written, structured documents for information extraction that include i) the reference standard text, ii) features used by our best system, iii) form categories with respect to the reference standard and iv) form categories with respect to the our best information extraction system (TXT in CRF++ format).

An Independent Data Set 2

The aforementioned data set was supplemented in April 2015 with an independent set that was used as a test set in the CLEFeHealth 2015 Task 1a on clinical speech recognition and can be used as a validation set in the CLEFeHealth 2016 Task 1 on handover information extraction. Hence, when using this set, please avoid its repeated use in evaluation – we do not wish to overfit to these data sets.

The set released in April 2015 consists of 100 patient profiles (DOCX), 100 written, and 100 speech-recognized, written, free-form text documents for the Dragon vocabulary of Nursing (TXT). The set released in November 2015 consists of the respective 100 written free-form text documents (TXT) and 100 written, structured documents for information extraction.

An Independent Data Set 3

For evaluation purposes, the aforementioned data sets were supplemented in April 2016 with an independent set of another 100 synthetic cases.

Lineage: Data creation included the following steps: generation of patient profiles; creation of written, free form text documents; development of a structured handover form, using this form and the written, free-form text documents to create written, structured documents; creation of spoken, free-form text documents; using a speech recognition engine with different vocabularies to convert the spoken documents to written, free-form text; and using an information extraction system to fill out the handover form from the written, free-form text documents.

See Suominen et al (2015) in the links below for a detailed description and examples.
m
data for: Synthetic Datasets Generator for Testing Techniques and Tools of...
data.mendeley.com
Updated Mar 12, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yvan Brito (2019). data for: Synthetic Datasets Generator for Testing Techniques and Tools of Information Visualization and Machine Learning [Dataset]. http://doi.org/10.17632/2j3hg4j6tc.1
Explore at:
Unique identifier
https://doi.org/10.17632/2j3hg4j6tc.1
Dataset updated
Mar 12, 2019
Authors
Yvan Brito
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data model to generate datasets used in the tests of the article: Synthetic Datasets Generator for Testing Techniques and Tools of Information Visualization and Machine Learning.
Rule-based Synthetic Data for Japanese GEC
zenodo.org
live.european-language-grid.eu
+1more
bin, tsv
Updated Dec 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Kimn; Alex Kimn; Yiqun Hu; Takako Aikawa; Takako Aikawa; Yiqun Hu (2020). Rule-based Synthetic Data for Japanese GEC [Dataset]. http://doi.org/10.5281/zenodo.4276130
Explore at:
tsv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4276130
Dataset updated
Dec 31, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alex Kimn; Alex Kimn; Yiqun Hu; Takako Aikawa; Takako Aikawa; Yiqun Hu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Title: Rule-based Synthetic Data for Japanese GEC Dataset Contents: This dataset contains two parallel corpora intended for the training and evaluating of models for the NLP (natural language processing) subtask of Japanese GEC (grammatical error correction). These are as follows: Synthetic Corpus - *synthesized_data.tsv* This corpus file contains 2,179,130 parallel sentence pairs synthesized using the process described in [1]. Each line of the file consists of two sentences delimited by a tab. The first sentence is the erroneous sentence while the second is the corresponding correction. These paired sentences are derived from data scraped from the keyword-lookup site
h
Immune Checkpoint Inhibitors synthetic data: HDR UK Medicines Programme...
web.dev.hdruk.cloud
healthdatagateway.org
unknown
Updated Oct 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158) (2024). Immune Checkpoint Inhibitors synthetic data: HDR UK Medicines Programme resource [Dataset]. https://web.dev.hdruk.cloud/dataset/189
Explore at:
unknownAvailable download formats
Dataset updated
Oct 8, 2024
Dataset authored and provided by
This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158)
License
https://www.pioneerdatahub.co.uk/data/data-request-process/https://www.pioneerdatahub.co.uk/data/data-request-process/
Description
This highly granular synthetic dataset created as an asset for the HDR UK Medicines programme includes information on 680 cancer patients over a period of three years. Includes simulated patient-related data, such as demographics & co-morbidities extracted from ICD-10 and SNOMED-CT codes. Serial, structured data pertaining to acute care process (readmissions, survival), primary diagnosis, presenting complaint, physiology readings, blood results (infection, inflammatory markers) and acuity markers such as AVPU Scale, NEWS2 score, imaging reports, prescribed & administered treatments including fluids, blood products, procedures, information on outpatient admissions and survival outcomes following one-year post discharge.

The data was generated using a generative adversarial network model (CTGAN). A flat real data table was created by consolidating essential information from various key relational tables (medications, demographics). A synthetic version of the flat table was generated using a customized script based on the SDV package (N. Patki, 2016), that replicated the real distribution and logic relationships.

Geography: The West Midlands (WM) has a population of 6 million & includes a diverse ethnic & socio-economic mix. UHB is one of the largest NHS Trusts in England, providing direct acute services & specialist care across four hospital sites, with 2.2 million patient episodes per year, 2750 beds & > 120 ITU bed capacity. UHB runs a fully electronic healthcare record (EHR) (PICS; Birmingham Systems), a shared primary & secondary care record (Your Care Connected) & a patient portal “My Health”.

Data set availability: Data access is available via the PIONEER Hub for projects which will benefit the public or patients. This can be by developing a new understanding of disease, by providing insights into how to improve care, or by developing new models, tools, treatments, or care processes. Data access can be provided to NHS, academic, commercial, policy and third sector organisations. Applications from SMEs are welcome. There is a single data access process, with public oversight provided by our public review committee, the Data Trust Committee. Contact pioneer@uhb.nhs.uk or visit www.pioneerdatahub.co.uk for more details.

Available supplementary data: Matched controls; ambulance and community data. Unstructured data (images). We can provide the dataset in OMOP and other common data models and provide the real-data via application.

Available supplementary support: Analytics, model build, validation & refinement; A.I. support. Data partner support for ETL (extract, transform & load) processes. Bespoke and “off the shelf” Trusted Research Environment (TRE) build and run. Consultancy with clinical, patient & end-user and purchaser access/ support. Support for regulatory requirements. Cohort discovery. Data-driven trials and “fast screen” services to assess population size.
R
Synthetic Fruit Object Detection Dataset - raw
public.roboflow.com
zip
Updated Aug 11, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brad Dwyer (2021). Synthetic Fruit Object Detection Dataset - raw [Dataset]. https://public.roboflow.com/object-detection/synthetic-fruit/1
Explore at:
zipAvailable download formats
Dataset updated
Aug 11, 2021
Dataset authored and provided by
Brad Dwyer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Bounding Boxes of Fruits
Description
About this dataset

This dataset contains 6,000 example images generated with the process described in Roboflow's How to Create a Synthetic Dataset tutorial.

The images are composed of a background (randomly selected from Google's Open Images dataset) and a number of fruits (from Horea94's Fruit Classification Dataset) superimposed on top with a random orientation, scale, and color transformation. All images are 416x550 to simulate a smartphone aspect ratio.

To generate your own images, follow our tutorial or download the code.

Example: https://blog.roboflow.ai/content/images/2020/04/synthetic-fruit-examples.jpg" alt="Example Image">
Dataset for publication: Usefulness of synthetic datasets for diatom...
dorel.univ-lorraine.fr
zenodo.org
+1more
bin, jpeg +4
Updated Jul 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Université de Lorraine (2023). Dataset for publication: Usefulness of synthetic datasets for diatom automatic detection using a deep-learning approach [Dataset]. http://doi.org/10.12763/UADENQ
Explore at:
text/x-python(652), bin(456), tsv(1716), text/x-python(1957), text/x-python(4882), text/x-python(3391), text/x-python(12356), jpeg(7239), text/x-python(8545), zip(50188610), bin(1530), text/markdown(2269)Available download formats
Unique identifier
https://doi.org/10.12763/UADENQ
Dataset updated
Jul 21, 2023
Dataset provided by
University of Lorrainehttp://www.univ-lorraine.fr/
License
Licence Ouverte / Open Licence 2.0https://www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf
License information was derived automatically
Description
This repository contains the dataset and code used to generate synthetic dataset as explained in the paper "Usefulness of synthetic datasets for diatom automatic detection using a deep-learning approach". Dataset : The dataset consists of two components: individual diatom images extracted from publicly available diatom atlases [1,2,3] and individual debris images. - Individual diatom images : currently, the repository consists of 166 diatom species, totalling 9230 images. These images were automatically extracted from atlases using PDF scraping, cleaned and verified by diatom taxonomists. The subfolders within each diatom specie indicates the origin of the images: RA[1], IDF[2], BRG[3]. Additional diatom species and images will be regularly updated in the repository. - Individual debris images : the debris images were extracted from real microscopy images. The repository contains 600 debris objects. Code : Contains the code used to generate synthetic microscopy images. For details on how to use the code, kindly refer to the README file available in synthetic_data_generator/.
g
Synset Boulevard: Synthetic image dataset for Vehicle Make and Model...
gimi9.com
Updated Dec 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Synset Boulevard: Synthetic image dataset for Vehicle Make and Model Recognition (VMMR) | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_725679870677258240
Explore at:
Dataset updated
Dec 15, 2024
Description
The Synset Boulevard dataset contains a total of 259,200 synthetically generated images of cars from a frontal traffic camera perspective, annotated by vehicle makes, models and years of construction for machine learning methods (ML) in the scope (task) of vehicle make and model recognition (VMMR). The data set contains 162 vehicle models from 43 brands with 200 images each, as well as 8 sub-data sets each to be able to investigate different imaging qualities. In addition to the classification annotations, the data set also contains label images for semantic segmentation, as well as information on image and scene properties, as well as vehicle color. The dataset was presented in May 2024 by Anne Sielemann, Stefan Wolf, Masoud Roschani, Jens Ziehn and Jürgen Beyerer in the publication: Sielemann, A., Wolf, S., Roschani, M., Ziehn, J. and Beyerer, J. (2024). Synset Boulevard: A Synthetic Image Dataset for VMMR. In 2024 IEEE International Conference on Robotics and Automation (ICRA). The model information is based on information from the ADAC online database (www.adac.de/rund-ums-fahrzeug/autokatalog/marken-modelle). The data was generated using the simulation environment OCTANE (www.octane.org), which uses the Cycles ray tracer of the Blender project. The dataset's website provides detailed information on the generation process and model assumptions. The dataset is therefore also intended to be used for the suitability analysis of simulated, synthetic datasets. The data set was developed as part of the Fraunhofer PREPARE program in the "ML4Safety" project with the funding code PREPARE 40-02702, as well as funded by the "Invest BW" funding program of the Ministry of Economic Affairs, Labour and Tourism as part of the "FeinSyn" research project.
R
Real_synthetic Data Dataset
universe.roboflow.com
zip
Updated Mar 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eswars CV Platform (2025). Real_synthetic Data Dataset [Dataset]. https://universe.roboflow.com/eswars-cv-platform/real_synthetic-data/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Mar 15, 2025
Dataset authored and provided by
Eswars CV Platform
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
GlassBottle MetalCan Bounding Boxes
Description
Real_Synthetic Data

## Overview Real_Synthetic Data is a dataset for object detection tasks - it contains GlassBottle MetalCan annotations for 1,993 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Data from: SIPHER Synthetic Population for Individuals in Great Britain,...
beta.ukdataservice.ac.uk
datacatalogue.cessda.eu
Updated 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UK Data Service (2024). SIPHER Synthetic Population for Individuals in Great Britain, 2019-2021: Supplementary Material, 2024 [Dataset]. http://doi.org/10.5255/ukda-sn-856754
Explore at:
Unique identifier
https://doi.org/10.5255/ukda-sn-856754
Dataset updated
2024
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
datacite
Area covered
Great Britain, United Kingdom
Description
IMPORTANT: This deposit contains a range of supplementary material related to the deposit of the SIPHER Synthetic Population for Individuals, 2019-2021 (https://doi.org/10.5255/UKDA-SN-9277-1). See the shared readme file for a detailed description describing this deposit. Please note that this deposit does not contain the SIPHER Synthetic Population dataset, or any other Understanding Society survey datasets.

The lack of a centralised and comprehensive register-based system in Great Britain limits opportunities for studying the interaction of aspects such as health, employment, benefit payments, or housing quality at the level of individuals and households. At the same time, the data that exist, is typically strictly controlled and only available in safe haven environments under a “create-and-destroy” model. In particular when testing policy options via simulation models where results are required swiftly, these limitations can present major hurdles to coproduction and collaborative work connecting researchers, policymakers, and key stakeholders. In some cases, survey data can provide a suitable alternative to the lack of readily available administrative data. However, survey data does typically not allow for a small-area perspective. Although special license area-level linkages of survey data can offer more detailed spatial information, the data’s coverage and statistical power might be too low for meaningful analysis.

Through a linkage with the UK Household Longitudinal Study (Understanding Society, SN 6614, wave k), the SIPHER Synthetic Population allows for the creation of a survey-based full-scale synthetic population for all of Great Britain. By drawing on data reflecting “real” survey respondents, the dataset represents over 50 million synthetic (i.e. “not real”) individuals. As a digital twin of the adult population in Great Britain, the SIPHER Synthetic population provides a novel source of microdata for understanding “status quo” and modelling “what if” scenarios (e.g., via static/dynamic microsimulation model), as well as other exploratory analyses where a granular geographical resolution is required

As the SIPHER Synthetic Population is the outcome of a statistical creation process, all results obtained from this dataset should always be treated as “model output” - including basic descriptive statistics. Here, the SIPHER Synthetic Population should not replace the underlying Understanding Society survey data for standard statistical analyses (e.g., standard regression analysis, longitudinal multi-wave analysis). Please see the respective User Guide provided for this dataset for further information on creation and validation.

This research was conducted as part of the Systems Science in Public Health and Health Economics Research - SIPHER Consortium and we thank the whole team for valuable input and discussions that have informed this work.

Facebook

Twitter

Click to copy link

Link copied

Cite

Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002

Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.3389/frai.2025.1533508.s002

Dataset updated

Feb 5, 2025

Dataset provided by

Frontiers

Authors

Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

Clear search

Close search

Google apps

Main menu

Data Sheet 2_Large language models generating synthetic clinical datasets: a...

Table1_Enhancing biomechanical machine learning with limited data:...

Syntegra Synthetic EHR Data | Structured Healthcare Electronic Health Record...

Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk...

Data from: Synthetic Multimodal Dataset for Daily Life Activities

Open Data Training Workshop: Synthetic Data & The 2023 Pediatric Sepsis Data...

Synthetic version of anonymized Norway Registry data containing...

replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks

Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data

Synthetic Administrative Data: Census 1991, 2023

CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) in...

Synthetic nursing handover training and development data set - text files

data for: Synthetic Datasets Generator for Testing Techniques and Tools of...

Rule-based Synthetic Data for Japanese GEC

Immune Checkpoint Inhibitors synthetic data: HDR UK Medicines Programme...

Synthetic Fruit Object Detection Dataset - raw

About this dataset

Dataset for publication: Usefulness of synthetic datasets for diatom...

Synset Boulevard: Synthetic image dataset for Vehicle Make and Model...

Real_synthetic Data Dataset

Real_Synthetic Data

Data from: SIPHER Synthetic Population for Individuals in Great Britain,...

Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsxSee More Versions

Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx