14 datasets found
  1. f

    Data Sheet 2_Large language models generating synthetic clinical datasets: a...

    • frontiersin.figshare.com
    xlsx
    Updated Feb 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Frontiers
    Authors
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

  2. h

    synthetic-from-unit-triple-tasks-danish

    • huggingface.co
    • sprogteknologi.dk
    Updated Jan 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    synthetic-from-unit-triple-tasks-danish [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2025
    Authors
    Kasper Groes Albin Ludvigsen
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

    The purpose of this dataset is to pre- or post-train embedding models for Danish on text similarity tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368 Compute sponsored by… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish.

  3. h

    synthetic-from-unit-triple-tasks-norwegian

    • huggingface.co
    Updated Jan 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kasper Groes Albin Ludvigsen (2025). synthetic-from-unit-triple-tasks-norwegian [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-norwegian
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2025
    Authors
    Kasper Groes Albin Ludvigsen
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

    The purpose of this dataset is to pre- or post-train embedding models on text similarity tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368 Compute sponsored by Arrow Denmark and… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-norwegian.

  4. h

    atcoder_cot

    • huggingface.co
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    atcoder_cot [Dataset]. https://huggingface.co/datasets/Nan-Do/atcoder_cot
    Explore at:
    Dataset updated
    Mar 19, 2025
    Authors
    Fernando Tarin Morales
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Dataset Card for Atcoder-CoT

      Dataset Description
    

    Atcoder-CoT is a proof-of-concept dataset designed to demonstrate how a dataset like the one found here can be used to generate synthetic datasets for training reasoning models, particularly for Supervised Fine-Tuning (SFT) and Knowledge Distillation. It leverages human-created and debugged solutions, combined with LLM-generated text to create conversational turns. The dataset currently consists of a single column:… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/atcoder_cot.

  5. h

    WangchanThaiInstruct_Multi-turn_Conversation_Dataset

    • huggingface.co
    • zenodo.org
    Updated Sep 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thai Synthetic QA (2024). WangchanThaiInstruct_Multi-turn_Conversation_Dataset [Dataset]. https://huggingface.co/datasets/ThaiSyntheticQA/WangchanThaiInstruct_Multi-turn_Conversation_Dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Dataset authored and provided by
    Thai Synthetic QA
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    WangchanThaiInstruct Multi-turn Conversation Dataset

    We create a Thai multi-turn conversation dataset from airesearch/WangchanThaiInstruct (Batch 1) by LLM. It was created from synthetic method using open source LLM in Thai language.

      Citation
    

    Thammaleelakul, S., & Phatthiyaphaibun, W. (2024). WangchanThaiInstruct Multi-turn Conversation Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13132633

    or BibTeX @dataset{thammaleelakul_2024_13132633, author… See the full description on the dataset page: https://huggingface.co/datasets/ThaiSyntheticQA/WangchanThaiInstruct_Multi-turn_Conversation_Dataset.

  6. G

    Generative Artificial Intelligence AI in Healthcare Market Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Dec 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2024). Generative Artificial Intelligence AI in Healthcare Market Report [Dataset]. https://www.marketresearchforecast.com/reports/generative-artificial-intelligence-ai-in-healthcare-market-10297
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Dec 26, 2024
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The size of the Generative Artificial Intelligence AI in Healthcare Market was valued at USD XX Million in 2023 and is projected to reach USD XXX Million by 2032, with an expected CAGR of XXX% during the forecast period. Generative Artificial Intelligence (AI) in healthcare refers to the application of advanced machine learning models that can create new, innovative outputs based on existing data. In healthcare, generative AI is used to design new drug molecules, create synthetic medical data for research, generate personalized treatment plans, and assist in medical imaging analysis. By learning patterns from vast datasets of patient records, medical literature, and diagnostic images, generative AI models can generate insights, predictive models, and recommendations, enhancing the efficiency, accuracy, and personalization of healthcare services. These tools can also contribute to the development of diagnostic algorithms, enabling earlier detection of diseases and improving patient outcomes. The market growth is primarily attributed to the rising demand for personalized medicine, the increasing adoption of AI in healthcare applications, and government initiatives promoting the use of AI in healthcare. Recent developments include: In February 2024, Persistent Systems launched an innovative generative artificial intelligence (AI)--powered population health management (PHM) solution in collaboration with Microsoft., In August 2023, Cognizant expanded its partnership with Google Cloud to develop healthcare large language model (LLM) solutions with the use of Google Cloud’s generative artificial intelligence (AI) technology., In April 2023, Microsoft expanded its collaboration agreement with Epic Systems Corporation to develop and integrate generative artificial intelligence (AI) into healthcare. According to the agreement, Microsoft would use the Azure OpenAI Service with Epic Systems Corporation’s electronic health record (EHR) software to increase productivity, enhance patient care, and improve the financial integrity of health systems globally., In March 2023, NVIDIA Corporation announced its collaboration with Medtronic to accelerate the development of generative artificial intelligence (AI) technology in the healthcare system and to bring new artificial intelligence (AI)-based solutions into patient care., In November 2022, Syntegra announced the launch of Syntegra Medical Mind 2.0 to expand its generative artificial intelligence (AI) technology to generate synthetic healthcare data..

  7. h

    synthetic-from-classification-tasks-danish

    • huggingface.co
    Updated Jan 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kasper Groes Albin Ludvigsen (2025). synthetic-from-classification-tasks-danish [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-classification-tasks-danish
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2025
    Authors
    Kasper Groes Albin Ludvigsen
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

    The purpose of this dataset is to pre- or post-train embedding models for Danish text classification tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. Each sample in the dataset was generated from a seed task randomly sampled from… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-classification-tasks-danish.

  8. P

    VietMed-Sum Dataset

    • paperswithcode.com
    Updated Jun 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khai Le-Duc; Khai-Nguyen Nguyen; Long Vo-Dang; Truong-Son Hy (2024). VietMed-Sum Dataset [Dataset]. https://paperswithcode.com/dataset/vietmed-sum
    Explore at:
    Dataset updated
    Jun 21, 2024
    Authors
    Khai Le-Duc; Khai-Nguyen Nguyen; Long Vo-Dang; Truong-Son Hy
    Description

    In doctor-patient conversations, identifying medically relevant information is crucial, posing the need for conversation summarization. In this work, we propose the first deployable real-time speech summarization system for real-world applications in industry, which generates a local summary after every N speech utterances within a conversation and a global summary after the end of a conversation. Our system could enhance user experience from a business standpoint, while also reducing computational costs from a technical perspective. Secondly, we present VietMed-Sum which, to our knowledge, is the first speech summarization dataset for medical conversations. Thirdly, we are the first to utilize LLM and human annotators collaboratively to create gold standard and synthetic summaries for medical conversation summarization.

  9. f

    Supplementary Material for: Automated Identification of Stroke Thrombolysis...

    • karger.figshare.com
    docx
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    figshare admin karger; Chen B.Y.; Antaki F.; Gonzalez M.; Uchino K.; Albahra S.; Robertson S.; Ibrikji S.; Aube E.; Russman A.; Hussain M.S. (2025). Supplementary Material for: Automated Identification of Stroke Thrombolysis Contraindications from Synthetic Clinical Notes – a Proof-of-Concept Study [Dataset]. http://doi.org/10.6084/m9.figshare.28605911.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    Karger Publishers
    Authors
    figshare admin karger; Chen B.Y.; Antaki F.; Gonzalez M.; Uchino K.; Albahra S.; Robertson S.; Ibrikji S.; Aube E.; Russman A.; Hussain M.S.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: Timely thrombolytic therapy improves outcomes in acute ischemic stroke. Manual chart review to screen for thrombolysis contraindications may be time-consuming and prone to errors. We developed and tested a large language model (LLM)-based tool to identify thrombolysis contraindications from clinical notes using synthetic data in a proof-of-concept study.

    Methods: We generated 150 synthetic clinical notes containing randomly assigned thrombolysis contraindications using LLMs. We then used Llama 3.1 405B with a custom prompt to generate a list of thrombolysis contraindications from each note. Performance was evaluated using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and F1 score.

    Results: A total of 150 synthetic notes were generated using five different models: ChatGPT-4o, Llama 3.1 405B, Llama 3.1 70B, ChatGPT-4o mini, and Gemini 1.5 Flash. On average, each note contained 241.6 words (SD 110.7; range 80-549) and included 1.5 contraindications (SD 1.1; range 0-5). Our tool achieved a sensitivity of 90.9% (95% CI: 86.3%-94.3%), specificity of 99.2% (95% CI: 98.8%-99.5%), PPV of 87.7% (95% CI: 82.7%-91.7%), NPV of 99.4% (95% CI: 99.1%-99.6%), accuracy of 98.7% (95% CI: 98.2%-99.0%), and an F1 score of 0.892. Among the false positives, 24 (86%) were due to the inclusion of irrelevant contraindications, and 4 (14%) resulted from repetitive information. No hallucinations were observed.

    Conclusion: Our LLM-based tool may identify stroke thrombolysis contraindications from synthetic clinical notes with high sensitivity and PPV. Future studies will validate its performance using real EMR data and integrate it into acute stroke workflows to facilitate faster and safer thrombolysis decision-making.

  10. h

    synthetic-from-text-mathing-short-tasks-norwegian

    • huggingface.co
    Updated Jan 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    synthetic-from-text-mathing-short-tasks-norwegian [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-text-mathing-short-tasks-norwegian
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2025
    Authors
    Kasper Groes Albin Ludvigsen
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

    The purpose of this dataset is to pre- or post-train embedding models for text matching tasks on short texts. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. Each sample in the dataset was generated from a seed task randomly sampled from… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-text-mathing-short-tasks-norwegian.

  11. h

    synthetic-from-text-matching-short-tasks-danish

    • huggingface.co
    Updated Jan 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kasper Groes Albin Ludvigsen (2025). synthetic-from-text-matching-short-tasks-danish [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-text-matching-short-tasks-danish
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2025
    Authors
    Kasper Groes Albin Ludvigsen
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    Denmark
    Description

    Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

    The purpose of this dataset is to pre- or post-train embedding models for Danish text matching tasks on short texts. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. Each sample in the dataset was generated from a seed task randomly sampled from… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-text-matching-short-tasks-danish.

  12. h

    synthetic_social_persona_tweets

    • huggingface.co
    Updated Dec 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris Levy (2024). synthetic_social_persona_tweets [Dataset]. https://huggingface.co/datasets/chrislevy/synthetic_social_persona_tweets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 29, 2024
    Authors
    Chris Levy
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Synthetic Social Persona Tweets Dataset

    This dataset contains synthetic social media posts generated by various language models. This dataset is only meant to be used for quick and dirty experiments i.e. it's a toy dataset. Every column/field in this dataset is generated by an LLM. The code/prompts used to create this dataset can be found here. The dataset was built to be used for some fine-tuning experiments with ModernBert for one of my blog posts/tutorials. Each row in the… See the full description on the dataset page: https://huggingface.co/datasets/chrislevy/synthetic_social_persona_tweets.

  13. h

    gutenberg-dpo-v0.1

    • huggingface.co
    Updated Jan 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jon Durbin (2024). gutenberg-dpo-v0.1 [Dataset]. https://huggingface.co/datasets/jondurbin/gutenberg-dpo-v0.1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 11, 2024
    Authors
    Jon Durbin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Gutenberg DPO

      Overview
    

    This is a dataset meant to enhance novel writing capabilities of LLMs, by using public domain books from Project Gutenberg

      Process
    

    First, the each book is parsed, split into chapters, cleaned up from the original format (remove superfluous newlines, illustration tags, etc.). Once we have chapters, an LLM is prompted with each chapter to create a synthetic prompt that would result in that chapter being written. Each chapter… See the full description on the dataset page: https://huggingface.co/datasets/jondurbin/gutenberg-dpo-v0.1.

  14. h

    ThaiQA-v1

    • huggingface.co
    Updated Jul 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ThaiQA-v1 [Dataset]. https://huggingface.co/datasets/ThaiSyntheticQA/ThaiQA-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 24, 2024
    Dataset authored and provided by
    Thai Synthetic QA
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ThaiQA v1

    ThaiQA v1 is a Thai Synthetic QA dataset. It was created from synthetic method using open source LLM in Thai language. We used Nvidia Nemotron 4 (340B) to create this dataset. Topics: Technology and Gadgets 100 Travel and Tourism 91 Food and Cooking 99 Sports and Fitness 50 Arts and Entertainment 24 Home and Garden 72 Fashion and Beauty 99 Science and Nature 100 History and Culture 91 Education and Learning 99 Pets and Animals 83 Relationships and Family 78 Personal… See the full description on the dataset page: https://huggingface.co/datasets/ThaiSyntheticQA/ThaiQA-v1.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002

Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx

Related Article
Explore at:
xlsxAvailable download formats
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

Search
Clear search
Close search
Google apps
Main menu