14 datasets found

f
Data Sheet 2_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
xlsx
Updated Feb 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s002
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
h
synthetic-from-unit-triple-tasks-danish
huggingface.co
sprogteknologi.dk
Updated Jan 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
synthetic-from-unit-triple-tasks-danish [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2025
Authors
Kasper Groes Albin Ludvigsen
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

The purpose of this dataset is to pre- or post-train embedding models for Danish on text similarity tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368 Compute sponsored by… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish.
h
synthetic-from-unit-triple-tasks-norwegian
huggingface.co
Updated Jan 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kasper Groes Albin Ludvigsen (2025). synthetic-from-unit-triple-tasks-norwegian [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-norwegian
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2025
Authors
Kasper Groes Albin Ludvigsen
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

The purpose of this dataset is to pre- or post-train embedding models on text similarity tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368 Compute sponsored by Arrow Denmark and… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-norwegian.
h
atcoder_cot
huggingface.co
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
atcoder_cot [Dataset]. https://huggingface.co/datasets/Nan-Do/atcoder_cot
Explore at:
Dataset updated
Mar 19, 2025
Authors
Fernando Tarin Morales
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Dataset Card for Atcoder-CoT

Dataset Description

Atcoder-CoT is a proof-of-concept dataset designed to demonstrate how a dataset like the one found here can be used to generate synthetic datasets for training reasoning models, particularly for Supervised Fine-Tuning (SFT) and Knowledge Distillation. It leverages human-created and debugged solutions, combined with LLM-generated text to create conversational turns. The dataset currently consists of a single column:… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/atcoder_cot.
h
WangchanThaiInstruct_Multi-turn_Conversation_Dataset
huggingface.co
zenodo.org
Updated Sep 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thai Synthetic QA (2024). WangchanThaiInstruct_Multi-turn_Conversation_Dataset [Dataset]. https://huggingface.co/datasets/ThaiSyntheticQA/WangchanThaiInstruct_Multi-turn_Conversation_Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Dataset authored and provided by
Thai Synthetic QA
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
WangchanThaiInstruct Multi-turn Conversation Dataset

We create a Thai multi-turn conversation dataset from airesearch/WangchanThaiInstruct (Batch 1) by LLM. It was created from synthetic method using open source LLM in Thai language.

Citation

Thammaleelakul, S., & Phatthiyaphaibun, W. (2024). WangchanThaiInstruct Multi-turn Conversation Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13132633

or BibTeX @dataset{thammaleelakul_2024_13132633, author… See the full description on the dataset page: https://huggingface.co/datasets/ThaiSyntheticQA/WangchanThaiInstruct_Multi-turn_Conversation_Dataset.
G
Generative Artificial Intelligence AI in Healthcare Market Report
marketresearchforecast.com
doc, pdf, ppt
Updated Dec 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2024). Generative Artificial Intelligence AI in Healthcare Market Report [Dataset]. https://www.marketresearchforecast.com/reports/generative-artificial-intelligence-ai-in-healthcare-market-10297
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Dec 26, 2024
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The size of the Generative Artificial Intelligence AI in Healthcare Market was valued at USD XX Million in 2023 and is projected to reach USD XXX Million by 2032, with an expected CAGR of XXX% during the forecast period. Generative Artificial Intelligence (AI) in healthcare refers to the application of advanced machine learning models that can create new, innovative outputs based on existing data. In healthcare, generative AI is used to design new drug molecules, create synthetic medical data for research, generate personalized treatment plans, and assist in medical imaging analysis. By learning patterns from vast datasets of patient records, medical literature, and diagnostic images, generative AI models can generate insights, predictive models, and recommendations, enhancing the efficiency, accuracy, and personalization of healthcare services. These tools can also contribute to the development of diagnostic algorithms, enabling earlier detection of diseases and improving patient outcomes. The market growth is primarily attributed to the rising demand for personalized medicine, the increasing adoption of AI in healthcare applications, and government initiatives promoting the use of AI in healthcare. Recent developments include: In February 2024, Persistent Systems launched an innovative generative artificial intelligence (AI)--powered population health management (PHM) solution in collaboration with Microsoft., In August 2023, Cognizant expanded its partnership with Google Cloud to develop healthcare large language model (LLM) solutions with the use of Google Cloud’s generative artificial intelligence (AI) technology., In April 2023, Microsoft expanded its collaboration agreement with Epic Systems Corporation to develop and integrate generative artificial intelligence (AI) into healthcare. According to the agreement, Microsoft would use the Azure OpenAI Service with Epic Systems Corporation’s electronic health record (EHR) software to increase productivity, enhance patient care, and improve the financial integrity of health systems globally., In March 2023, NVIDIA Corporation announced its collaboration with Medtronic to accelerate the development of generative artificial intelligence (AI) technology in the healthcare system and to bring new artificial intelligence (AI)-based solutions into patient care., In November 2022, Syntegra announced the launch of Syntegra Medical Mind 2.0 to expand its generative artificial intelligence (AI) technology to generate synthetic healthcare data..
h
synthetic-from-classification-tasks-danish
huggingface.co
Updated Jan 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kasper Groes Albin Ludvigsen (2025). synthetic-from-classification-tasks-danish [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-classification-tasks-danish
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2025
Authors
Kasper Groes Albin Ludvigsen
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

The purpose of this dataset is to pre- or post-train embedding models for Danish text classification tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. Each sample in the dataset was generated from a seed task randomly sampled from… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-classification-tasks-danish.
P
VietMed-Sum Dataset
paperswithcode.com
Updated Jun 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khai Le-Duc; Khai-Nguyen Nguyen; Long Vo-Dang; Truong-Son Hy (2024). VietMed-Sum Dataset [Dataset]. https://paperswithcode.com/dataset/vietmed-sum
Explore at:
Dataset updated
Jun 21, 2024
Authors
Khai Le-Duc; Khai-Nguyen Nguyen; Long Vo-Dang; Truong-Son Hy
Description
In doctor-patient conversations, identifying medically relevant information is crucial, posing the need for conversation summarization. In this work, we propose the first deployable real-time speech summarization system for real-world applications in industry, which generates a local summary after every N speech utterances within a conversation and a global summary after the end of a conversation. Our system could enhance user experience from a business standpoint, while also reducing computational costs from a technical perspective. Secondly, we present VietMed-Sum which, to our knowledge, is the first speech summarization dataset for medical conversations. Thirdly, we are the first to utilize LLM and human annotators collaboratively to create gold standard and synthetic summaries for medical conversation summarization.
f
Supplementary Material for: Automated Identification of Stroke Thrombolysis...
karger.figshare.com
docx
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
figshare admin karger; Chen B.Y.; Antaki F.; Gonzalez M.; Uchino K.; Albahra S.; Robertson S.; Ibrikji S.; Aube E.; Russman A.; Hussain M.S. (2025). Supplementary Material for: Automated Identification of Stroke Thrombolysis Contraindications from Synthetic Clinical Notes – a Proof-of-Concept Study [Dataset]. http://doi.org/10.6084/m9.figshare.28605911.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28605911.v1
Dataset updated
Mar 17, 2025
Dataset provided by
Karger Publishers
Authors
figshare admin karger; Chen B.Y.; Antaki F.; Gonzalez M.; Uchino K.; Albahra S.; Robertson S.; Ibrikji S.; Aube E.; Russman A.; Hussain M.S.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: Timely thrombolytic therapy improves outcomes in acute ischemic stroke. Manual chart review to screen for thrombolysis contraindications may be time-consuming and prone to errors. We developed and tested a large language model (LLM)-based tool to identify thrombolysis contraindications from clinical notes using synthetic data in a proof-of-concept study.

Methods: We generated 150 synthetic clinical notes containing randomly assigned thrombolysis contraindications using LLMs. We then used Llama 3.1 405B with a custom prompt to generate a list of thrombolysis contraindications from each note. Performance was evaluated using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and F1 score.

Results: A total of 150 synthetic notes were generated using five different models: ChatGPT-4o, Llama 3.1 405B, Llama 3.1 70B, ChatGPT-4o mini, and Gemini 1.5 Flash. On average, each note contained 241.6 words (SD 110.7; range 80-549) and included 1.5 contraindications (SD 1.1; range 0-5). Our tool achieved a sensitivity of 90.9% (95% CI: 86.3%-94.3%), specificity of 99.2% (95% CI: 98.8%-99.5%), PPV of 87.7% (95% CI: 82.7%-91.7%), NPV of 99.4% (95% CI: 99.1%-99.6%), accuracy of 98.7% (95% CI: 98.2%-99.0%), and an F1 score of 0.892. Among the false positives, 24 (86%) were due to the inclusion of irrelevant contraindications, and 4 (14%) resulted from repetitive information. No hallucinations were observed.

Conclusion: Our LLM-based tool may identify stroke thrombolysis contraindications from synthetic clinical notes with high sensitivity and PPV. Future studies will validate its performance using real EMR data and integrate it into acute stroke workflows to facilitate faster and safer thrombolysis decision-making.
h
synthetic-from-text-mathing-short-tasks-norwegian
huggingface.co
Updated Jan 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
synthetic-from-text-mathing-short-tasks-norwegian [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-text-mathing-short-tasks-norwegian
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2025
Authors
Kasper Groes Albin Ludvigsen
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

The purpose of this dataset is to pre- or post-train embedding models for text matching tasks on short texts. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. Each sample in the dataset was generated from a seed task randomly sampled from… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-text-mathing-short-tasks-norwegian.
h
synthetic-from-text-matching-short-tasks-danish
huggingface.co
Updated Jan 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kasper Groes Albin Ludvigsen (2025). synthetic-from-text-matching-short-tasks-danish [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-text-matching-short-tasks-danish
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2025
Authors
Kasper Groes Albin Ludvigsen
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
Denmark
Description
Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

The purpose of this dataset is to pre- or post-train embedding models for Danish text matching tasks on short texts. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. Each sample in the dataset was generated from a seed task randomly sampled from… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-text-matching-short-tasks-danish.
h
synthetic_social_persona_tweets
huggingface.co
Updated Dec 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chris Levy (2024). synthetic_social_persona_tweets [Dataset]. https://huggingface.co/datasets/chrislevy/synthetic_social_persona_tweets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 29, 2024
Authors
Chris Levy
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Synthetic Social Persona Tweets Dataset

This dataset contains synthetic social media posts generated by various language models. This dataset is only meant to be used for quick and dirty experiments i.e. it's a toy dataset. Every column/field in this dataset is generated by an LLM. The code/prompts used to create this dataset can be found here. The dataset was built to be used for some fine-tuning experiments with ModernBert for one of my blog posts/tutorials. Each row in the… See the full description on the dataset page: https://huggingface.co/datasets/chrislevy/synthetic_social_persona_tweets.
h
gutenberg-dpo-v0.1
huggingface.co
Updated Jan 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jon Durbin (2024). gutenberg-dpo-v0.1 [Dataset]. https://huggingface.co/datasets/jondurbin/gutenberg-dpo-v0.1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 11, 2024
Authors
Jon Durbin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Gutenberg DPO

Overview

This is a dataset meant to enhance novel writing capabilities of LLMs, by using public domain books from Project Gutenberg

Process

First, the each book is parsed, split into chapters, cleaned up from the original format (remove superfluous newlines, illustration tags, etc.). Once we have chapters, an LLM is prompted with each chapter to create a synthetic prompt that would result in that chapter being written. Each chapter… See the full description on the dataset page: https://huggingface.co/datasets/jondurbin/gutenberg-dpo-v0.1.
h
ThaiQA-v1
huggingface.co
Updated Jul 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ThaiQA-v1 [Dataset]. https://huggingface.co/datasets/ThaiSyntheticQA/ThaiQA-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 24, 2024
Dataset authored and provided by
Thai Synthetic QA
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ThaiQA v1

ThaiQA v1 is a Thai Synthetic QA dataset. It was created from synthetic method using open source LLM in Thai language. We used Nvidia Nemotron 4 (340B) to create this dataset. Topics: Technology and Gadgets 100 Travel and Tourism 91 Food and Cooking 99 Sports and Fitness 50 Arts and Entertainment 24 Home and Garden 72 Fashion and Beauty 99 Science and Nature 100 History and Culture 91 Education and Learning 99 Pets and Animals 83 Relationships and Family 78 Personal… See the full description on the dataset page: https://huggingface.co/datasets/ThaiSyntheticQA/ThaiQA-v1.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002

Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.3389/frai.2025.1533508.s002

Dataset updated

Feb 5, 2025

Dataset provided by

Frontiers

Authors

Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

Clear search

Close search

Google apps

Main menu

Data Sheet 2_Large language models generating synthetic clinical datasets: a...

synthetic-from-unit-triple-tasks-danish

synthetic-from-unit-triple-tasks-norwegian

atcoder_cot

WangchanThaiInstruct_Multi-turn_Conversation_Dataset

Generative Artificial Intelligence AI in Healthcare Market Report

synthetic-from-classification-tasks-danish

VietMed-Sum Dataset

Supplementary Material for: Automated Identification of Stroke Thrombolysis...

synthetic-from-text-mathing-short-tasks-norwegian

synthetic-from-text-matching-short-tasks-danish

synthetic_social_persona_tweets

gutenberg-dpo-v0.1

ThaiQA-v1

Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsxSee More Versions

Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx