Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset
The purpose of this dataset is to pre- or post-train embedding models for Danish on text similarity tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368 Compute sponsored by… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset
The purpose of this dataset is to pre- or post-train embedding models on text similarity tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368 Compute sponsored by Arrow Denmark and… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-norwegian.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Dataset Card for Atcoder-CoT
Dataset Description
Atcoder-CoT is a proof-of-concept dataset designed to demonstrate how a dataset like the one found here can be used to generate synthetic datasets for training reasoning models, particularly for Supervised Fine-Tuning (SFT) and Knowledge Distillation. It leverages human-created and debugged solutions, combined with LLM-generated text to create conversational turns. The dataset currently consists of a single column:… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/atcoder_cot.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
WangchanThaiInstruct Multi-turn Conversation Dataset
We create a Thai multi-turn conversation dataset from airesearch/WangchanThaiInstruct (Batch 1) by LLM. It was created from synthetic method using open source LLM in Thai language.
Citation
Thammaleelakul, S., & Phatthiyaphaibun, W. (2024). WangchanThaiInstruct Multi-turn Conversation Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13132633
or BibTeX @dataset{thammaleelakul_2024_13132633, author… See the full description on the dataset page: https://huggingface.co/datasets/ThaiSyntheticQA/WangchanThaiInstruct_Multi-turn_Conversation_Dataset.
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The size of the Generative Artificial Intelligence AI in Healthcare Market was valued at USD XX Million in 2023 and is projected to reach USD XXX Million by 2032, with an expected CAGR of XXX% during the forecast period. Generative Artificial Intelligence (AI) in healthcare refers to the application of advanced machine learning models that can create new, innovative outputs based on existing data. In healthcare, generative AI is used to design new drug molecules, create synthetic medical data for research, generate personalized treatment plans, and assist in medical imaging analysis. By learning patterns from vast datasets of patient records, medical literature, and diagnostic images, generative AI models can generate insights, predictive models, and recommendations, enhancing the efficiency, accuracy, and personalization of healthcare services. These tools can also contribute to the development of diagnostic algorithms, enabling earlier detection of diseases and improving patient outcomes. The market growth is primarily attributed to the rising demand for personalized medicine, the increasing adoption of AI in healthcare applications, and government initiatives promoting the use of AI in healthcare. Recent developments include: In February 2024, Persistent Systems launched an innovative generative artificial intelligence (AI)--powered population health management (PHM) solution in collaboration with Microsoft., In August 2023, Cognizant expanded its partnership with Google Cloud to develop healthcare large language model (LLM) solutions with the use of Google Cloud’s generative artificial intelligence (AI) technology., In April 2023, Microsoft expanded its collaboration agreement with Epic Systems Corporation to develop and integrate generative artificial intelligence (AI) into healthcare. According to the agreement, Microsoft would use the Azure OpenAI Service with Epic Systems Corporation’s electronic health record (EHR) software to increase productivity, enhance patient care, and improve the financial integrity of health systems globally., In March 2023, NVIDIA Corporation announced its collaboration with Medtronic to accelerate the development of generative artificial intelligence (AI) technology in the healthcare system and to bring new artificial intelligence (AI)-based solutions into patient care., In November 2022, Syntegra announced the launch of Syntegra Medical Mind 2.0 to expand its generative artificial intelligence (AI) technology to generate synthetic healthcare data..
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset
The purpose of this dataset is to pre- or post-train embedding models for Danish text classification tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. Each sample in the dataset was generated from a seed task randomly sampled from… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-classification-tasks-danish.
In doctor-patient conversations, identifying medically relevant information is crucial, posing the need for conversation summarization. In this work, we propose the first deployable real-time speech summarization system for real-world applications in industry, which generates a local summary after every N speech utterances within a conversation and a global summary after the end of a conversation. Our system could enhance user experience from a business standpoint, while also reducing computational costs from a technical perspective. Secondly, we present VietMed-Sum which, to our knowledge, is the first speech summarization dataset for medical conversations. Thirdly, we are the first to utilize LLM and human annotators collaboratively to create gold standard and synthetic summaries for medical conversation summarization.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: Timely thrombolytic therapy improves outcomes in acute ischemic stroke. Manual chart review to screen for thrombolysis contraindications may be time-consuming and prone to errors. We developed and tested a large language model (LLM)-based tool to identify thrombolysis contraindications from clinical notes using synthetic data in a proof-of-concept study.
Methods: We generated 150 synthetic clinical notes containing randomly assigned thrombolysis contraindications using LLMs. We then used Llama 3.1 405B with a custom prompt to generate a list of thrombolysis contraindications from each note. Performance was evaluated using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and F1 score.
Results: A total of 150 synthetic notes were generated using five different models: ChatGPT-4o, Llama 3.1 405B, Llama 3.1 70B, ChatGPT-4o mini, and Gemini 1.5 Flash. On average, each note contained 241.6 words (SD 110.7; range 80-549) and included 1.5 contraindications (SD 1.1; range 0-5). Our tool achieved a sensitivity of 90.9% (95% CI: 86.3%-94.3%), specificity of 99.2% (95% CI: 98.8%-99.5%), PPV of 87.7% (95% CI: 82.7%-91.7%), NPV of 99.4% (95% CI: 99.1%-99.6%), accuracy of 98.7% (95% CI: 98.2%-99.0%), and an F1 score of 0.892. Among the false positives, 24 (86%) were due to the inclusion of irrelevant contraindications, and 4 (14%) resulted from repetitive information. No hallucinations were observed.
Conclusion: Our LLM-based tool may identify stroke thrombolysis contraindications from synthetic clinical notes with high sensitivity and PPV. Future studies will validate its performance using real EMR data and integrate it into acute stroke workflows to facilitate faster and safer thrombolysis decision-making.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset
The purpose of this dataset is to pre- or post-train embedding models for text matching tasks on short texts. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. Each sample in the dataset was generated from a seed task randomly sampled from… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-text-mathing-short-tasks-norwegian.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset
The purpose of this dataset is to pre- or post-train embedding models for Danish text matching tasks on short texts. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. Each sample in the dataset was generated from a seed task randomly sampled from… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-text-matching-short-tasks-danish.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Synthetic Social Persona Tweets Dataset
This dataset contains synthetic social media posts generated by various language models. This dataset is only meant to be used for quick and dirty experiments i.e. it's a toy dataset. Every column/field in this dataset is generated by an LLM. The code/prompts used to create this dataset can be found here. The dataset was built to be used for some fine-tuning experiments with ModernBert for one of my blog posts/tutorials. Each row in the… See the full description on the dataset page: https://huggingface.co/datasets/chrislevy/synthetic_social_persona_tweets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gutenberg DPO
Overview
This is a dataset meant to enhance novel writing capabilities of LLMs, by using public domain books from Project Gutenberg
Process
First, the each book is parsed, split into chapters, cleaned up from the original format (remove superfluous newlines, illustration tags, etc.). Once we have chapters, an LLM is prompted with each chapter to create a synthetic prompt that would result in that chapter being written. Each chapter… See the full description on the dataset page: https://huggingface.co/datasets/jondurbin/gutenberg-dpo-v0.1.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ThaiQA v1
ThaiQA v1 is a Thai Synthetic QA dataset. It was created from synthetic method using open source LLM in Thai language. We used Nvidia Nemotron 4 (340B) to create this dataset. Topics: Technology and Gadgets 100 Travel and Tourism 91 Food and Cooking 99 Sports and Fitness 50 Arts and Entertainment 24 Home and Garden 72 Fashion and Beauty 99 Science and Nature 100 History and Culture 91 Education and Learning 99 Pets and Animals 83 Relationships and Family 78 Personal… See the full description on the dataset page: https://huggingface.co/datasets/ThaiSyntheticQA/ThaiQA-v1.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.