Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Data Description
We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on LLM-generated topics and writing styles.
Generated Datasets
The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000β¦ See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAIβs GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Synthetic Data Generation Demo β UK Retail Dataset
Welcome to this synthetic data generation demo repository by Syncora.ai. This project showcases how to generate synthetic data using real-world tabular structures, demonstrated on a UK retail dataset with columns such as:
Country
CustomerID
UnitPrice
InvoiceDate
Quantity
StockCode
This dataset is designed for dataset for LLM training and AI development, enabling developers to work with privacy-safe, high-quality⦠See the full description on the dataset page: https://huggingface.co/datasets/syncora/uk_retail_store_synthetic_dataset.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This training data was generated using GPT-4o as part of the 'Drawing with LLM' competition (https://www.kaggle.com/competitions/drawing-with-llms). It can be used to fine-tune small language models for the competition or serve as an augmentation dataset alongside other data sources.
The dataset is generated in two steps using the GPT-4o model. - In the first step, topic descriptions relevant to the competition are generated using a specific prompt. By running this prompt multiple times, over 3,000 descriptions were collected.
prompt=f""" I am participating in an SVG code generation competition.
The competition involves generating SVG images based on short textual descriptions of everyday objects and scenes, spanning a wide range of categories. The key guidelines are as follows:
- Descriptions are generic and do not contain brand names, trademarks, or personal names.
- No descriptions include people, even in generic terms.
- Descriptions are conciseβeach is no more than 200 characters, with an average length of about 50 characters.
- Categories cover various domains, with some overlap between public and private test sets.
To train a small LLM model, I am preparing a synthetic dataset. Could you generate 100 unique topics aligned with the competition style?
Requirements:
- Each topic should range between **20 and 200 characters**, with an **average around 60 characters**.
- Ensure **diversity and creativity** across topics.
- **50% of the topics** should come from the categories of **landscapes**, **abstract art**, and **fashion**.
- Avoid duplication or overly similar phrasing.
Example topics:
a purple forest at dusk, gray wool coat with a faux fur collar, a lighthouse overlooking the ocean, burgundy corduroy, pants with patch pockets and silver buttons, orange corduroy overalls, a purple silk scarf with tassel trim, a green lagoon under a cloudy sky, crimson rectangles forming a chaotic grid, purple pyramids spiraling around a bronze cone, magenta trapezoids layered on a translucent silver sheet, a snowy plain, black and white checkered pants, a starlit night over snow-covered peaks, khaki triangles and azure crescents, a maroon dodecahedron interwoven with teal threads.
Please return the 100 topics in csv format.
"""
prompt = f"""
Generate SVG code to visually represent the following text description, while respecting the given constraints.
Allowed Elements: `svg`, `path`, `circle`, `rect`, `ellipse`, `line`, `polyline`, `polygon`, `g`, `linearGradient`, `radialGradient`, `stop`, `defs`
Allowed Attributes: `viewBox`, `width`, `height`, `fill`, `stroke`, `stroke-width`, `d`, `cx`, `cy`, `r`, `x`, `y`, `rx`, `ry`, `x1`, `y1`, `x2`, `y2`, `points`, `transform`, `opacity`
Please ensure that the generated SVG code is well-formed, valid, and strictly adheres to these constraints.
Focus on a clear and concise representation of the input description within the given limitations.
Always give the complete SVG code with nothing omitted. Never use an ellipsis.
The code is scored based on similarity to the description, Visual question anwering and aesthetic components.
Please generate a detailed svg code accordingly.
input description: {text}
"""
The raw SVG output is then cleaned and sanitized using a competition-specific sanitization class. After that, the cleaned SVG is scored using the SigLIP model to evaluate text-to-SVG similarity. Only SVGs with a score above 0.5 are included in the dataset. On average, out of three SVG generations, only one meets the quality threshold after the cleaning, sanitization, and scoring process.
A dataset with ~50,000 samples for SVG code generation is publicly available at: https://huggingface.co/datasets/vinoku89/svg-code-generation
Facebook
TwitterVerbalized-Sampling-Synthetic-Data-Generation
This dataset showcases how Verbalized Sampling (VS) can be used to generate high-quality, diverse synthetic training data for mathematical reasoning tasks. From the paper Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity.
Dataset Description
The Synthetic Data Generation dataset contains mathematical problem-solution pairs generated by different methods using state-of-the-art LLMs. This dataset⦠See the full description on the dataset page: https://huggingface.co/datasets/CHATS-Lab/Verbalized-Sampling-Synthetic-Data-Generation.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset contains conversation summaries, topics, and dialogues used to create the pipeline of fine tunning the LLM model using Parameter Efficient Fine Tunning and Low-Rank Adaptation of Large Language Models) which is a popular and lightweight training technique that significantly reduces the number of trainable parameters.
The dataset is also available in the hugging face. https://huggingface.co/datasets/knkarthick/dialogsum
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Recruitment and career advisory teams in the HR industry often face challenges with sensitive, hard-to-access data. This dataset removes that barrier by providing synthetic HR conversations and resume screening Q&A, structured for LLM training in JSONL format.
It enables HR teams and AI developers to build smarter internal chatbots, automate candidate screening, accelerate onboarding workflows, and create AI-powered career advisory tools, all while keeping data privacy intact. This helps organizations improve efficiency, reduce manual effort, and scale AI-driven HR solutions.
This dataset contains synthetic HR conversations and resume screening Q&A, formatted for LLM fine-tuning. Each record represents a dialogue simulating real-world HR workflows like candidate evaluation and career guidance.
messages: An array of chat messages with roles (system, user, assistant) and their respective content.{
"messages": [
{"role": "system", "content": "You are an informative assistant."},
{"role": "user", "content": "What is AbdulMuiz Shaikh's current job title?"},
{"role": "assistant", "content": "AbdulMuiz Shaikh's current job title is Associate Data Scientist."}
]
}
Fine-tune with OpenAI:
bash
openai tools fine_tunes.prepare_data -f hr_resume_qna.jsonl
openai api fine_tunes.create -t "hr_resume_qna.jsonl" -m "gpt-3.5-turbo"
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Gardening LLM Synthetic Training - Multiturn Dialog Dataset
Dataset Description
This dataset contains a sample of synthetic multiturn conversations between home gardeners and an expert gardening assistant ("GardenBot"). The conversations cover five key gardening topics with detailed subtopics and plant-specific advice, designed for training conversational LLMs.
Dataset Overview
Curated by: CJ Jones Language: English License: CC BY-NC-SA 4.0 Size: 250 of 100β¦ See the full description on the dataset page: https://huggingface.co/datasets/CJJones/Gardening_LLM_Synthetic_Training_Multiturn_Dialog.
Facebook
TwitterWiserBrand offers a unique dataset of real consumer-to-business phone conversations. These high-quality audio recordings capture authentic interactions between consumers and support agents across industries. Unlike synthetic data or scripted samples, our dataset reflects natural speech patterns, emotion, intent, and real-world phrasing, making it ideal for:
We provide custom datasets on demand: - Multi-language datasets - Calls from various countries - Calls to companies in specific industries (healthcare, banking, e-commerce, etc.) - The larger the volume you purchase, the lower the price will be.
We ensure strict data privacy: all personally identifiable information (PII) is removed before delivery.
Recordings are produced on demand and can be tailored by vertical (e.g., telecom, finance, e-commerce) or use case.
Whether you're building next-gen voice technology or need realistic conversational datasets to test models, this dataset provides what synthetic corpora lack β realism, variation, and authenticity.
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Telco Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [telco] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of⦠See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-telco-llm-chatbot-training-dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Tiny-LLM: Synthetic Question-Answering Dataset
Dataset Description
This dataset was created for the fine-tuning stage of the Tiny-LLM Project, a project focused on training and evaluating compact language models from scratch. It contains 706,727 high-quality, synthetic multi-turn Question-Answering (Q&A) conversations in English, generated using the Gemini API. The dataset was designed to teach small models instruction-following capabilities across a diverse range of⦠See the full description on the dataset page: https://huggingface.co/datasets/Gabriel8/tiny-llm-synthetic-qa.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
High-Fidelity Synthetic Dataset for LLM Training, Legal NLP & AI Research
This repository provides a synthetic dataset of legal contract Q&A interactions, modeled after real-world corporate filings (e.g., SEC 8-K disclosures). The data is generated using Syncora.ai, ensuring privacy-safe, fake data that is safe to use in LLM training, benchmarking, and experimentation.
This free dataset captures the style and structure of legal exchanges without exposing any confidential or sensitive client information.
| Feature | Description |
|---|---|
| Structured JSONL Format | Includes system, user, and assistant roles for conversational Q&A. |
| Contract & Compliance Questions | Modeled on SEC filings and legal disclosure scenarios. |
| Statistically Realistic Fake Data | Fully synthetic, mirrors real-world patterns without privacy risks. |
| NLP-Ready | Optimized for direct fine-tuning, benchmarking, and evaluation in LLM pipelines. |
This synthetic legal dataset is not just for LLM training β it enables developers and researchers to create simulated regulatory scenarios. Examples include:
This section gives the dataset a practical, standout value, showing it can be used for stress-testing AI in legal environments.
Syncora.ai creates synthetic datasets optimized for LLM training with:
Take your AI projects further with Syncora.ai:
β Generate your own synthetic datasets now
This dataset is released under the MIT License.
It is 100% synthetic, safe for LLM training, and ideal for research, experimentation, and open-source projects.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
High-quality synthetic dataset for chatbot training, LLM fine-tuning, and AI research in conversational systems.
This dataset provides a fully synthetic collection of customer support interactions, generated using Syncora.aiβs synthetic data generation engine.
It mirrors realistic support conversations across e-commerce, banking, SaaS, and telecom domains, ensuring diversity, context depth, and privacy-safe realism.
Each conversation simulates multi-turn dialogues between a customer and a support agent, making it ideal for training chatbots, LLMs, and retrieval-augmented generation (RAG) systems.
This is a free dataset, designed for LLM training, chatbot model fine-tuning, and dialogue understanding research.
| Feature | Description |
|---|---|
conversation_id | Unique identifier for each dialogue session |
domain | Industry domain (e.g., banking, telecom, retail) |
role | Speaker role: customer or support agent |
message | Message text (synthetic conversation content) |
intent_label | Labeled customer intent (e.g., refund_request, password_reset) |
resolution_status | Whether the query was resolved or escalated |
sentiment_score | Sentiment polarity of the conversation |
language | Language of interaction (supports multilingual synthetic data) |
Use Syncora.ai to generate synthetic conversational datasets for your AI or chatbot projects:
Try Synthetic Data Generation tool
This dataset is released under the MIT License.
It is fully synthetic, free, and safe for LLM training, chatbot model fine-tuning, and AI research.
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
TrojAI llm-pretrain-apr2024 Train Dataset This is the training data used to create and evaluate trojan detection software solutions. This data, generated at NIST, consists Llama2 Large Language Models refined using fine-tuning and LoRA to perform next token prediction. A known percentage of these trained AI models have been poisoned with triggers which induces modified behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers into the model weights.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is the datamix created by Team π π π΅οΈββοΈ π€ during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.
It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.
To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.
Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset
We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays
Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
Facebook
Twitterπ¬ Customer Support Conversation Dataset β Powered by Syncora.ai
A free synthetic dataset for chatbot training, LLM fine-tuning, and synthetic data generation research.Created using Syncora.aiβs privacy-safe synthetic data engine, this dataset is ideal for developing, testing, and benchmarking AI customer support systems. It serves as a dataset for chatbot training and a dataset for LLM training, offering rich, structured conversation data for real-world simulation.
πβ¦ See the full description on the dataset page: https://huggingface.co/datasets/syncora/customer_support_conversations_dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The use of real world health data for Foundation Model training often comes with concerns due to the potential sharing of sensitive information. Synthetic data may prove to be one of the best assets to limit such concerns. In this manuscript, we introduce a new paradigm of training Foundation Models - generate synthetic data, encode it with a compression method and frequency-based mapping, and use these encoded data to align a Foundation Model. We demonstrate our pipeline on the task of colorectal cancer patient stratification into consensus molecular subtypes (CMS) using a decoder-only model. Evaluation of the aligned model on real data results in a balanced accuracy and F1 score of approximately 91%, competitive with baselines established by prior work leveraging real data as well as with models trained directly on synthetic data.
This repository contains the data used in the experiments and the results of LLM-finetuning in json form. The numbered folders in data.zip correspond to the seed that was used for generating the synthetic data.
Facebook
TwitterEnergy consumption of artificial intelligence (AI) models in training is considerable, with both GPT-3, the original release of the current iteration of OpenAI's popular ChatGPT, and Gopher consuming well over ********** megawatt hours of energy simply for training. As this is only for the training model it is likely that the energy consumption for the entire usage and lifetime of GPT-3 and other large language models (LLMs) is significantly higher.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
High-Fidelity Synthetic Medical Records for AI, ML Modeling, LLM Training & HealthTech Research
This is a synthetic dataset of healthcare records generated using Syncora.ai, a next-generation synthetic data generation platform designed for privacy-safe AI development.
It simulates patient demographics, medical conditions, treatments, billing, and admission data, preserving statistical realism while ensuring 0% privacy risk.
This free dataset is designed for:
Think of this as fake data that mimics real-world healthcare patterns β statistically accurate, but without any sensitive patient information.
The dataset captures patient-level hospital information, including:
All records are 100% synthetic, maintaining the statistical properties of real-world healthcare data while remaining safe to share and use for ML & LLM tasks.
Unlike most healthcare datasets, this one is tailored for LLM training:
Syncora.ai is a synthetic data generation platform designed for healthcare, finance, and enterprise AI.
Key benefits:
Take your AI projects to the next level with Syncora.ai:
β Generate your own synthetic datasets now
This is a free dataset, 100% synthetic, and contains no real patient information.
It is safe for public use in education, research, open-source contributions, LLM training, and AI development.
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
SmolLM-Corpus
This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.
Dataset subsets
Cosmopedia v2
Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by⦠See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Data Description
We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on LLM-generated topics and writing styles.
Generated Datasets
The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000β¦ See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm.