72 datasets found

h
clinical-synthetic-text-llm
huggingface.co
Updated Jul 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ran Xu (2024). clinical-synthetic-text-llm [Dataset]. https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2024
Authors
Ran Xu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Data Description

We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on LLM-generated topics and writing styles.

Generated Datasets

The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000… See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm.
f
Data Sheet 2_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
figshare.com
xlsx
Updated Feb 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s002
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
h
uk_retail_store_synthetic_dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syncora.ai - Agentic Synthetic Data Platform, uk_retail_store_synthetic_dataset [Dataset]. https://huggingface.co/datasets/syncora/uk_retail_store_synthetic_dataset
Explore at:
Authors
Syncora.ai - Agentic Synthetic Data Platform
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
United Kingdom
Description
Synthetic Data Generation Demo — UK Retail Dataset

Welcome to this synthetic data generation demo repository by Syncora.ai. This project showcases how to generate synthetic data using real-world tabular structures, demonstrated on a UK retail dataset with columns such as:

Country
CustomerID
UnitPrice
InvoiceDate
Quantity
StockCode

This dataset is designed for dataset for LLM training and AI development, enabling developers to work with privacy-safe, high-quality… See the full description on the dataset page: https://huggingface.co/datasets/syncora/uk_retail_store_synthetic_dataset.

SVG Code Generation Sample Training Data

kaggle.com

zip

Updated May 3, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Vinothkumar Sekar (2025). SVG Code Generation Sample Training Data [Dataset]. https://www.kaggle.com/datasets/vinothkumarsekar89/svg-generation-sample-training-data

Explore at:

zip(193477 bytes)Available download formats

Dataset updated

May 3, 2025

Authors

Vinothkumar Sekar

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

This training data was generated using GPT-4o as part of the 'Drawing with LLM' competition (https://www.kaggle.com/competitions/drawing-with-llms). It can be used to fine-tune small language models for the competition or serve as an augmentation dataset alongside other data sources.

The dataset is generated in two steps using the GPT-4o model. - In the first step, topic descriptions relevant to the competition are generated using a specific prompt. By running this prompt multiple times, over 3,000 descriptions were collected.

 
prompt=f""" I am participating in an SVG code generation competition.
  
   The competition involves generating SVG images based on short textual descriptions of everyday objects and scenes, spanning a wide range of categories. The key guidelines are as follows:
  
   - Descriptions are generic and do not contain brand names, trademarks, or personal names.
   - No descriptions include people, even in generic terms.
   - Descriptions are concise—each is no more than 200 characters, with an average length of about 50 characters.
   - Categories cover various domains, with some overlap between public and private test sets.
  
   To train a small LLM model, I am preparing a synthetic dataset. Could you generate 100 unique topics aligned with the competition style?
  
   Requirements:
   - Each topic should range between **20 and 200 characters**, with an **average around 60 characters**.
   - Ensure **diversity and creativity** across topics.
   - **50% of the topics** should come from the categories of **landscapes**, **abstract art**, and **fashion**.
   - Avoid duplication or overly similar phrasing.
  
   Example topics:
                 a purple forest at dusk, gray wool coat with a faux fur collar, a lighthouse overlooking the ocean, burgundy corduroy, pants with patch pockets and silver buttons, orange corduroy overalls, a purple silk scarf with tassel trim, a green lagoon under a cloudy sky, crimson rectangles forming a chaotic grid,  purple pyramids spiraling around a bronze cone, magenta trapezoids layered on a translucent silver sheet,  a snowy plain, black and white checkered pants,  a starlit night over snow-covered peaks, khaki triangles and azure crescents,  a maroon dodecahedron interwoven with teal threads.
  
   Please return the 100 topics in csv format.
   """

In the second step, SVG code is generated by prompting the GPT-4o model. The following prompt is used to query the model to generate svg.

 
  prompt = f"""
      Generate SVG code to visually represent the following text description, while respecting the given constraints.
      
      Allowed Elements: `svg`, `path`, `circle`, `rect`, `ellipse`, `line`, `polyline`, `polygon`, `g`, `linearGradient`, `radialGradient`, `stop`, `defs`
      Allowed Attributes: `viewBox`, `width`, `height`, `fill`, `stroke`, `stroke-width`, `d`, `cx`, `cy`, `r`, `x`, `y`, `rx`, `ry`, `x1`, `y1`, `x2`, `y2`, `points`, `transform`, `opacity`
      

      Please ensure that the generated SVG code is well-formed, valid, and strictly adheres to these constraints. 
      Focus on a clear and concise representation of the input description within the given limitations. 
      Always give the complete SVG code with nothing omitted. Never use an ellipsis.

      The code is scored based on similarity to the description, Visual question anwering and aesthetic components.
      Please generate a detailed svg code accordingly.

      input description: {text}
      """

The raw SVG output is then cleaned and sanitized using a competition-specific sanitization class. After that, the cleaned SVG is scored using the SigLIP model to evaluate text-to-SVG similarity. Only SVGs with a score above 0.5 are included in the dataset. On average, out of three SVG generations, only one meets the quality threshold after the cleaning, sanitization, and scoring process.

A dataset with ~50,000 samples for SVG code generation is publicly available at: https://huggingface.co/datasets/vinoku89/svg-code-generation

h
Verbalized-Sampling-Synthetic-Data-Generation
huggingface.co
Updated Oct 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CHATS-Lab (2025). Verbalized-Sampling-Synthetic-Data-Generation [Dataset]. https://huggingface.co/datasets/CHATS-Lab/Verbalized-Sampling-Synthetic-Data-Generation
Explore at:
Dataset updated
Oct 31, 2025
Dataset authored and provided by
CHATS-Lab
Description
Verbalized-Sampling-Synthetic-Data-Generation

This dataset showcases how Verbalized Sampling (VS) can be used to generate high-quality, diverse synthetic training data for mathematical reasoning tasks. From the paper Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity.

Dataset Description

The Synthetic Data Generation dataset contains mathematical problem-solution pairs generated by different methods using state-of-the-art LLMs. This dataset… See the full description on the dataset page: https://huggingface.co/datasets/CHATS-Lab/Verbalized-Sampling-Synthetic-Data-Generation.
Training_Data_FineTuning_LLM_PEFT_LORA
kaggle.com
zip
Updated Aug 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rupak Roy/ Bob (2024). Training_Data_FineTuning_LLM_PEFT_LORA [Dataset]. https://www.kaggle.com/datasets/rupakroy/training-dataset-peft-lora
Explore at:
zip(29562174 bytes)Available download formats
Dataset updated
Aug 8, 2024
Authors
Rupak Roy/ Bob
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset contains conversation summaries, topics, and dialogues used to create the pipeline of fine tunning the LLM model using Parameter Efficient Fine Tunning and Low-Rank Adaptation of Large Language Models) which is a popular and lightweight training technique that significantly reduces the number of trainable parameters.

The dataset is also available in the hugging face. https://huggingface.co/datasets/knkarthick/dialogsum
resume-screening-llm-training-dataset
kaggle.com
zip
Updated Sep 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syncora_ai (2025). resume-screening-llm-training-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/resume-screening-llm-training-dataset
Explore at:
zip(60353 bytes)Available download formats
Dataset updated
Sep 11, 2025
Authors
Syncora_ai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Resume Screening & HR Conversations Dataset for LLM Training

Recruitment and career advisory teams in the HR industry often face challenges with sensitive, hard-to-access data. This dataset removes that barrier by providing synthetic HR conversations and resume screening Q&A, structured for LLM training in JSONL format.

It enables HR teams and AI developers to build smarter internal chatbots, automate candidate screening, accelerate onboarding workflows, and create AI-powered career advisory tools, all while keeping data privacy intact. This helps organizations improve efficiency, reduce manual effort, and scale AI-driven HR solutions.

By Syncora.ai, enabling privacy-safe, high-quality synthetic data for smarter AI.

✅ Why This Dataset?

Accelerate HR AI development – No need to source or anonymize real resumes.

Ready for LLM Training – Structured in OpenAI-compatible JSONL format.

Privacy-Safe & Scalable – 100% synthetic, zero PII.

📂 Dataset Description

This dataset contains synthetic HR conversations and resume screening Q&A, formatted for LLM fine-tuning. Each record represents a dialogue simulating real-world HR workflows like candidate evaluation and career guidance.

Format: JSONL (OpenAI fine-tuning compatible)

Schema:

messages: An array of chat messages with roles (system, user, assistant) and their respective content.

Example:

{ "messages": [ {"role": "system", "content": "You are an informative assistant."}, {"role": "user", "content": "What is AbdulMuiz Shaikh's current job title?"}, {"role": "assistant", "content": "AbdulMuiz Shaikh's current job title is Associate Data Scientist."} ] }

🔍 What's Inside

Resume Screening Q&A

Job titles, experience, role responsibilities.

Career Guidance Questions

Tech trends, upskilling, and career planning.

👥 Who Should Use This Dataset?

Recruitment Tech Startups – Build automated candidate screeners.

HR Teams – Deploy internal career chatbots.

AI Engineers & Researchers – Fine-tune LLMs for HR applications.

EdTech Platforms – Power career advisory assistants.

🚀 How To Use

Fine-tune with OpenAI: bash openai tools fine_tunes.prepare_data -f hr_resume_qna.jsonl openai api fine_tunes.create -t "hr_resume_qna.jsonl" -m "gpt-3.5-turbo"

✅ Why Synthetic?

No compliance headaches (GDPR, PII safe)

Faster experimentation for LLM applications

High-fidelity HR scenarios for realistic model behavior

🔗 Start Generating Your Own Synthetic Data

→ Generate your own synthetic datasets now
h
Gardening_LLM_Synthetic_Training_Multiturn_Dialog
huggingface.co
Updated Sep 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cameron Jones (2025). Gardening_LLM_Synthetic_Training_Multiturn_Dialog [Dataset]. https://huggingface.co/datasets/CJJones/Gardening_LLM_Synthetic_Training_Multiturn_Dialog
Explore at:
Dataset updated
Sep 3, 2025
Authors
Cameron Jones
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Gardening LLM Synthetic Training - Multiturn Dialog Dataset

Dataset Description

This dataset contains a sample of synthetic multiturn conversations between home gardeners and an expert gardening assistant ("GardenBot"). The conversations cover five key gardening topics with detailed subtopics and plant-specific advice, designed for training conversational LLMs.

Dataset Overview

Curated by: CJ Jones Language: English License: CC BY-NC-SA 4.0 Size: 250 of 100… See the full description on the dataset page: https://huggingface.co/datasets/CJJones/Gardening_LLM_Synthetic_Training_Multiturn_Dialog.
d
AI Training Data | Audio Data| Unique Consumer Sentiment Data: Recordings of...
datarade.ai
.wav
Updated Dec 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WiserBrand.com (2023). AI Training Data | Audio Data| Unique Consumer Sentiment Data: Recordings of the calls between consumers and companies [Dataset]. https://datarade.ai/data-products/ai-training-data-audio-data-unique-consumer-sentiment-data-wiserbrand-com
Explore at:
.wavAvailable download formats
Dataset updated
Dec 8, 2023
Dataset provided by
WiserBrand
Area covered
United States of America
Description
WiserBrand offers a unique dataset of real consumer-to-business phone conversations. These high-quality audio recordings capture authentic interactions between consumers and support agents across industries. Unlike synthetic data or scripted samples, our dataset reflects natural speech patterns, emotion, intent, and real-world phrasing, making it ideal for:

Training ASR (Automatic Speech Recognition) systems

Improving voice assistants and LLM audio understanding

Enhancing call center AI tools (e.g., sentiment analysis, intent detection)

Benchmarking conversational AI performance with real-world noise and context

Dataset language: English (other languages on request)

We provide custom datasets on demand: - Multi-language datasets - Calls from various countries - Calls to companies in specific industries (healthcare, banking, e-commerce, etc.) - The larger the volume you purchase, the lower the price will be.

We ensure strict data privacy: all personally identifiable information (PII) is removed before delivery.

Recordings are produced on demand and can be tailored by vertical (e.g., telecom, finance, e-commerce) or use case.

Whether you're building next-gen voice technology or need realistic conversational datasets to test models, this dataset provides what synthetic corpora lack — realism, variation, and authenticity.
h
Bitext-telco-llm-chatbot-training-dataset
huggingface.co
Updated Jan 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2025). Bitext-telco-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-telco-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 10, 2025
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Telco Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [telco] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-telco-llm-chatbot-training-dataset.
h
tiny-llm-synthetic-qa
huggingface.co
Updated Oct 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel de Antonio Mazetto (2025). tiny-llm-synthetic-qa [Dataset]. https://huggingface.co/datasets/Gabriel8/tiny-llm-synthetic-qa
Explore at:
Dataset updated
Oct 19, 2025
Authors
Gabriel de Antonio Mazetto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Tiny-LLM: Synthetic Question-Answering Dataset

Dataset Description

This dataset was created for the fine-tuning stage of the Tiny-LLM Project, a project focused on training and evaluating compact language models from scratch. It contains 706,727 high-quality, synthetic multi-turn Question-Answering (Q&A) conversations in English, generated using the Gemini API. The dataset was designed to teach small models instruction-following capabilities across a diverse range of… See the full description on the dataset page: https://huggingface.co/datasets/Gabriel8/tiny-llm-synthetic-qa.

synthetic-legal-contracts-dataset

kaggle.com

zip

Updated Sep 11, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Syncora_ai (2025). synthetic-legal-contracts-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/synthetic-legal-contracts-dataset

Explore at:

zip(109408 bytes)Available download formats

Dataset updated

Sep 11, 2025

Authors

Syncora_ai

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Synthetic Legal Contract Dataset — Powered by Syncora

High-Fidelity Synthetic Dataset for LLM Training, Legal NLP & AI Research

About This Dataset

This repository provides a synthetic dataset of legal contract Q&A interactions, modeled after real-world corporate filings (e.g., SEC 8-K disclosures). The data is generated using Syncora.ai, ensuring privacy-safe, fake data that is safe to use in LLM training, benchmarking, and experimentation.

This free dataset captures the style and structure of legal exchanges without exposing any confidential or sensitive client information.

Dataset Context & Features

Feature	Description
Structured JSONL Format	Includes system, user, and assistant roles for conversational Q&A.
Contract & Compliance Questions	Modeled on SEC filings and legal disclosure scenarios.
Statistically Realistic Fake Data	Fully synthetic, mirrors real-world patterns without privacy risks.
NLP-Ready	Optimized for direct fine-tuning, benchmarking, and evaluation in LLM pipelines.

🚨 Simulated Regulatory Scenarios

This synthetic legal dataset is not just for LLM training — it enables developers and researchers to create simulated regulatory scenarios. Examples include:

Detecting high-risk clauses in contracts before real-world deployment
Testing AI models on rare or edge-case compliance situations
Simulating SEC filings and corporate disclosures to evaluate NLP models
Benchmarking contract analysis tools safely without exposing sensitive data

This section gives the dataset a practical, standout value, showing it can be used for stress-testing AI in legal environments.

Why Syncora?

Syncora.ai creates synthetic datasets optimized for LLM training with:

High similarity to real-world distributions
Free dataset access for research and open innovation
0% privacy leakage — fully synthetic fake data
Robust benchmarking potential for AI & legal NLP tasks

🔗 Generate Your Own Synthetic Data

Take your AI projects further with Syncora.ai:
→ Generate your own synthetic datasets now

📜 License

This dataset is released under the MIT License.

It is 100% synthetic, safe for LLM training, and ideal for research, experimentation, and open-source projects.

customer support conversations

kaggle.com

zip

Updated Oct 9, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Syncora_ai (2025). customer support conversations [Dataset]. https://www.kaggle.com/datasets/syncoraai/customer-support-conversations/code

Explore at:

zip(303724713 bytes)Available download formats

Dataset updated

Oct 9, 2025

Authors

Syncora_ai

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Customer Support Conversation Dataset — Powered by Syncora.ai

High-quality synthetic dataset for chatbot training, LLM fine-tuning, and AI research in conversational systems.

About This Dataset

This dataset provides a fully synthetic collection of customer support interactions, generated using Syncora.ai’s synthetic data generation engine.
It mirrors realistic support conversations across e-commerce, banking, SaaS, and telecom domains, ensuring diversity, context depth, and privacy-safe realism.

Each conversation simulates multi-turn dialogues between a customer and a support agent, making it ideal for training chatbots, LLMs, and retrieval-augmented generation (RAG) systems.

This is a free dataset, designed for LLM training, chatbot model fine-tuning, and dialogue understanding research.

Dataset Context & Features

Feature	Description
`conversation_id`	Unique identifier for each dialogue session
`domain`	Industry domain (e.g., banking, telecom, retail)
`role`	Speaker role: customer or support agent
`message`	Message text (synthetic conversation content)
`intent_label`	Labeled customer intent (e.g., refund_request, password_reset)
`resolution_status`	Whether the query was resolved or escalated
`sentiment_score`	Sentiment polarity of the conversation
`language`	Language of interaction (supports multilingual synthetic data)

Use Cases

Chatbot Training & Evaluation – Build and fine-tune conversational agents with realistic dialogue data.
LLM Training & Alignment – Use as a dataset for LLM training on dialogue tasks.
Customer Support Automation – Prototype or benchmark AI-driven support systems.
Dialogue Analytics – Study sentiment, escalation patterns, and domain-specific behavior.
Synthetic Data Research – Validate synthetic data generation pipelines for conversational systems.

Why Synthetic?

Privacy-Safe – No real user data; fully synthetic and compliant.
Scalable – Generate millions of conversations for LLM and chatbot training.
Balanced & Bias-Controlled – Ensures diversity and fairness in training data.
Instantly Usable – Pre-structured and cleanly labeled for NLP tasks.

Generate Your Own Synthetic Data

Use Syncora.ai to generate synthetic conversational datasets for your AI or chatbot projects:
Try Synthetic Data Generation tool

License

This dataset is released under the MIT License.
It is fully synthetic, free, and safe for LLM training, chatbot model fine-tuning, and AI research.

Trojan Detection Software Challenge - llm-pretrain-apr2024-train
data.nist.gov
nist.gov
+1more
Updated Apr 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Paul Majurski (2024). Trojan Detection Software Challenge - llm-pretrain-apr2024-train [Dataset]. http://doi.org/10.18434/mds2-3235
Explore at:
Unique identifier
https://doi.org/10.18434/mds2-3235, https://identifiers.org/ark:/88434/mds2-3235
Dataset updated
Apr 16, 2024
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Authors
Michael Paul Majurski
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
TrojAI llm-pretrain-apr2024 Train Dataset This is the training data used to create and evaluate trojan detection software solutions. This data, generated at NIST, consists Llama2 Large Language Models refined using fine-tuning and LoRA to perform next token prediction. A known percentage of these trained AI models have been poisoned with triggers which induces modified behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers into the model weights.
LLM - Detect AI Datamix
kaggle.com
zip
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26
Explore at:
zip(172818297 bytes)Available download formats
Dataset updated
Jan 19, 2024
Authors
Raja Biswas
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:

Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.

One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.

Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
h
customer_support_conversations_dataset
huggingface.co
Updated Oct 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syncora.ai - Agentic Synthetic Data Platform (2025). customer_support_conversations_dataset [Dataset]. https://huggingface.co/datasets/syncora/customer_support_conversations_dataset
Explore at:
Dataset updated
Oct 10, 2025
Authors
Syncora.ai - Agentic Synthetic Data Platform
Description
💬 Customer Support Conversation Dataset — Powered by Syncora.ai

A free synthetic dataset for chatbot training, LLM fine-tuning, and synthetic data generation research.Created using Syncora.ai’s privacy-safe synthetic data engine, this dataset is ideal for developing, testing, and benchmarking AI customer support systems. It serves as a dataset for chatbot training and a dataset for LLM training, offering rich, structured conversation data for real-world simulation.

🌟… See the full description on the dataset page: https://huggingface.co/datasets/syncora/customer_support_conversations_dataset.
Aligning foundation models on encoded synthetic omic data for patient...
zenodo.org
zip
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikita Janakarajan; Nikita Janakarajan (2025). Aligning foundation models on encoded synthetic omic data for patient stratification [Dataset]. http://doi.org/10.5281/zenodo.15641421
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15641421
Dataset updated
Jun 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nikita Janakarajan; Nikita Janakarajan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The use of real world health data for Foundation Model training often comes with concerns due to the potential sharing of sensitive information. Synthetic data may prove to be one of the best assets to limit such concerns. In this manuscript, we introduce a new paradigm of training Foundation Models - generate synthetic data, encode it with a compression method and frequency-based mapping, and use these encoded data to align a Foundation Model. We demonstrate our pipeline on the task of colorectal cancer patient stratification into consensus molecular subtypes (CMS) using a decoder-only model. Evaluation of the aligned model on real data results in a balanced accuracy and F1 score of approximately 91%, competitive with baselines established by prior work leveraging real data as well as with models trained directly on synthetic data.

This repository contains the data used in the experiments and the results of LLM-finetuning in json form. The numbered folders in data.zip correspond to the seed that was used for generating the synthetic data.
CO2 emissions of LLMs during training in 2022 (in CO2 eq tonnes)
statista.com
Updated Jun 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). CO2 emissions of LLMs during training in 2022 (in CO2 eq tonnes) [Dataset]. https://www.statista.com/statistics/1384418/co2-emissions-when-training-llm-models/
Explore at:
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2022
Area covered
Worldwide
Description
Energy consumption of artificial intelligence (AI) models in training is considerable, with both GPT-3, the original release of the current iteration of OpenAI's popular ChatGPT, and Gopher consuming well over ********** megawatt hours of energy simply for training. As this is only for the training model it is likely that the energy consumption for the entire usage and lifetime of GPT-3 and other large language models (LLMs) is significantly higher.
synthetic-medical-records-dataset
kaggle.com
zip
Updated Sep 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syncora_ai (2025). synthetic-medical-records-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/synthetic-medical-records-dataset
Explore at:
zip(1582643 bytes)Available download formats
Dataset updated
Sep 11, 2025
Authors
Syncora_ai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Synthetic Healthcare Dataset — Powered by Syncora

High-Fidelity Synthetic Medical Records for AI, ML Modeling, LLM Training & HealthTech Research

About This Dataset

This is a synthetic dataset of healthcare records generated using Syncora.ai, a next-generation synthetic data generation platform designed for privacy-safe AI development.

It simulates patient demographics, medical conditions, treatments, billing, and admission data, preserving statistical realism while ensuring 0% privacy risk.

This free dataset is designed for:

Healthcare AI research

Predictive analytics (disease risk, treatment outcomes)

LLM training on structured tabular healthcare data

Medical data science education & experimentation

Think of this as fake data that mimics real-world healthcare patterns — statistically accurate, but without any sensitive patient information.

Dataset Context & Features

The dataset captures patient-level hospital information, including:

Demographics: Age, Gender, Blood Type

Medical Details: Diagnosed medical condition, prescribed medication, test results

Hospital Records: Admission type (emergency, planned, outpatient), billing amount

Target Applications: Predictive modeling, anomaly detection, cost optimization

All records are 100% synthetic, maintaining the statistical properties of real-world healthcare data while remaining safe to share and use for ML & LLM tasks.

LLM Training & Generative AI Applications 🧠

Unlike most healthcare datasets, this one is tailored for LLM training:

Fine-tune LLMs on tabular + medical data for reasoning tasks

Create medical report generators from structured fields (e.g., convert demographics + condition + test results into natural language summaries)

Use as fake data for prompt engineering, synthetic QA pairs, or generative simulations

Safely train LLMs to understand healthcare schemas without exposing private patient data

Machine Learning & AI Use Cases

Predictive Modeling: Forecast patient outcomes or readmission likelihood

Classification: Disease diagnosis prediction using demographic and medical variables

Clustering: Patient segmentation by condition, treatment, or billing pattern

Healthcare Cost Prediction: Estimate and optimize billing amounts

Bias & Fairness Testing: Study algorithmic bias without exposing sensitive patient data

Why Syncora?

Syncora.ai is a synthetic data generation platform designed for healthcare, finance, and enterprise AI.

Key benefits:

Privacy-first: 100% synthetic, zero risk of re-identification

Statistical accuracy: Feature relationships preserved for ML & LLM training

Regulatory compliance: HIPAA, GDPR, DPDP safe

Scalability: Generate millions of synthetic patient records with agentic AI

Ideas for Exploration

Which medical conditions correlate with higher billing amounts?

Can test results predict hospitalization type?

How do demographics influence treatment or billing trends?

Can synthetic datasets reduce bias in healthcare AI & LLMs?

🔗 Generate Your Own Synthetic Data

Take your AI projects to the next level with Syncora.ai:
→ Generate your own synthetic datasets now

Licensing & Compliance

This is a free dataset, 100% synthetic, and contains no real patient information.
It is safe for public use in education, research, open-source contributions, LLM training, and AI development.
smollm-corpus
huggingface.co
Updated Jul 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2024). smollm-corpus [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 16, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
SmolLM-Corpus

This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

Dataset subsets Cosmopedia v2

Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ran Xu (2024). clinical-synthetic-text-llm [Dataset]. https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm

clinical-synthetic-text-llm

ritaranx/clinical-synthetic-text-llm

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 5, 2024

Authors

Ran Xu

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Data Description

We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on LLM-generated topics and writing styles.

  Generated Datasets

The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000… See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm.

Clear search

Close search

Google apps

Main menu

clinical-synthetic-text-llm

Data Sheet 2_Large language models generating synthetic clinical datasets: a...

uk_retail_store_synthetic_dataset

SVG Code Generation Sample Training Data

Verbalized-Sampling-Synthetic-Data-Generation

Training_Data_FineTuning_LLM_PEFT_LORA

resume-screening-llm-training-dataset

Resume Screening & HR Conversations Dataset for LLM Training

By Syncora.ai, enabling privacy-safe, high-quality synthetic data for smarter AI.

✅ Why This Dataset?

📂 Dataset Description

Example:

🔍 What's Inside

👥 Who Should Use This Dataset?

🚀 How To Use

✅ Why Synthetic?

🔗 Start Generating Your Own Synthetic Data

Gardening_LLM_Synthetic_Training_Multiturn_Dialog

AI Training Data | Audio Data| Unique Consumer Sentiment Data: Recordings of...

Bitext-telco-llm-chatbot-training-dataset

tiny-llm-synthetic-qa

synthetic-legal-contracts-dataset

Synthetic Legal Contract Dataset — Powered by Syncora

About This Dataset

Dataset Context & Features

🚨 Simulated Regulatory Scenarios

Why Syncora?

🔗 Generate Your Own Synthetic Data

📜 License

customer support conversations

Customer Support Conversation Dataset — Powered by Syncora.ai

About This Dataset

Dataset Context & Features

Use Cases

Why Synthetic?

Generate Your Own Synthetic Data

License

Trojan Detection Software Challenge - llm-pretrain-apr2024-train

LLM - Detect AI Datamix

customer_support_conversations_dataset

Aligning foundation models on encoded synthetic omic data for patient...

CO2 emissions of LLMs during training in 2022 (in CO2 eq tonnes)

synthetic-medical-records-dataset

Synthetic Healthcare Dataset — Powered by Syncora

About This Dataset

Dataset Context & Features

LLM Training & Generative AI Applications 🧠

Machine Learning & AI Use Cases

Why Syncora?

Ideas for Exploration

🔗 Generate Your Own Synthetic Data

Licensing & Compliance

smollm-corpus

clinical-synthetic-text-llmSee More Versions

ritaranx/clinical-synthetic-text-llm

clinical-synthetic-text-llm