72 datasets found
  1. h

    clinical-synthetic-text-llm

    • huggingface.co
    Updated Jul 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ran Xu (2024). clinical-synthetic-text-llm [Dataset]. https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2024
    Authors
    Ran Xu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Data Description

    We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on LLM-generated topics and writing styles.

      Generated Datasets
    

    The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000… See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm.

  2. f

    Data Sheet 2_Large language models generating synthetic clinical datasets: a...

    • frontiersin.figshare.com
    • figshare.com
    xlsx
    Updated Feb 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Frontiers
    Authors
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

  3. h

    uk_retail_store_synthetic_dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora.ai - Agentic Synthetic Data Platform, uk_retail_store_synthetic_dataset [Dataset]. https://huggingface.co/datasets/syncora/uk_retail_store_synthetic_dataset
    Explore at:
    Authors
    Syncora.ai - Agentic Synthetic Data Platform
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    United Kingdom
    Description

    Synthetic Data Generation Demo β€” UK Retail Dataset

    Welcome to this synthetic data generation demo repository by Syncora.ai. This project showcases how to generate synthetic data using real-world tabular structures, demonstrated on a UK retail dataset with columns such as:

    Country
    CustomerID
    UnitPrice
    InvoiceDate
    Quantity
    StockCode

    This dataset is designed for dataset for LLM training and AI development, enabling developers to work with privacy-safe, high-quality… See the full description on the dataset page: https://huggingface.co/datasets/syncora/uk_retail_store_synthetic_dataset.

  4. SVG Code Generation Sample Training Data

    • kaggle.com
    zip
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vinothkumar Sekar (2025). SVG Code Generation Sample Training Data [Dataset]. https://www.kaggle.com/datasets/vinothkumarsekar89/svg-generation-sample-training-data
    Explore at:
    zip(193477 bytes)Available download formats
    Dataset updated
    May 3, 2025
    Authors
    Vinothkumar Sekar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This training data was generated using GPT-4o as part of the 'Drawing with LLM' competition (https://www.kaggle.com/competitions/drawing-with-llms). It can be used to fine-tune small language models for the competition or serve as an augmentation dataset alongside other data sources.

    The dataset is generated in two steps using the GPT-4o model. - In the first step, topic descriptions relevant to the competition are generated using a specific prompt. By running this prompt multiple times, over 3,000 descriptions were collected.

     
    prompt=f""" I am participating in an SVG code generation competition.
      
       The competition involves generating SVG images based on short textual descriptions of everyday objects and scenes, spanning a wide range of categories. The key guidelines are as follows:
      
       - Descriptions are generic and do not contain brand names, trademarks, or personal names.
       - No descriptions include people, even in generic terms.
       - Descriptions are conciseβ€”each is no more than 200 characters, with an average length of about 50 characters.
       - Categories cover various domains, with some overlap between public and private test sets.
      
       To train a small LLM model, I am preparing a synthetic dataset. Could you generate 100 unique topics aligned with the competition style?
      
       Requirements:
       - Each topic should range between **20 and 200 characters**, with an **average around 60 characters**.
       - Ensure **diversity and creativity** across topics.
       - **50% of the topics** should come from the categories of **landscapes**, **abstract art**, and **fashion**.
       - Avoid duplication or overly similar phrasing.
      
       Example topics:
                     a purple forest at dusk, gray wool coat with a faux fur collar, a lighthouse overlooking the ocean, burgundy corduroy, pants with patch pockets and silver buttons, orange corduroy overalls, a purple silk scarf with tassel trim, a green lagoon under a cloudy sky, crimson rectangles forming a chaotic grid,  purple pyramids spiraling around a bronze cone, magenta trapezoids layered on a translucent silver sheet,  a snowy plain, black and white checkered pants,  a starlit night over snow-covered peaks, khaki triangles and azure crescents,  a maroon dodecahedron interwoven with teal threads.
      
       Please return the 100 topics in csv format.
       """
     
    • In the second step, SVG code is generated by prompting the GPT-4o model. The following prompt is used to query the model to generate svg.
     
      prompt = f"""
          Generate SVG code to visually represent the following text description, while respecting the given constraints.
          
          Allowed Elements: `svg`, `path`, `circle`, `rect`, `ellipse`, `line`, `polyline`, `polygon`, `g`, `linearGradient`, `radialGradient`, `stop`, `defs`
          Allowed Attributes: `viewBox`, `width`, `height`, `fill`, `stroke`, `stroke-width`, `d`, `cx`, `cy`, `r`, `x`, `y`, `rx`, `ry`, `x1`, `y1`, `x2`, `y2`, `points`, `transform`, `opacity`
          
    
          Please ensure that the generated SVG code is well-formed, valid, and strictly adheres to these constraints. 
          Focus on a clear and concise representation of the input description within the given limitations. 
          Always give the complete SVG code with nothing omitted. Never use an ellipsis.
    
          The code is scored based on similarity to the description, Visual question anwering and aesthetic components.
          Please generate a detailed svg code accordingly.
    
          input description: {text}
          """
     

    The raw SVG output is then cleaned and sanitized using a competition-specific sanitization class. After that, the cleaned SVG is scored using the SigLIP model to evaluate text-to-SVG similarity. Only SVGs with a score above 0.5 are included in the dataset. On average, out of three SVG generations, only one meets the quality threshold after the cleaning, sanitization, and scoring process.

    A dataset with ~50,000 samples for SVG code generation is publicly available at: https://huggingface.co/datasets/vinoku89/svg-code-generation

  5. h

    Verbalized-Sampling-Synthetic-Data-Generation

    • huggingface.co
    Updated Oct 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CHATS-Lab (2025). Verbalized-Sampling-Synthetic-Data-Generation [Dataset]. https://huggingface.co/datasets/CHATS-Lab/Verbalized-Sampling-Synthetic-Data-Generation
    Explore at:
    Dataset updated
    Oct 31, 2025
    Dataset authored and provided by
    CHATS-Lab
    Description

    Verbalized-Sampling-Synthetic-Data-Generation

    This dataset showcases how Verbalized Sampling (VS) can be used to generate high-quality, diverse synthetic training data for mathematical reasoning tasks. From the paper Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity.

      Dataset Description
    

    The Synthetic Data Generation dataset contains mathematical problem-solution pairs generated by different methods using state-of-the-art LLMs. This dataset… See the full description on the dataset page: https://huggingface.co/datasets/CHATS-Lab/Verbalized-Sampling-Synthetic-Data-Generation.

  6. Training_Data_FineTuning_LLM_PEFT_LORA

    • kaggle.com
    zip
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rupak Roy/ Bob (2024). Training_Data_FineTuning_LLM_PEFT_LORA [Dataset]. https://www.kaggle.com/datasets/rupakroy/training-dataset-peft-lora
    Explore at:
    zip(29562174 bytes)Available download formats
    Dataset updated
    Aug 8, 2024
    Authors
    Rupak Roy/ Bob
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The dataset contains conversation summaries, topics, and dialogues used to create the pipeline of fine tunning the LLM model using Parameter Efficient Fine Tunning and Low-Rank Adaptation of Large Language Models) which is a popular and lightweight training technique that significantly reduces the number of trainable parameters.

    The dataset is also available in the hugging face. https://huggingface.co/datasets/knkarthick/dialogsum

  7. resume-screening-llm-training-dataset

    • kaggle.com
    zip
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora_ai (2025). resume-screening-llm-training-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/resume-screening-llm-training-dataset
    Explore at:
    zip(60353 bytes)Available download formats
    Dataset updated
    Sep 11, 2025
    Authors
    Syncora_ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Resume Screening & HR Conversations Dataset for LLM Training

    Recruitment and career advisory teams in the HR industry often face challenges with sensitive, hard-to-access data. This dataset removes that barrier by providing synthetic HR conversations and resume screening Q&A, structured for LLM training in JSONL format.

    It enables HR teams and AI developers to build smarter internal chatbots, automate candidate screening, accelerate onboarding workflows, and create AI-powered career advisory tools, all while keeping data privacy intact. This helps organizations improve efficiency, reduce manual effort, and scale AI-driven HR solutions.

    By Syncora.ai, enabling privacy-safe, high-quality synthetic data for smarter AI.

    βœ… Why This Dataset?

    • Accelerate HR AI development – No need to source or anonymize real resumes.
    • Ready for LLM Training – Structured in OpenAI-compatible JSONL format.
    • Privacy-Safe & Scalable – 100% synthetic, zero PII.

    πŸ“‚ Dataset Description

    This dataset contains synthetic HR conversations and resume screening Q&A, formatted for LLM fine-tuning. Each record represents a dialogue simulating real-world HR workflows like candidate evaluation and career guidance.

    • Format: JSONL (OpenAI fine-tuning compatible)
    • Schema:
      • messages: An array of chat messages with roles (system, user, assistant) and their respective content.

    Example:

    {
     "messages": [
      {"role": "system", "content": "You are an informative assistant."},
      {"role": "user", "content": "What is AbdulMuiz Shaikh's current job title?"},
      {"role": "assistant", "content": "AbdulMuiz Shaikh's current job title is Associate Data Scientist."}
     ]
    }
    

    πŸ” What's Inside

    • Resume Screening Q&A
      • Job titles, experience, role responsibilities.
    • Career Guidance Questions
      • Tech trends, upskilling, and career planning.

    πŸ‘₯ Who Should Use This Dataset?

    • Recruitment Tech Startups – Build automated candidate screeners.
    • HR Teams – Deploy internal career chatbots.
    • AI Engineers & Researchers – Fine-tune LLMs for HR applications.
    • EdTech Platforms – Power career advisory assistants.

    πŸš€ How To Use

    Fine-tune with OpenAI: bash openai tools fine_tunes.prepare_data -f hr_resume_qna.jsonl openai api fine_tunes.create -t "hr_resume_qna.jsonl" -m "gpt-3.5-turbo"

    βœ… Why Synthetic?

    • No compliance headaches (GDPR, PII safe)
    • Faster experimentation for LLM applications
    • High-fidelity HR scenarios for realistic model behavior

    πŸ”— Start Generating Your Own Synthetic Data

    β†’ Generate your own synthetic datasets now

  8. h

    Gardening_LLM_Synthetic_Training_Multiturn_Dialog

    • huggingface.co
    Updated Sep 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cameron Jones (2025). Gardening_LLM_Synthetic_Training_Multiturn_Dialog [Dataset]. https://huggingface.co/datasets/CJJones/Gardening_LLM_Synthetic_Training_Multiturn_Dialog
    Explore at:
    Dataset updated
    Sep 3, 2025
    Authors
    Cameron Jones
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Gardening LLM Synthetic Training - Multiturn Dialog Dataset

      Dataset Description
    

    This dataset contains a sample of synthetic multiturn conversations between home gardeners and an expert gardening assistant ("GardenBot"). The conversations cover five key gardening topics with detailed subtopics and plant-specific advice, designed for training conversational LLMs.

      Dataset Overview
    

    Curated by: CJ Jones Language: English License: CC BY-NC-SA 4.0 Size: 250 of 100… See the full description on the dataset page: https://huggingface.co/datasets/CJJones/Gardening_LLM_Synthetic_Training_Multiturn_Dialog.

  9. d

    AI Training Data | Audio Data| Unique Consumer Sentiment Data: Recordings of...

    • datarade.ai
    .wav
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WiserBrand.com (2023). AI Training Data | Audio Data| Unique Consumer Sentiment Data: Recordings of the calls between consumers and companies [Dataset]. https://datarade.ai/data-products/ai-training-data-audio-data-unique-consumer-sentiment-data-wiserbrand-com
    Explore at:
    .wavAvailable download formats
    Dataset updated
    Dec 8, 2023
    Dataset provided by
    WiserBrand
    Area covered
    United States of America
    Description

    WiserBrand offers a unique dataset of real consumer-to-business phone conversations. These high-quality audio recordings capture authentic interactions between consumers and support agents across industries. Unlike synthetic data or scripted samples, our dataset reflects natural speech patterns, emotion, intent, and real-world phrasing, making it ideal for:

    • Training ASR (Automatic Speech Recognition) systems
    • Improving voice assistants and LLM audio understanding
    • Enhancing call center AI tools (e.g., sentiment analysis, intent detection)
    • Benchmarking conversational AI performance with real-world noise and context
    • Dataset language: English (other languages on request)

    We provide custom datasets on demand: - Multi-language datasets - Calls from various countries - Calls to companies in specific industries (healthcare, banking, e-commerce, etc.) - The larger the volume you purchase, the lower the price will be.

    We ensure strict data privacy: all personally identifiable information (PII) is removed before delivery.

    Recordings are produced on demand and can be tailored by vertical (e.g., telecom, finance, e-commerce) or use case.

    Whether you're building next-gen voice technology or need realistic conversational datasets to test models, this dataset provides what synthetic corpora lack β€” realism, variation, and authenticity.

  10. h

    Bitext-telco-llm-chatbot-training-dataset

    • huggingface.co
    Updated Jan 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2025). Bitext-telco-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-telco-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 10, 2025
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Telco Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [telco] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-telco-llm-chatbot-training-dataset.

  11. h

    tiny-llm-synthetic-qa

    • huggingface.co
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel de Antonio Mazetto (2025). tiny-llm-synthetic-qa [Dataset]. https://huggingface.co/datasets/Gabriel8/tiny-llm-synthetic-qa
    Explore at:
    Dataset updated
    Oct 19, 2025
    Authors
    Gabriel de Antonio Mazetto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Tiny-LLM: Synthetic Question-Answering Dataset

      Dataset Description
    

    This dataset was created for the fine-tuning stage of the Tiny-LLM Project, a project focused on training and evaluating compact language models from scratch. It contains 706,727 high-quality, synthetic multi-turn Question-Answering (Q&A) conversations in English, generated using the Gemini API. The dataset was designed to teach small models instruction-following capabilities across a diverse range of… See the full description on the dataset page: https://huggingface.co/datasets/Gabriel8/tiny-llm-synthetic-qa.

  12. synthetic-legal-contracts-dataset

    • kaggle.com
    zip
    Updated Sep 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora_ai (2025). synthetic-legal-contracts-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/synthetic-legal-contracts-dataset
    Explore at:
    zip(109408 bytes)Available download formats
    Dataset updated
    Sep 11, 2025
    Authors
    Syncora_ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Synthetic Legal Contract Dataset β€” Powered by Syncora

    High-Fidelity Synthetic Dataset for LLM Training, Legal NLP & AI Research

    About This Dataset

    This repository provides a synthetic dataset of legal contract Q&A interactions, modeled after real-world corporate filings (e.g., SEC 8-K disclosures). The data is generated using Syncora.ai, ensuring privacy-safe, fake data that is safe to use in LLM training, benchmarking, and experimentation.

    This free dataset captures the style and structure of legal exchanges without exposing any confidential or sensitive client information.

    Dataset Context & Features

    FeatureDescription
    Structured JSONL FormatIncludes system, user, and assistant roles for conversational Q&A.
    Contract & Compliance QuestionsModeled on SEC filings and legal disclosure scenarios.
    Statistically Realistic Fake DataFully synthetic, mirrors real-world patterns without privacy risks.
    NLP-ReadyOptimized for direct fine-tuning, benchmarking, and evaluation in LLM pipelines.

    🚨 Simulated Regulatory Scenarios

    This synthetic legal dataset is not just for LLM training β€” it enables developers and researchers to create simulated regulatory scenarios. Examples include:

    • Detecting high-risk clauses in contracts before real-world deployment
    • Testing AI models on rare or edge-case compliance situations
    • Simulating SEC filings and corporate disclosures to evaluate NLP models
    • Benchmarking contract analysis tools safely without exposing sensitive data

    This section gives the dataset a practical, standout value, showing it can be used for stress-testing AI in legal environments.

    Why Syncora?

    Syncora.ai creates synthetic datasets optimized for LLM training with:

    • High similarity to real-world distributions
    • Free dataset access for research and open innovation
    • 0% privacy leakage β€” fully synthetic fake data
    • Robust benchmarking potential for AI & legal NLP tasks

    πŸ”— Generate Your Own Synthetic Data

    Take your AI projects further with Syncora.ai:
    β†’ Generate your own synthetic datasets now

    πŸ“œ License

    This dataset is released under the MIT License.

    It is 100% synthetic, safe for LLM training, and ideal for research, experimentation, and open-source projects.

  13. customer support conversations

    • kaggle.com
    zip
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora_ai (2025). customer support conversations [Dataset]. https://www.kaggle.com/datasets/syncoraai/customer-support-conversations/code
    Explore at:
    zip(303724713 bytes)Available download formats
    Dataset updated
    Oct 9, 2025
    Authors
    Syncora_ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Customer Support Conversation Dataset β€” Powered by Syncora.ai

    High-quality synthetic dataset for chatbot training, LLM fine-tuning, and AI research in conversational systems.

    About This Dataset

    This dataset provides a fully synthetic collection of customer support interactions, generated using Syncora.ai’s synthetic data generation engine.
    It mirrors realistic support conversations across e-commerce, banking, SaaS, and telecom domains, ensuring diversity, context depth, and privacy-safe realism.

    Each conversation simulates multi-turn dialogues between a customer and a support agent, making it ideal for training chatbots, LLMs, and retrieval-augmented generation (RAG) systems.

    This is a free dataset, designed for LLM training, chatbot model fine-tuning, and dialogue understanding research.

    Dataset Context & Features

    FeatureDescription
    conversation_idUnique identifier for each dialogue session
    domainIndustry domain (e.g., banking, telecom, retail)
    roleSpeaker role: customer or support agent
    messageMessage text (synthetic conversation content)
    intent_labelLabeled customer intent (e.g., refund_request, password_reset)
    resolution_statusWhether the query was resolved or escalated
    sentiment_scoreSentiment polarity of the conversation
    languageLanguage of interaction (supports multilingual synthetic data)

    Use Cases

    • Chatbot Training & Evaluation – Build and fine-tune conversational agents with realistic dialogue data.
    • LLM Training & Alignment – Use as a dataset for LLM training on dialogue tasks.
    • Customer Support Automation – Prototype or benchmark AI-driven support systems.
    • Dialogue Analytics – Study sentiment, escalation patterns, and domain-specific behavior.
    • Synthetic Data Research – Validate synthetic data generation pipelines for conversational systems.

    Why Synthetic?

    • Privacy-Safe – No real user data; fully synthetic and compliant.
    • Scalable – Generate millions of conversations for LLM and chatbot training.
    • Balanced & Bias-Controlled – Ensures diversity and fairness in training data.
    • Instantly Usable – Pre-structured and cleanly labeled for NLP tasks.

    Generate Your Own Synthetic Data

    Use Syncora.ai to generate synthetic conversational datasets for your AI or chatbot projects:
    Try Synthetic Data Generation tool

    License

    This dataset is released under the MIT License.
    It is fully synthetic, free, and safe for LLM training, chatbot model fine-tuning, and AI research.

  14. Trojan Detection Software Challenge - llm-pretrain-apr2024-train

    • data.nist.gov
    • nist.gov
    • +1more
    Updated Apr 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Paul Majurski (2024). Trojan Detection Software Challenge - llm-pretrain-apr2024-train [Dataset]. http://doi.org/10.18434/mds2-3235
    Explore at:
    Dataset updated
    Apr 16, 2024
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Michael Paul Majurski
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    TrojAI llm-pretrain-apr2024 Train Dataset This is the training data used to create and evaluate trojan detection software solutions. This data, generated at NIST, consists Llama2 Large Language Models refined using fine-tuning and LoRA to perform next token prediction. A known percentage of these trained AI models have been poisoned with triggers which induces modified behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers into the model weights.

  15. LLM - Detect AI Datamix

    • kaggle.com
    zip
    Updated Jan 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26
    Explore at:
    zip(172818297 bytes)Available download formats
    Dataset updated
    Jan 19, 2024
    Authors
    Raja Biswas
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the datamix created by Team πŸ” πŸ“ πŸ•΅οΈβ€β™‚οΈ πŸ€– during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

    It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

    To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

    Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

    • Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:
      • Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.
      • One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.
      • Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

    We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

    Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence

  16. h

    customer_support_conversations_dataset

    • huggingface.co
    Updated Oct 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora.ai - Agentic Synthetic Data Platform (2025). customer_support_conversations_dataset [Dataset]. https://huggingface.co/datasets/syncora/customer_support_conversations_dataset
    Explore at:
    Dataset updated
    Oct 10, 2025
    Authors
    Syncora.ai - Agentic Synthetic Data Platform
    Description

    πŸ’¬ Customer Support Conversation Dataset β€” Powered by Syncora.ai

    A free synthetic dataset for chatbot training, LLM fine-tuning, and synthetic data generation research.Created using Syncora.ai’s privacy-safe synthetic data engine, this dataset is ideal for developing, testing, and benchmarking AI customer support systems. It serves as a dataset for chatbot training and a dataset for LLM training, offering rich, structured conversation data for real-world simulation.

      πŸŒŸβ€¦ See the full description on the dataset page: https://huggingface.co/datasets/syncora/customer_support_conversations_dataset.
    
  17. Aligning foundation models on encoded synthetic omic data for patient...

    • zenodo.org
    zip
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikita Janakarajan; Nikita Janakarajan (2025). Aligning foundation models on encoded synthetic omic data for patient stratification [Dataset]. http://doi.org/10.5281/zenodo.15641421
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nikita Janakarajan; Nikita Janakarajan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The use of real world health data for Foundation Model training often comes with concerns due to the potential sharing of sensitive information. Synthetic data may prove to be one of the best assets to limit such concerns. In this manuscript, we introduce a new paradigm of training Foundation Models - generate synthetic data, encode it with a compression method and frequency-based mapping, and use these encoded data to align a Foundation Model. We demonstrate our pipeline on the task of colorectal cancer patient stratification into consensus molecular subtypes (CMS) using a decoder-only model. Evaluation of the aligned model on real data results in a balanced accuracy and F1 score of approximately 91%, competitive with baselines established by prior work leveraging real data as well as with models trained directly on synthetic data.

    This repository contains the data used in the experiments and the results of LLM-finetuning in json form. The numbered folders in data.zip correspond to the seed that was used for generating the synthetic data.

  18. CO2 emissions of LLMs during training in 2022 (in CO2 eq tonnes)

    • statista.com
    Updated Jun 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). CO2 emissions of LLMs during training in 2022 (in CO2 eq tonnes) [Dataset]. https://www.statista.com/statistics/1384418/co2-emissions-when-training-llm-models/
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2022
    Area covered
    Worldwide
    Description

    Energy consumption of artificial intelligence (AI) models in training is considerable, with both GPT-3, the original release of the current iteration of OpenAI's popular ChatGPT, and Gopher consuming well over ********** megawatt hours of energy simply for training. As this is only for the training model it is likely that the energy consumption for the entire usage and lifetime of GPT-3 and other large language models (LLMs) is significantly higher.

  19. synthetic-medical-records-dataset

    • kaggle.com
    zip
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora_ai (2025). synthetic-medical-records-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/synthetic-medical-records-dataset
    Explore at:
    zip(1582643 bytes)Available download formats
    Dataset updated
    Sep 11, 2025
    Authors
    Syncora_ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Synthetic Healthcare Dataset β€” Powered by Syncora

    High-Fidelity Synthetic Medical Records for AI, ML Modeling, LLM Training & HealthTech Research

    About This Dataset

    This is a synthetic dataset of healthcare records generated using Syncora.ai, a next-generation synthetic data generation platform designed for privacy-safe AI development.

    It simulates patient demographics, medical conditions, treatments, billing, and admission data, preserving statistical realism while ensuring 0% privacy risk.

    This free dataset is designed for:

    • Healthcare AI research
    • Predictive analytics (disease risk, treatment outcomes)
    • LLM training on structured tabular healthcare data
    • Medical data science education & experimentation

    Think of this as fake data that mimics real-world healthcare patterns β€” statistically accurate, but without any sensitive patient information.

    Dataset Context & Features

    The dataset captures patient-level hospital information, including:

    • Demographics: Age, Gender, Blood Type
    • Medical Details: Diagnosed medical condition, prescribed medication, test results
    • Hospital Records: Admission type (emergency, planned, outpatient), billing amount
    • Target Applications: Predictive modeling, anomaly detection, cost optimization

    All records are 100% synthetic, maintaining the statistical properties of real-world healthcare data while remaining safe to share and use for ML & LLM tasks.

    LLM Training & Generative AI Applications 🧠

    Unlike most healthcare datasets, this one is tailored for LLM training:

    • Fine-tune LLMs on tabular + medical data for reasoning tasks
    • Create medical report generators from structured fields (e.g., convert demographics + condition + test results into natural language summaries)
    • Use as fake data for prompt engineering, synthetic QA pairs, or generative simulations
    • Safely train LLMs to understand healthcare schemas without exposing private patient data

    Machine Learning & AI Use Cases

    • Predictive Modeling: Forecast patient outcomes or readmission likelihood
    • Classification: Disease diagnosis prediction using demographic and medical variables
    • Clustering: Patient segmentation by condition, treatment, or billing pattern
    • Healthcare Cost Prediction: Estimate and optimize billing amounts
    • Bias & Fairness Testing: Study algorithmic bias without exposing sensitive patient data

    Why Syncora?

    Syncora.ai is a synthetic data generation platform designed for healthcare, finance, and enterprise AI.

    Key benefits:

    • Privacy-first: 100% synthetic, zero risk of re-identification
    • Statistical accuracy: Feature relationships preserved for ML & LLM training
    • Regulatory compliance: HIPAA, GDPR, DPDP safe
    • Scalability: Generate millions of synthetic patient records with agentic AI

    Ideas for Exploration

    • Which medical conditions correlate with higher billing amounts?
    • Can test results predict hospitalization type?
    • How do demographics influence treatment or billing trends?
    • Can synthetic datasets reduce bias in healthcare AI & LLMs?

    πŸ”— Generate Your Own Synthetic Data

    Take your AI projects to the next level with Syncora.ai:
    β†’ Generate your own synthetic datasets now

    Licensing & Compliance

    This is a free dataset, 100% synthetic, and contains no real patient information.
    It is safe for public use in education, research, open-source contributions, LLM training, and AI development.

  20. smollm-corpus

    • huggingface.co
    Updated Jul 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2024). smollm-corpus [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    SmolLM-Corpus

    This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

      Dataset subsets
    
    
    
    
    
      Cosmopedia v2
    

    Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ran Xu (2024). clinical-synthetic-text-llm [Dataset]. https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm

clinical-synthetic-text-llm

ritaranx/clinical-synthetic-text-llm

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2024
Authors
Ran Xu
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Data Description

We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on LLM-generated topics and writing styles.

  Generated Datasets

The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000… See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm.

Search
Clear search
Close search
Google apps
Main menu