19 datasets found
  1. customer support conversations

    • kaggle.com
    zip
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora_ai (2025). customer support conversations [Dataset]. https://www.kaggle.com/datasets/syncoraai/customer-support-conversations/code
    Explore at:
    zip(303724713 bytes)Available download formats
    Dataset updated
    Oct 9, 2025
    Authors
    Syncora_ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Customer Support Conversation Dataset — Powered by Syncora.ai

    High-quality synthetic dataset for chatbot training, LLM fine-tuning, and AI research in conversational systems.

    About This Dataset

    This dataset provides a fully synthetic collection of customer support interactions, generated using Syncora.ai’s synthetic data generation engine.
    It mirrors realistic support conversations across e-commerce, banking, SaaS, and telecom domains, ensuring diversity, context depth, and privacy-safe realism.

    Each conversation simulates multi-turn dialogues between a customer and a support agent, making it ideal for training chatbots, LLMs, and retrieval-augmented generation (RAG) systems.

    This is a free dataset, designed for LLM training, chatbot model fine-tuning, and dialogue understanding research.

    Dataset Context & Features

    FeatureDescription
    conversation_idUnique identifier for each dialogue session
    domainIndustry domain (e.g., banking, telecom, retail)
    roleSpeaker role: customer or support agent
    messageMessage text (synthetic conversation content)
    intent_labelLabeled customer intent (e.g., refund_request, password_reset)
    resolution_statusWhether the query was resolved or escalated
    sentiment_scoreSentiment polarity of the conversation
    languageLanguage of interaction (supports multilingual synthetic data)

    Use Cases

    • Chatbot Training & Evaluation – Build and fine-tune conversational agents with realistic dialogue data.
    • LLM Training & Alignment – Use as a dataset for LLM training on dialogue tasks.
    • Customer Support Automation – Prototype or benchmark AI-driven support systems.
    • Dialogue Analytics – Study sentiment, escalation patterns, and domain-specific behavior.
    • Synthetic Data Research – Validate synthetic data generation pipelines for conversational systems.

    Why Synthetic?

    • Privacy-Safe – No real user data; fully synthetic and compliant.
    • Scalable – Generate millions of conversations for LLM and chatbot training.
    • Balanced & Bias-Controlled – Ensures diversity and fairness in training data.
    • Instantly Usable – Pre-structured and cleanly labeled for NLP tasks.

    Generate Your Own Synthetic Data

    Use Syncora.ai to generate synthetic conversational datasets for your AI or chatbot projects:
    Try Synthetic Data Generation tool

    License

    This dataset is released under the MIT License.
    It is fully synthetic, free, and safe for LLM training, chatbot model fine-tuning, and AI research.

  2. h

    legal_contract_dataset

    • huggingface.co
    Updated Sep 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora.ai - Agentic Synthetic Data Platform (2025). legal_contract_dataset [Dataset]. https://huggingface.co/datasets/syncora/legal_contract_dataset
    Explore at:
    Dataset updated
    Sep 15, 2025
    Authors
    Syncora.ai - Agentic Synthetic Data Platform
    Description

    Synthetic Legal Contract Dataset — Powered by Syncora.ai āš–ļø

    High-Fidelity Synthetic Dataset for LLM Training, Legal NLP & AI Research

      🌟 About This Dataset
    

    This repository provides a synthetic dataset of legal contract Q&A interactions, modeled after real-world corporate filings (e.g., SEC 8-K disclosures).
    All records are fake data, generated using Syncora.ai, ensuring privacy-safe, free dataset access suitable for LLM training, benchmarking, and experimentation.… See the full description on the dataset page: https://huggingface.co/datasets/syncora/legal_contract_dataset.

  3. Public Domain Synthetic Datasets

    • kaggle.com
    zip
    Updated Aug 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Anderson (2024). Public Domain Synthetic Datasets [Dataset]. https://www.kaggle.com/datasets/thomasanderson1962/public-domain-synthetic-datasets
    Explore at:
    zip(748302 bytes)Available download formats
    Dataset updated
    Aug 5, 2024
    Authors
    Thomas Anderson
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is sourced from running the Synthetic Dataset Creation w/ InternVL2 script. The dataset was made in mind for compatibility for LLM Finetuning Script which finetunes Large Language Models (LLM) through the use of datasets. This is just an example of how a dataset is supposed to be structured for the LLM Finetuning Script. Feel free to make your own datasets with the help of the Synthetic Dataset Creation w/ InternVL2 script.

  4. h

    mental_health_survey_dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora.ai - Agentic Synthetic Data Platform, mental_health_survey_dataset [Dataset]. https://huggingface.co/datasets/syncora/mental_health_survey_dataset
    Explore at:
    Authors
    Syncora.ai - Agentic Synthetic Data Platform
    Description

    🧠 Mental Health Posting Dataset — Synthetic Dataset for LLM & Chatbot Training

    Free dataset for mental health research, LLM training, and chatbot development, generated using synthetic data generation techniques to ensure privacy and high fidelity.

      🌟 About This Dataset
    

    This dataset contains synthetic mental health survey responses across multiple demographics and occupations. It includes participant-reported stress levels, coping mechanisms, mood swings, and social… See the full description on the dataset page: https://huggingface.co/datasets/syncora/mental_health_survey_dataset.

  5. synthetic-medical-records-dataset

    • kaggle.com
    zip
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora_ai (2025). synthetic-medical-records-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/synthetic-medical-records-dataset
    Explore at:
    zip(1582643 bytes)Available download formats
    Dataset updated
    Sep 11, 2025
    Authors
    Syncora_ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Synthetic Healthcare Dataset — Powered by Syncora

    High-Fidelity Synthetic Medical Records for AI, ML Modeling, LLM Training & HealthTech Research

    About This Dataset

    This is a synthetic dataset of healthcare records generated using Syncora.ai, a next-generation synthetic data generation platform designed for privacy-safe AI development.

    It simulates patient demographics, medical conditions, treatments, billing, and admission data, preserving statistical realism while ensuring 0% privacy risk.

    This free dataset is designed for:

    • Healthcare AI research
    • Predictive analytics (disease risk, treatment outcomes)
    • LLM training on structured tabular healthcare data
    • Medical data science education & experimentation

    Think of this as fake data that mimics real-world healthcare patterns — statistically accurate, but without any sensitive patient information.

    Dataset Context & Features

    The dataset captures patient-level hospital information, including:

    • Demographics: Age, Gender, Blood Type
    • Medical Details: Diagnosed medical condition, prescribed medication, test results
    • Hospital Records: Admission type (emergency, planned, outpatient), billing amount
    • Target Applications: Predictive modeling, anomaly detection, cost optimization

    All records are 100% synthetic, maintaining the statistical properties of real-world healthcare data while remaining safe to share and use for ML & LLM tasks.

    LLM Training & Generative AI Applications 🧠

    Unlike most healthcare datasets, this one is tailored for LLM training:

    • Fine-tune LLMs on tabular + medical data for reasoning tasks
    • Create medical report generators from structured fields (e.g., convert demographics + condition + test results into natural language summaries)
    • Use as fake data for prompt engineering, synthetic QA pairs, or generative simulations
    • Safely train LLMs to understand healthcare schemas without exposing private patient data

    Machine Learning & AI Use Cases

    • Predictive Modeling: Forecast patient outcomes or readmission likelihood
    • Classification: Disease diagnosis prediction using demographic and medical variables
    • Clustering: Patient segmentation by condition, treatment, or billing pattern
    • Healthcare Cost Prediction: Estimate and optimize billing amounts
    • Bias & Fairness Testing: Study algorithmic bias without exposing sensitive patient data

    Why Syncora?

    Syncora.ai is a synthetic data generation platform designed for healthcare, finance, and enterprise AI.

    Key benefits:

    • Privacy-first: 100% synthetic, zero risk of re-identification
    • Statistical accuracy: Feature relationships preserved for ML & LLM training
    • Regulatory compliance: HIPAA, GDPR, DPDP safe
    • Scalability: Generate millions of synthetic patient records with agentic AI

    Ideas for Exploration

    • Which medical conditions correlate with higher billing amounts?
    • Can test results predict hospitalization type?
    • How do demographics influence treatment or billing trends?
    • Can synthetic datasets reduce bias in healthcare AI & LLMs?

    šŸ”— Generate Your Own Synthetic Data

    Take your AI projects to the next level with Syncora.ai:
    → Generate your own synthetic datasets now

    Licensing & Compliance

    This is a free dataset, 100% synthetic, and contains no real patient information.
    It is safe for public use in education, research, open-source contributions, LLM training, and AI development.

  6. h

    fitness-tracker-dataset

    • huggingface.co
    Updated Oct 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora.ai - Agentic Synthetic Data Platform (2025). fitness-tracker-dataset [Dataset]. https://huggingface.co/datasets/syncora/fitness-tracker-dataset
    Explore at:
    Dataset updated
    Oct 5, 2025
    Authors
    Syncora.ai - Agentic Synthetic Data Platform
    Description

    šŸƒ Synthetic Wearable & Activity Dataset — Powered by Syncora.ai

    Free dataset for health analytics, activity recognition, synthetic data generation, and dataset for LLM training.

      🌟 About This Dataset
    

    This dataset contains synthetic wearable fitness records, modeled on signals from devices such as the Apple Watch. All entries are fully synthetic, generated with Syncora.ai’s synthetic data engine, ensuring privacy-safe and bias-aware data.
    The dataset provides rich… See the full description on the dataset page: https://huggingface.co/datasets/syncora/fitness-tracker-dataset.

  7. h

    Bitext-customer-support-llm-chatbot-training-dataset

    • huggingface.co
    • opendatalab.com
    Updated Jul 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2024). Bitext-customer-support-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2024
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the Customer Support sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset.

  8. mental health survey dataset

    • kaggle.com
    zip
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora_ai (2025). mental health survey dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/mental-health-survey-dataset
    Explore at:
    zip(676541 bytes)Available download formats
    Dataset updated
    Oct 9, 2025
    Authors
    Syncora_ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Mental Health Survey Dataset — Synthetic Responses for LLM Training

    A free synthetic dataset of mental health survey responses, designed for LLM training, AI research, and synthetic data generation.
    It captures stress levels, coping strategies, mood swings, and work-life interactions across diverse demographics.
    All entries are fully synthetic, privacy-safe, and structured for easy modeling and analysis.
    Ideal for developing AI models, visualizations, or fine-tuning LLMs on structured mental health data.

    Dataset Overview

    This dataset contains synthetic survey responses covering mental health, lifestyle, and work-related factors.
    It is suitable as a dataset for LLM training, dataset for mental health research, and experimentation with free synthetic datasets.

    Dataset Schema

    ColumnDescription
    TimestampDate and time of survey response
    GenderRespondent gender
    CountryRespondent country
    OccupationProfession or role
    self_employedWhether respondent is self-employed
    family_historyFamily history of mental health issues
    treatmentWhether respondent has sought treatment
    Days_IndoorsAverage number of days indoors
    Growing_StressRespondent perception of stress growth
    Changes_HabitsWhether lifestyle habits have changed
    Mental_Health_HistoryPast mental health history
    Mood_SwingsFrequency of mood swings
    Coping_StrugglesDifficulty in coping
    Work_InterestLevel of engagement at work
    Social_WeaknessSocial interaction challenges
    mental_health_interviewRespondent willingness for interview
    care_optionsPreferred care options

    --

    Use Cases

    • šŸ’¬ LLM Training: Fine-tune language models on structured survey data
    • šŸ“Š Mental Health Research: Analyze trends in stress, mood, and coping
    • ⚔ Synthetic Data Generation: Benchmark methods for generating structured datasets
    • 🧠 Model Development: Create predictive or classification models on mental health indicators

    Resources

    • Synthetic Data Generator: Generate your own structured datasets
      Open Generator

    • Syncora.ai: Platform powering the synthetic dataset
      Visit Website

    License

    Released under MIT License.

    This dataset is 100% synthetic, free, and safe for dataset for LLM training, dataset for mental health research, and synthetic data experiments.

  9. h

    synthetic_text_to_sql

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gretel.ai, synthetic_text_to_sql [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_text_to_sql
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Gretel.ai
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Image generated by DALL-E. See prompt for more details

      synthetic_text_to_sql
    

    gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:

    105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.

  10. G

    Evaluation Dataset Curation for LLMs Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Oct 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Evaluation Dataset Curation for LLMs Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/evaluation-dataset-curation-for-llms-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Oct 4, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Evaluation Dataset Curation for LLMs Market Outlook



    According to our latest research, the global Evaluation Dataset Curation for LLMs market size reached USD 1.18 billion in 2024, reflecting robust momentum driven by the proliferation of large language models (LLMs) across industries. The market is projected to expand at a CAGR of 24.7% from 2025 to 2033, reaching a forecasted value of USD 9.01 billion by 2033. This impressive growth is primarily fueled by the surging demand for high-quality, unbiased, and diverse datasets essential for evaluating, benchmarking, and fine-tuning LLMs, as well as for ensuring their safety and fairness in real-world applications.




    The exponential growth of the Evaluation Dataset Curation for LLMs market is underpinned by the rapid advancements in artificial intelligence and natural language processing technologies. As organizations increasingly deploy LLMs for a variety of applications, the need for meticulously curated datasets has become paramount. High-quality datasets are the cornerstone for testing model robustness, identifying biases, and ensuring compliance with ethical standards. The proliferation of domain-specific use cases—from healthcare diagnostics to legal document analysis—has further intensified the demand for specialized datasets tailored to unique linguistic and contextual requirements. Moreover, the growing recognition of dataset quality as a critical determinant of model performance is prompting enterprises and research institutions to invest heavily in advanced curation platforms and services.




    Another significant growth driver for the Evaluation Dataset Curation for LLMs market is the heightened regulatory scrutiny and societal emphasis on AI transparency, fairness, and accountability. Governments and standard-setting bodies worldwide are introducing stringent guidelines to mitigate the risks associated with biased or unsafe AI systems. This regulatory landscape is compelling organizations to adopt rigorous dataset curation practices, encompassing bias detection, fairness assessment, and safety evaluations. As LLMs become integral to decision-making processes in sensitive domains such as finance, healthcare, and public policy, the imperative for trustworthy and explainable AI models is fueling the adoption of comprehensive evaluation datasets. This trend is expected to accelerate as new regulations come into force, further expanding the market’s scope.




    The market is also benefiting from the collaborative efforts between academia, industry, and open-source communities to establish standardized benchmarks and best practices for LLM evaluation. These collaborations are fostering innovation in dataset curation methodologies, including the use of synthetic data generation, crowdsourcing, and automated annotation tools. The integration of multimodal data—combining text, images, and code—is enabling more holistic assessments of LLM capabilities, thereby expanding the market’s addressable segments. Additionally, the emergence of specialized startups focused on dataset curation services is introducing competitive dynamics and driving technological advancements. These factors collectively contribute to the market’s sustained growth trajectory.




    Regionally, North America continues to dominate the Evaluation Dataset Curation for LLMs market, accounting for the largest share in 2024, followed by Europe and Asia Pacific. The United States, in particular, is home to leading AI research institutions, technology giants, and a vibrant ecosystem of startups dedicated to LLM development and evaluation. Europe is witnessing increased investments in AI ethics and regulatory compliance, while Asia Pacific is rapidly emerging as a key growth market due to its expanding AI research capabilities and government-led digital transformation initiatives. Latin America and the Middle East & Africa are also showing promise, albeit from a smaller base, as local enterprises and public sector organizations begin to recognize the strategic importance of robust LLM evaluation frameworks.





    Da

  11. REGEN: Reviews Enhanced with Generative Narratives

    • kaggle.com
    zip
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google AI (2025). REGEN: Reviews Enhanced with Generative Narratives [Dataset]. https://www.kaggle.com/datasets/googleai/regen-reviews-enhanced-with-generative-narratives
    Explore at:
    zip(15353495064 bytes)Available download formats
    Dataset updated
    Mar 13, 2025
    Dataset authored and provided by
    Google AIhttp://ai.google/
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    REGEN is a synthetic dataset designed to evaluate the conversational capabilities of Recommender Language Models (RLMs). Derived from the 2018 Amazon Reviews dataset (https://jmcauley.ucsd.edu/data/amazon/), REGEN transforms raw product and review data into rich, interactive narratives, simulating real-world user interactions. Further details of REGEN and benchmarks using recommender LLMs can be found at https://arxiv.org/pdf/2503.11924. We summarize the key details below.

    Leveraging the power of Gemini Flash, REGEN adds diverse narratives such as purchase reasons, explanations, product endorsements, user summaries & concise user profiles, personalized to each user. The narratives vary in context, length, and style, encompassing: - Explicit context narratives (e.g., purchase explanations, endorsements) - Context-free narratives (e.g., user summaries) - Short-form and long-form narratives

    Further, it adds plausible critiques, or natural language feedback that the user issues in response to recommendations to steer the recommender to the next item.

    Key Features

    Sequential Interactions

    REGEN organizes user reviews chronologically, capturing the evolution of user preferences and product experiences over time. Each sequence is capped at 50 reviews.

    Gemini Pro Flash Generated Narratives

    Inpainting for Synthesizing Multi-Turn Conversations

    The Amazon Reviews dataset, while valuable for recommendation studies, primarily reflects a series of individual user interactions rather than real conversations. To create a synthetic dataset that emulates multi-turn conversational dynamics, we employ an inpainting technique to generate the elements typically missing in nonconversational review data. Specifically, we leverage LLMs to synthesize these elements, using carefully designed prompts and an iterative evaluation process to ensure high-quality generation and minimize inaccuracies. We exploit the extended context length capabilities of the LLM to effectively process the complete user history for each user. Our prompts are simple, employing a task prefix with instructions and the desired output format. This prefix is followed by the user’s entire interaction history, including item metadata and review text. Finally, the prompt specifies the expected output format. We use the user history and associated reviews to generate critiques, purchase reasons and summaries. For product endorsement generation, however, we intentionally withhold the user’s last review, enabling the model to learn how to endorse a newly recommended item effectively.

    Feature Types

    Critiques

    REGEN adds generated critiques (short utterance) to steer the system from the current recommended item to a desired item. Our aim is to focus on the setting where a user refines from one item to a closely-related item variant (e.g., ā€œred ball-point penā€ to ā€œblack ball-point penā€), rather than to another arbitrary item, which would be better served by other mechanisms, e.g. a new search query. Thus, REGEN only generates critiques between adjacent pairs of items that are sufficiently similar. We use the Amazon Reviews-provided hierarchical item categories as a rough proxy for item similarity and consider items sufficiently similar if at least four levels of the category match.

    To generate critiques for adjacent item pairs that meet the similarity criteria, we query Gemini 1.5 Flash to synthesize several options, instructing the LLM to treat the first item as the current recommendation and the second item as representative of the user’s desired goal. The prompt contains the item descriptions and few shot examples. We select at random one of the generated critiques to inpaint into the dataset for that pair. For pairs that do not meet the similarity criteria, REGEN contains a sentinel placeholder. Dataset users can decide to train on the placeholder, to model the case where end-users do not provide critiques emulating a new search query, or to replace them with other critiques. For the ā€œClothingā€ dataset, REGEN includes LLM-generated critiques for about 18.6% of adjacent items appearing in user sequences. For ā€œOffices,ā€ there are about 17.6%.

    Narratives

    We aim to generate diverse narratives for conversational recommender systems, varying in:

    • Contextualization: Inclusion or exclusion of explicit con- text (e.g., user summaries, purchase explanations, endorse- ments) to assess its impact on quality and relevance.

    • Length: Short-form and long-form narratives for varied conversational scenarios.

    Detailed descriptions and examples of the narratives are summarized below.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23089219%2F262b3f59faa6bc53123bbc2c80911cee%2Fkaggle_narrative_examples.png?generation=1741824378051785&alt=media" alt="">

    Comprehensive Evaluation

    Using ...

  12. smollm-corpus

    • huggingface.co
    Updated Jul 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2024). smollm-corpus [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    SmolLM-Corpus

    This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

      Dataset subsets
    
    
    
    
    
      Cosmopedia v2
    

    Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.

  13. hr-policies-qa-dataset

    • kaggle.com
    zip
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora_ai (2025). hr-policies-qa-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/hr-policies-qa-dataset
    Explore at:
    zip(54895 bytes)Available download formats
    Dataset updated
    Sep 11, 2025
    Authors
    Syncora_ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    šŸ¢ HR Policies Q&A Synthetic Dataset

    This synthetic dataset for LLM training captures realistic employee–assistant interactions about HR and compliance policies.
    Generated using Syncora.ai's synthetic data generation engine, it provides privacy-safe, high-quality conversations for training Large Language Models (LLMs) to handle HR-related queries.

    Perfect for researchers, HR tech startups, and AI developers building chatbots, compliance assistants, or policy QA systems — without exposing sensitive employee data.

    🧠 Context & Applications

    HR departments handle countless queries on policies, compliance, and workplace practices.
    This dataset simulates those Q&A flows, making it a powerful dataset for LLM training and research.

    You can use it for:

    • HR chatbot prototyping
    • Policy compliance assistants
    • Internal knowledge base fine-tuning
    • Generative AI experimentation
    • Synthetic benchmarking in enterprise QA systems

    šŸ“Š Dataset Features

    ColumnDescription
    roleRole of the message author (system, user, or assistant)
    contentActual text of the message
    messagesGrouped sequence of role–content exchanges (conversation turns)

    Each entry represents a self-contained dialogue snippet designed to reflect natural HR conversations, ideal for synthetic data generation research.

    šŸ“¦ This Repo Contains

    • HR Policies QA Dataset – JSON format, ready to use for LLM training or evaluation
    • Jupyter Notebook – Explore the dataset structure and basic preprocessing
    • Synthetic Data Tools – Generate your own datasets using Syncora.ai
    • ⚔ Generate Synthetic Data
      Need more? Use Syncora.ai’s synthetic data generation tool to create custom HR/compliance datasets. Our process is simple, reliable, and ensures privacy.

    🧪 ML & Research Use Cases

    • Policy Chatbots — Train assistants to answer compliance and HR questions
    • Knowledge Management — Fine-tune models for consistent responses
    • Synthetic Data Research — Explore structured dialogue datasets without legal risks
    • Evaluation Benchmarks — Test enterprise AI assistants on HR-related queries
    • Dataset Expansion — Combine this dataset with your own data using synthetic generation

    šŸ”’ Why Syncora.ai Synthetic Data?

    • Zero real-user data → Zero privacy liability
    • High realism → Actionable insights for LLM training
    • Fully customizable → Generate synthetic data tailored to your domain
    • Ethically aligned → Safe and responsible dataset creation

    Whether you're building an HR assistant, compliance bot, or experimenting with enterprise LLMs, Syncora.ai synthetic datasets give you trustworthy, free datasets to start with — and scalable tools to grow further.

    šŸ’¬ Questions or Contributions?

    Got feedback, research use cases, or want to collaborate?
    Open an issue or reach out — we’re excited to work with AI researchers, HR tech builders, and compliance innovators.

    BOOK A DEMO

    āš ļø Disclaimer

    This dataset is 100% synthetic and does not represent real employees or organizations.
    It is intended solely for research, educational, and experimental use in HR analytics, compliance automation, and machine learning.

  14. h

    Bitext-retail-banking-llm-chatbot-training-dataset

    • huggingface.co
    Updated Jul 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2024). Bitext-retail-banking-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-retail-banking-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2024
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Retail Banking Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail Banking] sector can be easily achieved using our two-step approach to LLM Fine-Tuning.… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-banking-llm-chatbot-training-dataset.

  15. credit scoring dataset

    • kaggle.com
    zip
    Updated Sep 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora_ai (2025). credit scoring dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/credit-scoring-dataset
    Explore at:
    zip(1018261 bytes)Available download formats
    Dataset updated
    Sep 12, 2025
    Authors
    Syncora_ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    šŸ’³ Synthetic Credit Scoring Dataset — Powered by Syncora

    High-Fidelity Synthetic Financial Behavior Dataset for AI, ML, and Risk Modeling

    🌟 About This Dataset

    This is a synthetic credit scoring dataset simulating realistic financial behaviors of individuals. Generated using Syncora.ai, it enables you to generate synthetic data safely without privacy concerns. Perfect for financial risk modeling, tabular ML classification, and as a dataset for LLM training.

    Key points:
    - Fully synthetic / fake data — 0% privacy risk.
    - Maintains realistic correlations between income, savings, debt, and credit behavior.
    - Ideal free dataset for experimentation, AI/ML education, and model prototyping.

    šŸ“Š Dataset Features

    FeatureType
    CUST_IDstring
    INCOMEint32
    SAVINGSint32
    DEBTint32
    CREDIT_SCOREint32
    DEFAULTint32

    Task Categories: Tabular Classification, Financial Risk Modeling
    License: Apache-2.0
    Size Category: 10K < n < 100K

    šŸ¤– Machine Learning & AI Use Cases

    • šŸ’³ Credit Risk Modeling: Train classification models to predict default risk.
    • āš™ļø Feature Engineering: Extract behavioral features like debt-to-income and repayment consistency.
    • 🧠 LLM Alignment: Use as a structured dataset for LLM training (e.g., converting tabular inputs into human-readable risk assessments).
    • šŸ“Š Benchmarking: Compare model accuracy, precision, and recall across logistic regression, random forest, XGBoost, and deep learning.
    • šŸ” Explainability: Apply SHAP, LIME, or ELI5 to interpret model predictions.
    • āš–ļø Bias & Fairness Studies: Explore whether synthetic datasets can reduce bias compared to real-world financial data.
    • āœ… Synthetic Data Validation: Test how well synthetic datasets maintain model performance relative to real datasets.

    šŸ’” Ideas for Exploration

    • Which financial ratios most strongly predict DEFAULT?
    • Can synthetic datasets improve fairness in credit scoring models?
    • How do demographic variables (education, marital status, dependents) interact with repayment behavior?
    • Can LLMs generate reliable credit narratives when trained on structured synthetic data?

    ## Syncora.ai Platform – Generate your own high-fidelity synthetic datasets.
    ⚔ Generate Your Own Synthetic Data

    šŸ“ Citation

    @dataset{syncora_synthetic_credit_scoring,
     title    = {Synthetic Credit Scoring Dataset},
     author    = {Syncora.ai},
     year     = {2025},
     url     = {https://www.kaggle.com/datasets/syncora-ai/synthetic-credit-scoring}
    }
    
  16. synthetic-quotes

    • kaggle.com
    zip
    Updated Jun 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    duckycode (2024). synthetic-quotes [Dataset]. https://www.kaggle.com/datasets/duckycode/synthetic-quotes/discussion
    Explore at:
    zip(88374 bytes)Available download formats
    Dataset updated
    Jun 29, 2024
    Authors
    duckycode
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Small, complex dataset for LLM generation classification. Would recommend just using the train csv to validate your pipeline for AI generated texts if a competition comes up. Task was to get a LLM to generate human sounding quotes, which is a very hard task, so you can probably tell the different by just reading it but it is hard to train a good classifier without overfitting.

    Includes 500 real quotes from here. (490 train, 10 validation)

    Synthetic quotes are generated via mistral-small-2402. (300 train, 10 validation)

    Slight dataset imbalance, haven't seen much problem with classification, may update.

    Text generation was done with a 3 shot prompt, in JSON mode (grammars), at a temperature of 1.0 to encourage creative outputs. Used a seed value to encourage some extra randomness and had the LLM judge the quotes "depth" and "tone". Will do the same with the real quotes in an update to see if there is any correlation between an LLM judge and real and synthetic quotes,

    All done using Mistral API free grants.

    IMPORTANT

    1: Real 0: Synthetic

  17. Training_Data_FineTuning_LLM_PEFT_LORA

    • kaggle.com
    zip
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rupak Roy/ Bob (2024). Training_Data_FineTuning_LLM_PEFT_LORA [Dataset]. https://www.kaggle.com/datasets/rupakroy/training-dataset-peft-lora
    Explore at:
    zip(29562174 bytes)Available download formats
    Dataset updated
    Aug 8, 2024
    Authors
    Rupak Roy/ Bob
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The dataset contains conversation summaries, topics, and dialogues used to create the pipeline of fine tunning the LLM model using Parameter Efficient Fine Tunning and Low-Rank Adaptation of Large Language Models) which is a popular and lightweight training technique that significantly reduces the number of trainable parameters.

    The dataset is also available in the hugging face. https://huggingface.co/datasets/knkarthick/dialogsum

  18. Sample RAG Knowledge Item Dataset

    • kaggle.com
    Updated Aug 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Hundley (2024). Sample RAG Knowledge Item Dataset [Dataset]. https://www.kaggle.com/datasets/dkhundley/sample-rag-knowledge-item-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 18, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    David Hundley
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This is a small dataset that learners can use for testing out their retrieval augmented generation (RAG) knowledge. New RAG students will learn that their RAG processes can be optimized in many ways, including how the documents are chunked, how chunks are retrieved, and more. This dataset was designed to allow students to experiment with these different strategies.

    This smaller dataset was generated from a larger dataset that I created, which can be found at this link:

    https://www.kaggle.com/datasets/dkhundley/synthetic-it-related-knowledge-items

    This larger dataset represents a set of 100 articles that you might find in a typical Fortune 500’s IT helpdesk. Students are advised to use the larger dataset for a full RAG experimentation, but this smaller dataset provided here contains a focused set of material to test with amongst each of your experiments.

    Both this dataset and the other larger dataset were generated using this Kaggle notebook:

    https://www.kaggle.com/code/dkhundley/generate-synthetic-ki-dataset

  19. BTC Trading Bot Dataset with Advanced Indicators

    • kaggle.com
    zip
    Updated Dec 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EMİRHAN BULUT (2024). BTC Trading Bot Dataset with Advanced Indicators [Dataset]. https://www.kaggle.com/datasets/emirhanai/btc-trading-bot-dataset-with-advanced-indicators
    Explore at:
    zip(54013 bytes)Available download formats
    Dataset updated
    Dec 8, 2024
    Authors
    EMİRHAN BULUT
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    BTC Trading Bot Dataset with Advanced Indicators - Predict $100,000 Breakthrough

    Description:
    This dataset is designed to power innovative machine learning and LLM-based trading bots focused on Bitcoin (BTC) trading strategies. It combines historical BTC price data with advanced custom indicators, offering valuable insights for predicting key market trends, including the potential breakthrough of BTC exceeding $100,000.

    The dataset contains enriched features, meticulously crafted to serve as the foundation for predictive models and algorithmic trading strategies. Each indicator has been optimized for capturing both macro and micro trends in BTC's volatile market.

    Dataset Features

    1. Date and Time: Timestamp of the price data in UTC.
    2. Open, High, Low, Close (OHLC): Standard candlestick charting data.
    3. Volume: Total trading volume in BTC during the respective time period.
    4. Custom Indicators (Synthetic):

      • Momentum Signal (MOM_S): Measures the velocity of price movements over a custom period.
      • Volatility Tracker (VOL_T): Captures the price range fluctuations with a proprietary algorithm.
      • Trend Oscillator (T_OSC): Detects market trends by analyzing price shifts and crossovers.
      • Buy Pressure Index (BPI): Highlights potential buy zones based on unique volume-weighted formulas.
      • Breakthrough Indicator (BRI): Focused on identifying signals for major price thresholds like $100,000.
      • Sentiment Score (SENT_S): Derived from on-chain data and social sentiment.
    5. Derived Metrics:

      • Moving Averages (MA10, MA50, MA200): Commonly used technical indicators.
      • RSI (Relative Strength Index): Measures the magnitude of recent price changes to evaluate overbought/oversold conditions.

    Target Audience

    This dataset is ideal for:
    - Quantitative Analysts: Exploring BTC trends and creating trading signals.
    - Data Scientists: Building LLMs for price prediction and risk analysis.
    - Financial Enthusiasts: Experimenting with algorithmic trading strategies.

    Usage Scenarios

    1. Predicting Market Movements: Utilize the synthetic indicators and historical data to train models for identifying price spikes or drops.
    2. Developing Trading Bots: Fine-tune machine learning-based trading bots, focusing on the $100,000 BTC price breakthrough.
    3. Backtesting Strategies: Validate trading algorithms and measure performance against historical data.

    Additional Notes

    This dataset is tailored for Bitcoin trading and does not include unrelated financial instruments. The provided indicators were designed for educational and research purposes but align with real-world trading practices for improved predictive accuracy. Dive into the data and start creating your trading bot today!

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Syncora_ai (2025). customer support conversations [Dataset]. https://www.kaggle.com/datasets/syncoraai/customer-support-conversations/code
Organization logo

customer support conversations

Explore at:
135 scholarly articles cite this dataset (View in Google Scholar)
zip(303724713 bytes)Available download formats
Dataset updated
Oct 9, 2025
Authors
Syncora_ai
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Customer Support Conversation Dataset — Powered by Syncora.ai

High-quality synthetic dataset for chatbot training, LLM fine-tuning, and AI research in conversational systems.

About This Dataset

This dataset provides a fully synthetic collection of customer support interactions, generated using Syncora.ai’s synthetic data generation engine.
It mirrors realistic support conversations across e-commerce, banking, SaaS, and telecom domains, ensuring diversity, context depth, and privacy-safe realism.

Each conversation simulates multi-turn dialogues between a customer and a support agent, making it ideal for training chatbots, LLMs, and retrieval-augmented generation (RAG) systems.

This is a free dataset, designed for LLM training, chatbot model fine-tuning, and dialogue understanding research.

Dataset Context & Features

FeatureDescription
conversation_idUnique identifier for each dialogue session
domainIndustry domain (e.g., banking, telecom, retail)
roleSpeaker role: customer or support agent
messageMessage text (synthetic conversation content)
intent_labelLabeled customer intent (e.g., refund_request, password_reset)
resolution_statusWhether the query was resolved or escalated
sentiment_scoreSentiment polarity of the conversation
languageLanguage of interaction (supports multilingual synthetic data)

Use Cases

  • Chatbot Training & Evaluation – Build and fine-tune conversational agents with realistic dialogue data.
  • LLM Training & Alignment – Use as a dataset for LLM training on dialogue tasks.
  • Customer Support Automation – Prototype or benchmark AI-driven support systems.
  • Dialogue Analytics – Study sentiment, escalation patterns, and domain-specific behavior.
  • Synthetic Data Research – Validate synthetic data generation pipelines for conversational systems.

Why Synthetic?

  • Privacy-Safe – No real user data; fully synthetic and compliant.
  • Scalable – Generate millions of conversations for LLM and chatbot training.
  • Balanced & Bias-Controlled – Ensures diversity and fairness in training data.
  • Instantly Usable – Pre-structured and cleanly labeled for NLP tasks.

Generate Your Own Synthetic Data

Use Syncora.ai to generate synthetic conversational datasets for your AI or chatbot projects:
Try Synthetic Data Generation tool

License

This dataset is released under the MIT License.
It is fully synthetic, free, and safe for LLM training, chatbot model fine-tuning, and AI research.

Search
Clear search
Close search
Google apps
Main menu