Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
High-quality synthetic dataset for chatbot training, LLM fine-tuning, and AI research in conversational systems.
This dataset provides a fully synthetic collection of customer support interactions, generated using Syncora.aiās synthetic data generation engine.
It mirrors realistic support conversations across e-commerce, banking, SaaS, and telecom domains, ensuring diversity, context depth, and privacy-safe realism.
Each conversation simulates multi-turn dialogues between a customer and a support agent, making it ideal for training chatbots, LLMs, and retrieval-augmented generation (RAG) systems.
This is a free dataset, designed for LLM training, chatbot model fine-tuning, and dialogue understanding research.
| Feature | Description |
|---|---|
conversation_id | Unique identifier for each dialogue session |
domain | Industry domain (e.g., banking, telecom, retail) |
role | Speaker role: customer or support agent |
message | Message text (synthetic conversation content) |
intent_label | Labeled customer intent (e.g., refund_request, password_reset) |
resolution_status | Whether the query was resolved or escalated |
sentiment_score | Sentiment polarity of the conversation |
language | Language of interaction (supports multilingual synthetic data) |
Use Syncora.ai to generate synthetic conversational datasets for your AI or chatbot projects:
Try Synthetic Data Generation tool
This dataset is released under the MIT License.
It is fully synthetic, free, and safe for LLM training, chatbot model fine-tuning, and AI research.
Facebook
TwitterSynthetic Legal Contract Dataset ā Powered by Syncora.ai āļø
High-Fidelity Synthetic Dataset for LLM Training, Legal NLP & AI Research
š About This Dataset
This repository provides a synthetic dataset of legal contract Q&A interactions, modeled after real-world corporate filings (e.g., SEC 8-K disclosures).
All records are fake data, generated using Syncora.ai, ensuring privacy-safe, free dataset access suitable for LLM training, benchmarking, and experimentation.⦠See the full description on the dataset page: https://huggingface.co/datasets/syncora/legal_contract_dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is sourced from running the Synthetic Dataset Creation w/ InternVL2 script. The dataset was made in mind for compatibility for LLM Finetuning Script which finetunes Large Language Models (LLM) through the use of datasets. This is just an example of how a dataset is supposed to be structured for the LLM Finetuning Script. Feel free to make your own datasets with the help of the Synthetic Dataset Creation w/ InternVL2 script.
Facebook
Twitterš§ Mental Health Posting Dataset ā Synthetic Dataset for LLM & Chatbot Training
Free dataset for mental health research, LLM training, and chatbot development, generated using synthetic data generation techniques to ensure privacy and high fidelity.
š About This Dataset
This dataset contains synthetic mental health survey responses across multiple demographics and occupations. It includes participant-reported stress levels, coping mechanisms, mood swings, and social⦠See the full description on the dataset page: https://huggingface.co/datasets/syncora/mental_health_survey_dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
High-Fidelity Synthetic Medical Records for AI, ML Modeling, LLM Training & HealthTech Research
This is a synthetic dataset of healthcare records generated using Syncora.ai, a next-generation synthetic data generation platform designed for privacy-safe AI development.
It simulates patient demographics, medical conditions, treatments, billing, and admission data, preserving statistical realism while ensuring 0% privacy risk.
This free dataset is designed for:
Think of this as fake data that mimics real-world healthcare patterns ā statistically accurate, but without any sensitive patient information.
The dataset captures patient-level hospital information, including:
All records are 100% synthetic, maintaining the statistical properties of real-world healthcare data while remaining safe to share and use for ML & LLM tasks.
Unlike most healthcare datasets, this one is tailored for LLM training:
Syncora.ai is a synthetic data generation platform designed for healthcare, finance, and enterprise AI.
Key benefits:
Take your AI projects to the next level with Syncora.ai:
ā Generate your own synthetic datasets now
This is a free dataset, 100% synthetic, and contains no real patient information.
It is safe for public use in education, research, open-source contributions, LLM training, and AI development.
Facebook
Twitterš Synthetic Wearable & Activity Dataset ā Powered by Syncora.ai
Free dataset for health analytics, activity recognition, synthetic data generation, and dataset for LLM training.
š About This Dataset
This dataset contains synthetic wearable fitness records, modeled on signals from devices such as the Apple Watch. All entries are fully synthetic, generated with Syncora.aiās synthetic data engine, ensuring privacy-safe and bias-aware data.
The dataset provides rich⦠See the full description on the dataset page: https://huggingface.co/datasets/syncora/fitness-tracker-dataset.
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the Customer Support sector can be easily achieved using our two-step approach to LLM⦠See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A free synthetic dataset of mental health survey responses, designed for LLM training, AI research, and synthetic data generation.
It captures stress levels, coping strategies, mood swings, and work-life interactions across diverse demographics.
All entries are fully synthetic, privacy-safe, and structured for easy modeling and analysis.
Ideal for developing AI models, visualizations, or fine-tuning LLMs on structured mental health data.
This dataset contains synthetic survey responses covering mental health, lifestyle, and work-related factors.
It is suitable as a dataset for LLM training, dataset for mental health research, and experimentation with free synthetic datasets.
| Column | Description |
|---|---|
Timestamp | Date and time of survey response |
Gender | Respondent gender |
Country | Respondent country |
Occupation | Profession or role |
self_employed | Whether respondent is self-employed |
family_history | Family history of mental health issues |
treatment | Whether respondent has sought treatment |
Days_Indoors | Average number of days indoors |
Growing_Stress | Respondent perception of stress growth |
Changes_Habits | Whether lifestyle habits have changed |
Mental_Health_History | Past mental health history |
Mood_Swings | Frequency of mood swings |
Coping_Struggles | Difficulty in coping |
Work_Interest | Level of engagement at work |
Social_Weakness | Social interaction challenges |
mental_health_interview | Respondent willingness for interview |
care_options | Preferred care options |
--
Synthetic Data Generator: Generate your own structured datasets
Open Generator
Syncora.ai: Platform powering the synthetic dataset
Visit Website
Released under MIT License.
This dataset is 100% synthetic, free, and safe for dataset for LLM training, dataset for mental health research, and synthetic data experiments.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Image generated by DALL-E. See prompt for more details
synthetic_text_to_sql
gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:
105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct⦠See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.
Facebook
Twitter
According to our latest research, the global Evaluation Dataset Curation for LLMs market size reached USD 1.18 billion in 2024, reflecting robust momentum driven by the proliferation of large language models (LLMs) across industries. The market is projected to expand at a CAGR of 24.7% from 2025 to 2033, reaching a forecasted value of USD 9.01 billion by 2033. This impressive growth is primarily fueled by the surging demand for high-quality, unbiased, and diverse datasets essential for evaluating, benchmarking, and fine-tuning LLMs, as well as for ensuring their safety and fairness in real-world applications.
The exponential growth of the Evaluation Dataset Curation for LLMs market is underpinned by the rapid advancements in artificial intelligence and natural language processing technologies. As organizations increasingly deploy LLMs for a variety of applications, the need for meticulously curated datasets has become paramount. High-quality datasets are the cornerstone for testing model robustness, identifying biases, and ensuring compliance with ethical standards. The proliferation of domain-specific use casesāfrom healthcare diagnostics to legal document analysisāhas further intensified the demand for specialized datasets tailored to unique linguistic and contextual requirements. Moreover, the growing recognition of dataset quality as a critical determinant of model performance is prompting enterprises and research institutions to invest heavily in advanced curation platforms and services.
Another significant growth driver for the Evaluation Dataset Curation for LLMs market is the heightened regulatory scrutiny and societal emphasis on AI transparency, fairness, and accountability. Governments and standard-setting bodies worldwide are introducing stringent guidelines to mitigate the risks associated with biased or unsafe AI systems. This regulatory landscape is compelling organizations to adopt rigorous dataset curation practices, encompassing bias detection, fairness assessment, and safety evaluations. As LLMs become integral to decision-making processes in sensitive domains such as finance, healthcare, and public policy, the imperative for trustworthy and explainable AI models is fueling the adoption of comprehensive evaluation datasets. This trend is expected to accelerate as new regulations come into force, further expanding the marketās scope.
The market is also benefiting from the collaborative efforts between academia, industry, and open-source communities to establish standardized benchmarks and best practices for LLM evaluation. These collaborations are fostering innovation in dataset curation methodologies, including the use of synthetic data generation, crowdsourcing, and automated annotation tools. The integration of multimodal dataācombining text, images, and codeāis enabling more holistic assessments of LLM capabilities, thereby expanding the marketās addressable segments. Additionally, the emergence of specialized startups focused on dataset curation services is introducing competitive dynamics and driving technological advancements. These factors collectively contribute to the marketās sustained growth trajectory.
Regionally, North America continues to dominate the Evaluation Dataset Curation for LLMs market, accounting for the largest share in 2024, followed by Europe and Asia Pacific. The United States, in particular, is home to leading AI research institutions, technology giants, and a vibrant ecosystem of startups dedicated to LLM development and evaluation. Europe is witnessing increased investments in AI ethics and regulatory compliance, while Asia Pacific is rapidly emerging as a key growth market due to its expanding AI research capabilities and government-led digital transformation initiatives. Latin America and the Middle East & Africa are also showing promise, albeit from a smaller base, as local enterprises and public sector organizations begin to recognize the strategic importance of robust LLM evaluation frameworks.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
REGEN is a synthetic dataset designed to evaluate the conversational capabilities of Recommender Language Models (RLMs). Derived from the 2018 Amazon Reviews dataset (https://jmcauley.ucsd.edu/data/amazon/), REGEN transforms raw product and review data into rich, interactive narratives, simulating real-world user interactions. Further details of REGEN and benchmarks using recommender LLMs can be found at https://arxiv.org/pdf/2503.11924. We summarize the key details below.
Leveraging the power of Gemini Flash, REGEN adds diverse narratives such as purchase reasons, explanations, product endorsements, user summaries & concise user profiles, personalized to each user. The narratives vary in context, length, and style, encompassing: - Explicit context narratives (e.g., purchase explanations, endorsements) - Context-free narratives (e.g., user summaries) - Short-form and long-form narratives
Further, it adds plausible critiques, or natural language feedback that the user issues in response to recommendations to steer the recommender to the next item.
REGEN organizes user reviews chronologically, capturing the evolution of user preferences and product experiences over time. Each sequence is capped at 50 reviews.
The Amazon Reviews dataset, while valuable for recommendation studies, primarily reflects a series of individual user interactions rather than real conversations. To create a synthetic dataset that emulates multi-turn conversational dynamics, we employ an inpainting technique to generate the elements typically missing in nonconversational review data. Specifically, we leverage LLMs to synthesize these elements, using carefully designed prompts and an iterative evaluation process to ensure high-quality generation and minimize inaccuracies. We exploit the extended context length capabilities of the LLM to effectively process the complete user history for each user. Our prompts are simple, employing a task prefix with instructions and the desired output format. This prefix is followed by the userās entire interaction history, including item metadata and review text. Finally, the prompt specifies the expected output format. We use the user history and associated reviews to generate critiques, purchase reasons and summaries. For product endorsement generation, however, we intentionally withhold the userās last review, enabling the model to learn how to endorse a newly recommended item effectively.
REGEN adds generated critiques (short utterance) to steer the system from the current recommended item to a desired item. Our aim is to focus on the setting where a user refines from one item to a closely-related item variant (e.g., āred ball-point penā to āblack ball-point penā), rather than to another arbitrary item, which would be better served by other mechanisms, e.g. a new search query. Thus, REGEN only generates critiques between adjacent pairs of items that are sufficiently similar. We use the Amazon Reviews-provided hierarchical item categories as a rough proxy for item similarity and consider items sufficiently similar if at least four levels of the category match.
To generate critiques for adjacent item pairs that meet the similarity criteria, we query Gemini 1.5 Flash to synthesize several options, instructing the LLM to treat the first item as the current recommendation and the second item as representative of the userās desired goal. The prompt contains the item descriptions and few shot examples. We select at random one of the generated critiques to inpaint into the dataset for that pair. For pairs that do not meet the similarity criteria, REGEN contains a sentinel placeholder. Dataset users can decide to train on the placeholder, to model the case where end-users do not provide critiques emulating a new search query, or to replace them with other critiques. For the āClothingā dataset, REGEN includes LLM-generated critiques for about 18.6% of adjacent items appearing in user sequences. For āOffices,ā there are about 17.6%.
We aim to generate diverse narratives for conversational recommender systems, varying in:
⢠Contextualization: Inclusion or exclusion of explicit con- text (e.g., user summaries, purchase explanations, endorse- ments) to assess its impact on quality and relevance.
⢠Length: Short-form and long-form narratives for varied conversational scenarios.
Detailed descriptions and examples of the narratives are summarized below.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23089219%2F262b3f59faa6bc53123bbc2c80911cee%2Fkaggle_narrative_examples.png?generation=1741824378051785&alt=media" alt="">
Using ...
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
SmolLM-Corpus
This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.
Dataset subsets
Cosmopedia v2
Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by⦠See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This synthetic dataset for LLM training captures realistic employeeāassistant interactions about HR and compliance policies.
Generated using Syncora.ai's synthetic data generation engine, it provides privacy-safe, high-quality conversations for training Large Language Models (LLMs) to handle HR-related queries.
Perfect for researchers, HR tech startups, and AI developers building chatbots, compliance assistants, or policy QA systems ā without exposing sensitive employee data.
HR departments handle countless queries on policies, compliance, and workplace practices.
This dataset simulates those Q&A flows, making it a powerful dataset for LLM training and research.
You can use it for:
| Column | Description |
|---|---|
role | Role of the message author (system, user, or assistant) |
content | Actual text of the message |
messages | Grouped sequence of roleācontent exchanges (conversation turns) |
Each entry represents a self-contained dialogue snippet designed to reflect natural HR conversations, ideal for synthetic data generation research.
Whether you're building an HR assistant, compliance bot, or experimenting with enterprise LLMs, Syncora.ai synthetic datasets give you trustworthy, free datasets to start with ā and scalable tools to grow further.
Got feedback, research use cases, or want to collaborate?
Open an issue or reach out ā weāre excited to work with AI researchers, HR tech builders, and compliance innovators.
This dataset is 100% synthetic and does not represent real employees or organizations.
It is intended solely for research, educational, and experimental use in HR analytics, compliance automation, and machine learning.
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Retail Banking Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail Banking] sector can be easily achieved using our two-step approach to LLM Fine-Tuning.⦠See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-banking-llm-chatbot-training-dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
High-Fidelity Synthetic Financial Behavior Dataset for AI, ML, and Risk Modeling
This is a synthetic credit scoring dataset simulating realistic financial behaviors of individuals. Generated using Syncora.ai, it enables you to generate synthetic data safely without privacy concerns. Perfect for financial risk modeling, tabular ML classification, and as a dataset for LLM training.
Key points:
- Fully synthetic / fake data ā 0% privacy risk.
- Maintains realistic correlations between income, savings, debt, and credit behavior.
- Ideal free dataset for experimentation, AI/ML education, and model prototyping.
| Feature | Type |
|---|---|
| CUST_ID | string |
| INCOME | int32 |
| SAVINGS | int32 |
| DEBT | int32 |
| CREDIT_SCORE | int32 |
| DEFAULT | int32 |
Task Categories: Tabular Classification, Financial Risk Modeling
License: Apache-2.0
Size Category: 10K < n < 100K
## Syncora.ai Platform ā Generate your own high-fidelity synthetic datasets.
ā” Generate Your Own Synthetic Data
@dataset{syncora_synthetic_credit_scoring,
title = {Synthetic Credit Scoring Dataset},
author = {Syncora.ai},
year = {2025},
url = {https://www.kaggle.com/datasets/syncora-ai/synthetic-credit-scoring}
}
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Small, complex dataset for LLM generation classification. Would recommend just using the train csv to validate your pipeline for AI generated texts if a competition comes up. Task was to get a LLM to generate human sounding quotes, which is a very hard task, so you can probably tell the different by just reading it but it is hard to train a good classifier without overfitting.
Includes 500 real quotes from here. (490 train, 10 validation)
Synthetic quotes are generated via mistral-small-2402. (300 train, 10 validation)
Slight dataset imbalance, haven't seen much problem with classification, may update.
Text generation was done with a 3 shot prompt, in JSON mode (grammars), at a temperature of 1.0 to encourage creative outputs. Used a seed value to encourage some extra randomness and had the LLM judge the quotes "depth" and "tone". Will do the same with the real quotes in an update to see if there is any correlation between an LLM judge and real and synthetic quotes,
All done using Mistral API free grants.
IMPORTANT
1: Real 0: Synthetic
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset contains conversation summaries, topics, and dialogues used to create the pipeline of fine tunning the LLM model using Parameter Efficient Fine Tunning and Low-Rank Adaptation of Large Language Models) which is a popular and lightweight training technique that significantly reduces the number of trainable parameters.
The dataset is also available in the hugging face. https://huggingface.co/datasets/knkarthick/dialogsum
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is a small dataset that learners can use for testing out their retrieval augmented generation (RAG) knowledge. New RAG students will learn that their RAG processes can be optimized in many ways, including how the documents are chunked, how chunks are retrieved, and more. This dataset was designed to allow students to experiment with these different strategies.
This smaller dataset was generated from a larger dataset that I created, which can be found at this link:
https://www.kaggle.com/datasets/dkhundley/synthetic-it-related-knowledge-items
This larger dataset represents a set of 100 articles that you might find in a typical Fortune 500ās IT helpdesk. Students are advised to use the larger dataset for a full RAG experimentation, but this smaller dataset provided here contains a focused set of material to test with amongst each of your experiments.
Both this dataset and the other larger dataset were generated using this Kaggle notebook:
https://www.kaggle.com/code/dkhundley/generate-synthetic-ki-dataset
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
BTC Trading Bot Dataset with Advanced Indicators - Predict $100,000 Breakthrough
Description:
This dataset is designed to power innovative machine learning and LLM-based trading bots focused on Bitcoin (BTC) trading strategies. It combines historical BTC price data with advanced custom indicators, offering valuable insights for predicting key market trends, including the potential breakthrough of BTC exceeding $100,000.
The dataset contains enriched features, meticulously crafted to serve as the foundation for predictive models and algorithmic trading strategies. Each indicator has been optimized for capturing both macro and micro trends in BTC's volatile market.
Custom Indicators (Synthetic):
Derived Metrics:
This dataset is ideal for:
- Quantitative Analysts: Exploring BTC trends and creating trading signals.
- Data Scientists: Building LLMs for price prediction and risk analysis.
- Financial Enthusiasts: Experimenting with algorithmic trading strategies.
This dataset is tailored for Bitcoin trading and does not include unrelated financial instruments. The provided indicators were designed for educational and research purposes but align with real-world trading practices for improved predictive accuracy. Dive into the data and start creating your trading bot today!
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
High-quality synthetic dataset for chatbot training, LLM fine-tuning, and AI research in conversational systems.
This dataset provides a fully synthetic collection of customer support interactions, generated using Syncora.aiās synthetic data generation engine.
It mirrors realistic support conversations across e-commerce, banking, SaaS, and telecom domains, ensuring diversity, context depth, and privacy-safe realism.
Each conversation simulates multi-turn dialogues between a customer and a support agent, making it ideal for training chatbots, LLMs, and retrieval-augmented generation (RAG) systems.
This is a free dataset, designed for LLM training, chatbot model fine-tuning, and dialogue understanding research.
| Feature | Description |
|---|---|
conversation_id | Unique identifier for each dialogue session |
domain | Industry domain (e.g., banking, telecom, retail) |
role | Speaker role: customer or support agent |
message | Message text (synthetic conversation content) |
intent_label | Labeled customer intent (e.g., refund_request, password_reset) |
resolution_status | Whether the query was resolved or escalated |
sentiment_score | Sentiment polarity of the conversation |
language | Language of interaction (supports multilingual synthetic data) |
Use Syncora.ai to generate synthetic conversational datasets for your AI or chatbot projects:
Try Synthetic Data Generation tool
This dataset is released under the MIT License.
It is fully synthetic, free, and safe for LLM training, chatbot model fine-tuning, and AI research.