100+ datasets found

h
clinical-synthetic-text-llm
huggingface.co
Updated Jul 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ran Xu (2024). clinical-synthetic-text-llm [Dataset]. https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2024
Authors
Ran Xu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Data Description

We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on LLM-generated topics and writing styles.

Generated Datasets

The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000… See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm.
LLM Prompt Recovery - Synthetic Datastore
kaggle.com
zip
Updated Feb 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darien Schettler (2024). LLM Prompt Recovery - Synthetic Datastore [Dataset]. https://www.kaggle.com/datasets/dschettler8845/llm-prompt-recovery-synthetic-datastore
Explore at:
zip(988448 bytes)Available download formats
Dataset updated
Feb 29, 2024
Authors
Darien Schettler
License
https://www.licenses.ai/ai-licenseshttps://www.licenses.ai/ai-licenses
Description
High Level Description

This dataset uses Gemma 7B-IT to generate synthetic dataset for the LLM Prompt Recovery competition.

Contributors

Please go upvote these other datasets as my work is not possible without them

thedrcat's dataset - LLM Prompt Recovery Data

TBD

First Dataset - 1000 Examples From @thedrcat

Update 1 - February 29, 2024

The only file presently found in this dataset is gemma1000_7b.csv which uses the dataset created by @thedrcat found here: https://www.kaggle.com/datasets/thedrcat/llm-prompt-recovery-data?select=gemma1000.csv

The file below is the file Darek created with two additional columns appended. The first is the output of Gemma 7B-IT (raw based on the instructions below)(vs. 2B-IT that Darek used) and the second is the output with the 'Sure... blah blah

' sentence removed.

I generated things using the following setup:

# I used a vLLM server to host Gemma 7B on paperspace (A100) # Step 1 - Install vLLM >>> pip install vllm # Step 2 - Authenticate HuggingFace CLI (for model weights) >>> huggingface-cli login --token
LLM - Detect AI Datamix
kaggle.com
zip
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26
Explore at:
zip(172818297 bytes)Available download formats
Dataset updated
Jan 19, 2024
Authors
Raja Biswas
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:

Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.

One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.

Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
f
Data Sheet 2_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
figshare.com
xlsx
Updated Feb 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s002
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
LLM Question-Answer Dataset
kaggle.com
zip
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unique Data (2024). LLM Question-Answer Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/llm-dataset/code
Explore at:
zip(543652 bytes)Available download formats
Dataset updated
Mar 6, 2024
Authors
Unique Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
LLM Dataset - Prompts and Generated Texts

The dataset contains prompts and texts generated by the Large Language Models (LLMs) in 32 different languages. The prompts are short sentences or phrases for the model to generate text. The texts generated by the LLM are responses to these prompts and can vary in length and complexity.

Researchers and developers can use this dataset to train and fine-tune their own language models for multilingual applications. The dataset provides a rich and diverse collection of outputs from the model, demonstrating its ability to generate coherent and contextually relevant text in multiple languages.

👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset - Full dataset

Models used for text generation:

GPT-3.5,

GPT-4

Languages in the dataset:

Arabic, Azerbaijani, Catalan, Chinese, Czech, Danish, German, Greek, English, Esperanto, Spanish, Persian, Finnish, French, Irish, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Malayalam, Maratham, Netherlands, Polish, Portuguese, Portuguese (Brazil), Slovak, Swedish, Thai, Turkish, Ukrainian

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Ff60c93f09ec82a765aa39678e4aa0a58%2Fsnapedit_1709731090855.jpeg?generation=1709738798916444&alt=media" alt="">

🧩 This is just an example of the data. Leave a request here to learn more

Content

CSV File includes the following data: - from_language: language the prompt is made in, - model: type of the model (GPT-3.5, GPT-4 and Uncensored GPT Version), - time: time when the answer was generated, - text: user prompt, - response: response generated by the model

🚀 You can learn more about our high-quality unique datasets here

keywords: dataset, machine learning, natural language processing, artificial intelligence, deep learning, neural networks, text generation, language models, openai, gpt-3, data science, predictive modeling, sentiment analysis, keyword extraction, text classification, sequence-to-sequence models, attention mechanisms, transformer architecture, word embeddings, glove embeddings, chatbots, question answering, language understanding, text mining, information retrieval, data preprocessing, feature engineering, explainable ai, model deployment
h
uk_retail_store_synthetic_dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syncora.ai - Agentic Synthetic Data Platform, uk_retail_store_synthetic_dataset [Dataset]. https://huggingface.co/datasets/syncora/uk_retail_store_synthetic_dataset
Explore at:
Authors
Syncora.ai - Agentic Synthetic Data Platform
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
United Kingdom
Description
Synthetic Data Generation Demo — UK Retail Dataset

Welcome to this synthetic data generation demo repository by Syncora.ai. This project showcases how to generate synthetic data using real-world tabular structures, demonstrated on a UK retail dataset with columns such as:

Country
CustomerID
UnitPrice
InvoiceDate
Quantity
StockCode

This dataset is designed for dataset for LLM training and AI development, enabling developers to work with privacy-safe, high-quality… See the full description on the dataset page: https://huggingface.co/datasets/syncora/uk_retail_store_synthetic_dataset.
LLM: 7 prompt training dataset
kaggle.com
zip
Updated Nov 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset/code
Explore at:
zip(43404345 bytes)Available download formats
Dataset updated
Nov 15, 2023
Authors
Carl McBride Ellis
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
File: train_essays_RDizzl3_seven_v2.csv
Human texts: 14247 LLM texts: 3004

See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts

Version 3: "**The RDizzl3 Seven**"
File: train_essays_RDizzl3_seven_v1.csv

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"A Cowboy Who Rode the Waves"

"Driverless cars"

How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

Namely:

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"Seeking multiple opinions"

"Phones and driving"

This dataset is a derivative of the datasets

LLM Generated Essays for the Detect AI Comp! by Radek Osmulski

persuade corpus 2.0 provided by Nicholas Broad

daigt data - llama 70b and falcon180b by Nicholas Broad

Hello, Claude! 1000 essays from Anthropic... by Darragh

as well as the original competition training dataset

Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.
D
Data Lineage For LLM Training Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Data Lineage For LLM Training Market Research Report 2033 [Dataset]. https://dataintelo.com/report/data-lineage-for-llm-training-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Lineage for LLM Training Market Outlook

According to our latest research, the global Data Lineage for LLM Training market size reached USD 1.29 billion in 2024, with an impressive compound annual growth rate (CAGR) of 21.8% expected through the forecast period. By 2033, the market is projected to grow to USD 8.93 billion, as organizations worldwide recognize the critical importance of robust data lineage solutions in ensuring transparency, compliance, and efficiency in large language model (LLM) training. The primary growth driver stems from the surging adoption of generative AI and LLMs across diverse industries, necessitating advanced data lineage capabilities for responsible and auditable AI development.

The exponential growth of the Data Lineage for LLM Training market is fundamentally driven by the increasing complexity and scale of data used in training modern AI models. As organizations deploy LLMs for a wide array of applications—from customer service automation to advanced analytics—the need for precise tracking of data provenance, transformation, and usage has become paramount. This trend is further amplified by the proliferation of multi-source and multi-format data, which significantly complicates the process of tracing data origins and transformations. Enterprises are investing heavily in data lineage solutions to ensure that their AI models are trained on high-quality, compliant, and auditable datasets, thereby reducing risks associated with data bias, inconsistency, and regulatory violations.

Another significant growth factor is the evolving regulatory landscape surrounding AI and data governance. Governments and regulatory bodies worldwide are introducing stringent guidelines for data usage, privacy, and accountability in AI systems. Regulations such as the European Union’s AI Act and the U.S. AI Bill of Rights are compelling organizations to implement comprehensive data lineage practices to demonstrate compliance and mitigate legal risks. This regulatory pressure is particularly pronounced in highly regulated industries such as banking, healthcare, and government, where the consequences of non-compliance can be financially and reputationally devastating. As a result, the demand for advanced data lineage software and services is surging, driving market expansion.

Technological advancements in data management platforms and the integration of AI-driven automation are further catalyzing the growth of the Data Lineage for LLM Training market. Modern data lineage tools now leverage machine learning and natural language processing to automatically map data flows, detect anomalies, and generate real-time lineage reports. These innovations drastically reduce the manual effort required for lineage documentation and enhance the scalability of lineage solutions across large and complex data environments. The continuous evolution of such technologies is enabling organizations to achieve higher levels of transparency, trust, and operational efficiency in their AI workflows, thereby fueling market growth.

Regionally, North America dominates the Data Lineage for LLM Training market, accounting for over 42% of the global market share in 2024. This dominance is attributed to the early adoption of AI technologies, the presence of leading technology vendors, and a mature regulatory environment. Europe follows closely, driven by strict data governance regulations and a rapidly growing AI ecosystem. The Asia Pacific region is witnessing the fastest growth, with a projected CAGR of 24.6% through 2033, fueled by digital transformation initiatives, increased AI investments, and a burgeoning startup landscape. Latin America and the Middle East & Africa are also emerging as promising markets, albeit at a relatively nascent stage.

Component Analysis

The Data Lineage for LLM Training market is segmented by component into software and services, each playing a pivotal role in supporting organizations’ lineage initiatives. The software segment holds the largest market share, accounting for nearly 68% of the total market revenue in 2024. This dominance is primarily due to the widespread adoption of advanced data lineage platforms that offer features such as automated lineage mapping, visualization, impact analysis, and integration with existing data management and AI training workflows. These platforms are essential for organ
F
Finnish Open Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Finnish Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/finnish-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
The Finnish Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Finnish language, advancing the field of artificial intelligence.
Dataset Content:
This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Finnish. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Finnish people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Finnish Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Finnish are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Finnish Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
h
Bitext-travel-llm-chatbot-training-dataset
huggingface.co
Updated Jun 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2025). Bitext-travel-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2025
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.
h
investopedia-instruction-tuning-dataset
huggingface.co
Updated Jul 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FinLang (2023). investopedia-instruction-tuning-dataset [Dataset]. https://huggingface.co/datasets/FinLang/investopedia-instruction-tuning-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 26, 2023
Dataset authored and provided by
FinLang
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for investopedia-instruction-tuning dataset

We curate a dataset of substantial size pertaining to finance from Investopedia using a new technique that leverages unstructured scraping data and LLM to generate structured data that is suitable for fine-tuning embedding models. The dataset generation uses a new method of self-verification that ensures that the generated question-answer pairs and not hallucinated by the LLM with high probability.

Dataset… See the full description on the dataset page: https://huggingface.co/datasets/FinLang/investopedia-instruction-tuning-dataset.
h
Verbalized-Sampling-Synthetic-Data-Generation
huggingface.co
Updated Oct 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CHATS-Lab (2025). Verbalized-Sampling-Synthetic-Data-Generation [Dataset]. https://huggingface.co/datasets/CHATS-Lab/Verbalized-Sampling-Synthetic-Data-Generation
Explore at:
Dataset updated
Oct 31, 2025
Dataset authored and provided by
CHATS-Lab
Description
Verbalized-Sampling-Synthetic-Data-Generation

This dataset showcases how Verbalized Sampling (VS) can be used to generate high-quality, diverse synthetic training data for mathematical reasoning tasks. From the paper Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity.

Dataset Description

The Synthetic Data Generation dataset contains mathematical problem-solution pairs generated by different methods using state-of-the-art LLMs. This dataset… See the full description on the dataset page: https://huggingface.co/datasets/CHATS-Lab/Verbalized-Sampling-Synthetic-Data-Generation.
Dataset for From Literature to Lab LLM powered catalyst synthesis Protocol...
zenodo.org
Updated Sep 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yue Zhang; Yue Zhang (2025). Dataset for From Literature to Lab LLM powered catalyst synthesis Protocol Generation [Dataset]. http://doi.org/10.5281/zenodo.16948950
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.16948950
Dataset updated
Sep 19, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yue Zhang; Yue Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset for "From Literature to Lab: LLM powered catalyst synthesis Protocol Generation"

Code are hosted at Github: https://github.com/nsndimt/ChemSynthesis

Due to potential copyright issue, please email zhangyue@udel.edu or hfang@udel.edu to access the human annotated dataset.

Human Annotated Dataset

Files: stage1/stage2 + train/eval/input JSON

Dataset split suffix meaning:

input: all 250 paragraphs

train: 200 paragrahs used for training

eval: 50 paragraphs used for validation

Manual Evaluation Logs

Files: stage1/stage2_manual_evaluation.csv

ID meaning: paragraphs with their ID from 1 to 50 are in the same order as the eval dataset split

LLM Checkpoint and Prediction

Prediction Files: stage1/stage2_pred.tar.gz

Checkpoint File: llamafactory_checkpoint.tar

Copy of all LLM checkpoints and predictions used in the paper, give a reference for reproduction

Note: all stage 2 prediction are generated by end-to-end testing, meaning we first use Stage 1 LLM to generate predictions and then sent these to Stage 2 LLM

Misc

File: llamafactory_dataset.tar.gz

Compressed LLaMA-Factory Dataset files to help people runing Github Codes
d
Customer Service Call Dataset [Multisector] – Annotated support transcripts...
datarade.ai
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WiserBrand.com (2025). Customer Service Call Dataset [Multisector] – Annotated support transcripts for training AI and improving CX [Dataset]. https://datarade.ai/data-products/customer-service-call-dataset-multisector-annotated-suppo-wiserbrand-com
Explore at:
.json, .csv, .xls, .txtAvailable download formats
Dataset updated
Apr 11, 2025
Dataset provided by
WiserBrand
Area covered
United States of America
Description
"This dataset contains transcribed customer support calls from companies in over 160 industries, offering a high-quality foundation for developing customer-aware AI systems and improving service operations. It captures how real people express concerns, frustrations, and requests — and how support teams respond.

Included in each record:

Full call transcription with labeled speakers (system, agent, customer)

Concise human-written summary of the conversation

Sentiment tag for the overall interaction: positive, neutral, or negative

Company name, duration, and geographic location of the caller

Call context includes industries such as eCommerce, banking, telecom, and streaming services

Common use cases:

Train NLP models to understand support calls and detect churn risk

Power complaint detection engines for customer success and support teams

Create high-quality LLM training sets with real support narratives

Build summarization and topic tagging pipelines for CX dashboards

Analyze tone shifts and resolution language in customer-agent interaction

This dataset is structured, high-signal, and ready for use in AI pipelines, CX design, and quality assurance systems. It brings full transparency to what actually happens during customer service moments — from routine fixes to emotional escalations."

The more you purchase, the lower the price will be.
F
Japanese Closed Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Japanese Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/japanese-closed-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
The Japanese Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Japanese language, advancing the field of artificial intelligence.
Dataset Content
This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Japanese. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Japanese people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details
This fully labeled Japanese Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Japanese versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Japanese Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.

SVG Code Generation Sample Training Data

kaggle.com

zip

Updated May 3, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Vinothkumar Sekar (2025). SVG Code Generation Sample Training Data [Dataset]. https://www.kaggle.com/datasets/vinothkumarsekar89/svg-generation-sample-training-data

Explore at:

zip(193477 bytes)Available download formats

Dataset updated

May 3, 2025

Authors

Vinothkumar Sekar

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

This training data was generated using GPT-4o as part of the 'Drawing with LLM' competition (https://www.kaggle.com/competitions/drawing-with-llms). It can be used to fine-tune small language models for the competition or serve as an augmentation dataset alongside other data sources.

The dataset is generated in two steps using the GPT-4o model. - In the first step, topic descriptions relevant to the competition are generated using a specific prompt. By running this prompt multiple times, over 3,000 descriptions were collected.

 
prompt=f""" I am participating in an SVG code generation competition.
  
   The competition involves generating SVG images based on short textual descriptions of everyday objects and scenes, spanning a wide range of categories. The key guidelines are as follows:
  
   - Descriptions are generic and do not contain brand names, trademarks, or personal names.
   - No descriptions include people, even in generic terms.
   - Descriptions are concise—each is no more than 200 characters, with an average length of about 50 characters.
   - Categories cover various domains, with some overlap between public and private test sets.
  
   To train a small LLM model, I am preparing a synthetic dataset. Could you generate 100 unique topics aligned with the competition style?
  
   Requirements:
   - Each topic should range between **20 and 200 characters**, with an **average around 60 characters**.
   - Ensure **diversity and creativity** across topics.
   - **50% of the topics** should come from the categories of **landscapes**, **abstract art**, and **fashion**.
   - Avoid duplication or overly similar phrasing.
  
   Example topics:
                 a purple forest at dusk, gray wool coat with a faux fur collar, a lighthouse overlooking the ocean, burgundy corduroy, pants with patch pockets and silver buttons, orange corduroy overalls, a purple silk scarf with tassel trim, a green lagoon under a cloudy sky, crimson rectangles forming a chaotic grid,  purple pyramids spiraling around a bronze cone, magenta trapezoids layered on a translucent silver sheet,  a snowy plain, black and white checkered pants,  a starlit night over snow-covered peaks, khaki triangles and azure crescents,  a maroon dodecahedron interwoven with teal threads.
  
   Please return the 100 topics in csv format.
   """

In the second step, SVG code is generated by prompting the GPT-4o model. The following prompt is used to query the model to generate svg.

 
  prompt = f"""
      Generate SVG code to visually represent the following text description, while respecting the given constraints.
      
      Allowed Elements: `svg`, `path`, `circle`, `rect`, `ellipse`, `line`, `polyline`, `polygon`, `g`, `linearGradient`, `radialGradient`, `stop`, `defs`
      Allowed Attributes: `viewBox`, `width`, `height`, `fill`, `stroke`, `stroke-width`, `d`, `cx`, `cy`, `r`, `x`, `y`, `rx`, `ry`, `x1`, `y1`, `x2`, `y2`, `points`, `transform`, `opacity`
      

      Please ensure that the generated SVG code is well-formed, valid, and strictly adheres to these constraints. 
      Focus on a clear and concise representation of the input description within the given limitations. 
      Always give the complete SVG code with nothing omitted. Never use an ellipsis.

      The code is scored based on similarity to the description, Visual question anwering and aesthetic components.
      Please generate a detailed svg code accordingly.

      input description: {text}
      """

The raw SVG output is then cleaned and sanitized using a competition-specific sanitization class. After that, the cleaned SVG is scored using the SigLIP model to evaluate text-to-SVG similarity. Only SVGs with a score above 0.5 are included in the dataset. On average, out of three SVG generations, only one meets the quality threshold after the cleaning, sanitization, and scoring process.

A dataset with ~50,000 samples for SVG code generation is publicly available at: https://huggingface.co/datasets/vinoku89/svg-code-generation

Z
Unlocking LLM Insights: A Dataset for Automatic Model Card Generation
data.niaid.nih.gov
Updated Jun 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Singh, Shruti (2024). Unlocking LLM Insights: A Dataset for Automatic Model Card Generation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11466896
Explore at:
Dataset updated
Jun 4, 2024
Authors
Singh, Shruti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Language models (LMs) are no longer restricted to the ML community, and instruction-following LMs have led to a rise in autonomous AI agents. As the accessibility of LMs grows, it is imperative that an understanding of their capabilities, intended usage, and development cycle also improves. Model cards are a widespread practice for documenting detailed information about an ML model. To automate model card generation, we introduce a dataset of 500 question-answer pairs for 25 LMs that cover crucial aspects of the model, such as its training configurations, datasets, biases, architecture details, and training resources. We employ annotators to extract the answers from the original paper. Further, we explore the capabilities of LMs in generating model cards by answering questions. We experiment with three configurations: zero-shot generation, retrieval-augmented generation, and fine-tuning on our dataset. The fine-tuned Llama 3 model shows an improvement of 7 points over the retrieval-augmented generation setup. This indicates that our dataset can be used to train models to automatically generate model cards from paper text and reduce the human effort in the model card curation process.
r
Synthetic datasets generated by Large Language Models
resodate.org
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanco Amor Torterolo Orta; Yanco Amor Torterolo Orta; Sofía Micaela Roseti; Sofía Micaela Roseti; Antonio Moreno-Sandoval; Antonio Moreno-Sandoval (2025). Synthetic datasets generated by Large Language Models [Dataset]. http://doi.org/10.21950/YXP8Q8
Explore at:
Unique identifier
https://doi.org/10.21950/YXP8Q8
Dataset updated
May 27, 2025
Dataset provided by
GRESEL-UAM: Narrativas Financieras y Literatura
Universidad Autónoma de Madrid
Eciencia Data
Authors
Yanco Amor Torterolo Orta; Yanco Amor Torterolo Orta; Sofía Micaela Roseti; Sofía Micaela Roseti; Antonio Moreno-Sandoval; Antonio Moreno-Sandoval
Description
This dataset is the result of the work done in the project GRESEL-UAM: About GRESEL: AI Generation Results Enriched with Simplified Explanations Based on Linguistic Features (Resultados de Generación de IA Enriquecidos con Explicaciones Simplificadas Basadas en Características Lingüísticas). This dataset is part of the publication titled "Assessing a Literary RAG System with a Human-Evaluated Synthetic QA Dataset Generated by an LLM: Experiments with Knowledge Graphs," which will be presented in September 2025 in Zaragoza, within the framework of the conference of the Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN). The work has already been accepted for publication in SEPLN’s official journal, Procesamiento del Lenguaje Natural. This dataset consists of three synthetically generated datasets, a process known as Synthetic Data Generation (SDG). We used three different LLMs: deepseek-r1:14b, llama3.1:8b-instruct-q8_0, and mistral:7b-instruct. Each was given a prompt instructing them to generate a question answering (QA) dataset based on context fragments from the novel Trafalgar by Benito Pérez Galdós. These datasets were later used to evaluate a Retrieval-Augmented Generation (RAG) system. Three CSV files are provided, each corresponding to the synthetic dataset generated by one of the models. In total, the dataset contains 359 items. The header includes the following fields: id, context, question, answer, and success. Fields are separated by tabs. The id column is simply an identifier number. The context column contains the text fragment from which the model generated the questions and answers. The question and answer fields contain the generated questions and answers, respectively. The success column indicates whether the model successfully generated the question and answer in the corresponding fields ("yes" or "no").
F
Malayalam Open Ended Classification Prompt & Response Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Malayalam Open Ended Classification Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/malayalam-open-ended-classification-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Welcome to the Malayalam Open Ended Classification Prompt-Response Dataset, an extensive collection of 3000 meticulously curated prompt and response pairs. This dataset is a valuable resource for training Language Models (LMs) to classify input text accurately, a crucial aspect in advancing generative AI.
Dataset Content
This open-ended classification dataset comprises a diverse set of prompts and responses where the prompt contains input text to be classified and may also contain task instruction, context, constraints, and restrictions while completion contains the best classification category as response. Both these prompts and completions are available in Malayalam language. As this is an open-ended dataset, there will be no options given to choose the right classification category as a part of the prompt.
These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Malayalam people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This open-ended classification prompt and completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains prompts and responses with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Prompt Diversity
To ensure diversity, this open-ended classification dataset includes prompts with varying complexity levels, ranging from easy to medium and hard. Different types of prompts, such as multiple-choice, direct, and true/false, are included. Additionally, prompts are diverse in terms of length from short to medium and long, creating a comprehensive variety. The classification dataset also contains prompts with constraints and persona restrictions, which makes it even more useful for LLM training.
Response Formats
To accommodate diverse learning experiences, our dataset incorporates different types of responses depending on the prompt. These formats include single-word, short phrase, and single sentence type of response. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.
Data Format and Annotation Details
This fully labeled Malayalam Open Ended Classification Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, response type, and rich text presence.
Quality and Accuracy
Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.
The Malayalam version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.
Continuous Updates and Customization
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom open-ended classification prompt and completion data tailored to specific needs, providing flexibility and customization options.
License
The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Malayalam Open Ended Classification Prompt-Completion Dataset to enhance the classification abilities and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.
Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...
zenodo.org
data.niaid.nih.gov
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7916716
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

An example

An example test sentence:

Test Sentence: {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

An example of ontology:

Ontology: Music Ontology

Expected Output:

{ "id": "ont_k_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

The structure of the repo is as the following.

Text2KGBench

src: the source code used for generation and evaluation, and baseline

benchmark the code used to generate the benchmark

evaluation evaluation scripts for calculating the results

baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

wikidata_tekgen Wikidata-TekGen Dataset

ontologies 10 ontologies used by this dataset

train training data

test test data

manually_verified_sentences ids of a subset of test cases manually validated

unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

test unseen test unseen test sentences

ground_truth ground truth for unseen test sentences.

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

unseen prompts unseen prompts for the unseen test cases

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

unseen results results for the unseen test cases

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

dbpedia_webnlg DBpedia Dataset

ontologies 19 ontologies used by this dataset

train training data

test test data

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages

Facebook

Twitter

Click to copy link

Link copied

Cite

Ran Xu (2024). clinical-synthetic-text-llm [Dataset]. https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm

clinical-synthetic-text-llm

ritaranx/clinical-synthetic-text-llm

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 5, 2024

Authors

Ran Xu

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Data Description

We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on LLM-generated topics and writing styles.

  Generated Datasets

The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000… See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm.

Clear search

Close search

Google apps

Main menu

clinical-synthetic-text-llm

LLM Prompt Recovery - Synthetic Datastore

High Level Description

Contributors

First Dataset - 1000 Examples From @thedrcat

LLM - Detect AI Datamix

Data Sheet 2_Large language models generating synthetic clinical datasets: a...

LLM Question-Answer Dataset

LLM Dataset - Prompts and Generated Texts

👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset - Full dataset

Models used for text generation:

Languages in the dataset:

🧩 This is just an example of the data. Leave a request here to learn more

Content

uk_retail_store_synthetic_dataset

LLM: 7 prompt training dataset

Data Lineage For LLM Training Market Research Report 2033

Data Lineage for LLM Training Market Outlook

Component Analysis

Finnish Open Ended Question Answer Text Dataset

Dataset Content:

Question Diversity:

Answer Formats:

Data Format and Annotation Details:

Quality and Accuracy:

Continuous Updates and Customization:

License:

Bitext-travel-llm-chatbot-training-dataset

investopedia-instruction-tuning-dataset

Verbalized-Sampling-Synthetic-Data-Generation

Dataset for From Literature to Lab LLM powered catalyst synthesis Protocol...

Human Annotated Dataset

Manual Evaluation Logs

LLM Checkpoint and Prediction

Misc

Customer Service Call Dataset [Multisector] – Annotated support transcripts...

Japanese Closed Ended Question Answer Text Dataset

Dataset Content

Question Diversity

Answer Formats

Data Format and Annotation Details

Quality and Accuracy

Continuous Updates and Customization

License:

SVG Code Generation Sample Training Data

Unlocking LLM Insights: A Dataset for Automatic Model Card Generation

Synthetic datasets generated by Large Language Models

Malayalam Open Ended Classification Prompt & Response Dataset

Dataset Content

Prompt Diversity

Response Formats

Data Format and Annotation Details

Quality and Accuracy

Continuous Updates and Customization

License

Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

clinical-synthetic-text-llmSee More Versions

ritaranx/clinical-synthetic-text-llm

clinical-synthetic-text-llm