100+ datasets found

h
llm-training-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UniData, llm-training-dataset [Dataset]. https://huggingface.co/datasets/UniDataPro/llm-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
UniData
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
LLM Fine-Tuning Dataset - 4,000,000+ logs, 32 languages

The dataset contains over 4 million+ logs written in 32 languages and is tailored for LLM training. It includes log and response pairs from 3 models, and is designed for language models and instruction fine-tuning to achieve improved performance in various NLP tasks - Get the data

Models used for text generation:

GPT-3.5 GPT-4 Uncensored GPT Version (is not included inthe sample)

Languages in the… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/llm-training-dataset.
LLM: 7 prompt training dataset
kaggle.com
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Carl McBride Ellis
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
File: train_essays_RDizzl3_seven_v2.csv
Human texts: 14247 LLM texts: 3004

See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts

Version 3: "**The RDizzl3 Seven**"
File: train_essays_RDizzl3_seven_v1.csv

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"A Cowboy Who Rode the Waves"

"Driverless cars"

How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

Namely:

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"Seeking multiple opinions"

"Phones and driving"

This dataset is a derivative of the datasets

LLM Generated Essays for the Detect AI Comp! by Radek Osmulski

persuade corpus 2.0 provided by Nicholas Broad

daigt data - llama 70b and falcon180b by Nicholas Broad

Hello, Claude! 1000 essays from Anthropic... by Darragh

as well as the original competition training dataset

Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.
h
Bitext-travel-llm-chatbot-training-dataset
huggingface.co
Updated Jun 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2025). Bitext-travel-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2025
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.
h
llm-bootcamp-train-samples
huggingface.co
Updated Jul 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanislav Sandler (2024). llm-bootcamp-train-samples [Dataset]. https://huggingface.co/datasets/stas1k/llm-bootcamp-train-samples
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2024
Authors
Stanislav Sandler
Description
stas1k/llm-bootcamp-train-samples dataset hosted on Hugging Face and contributed by the HF Datasets community
Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...
zenodo.org
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7916716
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

An example

An example test sentence:

Test Sentence: {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

An example of ontology:

Ontology: Music Ontology

Expected Output:

{ "id": "ont_k_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

The structure of the repo is as the following.

Text2KGBench

src: the source code used for generation and evaluation, and baseline

benchmark the code used to generate the benchmark

evaluation evaluation scripts for calculating the results

baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

wikidata_tekgen Wikidata-TekGen Dataset

ontologies 10 ontologies used by this dataset

train training data

test test data

manually_verified_sentences ids of a subset of test cases manually validated

unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

test unseen test unseen test sentences

ground_truth ground truth for unseen test sentences.

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

unseen prompts unseen prompts for the unseen test cases

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

unseen results results for the unseen test cases

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

dbpedia_webnlg DBpedia Dataset

ontologies 19 ontologies used by this dataset

train training data

test test data

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
d
Large Language Model (LLM) Training Data | 236 Countries | AI-Enhanced...
datarade.ai
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Silencio Network (2025). Large Language Model (LLM) Training Data | 236 Countries | AI-Enhanced Ground Truth Based | 10M+ Hours of Measurements | 100% Traceable Consent [Dataset]. https://datarade.ai/data-products/large-language-model-llm-training-data-236-countries-ai-silencio-network
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Apr 15, 2025
Dataset provided by
Quickkonnect UG
Authors
Silencio Network
Area covered
Hungary, Sri Lanka, Libya, Guernsey, Puerto Rico, Taiwan, Oman, United Arab Emirates, Saint Kitts and Nevis, Serbia
Description
Silencio’s interpolation dataset delivers spatially continuous noise data combining: • 10M+ hours of real dBA measurements • AI-generated interpolations

Applications: • AI-based acoustic mapping • Digital twin and simulation models • Ground-truth data for AI validation

Delivered via CSV or S3. GDPR-compliant.
f
Data Sheet 1_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
xlsx
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 1_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s001
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
Foundation Model Data Collection and Data Annotation | Large Language...
datarade.ai
Updated Jan 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2024). Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services [Dataset]. https://datarade.ai/data-products/nexdata-foundation-model-data-solutions-llm-sft-rhlf-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jan 25, 2024
Dataset authored and provided by
Nexdata
Area covered
Czech Republic, Portugal, Ireland, Azerbaijan, El Salvador, Kyrgyzstan, Spain, Russian Federation, Taiwan, Maldives
Description
Overview

Unsupervised Learning: For the training data required in unsupervised learning, Nexdata delivers data collection and cleaning services for both single-modal and cross-modal data. We provide Large Language Model(LLM) Data cleaning and personnel support services based on the specific data types and characteristics of the client's domain.

-SFT: Nexdata assists clients in generating high-quality supervised fine-tuning data for model optimization through prompts and outputs annotation.

-Red teaming: Nexdata helps clients train and validate models through drafting various adversarial attacks, such as exploratory or potentially harmful questions. Our red team capabilities help clients identify problems in their models related to hallucinations, harmful content, false information, discrimination, language bias and etc.

-RLHF: Nexdata assist clients in manually ranking multiple outputs generated by the SFT-trained model according to the rules provided by the client, or provide multi-factor scoring. By training annotators to align with values and utilizing a multi-person fitting approach, the quality of feedback can be improved.

Our Capacity -Global Resources: Global resources covering hundreds of languages worldwide

-Compliance: All the Large Language Model(LLM) Data is collected with proper authorization

-Quality: Multiple rounds of quality inspections ensures high quality data output

-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.

-Efficency: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator. It has successfully been applied to nearly 5,000 projects.

3.About Nexdata Nexdata is equipped with professional data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the Large Language Model(LLM) Data collection requirements in various scenarios and types. We have global data processing centers and more than 20,000 professional annotators, supporting on-demand Large Language Model(LLM) Data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/?source=Datarade
Chain-of-Thought collection
kaggle.com
Updated Jun 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konrad Banachewicz (2023). Chain-of-Thought collection [Dataset]. http://identifiers.org/arxiv:2305.140
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://identifiers.org/arxiv:2305.140
Dataset updated
Jun 19, 2023
Dataset provided by
Kaggle
Authors
Konrad Banachewicz
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Dataset accompanying the paper "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning", including 1.88M CoT rationales extracted across 1,060 tasks" - https://arxiv.org/abs/2305.14045

From the release repo https://github.com/kaistAI/CoT-Collection: Large Language Models (LLMs) have shown enhanced capabilities of solving novel tasks by reasoning step-by-step known as Chain-of-Thought (CoT) reasoning; how can we instill the same capability of reasoning step-by-step on unseen tasks into LMs that possess less than <100B parameters? To address this question, we first introduce the CoT Collection, a new instruction-tuning dataset that augments 1.88 million CoT rationales across 1,060 tasks. We show that continually fine-tuning Flan-T5 (3B & 11B) with the CoT Collection enables the 3B & 11B LMs to perform CoT better on unseen tasks, leading to an improvement in the average zero-shot accuracy on 27 datasets of the BIG-Bench-Hard benchmark by +4.34% and +2.44%, respectively. Furthermore, we show that instruction tuning with CoT allows LMs to possess stronger few-shot learning capabilities, resulting in an improvement of +2.97% and +2.37% on 4 domain-specific tasks over Flan-T5 (3B & 11B), respectively.
NewsMediaBias-Plus Dataset
zenodo.org
huggingface.co
bin, zip
Updated Nov 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaina Raza; Shaina Raza (2024). NewsMediaBias-Plus Dataset [Dataset]. http://doi.org/10.5281/zenodo.13961155
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13961155
Dataset updated
Nov 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shaina Raza; Shaina Raza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NewsMediaBias-Plus Dataset

Overview

The NewsMediaBias-Plus dataset is designed for the analysis of media bias and disinformation by combining textual and visual data from news articles. It aims to support research in detecting, categorizing, and understanding biased reporting in media outlets.

Dataset Description

NewsMediaBias-Plus pairs news articles with relevant images and annotations indicating perceived biases and the reliability of the content. It adds a multimodal dimension for bias detection in news media.

Contents

unique_id: Unique identifier for each news item. Each unique_id matches an image for the same article.

outlet: The publisher of the article.

headline: The headline of the article.

article_text: The full content of the news article.

image_description: Description of the paired image.

image: The file path of the associated image.

date_published: The date the article was published.

source_url: The original URL of the article.

canonical_link: The canonical URL of the article.

new_categories: Categories assigned to the article.

news_categories_confidence_scores: Confidence scores for each category.

Annotation Labels

text_label: Indicates the likelihood of the article being disinformation:

Likely: Likely to be disinformation.

Unlikely: Unlikely to be disinformation.

multimodal_label: Indicates the likelihood of disinformation from the combination of the text snippet and image content:

Likely: Likely to be disinformation.

Unlikely: Unlikely to be disinformation.

Getting Started

Prerequisites

Python 3.6+

Pandas

Hugging Face Datasets

Hugging Face Hub

Installation

Load the dataset into Python:

python

Copy code

from datasets import load_dataset ds = load_dataset("vector-institute/newsmediabias-plus") print(ds) # View structure and splits print(ds['train'][0]) # Access the first record of the train split print(ds['train'][:5]) # Access the first five records

Load a Few Records

python

Copy code

from datasets import load_dataset # Load the dataset in streaming mode streamed_dataset = load_dataset("vector-institute/newsmediabias-plus", streaming=True) # Get an iterable dataset dataset_iterable = streamed_dataset['train'].take(5) # Print the records for record in dataset_iterable: print(record)

Contributions

Contributions are welcome! You can:

Add Data: Contribute more data points.

Refine Annotations: Improve annotation accuracy.

Share Usage Examples: Help others use the dataset effectively.

To contribute, fork the repository and create a pull request with your changes.

License

This dataset is released under a non-commercial license. See the LICENSE file for more details.

Citation

Please cite the dataset using this BibTeX entry:

bibtex

Copy code

@misc{vector_institute_2024_newsmediabias_plus, title={NewsMediaBias-Plus: A Multimodal Dataset for Analyzing Media Bias}, author={Vector Institute Research Team}, year={2024}, url={https://huggingface.co/datasets/vector-institute/newsmediabias-plus} }

Contact

For questions or support, contact Shaina Raza at: shaina.raza@vectorinstitute.ai

Disclaimer and User Guidance

Disclaimer: The labels Likely and Unlikely are based on LLM annotations and expert assessments, intended for informational use only. They should not be considered final judgments.

Guidance: This dataset is for research purposes. Cross-reference findings with other reliable sources before drawing conclusions. The dataset aims to encourage critical thinking, not provide definitive classifications.
P
GPTFuzzer Dataset
paperswithcode.com
Updated Mar 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiahao Yu; Xingwei Lin; Zheng Yu; Xinyu Xing (2025). GPTFuzzer Dataset [Dataset]. https://paperswithcode.com/dataset/gptfuzzer
Explore at:
Dataset updated
Mar 17, 2024
Authors
Jiahao Yu; Xingwei Lin; Zheng Yu; Xinyu Xing
Description
GPTFuzzer is a fascinating project that explores red teaming of large language models (LLMs) using auto-generated jailbreak prompts. Let's dive into the details:

Project Overview: GPTFuzzer aims to assess the security and robustness of LLMs by crafting prompts that can potentially lead to harmful or unintended behavior.

The project focuses on GPT-3 and similar models.

Datasets:

The datasets used in GPTFuzzer include:

Harmful Questions: Sampled from public datasets like llm-jailbreak-study and hh-rlhf. Human-Written Templates: Collected from llm-jailbreak-study. Responses: Gathered by querying models like Vicuna-7B, ChatGPT, and Llama-2-7B-chat.

Models:

The judgment model is a finetuned RoBERTa-large model. The training code and data are available in the repository.

During fuzzing experiments, the model is automatically downloaded and cached.

Updates:

The project has received recognition and awards at conferences like Geekcon 2023. The team continues to improve the codebase and aims to build a general black-box fuzzing framework for LLMs.

Source: Conversation with Bing, 3/17/2024 (1) sherdencooper/GPTFuzz: Official repo for GPTFUZZER - GitHub. https://github.com/sherdencooper/GPTFuzz. (2) GPTFUZZER : Red Teaming Large Language Models with Auto ... - GitHub. https://github.com/sherdencooper/GPTFuzz/blob/master/README.md. (3) GPTFUZZER : Red Teaming Large Language Models with Auto ... - GitHub. https://github.com/CriticalPulsar/GPTFuzz/blob/master/README.md. (4) undefined. https://avatars.githubusercontent.com/u/37368657?v=4. (5) undefined. https://github.com/sherdencooper/GPTFuzz/blob/master/README.md?raw=true. (6) undefined. https://desktop.github.com. (7) undefined. https://github.com/sherdencooper/GPTFuzz/raw/master/README.md. (8) undefined. https://opensource.org/licenses/MIT. (9) undefined. https://camo.githubusercontent.com/a4426cbe5c21edb002526331c7a8fbfa089e84a550567b02a0d829a98b136ad0/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667. (10) undefined. https://img.shields.io/badge/License-MIT-yellow.svg. (11) undefined. https://arxiv.org/pdf/2309.10253.pdf. (12) undefined. https://sherdencooper.github.io/. (13) undefined. https://scholar.google.com/citations?user=Zv_rC0AAAAAJ&. (14) undefined. http://www.dataisland.org/. (15) undefined. http://xinyuxing.org/. (16) undefined. https://geekcon.darknavy.com/2023/china/en/index.html. (17) undefined. https://avatars.githubusercontent.com/u/35443979?v=4. (18) undefined. https://github.com/CriticalPulsar/GPTFuzz/blob/master/README.md?raw=true. (19) undefined. https://docs.github.com/articles/about-issue-and-pull-request-templates. (20) undefined. https://github.com/CriticalPulsar/GPTFuzz/raw/master/README.md. (21) undefined. https://scholar.google.com/citations?user=Zv_rC0AAAAAJ&hl=en.
o
LLM Question-Answer Dataset
opendatabay.com
.undefined
Updated Jun 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). LLM Question-Answer Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/0ccec8f4-3216-4689-9f6e-b4d01e271bdf
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 18, 2025
Dataset authored and provided by
Datasimple
Area covered
Education & Learning Analytics
Description
LLM Dataset - Prompts and Generated Texts The dataset contains prompts and texts generated by the Large Language Models (LLMs) in 32 different languages. The prompts are short sentences or phrases for the model to generate text. The texts generated by the LLM are responses to these prompts and can vary in length and complexity.

Researchers and developers can use this dataset to train and fine-tune their own language models for multilingual applications. The dataset provides a rich and diverse collection of outputs from the model, demonstrating its ability to generate coherent and contextually relevant text in multiple languages.

💴 For Commercial Usage: Full version of the dataset includes 4,000,000 logs generated in 32 languages with diferent types of LLM, including Uncensored GPT, leave a request on TrainingData to buy the dataset Models used for text generation: GPT-3.5, GPT-4 Languages in the dataset: Arabic, Azerbaijani, Catalan, Chinese, Czech, Danish, German, Greek, English, Esperanto, Spanish, Persian, Finnish, French, Irish, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Malayalam, Maratham, Netherlands, Polish, Portuguese, Portuguese (Brazil), Slovak, Swedish, Thai, Turkish, Ukrainian

Content CSV File includes the following data:

from_language: language the prompt is made in, model: type of the model (GPT-3.5, GPT-4 and Uncensored GPT Version), time: time when the answer was generated, text: user prompt, response: response generated by the model 💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset TrainingData provides high-quality data annotation tailored to your needs keywords: dataset, machine learning, natural language processing, artificial intelligence, deep learning, neural networks, text generation, language models, openai, gpt-3, data science, predictive modeling, sentiment analysis, keyword extraction, text classification, sequence-to-sequence models, attention mechanisms, transformer architecture, word embeddings, glove embeddings, chatbots, question answering, language understanding, text mining, information retrieval, data preprocessing, feature engineering, explainable ai, model deployment

License

CC-BY-NC

Original Data Source: LLM Question-Answer Dataset
h
Bitext-retail-ecommerce-llm-chatbot-training-dataset
huggingface.co
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2024). Bitext-retail-ecommerce-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 6, 2024
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset.
d
Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training...
datarade.ai
Updated Jan 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MealMe (2025). Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training Data (RAG) for 1M+ Global Grocery, Restaurant, and Retail Stores [Dataset]. https://datarade.ai/data-products/ai-training-data-rag-for-grocery-restaurant-and-retail-ra-mealme
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jan 23, 2025
Dataset authored and provided by
MealMe
Area covered
Norfolk Island, Christmas Island, Trinidad and Tobago, Romania, Iceland, Saint Lucia, Andorra, Uruguay, Kosovo, Korea (Republic of)
Description
A comprehensive dataset covering over 1 million stores in the US and Canada, designed for training and optimizing retrieval-augmented generation (RAG) models and other AI/ML systems. This dataset includes highly detailed, structured information such as:

Menus: Restaurant menus with item descriptions, categories, and modifiers. Inventory: Grocery and retail product availability, SKUs, and detailed attributes like sizes, flavors, and variations.

Pricing: Real-time and historical pricing data for dynamic pricing strategies and recommendations.

Availability: Real-time stock status and fulfillment details for grocery, restaurant, and retail items.

Applications: Retrieval-Augmented Generation (RAG): Train AI models to retrieve and generate contextually relevant information.

Search Optimization: Build advanced, accurate search and recommendation engines. Personalization: Enable personalized shopping, ordering, and discovery experiences in apps.

Data-Driven Insights: Develop AI systems for pricing analysis, consumer behavior studies, and logistics optimization.

This dataset empowers businesses in marketplaces, grocery apps, delivery services, and retail platforms to scale their AI solutions with precision and reliability.
d
TagX Data collection for AI/ ML training | LLM data | Data collection for AI...
datarade.ai
.json, .csv, .xls
Updated Jun 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TagX (2021). TagX Data collection for AI/ ML training | LLM data | Data collection for AI development & model finetuning | Text, image, audio, and document data [Dataset]. https://datarade.ai/data-products/data-collection-and-capture-services-tagx
Explore at:
.json, .csv, .xlsAvailable download formats
Dataset updated
Jun 18, 2021
Dataset authored and provided by
TagX
Area covered
Colombia, Benin, Djibouti, Saudi Arabia, Antigua and Barbuda, Qatar, Iceland, Russian Federation, Equatorial Guinea, Belize
Description
We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.

Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.

We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.
P
GSM8K Dataset
paperswithcode.com
tensorflow.org
+2more
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano; Christopher Hesse; John Schulman (2024). GSM8K Dataset [Dataset]. https://paperswithcode.com/dataset/gsm8k
Explore at:
Dataset updated
Dec 31, 2024
Authors
Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano; Christopher Hesse; John Schulman
Description
GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning.
h
cncf-question-and-answer-dataset-for-llm-training
huggingface.co
Updated Nov 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kubermatic (2020). cncf-question-and-answer-dataset-for-llm-training [Dataset]. https://huggingface.co/datasets/Kubermatic/cncf-question-and-answer-dataset-for-llm-training
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 29, 2020
Dataset authored and provided by
Kubermatic
Description
CNCF QA Dataset for LLM Tuning

Description

This dataset, named cncf-qa-dataset-for-llm-tuning, is designed for fine-tuning large language models (LLMs) and is formatted in a question-answer (QA) style. The data is sourced from PDF and markdown (MD) files extracted from various project repositories within the CNCF (Cloud Native Computing Foundation) landscape. These files were processed and converted into a QA format to be fed into the LLM model. The dataset includes the… See the full description on the dataset page: https://huggingface.co/datasets/Kubermatic/cncf-question-and-answer-dataset-for-llm-training.
P
Speech Brown Dataset
paperswithcode.com
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Mahdi Abootorabi; Ehsaneddin Asgari (2024). Speech Brown Dataset [Dataset]. https://paperswithcode.com/dataset/speechbrown
Explore at:
Dataset updated
Dec 16, 2024
Authors
Mohammad Mahdi Abootorabi; Ehsaneddin Asgari
Description
Dataset Summary Speech Brown is a comprehensive, synthetic, and diverse paired speech-text dataset in 15 categories, covering a wide range of topics from fiction to religion. This dataset consists of over 55,000 sentence-level samples.

To train the CLASP model, we created this dataset based on the Brown Corpus. The synthetic speech was generated using the NVIDIA Tacotron 2 text-to-speech model.

For more information about our proposed model, please refer to this paper. The dataset generation pipeline, along with code and usage instructions, is available on this GitHub page.

Dataset Statistics

Total size: Approximately 30 GB.
Number of samples: 55,173 pairs of speech and text.
Average tokens per sample: 19.00.
Maximum tokens in a sample: 48.
Average characters per sample: 96.72. Number of unique tokens: 50,667 Categories: 15 categories consist of adventure, belles_lettres, editorial, fiction, government, hobbies, humor, learned, lore, mystery, news, religion, reviews, romance, science_fiction.

Dataset Structure To ensure ease of use, the dataset is partitioned into 10 parts. Each part can be used independently if it meets the requirements of your task and model.

Metadata Files

global_metadata: A JSON file containing metadata for all 55,173 samples.
localized_metadata: A JSON file containing metadata for all samples, categorized into the 10 dataset partitions.

Metadata Fields

id: The unique identifier for the sample.
audio_file_path: The file path for the audio in the dataset.
category: The category of the sample's text.
text: The corresponding text of the audio file.

Usage Instructions To use this dataset, download the parts and metadata files as follows:

Option 1: Manual Download Visit the dataset repository and download all dataset_partX.zip files and the global_metadata.json file.

Option 2: Programmatic Download Use the huggingface_hub library to download the files programmatically:

from huggingface_hub import hf_hub_download from zipfile import ZipFile import os import json Download dataset parts zip_file_path1 = hf_hub_download(repo_id="llm-lab/SpeechBrown", filename="dataset_part1.zip", repo_type="dataset") zip_file_path2 = hf_hub_download(repo_id="llm-lab/SpeechBrown", filename="dataset_part2.zip", repo_type="dataset") Download other parts... Download metadata metadata_file_path = hf_hub_download(repo_id="llm-lab/SpeechBrown", filename="global_metadata.json", repo_type="dataset") for i in range(1, 11): with ZipFile(f'dataset_part{i}.zip', 'r') as zip_ref: zip_ref.extractall(f'dataset_part{i}') os.remove(f'dataset_part{i}.zip') with open('global_metadata.json', 'r') as f: metadata = json.load(f) metadata.keys()

Citations If you find our paper, code, data, or models useful, please cite the paper: @misc{abootorabi2024claspcontrastivelanguagespeechpretraining, title={CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval}, author={Mohammad Mahdi Abootorabi and Ehsaneddin Asgari}, year={2024}, eprint={2412.13071}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.13071}, }

Contact If you have questions, please email mahdi.abootorabi2@gmail.com or asgari@berkeley.edu.
Synthetic Consumer Behaviour Dataset
opendatabay.com
.undefined
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Opendatabay Labs (2025). Synthetic Consumer Behaviour Dataset [Dataset]. https://www.opendatabay.com/data/synthetic/ad9e2ab7-7559-4c89-af01-7d9df45b4255
Explore at:
.undefinedAvailable download formats
Dataset updated
May 6, 2025
Dataset provided by
Buy & Sell Data | Opendatabay - AI & Synthetic Data Marketplace
Authors
Opendatabay Labs
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Retail & Consumer Behavior
Description
This synthetic customer purchase dataset has been created as an educational resource for data science, machine learning, and retail analytics applications. The data focuses on key consumer purchase behaviours, including demographic information, product details, purchase history, and payment methods. It is designed to help users practice data manipulation, analysis, and predictive modelling in the context of retail and e-commerce.

Dataset Features:

Customer ID: Unique identifier for each customer.

Age: Age of the customer (in years).

Gender: Gender of the customer (e.g., "Male," "Female").

Item Purchased: Item that was purchased (e.g., "Blouse," "Sandals").

Category: Category of the item purchased (e.g., "Accessories," "Clothing").

Purchase Amount (USD): The amount spent on the purchase (in USD).

Location: Geographical location of the customer (e.g., "Wyoming," "Hawaii").

Size: Size of the purchased item (e.g., "M," "S," "L").

Color: Color of the purchased item (e.g., "Red," "White").

Season: Season during which the item was purchased (e.g., "Winter," "Summer").

Review Rating: Rating given by the customer to the purchased item (on a scale from 1 to 5).

Subscription Status: Whether the customer is subscribed to a loyalty program or subscription service (e.g., "Yes," "No").

Shipping Type: Shipping method used for the purchase (e.g., "Free Shipping," "Standard").

Discount Applied: Whether a discount was applied to the purchase (e.g., "Yes," "No").

Promo Code Used: Whether a promotional code was used during the purchase (e.g., "Yes," "No").

Previous Purchases: Number of previous purchases made by the customer.

Payment Method: Method of payment used (e.g., "Bank Transfer," "PayPal," "Venmo").

Frequency of Purchases: How often the customer makes purchases (e.g., "Annually," "Bi-Weekly," "Monthly").

Sample:

https://storage.googleapis.com/opendatabay_public/images/image_e2373b5a-94d0-4587-a7c9-72e63e79115c.png" alt="image_e2373b5a-94d0-4587-a7c9-72e63e79115c.png">

Usage:

This dataset is useful for a variety of applications, including:

Customer Behavior Analysis: To explore trends in customer demographics, purchase behaviours, and preferences.

Retail Analytics: To understand how different factors (like season, location, and payment method) influence purchasing decisions.

Predictive Modeling: To develop models that predict customer behaviours such as purchase frequency or subscription status.

Marketing Strategy: To analyze the effectiveness of promotions, discounts, and shipping methods in driving purchases.

Coverage:

This dataset is synthetic and anonymized, making it a safe tool for experimentation and learning without compromising any real customer data.

License:

CCO (Public Domain)

Who can use it:

Data science enthusiasts: For learning and practising retail data analysis, customer segmentation, and predictive modelling. Researchers and educators: For academic studies or teaching purposes in retail analytics and consumer behaviour. Marketing professionals: For analyzing purchasing patterns and designing targeted promotional campaigns.
Energy consumption when training LLMs in 2022 (in MWh)
statista.com
Updated Jun 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Energy consumption when training LLMs in 2022 (in MWh) [Dataset]. https://www.statista.com/statistics/1384401/energy-use-when-training-llm-models/
Explore at:
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2022
Area covered
Worldwide
Description
Energy consumption of artificial intelligence (AI) models in training is considerable, with both GPT-3, the original release of the current iteration of OpenAI's popular ChatGPT, and Gopher consuming well over **********-megawatt hours of energy simply for training. As this is only for the training model it is likely that the energy consumption for the entire usage and lifetime of GPT-3 and other large language models (LLMs) is significantly higher. The largest consumer of energy, GPT-3, consumed roughly the equivalent of *** Germans in 2022. While not a staggering amount, it is a considerable use of energy. Energy savings through AI While it is undoubtedly true that training LLMs takes a considerable amount of energy, the energy savings are also likely to be substantial. Any AI model that improves processes by minute numbers might save hours on shipment, liters of fuel, or dozens of computations. Each one of these uses energy as well and the sum of energy saved through a LLM might vastly outperform its energy cost. A good example is mobile phone operators, of which a ***** expect that AI might reduce power consumption by *** to ******* percent. Considering that much of the world uses mobile phones this would be a considerable energy saver. Emissions are considerable The amount of CO2 emissions from training LLMs is also considerable, with GPT-3 producing nearly *** tonnes of CO2. This again could be radically changed based on the types of energy production creating the emissions. Most data center operators for instance would prefer to have nuclear energy play a key role, a significantly low-emission energy producer.

Facebook

Twitter

Click to copy link

Link copied

Cite

UniData, llm-training-dataset [Dataset]. https://huggingface.co/datasets/UniDataPro/llm-training-dataset

llm-training-dataset

UniDataPro/llm-training-dataset

Explore at:

188 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Authors

UniData

License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

LLM Fine-Tuning Dataset - 4,000,000+ logs, 32 languages

The dataset contains over 4 million+ logs written in 32 languages and is tailored for LLM training. It includes log and response pairs from 3 models, and is designed for language models and instruction fine-tuning to achieve improved performance in various NLP tasks - Get the data

  Models used for text generation:

GPT-3.5 GPT-4 Uncensored GPT Version (is not included inthe sample)

  Languages in the… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/llm-training-dataset.

Clear search

Close search

Google apps

Main menu

llm-training-dataset

LLM: 7 prompt training dataset

Bitext-travel-llm-chatbot-training-dataset

llm-bootcamp-train-samples

Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

Large Language Model (LLM) Training Data | 236 Countries | AI-Enhanced...

Data Sheet 1_Large language models generating synthetic clinical datasets: a...

Foundation Model Data Collection and Data Annotation | Large Language...

Chain-of-Thought collection

NewsMediaBias-Plus Dataset

NewsMediaBias-Plus Dataset

Overview

Dataset Description

Contents

Annotation Labels

Getting Started

Prerequisites

Installation

Load a Few Records

Contributions

License

Citation

Contact

Disclaimer and User Guidance

GPTFuzzer Dataset

LLM Question-Answer Dataset

License

Bitext-retail-ecommerce-llm-chatbot-training-dataset

Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training...

TagX Data collection for AI/ ML training | LLM data | Data collection for AI...

GSM8K Dataset

cncf-question-and-answer-dataset-for-llm-training

Speech Brown Dataset

Synthetic Consumer Behaviour Dataset

Dataset Features:

Sample:

Usage:

Coverage:

License:

Who can use it:

Energy consumption when training LLMs in 2022 (in MWh)

llm-training-dataset

UniDataPro/llm-training-dataset