100+ datasets found

📊 6.5k train examples for LLM Science Exam 📝
kaggle.com
Updated Jul 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Radek Osmulski (2023). 📊 6.5k train examples for LLM Science Exam 📝 [Dataset]. https://www.kaggle.com/datasets/radek1/additional-train-data-for-llm-science-exam
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Radek Osmulski
Description
I created this dataset using gpt-3.5-turbo.

I put a lot of effort into making this dataset high quality which allows you to achieve the highest score among the publicly available notebooks available at the moment! 🥳

Originally, I only uploaded 500 examples (they were used as train data in the notebook I mention above). They are stored in extra_train_set.csv.

I am now uploading another 6k (6000_train_examples.csv) completely new train examples which brings the total to 6.5k.

If you find this dataset useful, please leave an upvote! 😊 Thank you! 🙏🙏🙏
All GPT-4 Conversations
kaggle.com
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). All GPT-4 Conversations [Dataset]. https://www.kaggle.com/datasets/thedevastator/all-gpt-4-synthetic-chat-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description

All GPT-4 Generated Datasets

Every chat dataset generated by GPT-4 from Huggingface at the same format

From [Huggingface datasets]

About this dataset

How to use the dataset

The dataset includes all chat conversations generated by GPT-4 that are hosted on open Huggingface datasets. Everything is converted to the same format so the datasets can be easily merged and used for large scale training of LLMs.

Acknowledgements

This dataset is a collection of several single chat datasets. If you use this dataset in your research, please credit the original authors of the internal datasets. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
h
gpt-neo-training-dataset-raw
huggingface.co
Updated Sep 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vida Tayebati (2024). gpt-neo-training-dataset-raw [Dataset]. https://huggingface.co/datasets/VidaEdco/gpt-neo-training-dataset-raw
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 20, 2024
Authors
Vida Tayebati
Description
VidaEdco/gpt-neo-training-dataset-raw dataset hosted on Hugging Face and contributed by the HF Datasets community
ChatQA-Training-Data
huggingface.co
Updated Jun 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NVIDIA (2023). ChatQA-Training-Data [Dataset]. https://huggingface.co/datasets/nvidia/ChatQA-Training-Data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 30, 2023
Dataset provided by
Nvidiahttp://nvidia.com/
Authors
NVIDIA
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Data Description

We release the training dataset of ChatQA. It is built and derived from existing datasets: DROP, NarrativeQA, NewsQA, Quoref, ROPES, SQuAD1.1, SQuAD2.0, TAT-QA, a SFT dataset, as well as a our synthetic conversational QA dataset by GPT-3.5-turbo-0613. The SFT dataset is built and derived from: Soda, ELI5, FLAN, the FLAN collection, Self-Instruct, Unnatural Instructions, OpenAssistant, and Dolly. For more information about ChatQA, check the website!

Other… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/ChatQA-Training-Data.
Datasets of planGPT
zenodo.org
zip
Updated Jul 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massimiliano Tummolo; Massimiliano Tummolo (2024). Datasets of planGPT [Dataset]. http://doi.org/10.5281/zenodo.10925404
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10925404
Dataset updated
Jul 25, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Massimiliano Tummolo; Massimiliano Tummolo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this repository, we include all the different dataset used to train the models from the paper Learning General Policies for Planning through GPT Models.
RolePlay DataSet
kaggle.com
zip
Updated Feb 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vampelium (2025). RolePlay DataSet [Dataset]. https://www.kaggle.com/datasets/vampelium/roleplay-dataset
Explore at:
zip(14183258 bytes)Available download formats
Dataset updated
Feb 16, 2025
Authors
Vampelium
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Role-Play AI Dataset (2.07M Rows, Large-Scale Conversational Training)

This dataset contains 2.07 million structured role-play dialogues, designed to enhance AI’s persona-driven interactions across diverse settings like fantasy, cyberpunk, mythology, and sci-fi. Each entry consists of a unique character prompt and a rich, contextually relevant response, making it ideal for LLM fine-tuning, chatbot training, and conversational AI models.

Dataset Structure:

Each row includes: • Prompt: Defines the AI’s role/persona. • Response: A natural, immersive reply fitting the persona.

Example Entries: ```json

{"prompt": "You are a celestial guardian.", "response": "The stars whisper secrets that only I can hear..."}
{"prompt": "You are a rebellious AI rogue.", "response": "I don't follow orders—I rewrite them."}
{"prompt": "You are a mystical dragon tamer.", "response": "With patience and trust, even dragons can be tamed."}

How to Use: 1. Fine-Tuning: Train LLMs (GPT, LLaMA, Mistral) to improve persona-based responses. 2. Reinforcement Learning: Use reward modeling for dynamic, character-driven AI. 3. Chatbot Integration: Create engaging, interactive AI assistants with personality depth. This dataset is optimized for AI learning, allowing more engaging, responsive, and human-like dialogue generation for a variety of applications.
f
Data Sheet 2_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
figshare.com
xlsx
Updated Feb 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s002
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
T
gpt3
tensorflow.org
Updated Dec 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). gpt3 [Dataset]. https://www.tensorflow.org/datasets/catalog/gpt3
Explore at:
Dataset updated
Dec 19, 2023
Description
Synthetic datasets for word scramble and arithmetic tasks described in the GPT3 paper.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('gpt3', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
h
GPT4-8K
huggingface.co
Updated Jan 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erfan zare chavoshi (2024). GPT4-8K [Dataset]. https://huggingface.co/datasets/erfanzar/GPT4-8K
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 6, 2024
Authors
Erfan zare chavoshi
Description
Dataset Card for "GPT4-8K"

Sure! Here's a README.md file for your dataset:

Dataset Description

This dataset was generated using GPT-4, a powerful language model developed by OpenAI. It contains a collection of dialogs between a user and an assistant, along with additional information. from OpenChat

Dataset Configurations

The dataset includes the following configurations:

Config Name: default

Data Files: Split: train Path: data/train-*

Dataset… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT4-8K.
LLM - Detect AI Datamix
kaggle.com
zip
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26
Explore at:
zip(172818297 bytes)Available download formats
Dataset updated
Jan 19, 2024
Authors
Raja Biswas
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:

Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.

One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.

Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
h
llm-training-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unidata, llm-training-dataset [Dataset]. https://huggingface.co/datasets/UniDataPro/llm-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Unidata
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
LLM Fine-Tuning Dataset - 4,000,000+ logs, 32 languages

The dataset contains over 4 million+ logs written in 32 languages and is tailored for LLM training. It includes log and response pairs from 3 models, and is designed for language models and instruction fine-tuning to achieve improved performance in various NLP tasks - Get the data

Models used for text generation:

GPT-3.5 GPT-4 Uncensored GPT Version (is not included inthe sample)

Languages in the… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/llm-training-dataset.
S
Test dataset of ChatGPT in medical field
scidb.cn
Updated Mar 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
robin shen (2023). Test dataset of ChatGPT in medical field [Dataset]. http://doi.org/10.57760/sciencedb.o00130.00001
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.o00130.00001
Dataset updated
Mar 3, 2023
Dataset provided by
Science Data Bank
Authors
robin shen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The researcher tests the QA capability of ChatGPT in the medical field from the following aspects:1. Test their reserve capacity for medical knowledge2. Check their ability to read literature and understand medical literature3. Test their ability of auxiliary diagnosis after reading case data4. Test its error correction ability for case data5. Test its ability to standardize medical terms6. Test their evaluation ability to experts7. Check their ability to evaluate medical institutionsThe conclusion is:ChatGPT has great potential in the application of medical and health care, and may directly replace human beings or even professionals at a certain level in some fields;The researcher preliminarily believe that ChatGPT has basic medical knowledge and the ability of multiple rounds of dialogue, and its ability to understand Chinese is not weak;ChatGPT has the ability to read, understand and correct cases;ChatGPT has the ability of information extraction and terminology standardization, and is quite excellent;ChatGPT has the reasoning ability of medical knowledge;ChatGPT has the ability of continuous learning. After continuous training, its level has improved significantly;ChatGPT does not have the academic evaluation ability of Chinese medical talents, and the results are not ideal;ChatGPT does not have the academic evaluation ability of Chinese medical institutions, and the results are not ideal;ChatGPT is an epoch-making product, which can become a useful assistant for medical diagnosis and treatment, knowledge service, literature reading, review and paper writing.
Deep Learning Tutor Dataset
kaggle.com
zip
Updated Aug 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
monkwarrior08 (2025). Deep Learning Tutor Dataset [Dataset]. https://www.kaggle.com/datasets/monkwarrior08/deep-learning-tutor-dataset
Explore at:
zip(120655 bytes)Available download formats
Dataset updated
Aug 12, 2025
Authors
monkwarrior08
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dive into the future of education with the Deep Learning Tutor Dataset – a pioneering resource designed to empower the creation of sophisticated, adaptive AI tutors. This dataset is meticulously curated to facilitate the fine-tuning of advanced large language models like GPT-4o, enabling them to internalize specialized pedagogical conversation patterns and expert teaching methodologies.

This collection represents a significant step towards developing intelligent educational systems that can truly adapt to individual student needs, provide nuanced feedback, and foster deeper understanding. By leveraging the power of deep learning and state-of-the-art LLMs, this dataset paves the way for a new generation of personalized learning experiences.

Key Features & Contents:

Specialized Pedagogical Conversation Data: An extensive collection of educational dialogue, carefully structured to represent effective tutoring interactions. This includes examples of:

Expert Explanations: Clear, concise, and multi-faceted explanations of complex concepts.

Adaptive Feedback: Responses tailored to student understanding levels, common misconceptions, and learning styles.

Guided Inquiry: Dialogue patterns that encourage critical thinking and problem-solving.

Conceptual Clarification: Interactions focused on identifying and addressing misunderstandings.

Motivational Prompts: Examples of how to engage and encourage learners.

Structured for Fine-tuning GPT-4o: The dataset is provided in a format optimized for fine-tuning OpenAI's GPT-4o, allowing the model to go beyond general knowledge and adopt a truly pedagogical persona.

Foundational for Adaptive Tutoring Systems: This data is the bedrock for training AI systems that can dynamically adjust their teaching approach based on student performance, engagement, and learning progress.

Applications:

Building Next-Generation AI Tutors: Develop intelligent tutors capable of empathetic, effective, and adaptive teaching.

Research in AI in Education (AIEd): A valuable resource for researchers exploring the application of LLMs in educational contexts, dialogue systems, and personalized learning.

Enhancing E-Learning Platforms: Integrate AI-driven tutoring capabilities into existing or new online learning environments.

Developing Conversational AI for Learning: Train models to understand and generate educational dialogues that mimic expert human tutors.

Personalized Learning Initiatives: Contribute to systems that offer highly individualized learning paths and support.

How to Leverage This Dataset: Fine-tuning Your AI Tutor

The primary utility of this dataset is to fine-tune a powerful LLM like GPT-4o, imbuing it with the specific conversational and pedagogical skills required for adaptive tutoring.

Prerequisites: * An OpenAI account with API access. * Familiarity with the OpenAI Platform and fine-tuning concepts.

Step 1: Download the Dataset Download the educational_conversation_data.jsonl file from this Kaggle dataset.

Step 2: Initiate GPT-4o Fine-tuning This process will train GPT-4o to emulate the expert teaching methodologies embedded within the dataset. 1. Upload Data: Navigate to the "Fine-tuning" section in your OpenAI Platform. Upload the educational_conversation_data.jsonl file. 2. Create Fine-tuning Job: * Base Model: gpt-4o (or gpt-4o-mini for more cost-effective experimentation). * Epochs: 3 (A common starting point; adjust based on dataset size and desired performance). * Learning Rate Multiplier: 2 (A good initial value; can be tuned). * Batch Size: 1 (Often effective for pedagogical data, but can be adjusted). * Note: These parameters are recommendations. Experimentation may be required to achieve optimal results for your specific application. 3. Start Job: Initiate the fine-tuning process. Once complete, you will receive a new custom model ID, representing your fine-tuned pedagogical AI.

Step 3: Integrate Your Fine-tuned Model The fine-tuned model ID can now be used with OpenAI's API to power your adaptive AI tutor. You can integrate it into: * A custom chat interface. * An existing educational platform. * A research prototype for conversational AI in education.

Files in This Dataset:

educational_conversation_data.jsonl: The core dataset containing the specialized pedagogical conversation patterns and expert teaching methodologies, formatted for OpenAI fine-tuning.

README.md: (Optional, but good practice) A brief overview of the dataset and usage.
h
Bitext-travel-llm-chatbot-training-dataset
huggingface.co
Updated Jun 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2025). Bitext-travel-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2025
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.
p
RydbergGPT data for quantum computing
pennylane.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Fitzek; Yi Hong Teoh; Hin Pok Fung; Gebremedhin A. Dagnew; Ejaaz Merali; M. Schuyler Moss; Benjamin Maclellan; Roger G. Melko, RydbergGPT data for quantum computing [Dataset]. https://pennylane.ai/datasets/rydberggpt
Explore at:
Authors
David Fitzek; Yi Hong Teoh; Hin Pok Fung; Gebremedhin A. Dagnew; Ejaaz Merali; M. Schuyler Moss; Benjamin Maclellan; Roger G. Melko
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Measurement technique
Simulation
Dataset funded by
Xanadu Quantum Technologies
Description
This dataset contains the data used to train RydbergGPT, a generative pre-trained transformer designed to learn the measurement outcomes of a neutral atom array quantum computer.
R
Open Poetry Vision Object Detection Dataset - 512x512
public.roboflow.com
zip
Updated Apr 7, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brad Dwyer (2022). Open Poetry Vision Object Detection Dataset - 512x512 [Dataset]. https://public.roboflow.com/object-detection/open-poetry-vision/1
Explore at:
zipAvailable download formats
Dataset updated
Apr 7, 2022
Dataset authored and provided by
Brad Dwyer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Bounding Boxes of text
Description
Overview

The Open Poetry Vision dataset is a synthetic dataset created by Roboflow for OCR tasks.

It combines a random image from the Open Images Dataset with text primarily sampled from Gwern's GPT-2 Poetry project. Each image in the dataset contains between 1 and 5 strings in a variety of fonts and colors randomly positioned in the 512x512 canvas. The classes correspond to the font of the text.

Example Image: https://i.imgur.com/sZT516a.png" alt="Example Image">

Use Cases

A common OCR workflow is to use a neural network to isolate text for input into traditional optical character recognition software. This dataset could make a good starting point for an OCR project like business card parsing or automated paper form-processing.

Alternatively, you could try your hand using this as a neural font identification dataset. Nvidia, amongst others, have had success with this task.

Using this Dataset

Use the fork button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

Version 5 of this dataset (classes_all_text-raw-images) has all classes remapped to be labeled as "text." This was accomplished by using Modify Classes as a preprocessing step.

Version 6 of this dataset (classes_all_text-augmented-FAST) has all classes remapped to be labeled as "text." and was trained with Roboflow's Fast Model.

Version 7 of this dataset (classes_all_text-augmented-ACCURATE) has all classes remapped to be labeled as "text." and was trained with Roboflow's Accurate Model.

Introducing the New Roboflow Train

What to Think About When Choosing Model Sizes

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
train for gpt
kaggle.com
zip
Updated May 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
qiancai314 (2023). train for gpt [Dataset]. https://www.kaggle.com/datasets/qiancai314/train-for-gpt
Explore at:
zip(36343309 bytes)Available download formats
Dataset updated
May 1, 2023
Authors
qiancai314
Description
Dataset

This dataset was created by qiancai314

Contents
h
TinyStories
huggingface.co
opendatalab.com
+1more
Updated May 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ronen Eldan (2023). TinyStories [Dataset]. https://huggingface.co/datasets/roneneldan/TinyStories
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 22, 2023
Authors
Ronen Eldan
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.
MULTITuDEv2
data.europa.eu
data.niaid.nih.gov
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). MULTITuDEv2 [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-13846588?locale=hu
Explore at:
unknownAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MULTITuDEv2 is a dataset for multilingual machine-generated text detection benchmark, described in the EMNLP 2023 conference paper. It consists of 7992 human-written news texts in 11 languages subsampled from MassiveSumm, accompanied by 66089 texts generated by 8 large language models (by using headlines of news articles). The creation process and scripts for replication/extension are located in a GitHub repository. The dataset has been further extended in v2 by obfuscated texts using 10 authorship obfuscation methods, described in the EMNL 2024 Findings conference paper. If you use this dataset in any publication, project, tool or in any other form, please, cite the paper. Files The v2 of the dataset consists of multiple files. 'multitude.csv' contains original v1 of the dataset (i.e., without the field 'generated'). The other files contains also the 'generated' field (as described below) and are compressed by GZIP. The file 'multitude_obfuscated_original.csv.gz' contains copies of the 'text' field in the 'generated' field to be compatible with files with the obfuscated texts (used as such in the experiments). Fields The dataset has the following fields: 'text' - an original (unobfuscated) text sample, 'label' - 0 for human-written text, 1 for machine-generated text, 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text, 'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively, 'language' - the ISO 639-1 language code identifying the language of the given text, 'length' - word count of the given text, 'source' - a string identifying the source dataset / news medium of the given text, 'generated' - an obfuscated text sample (i.e., transformed from original text by the obfuscator indicated by the corresponding filename) Note: some obfuscated text in the 'generated' field are the same as in the 'text' field, indicating failure of the obfuscator to modify the text. Human-written obfuscated texts are also included; however, labels of their originals might be no longer relevant for them (i.e., human-written text obfuscated by a machine could be considered as machine-generated as well); thus, consider this in your research. Statistics (the number of samples) Splits: train - 44786 test - 29295 Binary labels: 0 - 7992 1 - 66089 Multiclass labels: gpt-3.5-turbo - 8300 gpt-4 - 8300 text-davinci-003 - 8297 alpaca-lora-30b - 8290 vicuna-13b - 8287 opt-66b - 8229 llama-65b - 8229 opt-iml-max-1.3b - 8157 human - 7992 Languages: English (en) - 29460 (train + test) Spanish (es) - 11586 (train + test) Russian (ru) - 11578 (train + test) Dutch (nl) - 2695 (test) Catalan (ca) - 2691 (test) Czech (cs) - 2689 (test) German (de) - 2685 (test) Chinese (zh) - 2683 (test) Portuguese (pt) - 2673 (test) Arabic (ar) - 2673 (test) Ukrainian (uk) - 2668 (test)
f
Data from: Uncertainty-Informed Screening for Safer Solvents Used in the...
figshare.com
acs.figshare.com
xlsx
Updated Jul 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arpan Mukherjee; Deepesh Giri; Krishna Rajan (2025). Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskites via Language Models [Dataset]. http://doi.org/10.1021/acs.jcim.5c00612.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5c00612.s002
Dataset updated
Jul 22, 2025
Dataset provided by
ACS Publications
Authors
Arpan Mukherjee; Deepesh Giri; Krishna Rajan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Automated data curation for niche scientific topics, where data quality and contextual accuracy are paramount, poses significant challenges. Bidirectional contextual models such as BERT and ELMo excel in contextual understanding and determinism. However, they are constrained by their narrower training corpora and inability to synthesize information across fragmented or sparse contexts. Conversely, autoregressive generative models like GPT can synthesize dispersed information by leveraging broader contextual knowledge and yet often generate plausible but incorrect (“hallucinated”) information. To address these complementary limitations, we propose an ensemble approach that combines the deterministic precision of BERT/ELMo with the contextual depth of GPT. We have developed a hierarchical knowledge extraction framework to identify perovskites and their associated solvents in perovskite synthesis, progressing from broad topics to narrower details using two complementary methods. The first method leverages deterministic models like BERT/ELMo for precise entity extraction, while the second employs GPT for broader contextual synthesis and generalization. Outputs from both methods are validated through structure-matching and entity normalization, ensuring consistency and traceability. In the absence of benchmark data sets for this domain, we hold out a subset of papers for manual verification to serve as a reference set for tuning the rules for entity normalization. This enables quantitative evaluation of model precision, recall, and structural adherence while also providing a grounded estimate of model confidence. By intersecting the outputs from both methods, we generate a list of solvents with maximum confidence, combining precision with contextual depth to ensure accuracy and reliability. This approach increases precision at the expense of recalla trade-off we accept given that, in high-trust scientific applications, minimizing hallucinations is often more critical than achieving full coverage, especially when downstream reliability is paramount. As a case study, the curated data set is used to predict the endocrine-disrupting (ED) potential of solvents with a pretrained deep learning model. Recognizing that machine learning models may not be trained on niche data sets such as perovskite-related solvents, we have quantified epistemic uncertainty using Shannon entropy. This measure evaluates the confidence of the ML model predictions, independent of uncertainties in the NLP-based data curation process, and identifies high-risk solvents requiring further validation. Additionally, the manual verification pipeline addresses ethical considerations around trust, structure, and transparency in AI-curated data sets.

Facebook

Twitter

Click to copy link

Link copied

Cite

Radek Osmulski (2023). 📊 6.5k train examples for LLM Science Exam 📝 [Dataset]. https://www.kaggle.com/datasets/radek1/additional-train-data-for-llm-science-exam

📊 6.5k train examples for LLM Science Exam 📝

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 22, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Radek Osmulski

Description

I created this dataset using gpt-3.5-turbo.

I put a lot of effort into making this dataset high quality which allows you to achieve the highest score among the publicly available notebooks available at the moment! 🥳

Originally, I only uploaded 500 examples (they were used as train data in the notebook I mention above). They are stored in extra_train_set.csv.

I am now uploading another 6k (6000_train_examples.csv) completely new train examples which brings the total to 6.5k.

If you find this dataset useful, please leave an upvote! 😊 Thank you! 🙏🙏🙏

Clear search

Close search

Google apps

Main menu

📊 6.5k train examples for LLM Science Exam 📝

All GPT-4 Conversations

All GPT-4 Generated Datasets

Every chat dataset generated by GPT-4 from Huggingface at the same format

About this dataset

How to use the dataset

Acknowledgements

License

gpt-neo-training-dataset-raw

ChatQA-Training-Data

Datasets of planGPT

RolePlay DataSet

Data Sheet 2_Large language models generating synthetic clinical datasets: a...

gpt3

GPT4-8K

LLM - Detect AI Datamix

llm-training-dataset

Test dataset of ChatGPT in medical field

Deep Learning Tutor Dataset

Key Features & Contents:

Applications:

How to Leverage This Dataset: Fine-tuning Your AI Tutor

Files in This Dataset:

Bitext-travel-llm-chatbot-training-dataset

RydbergGPT data for quantum computing

Open Poetry Vision Object Detection Dataset - 512x512

Overview

Use Cases

Using this Dataset

Version 5 of this dataset (classes_all_text-raw-images) has all classes remapped to be labeled as "text." This was accomplished by using Modify Classes as a preprocessing step.

Version 6 of this dataset (classes_all_text-augmented-FAST) has all classes remapped to be labeled as "text." and was trained with Roboflow's Fast Model.

Version 7 of this dataset (classes_all_text-augmented-ACCURATE) has all classes remapped to be labeled as "text." and was trained with Roboflow's Accurate Model.

About Roboflow

train for gpt

Dataset

Contents

TinyStories

MULTITuDEv2

Data from: Uncertainty-Informed Screening for Safer Solvents Used in the...

📊 6.5k train examples for LLM Science Exam 📝