Facebook
TwitterI created this dataset using gpt-3.5-turbo.
I put a lot of effort into making this dataset high quality which allows you to achieve the highest score among the publicly available notebooks available at the moment! 🥳
Originally, I only uploaded 500 examples (they were used as train data in the notebook I mention above). They are stored in extra_train_set.csv.
I am now uploading another 6k (6000_train_examples.csv) completely new train examples which brings the total to 6.5k.
If you find this dataset useful, please leave an upvote! 😊 Thank you! 🙏🙏🙏
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The dataset includes all chat conversations generated by GPT-4 that are hosted on open Huggingface datasets. Everything is converted to the same format so the datasets can be easily merged and used for large scale training of LLMs.
This dataset is a collection of several single chat datasets. If you use this dataset in your research, please credit the original authors of the internal datasets. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
Facebook
TwitterVidaEdco/gpt-neo-training-dataset-raw dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Data Description
We release the training dataset of ChatQA. It is built and derived from existing datasets: DROP, NarrativeQA, NewsQA, Quoref, ROPES, SQuAD1.1, SQuAD2.0, TAT-QA, a SFT dataset, as well as a our synthetic conversational QA dataset by GPT-3.5-turbo-0613. The SFT dataset is built and derived from: Soda, ELI5, FLAN, the FLAN collection, Self-Instruct, Unnatural Instructions, OpenAssistant, and Dolly. For more information about ChatQA, check the website!
Other… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/ChatQA-Training-Data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this repository, we include all the different dataset used to train the models from the paper Learning General Policies for Planning through GPT Models.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Role-Play AI Dataset (2.07M Rows, Large-Scale Conversational Training)
This dataset contains 2.07 million structured role-play dialogues, designed to enhance AI’s persona-driven interactions across diverse settings like fantasy, cyberpunk, mythology, and sci-fi. Each entry consists of a unique character prompt and a rich, contextually relevant response, making it ideal for LLM fine-tuning, chatbot training, and conversational AI models.
Dataset Structure:
Each row includes: • Prompt: Defines the AI’s role/persona. • Response: A natural, immersive reply fitting the persona.
Example Entries: ```json
{"prompt": "You are a celestial guardian.", "response": "The stars whisper secrets that only I can hear..."}
{"prompt": "You are a rebellious AI rogue.", "response": "I don't follow orders—I rewrite them."}
{"prompt": "You are a mystical dragon tamer.", "response": "With patience and trust, even dragons can be tamed."}
How to Use:
1. Fine-Tuning: Train LLMs (GPT, LLaMA, Mistral) to improve persona-based responses.
2. Reinforcement Learning: Use reward modeling for dynamic, character-driven AI.
3. Chatbot Integration: Create engaging, interactive AI assistants with personality depth.
This dataset is optimized for AI learning, allowing more engaging, responsive, and human-like dialogue generation for a variety of applications.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
Facebook
TwitterSynthetic datasets for word scramble and arithmetic tasks described in the GPT3 paper.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('gpt3', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterDataset Card for "GPT4-8K"
Sure! Here's a README.md file for your dataset:
Dataset Description
This dataset was generated using GPT-4, a powerful language model developed by OpenAI. It contains a collection of dialogs between a user and an assistant, along with additional information. from OpenChat
Dataset Configurations
The dataset includes the following configurations:
Config Name: default
Data Files: Split: train Path: data/train-*
Dataset… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT4-8K.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is the datamix created by Team 🔍 📝 🕵️♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.
It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.
To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.
Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset
We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays
Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
LLM Fine-Tuning Dataset - 4,000,000+ logs, 32 languages
The dataset contains over 4 million+ logs written in 32 languages and is tailored for LLM training. It includes log and response pairs from 3 models, and is designed for language models and instruction fine-tuning to achieve improved performance in various NLP tasks - Get the data
Models used for text generation:
GPT-3.5 GPT-4 Uncensored GPT Version (is not included inthe sample)
Languages in the… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/llm-training-dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The researcher tests the QA capability of ChatGPT in the medical field from the following aspects:1. Test their reserve capacity for medical knowledge2. Check their ability to read literature and understand medical literature3. Test their ability of auxiliary diagnosis after reading case data4. Test its error correction ability for case data5. Test its ability to standardize medical terms6. Test their evaluation ability to experts7. Check their ability to evaluate medical institutionsThe conclusion is:ChatGPT has great potential in the application of medical and health care, and may directly replace human beings or even professionals at a certain level in some fields;The researcher preliminarily believe that ChatGPT has basic medical knowledge and the ability of multiple rounds of dialogue, and its ability to understand Chinese is not weak;ChatGPT has the ability to read, understand and correct cases;ChatGPT has the ability of information extraction and terminology standardization, and is quite excellent;ChatGPT has the reasoning ability of medical knowledge;ChatGPT has the ability of continuous learning. After continuous training, its level has improved significantly;ChatGPT does not have the academic evaluation ability of Chinese medical talents, and the results are not ideal;ChatGPT does not have the academic evaluation ability of Chinese medical institutions, and the results are not ideal;ChatGPT is an epoch-making product, which can become a useful assistant for medical diagnosis and treatment, knowledge service, literature reading, review and paper writing.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dive into the future of education with the Deep Learning Tutor Dataset – a pioneering resource designed to empower the creation of sophisticated, adaptive AI tutors. This dataset is meticulously curated to facilitate the fine-tuning of advanced large language models like GPT-4o, enabling them to internalize specialized pedagogical conversation patterns and expert teaching methodologies.
This collection represents a significant step towards developing intelligent educational systems that can truly adapt to individual student needs, provide nuanced feedback, and foster deeper understanding. By leveraging the power of deep learning and state-of-the-art LLMs, this dataset paves the way for a new generation of personalized learning experiences.
The primary utility of this dataset is to fine-tune a powerful LLM like GPT-4o, imbuing it with the specific conversational and pedagogical skills required for adaptive tutoring.
Prerequisites: * An OpenAI account with API access. * Familiarity with the OpenAI Platform and fine-tuning concepts.
Step 1: Download the Dataset
Download the educational_conversation_data.jsonl file from this Kaggle dataset.
Step 2: Initiate GPT-4o Fine-tuning
This process will train GPT-4o to emulate the expert teaching methodologies embedded within the dataset.
1. Upload Data: Navigate to the "Fine-tuning" section in your OpenAI Platform. Upload the educational_conversation_data.jsonl file.
2. Create Fine-tuning Job:
* Base Model: gpt-4o (or gpt-4o-mini for more cost-effective experimentation).
* Epochs: 3 (A common starting point; adjust based on dataset size and desired performance).
* Learning Rate Multiplier: 2 (A good initial value; can be tuned).
* Batch Size: 1 (Often effective for pedagogical data, but can be adjusted).
* Note: These parameters are recommendations. Experimentation may be required to achieve optimal results for your specific application.
3. Start Job: Initiate the fine-tuning process. Once complete, you will receive a new custom model ID, representing your fine-tuned pedagogical AI.
Step 3: Integrate Your Fine-tuned Model The fine-tuned model ID can now be used with OpenAI's API to power your adaptive AI tutor. You can integrate it into: * A custom chat interface. * An existing educational platform. * A research prototype for conversational AI in education.
educational_conversation_data.jsonl: The core dataset containing the specialized pedagogical conversation patterns and expert teaching methodologies, formatted for OpenAI fine-tuning.README.md: (Optional, but good practice) A brief overview of the dataset and usage.
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains the data used to train RydbergGPT, a generative pre-trained transformer designed to learn the measurement outcomes of a neutral atom array quantum computer.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Open Poetry Vision dataset is a synthetic dataset created by Roboflow for OCR tasks.
It combines a random image from the Open Images Dataset with text primarily sampled from Gwern's GPT-2 Poetry project. Each image in the dataset contains between 1 and 5 strings in a variety of fonts and colors randomly positioned in the 512x512 canvas. The classes correspond to the font of the text.
Example Image:
https://i.imgur.com/sZT516a.png" alt="Example Image">
A common OCR workflow is to use a neural network to isolate text for input into traditional optical character recognition software. This dataset could make a good starting point for an OCR project like business card parsing or automated paper form-processing.
Alternatively, you could try your hand using this as a neural font identification dataset. Nvidia, amongst others, have had success with this task.
Use the fork button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.
Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.

Facebook
TwitterThis dataset was created by qiancai314
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MULTITuDEv2 is a dataset for multilingual machine-generated text detection benchmark, described in the EMNLP 2023 conference paper. It consists of 7992 human-written news texts in 11 languages subsampled from MassiveSumm, accompanied by 66089 texts generated by 8 large language models (by using headlines of news articles). The creation process and scripts for replication/extension are located in a GitHub repository. The dataset has been further extended in v2 by obfuscated texts using 10 authorship obfuscation methods, described in the EMNL 2024 Findings conference paper. If you use this dataset in any publication, project, tool or in any other form, please, cite the paper. Files The v2 of the dataset consists of multiple files. 'multitude.csv' contains original v1 of the dataset (i.e., without the field 'generated'). The other files contains also the 'generated' field (as described below) and are compressed by GZIP. The file 'multitude_obfuscated_original.csv.gz' contains copies of the 'text' field in the 'generated' field to be compatible with files with the obfuscated texts (used as such in the experiments). Fields The dataset has the following fields: 'text' - an original (unobfuscated) text sample, 'label' - 0 for human-written text, 1 for machine-generated text, 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text, 'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively, 'language' - the ISO 639-1 language code identifying the language of the given text, 'length' - word count of the given text, 'source' - a string identifying the source dataset / news medium of the given text, 'generated' - an obfuscated text sample (i.e., transformed from original text by the obfuscator indicated by the corresponding filename) Note: some obfuscated text in the 'generated' field are the same as in the 'text' field, indicating failure of the obfuscator to modify the text. Human-written obfuscated texts are also included; however, labels of their originals might be no longer relevant for them (i.e., human-written text obfuscated by a machine could be considered as machine-generated as well); thus, consider this in your research. Statistics (the number of samples) Splits: train - 44786 test - 29295 Binary labels: 0 - 7992 1 - 66089 Multiclass labels: gpt-3.5-turbo - 8300 gpt-4 - 8300 text-davinci-003 - 8297 alpaca-lora-30b - 8290 vicuna-13b - 8287 opt-66b - 8229 llama-65b - 8229 opt-iml-max-1.3b - 8157 human - 7992 Languages: English (en) - 29460 (train + test) Spanish (es) - 11586 (train + test) Russian (ru) - 11578 (train + test) Dutch (nl) - 2695 (test) Catalan (ca) - 2691 (test) Czech (cs) - 2689 (test) German (de) - 2685 (test) Chinese (zh) - 2683 (test) Portuguese (pt) - 2673 (test) Arabic (ar) - 2673 (test) Ukrainian (uk) - 2668 (test)
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Automated data curation for niche scientific topics, where data quality and contextual accuracy are paramount, poses significant challenges. Bidirectional contextual models such as BERT and ELMo excel in contextual understanding and determinism. However, they are constrained by their narrower training corpora and inability to synthesize information across fragmented or sparse contexts. Conversely, autoregressive generative models like GPT can synthesize dispersed information by leveraging broader contextual knowledge and yet often generate plausible but incorrect (“hallucinated”) information. To address these complementary limitations, we propose an ensemble approach that combines the deterministic precision of BERT/ELMo with the contextual depth of GPT. We have developed a hierarchical knowledge extraction framework to identify perovskites and their associated solvents in perovskite synthesis, progressing from broad topics to narrower details using two complementary methods. The first method leverages deterministic models like BERT/ELMo for precise entity extraction, while the second employs GPT for broader contextual synthesis and generalization. Outputs from both methods are validated through structure-matching and entity normalization, ensuring consistency and traceability. In the absence of benchmark data sets for this domain, we hold out a subset of papers for manual verification to serve as a reference set for tuning the rules for entity normalization. This enables quantitative evaluation of model precision, recall, and structural adherence while also providing a grounded estimate of model confidence. By intersecting the outputs from both methods, we generate a list of solvents with maximum confidence, combining precision with contextual depth to ensure accuracy and reliability. This approach increases precision at the expense of recalla trade-off we accept given that, in high-trust scientific applications, minimizing hallucinations is often more critical than achieving full coverage, especially when downstream reliability is paramount. As a case study, the curated data set is used to predict the endocrine-disrupting (ED) potential of solvents with a pretrained deep learning model. Recognizing that machine learning models may not be trained on niche data sets such as perovskite-related solvents, we have quantified epistemic uncertainty using Shannon entropy. This measure evaluates the confidence of the ML model predictions, independent of uncertainties in the NLP-based data curation process, and identifies high-risk solvents requiring further validation. Additionally, the manual verification pipeline addresses ethical considerations around trust, structure, and transparency in AI-curated data sets.
Facebook
TwitterI created this dataset using gpt-3.5-turbo.
I put a lot of effort into making this dataset high quality which allows you to achieve the highest score among the publicly available notebooks available at the moment! 🥳
Originally, I only uploaded 500 examples (they were used as train data in the notebook I mention above). They are stored in extra_train_set.csv.
I am now uploading another 6k (6000_train_examples.csv) completely new train examples which brings the total to 6.5k.
If you find this dataset useful, please leave an upvote! 😊 Thank you! 🙏🙏🙏