100+ datasets found
  1. 📊 6.5k train examples for LLM Science Exam 📝

    • kaggle.com
    Updated Jul 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radek Osmulski (2023). 📊 6.5k train examples for LLM Science Exam 📝 [Dataset]. https://www.kaggle.com/datasets/radek1/additional-train-data-for-llm-science-exam
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Radek Osmulski
    Description

    I created this dataset using gpt-3.5-turbo.

    I put a lot of effort into making this dataset high quality which allows you to achieve the highest score among the publicly available notebooks available at the moment! 🥳

    Originally, I only uploaded 500 examples (they were used as train data in the notebook I mention above). They are stored in extra_train_set.csv.

    I am now uploading another 6k (6000_train_examples.csv) completely new train examples which brings the total to 6.5k.

    If you find this dataset useful, please leave an upvote! 😊 Thank you! 🙏🙏🙏

  2. All GPT-4 Conversations

    • kaggle.com
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). All GPT-4 Conversations [Dataset]. https://www.kaggle.com/datasets/thedevastator/all-gpt-4-synthetic-chat-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    All GPT-4 Generated Datasets

    Every chat dataset generated by GPT-4 from Huggingface at the same format

    From [Huggingface datasets]

    About this dataset

    How to use the dataset

    The dataset includes all chat conversations generated by GPT-4 that are hosted on open Huggingface datasets. Everything is converted to the same format so the datasets can be easily merged and used for large scale training of LLMs.

    Acknowledgements

    This dataset is a collection of several single chat datasets. If you use this dataset in your research, please credit the original authors of the internal datasets. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

  3. h

    gpt-neo-training-dataset-raw

    • huggingface.co
    Updated Sep 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vida Tayebati (2024). gpt-neo-training-dataset-raw [Dataset]. https://huggingface.co/datasets/VidaEdco/gpt-neo-training-dataset-raw
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2024
    Authors
    Vida Tayebati
    Description

    VidaEdco/gpt-neo-training-dataset-raw dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. ChatQA-Training-Data

    • huggingface.co
    Updated Jun 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NVIDIA (2023). ChatQA-Training-Data [Dataset]. https://huggingface.co/datasets/nvidia/ChatQA-Training-Data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 30, 2023
    Dataset provided by
    Nvidiahttp://nvidia.com/
    Authors
    NVIDIA
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Data Description

    We release the training dataset of ChatQA. It is built and derived from existing datasets: DROP, NarrativeQA, NewsQA, Quoref, ROPES, SQuAD1.1, SQuAD2.0, TAT-QA, a SFT dataset, as well as a our synthetic conversational QA dataset by GPT-3.5-turbo-0613. The SFT dataset is built and derived from: Soda, ELI5, FLAN, the FLAN collection, Self-Instruct, Unnatural Instructions, OpenAssistant, and Dolly. For more information about ChatQA, check the website!

      Other… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/ChatQA-Training-Data.
    
  5. Datasets of planGPT

    • zenodo.org
    zip
    Updated Jul 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massimiliano Tummolo; Massimiliano Tummolo (2024). Datasets of planGPT [Dataset]. http://doi.org/10.5281/zenodo.10925404
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Massimiliano Tummolo; Massimiliano Tummolo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this repository, we include all the different dataset used to train the models from the paper Learning General Policies for Planning through GPT Models.

  6. RolePlay DataSet

    • kaggle.com
    zip
    Updated Feb 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vampelium (2025). RolePlay DataSet [Dataset]. https://www.kaggle.com/datasets/vampelium/roleplay-dataset
    Explore at:
    zip(14183258 bytes)Available download formats
    Dataset updated
    Feb 16, 2025
    Authors
    Vampelium
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Role-Play AI Dataset (2.07M Rows, Large-Scale Conversational Training)

    This dataset contains 2.07 million structured role-play dialogues, designed to enhance AI’s persona-driven interactions across diverse settings like fantasy, cyberpunk, mythology, and sci-fi. Each entry consists of a unique character prompt and a rich, contextually relevant response, making it ideal for LLM fine-tuning, chatbot training, and conversational AI models.

    Dataset Structure:

    Each row includes: • Prompt: Defines the AI’s role/persona. • Response: A natural, immersive reply fitting the persona.

    Example Entries: ```json

    {"prompt": "You are a celestial guardian.", "response": "The stars whisper secrets that only I can hear..."}
    {"prompt": "You are a rebellious AI rogue.", "response": "I don't follow orders—I rewrite them."}
    {"prompt": "You are a mystical dragon tamer.", "response": "With patience and trust, even dragons can be tamed."}

    How to Use:
      1. Fine-Tuning: Train LLMs (GPT, LLaMA, Mistral) to improve persona-based responses.
      2. Reinforcement Learning: Use reward modeling for dynamic, character-driven AI.
      3. Chatbot Integration: Create engaging, interactive AI assistants with personality depth.
    
    This dataset is optimized for AI learning, allowing more engaging, responsive, and human-like dialogue generation for a variety of applications.
    
  7. f

    Data Sheet 2_Large language models generating synthetic clinical datasets: a...

    • frontiersin.figshare.com
    • figshare.com
    xlsx
    Updated Feb 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Frontiers
    Authors
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

  8. T

    gpt3

    • tensorflow.org
    Updated Dec 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). gpt3 [Dataset]. https://www.tensorflow.org/datasets/catalog/gpt3
    Explore at:
    Dataset updated
    Dec 19, 2023
    Description

    Synthetic datasets for word scramble and arithmetic tasks described in the GPT3 paper.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('gpt3', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  9. h

    GPT4-8K

    • huggingface.co
    Updated Jan 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erfan zare chavoshi (2024). GPT4-8K [Dataset]. https://huggingface.co/datasets/erfanzar/GPT4-8K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 6, 2024
    Authors
    Erfan zare chavoshi
    Description

    Dataset Card for "GPT4-8K"

    Sure! Here's a README.md file for your dataset:

      Dataset Description
    

    This dataset was generated using GPT-4, a powerful language model developed by OpenAI. It contains a collection of dialogs between a user and an assistant, along with additional information. from OpenChat

      Dataset Configurations
    

    The dataset includes the following configurations:

    Config Name: default

    Data Files: Split: train Path: data/train-*

      Dataset… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT4-8K.
    
  10. LLM - Detect AI Datamix

    • kaggle.com
    zip
    Updated Jan 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26
    Explore at:
    zip(172818297 bytes)Available download formats
    Dataset updated
    Jan 19, 2024
    Authors
    Raja Biswas
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

    It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

    To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

    Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

    • Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:
      • Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.
      • One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.
      • Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

    We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

    Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence

  11. h

    llm-training-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata, llm-training-dataset [Dataset]. https://huggingface.co/datasets/UniDataPro/llm-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Unidata
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    LLM Fine-Tuning Dataset - 4,000,000+ logs, 32 languages

    The dataset contains over 4 million+ logs written in 32 languages and is tailored for LLM training. It includes log and response pairs from 3 models, and is designed for language models and instruction fine-tuning to achieve improved performance in various NLP tasks - Get the data

      Models used for text generation:
    

    GPT-3.5 GPT-4 Uncensored GPT Version (is not included inthe sample)

      Languages in the… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/llm-training-dataset.
    
  12. S

    Test dataset of ChatGPT in medical field

    • scidb.cn
    Updated Mar 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    robin shen (2023). Test dataset of ChatGPT in medical field [Dataset]. http://doi.org/10.57760/sciencedb.o00130.00001
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 3, 2023
    Dataset provided by
    Science Data Bank
    Authors
    robin shen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The researcher tests the QA capability of ChatGPT in the medical field from the following aspects:1. Test their reserve capacity for medical knowledge2. Check their ability to read literature and understand medical literature3. Test their ability of auxiliary diagnosis after reading case data4. Test its error correction ability for case data5. Test its ability to standardize medical terms6. Test their evaluation ability to experts7. Check their ability to evaluate medical institutionsThe conclusion is:ChatGPT has great potential in the application of medical and health care, and may directly replace human beings or even professionals at a certain level in some fields;The researcher preliminarily believe that ChatGPT has basic medical knowledge and the ability of multiple rounds of dialogue, and its ability to understand Chinese is not weak;ChatGPT has the ability to read, understand and correct cases;ChatGPT has the ability of information extraction and terminology standardization, and is quite excellent;ChatGPT has the reasoning ability of medical knowledge;ChatGPT has the ability of continuous learning. After continuous training, its level has improved significantly;ChatGPT does not have the academic evaluation ability of Chinese medical talents, and the results are not ideal;ChatGPT does not have the academic evaluation ability of Chinese medical institutions, and the results are not ideal;ChatGPT is an epoch-making product, which can become a useful assistant for medical diagnosis and treatment, knowledge service, literature reading, review and paper writing.

  13. Deep Learning Tutor Dataset

    • kaggle.com
    zip
    Updated Aug 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    monkwarrior08 (2025). Deep Learning Tutor Dataset [Dataset]. https://www.kaggle.com/datasets/monkwarrior08/deep-learning-tutor-dataset
    Explore at:
    zip(120655 bytes)Available download formats
    Dataset updated
    Aug 12, 2025
    Authors
    monkwarrior08
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dive into the future of education with the Deep Learning Tutor Dataset – a pioneering resource designed to empower the creation of sophisticated, adaptive AI tutors. This dataset is meticulously curated to facilitate the fine-tuning of advanced large language models like GPT-4o, enabling them to internalize specialized pedagogical conversation patterns and expert teaching methodologies.

    This collection represents a significant step towards developing intelligent educational systems that can truly adapt to individual student needs, provide nuanced feedback, and foster deeper understanding. By leveraging the power of deep learning and state-of-the-art LLMs, this dataset paves the way for a new generation of personalized learning experiences.

    Key Features & Contents:

    • Specialized Pedagogical Conversation Data: An extensive collection of educational dialogue, carefully structured to represent effective tutoring interactions. This includes examples of:
      • Expert Explanations: Clear, concise, and multi-faceted explanations of complex concepts.
      • Adaptive Feedback: Responses tailored to student understanding levels, common misconceptions, and learning styles.
      • Guided Inquiry: Dialogue patterns that encourage critical thinking and problem-solving.
      • Conceptual Clarification: Interactions focused on identifying and addressing misunderstandings.
      • Motivational Prompts: Examples of how to engage and encourage learners.
    • Structured for Fine-tuning GPT-4o: The dataset is provided in a format optimized for fine-tuning OpenAI's GPT-4o, allowing the model to go beyond general knowledge and adopt a truly pedagogical persona.
    • Foundational for Adaptive Tutoring Systems: This data is the bedrock for training AI systems that can dynamically adjust their teaching approach based on student performance, engagement, and learning progress.

    Applications:

    • Building Next-Generation AI Tutors: Develop intelligent tutors capable of empathetic, effective, and adaptive teaching.
    • Research in AI in Education (AIEd): A valuable resource for researchers exploring the application of LLMs in educational contexts, dialogue systems, and personalized learning.
    • Enhancing E-Learning Platforms: Integrate AI-driven tutoring capabilities into existing or new online learning environments.
    • Developing Conversational AI for Learning: Train models to understand and generate educational dialogues that mimic expert human tutors.
    • Personalized Learning Initiatives: Contribute to systems that offer highly individualized learning paths and support.

    How to Leverage This Dataset: Fine-tuning Your AI Tutor

    The primary utility of this dataset is to fine-tune a powerful LLM like GPT-4o, imbuing it with the specific conversational and pedagogical skills required for adaptive tutoring.

    Prerequisites: * An OpenAI account with API access. * Familiarity with the OpenAI Platform and fine-tuning concepts.

    Step 1: Download the Dataset Download the educational_conversation_data.jsonl file from this Kaggle dataset.

    Step 2: Initiate GPT-4o Fine-tuning This process will train GPT-4o to emulate the expert teaching methodologies embedded within the dataset. 1. Upload Data: Navigate to the "Fine-tuning" section in your OpenAI Platform. Upload the educational_conversation_data.jsonl file. 2. Create Fine-tuning Job: * Base Model: gpt-4o (or gpt-4o-mini for more cost-effective experimentation). * Epochs: 3 (A common starting point; adjust based on dataset size and desired performance). * Learning Rate Multiplier: 2 (A good initial value; can be tuned). * Batch Size: 1 (Often effective for pedagogical data, but can be adjusted). * Note: These parameters are recommendations. Experimentation may be required to achieve optimal results for your specific application. 3. Start Job: Initiate the fine-tuning process. Once complete, you will receive a new custom model ID, representing your fine-tuned pedagogical AI.

    Step 3: Integrate Your Fine-tuned Model The fine-tuned model ID can now be used with OpenAI's API to power your adaptive AI tutor. You can integrate it into: * A custom chat interface. * An existing educational platform. * A research prototype for conversational AI in education.

    Files in This Dataset:

    • educational_conversation_data.jsonl: The core dataset containing the specialized pedagogical conversation patterns and expert teaching methodologies, formatted for OpenAI fine-tuning.
    • README.md: (Optional, but good practice) A brief overview of the dataset and usage.
  14. h

    Bitext-travel-llm-chatbot-training-dataset

    • huggingface.co
    Updated Jun 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2025). Bitext-travel-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2025
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.

  15. p

    RydbergGPT data for quantum computing

    • pennylane.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Fitzek; Yi Hong Teoh; Hin Pok Fung; Gebremedhin A. Dagnew; Ejaaz Merali; M. Schuyler Moss; Benjamin Maclellan; Roger G. Melko, RydbergGPT data for quantum computing [Dataset]. https://pennylane.ai/datasets/rydberggpt
    Explore at:
    Authors
    David Fitzek; Yi Hong Teoh; Hin Pok Fung; Gebremedhin A. Dagnew; Ejaaz Merali; M. Schuyler Moss; Benjamin Maclellan; Roger G. Melko
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Measurement technique
    Simulation
    Dataset funded by
    Xanadu Quantum Technologies
    Description

    This dataset contains the data used to train RydbergGPT, a generative pre-trained transformer designed to learn the measurement outcomes of a neutral atom array quantum computer.

  16. R

    Open Poetry Vision Object Detection Dataset - 512x512

    • public.roboflow.com
    zip
    Updated Apr 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brad Dwyer (2022). Open Poetry Vision Object Detection Dataset - 512x512 [Dataset]. https://public.roboflow.com/object-detection/open-poetry-vision/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 7, 2022
    Dataset authored and provided by
    Brad Dwyer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Bounding Boxes of text
    Description

    Overview

    The Open Poetry Vision dataset is a synthetic dataset created by Roboflow for OCR tasks.

    It combines a random image from the Open Images Dataset with text primarily sampled from Gwern's GPT-2 Poetry project. Each image in the dataset contains between 1 and 5 strings in a variety of fonts and colors randomly positioned in the 512x512 canvas. The classes correspond to the font of the text.

    Example Image: https://i.imgur.com/sZT516a.png" alt="Example Image">

    Use Cases

    A common OCR workflow is to use a neural network to isolate text for input into traditional optical character recognition software. This dataset could make a good starting point for an OCR project like business card parsing or automated paper form-processing.

    Alternatively, you could try your hand using this as a neural font identification dataset. Nvidia, amongst others, have had success with this task.

    Using this Dataset

    Use the fork button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

    Version 5 of this dataset (classes_all_text-raw-images) has all classes remapped to be labeled as "text." This was accomplished by using Modify Classes as a preprocessing step.

    Version 6 of this dataset (classes_all_text-augmented-FAST) has all classes remapped to be labeled as "text." and was trained with Roboflow's Fast Model.

    Version 7 of this dataset (classes_all_text-augmented-ACCURATE) has all classes remapped to be labeled as "text." and was trained with Roboflow's Accurate Model.

    About Roboflow

    Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

    Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.

    Roboflow Workmark

  17. train for gpt

    • kaggle.com
    zip
    Updated May 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    qiancai314 (2023). train for gpt [Dataset]. https://www.kaggle.com/datasets/qiancai314/train-for-gpt
    Explore at:
    zip(36343309 bytes)Available download formats
    Dataset updated
    May 1, 2023
    Authors
    qiancai314
    Description

    Dataset

    This dataset was created by qiancai314

    Contents

  18. h

    TinyStories

    • huggingface.co
    • opendatalab.com
    • +1more
    Updated May 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ronen Eldan (2023). TinyStories [Dataset]. https://huggingface.co/datasets/roneneldan/TinyStories
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 22, 2023
    Authors
    Ronen Eldan
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.

  19. MULTITuDEv2

    • data.europa.eu
    • data.niaid.nih.gov
    unknown
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). MULTITuDEv2 [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-13846588?locale=hu
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MULTITuDEv2 is a dataset for multilingual machine-generated text detection benchmark, described in the EMNLP 2023 conference paper. It consists of 7992 human-written news texts in 11 languages subsampled from MassiveSumm, accompanied by 66089 texts generated by 8 large language models (by using headlines of news articles). The creation process and scripts for replication/extension are located in a GitHub repository. The dataset has been further extended in v2 by obfuscated texts using 10 authorship obfuscation methods, described in the EMNL 2024 Findings conference paper. If you use this dataset in any publication, project, tool or in any other form, please, cite the paper. Files The v2 of the dataset consists of multiple files. 'multitude.csv' contains original v1 of the dataset (i.e., without the field 'generated'). The other files contains also the 'generated' field (as described below) and are compressed by GZIP. The file 'multitude_obfuscated_original.csv.gz' contains copies of the 'text' field in the 'generated' field to be compatible with files with the obfuscated texts (used as such in the experiments). Fields The dataset has the following fields: 'text' - an original (unobfuscated) text sample, 'label' - 0 for human-written text, 1 for machine-generated text, 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text, 'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively, 'language' - the ISO 639-1 language code identifying the language of the given text, 'length' - word count of the given text, 'source' - a string identifying the source dataset / news medium of the given text, 'generated' - an obfuscated text sample (i.e., transformed from original text by the obfuscator indicated by the corresponding filename) Note: some obfuscated text in the 'generated' field are the same as in the 'text' field, indicating failure of the obfuscator to modify the text. Human-written obfuscated texts are also included; however, labels of their originals might be no longer relevant for them (i.e., human-written text obfuscated by a machine could be considered as machine-generated as well); thus, consider this in your research. Statistics (the number of samples) Splits: train - 44786 test - 29295 Binary labels: 0 - 7992 1 - 66089 Multiclass labels: gpt-3.5-turbo - 8300 gpt-4 - 8300 text-davinci-003 - 8297 alpaca-lora-30b - 8290 vicuna-13b - 8287 opt-66b - 8229 llama-65b - 8229 opt-iml-max-1.3b - 8157 human - 7992 Languages: English (en) - 29460 (train + test) Spanish (es) - 11586 (train + test) Russian (ru) - 11578 (train + test) Dutch (nl) - 2695 (test) Catalan (ca) - 2691 (test) Czech (cs) - 2689 (test) German (de) - 2685 (test) Chinese (zh) - 2683 (test) Portuguese (pt) - 2673 (test) Arabic (ar) - 2673 (test) Ukrainian (uk) - 2668 (test)

  20. f

    Data from: Uncertainty-Informed Screening for Safer Solvents Used in the...

    • figshare.com
    • acs.figshare.com
    xlsx
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arpan Mukherjee; Deepesh Giri; Krishna Rajan (2025). Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskites via Language Models [Dataset]. http://doi.org/10.1021/acs.jcim.5c00612.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 22, 2025
    Dataset provided by
    ACS Publications
    Authors
    Arpan Mukherjee; Deepesh Giri; Krishna Rajan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Automated data curation for niche scientific topics, where data quality and contextual accuracy are paramount, poses significant challenges. Bidirectional contextual models such as BERT and ELMo excel in contextual understanding and determinism. However, they are constrained by their narrower training corpora and inability to synthesize information across fragmented or sparse contexts. Conversely, autoregressive generative models like GPT can synthesize dispersed information by leveraging broader contextual knowledge and yet often generate plausible but incorrect (“hallucinated”) information. To address these complementary limitations, we propose an ensemble approach that combines the deterministic precision of BERT/ELMo with the contextual depth of GPT. We have developed a hierarchical knowledge extraction framework to identify perovskites and their associated solvents in perovskite synthesis, progressing from broad topics to narrower details using two complementary methods. The first method leverages deterministic models like BERT/ELMo for precise entity extraction, while the second employs GPT for broader contextual synthesis and generalization. Outputs from both methods are validated through structure-matching and entity normalization, ensuring consistency and traceability. In the absence of benchmark data sets for this domain, we hold out a subset of papers for manual verification to serve as a reference set for tuning the rules for entity normalization. This enables quantitative evaluation of model precision, recall, and structural adherence while also providing a grounded estimate of model confidence. By intersecting the outputs from both methods, we generate a list of solvents with maximum confidence, combining precision with contextual depth to ensure accuracy and reliability. This approach increases precision at the expense of recalla trade-off we accept given that, in high-trust scientific applications, minimizing hallucinations is often more critical than achieving full coverage, especially when downstream reliability is paramount. As a case study, the curated data set is used to predict the endocrine-disrupting (ED) potential of solvents with a pretrained deep learning model. Recognizing that machine learning models may not be trained on niche data sets such as perovskite-related solvents, we have quantified epistemic uncertainty using Shannon entropy. This measure evaluates the confidence of the ML model predictions, independent of uncertainties in the NLP-based data curation process, and identifies high-risk solvents requiring further validation. Additionally, the manual verification pipeline addresses ethical considerations around trust, structure, and transparency in AI-curated data sets.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Radek Osmulski (2023). 📊 6.5k train examples for LLM Science Exam 📝 [Dataset]. https://www.kaggle.com/datasets/radek1/additional-train-data-for-llm-science-exam
Organization logo

📊 6.5k train examples for LLM Science Exam 📝

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Radek Osmulski
Description

I created this dataset using gpt-3.5-turbo.

I put a lot of effort into making this dataset high quality which allows you to achieve the highest score among the publicly available notebooks available at the moment! 🥳

Originally, I only uploaded 500 examples (they were used as train data in the notebook I mention above). They are stored in extra_train_set.csv.

I am now uploading another 6k (6000_train_examples.csv) completely new train examples which brings the total to 6.5k.

If you find this dataset useful, please leave an upvote! 😊 Thank you! 🙏🙏🙏

Search
Clear search
Close search
Google apps
Main menu