34 datasets found
  1. h

    teknium-GPT4-LLM-Cleaned

    • huggingface.co
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Post-training-Data-Flywheel (2024). teknium-GPT4-LLM-Cleaned [Dataset]. https://huggingface.co/datasets/Post-training-Data-Flywheel/teknium-GPT4-LLM-Cleaned
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 27, 2024
    Dataset authored and provided by
    Post-training-Data-Flywheel
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Post-training-Data-Flywheel/teknium-GPT4-LLM-Cleaned dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. All GPT-4 Conversations

    • kaggle.com
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). All GPT-4 Conversations [Dataset]. https://www.kaggle.com/datasets/thedevastator/all-gpt-4-synthetic-chat-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    All GPT-4 Generated Datasets

    Every chat dataset generated by GPT-4 from Huggingface at the same format

    From [Huggingface datasets]

    About this dataset

    How to use the dataset

    The dataset includes all chat conversations generated by GPT-4 that are hosted on open Huggingface datasets. Everything is converted to the same format so the datasets can be easily merged and used for large scale training of LLMs.

    Acknowledgements

    This dataset is a collection of several single chat datasets. If you use this dataset in your research, please credit the original authors of the internal datasets. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

  3. h

    gpt4-self-instruct

    • huggingface.co
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Post-training-Data-Flywheel (2024). gpt4-self-instruct [Dataset]. https://huggingface.co/datasets/Post-training-Data-Flywheel/gpt4-self-instruct
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 27, 2024
    Dataset authored and provided by
    Post-training-Data-Flywheel
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Post-training-Data-Flywheel/gpt4-self-instruct dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    GPT4-8K

    • huggingface.co
    Updated Jan 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erfan zare chavoshi (2024). GPT4-8K [Dataset]. https://huggingface.co/datasets/erfanzar/GPT4-8K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 6, 2024
    Authors
    Erfan zare chavoshi
    Description

    Dataset Card for "GPT4-8K"

    Sure! Here's a README.md file for your dataset:

      Dataset Description
    

    This dataset was generated using GPT-4, a powerful language model developed by OpenAI. It contains a collection of dialogs between a user and an assistant, along with additional information. from OpenChat

      Dataset Configurations
    

    The dataset includes the following configurations:

    Config Name: default

    Data Files: Split: train Path: data/train-*

      Dataset… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT4-8K.
    
  5. daigt-v3-train-dataset

    • kaggle.com
    zip
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darek Kłeczek (2023). daigt-v3-train-dataset [Dataset]. https://www.kaggle.com/datasets/thedrcat/daigt-v3-train-dataset
    Explore at:
    zip(86685168 bytes)Available download formats
    Dataset updated
    Dec 28, 2023
    Authors
    Darek Kłeczek
    Description

    New release of DAIGT train dataset! New models: 'text-ada-001', 'text-babbage-001', 'text-curie-001', 'text-davinci-001', 'text-davinci-002', 'text-davinci-003'

    These models from OpenAI are getting deprecated, so I made sure to generate some essays with them and share here. I also added following public datasets (please upvote!): - https://www.kaggle.com/datasets/phanisrikanth/daigt-essays-from-intel-neural-chat-7b - https://www.kaggle.com/datasets/carlmcbrideellis/llm-mistral-7b-instruct-texts - https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b - https://www.kaggle.com/datasets/snassimr/gpt4-rephrased-llm-daigt-dataset

    All merged with my previous dataset for convenience (https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset)

    Enjoy ❤️

    Version 2 update: - removed NaNs and duplicated/short generations - applied cleaning prodedure from @nbroad's notebook - give it an upvote please! - added model column to indicate model family used in generations

  6. Z

    Model Output of GPT-3.5 and GPT-4 for ECHR-AM

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Dec 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zubaer, Abdullah Al; Granitzer, Michael; Mitrović, Jelena (2024). Model Output of GPT-3.5 and GPT-4 for ECHR-AM [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8246128
    Explore at:
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    University of Passau
    University of Passau | Institute for Artificial Intelligence Research and Development of Serbia, Novi Sad, Serbia.
    Authors
    Zubaer, Abdullah Al; Granitzer, Michael; Mitrović, Jelena
    Description

    "gpt3.5-gpt4-input-output-echram.zip" :

    Input and output to GPT-3.5 and GPT-4 based on ECHR dataset published in JSON format in this paper for argument component classification only i.e. clauses that are argumentative (conclusion/premise), extracted from the JSON file

    Note: Output of the model is under OpenAI Terms & policies.

    Please cite our paper also if you use this dataset: Performance analysis of large language models in the domain of legal argument mining

    You can click here for BibTex or copy the text below.

    @ARTICLE{10.3389/frai.2023.1278796,

    AUTHOR={Al Zubaer, Abdullah and Granitzer, Michael and Mitrović, Jelena },

    TITLE={Performance analysis of large language models in the domain of legal argument mining},

    JOURNAL={Frontiers in Artificial Intelligence},

    VOLUME={6},

    YEAR={2023},

    URL={https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2023.1278796},

    DOI={10.3389/frai.2023.1278796},

    ISSN={2624-8212},

    ABSTRACT={Generative pre-trained transformers (GPT) have recently demonstrated excellent performance in various natural language tasks. The development of ChatGPT and the recently released GPT-4 model has shown competence in solving complex and higher-order reasoning tasks without further training or fine-tuning. However, the applicability and strength of these models in classifying legal texts in the context of argument mining are yet to be realized and have not been tested thoroughly. In this study, we investigate the effectiveness of GPT-like models, specifically GPT-3.5 and GPT-4, for argument mining via prompting. We closely study the model's performance considering diverse prompt formulation and example selection in the prompt via semantic search using state-of-the-art embedding models from OpenAI and sentence transformers. We primarily concentrate on the argument component classification task on the legal corpus from the European Court of Human Rights. To address these models' inherent non-deterministic nature and make our result statistically sound, we conducted 5-fold cross-validation on the test set. Our experiments demonstrate, quite surprisingly, that relatively small domain-specific models outperform GPT 3.5 and GPT-4 in the F1-score for premise and conclusion classes, with 1.9% and 12% improvements, respectively. We hypothesize that the performance drop indirectly reflects the complexity of the structure in the dataset, which we verify through prompt and data analysis. Nevertheless, our results demonstrate a noteworthy variation in the performance of GPT models based on prompt formulation. We observe comparable performance between the two embedding models, with a slight improvement in the local model's ability for prompt selection. This suggests that local models are as semantically rich as the embeddings from the OpenAI model. Our results indicate that the structure of prompts significantly impacts the performance of GPT models and should be considered when designing them.}}

  7. h

    airoboros-gpt4

    • huggingface.co
    Updated Jun 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jon Durbin (2023). airoboros-gpt4 [Dataset]. https://huggingface.co/datasets/jondurbin/airoboros-gpt4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 4, 2023
    Authors
    Jon Durbin
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The data was generated by gpt-4, and therefore is subject to OpenAI ToS. The tool used to generate the data airoboros is apache-2. Specific areas of focus for this training data:

    trivia math nonsensical math coding closed context question answering closed context question answering, with multiple contexts to choose from as confounding factors writing multiple choice

      Usage and License Notices
    

    All airoboros models and datasets are intended and licensed for research use only.… See the full description on the dataset page: https://huggingface.co/datasets/jondurbin/airoboros-gpt4.

  8. DAIGT V2 Train Dataset

    • kaggle.com
    zip
    Updated Nov 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darek Kłeczek (2023). DAIGT V2 Train Dataset [Dataset]. https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset
    Explore at:
    zip(29923908 bytes)Available download formats
    Dataset updated
    Nov 16, 2023
    Authors
    Darek Kłeczek
    Description

    Please use version 2 (there were some issues with v1 that I fixed)!

    New release of DAIGT train dataset! Improvement: - new models: Cohere Command, Google Palm, GPT4 (from Radek!) - new prompts, including source texts from the original essays! - mapping of essay text to original prompt from persuade corpus - filtering by the famous "RDizzl3_seven"

    persuade_corpus            25996
    chat_gpt_moth             2421
    llama2_chat              2421
    mistral7binstruct_v2          2421
    mistral7binstruct_v1          2421
    original_moth             2421
    train_essays              1378
    llama_70b_v1              1172
    falcon_180b_v1             1055
    darragh_claude_v7           1000
    darragh_claude_v6           1000
    radek_500                500
    NousResearch/Llama-2-7b-chat-hf     400
    mistralai/Mistral-7B-Instruct-v0.1   400
    cohere-command             350
    palm-text-bison1            349
    radekgpt4                200
    

    Sources (please upvote the original datasets!): - Text generated with ChatGPT by MOTH (https://www.kaggle.com/datasets/alejopaullier/daigt-external-dataset) - Persuade corpus contributed by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/persaude-corpus-2/) - Text generated with Llama-70b and Falcon180b by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b) - Text generated with ChatGPT and GPT4 by Radek (https://www.kaggle.com/datasets/radek1/llm-generated-essays) - 2000 Claude essays generated by @darraghdog (https://www.kaggle.com/datasets/darraghdog/hello-claude-1000-essays-from-anthropic) - LLM-generated essay using PaLM from Google Gen-AI by @kingki19 (https://www.kaggle.com/datasets/kingki19/llm-generated-essay-using-palm-from-google-gen-ai) - Official train essays - Essays I generated with various LLMs

    License: MIT for the data I generated. Check source datasets for the other sources mentioned above.

  9. h

    GPT-4-Prompts

    • huggingface.co
    Updated Dec 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erfan zare chavoshi (2024). GPT-4-Prompts [Dataset]. https://huggingface.co/datasets/erfanzar/GPT-4-Prompts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 22, 2024
    Authors
    Erfan zare chavoshi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Multi-Turn Conversational Prompts from ChatGPT-4 (10K+ Tokens) Abstract: This dataset offers a valuable collection of multi-turn conversational prompts generated by ChatGPT-4, carefully curated for diverse prompt styles (chatml, gemma, llama). Each prompt exceeds 10,000 tokens, providing ample context and inspiration for training and evaluating large language models. Ideal for researchers and developers interested in exploring advanced conversational AI capabilities. Table of Contents:… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT-4-Prompts.

  10. f

    Experimental results (%) of applying lossST, lossLM, and lossIE in one...

    • figshare.com
    • plos.figshare.com
    xls
    Updated Aug 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuting Bai; Tonghua Su; Zixing Bai (2025). Experimental results (%) of applying lossST, lossLM, and lossIE in one training stage and segmented training. [Dataset]. http://doi.org/10.1371/journal.pone.0329590.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 11, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Yuting Bai; Tonghua Su; Zixing Bai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Experimental results (%) of applying lossST, lossLM, and lossIE in one training stage and segmented training.

  11. h

    GPT4-LLM-Cleaned

    • huggingface.co
    • opendatalab.com
    Updated May 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teknium (2023). GPT4-LLM-Cleaned [Dataset]. https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2023
    Authors
    Teknium
    Description

    This is the GPT4-LLM dataset from : https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM It has been filtered of all OpenAI disclaimers and refusals. (Disclaimer: It may have removed some additional things besides just OAI disclaimers, as I used the followings script which is a bit more broad: https://huggingface.co/datasets/ehartford/WizardLM_alpaca_evol_instruct_70k_unfiltered/blob/main/wizardlm_clean.py) There is a modified script of that in the repo that was used specifically for… See the full description on the dataset page: https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned.

  12. Conversations on Coding, Debugging, Storytelling

    • kaggle.com
    zip
    Updated Dec 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Conversations on Coding, Debugging, Storytelling [Dataset]. https://www.kaggle.com/datasets/thedevastator/conversations-on-coding-debugging-storytelling-s
    Explore at:
    zip(1371478 bytes)Available download formats
    Dataset updated
    Dec 1, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Conversations on Coding, Debugging, Storytelling & Science

    Conversations on Coding, Debugging, Storytelling & Science

    By Peevski (From Huggingface) [source]

    About this dataset

    The OpenLeecher/GPT4-10k dataset is a comprehensive collection of 100 diverse conversations, presented in text format, revolving around a wide range of topics. These conversations cover various domains such as coding, debugging, storytelling, and science. Aimed at facilitating training and analysis purposes for researchers and developers alike, this dataset offers an extensive array of conversation samples.

    Each conversation within this dataset delves into different subject matters related to coding techniques, debugging strategies, storytelling methods; while also exploring concepts like spatial thinking, logical thinking. Furthermore, the conversations touch upon scientific fields including chemistry, physics and biology. To add further depth to the dataset's content, it also includes discussions on the topic of law.

    By providing this rich assortment of conversations spanning across multiple domains and disciplines in one cohesive dataset format on Kaggle platform as train.csv file , it empowers users to delve into these dialogue examples for exploration and analysis effortlessly. This compilation serves as an invaluable resource for understanding various aspects of coding practices alongside stimulating scientific discussions on subjects spanning across multiple fields

    How to use the dataset

    Introduction:

    • Understanding the Dataset Structure: The dataset consists of a CSV file named 'train.csv'. When examining the file's columns using software or programming language of your choice (e.g., Python), you will notice two key columns: 'chat' and '**chat'. Both these columns contain text data representing conversations between two or more participants.

    • Exploring Different Topics: The dataset covers a vast spectrum of subjects including coding techniques, debugging strategies, storytelling methods, spatial thinking, logical thinking, chemistry, physics, biology, and law each conversation:

      • Coding Techniques: Discover discussions on various programming concepts and best practices.
      • Debugging Strategies: Explore conversations related to identifying and fixing software issues.
      • Storytelling Methods: Dive into dialogues about effective storytelling techniques in different contexts.
      • Spatial Thinking: Engage with conversations that involve developing spatial reasoning skills for problem-solving.
      • Logical Thinking: Learn from discussions focused on enhancing logical reasoning abilities related to different domains.
      • Chemistry
      • Physics
      • Biology
      • Law
    • Analyzing Conversations: leverage natural language processing (NLP) tools or techniques such as sentiment analysis print(Number of Conversations:, len(df)) together

    • Accessible Code Examples

    Maximize Training Efficiency:

    • Taking Advantage of Diversity:

    • Creating New Applications:

    Conclusion:

    Research Ideas

    • Natural Language Processing Research: Researchers can leverage this dataset to train and evaluate natural language processing models, particularly in the context of conversational understanding and generation. The diverse conversations on coding, debugging, storytelling, and science can provide valuable insights into modeling human-like conversation patterns.
    • Chatbot Development: The dataset can be utilized for training chatbots or virtual assistants that can engage in conversations related to coding, debugging, storytelling, and science. By exposing the chatbot to a wide range of conversation samples from different domains, developers can ensure that their chatbots are capable of providing relevant and accurate responses.
    • Domain-specific Intelligent Assistants: Organizations or individuals working in fields such as coding education or scientific research may use this dataset to develop intelligent assistants tailored specifically for these domains. These assistants can help users navigate complex topics by answering questions related to coding techniques, debugging strategies, storytelling methods, or scientific concepts. Overall,'train.csv' provides a rich resource for researchers and developers interested in building conversational AI systems with knowledge across multiple domains including even legal matters

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **Li...

  13. dolphin-2.9.3-mistral-nemo-12b

    • kaggle.com
    zip
    Updated Sep 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Serhii Kharchuk (2024). dolphin-2.9.3-mistral-nemo-12b [Dataset]. https://www.kaggle.com/datasets/serhiikharchuk/dolphin-2-9-3-mistral-nemo-12b
    Explore at:
    zip(19411245984 bytes)Available download formats
    Dataset updated
    Sep 1, 2024
    Authors
    Serhii Kharchuk
    Description

    Dolphin 2.9.3 Mistral Nemo 12b 🐬

    Curated and trained by Eric Hartford and Cognitive Computations

    Discord Discord: https://discord.gg/h3K4XGj2RH

    Our appreciation for the sponsors of Dolphin 2.9.3:

    Crusoe Cloud - provided excellent on-demand 8xL40S node
    

    This model is based on mistralai/Mistral-Nemo-Base-2407, and is governed by the apache 2.0 license.

    The base model has 128K context, and our finetuning used 8192 sequence length.

    Dolphin 2.9.3 uses ChatML prompt template format.

    example:

    <|im_start|>system You are Dolphin, a helpful AI assistant.<|im_end|> <|im_start|>user {prompt}<|im_end|> <|im_start|>assistant

    Dolphin-2.9.3 has a variety of instruction following, conversational, and coding skills. It also has initial agentic abilities and supports function calling.

    Dolphin is uncensored. We have filtered the dataset to remove alignment and bias. This makes the model more compliant. You are advised to implement your own alignment layer before exposing the model as a service. It will be highly compliant with any requests, even unethical ones. Please read my blog post about uncensored models. https://erichartford.com/uncensored-models You are responsible for any content you create using this model. Enjoy responsibly.

    Dolphin is licensed according to apache 2.0 license. We grant permission for any use, including commercial. Dolphin was trained on data generated from GPT4, among other models. Evals See evals

    Training

    Built with Axolotl See axolotl config

    Visualize in Weights & Biases workspace/axolotl/dolphin-2.9.3-mistral-nemo

    This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:

    Loss: 0.5605
    

    Model description

    More information needed Intended uses & limitations

    More information needed Training and evaluation data

    More information needed Training procedure Training hyperparameters

    The following hyperparameters were used during training:

    learning_rate: 5e-06
    train_batch_size: 1
    eval_batch_size: 1
    seed: 42
    distributed_type: multi-GPU
    num_devices: 8
    gradient_accumulation_steps: 16
    total_train_batch_size: 128
    total_eval_batch_size: 8
    optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
    lr_scheduler_type: cosine
    lr_scheduler_warmup_steps: 100
    num_epochs: 3
    

    Training results Training Loss Epoch Step Validation Loss 0.5691 1.0162 983 0.5734 0.5335 2.0174 1968 0.5609 0.5297 2.9639 2901 0.5605 Framework versions

    Transformers 4.43.0.dev0
    Pytorch 2.2.2+cu121
    Datasets 2.19.1
    Tokenizers 0.19.1
    
  14. h

    tinystoriesv2_gpt4

    • huggingface.co
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haris Jabbar (2024). tinystoriesv2_gpt4 [Dataset]. https://huggingface.co/datasets/maveriq/tinystoriesv2_gpt4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 16, 2024
    Authors
    Haris Jabbar
    Description

    Prepared dataset from roneneldan/TinyStoriesV2-GPT4

      Data Preparation pipeline.
    

    Download TinyStoriesV2-GPT4-train.txt from https://huggingface.co/datasets/roneneldan/TinyStories/blob/main/TinyStoriesV2-GPT4-train.txt

    raw = open('TinyStoriesV2-GPT4-train.txt').readlines() stories = [] for x in tqdm(raw,total=len(raw)): if x==' ': continue if x.startswith('<|endoftext|>'): chunk.append(x.strip()) stories.append(" ".join(chunk))… See the full description on the dataset page: https://huggingface.co/datasets/maveriq/tinystoriesv2_gpt4.

  15. h

    Flan-GPT4

    • huggingface.co
    Updated Jan 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erfan zare chavoshi (2024). Flan-GPT4 [Dataset]. https://huggingface.co/datasets/erfanzar/Flan-GPT4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 6, 2024
    Authors
    Erfan zare chavoshi
    Description

    Flan-GPT4 Dataset

      Overview
    

    The Flan-GPT4 dataset is a collection of prompts and responses designed for training and evaluating language generation models. It contains various features such as response, instruction, system, toxin_prompt, and llama_prompt, each with a data type of string. Edited and customized from SlimOrca-Flan

      Dataset Information
    

    Features:

    response (string) instruction (string) system (string) toxin_prompt (string) llama_prompt (string)… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/Flan-GPT4.

  16. Perception performance comparison on MME benchmark.

    • plos.figshare.com
    xls
    Updated Aug 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuting Bai; Tonghua Su; Zixing Bai (2025). Perception performance comparison on MME benchmark. [Dataset]. http://doi.org/10.1371/journal.pone.0329590.t012
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 11, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yuting Bai; Tonghua Su; Zixing Bai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Perception performance comparison on MME benchmark.

  17. Experimental results (%) of lossIE at both ends of the sequence.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Aug 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuting Bai; Tonghua Su; Zixing Bai (2025). Experimental results (%) of lossIE at both ends of the sequence. [Dataset]. http://doi.org/10.1371/journal.pone.0329590.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 11, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yuting Bai; Tonghua Su; Zixing Bai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Experimental results (%) of lossIE at both ends of the sequence.

  18. h

    alpaca-gpt4-MG

    • huggingface.co
    Updated Sep 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorenzo Mamelona (2025). alpaca-gpt4-MG [Dataset]. https://huggingface.co/datasets/Lo-Renz-O/alpaca-gpt4-MG
    Explore at:
    Dataset updated
    Sep 25, 2025
    Authors
    Lorenzo Mamelona
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This dataset is a Malagasy adaptation of the Alpaca-GPT4 instruction-following dataset.It contains instruction-response pairs translated or adapted into Malagasy, designed for fine-tuning instruction-following language models. Each entry includes an instruction, optional input context, and a reference response generated by GPT-4 and adapted to Malagasy using Gemini 2.5 for the translation.
    The dataset enables training and evaluating LLMs on instruction understanding… See the full description on the dataset page: https://huggingface.co/datasets/Lo-Renz-O/alpaca-gpt4-MG.

  19. LLM Human Preference Data - Ultrafeedback

    • kaggle.com
    zip
    Updated May 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darek Kłeczek (2024). LLM Human Preference Data - Ultrafeedback [Dataset]. https://www.kaggle.com/datasets/thedrcat/llm-human-preference-data-ultrafeedback
    Explore at:
    zip(505786139 bytes)Available download formats
    Dataset updated
    May 3, 2024
    Authors
    Darek Kłeczek
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    External data for LMSYS - Chatbot Arena Human Preference Predictions competition.

    Downloaded from HuggingFace dataset: argilla/ultrafeedback-multi-binarized-preferences-cleaned

    Additionally, I converted the data into LMSYS train data format (you may still need to shuffle the responses).

    Version 2 contains additional examples with ties between model responses that were previously filtered out.

    NOTE: This dataset is based on GPT4 as a judge as a proxy for human preference rating.

    UltraFeedback - Multi-Binarized using the Average of Preference Ratings (Cleaned) dataset represents a new iteration on top of argilla/ultrafeedback-binarized-preferences-cleaned, and has been created to explore whether DPO fine-tuning with more than one rejection per chosen response helps the model perform better in the AlpacaEval, MT-Bench, and LM Eval Harness benchmarks.

    Paper: https://arxiv.org/pdf/2310.01377

  20. f

    Experimental results (%) under different orders of lossST and lossLM.

    • figshare.com
    xls
    Updated Aug 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuting Bai; Tonghua Su; Zixing Bai (2025). Experimental results (%) under different orders of lossST and lossLM. [Dataset]. http://doi.org/10.1371/journal.pone.0329590.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 11, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Yuting Bai; Tonghua Su; Zixing Bai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Experimental results (%) under different orders of lossST and lossLM.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Post-training-Data-Flywheel (2024). teknium-GPT4-LLM-Cleaned [Dataset]. https://huggingface.co/datasets/Post-training-Data-Flywheel/teknium-GPT4-LLM-Cleaned

teknium-GPT4-LLM-Cleaned

Post-training-Data-Flywheel/teknium-GPT4-LLM-Cleaned

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 27, 2024
Dataset authored and provided by
Post-training-Data-Flywheel
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Post-training-Data-Flywheel/teknium-GPT4-LLM-Cleaned dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu