20 datasets found
  1. h

    GPT4-8K

    • huggingface.co
    Updated Jan 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erfan zare chavoshi (2024). GPT4-8K [Dataset]. https://huggingface.co/datasets/erfanzar/GPT4-8K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 6, 2024
    Authors
    Erfan zare chavoshi
    Description

    Dataset Card for "GPT4-8K"

    Sure! Here's a README.md file for your dataset:

      Dataset Description
    

    This dataset was generated using GPT-4, a powerful language model developed by OpenAI. It contains a collection of dialogs between a user and an assistant, along with additional information. from OpenChat

      Dataset Configurations
    

    The dataset includes the following configurations:

    Config Name: default

    Data Files: Split: train Path: data/train-*

      Dataset… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT4-8K.
    
  2. h

    airoboros-gpt4

    • huggingface.co
    Updated Jun 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jon Durbin (2023). airoboros-gpt4 [Dataset]. https://huggingface.co/datasets/jondurbin/airoboros-gpt4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 4, 2023
    Authors
    Jon Durbin
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The data was generated by gpt-4, and therefore is subject to OpenAI ToS. The tool used to generate the data airoboros is apache-2. Specific areas of focus for this training data:

    trivia math nonsensical math coding closed context question answering closed context question answering, with multiple contexts to choose from as confounding factors writing multiple choice

      Usage and License Notices
    

    All airoboros models and datasets are intended and licensed for research use only.… See the full description on the dataset page: https://huggingface.co/datasets/jondurbin/airoboros-gpt4.

  3. h

    alpaca-gpt4-data-zh

    • huggingface.co
    Updated Apr 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris Alexiuk (2023). alpaca-gpt4-data-zh [Dataset]. https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data-zh
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 11, 2023
    Authors
    Chris Alexiuk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for "alpaca-gpt4-data-zh"

    All of the work is done by this team.

      Usage and License Notices
    

    The data is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

      English Dataset
    

    Found here

      Citation
    

    @article{peng2023gpt4llm, title={Instruction Tuning with GPT-4}, author={Baolin Peng, Chunyuan Li… See the full description on the dataset page: https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data-zh.

  4. Z

    Model Output of GPT-3.5 and GPT-4 for ECHR-AM

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Dec 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mitrović, Jelena (2024). Model Output of GPT-3.5 and GPT-4 for ECHR-AM [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8246128
    Explore at:
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    Mitrović, Jelena
    Granitzer, Michael
    Zubaer, Abdullah Al
    Description

    "gpt3.5-gpt4-input-output-echram.zip" :

    Input and output to GPT-3.5 and GPT-4 based on ECHR dataset published in JSON format in this paper for argument component classification only i.e. clauses that are argumentative (conclusion/premise), extracted from the JSON file

    Note: Output of the model is under OpenAI Terms & policies.

    Please cite our paper also if you use this dataset: Performance analysis of large language models in the domain of legal argument mining

    You can click here for BibTex or copy the text below.

    @ARTICLE{10.3389/frai.2023.1278796,

    AUTHOR={Al Zubaer, Abdullah and Granitzer, Michael and Mitrović, Jelena },

    TITLE={Performance analysis of large language models in the domain of legal argument mining},

    JOURNAL={Frontiers in Artificial Intelligence},

    VOLUME={6},

    YEAR={2023},

    URL={https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2023.1278796},

    DOI={10.3389/frai.2023.1278796},

    ISSN={2624-8212},

    ABSTRACT={Generative pre-trained transformers (GPT) have recently demonstrated excellent performance in various natural language tasks. The development of ChatGPT and the recently released GPT-4 model has shown competence in solving complex and higher-order reasoning tasks without further training or fine-tuning. However, the applicability and strength of these models in classifying legal texts in the context of argument mining are yet to be realized and have not been tested thoroughly. In this study, we investigate the effectiveness of GPT-like models, specifically GPT-3.5 and GPT-4, for argument mining via prompting. We closely study the model's performance considering diverse prompt formulation and example selection in the prompt via semantic search using state-of-the-art embedding models from OpenAI and sentence transformers. We primarily concentrate on the argument component classification task on the legal corpus from the European Court of Human Rights. To address these models' inherent non-deterministic nature and make our result statistically sound, we conducted 5-fold cross-validation on the test set. Our experiments demonstrate, quite surprisingly, that relatively small domain-specific models outperform GPT 3.5 and GPT-4 in the F1-score for premise and conclusion classes, with 1.9% and 12% improvements, respectively. We hypothesize that the performance drop indirectly reflects the complexity of the structure in the dataset, which we verify through prompt and data analysis. Nevertheless, our results demonstrate a noteworthy variation in the performance of GPT models based on prompt formulation. We observe comparable performance between the two embedding models, with a slight improvement in the local model's ability for prompt selection. This suggests that local models are as semantically rich as the embeddings from the OpenAI model. Our results indicate that the structure of prompts significantly impacts the performance of GPT models and should be considered when designing them.}}

  5. daigt-v3-train-dataset

    • kaggle.com
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darek Kłeczek (2023). daigt-v3-train-dataset [Dataset]. https://www.kaggle.com/datasets/thedrcat/daigt-v3-train-dataset/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Darek Kłeczek
    Description

    New release of DAIGT train dataset! New models: 'text-ada-001', 'text-babbage-001', 'text-curie-001', 'text-davinci-001', 'text-davinci-002', 'text-davinci-003'

    These models from OpenAI are getting deprecated, so I made sure to generate some essays with them and share here. I also added following public datasets (please upvote!): - https://www.kaggle.com/datasets/phanisrikanth/daigt-essays-from-intel-neural-chat-7b - https://www.kaggle.com/datasets/carlmcbrideellis/llm-mistral-7b-instruct-texts - https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b - https://www.kaggle.com/datasets/snassimr/gpt4-rephrased-llm-daigt-dataset

    All merged with my previous dataset for convenience (https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset)

    Enjoy ❤️

    Version 2 update: - removed NaNs and duplicated/short generations - applied cleaning prodedure from @nbroad's notebook - give it an upvote please! - added model column to indicate model family used in generations

  6. Alpaca GPT-4

    • kaggle.com
    • opendatalab.com
    • +1more
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Alpaca GPT-4 [Dataset]. https://www.kaggle.com/datasets/thedevastator/gpt-4-instruction-following-dataset/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 24, 2023
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Alpaca GPT-4

    High-Performance NLP for Instruction-Following Reasoning

    By Huggingface Hub [source]

    About this dataset

    This dataset consists of 52K instruction-following data generated by GPT-4 in English using the same prompts as in Alpaca. This data has been crafted specifically to help researchers break ground and explore new strategies for natural language processing, with a special focus on instruction-following reasoning.

    What makes this dataset unique and powerful is that it offers an ample variety of options for experimenting with models that can excel at instruction following tasks; from refining specific components such as predicting outputs or analyzing long textual conversations, to using the entire platform to train and evaluate end-to-end approaches. Allowing researchers the opportunity to rapidly iterate their experiments while having the confidence of a high performant model with few limitations - making this an invaluable resource for anyone looking to push the boundaries of artificial intelligence techniques for logical reasoning problems

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset is an invaluable resource for researching artificial intelligence approaches to logical reasoning problems. This dataset consists of 52K instruction-following samples generated by GPT-4 in English using the same prompts as in Alpaca. Here are some tips on how to make the most out of this dataset:

    • The columns in this dataset provide essential data that can help researchers evaluate their models on a task involving instruction following: instruction, input, output and text. In order to effectively use this data, it is important for researchers to be familiar with each column and understand its purpose and contribution towards understanding instructional following principles. a) The instruction column provides a statement which an AI model must interpret in order for it complete a task correctly; b) The 'input' column is basically pre-generated data that helps an AI model make sense of the instructions; c) The 'output' column indicates what kind of result must be returned after the AI model interprets instructions correctly; and finally,
      d) The ‘text’ column is full text generated by GPT-4 which gives us deeper insight into what gave rise our output results from input & instruction handling.

      Note : It's very important that researchers pay attention to all four columns when overseeing their work on such datasets, as all four components collaborate together integrately.

      To get better results one should consider fine tuning existing schemes so they become better suited for instruction following tasks using these 4 columns as guidance points. It would be also useful if the datasets came with corresponding hyperparameters so users can fine tune them quicker without losing accuracy or any other metric needed on such scenarios!

      Additionally, readers should Oyverviewedthe contextcloserlytoaccuracy assessthepunishmeasure opinion toneandGoforwhichmodeltypebestsuitsitcaseization given before attempting any sort of evaluation since some might bringmore accurateresultsbuttakelongertoprocess ore viceversa!yerinaredaviews satismetricmayvariaentdataobservioletorsalld .yCdgntricular error%mnfreeunerratreated too accommodate certain scenarios better than others but will still depend largely onthedatasetaccuratelyusedtocourubricateperformances026 (269units). For example, if changes are

    Research Ideas

    • Training intelligent conversational agents with instruction-following reasoning capabilities.
    • Developing more complex and powerful instructions processing models driven by natural language understanding and reasoning algorithms.
    • Establishing an online platform to help academic, business or other organizations to construct auto-grading systems for instruction-following skills evaluation of their staff at large scale in a relatively cheap way

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Colu...

  7. h

    GPT-4-Prompts

    • huggingface.co
    Updated Dec 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erfan zare chavoshi (2024). GPT-4-Prompts [Dataset]. https://huggingface.co/datasets/erfanzar/GPT-4-Prompts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 22, 2024
    Authors
    Erfan zare chavoshi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Multi-Turn Conversational Prompts from ChatGPT-4 (10K+ Tokens) Abstract: This dataset offers a valuable collection of multi-turn conversational prompts generated by ChatGPT-4, carefully curated for diverse prompt styles (chatml, gemma, llama). Each prompt exceeds 10,000 tokens, providing ample context and inspiration for training and evaluating large language models. Ideal for researchers and developers interested in exploring advanced conversational AI capabilities. Table of Contents:… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT-4-Prompts.

  8. Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Dec 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky (2023). Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. http://doi.org/10.5281/zenodo.10413068
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 21, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

    This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

    Key Features:

    • LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.
    • Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.
    • Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

    Dataset Composition:

    • curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot
    • curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

    Intended Use:

    • Fine-tuning and advancing Homepage2Vec or similar website classification models
    • Research on LLM-generated datasets for text classification tasks
    • Exploration of multilingual website classification

    Additional Information:

    Acknowledgments:

    This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.

  9. h

    Flan-GPT4

    • huggingface.co
    Updated Jan 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erfan zare chavoshi (2024). Flan-GPT4 [Dataset]. https://huggingface.co/datasets/erfanzar/Flan-GPT4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 6, 2024
    Authors
    Erfan zare chavoshi
    Description

    Flan-GPT4 Dataset

      Overview
    

    The Flan-GPT4 dataset is a collection of prompts and responses designed for training and evaluating language generation models. It contains various features such as response, instruction, system, toxin_prompt, and llama_prompt, each with a data type of string. Edited and customized from SlimOrca-Flan

      Dataset Information
    

    Features:

    response (string) instruction (string) system (string) toxin_prompt (string) llama_prompt (string)… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/Flan-GPT4.

  10. h

    covid-bing-query-gpt4-avs_triplets

    • huggingface.co
    Updated Aug 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aivin Solatorio (2024). covid-bing-query-gpt4-avs_triplets [Dataset]. https://huggingface.co/datasets/avsolatorio/covid-bing-query-gpt4-avs_triplets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2024
    Authors
    Aivin Solatorio
    Description

    COVq dataset

    This dataset was used in the paper GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning. Refer to https://arxiv.org/abs/2402.16829 for details. The code for generating the data is available at https://github.com/avsolatorio/GISTEmbed.

      Citation
    

    @article{solatorio2024gistembed, title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning}, author={Aivin V. Solatorio}… See the full description on the dataset page: https://huggingface.co/datasets/avsolatorio/covid-bing-query-gpt4-avs_triplets.

  11. h

    GPT4-Mixtral-MMLU-Preference-Complexity-train

    • huggingface.co
    Updated Jul 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bud (2024). GPT4-Mixtral-MMLU-Preference-Complexity-train [Dataset]. https://huggingface.co/datasets/budecosystem/GPT4-Mixtral-MMLU-Preference-Complexity-train
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 29, 2024
    Dataset authored and provided by
    Bud
    Description

    budecosystem/GPT4-Mixtral-MMLU-Preference-Complexity-train dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    LLaVAR

    • huggingface.co
    Updated Jan 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social And Language Technology Lab (2021). LLaVAR [Dataset]. https://huggingface.co/datasets/SALT-NLP/LLaVAR
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 27, 2021
    Dataset authored and provided by
    Social And Language Technology Lab
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    LLaVAR Data: Enhanced Visual Instruction Data with Text-Rich Images

    More info at LLaVAR project page, Github repo, and paper.

      Training Data
    

    Based on the LAION dataset, we collect 422K pretraining data based on OCR results. For finetuning data, we collect 16K high-quality instruction-following data by interacting with langauge-only GPT-4. Note that we also release a larger and more diverse finetuning dataset below (20K), which contains the 16K we used for the paper. The… See the full description on the dataset page: https://huggingface.co/datasets/SALT-NLP/LLaVAR.

  13. h

    UltraChat-Mixin

    • huggingface.co
    Updated Apr 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erfan zare chavoshi (2023). UltraChat-Mixin [Dataset]. https://huggingface.co/datasets/erfanzar/UltraChat-Mixin
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2023
    Authors
    Erfan zare chavoshi
    Description

    Dataset Card for "UltraChat-Mixin"

      UltraChat-Mixin Dataset
    
    
    
    
    
      Overview
    

    UltraChat-Mixin is a dataset created by Me, which is a mix of three datasets: 'stingning/ultrachat', 'jondurbin/airoboros-2.1', and 'erfanzar/GPT4-8K'. This dataset is designed for training conversational AI models.

      Dataset Configuration
    

    The dataset is configured as follows: configs: - config_name: default data_files: - split: train path: data/train-*… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/UltraChat-Mixin.

  14. h

    mitsu_full_borda

    • huggingface.co
    Updated Oct 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lightblue KK. (2024). mitsu_full_borda [Dataset]. https://huggingface.co/datasets/lightblue/mitsu_full_borda
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 10, 2024
    Dataset authored and provided by
    Lightblue KK.
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Mitsu

    [Paper] [Model] This is a multilingual preference dataset generated using human written prompts and responses from 7 LLMs. We evaluate each set of responses 5 times using GPT4. Note that this model has a non-commerical license as we used the Command R and Command R+ models to create this data. We are currently working on a developing a commerically usable model, so stay tuned for that!

      Dataset details
    

    This is the ORPO training dataset derived from the… See the full description on the dataset page: https://huggingface.co/datasets/lightblue/mitsu_full_borda.

  15. h

    math

    • huggingface.co
    Updated Apr 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CAMEL-AI.org (2023). math [Dataset]. https://huggingface.co/datasets/camel-ai/math
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 11, 2023
    Dataset provided by
    CAMEL-AI.org
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society

    Github: https://github.com/lightaime/camel Website: https://www.camel-ai.org/ Arxiv Paper: https://arxiv.org/abs/2303.17760

      Dataset Summary
    

    Math dataset is composed of 50K problem-solution pairs obtained using GPT-4. The dataset problem-solutions pairs generating from 25 math topics, 25 subtopics for each topic and 80 problems for each "topic,subtopic" pairs. We provide the data… See the full description on the dataset page: https://huggingface.co/datasets/camel-ai/math.

  16. orca-math-word-problems-200k

    • huggingface.co
    Updated Mar 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft (2024). orca-math-word-problems-200k [Dataset]. https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2024
    Dataset authored and provided by
    Microsofthttp://microsoft.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card

    This dataset contains ~200K grade school math word problems. All the answers in this dataset is generated using Azure GPT4-Turbo. Please refer to Orca-Math: Unlocking the potential of SLMs in Grade School Math for details about the dataset construction.

      Dataset Sources
    

    Repository: microsoft/orca-math-word-problems-200k Paper: Orca-Math: Unlocking the potential of SLMs in Grade School Math

      Direct Use
    

    This dataset has been designed to… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k.

  17. h

    Human-Style-Answers

    • huggingface.co
    Updated Apr 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    INNOVA AI (2024). Human-Style-Answers [Dataset]. https://huggingface.co/datasets/innova-ai/Human-Style-Answers
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2024
    Dataset authored and provided by
    INNOVA AI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Human Style Answers

    This Datasets contains question and answers on different topics in Human style. (For Chatbots training) This Datasets is build using TOP AI like (GPT4, Claude3 , Command R+, etc.)

      Dataset Details
    
    
    
    
    
      Description
    

    The Human Style Response Dataset is a rich collection of question-and-answer pairs, meticulously crafted in a human-like style. It serves as a valuable resource for training chatbots and conversational AI models. Let's dive into the… See the full description on the dataset page: https://huggingface.co/datasets/innova-ai/Human-Style-Answers.

  18. tulu-v1-sft-mixture

    • huggingface.co
    Updated Nov 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2023). tulu-v1-sft-mixture [Dataset]. https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 14, 2023
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Dataset Card for Tulu Instruction Mix

    For a newer version, see Tulu V2 This version, the human data mixture, dataset consists of a mix of:

    FLAN (Apache 2.0): FLAN v2 with CoT examples (most of the tasks in SuperNatural Instructions are included here) Open Assistant 1 (Apache 2.0) Dolly (CC By SA 3.0) ShareGPT (Apache 2.0 listed, no official repo found) GPT4-Alpaca (CC By NC 4.0) Code-Alpaca (CC By NC 4.0)

    These are made by taking either just the training set of the subsets or the… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture.

  19. h

    wikipedia-document-question-answer

    • huggingface.co
    Updated Jan 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grimulkan (2024). wikipedia-document-question-answer [Dataset]. https://huggingface.co/datasets/grimulkan/wikipedia-document-question-answer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 24, 2024
    Authors
    Grimulkan
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Multi-round questions and answers for randomly selected Wikipedia articles of varying lengths, in fastchat JSON format, generated by gpt-4-1106-preview. OpenAI terms apply. This was designed to train a 32K context-length model. Check the total conversation lengths before using data items for training to ensure that they fit inside your target context window, and discard queries that don't fit.

    Both the questions and answers were generated by GPT4, based on the document. Only information from… See the full description on the dataset page: https://huggingface.co/datasets/grimulkan/wikipedia-document-question-answer.

  20. h

    alpaca-zh

    • huggingface.co
    Updated Apr 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ming Xu (徐明) (2023). alpaca-zh [Dataset]. https://huggingface.co/datasets/shibing624/alpaca-zh
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 11, 2023
    Authors
    Ming Xu (徐明)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for "alpaca-zh"

    本数据集是参考Alpaca方法基于GPT4得到的self-instruct数据,约5万条。 Dataset from https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM It is the chinese dataset from https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/blob/main/data/alpaca_gpt4_data_zh.json

      Usage and License Notices
    

    The data is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not… See the full description on the dataset page: https://huggingface.co/datasets/shibing624/alpaca-zh.

  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Erfan zare chavoshi (2024). GPT4-8K [Dataset]. https://huggingface.co/datasets/erfanzar/GPT4-8K

GPT4-8K

GPT4

erfanzar/GPT4-8K

Explore at:
13 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 6, 2024
Authors
Erfan zare chavoshi
Description

Dataset Card for "GPT4-8K"

Sure! Here's a README.md file for your dataset:

  Dataset Description

This dataset was generated using GPT-4, a powerful language model developed by OpenAI. It contains a collection of dialogs between a user and an assistant, along with additional information. from OpenChat

  Dataset Configurations

The dataset includes the following configurations:

Config Name: default

Data Files: Split: train Path: data/train-*

  Dataset… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT4-8K.
Search
Clear search
Close search
Google apps
Main menu