Dataset Card for "GPT4-8K"
Sure! Here's a README.md file for your dataset:
Dataset Description
This dataset was generated using GPT-4, a powerful language model developed by OpenAI. It contains a collection of dialogs between a user and an assistant, along with additional information. from OpenChat
Dataset Configurations
The dataset includes the following configurations:
Config Name: default
Data Files: Split: train Path: data/train-*
Dataset… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT4-8K.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The data was generated by gpt-4, and therefore is subject to OpenAI ToS. The tool used to generate the data airoboros is apache-2. Specific areas of focus for this training data:
trivia math nonsensical math coding closed context question answering closed context question answering, with multiple contexts to choose from as confounding factors writing multiple choice
Usage and License Notices
All airoboros models and datasets are intended and licensed for research use only.… See the full description on the dataset page: https://huggingface.co/datasets/jondurbin/airoboros-gpt4.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for "alpaca-gpt4-data-zh"
All of the work is done by this team.
Usage and License Notices
The data is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
English Dataset
Found here
Citation
@article{peng2023gpt4llm, title={Instruction Tuning with GPT-4}, author={Baolin Peng, Chunyuan Li… See the full description on the dataset page: https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data-zh.
"gpt3.5-gpt4-input-output-echram.zip" :
Input and output to GPT-3.5 and GPT-4 based on ECHR dataset published in JSON format in this paper for argument component classification only i.e. clauses that are argumentative (conclusion/premise), extracted from the JSON file
Note: Output of the model is under OpenAI Terms & policies.
Please cite our paper also if you use this dataset: Performance analysis of large language models in the domain of legal argument mining
You can click here for BibTex or copy the text below.
@ARTICLE{10.3389/frai.2023.1278796,
AUTHOR={Al Zubaer, Abdullah and Granitzer, Michael and Mitrović, Jelena },
TITLE={Performance analysis of large language models in the domain of legal argument mining},
JOURNAL={Frontiers in Artificial Intelligence},
VOLUME={6},
YEAR={2023},
URL={https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2023.1278796},
DOI={10.3389/frai.2023.1278796},
ISSN={2624-8212},
ABSTRACT={Generative pre-trained transformers (GPT) have recently demonstrated excellent performance in various natural language tasks. The development of ChatGPT and the recently released GPT-4 model has shown competence in solving complex and higher-order reasoning tasks without further training or fine-tuning. However, the applicability and strength of these models in classifying legal texts in the context of argument mining are yet to be realized and have not been tested thoroughly. In this study, we investigate the effectiveness of GPT-like models, specifically GPT-3.5 and GPT-4, for argument mining via prompting. We closely study the model's performance considering diverse prompt formulation and example selection in the prompt via semantic search using state-of-the-art embedding models from OpenAI and sentence transformers. We primarily concentrate on the argument component classification task on the legal corpus from the European Court of Human Rights. To address these models' inherent non-deterministic nature and make our result statistically sound, we conducted 5-fold cross-validation on the test set. Our experiments demonstrate, quite surprisingly, that relatively small domain-specific models outperform GPT 3.5 and GPT-4 in the F1-score for premise and conclusion classes, with 1.9% and 12% improvements, respectively. We hypothesize that the performance drop indirectly reflects the complexity of the structure in the dataset, which we verify through prompt and data analysis. Nevertheless, our results demonstrate a noteworthy variation in the performance of GPT models based on prompt formulation. We observe comparable performance between the two embedding models, with a slight improvement in the local model's ability for prompt selection. This suggests that local models are as semantically rich as the embeddings from the OpenAI model. Our results indicate that the structure of prompts significantly impacts the performance of GPT models and should be considered when designing them.}}
New release of DAIGT train dataset! New models: 'text-ada-001', 'text-babbage-001', 'text-curie-001', 'text-davinci-001', 'text-davinci-002', 'text-davinci-003'
These models from OpenAI are getting deprecated, so I made sure to generate some essays with them and share here. I also added following public datasets (please upvote!): - https://www.kaggle.com/datasets/phanisrikanth/daigt-essays-from-intel-neural-chat-7b - https://www.kaggle.com/datasets/carlmcbrideellis/llm-mistral-7b-instruct-texts - https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b - https://www.kaggle.com/datasets/snassimr/gpt4-rephrased-llm-daigt-dataset
All merged with my previous dataset for convenience (https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset)
Enjoy ❤️
Version 2 update:
- removed NaNs and duplicated/short generations
- applied cleaning prodedure from @nbroad's notebook - give it an upvote please!
- added model
column to indicate model family used in generations
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset consists of 52K instruction-following data generated by GPT-4 in English using the same prompts as in Alpaca. This data has been crafted specifically to help researchers break ground and explore new strategies for natural language processing, with a special focus on instruction-following reasoning.
What makes this dataset unique and powerful is that it offers an ample variety of options for experimenting with models that can excel at instruction following tasks; from refining specific components such as predicting outputs or analyzing long textual conversations, to using the entire platform to train and evaluate end-to-end approaches. Allowing researchers the opportunity to rapidly iterate their experiments while having the confidence of a high performant model with few limitations - making this an invaluable resource for anyone looking to push the boundaries of artificial intelligence techniques for logical reasoning problems
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset is an invaluable resource for researching artificial intelligence approaches to logical reasoning problems. This dataset consists of 52K instruction-following samples generated by GPT-4 in English using the same prompts as in Alpaca. Here are some tips on how to make the most out of this dataset:
The columns in this dataset provide essential data that can help researchers evaluate their models on a task involving instruction following:
instruction
,input
,output
andtext
. In order to effectively use this data, it is important for researchers to be familiar with each column and understand its purpose and contribution towards understanding instructional following principles. a) Theinstruction
column provides a statement which an AI model must interpret in order for it complete a task correctly; b) The 'input' column is basically pre-generated data that helps an AI model make sense of the instructions; c) The 'output' column indicates what kind of result must be returned after the AI model interprets instructions correctly; and finally,
d) The ‘text’ column is full text generated by GPT-4 which gives us deeper insight into what gave rise our output results from input & instruction handling.Note : It's very important that researchers pay attention to all four columns when overseeing their work on such datasets, as all four components collaborate together integrately.
To get better results one should consider fine tuning existing schemes so they become better suited for instruction following tasks using these 4 columns as guidance points. It would be also useful if the datasets came with corresponding hyperparameters so users can fine tune them quicker without losing accuracy or any other metric needed on such scenarios!
Additionally, readers should Oyverviewedthe contextcloserlytoaccuracy assessthepunishmeasure opinion toneandGoforwhichmodeltypebestsuitsitcaseization given before attempting any sort of evaluation since some might bringmore accurateresultsbuttakelongertoprocess ore viceversa!yerinaredaviews satismetricmayvariaentdataobservioletorsalld .yCdgntricular error%mnfreeunerratreated too accommodate certain scenarios better than others but will still depend largely onthedatasetaccuratelyusedtocourubricateperformances026 (269units). For example, if changes are
- Training intelligent conversational agents with instruction-following reasoning capabilities.
- Developing more complex and powerful instructions processing models driven by natural language understanding and reasoning algorithms.
- Establishing an online platform to help academic, business or other organizations to construct auto-grading systems for instruction-following skills evaluation of their staff at large scale in a relatively cheap way
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Colu...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Multi-Turn Conversational Prompts from ChatGPT-4 (10K+ Tokens) Abstract: This dataset offers a valuable collection of multi-turn conversational prompts generated by ChatGPT-4, carefully curated for diverse prompt styles (chatml, gemma, llama). Each prompt exceeds 10,000 tokens, providing ample context and inspiration for training and evaluating large language models. Ideal for researchers and developers interested in exploring advanced conversational AI capabilities. Table of Contents:… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT-4-Prompts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.
Key Features:
Dataset Composition:
Intended Use:
Additional Information:
Acknowledgments:
This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.
Flan-GPT4 Dataset
Overview
The Flan-GPT4 dataset is a collection of prompts and responses designed for training and evaluating language generation models. It contains various features such as response, instruction, system, toxin_prompt, and llama_prompt, each with a data type of string. Edited and customized from SlimOrca-Flan
Dataset Information
Features:
response (string) instruction (string) system (string) toxin_prompt (string) llama_prompt (string)… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/Flan-GPT4.
COVq dataset
This dataset was used in the paper GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning. Refer to https://arxiv.org/abs/2402.16829 for details. The code for generating the data is available at https://github.com/avsolatorio/GISTEmbed.
Citation
@article{solatorio2024gistembed, title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning}, author={Aivin V. Solatorio}… See the full description on the dataset page: https://huggingface.co/datasets/avsolatorio/covid-bing-query-gpt4-avs_triplets.
budecosystem/GPT4-Mixtral-MMLU-Preference-Complexity-train dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
LLaVAR Data: Enhanced Visual Instruction Data with Text-Rich Images
More info at LLaVAR project page, Github repo, and paper.
Training Data
Based on the LAION dataset, we collect 422K pretraining data based on OCR results. For finetuning data, we collect 16K high-quality instruction-following data by interacting with langauge-only GPT-4. Note that we also release a larger and more diverse finetuning dataset below (20K), which contains the 16K we used for the paper. The… See the full description on the dataset page: https://huggingface.co/datasets/SALT-NLP/LLaVAR.
Dataset Card for "UltraChat-Mixin"
UltraChat-Mixin Dataset
Overview
UltraChat-Mixin is a dataset created by Me, which is a mix of three datasets: 'stingning/ultrachat', 'jondurbin/airoboros-2.1', and 'erfanzar/GPT4-8K'. This dataset is designed for training conversational AI models.
Dataset Configuration
The dataset is configured as follows: configs: - config_name: default data_files: - split: train path: data/train-*… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/UltraChat-Mixin.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Mitsu
[Paper] [Model] This is a multilingual preference dataset generated using human written prompts and responses from 7 LLMs. We evaluate each set of responses 5 times using GPT4. Note that this model has a non-commerical license as we used the Command R and Command R+ models to create this data. We are currently working on a developing a commerically usable model, so stay tuned for that!
Dataset details
This is the ORPO training dataset derived from the… See the full description on the dataset page: https://huggingface.co/datasets/lightblue/mitsu_full_borda.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society
Github: https://github.com/lightaime/camel Website: https://www.camel-ai.org/ Arxiv Paper: https://arxiv.org/abs/2303.17760
Dataset Summary
Math dataset is composed of 50K problem-solution pairs obtained using GPT-4. The dataset problem-solutions pairs generating from 25 math topics, 25 subtopics for each topic and 80 problems for each "topic,subtopic" pairs. We provide the data… See the full description on the dataset page: https://huggingface.co/datasets/camel-ai/math.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card
This dataset contains ~200K grade school math word problems. All the answers in this dataset is generated using Azure GPT4-Turbo. Please refer to Orca-Math: Unlocking the potential of SLMs in Grade School Math for details about the dataset construction.
Dataset Sources
Repository: microsoft/orca-math-word-problems-200k Paper: Orca-Math: Unlocking the potential of SLMs in Grade School Math
Direct Use
This dataset has been designed to… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Human Style Answers
This Datasets contains question and answers on different topics in Human style. (For Chatbots training) This Datasets is build using TOP AI like (GPT4, Claude3 , Command R+, etc.)
Dataset Details
Description
The Human Style Response Dataset is a rich collection of question-and-answer pairs, meticulously crafted in a human-like style. It serves as a valuable resource for training chatbots and conversational AI models. Let's dive into the… See the full description on the dataset page: https://huggingface.co/datasets/innova-ai/Human-Style-Answers.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Dataset Card for Tulu Instruction Mix
For a newer version, see Tulu V2 This version, the human data mixture, dataset consists of a mix of:
FLAN (Apache 2.0): FLAN v2 with CoT examples (most of the tasks in SuperNatural Instructions are included here) Open Assistant 1 (Apache 2.0) Dolly (CC By SA 3.0) ShareGPT (Apache 2.0 listed, no official repo found) GPT4-Alpaca (CC By NC 4.0) Code-Alpaca (CC By NC 4.0)
These are made by taking either just the training set of the subsets or the… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Multi-round questions and answers for randomly selected Wikipedia articles of varying lengths, in fastchat JSON format, generated by gpt-4-1106-preview. OpenAI terms apply. This was designed to train a 32K context-length model. Check the total conversation lengths before using data items for training to ensure that they fit inside your target context window, and discard queries that don't fit.
Both the questions and answers were generated by GPT4, based on the document. Only information from… See the full description on the dataset page: https://huggingface.co/datasets/grimulkan/wikipedia-document-question-answer.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for "alpaca-zh"
本数据集是参考Alpaca方法基于GPT4得到的self-instruct数据,约5万条。 Dataset from https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM It is the chinese dataset from https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/blob/main/data/alpaca_gpt4_data_zh.json
Usage and License Notices
The data is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not… See the full description on the dataset page: https://huggingface.co/datasets/shibing624/alpaca-zh.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Dataset Card for "GPT4-8K"
Sure! Here's a README.md file for your dataset:
Dataset Description
This dataset was generated using GPT-4, a powerful language model developed by OpenAI. It contains a collection of dialogs between a user and an assistant, along with additional information. from OpenChat
Dataset Configurations
The dataset includes the following configurations:
Config Name: default
Data Files: Split: train Path: data/train-*
Dataset… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT4-8K.