Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Post-training-Data-Flywheel/teknium-GPT4-LLM-Cleaned dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The dataset includes all chat conversations generated by GPT-4 that are hosted on open Huggingface datasets. Everything is converted to the same format so the datasets can be easily merged and used for large scale training of LLMs.
This dataset is a collection of several single chat datasets. If you use this dataset in your research, please credit the original authors of the internal datasets. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Post-training-Data-Flywheel/gpt4-self-instruct dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset Card for "GPT4-8K"
Sure! Here's a README.md file for your dataset:
Dataset Description
This dataset was generated using GPT-4, a powerful language model developed by OpenAI. It contains a collection of dialogs between a user and an assistant, along with additional information. from OpenChat
Dataset Configurations
The dataset includes the following configurations:
Config Name: default
Data Files: Split: train Path: data/train-*
Dataset… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT4-8K.
Facebook
TwitterNew release of DAIGT train dataset! New models: 'text-ada-001', 'text-babbage-001', 'text-curie-001', 'text-davinci-001', 'text-davinci-002', 'text-davinci-003'
These models from OpenAI are getting deprecated, so I made sure to generate some essays with them and share here. I also added following public datasets (please upvote!): - https://www.kaggle.com/datasets/phanisrikanth/daigt-essays-from-intel-neural-chat-7b - https://www.kaggle.com/datasets/carlmcbrideellis/llm-mistral-7b-instruct-texts - https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b - https://www.kaggle.com/datasets/snassimr/gpt4-rephrased-llm-daigt-dataset
All merged with my previous dataset for convenience (https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset)
Enjoy ❤️
Version 2 update:
- removed NaNs and duplicated/short generations
- applied cleaning prodedure from @nbroad's notebook - give it an upvote please!
- added model column to indicate model family used in generations
Facebook
Twitter"gpt3.5-gpt4-input-output-echram.zip" :
Input and output to GPT-3.5 and GPT-4 based on ECHR dataset published in JSON format in this paper for argument component classification only i.e. clauses that are argumentative (conclusion/premise), extracted from the JSON file
Note: Output of the model is under OpenAI Terms & policies.
Please cite our paper also if you use this dataset: Performance analysis of large language models in the domain of legal argument mining
You can click here for BibTex or copy the text below.
@ARTICLE{10.3389/frai.2023.1278796,
AUTHOR={Al Zubaer, Abdullah and Granitzer, Michael and Mitrović, Jelena },
TITLE={Performance analysis of large language models in the domain of legal argument mining},
JOURNAL={Frontiers in Artificial Intelligence},
VOLUME={6},
YEAR={2023},
URL={https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2023.1278796},
DOI={10.3389/frai.2023.1278796},
ISSN={2624-8212},
ABSTRACT={Generative pre-trained transformers (GPT) have recently demonstrated excellent performance in various natural language tasks. The development of ChatGPT and the recently released GPT-4 model has shown competence in solving complex and higher-order reasoning tasks without further training or fine-tuning. However, the applicability and strength of these models in classifying legal texts in the context of argument mining are yet to be realized and have not been tested thoroughly. In this study, we investigate the effectiveness of GPT-like models, specifically GPT-3.5 and GPT-4, for argument mining via prompting. We closely study the model's performance considering diverse prompt formulation and example selection in the prompt via semantic search using state-of-the-art embedding models from OpenAI and sentence transformers. We primarily concentrate on the argument component classification task on the legal corpus from the European Court of Human Rights. To address these models' inherent non-deterministic nature and make our result statistically sound, we conducted 5-fold cross-validation on the test set. Our experiments demonstrate, quite surprisingly, that relatively small domain-specific models outperform GPT 3.5 and GPT-4 in the F1-score for premise and conclusion classes, with 1.9% and 12% improvements, respectively. We hypothesize that the performance drop indirectly reflects the complexity of the structure in the dataset, which we verify through prompt and data analysis. Nevertheless, our results demonstrate a noteworthy variation in the performance of GPT models based on prompt formulation. We observe comparable performance between the two embedding models, with a slight improvement in the local model's ability for prompt selection. This suggests that local models are as semantically rich as the embeddings from the OpenAI model. Our results indicate that the structure of prompts significantly impacts the performance of GPT models and should be considered when designing them.}}
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The data was generated by gpt-4, and therefore is subject to OpenAI ToS. The tool used to generate the data airoboros is apache-2. Specific areas of focus for this training data:
trivia math nonsensical math coding closed context question answering closed context question answering, with multiple contexts to choose from as confounding factors writing multiple choice
Usage and License Notices
All airoboros models and datasets are intended and licensed for research use only.… See the full description on the dataset page: https://huggingface.co/datasets/jondurbin/airoboros-gpt4.
Facebook
TwitterPlease use version 2 (there were some issues with v1 that I fixed)!
New release of DAIGT train dataset! Improvement: - new models: Cohere Command, Google Palm, GPT4 (from Radek!) - new prompts, including source texts from the original essays! - mapping of essay text to original prompt from persuade corpus - filtering by the famous "RDizzl3_seven"
persuade_corpus 25996
chat_gpt_moth 2421
llama2_chat 2421
mistral7binstruct_v2 2421
mistral7binstruct_v1 2421
original_moth 2421
train_essays 1378
llama_70b_v1 1172
falcon_180b_v1 1055
darragh_claude_v7 1000
darragh_claude_v6 1000
radek_500 500
NousResearch/Llama-2-7b-chat-hf 400
mistralai/Mistral-7B-Instruct-v0.1 400
cohere-command 350
palm-text-bison1 349
radekgpt4 200
Sources (please upvote the original datasets!): - Text generated with ChatGPT by MOTH (https://www.kaggle.com/datasets/alejopaullier/daigt-external-dataset) - Persuade corpus contributed by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/persaude-corpus-2/) - Text generated with Llama-70b and Falcon180b by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b) - Text generated with ChatGPT and GPT4 by Radek (https://www.kaggle.com/datasets/radek1/llm-generated-essays) - 2000 Claude essays generated by @darraghdog (https://www.kaggle.com/datasets/darraghdog/hello-claude-1000-essays-from-anthropic) - LLM-generated essay using PaLM from Google Gen-AI by @kingki19 (https://www.kaggle.com/datasets/kingki19/llm-generated-essay-using-palm-from-google-gen-ai) - Official train essays - Essays I generated with various LLMs
License: MIT for the data I generated. Check source datasets for the other sources mentioned above.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Multi-Turn Conversational Prompts from ChatGPT-4 (10K+ Tokens) Abstract: This dataset offers a valuable collection of multi-turn conversational prompts generated by ChatGPT-4, carefully curated for diverse prompt styles (chatml, gemma, llama). Each prompt exceeds 10,000 tokens, providing ample context and inspiration for training and evaluating large language models. Ideal for researchers and developers interested in exploring advanced conversational AI capabilities. Table of Contents:… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT-4-Prompts.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Experimental results (%) of applying lossST, lossLM, and lossIE in one training stage and segmented training.
Facebook
TwitterThis is the GPT4-LLM dataset from : https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM It has been filtered of all OpenAI disclaimers and refusals. (Disclaimer: It may have removed some additional things besides just OAI disclaimers, as I used the followings script which is a bit more broad: https://huggingface.co/datasets/ehartford/WizardLM_alpaca_evol_instruct_70k_unfiltered/blob/main/wizardlm_clean.py) There is a modified script of that in the repo that was used specifically for… See the full description on the dataset page: https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Peevski (From Huggingface) [source]
The OpenLeecher/GPT4-10k dataset is a comprehensive collection of 100 diverse conversations, presented in text format, revolving around a wide range of topics. These conversations cover various domains such as coding, debugging, storytelling, and science. Aimed at facilitating training and analysis purposes for researchers and developers alike, this dataset offers an extensive array of conversation samples.
Each conversation within this dataset delves into different subject matters related to coding techniques, debugging strategies, storytelling methods; while also exploring concepts like spatial thinking, logical thinking. Furthermore, the conversations touch upon scientific fields including chemistry, physics and biology. To add further depth to the dataset's content, it also includes discussions on the topic of law.
By providing this rich assortment of conversations spanning across multiple domains and disciplines in one cohesive dataset format on Kaggle platform as train.csv file , it empowers users to delve into these dialogue examples for exploration and analysis effortlessly. This compilation serves as an invaluable resource for understanding various aspects of coding practices alongside stimulating scientific discussions on subjects spanning across multiple fields
Introduction:
Understanding the Dataset Structure: The dataset consists of a CSV file named 'train.csv'. When examining the file's columns using software or programming language of your choice (e.g., Python), you will notice two key columns: 'chat' and '**chat'. Both these columns contain text data representing conversations between two or more participants.
Exploring Different Topics: The dataset covers a vast spectrum of subjects including coding techniques, debugging strategies, storytelling methods, spatial thinking, logical thinking, chemistry, physics, biology, and law each conversation:
- Coding Techniques: Discover discussions on various programming concepts and best practices.
- Debugging Strategies: Explore conversations related to identifying and fixing software issues.
- Storytelling Methods: Dive into dialogues about effective storytelling techniques in different contexts.
- Spatial Thinking: Engage with conversations that involve developing spatial reasoning skills for problem-solving.
- Logical Thinking: Learn from discussions focused on enhancing logical reasoning abilities related to different domains.
- Chemistry
- Physics
- Biology
- Law
Analyzing Conversations: leverage natural language processing (NLP) tools or techniques such as sentiment analysis print(Number of Conversations:, len(df)) together
Accessible Code Examples
Maximize Training Efficiency:
Taking Advantage of Diversity:
Creating New Applications:
Conclusion:
- Natural Language Processing Research: Researchers can leverage this dataset to train and evaluate natural language processing models, particularly in the context of conversational understanding and generation. The diverse conversations on coding, debugging, storytelling, and science can provide valuable insights into modeling human-like conversation patterns.
- Chatbot Development: The dataset can be utilized for training chatbots or virtual assistants that can engage in conversations related to coding, debugging, storytelling, and science. By exposing the chatbot to a wide range of conversation samples from different domains, developers can ensure that their chatbots are capable of providing relevant and accurate responses.
- Domain-specific Intelligent Assistants: Organizations or individuals working in fields such as coding education or scientific research may use this dataset to develop intelligent assistants tailored specifically for these domains. These assistants can help users navigate complex topics by answering questions related to coding techniques, debugging strategies, storytelling methods, or scientific concepts. Overall,'train.csv' provides a rich resource for researchers and developers interested in building conversational AI systems with knowledge across multiple domains including even legal matters
If you use this dataset in your research, please credit the original authors. Data Source
**Li...
Facebook
TwitterDolphin 2.9.3 Mistral Nemo 12b 🐬
Curated and trained by Eric Hartford and Cognitive Computations
Discord Discord: https://discord.gg/h3K4XGj2RH
Our appreciation for the sponsors of Dolphin 2.9.3:
Crusoe Cloud - provided excellent on-demand 8xL40S node
This model is based on mistralai/Mistral-Nemo-Base-2407, and is governed by the apache 2.0 license.
The base model has 128K context, and our finetuning used 8192 sequence length.
Dolphin 2.9.3 uses ChatML prompt template format.
example:
<|im_start|>system You are Dolphin, a helpful AI assistant.<|im_end|> <|im_start|>user {prompt}<|im_end|> <|im_start|>assistant
Dolphin-2.9.3 has a variety of instruction following, conversational, and coding skills. It also has initial agentic abilities and supports function calling.
Dolphin is uncensored. We have filtered the dataset to remove alignment and bias. This makes the model more compliant. You are advised to implement your own alignment layer before exposing the model as a service. It will be highly compliant with any requests, even unethical ones. Please read my blog post about uncensored models. https://erichartford.com/uncensored-models You are responsible for any content you create using this model. Enjoy responsibly.
Dolphin is licensed according to apache 2.0 license. We grant permission for any use, including commercial. Dolphin was trained on data generated from GPT4, among other models. Evals See evals
Training
Built with Axolotl See axolotl config
Visualize in Weights & Biases workspace/axolotl/dolphin-2.9.3-mistral-nemo
This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:
Loss: 0.5605
Model description
More information needed Intended uses & limitations
More information needed Training and evaluation data
More information needed Training procedure Training hyperparameters
The following hyperparameters were used during training:
learning_rate: 5e-06
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 16
total_train_batch_size: 128
total_eval_batch_size: 8
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 100
num_epochs: 3
Training results Training Loss Epoch Step Validation Loss 0.5691 1.0162 983 0.5734 0.5335 2.0174 1968 0.5609 0.5297 2.9639 2901 0.5605 Framework versions
Transformers 4.43.0.dev0
Pytorch 2.2.2+cu121
Datasets 2.19.1
Tokenizers 0.19.1
Facebook
TwitterPrepared dataset from roneneldan/TinyStoriesV2-GPT4
Data Preparation pipeline.
Download TinyStoriesV2-GPT4-train.txt from https://huggingface.co/datasets/roneneldan/TinyStories/blob/main/TinyStoriesV2-GPT4-train.txt
raw = open('TinyStoriesV2-GPT4-train.txt').readlines() stories = [] for x in tqdm(raw,total=len(raw)): if x==' ': continue if x.startswith('<|endoftext|>'): chunk.append(x.strip()) stories.append(" ".join(chunk))… See the full description on the dataset page: https://huggingface.co/datasets/maveriq/tinystoriesv2_gpt4.
Facebook
TwitterFlan-GPT4 Dataset
Overview
The Flan-GPT4 dataset is a collection of prompts and responses designed for training and evaluating language generation models. It contains various features such as response, instruction, system, toxin_prompt, and llama_prompt, each with a data type of string. Edited and customized from SlimOrca-Flan
Dataset Information
Features:
response (string) instruction (string) system (string) toxin_prompt (string) llama_prompt (string)… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/Flan-GPT4.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Perception performance comparison on MME benchmark.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Experimental results (%) of lossIE at both ends of the sequence.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Overview
This dataset is a Malagasy adaptation of the Alpaca-GPT4 instruction-following dataset.It contains instruction-response pairs translated or adapted into Malagasy, designed for fine-tuning instruction-following language models. Each entry includes an instruction, optional input context, and a reference response generated by GPT-4 and adapted to Malagasy using Gemini 2.5 for the translation.
The dataset enables training and evaluating LLMs on instruction understanding… See the full description on the dataset page: https://huggingface.co/datasets/Lo-Renz-O/alpaca-gpt4-MG.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
External data for LMSYS - Chatbot Arena Human Preference Predictions competition.
Downloaded from HuggingFace dataset: argilla/ultrafeedback-multi-binarized-preferences-cleaned
Additionally, I converted the data into LMSYS train data format (you may still need to shuffle the responses).
Version 2 contains additional examples with ties between model responses that were previously filtered out.
NOTE: This dataset is based on GPT4 as a judge as a proxy for human preference rating.
UltraFeedback - Multi-Binarized using the Average of Preference Ratings (Cleaned) dataset represents a new iteration on top of argilla/ultrafeedback-binarized-preferences-cleaned, and has been created to explore whether DPO fine-tuning with more than one rejection per chosen response helps the model perform better in the AlpacaEval, MT-Bench, and LM Eval Harness benchmarks.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Experimental results (%) under different orders of lossST and lossLM.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Post-training-Data-Flywheel/teknium-GPT4-LLM-Cleaned dataset hosted on Hugging Face and contributed by the HF Datasets community