20 datasets found

h
GPT4-8K
huggingface.co
Updated Jan 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erfan zare chavoshi (2024). GPT4-8K [Dataset]. https://huggingface.co/datasets/erfanzar/GPT4-8K
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 6, 2024
Authors
Erfan zare chavoshi
Description
Dataset Card for "GPT4-8K"

Sure! Here's a README.md file for your dataset:

Dataset Description

This dataset was generated using GPT-4, a powerful language model developed by OpenAI. It contains a collection of dialogs between a user and an assistant, along with additional information. from OpenChat

Dataset Configurations

The dataset includes the following configurations:

Config Name: default

Data Files: Split: train Path: data/train-*

Dataset… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT4-8K.
h
airoboros-gpt4
huggingface.co
Updated Jun 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jon Durbin (2023). airoboros-gpt4 [Dataset]. https://huggingface.co/datasets/jondurbin/airoboros-gpt4
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 4, 2023
Authors
Jon Durbin
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The data was generated by gpt-4, and therefore is subject to OpenAI ToS. The tool used to generate the data airoboros is apache-2. Specific areas of focus for this training data:

trivia math nonsensical math coding closed context question answering closed context question answering, with multiple contexts to choose from as confounding factors writing multiple choice

Usage and License Notices

All airoboros models and datasets are intended and licensed for research use only.… See the full description on the dataset page: https://huggingface.co/datasets/jondurbin/airoboros-gpt4.
h
alpaca-gpt4-data-zh
huggingface.co
Updated Apr 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chris Alexiuk (2023). alpaca-gpt4-data-zh [Dataset]. https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data-zh
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 11, 2023
Authors
Chris Alexiuk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for "alpaca-gpt4-data-zh"

All of the work is done by this team.

Usage and License Notices

The data is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

English Dataset

Found here

Citation

@article{peng2023gpt4llm, title={Instruction Tuning with GPT-4}, author={Baolin Peng, Chunyuan Li… See the full description on the dataset page: https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data-zh.
Z
Model Output of GPT-3.5 and GPT-4 for ECHR-AM
data.niaid.nih.gov
explore.openaire.eu
Updated Dec 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mitrović, Jelena (2024). Model Output of GPT-3.5 and GPT-4 for ECHR-AM [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8246128
Explore at:
Dataset updated
Dec 13, 2024
Dataset provided by
Mitrović, Jelena
Granitzer, Michael
Zubaer, Abdullah Al
Description
"gpt3.5-gpt4-input-output-echram.zip" :

Input and output to GPT-3.5 and GPT-4 based on ECHR dataset published in JSON format in this paper for argument component classification only i.e. clauses that are argumentative (conclusion/premise), extracted from the JSON file

Note: Output of the model is under OpenAI Terms & policies.

Please cite our paper also if you use this dataset: Performance analysis of large language models in the domain of legal argument mining

You can click here for BibTex or copy the text below.

@ARTICLE{10.3389/frai.2023.1278796,

AUTHOR={Al Zubaer, Abdullah and Granitzer, Michael and Mitrović, Jelena },

TITLE={Performance analysis of large language models in the domain of legal argument mining},

JOURNAL={Frontiers in Artificial Intelligence},

VOLUME={6},

YEAR={2023},

URL={https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2023.1278796},

DOI={10.3389/frai.2023.1278796},

ISSN={2624-8212},

ABSTRACT={Generative pre-trained transformers (GPT) have recently demonstrated excellent performance in various natural language tasks. The development of ChatGPT and the recently released GPT-4 model has shown competence in solving complex and higher-order reasoning tasks without further training or fine-tuning. However, the applicability and strength of these models in classifying legal texts in the context of argument mining are yet to be realized and have not been tested thoroughly. In this study, we investigate the effectiveness of GPT-like models, specifically GPT-3.5 and GPT-4, for argument mining via prompting. We closely study the model's performance considering diverse prompt formulation and example selection in the prompt via semantic search using state-of-the-art embedding models from OpenAI and sentence transformers. We primarily concentrate on the argument component classification task on the legal corpus from the European Court of Human Rights. To address these models' inherent non-deterministic nature and make our result statistically sound, we conducted 5-fold cross-validation on the test set. Our experiments demonstrate, quite surprisingly, that relatively small domain-specific models outperform GPT 3.5 and GPT-4 in the F1-score for premise and conclusion classes, with 1.9% and 12% improvements, respectively. We hypothesize that the performance drop indirectly reflects the complexity of the structure in the dataset, which we verify through prompt and data analysis. Nevertheless, our results demonstrate a noteworthy variation in the performance of GPT models based on prompt formulation. We observe comparable performance between the two embedding models, with a slight improvement in the local model's ability for prompt selection. This suggests that local models are as semantically rich as the embeddings from the OpenAI model. Our results indicate that the structure of prompts significantly impacts the performance of GPT models and should be considered when designing them.}}
daigt-v3-train-dataset
kaggle.com
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darek Kłeczek (2023). daigt-v3-train-dataset [Dataset]. https://www.kaggle.com/datasets/thedrcat/daigt-v3-train-dataset/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Darek Kłeczek
Description
New release of DAIGT train dataset! New models: 'text-ada-001', 'text-babbage-001', 'text-curie-001', 'text-davinci-001', 'text-davinci-002', 'text-davinci-003'

These models from OpenAI are getting deprecated, so I made sure to generate some essays with them and share here. I also added following public datasets (please upvote!): - https://www.kaggle.com/datasets/phanisrikanth/daigt-essays-from-intel-neural-chat-7b - https://www.kaggle.com/datasets/carlmcbrideellis/llm-mistral-7b-instruct-texts - https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b - https://www.kaggle.com/datasets/snassimr/gpt4-rephrased-llm-daigt-dataset

All merged with my previous dataset for convenience (https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset)

Enjoy ❤️

Version 2 update: - removed NaNs and duplicated/short generations - applied cleaning prodedure from @nbroad's notebook - give it an upvote please! - added model column to indicate model family used in generations
Alpaca GPT-4
kaggle.com
opendatalab.com
+1more
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Alpaca GPT-4 [Dataset]. https://www.kaggle.com/datasets/thedevastator/gpt-4-instruction-following-dataset/versions/2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 24, 2023
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Alpaca GPT-4

High-Performance NLP for Instruction-Following Reasoning

By Huggingface Hub [source]

About this dataset

This dataset consists of 52K instruction-following data generated by GPT-4 in English using the same prompts as in Alpaca. This data has been crafted specifically to help researchers break ground and explore new strategies for natural language processing, with a special focus on instruction-following reasoning.

What makes this dataset unique and powerful is that it offers an ample variety of options for experimenting with models that can excel at instruction following tasks; from refining specific components such as predicting outputs or analyzing long textual conversations, to using the entire platform to train and evaluate end-to-end approaches. Allowing researchers the opportunity to rapidly iterate their experiments while having the confidence of a high performant model with few limitations - making this an invaluable resource for anyone looking to push the boundaries of artificial intelligence techniques for logical reasoning problems

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset is an invaluable resource for researching artificial intelligence approaches to logical reasoning problems. This dataset consists of 52K instruction-following samples generated by GPT-4 in English using the same prompts as in Alpaca. Here are some tips on how to make the most out of this dataset:

The columns in this dataset provide essential data that can help researchers evaluate their models on a task involving instruction following: instruction, input, output and text. In order to effectively use this data, it is important for researchers to be familiar with each column and understand its purpose and contribution towards understanding instructional following principles. a) The instruction column provides a statement which an AI model must interpret in order for it complete a task correctly; b) The 'input' column is basically pre-generated data that helps an AI model make sense of the instructions; c) The 'output' column indicates what kind of result must be returned after the AI model interprets instructions correctly; and finally,
d) The ‘text’ column is full text generated by GPT-4 which gives us deeper insight into what gave rise our output results from input & instruction handling.

Note : It's very important that researchers pay attention to all four columns when overseeing their work on such datasets, as all four components collaborate together integrately.

To get better results one should consider fine tuning existing schemes so they become better suited for instruction following tasks using these 4 columns as guidance points. It would be also useful if the datasets came with corresponding hyperparameters so users can fine tune them quicker without losing accuracy or any other metric needed on such scenarios!

Additionally, readers should Oyverviewedthe contextcloserlytoaccuracy assessthepunishmeasure opinion toneandGoforwhichmodeltypebestsuitsitcaseization given before attempting any sort of evaluation since some might bringmore accurateresultsbuttakelongertoprocess ore viceversa!yerinaredaviews satismetricmayvariaentdataobservioletorsalld .yCdgntricular error%mnfreeunerratreated too accommodate certain scenarios better than others but will still depend largely onthedatasetaccuratelyusedtocourubricateperformances026 (269units). For example, if changes are

Research Ideas

Training intelligent conversational agents with instruction-following reasoning capabilities.

Developing more complex and powerful instructions processing models driven by natural language understanding and reasoning algorithms.

Establishing an online platform to help academic, business or other organizations to construct auto-grading systems for instruction-following skills evaluation of their staff at large scale in a relatively cheap way

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Colu...
h
GPT-4-Prompts
huggingface.co
Updated Dec 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erfan zare chavoshi (2024). GPT-4-Prompts [Dataset]. https://huggingface.co/datasets/erfanzar/GPT-4-Prompts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 22, 2024
Authors
Erfan zare chavoshi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Multi-Turn Conversational Prompts from ChatGPT-4 (10K+ Tokens) Abstract: This dataset offers a valuable collection of multi-turn conversational prompts generated by ChatGPT-4, carefully curated for diverse prompt styles (chatml, gemma, llama). Each prompt exceeds 10,000 tokens, providing ample context and inspiration for training and evaluating large language models. Ideal for researchers and developers interested in exploring advanced conversational AI capabilities. Table of Contents:… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT-4-Prompts.
Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...
zenodo.org
data.niaid.nih.gov
csv
Updated Dec 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky (2023). Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. http://doi.org/10.5281/zenodo.10413068
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10413068
Dataset updated
Dec 21, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

Key Features:

LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.

Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.

Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

Dataset Composition:

curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot

curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

Intended Use:

Fine-tuning and advancing Homepage2Vec or similar website classification models

Research on LLM-generated datasets for text classification tasks

Exploration of multilingual website classification

Additional Information:

Project and report repository: https://github.com/CS-433/ml-project-2-mlp

Acknowledgments:

This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.
h
Flan-GPT4
huggingface.co
Updated Jan 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erfan zare chavoshi (2024). Flan-GPT4 [Dataset]. https://huggingface.co/datasets/erfanzar/Flan-GPT4
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 6, 2024
Authors
Erfan zare chavoshi
Description
Flan-GPT4 Dataset

Overview

The Flan-GPT4 dataset is a collection of prompts and responses designed for training and evaluating language generation models. It contains various features such as response, instruction, system, toxin_prompt, and llama_prompt, each with a data type of string. Edited and customized from SlimOrca-Flan

Dataset Information

Features:

response (string) instruction (string) system (string) toxin_prompt (string) llama_prompt (string)… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/Flan-GPT4.
h
covid-bing-query-gpt4-avs_triplets
huggingface.co
Updated Aug 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aivin Solatorio (2024). covid-bing-query-gpt4-avs_triplets [Dataset]. https://huggingface.co/datasets/avsolatorio/covid-bing-query-gpt4-avs_triplets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 30, 2024
Authors
Aivin Solatorio
Description
COVq dataset

This dataset was used in the paper GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning. Refer to https://arxiv.org/abs/2402.16829 for details. The code for generating the data is available at https://github.com/avsolatorio/GISTEmbed.

Citation

@article{solatorio2024gistembed, title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning}, author={Aivin V. Solatorio}… See the full description on the dataset page: https://huggingface.co/datasets/avsolatorio/covid-bing-query-gpt4-avs_triplets.
h
GPT4-Mixtral-MMLU-Preference-Complexity-train
huggingface.co
Updated Jul 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bud (2024). GPT4-Mixtral-MMLU-Preference-Complexity-train [Dataset]. https://huggingface.co/datasets/budecosystem/GPT4-Mixtral-MMLU-Preference-Complexity-train
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 29, 2024
Dataset authored and provided by
Bud
Description
budecosystem/GPT4-Mixtral-MMLU-Preference-Complexity-train dataset hosted on Hugging Face and contributed by the HF Datasets community
h
LLaVAR
huggingface.co
Updated Jan 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social And Language Technology Lab (2021). LLaVAR [Dataset]. https://huggingface.co/datasets/SALT-NLP/LLaVAR
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 27, 2021
Dataset authored and provided by
Social And Language Technology Lab
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
LLaVAR Data: Enhanced Visual Instruction Data with Text-Rich Images

More info at LLaVAR project page, Github repo, and paper.

Training Data

Based on the LAION dataset, we collect 422K pretraining data based on OCR results. For finetuning data, we collect 16K high-quality instruction-following data by interacting with langauge-only GPT-4. Note that we also release a larger and more diverse finetuning dataset below (20K), which contains the 16K we used for the paper. The… See the full description on the dataset page: https://huggingface.co/datasets/SALT-NLP/LLaVAR.
h
UltraChat-Mixin
huggingface.co
Updated Apr 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erfan zare chavoshi (2023). UltraChat-Mixin [Dataset]. https://huggingface.co/datasets/erfanzar/UltraChat-Mixin
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2023
Authors
Erfan zare chavoshi
Description
Dataset Card for "UltraChat-Mixin"

UltraChat-Mixin Dataset Overview

UltraChat-Mixin is a dataset created by Me, which is a mix of three datasets: 'stingning/ultrachat', 'jondurbin/airoboros-2.1', and 'erfanzar/GPT4-8K'. This dataset is designed for training conversational AI models.

Dataset Configuration

The dataset is configured as follows: configs: - config_name: default data_files: - split: train path: data/train-*… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/UltraChat-Mixin.
h
mitsu_full_borda
huggingface.co
Updated Oct 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lightblue KK. (2024). mitsu_full_borda [Dataset]. https://huggingface.co/datasets/lightblue/mitsu_full_borda
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 10, 2024
Dataset authored and provided by
Lightblue KK.
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Mitsu

[Paper] [Model] This is a multilingual preference dataset generated using human written prompts and responses from 7 LLMs. We evaluate each set of responses 5 times using GPT4. Note that this model has a non-commerical license as we used the Command R and Command R+ models to create this data. We are currently working on a developing a commerically usable model, so stay tuned for that!

Dataset details

This is the ORPO training dataset derived from the… See the full description on the dataset page: https://huggingface.co/datasets/lightblue/mitsu_full_borda.
h
math
huggingface.co
Updated Apr 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CAMEL-AI.org (2023). math [Dataset]. https://huggingface.co/datasets/camel-ai/math
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 11, 2023
Dataset provided by
CAMEL-AI.org
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society

Github: https://github.com/lightaime/camel Website: https://www.camel-ai.org/ Arxiv Paper: https://arxiv.org/abs/2303.17760

Dataset Summary

Math dataset is composed of 50K problem-solution pairs obtained using GPT-4. The dataset problem-solutions pairs generating from 25 math topics, 25 subtopics for each topic and 80 problems for each "topic,subtopic" pairs. We provide the data… See the full description on the dataset page: https://huggingface.co/datasets/camel-ai/math.
orca-math-word-problems-200k
huggingface.co
Updated Mar 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft (2024). orca-math-word-problems-200k [Dataset]. https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 4, 2024
Dataset authored and provided by
Microsofthttp://microsoft.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card

This dataset contains ~200K grade school math word problems. All the answers in this dataset is generated using Azure GPT4-Turbo. Please refer to Orca-Math: Unlocking the potential of SLMs in Grade School Math for details about the dataset construction.

Dataset Sources

Repository: microsoft/orca-math-word-problems-200k Paper: Orca-Math: Unlocking the potential of SLMs in Grade School Math

Direct Use

This dataset has been designed to… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k.
h
Human-Style-Answers
huggingface.co
Updated Apr 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
INNOVA AI (2024). Human-Style-Answers [Dataset]. https://huggingface.co/datasets/innova-ai/Human-Style-Answers
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 20, 2024
Dataset authored and provided by
INNOVA AI
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Human Style Answers

This Datasets contains question and answers on different topics in Human style. (For Chatbots training) This Datasets is build using TOP AI like (GPT4, Claude3 , Command R+, etc.)

Dataset Details Description

The Human Style Response Dataset is a rich collection of question-and-answer pairs, meticulously crafted in a human-like style. It serves as a valuable resource for training chatbots and conversational AI models. Let's dive into the… See the full description on the dataset page: https://huggingface.co/datasets/innova-ai/Human-Style-Answers.
tulu-v1-sft-mixture
huggingface.co
Updated Nov 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2023). tulu-v1-sft-mixture [Dataset]. https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 14, 2023
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Dataset Card for Tulu Instruction Mix

For a newer version, see Tulu V2 This version, the human data mixture, dataset consists of a mix of:

FLAN (Apache 2.0): FLAN v2 with CoT examples (most of the tasks in SuperNatural Instructions are included here) Open Assistant 1 (Apache 2.0) Dolly (CC By SA 3.0) ShareGPT (Apache 2.0 listed, no official repo found) GPT4-Alpaca (CC By NC 4.0) Code-Alpaca (CC By NC 4.0)

These are made by taking either just the training set of the subsets or the… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture.
h
wikipedia-document-question-answer
huggingface.co
Updated Jan 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grimulkan (2024). wikipedia-document-question-answer [Dataset]. https://huggingface.co/datasets/grimulkan/wikipedia-document-question-answer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 24, 2024
Authors
Grimulkan
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Multi-round questions and answers for randomly selected Wikipedia articles of varying lengths, in fastchat JSON format, generated by gpt-4-1106-preview. OpenAI terms apply. This was designed to train a 32K context-length model. Check the total conversation lengths before using data items for training to ensure that they fit inside your target context window, and discard queries that don't fit.

Both the questions and answers were generated by GPT4, based on the document. Only information from… See the full description on the dataset page: https://huggingface.co/datasets/grimulkan/wikipedia-document-question-answer.
h
alpaca-zh
huggingface.co
Updated Apr 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ming Xu (徐明) (2023). alpaca-zh [Dataset]. https://huggingface.co/datasets/shibing624/alpaca-zh
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 11, 2023
Authors
Ming Xu (徐明)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for "alpaca-zh"

本数据集是参考Alpaca方法基于GPT4得到的self-instruct数据，约5万条。 Dataset from https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM It is the chinese dataset from https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/blob/main/data/alpaca_gpt4_data_zh.json

Usage and License Notices

The data is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not… See the full description on the dataset page: https://huggingface.co/datasets/shibing624/alpaca-zh.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Erfan zare chavoshi (2024). GPT4-8K [Dataset]. https://huggingface.co/datasets/erfanzar/GPT4-8K

GPT4-8K

GPT4

erfanzar/GPT4-8K

Explore at:

13 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 6, 2024

Authors

Erfan zare chavoshi

Description

Dataset Card for "GPT4-8K"

Sure! Here's a README.md file for your dataset:

  Dataset Description

This dataset was generated using GPT-4, a powerful language model developed by OpenAI. It contains a collection of dialogs between a user and an assistant, along with additional information. from OpenChat

  Dataset Configurations

The dataset includes the following configurations:

Config Name: default

Data Files: Split: train Path: data/train-*

  Dataset… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT4-8K.

Clear search

Close search

Google apps

Main menu

GPT4-8K

airoboros-gpt4

alpaca-gpt4-data-zh

Model Output of GPT-3.5 and GPT-4 for ECHR-AM

daigt-v3-train-dataset

Alpaca GPT-4

Alpaca GPT-4

High-Performance NLP for Instruction-Following Reasoning

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

GPT-4-Prompts

Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...

Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

Flan-GPT4

covid-bing-query-gpt4-avs_triplets

GPT4-Mixtral-MMLU-Preference-Complexity-train

LLaVAR

UltraChat-Mixin

mitsu_full_borda

math

orca-math-word-problems-200k

Human-Style-Answers

tulu-v1-sft-mixture

wikipedia-document-question-answer

alpaca-zh

GPT4-8K

GPT4

erfanzar/GPT4-8K