Dataset Card for Unnatural Instructions (Full data)
This info comes from the Unnatural Instructions GitHub repo. Unnatural Instructions is a dataset of instructions automatically generated by a Large Language model. See full details in the paper: "Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor"
🗃️ Content
It contains the full 240,670 Unnatural Instructions (instruction-input-output triplets) examples. It was constructed by expanding the… See the full description on the dataset page: https://huggingface.co/datasets/mrm8488/unnatural-instructions-full.
Unnatural Instructions is a dataset of instructions automatically generated by a Large Language model. See full details in the paper: "Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor" (https://arxiv.org/abs/2212.09689)
Dataset described in the paper: Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor (2022). Contains sets of natural-language instructions, with optional constraints / LLM-generated reformulations.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('unnatural_instructions', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Used the technique from Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor and Mixtral8x7B (Base Model) to generate this diverse, fully-synthetic, fully open-source set of 100,000 conversation starters. See also: unnaturalhermes-questions-30k, a distinct set of 30k examples just like this, if you want more training data.
OpenHermes was trained on 242,000 entries of primarily GPT-4 generated data, from open datasets across the AI landscape, including: - GPTeacher - General Instruct, Roleplay v1, Roleplay v2, and Code Instruct Datasets, by Teknium - WizardLM (v1, evol_instruct 70k), by WizardLM Team/nlpxucan - Airoboros GPT-4 (v1.0), by JonDurbin - Camel-AI's domain expert datasets, by the Camel-AI Team - CodeAlpaca, by Sahil2801 - GPT4-LLM and Unnatural Instructions, by Microsoft
Filtering included the removal of OpenAI refusals, disclaimers, and "As an AI" type examples and more
The base dataset mix is identical to the original Nous-Hermes', minus the Nous-Instruct and PDACTL datasets which were private datasets.
References 1. https://huggingface.co/datasets/teknium/openhermes
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
(:unav)...........................................
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Data Description
We release the training dataset of ChatQA. It is built and derived from existing datasets: DROP, NarrativeQA, NewsQA, Quoref, ROPES, SQuAD1.1, SQuAD2.0, TAT-QA, a SFT dataset, as well as a our synthetic conversational QA dataset by GPT-3.5-turbo-0613. The SFT dataset is built and derived from: Soda, ELI5, FLAN, the FLAN collection, Self-Instruct, Unnatural Instructions, OpenAssistant, and Dolly. For more information about ChatQA, check the website!
Other… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/ChatQA-Training-Data.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for xOA22 - Multilingual Prompts from OpenAssistant
Dataset Summary
x-self-instruct-seed-32 consists of 32 prompts chosen out of the 252 prompts in the self-instruct-seed dataset from the Self-Instruct paper. These 32 prompts were filtered out according to the following criteria:
Should be natural in a chat setting Therefore, we filter out any prompts with "few-shot examples", as these are all instruction prompts that we consider unnatural in a chat setting… See the full description on the dataset page: https://huggingface.co/datasets/sambanovasystems/x-self-instruct-seed-32.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Dataset Card for Unnatural Instructions (Full data)
This info comes from the Unnatural Instructions GitHub repo. Unnatural Instructions is a dataset of instructions automatically generated by a Large Language model. See full details in the paper: "Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor"
🗃️ Content
It contains the full 240,670 Unnatural Instructions (instruction-input-output triplets) examples. It was constructed by expanding the… See the full description on the dataset page: https://huggingface.co/datasets/mrm8488/unnatural-instructions-full.