Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for Alpaca
Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:
The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Alpaca-Cleaned
Repository: https://github.com/gururise/AlpacaDataCleaned
Dataset Description
This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:
Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer.
"instruction":"Summarize the… See the full description on the dataset page: https://huggingface.co/datasets/weiwei888/VIS.
Facebook
Twitterhttps://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
09/04/2023 Update: New instructions added from: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM Original Version: https://github.com/tatsu-lab/stanford_alpaca#data-release AI BASED TRANSLATION RESULTS OF STANFORD ALPACA EN TO TR For academic only, please cite before you use it. Taşar, D. E. T. (2023). stanford-alpaca-cleaned-turkish-translated [Dataset]. In Stanford Alpaca TR (1.0.1.a). https://huggingface.co/datasets/emre/stanford-alpaca-cleaned-turkish-translated… See the full description on the dataset page: https://huggingface.co/datasets/emre/stanford-alpaca-cleaned-turkish-translated.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
LLaMA is a great work that demonstrates the amazing zero-shot and few-shot ability. It significantly reduces the cost of training, finetuning, and using competitive large language models, i.e., LLaMA-13B outperforms GPT-3(175B) and LLaMA-65B is competitive to PaLM-540B. Recently, to boost the instruction-following ability of LLaMA, Stanford Alpaca finetuned LLaMA-7B on 52K instruction-following data generated by the Self-Instruct techniques. However, at present, the LLM research community still faces three challenges: 1. Even LLaMA-7b still has high requirements for computing resources; 2. There are few open source datasets for instruction finetuning; and 3. There is a lack of empirical study on the impact of various types of instruction on model abilities, such as the ability to respond to Chinese instruction and the CoT reasoning.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is a Thai 🇹🇭-instructed dataset translated from cleaned version of the original Alpaca Dataset released by Stanford using Google Cloud Translation, contain 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Roman Urdu + Alpaca QA Mix
This dataset is intended to support fine-tuning and evaluation of language models that understand and respond to Roman Urdu and English instructions. It consists of 1,022 records in total:
500 examples in Roman Urdu generated from high-quality Urdu sources and transliterated using the ChatGPT API. 500 examples in English randomly sampled from the Stanford Alpaca dataset.
The dataset follows the same format as Alpaca-style instruction… See the full description on the dataset page: https://huggingface.co/datasets/Redgerd/roman-urdu-alpaca-qa-mix.
Facebook
Twitter🇻🇳 Vietnamese modified Alpaca Dataset
This dataset is especially designed for Vietnamese based on the idea from Stanford Alpaca, Self-Instruct paper and Chinese LLaMA. The motivation behind the creation of this dataset stems from the hope to contribute high-quality dataset to Vietnamese commnunity to train language models. To construct this dataset, we follow a two-step process:
Step 1: Manually create Vietnamese seed tasks We employ the methodology outlined in the Self-Instruct… See the full description on the dataset page: https://huggingface.co/datasets/bkai-foundation-models/vi-alpaca-input-output-format.
Facebook
Twitterhttps://github.com/XueFuzhao/InstructionWild/blob/main/LICENSEhttps://github.com/XueFuzhao/InstructionWild/blob/main/LICENSE
Instruction Tuning is a key component of ChatGPT. OpenAI uses their user-based instruction dataset, but unfortunately, this dataset is not open source. Self-Instruct released a small instruction dataset consisting of 175 human-written instructions. The Stanford Alpaca team text-davinci-003 generated 52K instructions by model from the above 175 seed instructions.
The project's goal is a larger and more diverse instruction dataset. To this end, we collected 429 descriptions from ChatGPT usage screenshots and released Chinese and English versions. We found that these instructions are very diverse, even if the scale is still small. We follow Alpaca to generate 52K commands and their responses. All data can be found in the directory data.
NOTE: This is an ongoing project. We are still collecting and improving our data. We release this dataset early to accelerate our LLM research. We will also publish a white paper soon.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for Alpaca-Hu-2k
This is the dataset card for the Hungarian translation of a subset of the Stanford Alpaca prompts.
Dataset Details
Dataset Description
The dataset is the first Hungarian language instruction-following corpus created for fine-tuning large language models, specifically developed by translating and localizing a portion of the Stanford Alpaca corpus. It contains 2000 translated and 100 localized prompts, designed to train… See the full description on the dataset page: https://huggingface.co/datasets/NYTK/alpaca_hu_2k.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MixInstruct
Introduction
This is the official realease of dataset MixInstruct for project LLM-Blender. This dataset contains 11 responses from the current popular instruction following-LLMs that includes:
Stanford Alpaca FastChat Vicuna Dolly V2 StableLM Open Assistant Koala Baize Flan-T5 ChatGLM MOSS Moasic MPT
We evaluate each response with auto metrics including BLEU, ROUGE, BERTScore, BARTScore. And provide pairwise comparison results by prompting ChatGPT for the… See the full description on the dataset page: https://huggingface.co/datasets/llm-blender/mix-instruct.
Facebook
Twitterhttps://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/
This data set is a lightweight fine-tuned data format version of the Llama2 large language model for Stanford Alpaca. You can click here to view. cite original code @inproceedings{cohan-etal-2018-discourse, title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", author = "Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli", booktitle = "Proceedings… See the full description on the dataset page: https://huggingface.co/datasets/ZhongshengWang/Alpaca-pubmed-summarization.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for ChatAlpaca 20K
ChatAlpaca: A Multi-Turn Dialogue Corpus based on Alpaca Instructions
Dataset Description
ChatAlpaca is a chat dataset that aims to help researchers develop models for instruction-following in multi-turn conversations. The dataset is an extension of the Stanford Alpaca data, which contains multi-turn instructions and their corresponding responses. ChatAlpaca is developed by Chinese Information Processing Laboratory at the… See the full description on the dataset page: https://huggingface.co/datasets/robinsmits/ChatAlpaca-20K.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RuTurboAlpaca
Dataset of ChatGPT-generated instructions in Russian.
Code: rulm/self_instruct Code is based on Stanford Alpaca and self-instruct. 29822 examples
Preliminary evaluation by an expert based on 400 samples:
83% of samples contain correct instructions 63% of samples have correct instructions and outputs
Crowdsouring-based evaluation on 3500 samples:
90% of samples contain correct instructions 68% of samples have correct instructions and outputs
Prompt template:… See the full description on the dataset page: https://huggingface.co/datasets/IlyaGusev/ru_turbo_alpaca.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Alpaca-Cleaned
Repository: https://github.com/gururise/AlpacaDataCleaned
Dataset Description
This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:
Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer.
"instruction":"Summarize the… See the full description on the dataset page: https://huggingface.co/datasets/mikemoe/mavis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Indonesian Alpaca-Cleaned
Repository: https://github.com/gururise/AlpacaDataCleaned
Dataset Description
This is the Indonesian translated version of the cleaned original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:
Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an… See the full description on the dataset page: https://huggingface.co/datasets/cahya/alpaca-id-cleaned.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
NOTE: This is a machine translated version of the yahma/alpaca-cleaned dataset.
Dataset Card for Alpaca-Cleaned
Repository: https://github.com/gururise/AlpacaDataCleaned
Dataset Description
This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:
Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet… See the full description on the dataset page: https://huggingface.co/datasets/dominguesm/alpaca-data-pt-br.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BERTIN Alpaca Spanish
This dataset is a translation to Spanish of alpaca_data_cleaned.json, a clean version of the Alpaca dataset made at Stanford. An earlier version used Facebook's NLLB 1.3B model, but the current version uses OpenAI's gpt-3.5-turbo, hence this dataset cannot be used to create models that compete in any way against OpenAI.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Alpaca-Cleaned
Repository: https://github.com/gururise/AlpacaDataCleaned
Dataset Description
This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:
Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer.
"instruction":"Summarize the… See the full description on the dataset page: https://huggingface.co/datasets/Apex-X/prodigy-cleaned.
Facebook
TwitterDataset Summary
This dataset is a translation of the yahma/alpaca-cleaned dataset into Uzbek, leveraging the Google Translate API. The original dataset is a cleaned version of the Stanford Alpaca dataset, which contains instruction-following data for fine-tuning large language models. The cleaned version improves upon the original Alpaca dataset by removing low-quality data and inconsistencies in formatting, which helps enhance the quality and robustness of models trained on it.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Kyrgyz Alpaca
This repo is made for research use only, i.e., cannot be used for commercial purposes or entertainment.
References
All of our achievements were made achievable thanks to the robust AI community in Kyrgyzstan and the contributions made by individuals within the AkylAI project (by TheCramer.com). We also express our gratitude to Stanford for their outstanding efforts and extend the accessibility of this dataset to a global audience.
Dataset
Kyrgyz… See the full description on the dataset page: https://huggingface.co/datasets/the-cramer-project/kyrgyz-alpaca.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for Alpaca
Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:
The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.