44 datasets found

h
alpaca
huggingface.co
opendatalab.com
Updated Mar 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tatsu Lab (2023). alpaca [Dataset]. https://huggingface.co/datasets/tatsu-lab/alpaca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2023
Dataset authored and provided by
Tatsu Lab
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca

Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:

The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.
h
VIS
huggingface.co
Updated Aug 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
shiweiwei (2025). VIS [Dataset]. https://huggingface.co/datasets/weiwei888/VIS
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 25, 2025
Authors
shiweiwei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca-Cleaned

Repository: https://github.com/gururise/AlpacaDataCleaned

Dataset Description

This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:

Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer.

"instruction":"Summarize the… See the full description on the dataset page: https://huggingface.co/datasets/weiwei888/VIS.
h
stanford-alpaca-cleaned-turkish-translated
huggingface.co
Updated Sep 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davut Emre TASAR (2023). stanford-alpaca-cleaned-turkish-translated [Dataset]. https://huggingface.co/datasets/emre/stanford-alpaca-cleaned-turkish-translated
Explore at:
Dataset updated
Sep 4, 2023
Authors
Davut Emre TASAR
License
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Description
09/04/2023 Update: New instructions added from: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM Original Version: https://github.com/tatsu-lab/stanford_alpaca#data-release AI BASED TRANSLATION RESULTS OF STANFORD ALPACA EN TO TR For academic only, please cite before you use it. Taşar, D. E. T. (2023). stanford-alpaca-cleaned-turkish-translated [Dataset]. In Stanford Alpaca TR (1.0.1.a). https://huggingface.co/datasets/emre/stanford-alpaca-cleaned-turkish-translated… See the full description on the dataset page: https://huggingface.co/datasets/emre/stanford-alpaca-cleaned-turkish-translated.
O
Alpaca-COT
opendatalab.com
huggingface.co
zip
Updated Jan 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford University (2023). Alpaca-COT [Dataset]. https://opendatalab.com/OpenDataLab/Alpaca-COT
Explore at:
zipAvailable download formats
Dataset updated
Jan 1, 2023
Dataset provided by
Stanford University
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
LLaMA is a great work that demonstrates the amazing zero-shot and few-shot ability. It significantly reduces the cost of training, finetuning, and using competitive large language models, i.e., LLaMA-13B outperforms GPT-3(175B) and LLaMA-65B is competitive to PaLM-540B. Recently, to boost the instruction-following ability of LLaMA, Stanford Alpaca finetuned LLaMA-7B on 52K instruction-following data generated by the Self-Instruct techniques. However, at present, the LLM research community still faces three challenges: 1. Even LLaMA-7b still has high requirements for computing resources; 2. There are few open source datasets for instruction finetuning; and 3. There is a lack of empirical study on the impact of various types of instruction on model abilities, such as the ability to respond to Chinese instruction and the CoT reasoning.
h
thai_alpaca
huggingface.co
Updated Jun 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SEACrowd (2024). thai_alpaca [Dataset]. https://huggingface.co/datasets/SEACrowd/thai_alpaca
Explore at:
Dataset updated
Jun 20, 2024
Dataset authored and provided by
SEACrowd
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This is a Thai 🇹🇭-instructed dataset translated from cleaned version of the original Alpaca Dataset released by Stanford using Google Cloud Translation, contain 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.
h
roman-urdu-alpaca-qa-mix
huggingface.co
Updated Oct 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Salaar (2023). roman-urdu-alpaca-qa-mix [Dataset]. https://huggingface.co/datasets/Redgerd/roman-urdu-alpaca-qa-mix
Explore at:
Dataset updated
Oct 23, 2023
Authors
Muhammad Salaar
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Roman Urdu + Alpaca QA Mix

This dataset is intended to support fine-tuning and evaluation of language models that understand and respond to Roman Urdu and English instructions. It consists of 1,022 records in total:

500 examples in Roman Urdu generated from high-quality Urdu sources and transliterated using the ChatGPT API. 500 examples in English randomly sampled from the Stanford Alpaca dataset.

The dataset follows the same format as Alpaca-style instruction… See the full description on the dataset page: https://huggingface.co/datasets/Redgerd/roman-urdu-alpaca-qa-mix.
h
vi-alpaca-input-output-format
huggingface.co
Updated Apr 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BKAI-HUST Foundation Models Lab (2025). vi-alpaca-input-output-format [Dataset]. https://huggingface.co/datasets/bkai-foundation-models/vi-alpaca-input-output-format
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 28, 2025
Dataset authored and provided by
BKAI-HUST Foundation Models Lab
Description
🇻🇳 Vietnamese modified Alpaca Dataset

This dataset is especially designed for Vietnamese based on the idea from Stanford Alpaca, Self-Instruct paper and Chinese LLaMA. The motivation behind the creation of this dataset stems from the hope to contribute high-quality dataset to Vietnamese commnunity to train language models. To construct this dataset, we follow a two-step process:

Step 1: Manually create Vietnamese seed tasks We employ the methodology outlined in the Self-Instruct… See the full description on the dataset page: https://huggingface.co/datasets/bkai-foundation-models/vi-alpaca-input-output-format.
O
InstructWild
opendatalab.com
zip
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford University (2024). InstructWild [Dataset]. https://opendatalab.com/OpenDataLab/InstructWild
Explore at:
zip(15302878 bytes)Available download formats
Dataset updated
Aug 6, 2024
Dataset provided by
Stanford University
License
https://github.com/XueFuzhao/InstructionWild/blob/main/LICENSEhttps://github.com/XueFuzhao/InstructionWild/blob/main/LICENSE
Description
Instruction Tuning is a key component of ChatGPT. OpenAI uses their user-based instruction dataset, but unfortunately, this dataset is not open source. Self-Instruct released a small instruction dataset consisting of 175 human-written instructions. The Stanford Alpaca team text-davinci-003 generated 52K instructions by model from the above 175 seed instructions.

The project's goal is a larger and more diverse instruction dataset. To this end, we collected 429 descriptions from ChatGPT usage screenshots and released Chinese and English versions. We found that these instructions are very diverse, even if the scale is still small. We follow Alpaca to generate 52K commands and their responses. All data can be found in the directory data.

NOTE: This is an ongoing project. We are still collecting and improving our data. We release this dataset early to accelerate our LLM research. We will also publish a white paper soon.
h
alpaca_hu_2k
huggingface.co
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hungarian Research Centre for Linguistics (2024). alpaca_hu_2k [Dataset]. https://huggingface.co/datasets/NYTK/alpaca_hu_2k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 12, 2024
Dataset authored and provided by
Hungarian Research Centre for Linguistics
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca-Hu-2k

This is the dataset card for the Hungarian translation of a subset of the Stanford Alpaca prompts.

Dataset Details Dataset Description

The dataset is the first Hungarian language instruction-following corpus created for fine-tuning large language models, specifically developed by translating and localizing a portion of the Stanford Alpaca corpus. It contains 2000 translated and 100 localized prompts, designed to train… See the full description on the dataset page: https://huggingface.co/datasets/NYTK/alpaca_hu_2k.
h
mix-instruct
huggingface.co
opendatalab.com
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LLM Blender (2024). mix-instruct [Dataset]. https://huggingface.co/datasets/llm-blender/mix-instruct
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 13, 2024
Dataset authored and provided by
LLM Blender
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
MixInstruct

Introduction

This is the official realease of dataset MixInstruct for project LLM-Blender. This dataset contains 11 responses from the current popular instruction following-LLMs that includes:

Stanford Alpaca FastChat Vicuna Dolly V2 StableLM Open Assistant Koala Baize Flan-T5 ChatGLM MOSS Moasic MPT

We evaluate each response with auto metrics including BLEU, ROUGE, BERTScore, BARTScore. And provide pairwise comparison results by prompting ChatGPT for the… See the full description on the dataset page: https://huggingface.co/datasets/llm-blender/mix-instruct.
h
Alpaca-pubmed-summarization
huggingface.co
Updated Oct 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhongsheng Wang (2023). Alpaca-pubmed-summarization [Dataset]. https://huggingface.co/datasets/ZhongshengWang/Alpaca-pubmed-summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 16, 2023
Authors
Zhongsheng Wang
License
https://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/
Description
This data set is a lightweight fine-tuned data format version of the Llama2 large language model for Stanford Alpaca. You can click here to view. cite original code @inproceedings{cohan-etal-2018-discourse, title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", author = "Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli", booktitle = "Proceedings… See the full description on the dataset page: https://huggingface.co/datasets/ZhongshengWang/Alpaca-pubmed-summarization.
h
ChatAlpaca-20K
huggingface.co
Updated Jan 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robin Smits (2024). ChatAlpaca-20K [Dataset]. https://huggingface.co/datasets/robinsmits/ChatAlpaca-20K
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 6, 2024
Authors
Robin Smits
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for ChatAlpaca 20K

ChatAlpaca: A Multi-Turn Dialogue Corpus based on Alpaca Instructions Dataset Description

ChatAlpaca is a chat dataset that aims to help researchers develop models for instruction-following in multi-turn conversations. The dataset is an extension of the Stanford Alpaca data, which contains multi-turn instructions and their corresponding responses. ChatAlpaca is developed by Chinese Information Processing Laboratory at the… See the full description on the dataset page: https://huggingface.co/datasets/robinsmits/ChatAlpaca-20K.
h
ru_turbo_alpaca
huggingface.co
opendatalab.com
Updated Oct 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ilya Gusev (2024). ru_turbo_alpaca [Dataset]. https://huggingface.co/datasets/IlyaGusev/ru_turbo_alpaca
Explore at:
Dataset updated
Oct 28, 2024
Authors
Ilya Gusev
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
RuTurboAlpaca

Dataset of ChatGPT-generated instructions in Russian.

Code: rulm/self_instruct Code is based on Stanford Alpaca and self-instruct. 29822 examples

Preliminary evaluation by an expert based on 400 samples:

83% of samples contain correct instructions 63% of samples have correct instructions and outputs

Crowdsouring-based evaluation on 3500 samples:

90% of samples contain correct instructions 68% of samples have correct instructions and outputs

Prompt template:… See the full description on the dataset page: https://huggingface.co/datasets/IlyaGusev/ru_turbo_alpaca.
h
mavis
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mike, mavis [Dataset]. https://huggingface.co/datasets/mikemoe/mavis
Explore at:
Authors
mike
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca-Cleaned

Repository: https://github.com/gururise/AlpacaDataCleaned

Dataset Description

This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:

Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer.

"instruction":"Summarize the… See the full description on the dataset page: https://huggingface.co/datasets/mikemoe/mavis.
h
alpaca-id-cleaned
huggingface.co
Updated Apr 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cahya Wirawan (2023). alpaca-id-cleaned [Dataset]. https://huggingface.co/datasets/cahya/alpaca-id-cleaned
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 9, 2023
Authors
Cahya Wirawan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for Indonesian Alpaca-Cleaned

Repository: https://github.com/gururise/AlpacaDataCleaned

Dataset Description

This is the Indonesian translated version of the cleaned original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:

Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an… See the full description on the dataset page: https://huggingface.co/datasets/cahya/alpaca-id-cleaned.
h
alpaca-data-pt-br
huggingface.co
Updated Apr 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maicon Domingues (2023). alpaca-data-pt-br [Dataset]. https://huggingface.co/datasets/dominguesm/alpaca-data-pt-br
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 3, 2023
Authors
Maicon Domingues
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
NOTE: This is a machine translated version of the yahma/alpaca-cleaned dataset.

Dataset Card for Alpaca-Cleaned

Repository: https://github.com/gururise/AlpacaDataCleaned

Dataset Description

This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:

Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet… See the full description on the dataset page: https://huggingface.co/datasets/dominguesm/alpaca-data-pt-br.
h
alpaca-spanish
huggingface.co
Updated Apr 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BERTIN Project (2023). alpaca-spanish [Dataset]. https://huggingface.co/datasets/bertin-project/alpaca-spanish
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2023
Dataset authored and provided by
BERTIN Project
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BERTIN Alpaca Spanish

This dataset is a translation to Spanish of alpaca_data_cleaned.json, a clean version of the Alpaca dataset made at Stanford. An earlier version used Facebook's NLLB 1.3B model, but the current version uses OpenAI's gpt-3.5-turbo, hence this dataset cannot be used to create models that compete in any way against OpenAI.
h
prodigy-cleaned
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aadhithya, prodigy-cleaned [Dataset]. https://huggingface.co/datasets/Apex-X/prodigy-cleaned
Explore at:
Authors
Aadhithya
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca-Cleaned

Repository: https://github.com/gururise/AlpacaDataCleaned

Dataset Description

This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:

Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer.

"instruction":"Summarize the… See the full description on the dataset page: https://huggingface.co/datasets/Apex-X/prodigy-cleaned.
h
alpaca-cleaned-uz
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Behbudiy Labs, alpaca-cleaned-uz [Dataset]. https://huggingface.co/datasets/behbudiy/alpaca-cleaned-uz
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Behbudiy Labs
Description
Dataset Summary

This dataset is a translation of the yahma/alpaca-cleaned dataset into Uzbek, leveraging the Google Translate API. The original dataset is a cleaned version of the Stanford Alpaca dataset, which contains instruction-following data for fine-tuning large language models. The cleaned version improves upon the original Alpaca dataset by removing low-quality data and inconsistencies in formatting, which helps enhance the quality and robustness of models trained on it.
h
kyrgyz-alpaca
huggingface.co
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cramer Project (2024). kyrgyz-alpaca [Dataset]. https://huggingface.co/datasets/the-cramer-project/kyrgyz-alpaca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 28, 2024
Dataset authored and provided by
The Cramer Project
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Kyrgyz Alpaca

This repo is made for research use only, i.e., cannot be used for commercial purposes or entertainment.

References

All of our achievements were made achievable thanks to the robust AI community in Kyrgyzstan and the contributions made by individuals within the AkylAI project (by TheCramer.com). We also express our gratitude to Stanford for their outstanding efforts and extend the accessibility of this dataset to a global audience.

Dataset

Kyrgyz… See the full description on the dataset page: https://huggingface.co/datasets/the-cramer-project/kyrgyz-alpaca.

Facebook

Twitter

Click to copy link

Link copied

Cite

Tatsu Lab (2023). alpaca [Dataset]. https://huggingface.co/datasets/tatsu-lab/alpaca

alpaca

Alpaca

tatsu-lab/alpaca

Explore at:

74 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 14, 2023

Dataset authored and provided by

Tatsu Lab

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for Alpaca

  Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:

The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.

Clear search

Close search

Google apps

Main menu

alpaca

VIS

stanford-alpaca-cleaned-turkish-translated

Alpaca-COT

thai_alpaca

roman-urdu-alpaca-qa-mix

vi-alpaca-input-output-format

InstructWild

alpaca_hu_2k

mix-instruct

Alpaca-pubmed-summarization

ChatAlpaca-20K

ru_turbo_alpaca

mavis

alpaca-id-cleaned

alpaca-data-pt-br

alpaca-spanish

prodigy-cleaned

alpaca-cleaned-uz

kyrgyz-alpaca

alpacaSee More Versions

Alpaca

tatsu-lab/alpaca

alpaca