Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for Alpaca
Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:
The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Alpaca-Cleaned
Repository: https://github.com/gururise/AlpacaDataCleaned
Dataset Description
This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:
Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer.
"instruction":"Summarize… See the full description on the dataset page: https://huggingface.co/datasets/alexl83/AlpacaDataCleaned.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Summary
This is a Thai 🇹🇭-instructed dataset translated from cleaned version of the original Alpaca Dataset released by Stanford using Google Cloud Translation, contain 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The following issues have been identified in the original release and fixed in this… See the full description on the dataset page: https://huggingface.co/datasets/Thaweewat/alpaca-cleaned-52k-th.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
NOTE: This is a machine translated version of the yahma/alpaca-cleaned dataset.
Dataset Card for Alpaca-Cleaned
Repository: https://github.com/gururise/AlpacaDataCleaned
Dataset Description
This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:
Hallucinations: Many instructions in the original dataset had instructions referencing data on the… See the full description on the dataset page: https://huggingface.co/datasets/dominguesm/alpaca-data-pt-br.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Galician version of alpaca_data.json
This is a Galician-translated with Python package googletranslatepy version of the Stanford alpaca_data.json dataset. Our working notes are available here.
Dataset Structure
The dataset contains 52K instruction-following elements in a JSON file with a list of dictionaries. Each dictionary contains the following fields:
instruction: str, describes the task the model should perform. Each of the 52K instructions is unique. input:… See the full description on the dataset page: https://huggingface.co/datasets/irlab-udc/alpaca_data_galician.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Citation Information
@misc{alpaca, author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, title = {Stanford Alpaca: An Instruction-following LLaMA model}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}}, }```
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TatAlpaca
Dataset of Gemini-generated instructions in Tatar language.
Code: tatlm/self_instruct Code is based on Stanford Alpaca and self-instruct. 166,257 examples
Prompt template: {{num_tasks}} җыелмасының составы тел моделен өйрәнү өчен төрле:
Alpaca-Odia Dataset
This dataset contains 52,001 instruction-response pairs translated from the original Stanford Alpaca dataset into the Odia language using IndicTrans2. You can load the dataset as follows: from datasets import load_dataset
dataset = load_dataset("sumankumarbhadra/alpaca-odia")
Translation Details
Translation Model: IndicTrans2 (ai4bharat/indictrans2-indic-en-1B) Source Language Code: eng_Latn Target Language Code: ory_Orya… See the full description on the dataset page: https://huggingface.co/datasets/sumankumarbhadra/alpaca-odia.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for Alpaca
Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:
The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.