Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for Alpaca
Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:
The text-davinci-003 engine to generate the instruction data insteadโฆ See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.
generative-technologies/synth-ehr-icd10-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
Na0s/sft-ready-Text-Generation-Augmented-Data-Alpaca-Format dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Alpaca-Cleaned
Repository: https://github.com/gururise/AlpacaDataCleaned
Dataset Description
This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:
Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer.
"instruction":"Summarize theโฆ See the full description on the dataset page: https://huggingface.co/datasets/yahma/alpaca-cleaned.
๐ป๐ณ Vietnamese modified Alpaca Dataset
This dataset is especially designed for Vietnamese based on the idea from Stanford Alpaca, Self-Instruct paper and Chinese LLaMA. The motivation behind the creation of this dataset stems from the hope to contribute high-quality dataset to Vietnamese commnunity to train language models. To construct this dataset, we follow a two-step process:
Step 1: Manually create Vietnamese seed tasks We employ the methodology outlined in the Self-Instructโฆ See the full description on the dataset page: https://huggingface.co/datasets/bkai-foundation-models/vi-alpaca-input-output-format.
LangAGI-Lab/limo-trial7-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
remyxai/ffmperative-alpaca-format-50k dataset hosted on Hugging Face and contributed by the HF Datasets community
usham/mental-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Alpaca-style Question and Answer Dataset
This dataset contains question-answer pairs formatted in the Alpaca instruction style, suitable for instruction fine-tuning of language models.
Format
Each example contains:
instruction: The question input: Empty string (can be used for context in other applications) output: The answer text: The formatted text using the Alpaca template
Template
Below is an instruction that describes a task, paired with an input thatโฆ See the full description on the dataset page: https://huggingface.co/datasets/sweatSmile/alpaca-qa-data.
LangAGI-Lab/limo-small-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
This dataset is an adaptation of the Stanford Alpaca dataset in order to turn a text generation model like GPT-J into an "instruct" model. The initial dataset was slightly reworked in order to match the GPT-J fine-tuning format with Mesh Transformer Jax on TPUs.
aditya3w3733/retail-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
LangAGI-Lab/qwen-7b-instruct-8k-rft-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
adambuttrick/100K-ner-indexes-multiple-organizations-locations-alpaca-format-json-response-all-cases dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for MMLU-Alpaca
This dataset contains instruction-input-output pairs converted to ShareGPT format, designed for instruction tuning and text generation tasks.
Dataset Description
The dataset consists of carefully curated instruction-input-output pairs, formatted for conversational AI training. Each entry contains:
An instruction that specifies the task An optional input providing context A detailed output that addresses the instruction
Usage
Thisโฆ See the full description on the dataset page: https://huggingface.co/datasets/HappyAIUser/MMLU-Alpaca.
LangAGI-Lab/magpie-reasoning-v1-10k-step-by-step-rationale-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
abhijitkumarjha88192/ts_repl_ai_alpaca dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
CreitinGameplays/magpie-reasoning-v1-10k-step-by-step-rationale-alpaca-format-changedtoken dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
abhijitkumarjha88192/py_tiny_codes_alpaca dataset hosted on Hugging Face and contributed by the HF Datasets community
LangAGI-Lab/magpie-reasoning-v1-20k-math-verifiable-verification-min-4000-3200-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for Alpaca
Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:
The text-davinci-003 engine to generate the instruction data insteadโฆ See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.