100+ datasets found

h
alpaca
huggingface.co
opendatalab.com
Updated Mar 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tatsu Lab (2023). alpaca [Dataset]. https://huggingface.co/datasets/tatsu-lab/alpaca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2023
Dataset authored and provided by
Tatsu Lab
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca

Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:

The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.
h
synth-ehr-icd10-alpaca-format
huggingface.co
Updated Jun 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Generative Technologies, Inc (2024). synth-ehr-icd10-alpaca-format [Dataset]. https://huggingface.co/datasets/generative-technologies/synth-ehr-icd10-alpaca-format
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 26, 2024
Dataset authored and provided by
Generative Technologies, Inc
Description
generative-technologies/synth-ehr-icd10-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
h
sft-ready-Text-Generation-Augmented-Data-Alpaca-Format
huggingface.co
Updated Dec 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Janati (2024). sft-ready-Text-Generation-Augmented-Data-Alpaca-Format [Dataset]. https://huggingface.co/datasets/Na0s/sft-ready-Text-Generation-Augmented-Data-Alpaca-Format
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 11, 2024
Authors
Ali Janati
Description
Na0s/sft-ready-Text-Generation-Augmented-Data-Alpaca-Format dataset hosted on Hugging Face and contributed by the HF Datasets community
h
alpaca-cleaned
huggingface.co
kaggle.com
Updated Apr 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gene Ruebsamen (2023). alpaca-cleaned [Dataset]. https://huggingface.co/datasets/yahma/alpaca-cleaned
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 9, 2023
Authors
Gene Ruebsamen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca-Cleaned

Repository: https://github.com/gururise/AlpacaDataCleaned

Dataset Description

This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:

Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer.

"instruction":"Summarize the… See the full description on the dataset page: https://huggingface.co/datasets/yahma/alpaca-cleaned.
h
toxicity-instruct-alpaca-format
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
raj, toxicity-instruct-alpaca-format [Dataset]. https://huggingface.co/datasets/acloudfan/toxicity-instruct-alpaca-format
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
raj
Description
acloudfan/toxicity-instruct-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
P
Alpaca Dataset Image Classification Dataset
paperswithcode.com
gts.ai
Updated Jun 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Alpaca Dataset Image Classification Dataset [Dataset]. https://paperswithcode.com/dataset/alpaca-dataset-image-classification
Explore at:
Dataset updated
Jun 26, 2025
Description
Description:

👉 Download the dataset here

The Alpaca Dataset is a collection of JPEG images designed for binary image classification tasks, specifically classifying images as “Alpaca” or “Not Alpaca”. This dataset is ideal for training and fine-tuning machine learning models using transfer learning techniques.

Download Dataset

Context

This small dataset is perfect for educational purposes, initial model testing, and developing proof-of-concept applications in image classification. Due to its limited size, it is most beneficial when used in conjunction with transfer learning to leverage pre-trained models for improved accuracy.

Content

The dataset is organized into two primary directories:

Alpaca: Contains images that include alpacas.

Not Alpaca: Contains images without alpacas, featuring subjects that may resemble alpacas but are not.

Additional Information

Format: All images are in JPEG format, ensuring compatibility with a wide range of image processing libraries and tools.

Usage: This dataset can be utilized in various machine learning frameworks such as TensorFlow, PyTorch, and Keras for building and testing classification models.

Applications: Potential applications include animal recognition systems, educational tools, and development of AI-driven content moderation systems.

Data Statistics

Total Images: X (Number of images in the dataset)

Alpaca Images: Y (Number of images in the Alpaca directory)

Not Alpaca Images: Z (Number of images in the Not Alpaca directory)

Image Resolution: Varies, with most images having a resolution suitable for quick model training and evaluation.

This dataset is sourced from Kaggle.
ALPACA-overview-paper-metadata
catalog.data.gov
Updated Sep 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2024). ALPACA-overview-paper-metadata [Dataset]. https://catalog.data.gov/dataset/alpaca-overview-paper-metadata
Explore at:
Dataset updated
Sep 9, 2024
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This is metadata for the figures from an overview paper for the ALPACA Fairbanks winter air quality study. The paper includes only (non-EPA funded/generated) data to support some early analysis and drivers of the field study, but it is expected that the bulk of the data from the study will be analyzed and published by individual PIs in the (near) future. Note that final data from the study will be available to the scientific community through the ALPACA data portal hosted by Arcticdata.io (https://arcticdata.io/catalog/portals/ALPACA). This dataset is not publicly accessible because: The ALPACA overview paper includes (non-EPA funded/generated) data to support some early analysis and drivers of the field study, but it is expected that the bulk of the data from the study will be analyzed and published by individual PIs in the (near) future. It can be accessed through the following means: Final data from the field study will be available to the scientific community through the ALPACA data portal hosted by Arcticdata.io (https://arcticdata.io/catalog/portals/ALPACA). Format: Here we include metadata for overview paper figures (including data definitions and units). This dataset is associated with the following publication: Simpson, W., J. Mao, G. Fochesatto, K. Law, P. DeCarlo, J. Schmale, K. Pratt, S. Arnold, J. Stutz, J. Dibb, J. Creamean, R. Weber, B. Williams, B. Alexander, L. Hu, R. Yokelson, M. Shiraiwa, S. Decesari, C. Anastasio, B. D'Anna, R. Gilliam, A. Nenes, J. St. Clair, B. Trost, J. Flynn, J. Savarino, L. Conner, N. Kettle, K. Heeringa, S. Albertin, A. Baccarini, B. Barret, M. Battaglia, S. Bekki, T. Brado, N. Brett, D. Brus, J. Campbell, M. Cesler-Maloney, S. Cooperdock, K. Cysneiros de Carvalho, H. Delbarre, P. DeMott, C. Dennehy, E. Dieudonné, K. Dingilian, A. Donateo, K. Doulgeris, K. Edwards, K. Fahey, T. Fang, F. Guo, L. Heinlein, A. Holen, D. Huff, A. Ijaz, S. Johnson, S. Kapur, D. Ketcherside, E. Levin, E. Lill, A. Moon, T. Onishi, G. Pappaccogli, R. Perkins, R. Pohorsky, J. Raut, F. Ravetta, T. Roberts, E. Robinson, F. Scoto, V. Selimovic, M. Sunday, B. Temime-Roussel, X. Tian, J. Wu, and Y. Yang. OVERVIEW OF THE ALASKAN LAYERED POLLUTION AND CHEMICAL ANALYSIS (ALPACA) FIELD EXPERIMENT. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 1(3): 200-222, (2024).
h
limo-trial7-verified-alpaca-format
huggingface.co
Updated Feb 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language & AGI Lab (2025). limo-trial7-verified-alpaca-format [Dataset]. https://huggingface.co/datasets/LangAGI-Lab/limo-trial7-verified-alpaca-format
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 14, 2025
Dataset authored and provided by
Language & AGI Lab
Description
LangAGI-Lab/limo-trial7-verified-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
WRF outputs for ALPACA 2022
catalog.data.gov
Updated Jan 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2025). WRF outputs for ALPACA 2022 [Dataset]. https://catalog.data.gov/dataset/wrf-outputs-for-alpaca-2022
Explore at:
Dataset updated
Jan 31, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Standard WRF output in NetCDF format. This dataset is not publicly accessible because: The dataset consist of 59 WRF output files totaling a size of about 90 Gb. It can be accessed through the following means: US EPA Atmos tape drive archive: /asm/MOD3DEV/met/ALPACA/wrf_outputs/ALL_OBS_TWEAKS_FDDA9. Format: Raw WRF outputs in NetCDF format. This dataset is associated with the following publication: Brett, N., K. Law, S. Arnold, J.G. Fochesatto, J. Raut, T. Onishi, R. Gilliam, K. Fahey, D. Huff, G. Pouliot, B. Barret, E. Dieudonné, R. Pohorsky, J. Schmale, A. Baccarini, S. Bekki, G. Pappaccogli, F. Scoto, S. Decesari, A. Donateo, M. Cesler-Maloney, W. Simpson, P. Medina, B. D'Anna, B. Temime-Roussel, J. Savarino, S. Albertin, J. Mao, B. Alexander, A. Moon, P. DeCarlo, V. Selimovic, R. Yokelson, and E.S. Robinson. Investigating processes influencing simulation of local Arctic wintertime anthropogenic pollution in Fairbanks, Alaska during ALPACA-2022. Atmospheric Chemistry and Physics. Copernicus Publications, Katlenburg-Lindau, GERMANY, 25(2): 1063–1104, (2025).
Z
Alpaca Cleaned Dutch
data.niaid.nih.gov
Updated Jun 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vanroy, Bram (2023). Alpaca Cleaned Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8052362
Explore at:
Dataset updated
Jun 20, 2023
Dataset authored and provided by
Vanroy, Bram
License
Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
License information was derived automatically
Description
This dataset contains 51,712 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch. They are translations of Alpaca Cleaned Dataset.

Data Instances

{ 'id': 7, 'instruction': 'Leg uit waarom de volgende breuk gelijk is aan 1/4', 'input': '4/16', 'output': 'De breuk 4/16 is gelijk aan 1/4 omdat zowel de teller als de ' 'noemer deelbaar zijn door 4. Door zowel de teller als de noemer ' 'door 4 te delen, krijgen we de breuk 1/4.' }

Data Fields

id: the ID of the item. The following ID is not included because they could not be translated: [23019]

instruction: the given instruction input: optional input to accompany the instruction. Can be empty.

output: the "answer" to the instruction

Dataset Creation

The instructions, inputs and outputs were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

The prompt template to translate is (where src_lang is English and tgt_lang is Dutch):

TRANSLATION_PROMPT = """You are asked to translate a task's instruction, optional input to the task, and the output of the task, from {src_lang} into {tgt_lang}.

Here are the requirements that you should adhere to: 1. maintain the format: the task consists of a task instruction (marked instruction:), optional input to the task (marked input:) and output for the task marked with output:; 2. do not translate the identifiers instruction:, input:, and output: but instead copy them to your output; 3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 4. translate the instruction and input text using informal, but standard, language; 5. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the input in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang}; 7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the input, nor the translation in the output (just copy them as-is); 8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English.

Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.

"""

This prompt is concatenated with the instruction, optionally the input, and the output. In code, that last part looks like this:

text = f'instruction: "{instruction}"

' if inputstr: text += f'input: "{inputstr}"

' text += f'output: "{outputstr}"'

The system message was:

You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

Note that 1 item (0.0001%) was not successfully translated. The translation was missing the input, instruction, or output keywords where those were expected. The ID for the missing item is [23019].

Initial data creation of the English dataset by Tatsu lab and cleaned by Yahma.

Also available on HuggingFace hub (with a more extensive README).

Licensing Information

As per OpenAI's terms of use, this dataset cannot be used to build a commercial system that competes with OpenAI's services. Similar to the original Alpaca dataset, this dataset is released under CC NC 4.0.

This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

If you use this dataset, you must also follow the Sharing and Usage policies.

As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.
Emissions for Fairbanks AK and ALPACA products
catalog.data.gov
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2024). Emissions for Fairbanks AK and ALPACA products [Dataset]. https://catalog.data.gov/dataset/emissions-for-fairbanks-ak-and-alpaca-products
Explore at:
Dataset updated
Dec 15, 2024
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
Alaska, Fairbanks
Description
January-February 2022 Fairbanks Emissions for the US EPA 1.33 km WRF-CMAQ domain. Portions of this dataset are inaccessible because: Raw emissions files in NetCDF IOAPI format are too large. Below are the size of directories in this archive of ALPACA emissions that is around 75 Gb total. 63G ./premerged 11G ./onroad_ADEC_surrogates 121M ./ptegu_zehnder 121M ./ptegu_uaf 121M ./ptegu_north_pole 121M ./ptegu_ft_wainwright 121M ./ptegu_doyon 121M ./ptegu_aurora_chena 2.0M ./smk_merge_dates_202202.txt 2.0M ./smk_merge_dates_202201.txt. They can be accessed through the following means: See data dictionary on archived location on US EPA computer system for specifics, but all emissions are archived on tape drives connected to the Atmos computer system here: /asm/MOD3DEV/kfa/Fairbanks/ALPACA/emis. Format: See provided data dictionary document for full details: Data_dictionary_Emissions.docx i.e.; This dataset includes emissions inputs for the Community Multiscale Air Quality (CMAQ) modeling system for the 1.33 km resolution Fairbanks domain (Figure 1, inner box) during the ALPACA period (January 17-February 25, 2022). The following sections describe the data as well as their location on the archival file system, /asm, for EPA’s atmos high-performance computing platform.
h
vi-alpaca-input-output-format
huggingface.co
Updated Apr 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BKAI-HUST Foundation Models Lab (2025). vi-alpaca-input-output-format [Dataset]. https://huggingface.co/datasets/bkai-foundation-models/vi-alpaca-input-output-format
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 28, 2025
Dataset authored and provided by
BKAI-HUST Foundation Models Lab
Description
🇻🇳 Vietnamese modified Alpaca Dataset

This dataset is especially designed for Vietnamese based on the idea from Stanford Alpaca, Self-Instruct paper and Chinese LLaMA. The motivation behind the creation of this dataset stems from the hope to contribute high-quality dataset to Vietnamese commnunity to train language models. To construct this dataset, we follow a two-step process:

Step 1: Manually create Vietnamese seed tasks We employ the methodology outlined in the Self-Instruct… See the full description on the dataset page: https://huggingface.co/datasets/bkai-foundation-models/vi-alpaca-input-output-format.
P
AlpacaEval Dataset
paperswithcode.com
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yann Dubois; Xuechen Li; Rohan Taori; Tianyi Zhang; Ishaan Gulrajani; Jimmy Ba; Carlos Guestrin; Percy Liang; Tatsunori B. Hashimoto (2024). AlpacaEval Dataset [Dataset]. https://paperswithcode.com/dataset/alpacaeval
Explore at:
Dataset updated
Mar 6, 2024
Authors
Yann Dubois; Xuechen Li; Rohan Taori; Tianyi Zhang; Ishaan Gulrajani; Jimmy Ba; Carlos Guestrin; Percy Liang; Tatsunori B. Hashimoto
Description
The AlpacaEval set contains 805 instructions form self-instruct, open-assistant, vicuna, koala, hh-rlhf. Those were selected so that the AlpacaEval ranking of models on the AlpacaEval set would be similar to the ranking on the Alpaca demo data.
o
Alpaca Instruction NLU Dataset
opendatabay.com
.undefined
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Alpaca Instruction NLU Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/1db8957f-c1f2-4529-a98d-3e46b53360b4
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset, titled "TokenBender: 122k Alpaca-Style Instructions Word-Level Classification Towards Accurate Natural Language Understanding", offers a collection of 122,000 Alpaca-style instructions, each paired with corresponding input, text, and output for word-level classification. It is designed to facilitate natural language understanding (NLU) research by providing entries from diverse areas such as programming code instructions and gaming instructions, presented at varying levels of complexity. The dataset assists developers aiming to apply natural language processing (NLP) techniques, offering insights into how to improve the accuracy and ease the comprehension of human language commands. Utilising this dataset, one can develop advanced algorithms, such as neural networks or decision trees, capable of quickly understanding commands in various languages and bridging the gap between machines and humans for practical applications. It serves as a valuable resource for those seeking to gain insight into NLU through data science approaches.

Columns

input: The input associated with the instruction.

text: The Alpaca-Style instruction that corresponds to the user's input.

output: The associated output for word-level classification.

Distribution

The dataset is structured as a train.csv file, containing 122,000 Alpaca-Style Instructions. The input column holds 121,683 unique values, the text column contains 121,957 unique values, and the output column features 120,724 unique values.

Usage

This dataset is ideal for: * Developing AI-based algorithms to accurately understand the meaning of natural language instructions. * Training and testing machine learning models for classifying specific words and phrases within natural language instructions. * Training deep learning models to generate visual components based on given input, text, and output values. * Applying and enhancing natural language processing techniques for machine comprehension. * Developing advanced neural networks or decision trees for understanding commands across languages.

Coverage

The dataset's coverage is global. It was listed on 16/06/2025. It includes diverse instruction types, such as programming code and gaming instructions. No specific historical time range or demographic scope is detailed beyond the listing date.

License

CC0

Who Can Use It

Developers focused on applying and improving natural language processing techniques.

Researchers engaged in natural language understanding.

Data scientists seeking insights into NLU through data science methods.

Anyone developing AI-based algorithms for natural language comprehension.

Teams and individuals training machine learning or deep learning models for classification or generation tasks related to natural language.

Dataset Name Suggestions

Alpaca Instruction NLU Dataset

TokenBender Word Classification Data

Natural Language Understanding Instructions

Alpaca-Style NLP Training Set

Word-Level Text Classification Data

Attributes

Original Data Source: Alpaca
h
mental-alpaca-format
huggingface.co
Updated Apr 18, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
boris (2019). mental-alpaca-format [Dataset]. https://huggingface.co/datasets/usham/mental-alpaca-format
Explore at:
Dataset updated
Apr 18, 2019
Authors
boris
Description
usham/mental-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
h
ffmperative-alpaca-format-50k
huggingface.co
Updated Mar 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Remyx AI (2024). ffmperative-alpaca-format-50k [Dataset]. https://huggingface.co/datasets/remyxai/ffmperative-alpaca-format-50k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 1, 2024
Dataset authored and provided by
Remyx AI
Description
remyxai/ffmperative-alpaca-format-50k dataset hosted on Hugging Face and contributed by the HF Datasets community
P
Machine_Mindset_MBTI_dataset Dataset
paperswithcode.com
Updated Jan 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiaxi Cui; Liuzhenghao Lv; Jing Wen; Rongsheng Wang; Jing Tang; Yonghong Tian; Li Yuan (2023). Machine_Mindset_MBTI_dataset Dataset [Dataset]. https://paperswithcode.com/dataset/machine-mindset-mbti-dataset
Explore at:
Dataset updated
Jan 6, 2024
Authors
Jiaxi Cui; Liuzhenghao Lv; Jing Wen; Rongsheng Wang; Jing Tang; Yonghong Tian; Li Yuan
Description
Dataset introduction There are four dimension in MBTI. And there are two opposite attributes within each dimension.

To be specific:

Energe: Extraversion (E) - Introversion (I)

Information: Sensing (S) - Intuition (N)

Decision: Thinking (T) - Feeling (F)

Execution: Judging (J) - Perceiving (P)

Based on the above, you can infer the content of the json file from its name.

The datasets follow the Alpaca format, consisting of instruction, input and output.

How to use these datasets for behavior supervised fine-tuning (SFT) For example, if you want to make an LLM behave like an ISFJ, you need to select the four corresponding files (en_energe_introversion.json, en_information_sensing.json, en_decision_feeling.json, en_execution_judging.json).

And use the four for SFT.

How to use these datasets for direct preference optimization (DPO) For example, if you want to make an LLM be more feeling (F) than thinking (T) by DPO, you need to select the two corresponding files (en_decision_feeling.json, en_decision_thinking.json).

And then compile the two into the correct format for DPO. For the correct format, please refer to this.
h
limo-new-alpaca-format
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language & AGI Lab, limo-new-alpaca-format [Dataset]. https://huggingface.co/datasets/LangAGI-Lab/limo-new-alpaca-format
Explore at:
Dataset authored and provided by
Language & AGI Lab
Description
LangAGI-Lab/limo-new-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
h
retail-alpaca-format
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
S Aditya, retail-alpaca-format [Dataset]. https://huggingface.co/datasets/aditya3w3733/retail-alpaca-format
Explore at:
Authors
S Aditya
Description
aditya3w3733/retail-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
h
qwen-7b-instruct-8k-rft-alpaca-format
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language & AGI Lab, qwen-7b-instruct-8k-rft-alpaca-format [Dataset]. https://huggingface.co/datasets/LangAGI-Lab/qwen-7b-instruct-8k-rft-alpaca-format
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Language & AGI Lab
Description
LangAGI-Lab/qwen-7b-instruct-8k-rft-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

Tatsu Lab (2023). alpaca [Dataset]. https://huggingface.co/datasets/tatsu-lab/alpaca

alpaca

Alpaca

tatsu-lab/alpaca

Explore at:

62 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 14, 2023

Dataset authored and provided by

Tatsu Lab

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for Alpaca

  Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:

The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.

Clear search

Close search

Google apps

Main menu

alpaca

synth-ehr-icd10-alpaca-format

sft-ready-Text-Generation-Augmented-Data-Alpaca-Format

alpaca-cleaned

toxicity-instruct-alpaca-format

Alpaca Dataset Image Classification Dataset

ALPACA-overview-paper-metadata

limo-trial7-verified-alpaca-format

WRF outputs for ALPACA 2022

Alpaca Cleaned Dutch

Emissions for Fairbanks AK and ALPACA products

vi-alpaca-input-output-format

AlpacaEval Dataset

Alpaca Instruction NLU Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

mental-alpaca-format

ffmperative-alpaca-format-50k

Machine_Mindset_MBTI_dataset Dataset

limo-new-alpaca-format

retail-alpaca-format

qwen-7b-instruct-8k-rft-alpaca-format

alpacaSee More Versions

Alpaca

tatsu-lab/alpaca

alpaca