100+ datasets found
  1. h

    alpaca

    • huggingface.co
    • opendatalab.com
    Updated Mar 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tatsu Lab (2023). alpaca [Dataset]. https://huggingface.co/datasets/tatsu-lab/alpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2023
    Dataset authored and provided by
    Tatsu Lab
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for Alpaca

      Dataset Summary
    

    Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:

    The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.

  2. h

    synth-ehr-icd10-alpaca-format

    • huggingface.co
    Updated Jun 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Generative Technologies, Inc (2024). synth-ehr-icd10-alpaca-format [Dataset]. https://huggingface.co/datasets/generative-technologies/synth-ehr-icd10-alpaca-format
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 26, 2024
    Dataset authored and provided by
    Generative Technologies, Inc
    Description

    generative-technologies/synth-ehr-icd10-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    sft-ready-Text-Generation-Augmented-Data-Alpaca-Format

    • huggingface.co
    Updated Dec 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Janati (2024). sft-ready-Text-Generation-Augmented-Data-Alpaca-Format [Dataset]. https://huggingface.co/datasets/Na0s/sft-ready-Text-Generation-Augmented-Data-Alpaca-Format
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 11, 2024
    Authors
    Ali Janati
    Description

    Na0s/sft-ready-Text-Generation-Augmented-Data-Alpaca-Format dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    alpaca-cleaned

    • huggingface.co
    • kaggle.com
    Updated Apr 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gene Ruebsamen (2023). alpaca-cleaned [Dataset]. https://huggingface.co/datasets/yahma/alpaca-cleaned
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 9, 2023
    Authors
    Gene Ruebsamen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for Alpaca-Cleaned

    Repository: https://github.com/gururise/AlpacaDataCleaned

      Dataset Description
    

    This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:

    Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer.

    "instruction":"Summarize the… See the full description on the dataset page: https://huggingface.co/datasets/yahma/alpaca-cleaned.

  5. h

    toxicity-instruct-alpaca-format

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    raj, toxicity-instruct-alpaca-format [Dataset]. https://huggingface.co/datasets/acloudfan/toxicity-instruct-alpaca-format
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    raj
    Description

    acloudfan/toxicity-instruct-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. P

    Alpaca Dataset Image Classification Dataset

    • paperswithcode.com
    • gts.ai
    Updated Jun 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Alpaca Dataset Image Classification Dataset [Dataset]. https://paperswithcode.com/dataset/alpaca-dataset-image-classification
    Explore at:
    Dataset updated
    Jun 26, 2025
    Description

    Description:

    👉 Download the dataset here

    The Alpaca Dataset is a collection of JPEG images designed for binary image classification tasks, specifically classifying images as “Alpaca” or “Not Alpaca”. This dataset is ideal for training and fine-tuning machine learning models using transfer learning techniques.

    Download Dataset

    Context

    This small dataset is perfect for educational purposes, initial model testing, and developing proof-of-concept applications in image classification. Due to its limited size, it is most beneficial when used in conjunction with transfer learning to leverage pre-trained models for improved accuracy.

    Content

    The dataset is organized into two primary directories:

    Alpaca: Contains images that include alpacas.

    Not Alpaca: Contains images without alpacas, featuring subjects that may resemble alpacas but are not.

    Additional Information

    Format: All images are in JPEG format, ensuring compatibility with a wide range of image processing libraries and tools.

    Usage: This dataset can be utilized in various machine learning frameworks such as TensorFlow, PyTorch, and Keras for building and testing classification models.

    Applications: Potential applications include animal recognition systems, educational tools, and development of AI-driven content moderation systems.

    Data Statistics

    Total Images: X (Number of images in the dataset)

    Alpaca Images: Y (Number of images in the Alpaca directory)

    Not Alpaca Images: Z (Number of images in the Not Alpaca directory)

    Image Resolution: Varies, with most images having a resolution suitable for quick model training and evaluation.

    This dataset is sourced from Kaggle.

  7. ALPACA-overview-paper-metadata

    • catalog.data.gov
    Updated Sep 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2024). ALPACA-overview-paper-metadata [Dataset]. https://catalog.data.gov/dataset/alpaca-overview-paper-metadata
    Explore at:
    Dataset updated
    Sep 9, 2024
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This is metadata for the figures from an overview paper for the ALPACA Fairbanks winter air quality study. The paper includes only (non-EPA funded/generated) data to support some early analysis and drivers of the field study, but it is expected that the bulk of the data from the study will be analyzed and published by individual PIs in the (near) future. Note that final data from the study will be available to the scientific community through the ALPACA data portal hosted by Arcticdata.io (https://arcticdata.io/catalog/portals/ALPACA). This dataset is not publicly accessible because: The ALPACA overview paper includes (non-EPA funded/generated) data to support some early analysis and drivers of the field study, but it is expected that the bulk of the data from the study will be analyzed and published by individual PIs in the (near) future. It can be accessed through the following means: Final data from the field study will be available to the scientific community through the ALPACA data portal hosted by Arcticdata.io (https://arcticdata.io/catalog/portals/ALPACA). Format: Here we include metadata for overview paper figures (including data definitions and units). This dataset is associated with the following publication: Simpson, W., J. Mao, G. Fochesatto, K. Law, P. DeCarlo, J. Schmale, K. Pratt, S. Arnold, J. Stutz, J. Dibb, J. Creamean, R. Weber, B. Williams, B. Alexander, L. Hu, R. Yokelson, M. Shiraiwa, S. Decesari, C. Anastasio, B. D'Anna, R. Gilliam, A. Nenes, J. St. Clair, B. Trost, J. Flynn, J. Savarino, L. Conner, N. Kettle, K. Heeringa, S. Albertin, A. Baccarini, B. Barret, M. Battaglia, S. Bekki, T. Brado, N. Brett, D. Brus, J. Campbell, M. Cesler-Maloney, S. Cooperdock, K. Cysneiros de Carvalho, H. Delbarre, P. DeMott, C. Dennehy, E. Dieudonné, K. Dingilian, A. Donateo, K. Doulgeris, K. Edwards, K. Fahey, T. Fang, F. Guo, L. Heinlein, A. Holen, D. Huff, A. Ijaz, S. Johnson, S. Kapur, D. Ketcherside, E. Levin, E. Lill, A. Moon, T. Onishi, G. Pappaccogli, R. Perkins, R. Pohorsky, J. Raut, F. Ravetta, T. Roberts, E. Robinson, F. Scoto, V. Selimovic, M. Sunday, B. Temime-Roussel, X. Tian, J. Wu, and Y. Yang. OVERVIEW OF THE ALASKAN LAYERED POLLUTION AND CHEMICAL ANALYSIS (ALPACA) FIELD EXPERIMENT. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 1(3): 200-222, (2024).

  8. h

    limo-trial7-verified-alpaca-format

    • huggingface.co
    Updated Feb 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language & AGI Lab (2025). limo-trial7-verified-alpaca-format [Dataset]. https://huggingface.co/datasets/LangAGI-Lab/limo-trial7-verified-alpaca-format
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 14, 2025
    Dataset authored and provided by
    Language & AGI Lab
    Description

    LangAGI-Lab/limo-trial7-verified-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. WRF outputs for ALPACA 2022

    • catalog.data.gov
    Updated Jan 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2025). WRF outputs for ALPACA 2022 [Dataset]. https://catalog.data.gov/dataset/wrf-outputs-for-alpaca-2022
    Explore at:
    Dataset updated
    Jan 31, 2025
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Standard WRF output in NetCDF format. This dataset is not publicly accessible because: The dataset consist of 59 WRF output files totaling a size of about 90 Gb. It can be accessed through the following means: US EPA Atmos tape drive archive: /asm/MOD3DEV/met/ALPACA/wrf_outputs/ALL_OBS_TWEAKS_FDDA9. Format: Raw WRF outputs in NetCDF format. This dataset is associated with the following publication: Brett, N., K. Law, S. Arnold, J.G. Fochesatto, J. Raut, T. Onishi, R. Gilliam, K. Fahey, D. Huff, G. Pouliot, B. Barret, E. Dieudonné, R. Pohorsky, J. Schmale, A. Baccarini, S. Bekki, G. Pappaccogli, F. Scoto, S. Decesari, A. Donateo, M. Cesler-Maloney, W. Simpson, P. Medina, B. D'Anna, B. Temime-Roussel, J. Savarino, S. Albertin, J. Mao, B. Alexander, A. Moon, P. DeCarlo, V. Selimovic, R. Yokelson, and E.S. Robinson. Investigating processes influencing simulation of local Arctic wintertime anthropogenic pollution in Fairbanks, Alaska during ALPACA-2022. Atmospheric Chemistry and Physics. Copernicus Publications, Katlenburg-Lindau, GERMANY, 25(2): 1063–1104, (2025).

  10. Z

    Alpaca Cleaned Dutch

    • data.niaid.nih.gov
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vanroy, Bram (2023). Alpaca Cleaned Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8052362
    Explore at:
    Dataset updated
    Jun 20, 2023
    Dataset authored and provided by
    Vanroy, Bram
    License

    Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
    License information was derived automatically

    Description

    This dataset contains 51,712 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch. They are translations of Alpaca Cleaned Dataset.

    Data Instances

    { 'id': 7, 'instruction': 'Leg uit waarom de volgende breuk gelijk is aan 1/4', 'input': '4/16', 'output': 'De breuk 4/16 is gelijk aan 1/4 omdat zowel de teller als de ' 'noemer deelbaar zijn door 4. Door zowel de teller als de noemer ' 'door 4 te delen, krijgen we de breuk 1/4.' }

    Data Fields

    id: the ID of the item. The following ID is not included because they could not be translated: [23019]

    instruction: the given instruction input: optional input to accompany the instruction. Can be empty.

    output: the "answer" to the instruction

    Dataset Creation

    The instructions, inputs and outputs were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

    The prompt template to translate is (where src_lang is English and tgt_lang is Dutch):

    TRANSLATION_PROMPT = """You are asked to translate a task's instruction, optional input to the task, and the output of the task, from {src_lang} into {tgt_lang}.

    Here are the requirements that you should adhere to: 1. maintain the format: the task consists of a task instruction (marked instruction:), optional input to the task (marked input:) and output for the task marked with output:; 2. do not translate the identifiers instruction:, input:, and output: but instead copy them to your output; 3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 4. translate the instruction and input text using informal, but standard, language; 5. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the input in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang}; 7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the input, nor the translation in the output (just copy them as-is); 8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English.

    Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.

    """

    This prompt is concatenated with the instruction, optionally the input, and the output. In code, that last part looks like this:

    text = f'instruction: "{instruction}"

    ' if inputstr: text += f'input: "{inputstr}"

    ' text += f'output: "{outputstr}"'

    The system message was:

    You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

    Note that 1 item (0.0001%) was not successfully translated. The translation was missing the input, instruction, or output keywords where those were expected. The ID for the missing item is [23019].

    Initial data creation of the English dataset by Tatsu lab and cleaned by Yahma.

    Also available on HuggingFace hub (with a more extensive README).

    Licensing Information

    As per OpenAI's terms of use, this dataset cannot be used to build a commercial system that competes with OpenAI's services. Similar to the original Alpaca dataset, this dataset is released under CC NC 4.0.

    This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

    If you use this dataset, you must also follow the Sharing and Usage policies.

    As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

  11. Emissions for Fairbanks AK and ALPACA products

    • catalog.data.gov
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2024). Emissions for Fairbanks AK and ALPACA products [Dataset]. https://catalog.data.gov/dataset/emissions-for-fairbanks-ak-and-alpaca-products
    Explore at:
    Dataset updated
    Dec 15, 2024
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Area covered
    Alaska, Fairbanks
    Description

    January-February 2022 Fairbanks Emissions for the US EPA 1.33 km WRF-CMAQ domain. Portions of this dataset are inaccessible because: Raw emissions files in NetCDF IOAPI format are too large. Below are the size of directories in this archive of ALPACA emissions that is around 75 Gb total. 63G ./premerged 11G ./onroad_ADEC_surrogates 121M ./ptegu_zehnder 121M ./ptegu_uaf 121M ./ptegu_north_pole 121M ./ptegu_ft_wainwright 121M ./ptegu_doyon 121M ./ptegu_aurora_chena 2.0M ./smk_merge_dates_202202.txt 2.0M ./smk_merge_dates_202201.txt. They can be accessed through the following means: See data dictionary on archived location on US EPA computer system for specifics, but all emissions are archived on tape drives connected to the Atmos computer system here: /asm/MOD3DEV/kfa/Fairbanks/ALPACA/emis. Format: See provided data dictionary document for full details: Data_dictionary_Emissions.docx i.e.; This dataset includes emissions inputs for the Community Multiscale Air Quality (CMAQ) modeling system for the 1.33 km resolution Fairbanks domain (Figure 1, inner box) during the ALPACA period (January 17-February 25, 2022). The following sections describe the data as well as their location on the archival file system, /asm, for EPA’s atmos high-performance computing platform.

  12. h

    vi-alpaca-input-output-format

    • huggingface.co
    Updated Apr 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BKAI-HUST Foundation Models Lab (2025). vi-alpaca-input-output-format [Dataset]. https://huggingface.co/datasets/bkai-foundation-models/vi-alpaca-input-output-format
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 28, 2025
    Dataset authored and provided by
    BKAI-HUST Foundation Models Lab
    Description

    🇻🇳 Vietnamese modified Alpaca Dataset

    This dataset is especially designed for Vietnamese based on the idea from Stanford Alpaca, Self-Instruct paper and Chinese LLaMA. The motivation behind the creation of this dataset stems from the hope to contribute high-quality dataset to Vietnamese commnunity to train language models. To construct this dataset, we follow a two-step process:

    Step 1: Manually create Vietnamese seed tasks We employ the methodology outlined in the Self-Instruct… See the full description on the dataset page: https://huggingface.co/datasets/bkai-foundation-models/vi-alpaca-input-output-format.

  13. P

    AlpacaEval Dataset

    • paperswithcode.com
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yann Dubois; Xuechen Li; Rohan Taori; Tianyi Zhang; Ishaan Gulrajani; Jimmy Ba; Carlos Guestrin; Percy Liang; Tatsunori B. Hashimoto (2024). AlpacaEval Dataset [Dataset]. https://paperswithcode.com/dataset/alpacaeval
    Explore at:
    Dataset updated
    Mar 6, 2024
    Authors
    Yann Dubois; Xuechen Li; Rohan Taori; Tianyi Zhang; Ishaan Gulrajani; Jimmy Ba; Carlos Guestrin; Percy Liang; Tatsunori B. Hashimoto
    Description

    The AlpacaEval set contains 805 instructions form self-instruct, open-assistant, vicuna, koala, hh-rlhf. Those were selected so that the AlpacaEval ranking of models on the AlpacaEval set would be similar to the ranking on the Alpaca demo data.

  14. o

    Alpaca Instruction NLU Dataset

    • opendatabay.com
    .undefined
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Alpaca Instruction NLU Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/1db8957f-c1f2-4529-a98d-3e46b53360b4
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 5, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    This dataset, titled "TokenBender: 122k Alpaca-Style Instructions Word-Level Classification Towards Accurate Natural Language Understanding", offers a collection of 122,000 Alpaca-style instructions, each paired with corresponding input, text, and output for word-level classification. It is designed to facilitate natural language understanding (NLU) research by providing entries from diverse areas such as programming code instructions and gaming instructions, presented at varying levels of complexity. The dataset assists developers aiming to apply natural language processing (NLP) techniques, offering insights into how to improve the accuracy and ease the comprehension of human language commands. Utilising this dataset, one can develop advanced algorithms, such as neural networks or decision trees, capable of quickly understanding commands in various languages and bridging the gap between machines and humans for practical applications. It serves as a valuable resource for those seeking to gain insight into NLU through data science approaches.

    Columns

    • input: The input associated with the instruction.
    • text: The Alpaca-Style instruction that corresponds to the user's input.
    • output: The associated output for word-level classification.

    Distribution

    The dataset is structured as a train.csv file, containing 122,000 Alpaca-Style Instructions. The input column holds 121,683 unique values, the text column contains 121,957 unique values, and the output column features 120,724 unique values.

    Usage

    This dataset is ideal for: * Developing AI-based algorithms to accurately understand the meaning of natural language instructions. * Training and testing machine learning models for classifying specific words and phrases within natural language instructions. * Training deep learning models to generate visual components based on given input, text, and output values. * Applying and enhancing natural language processing techniques for machine comprehension. * Developing advanced neural networks or decision trees for understanding commands across languages.

    Coverage

    The dataset's coverage is global. It was listed on 16/06/2025. It includes diverse instruction types, such as programming code and gaming instructions. No specific historical time range or demographic scope is detailed beyond the listing date.

    License

    CC0

    Who Can Use It

    • Developers focused on applying and improving natural language processing techniques.
    • Researchers engaged in natural language understanding.
    • Data scientists seeking insights into NLU through data science methods.
    • Anyone developing AI-based algorithms for natural language comprehension.
    • Teams and individuals training machine learning or deep learning models for classification or generation tasks related to natural language.

    Dataset Name Suggestions

    • Alpaca Instruction NLU Dataset
    • TokenBender Word Classification Data
    • Natural Language Understanding Instructions
    • Alpaca-Style NLP Training Set
    • Word-Level Text Classification Data

    Attributes

    Original Data Source: Alpaca

  15. h

    mental-alpaca-format

    • huggingface.co
    Updated Apr 18, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    boris (2019). mental-alpaca-format [Dataset]. https://huggingface.co/datasets/usham/mental-alpaca-format
    Explore at:
    Dataset updated
    Apr 18, 2019
    Authors
    boris
    Description

    usham/mental-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    ffmperative-alpaca-format-50k

    • huggingface.co
    Updated Mar 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Remyx AI (2024). ffmperative-alpaca-format-50k [Dataset]. https://huggingface.co/datasets/remyxai/ffmperative-alpaca-format-50k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2024
    Dataset authored and provided by
    Remyx AI
    Description

    remyxai/ffmperative-alpaca-format-50k dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. P

    Machine_Mindset_MBTI_dataset Dataset

    • paperswithcode.com
    Updated Jan 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiaxi Cui; Liuzhenghao Lv; Jing Wen; Rongsheng Wang; Jing Tang; Yonghong Tian; Li Yuan (2023). Machine_Mindset_MBTI_dataset Dataset [Dataset]. https://paperswithcode.com/dataset/machine-mindset-mbti-dataset
    Explore at:
    Dataset updated
    Jan 6, 2024
    Authors
    Jiaxi Cui; Liuzhenghao Lv; Jing Wen; Rongsheng Wang; Jing Tang; Yonghong Tian; Li Yuan
    Description

    Dataset introduction There are four dimension in MBTI. And there are two opposite attributes within each dimension.

    To be specific:

    Energe: Extraversion (E) - Introversion (I)

    Information: Sensing (S) - Intuition (N)

    Decision: Thinking (T) - Feeling (F)

    Execution: Judging (J) - Perceiving (P)

    Based on the above, you can infer the content of the json file from its name.

    The datasets follow the Alpaca format, consisting of instruction, input and output.

    How to use these datasets for behavior supervised fine-tuning (SFT) For example, if you want to make an LLM behave like an ISFJ, you need to select the four corresponding files (en_energe_introversion.json, en_information_sensing.json, en_decision_feeling.json, en_execution_judging.json).

    And use the four for SFT.

    How to use these datasets for direct preference optimization (DPO) For example, if you want to make an LLM be more feeling (F) than thinking (T) by DPO, you need to select the two corresponding files (en_decision_feeling.json, en_decision_thinking.json).

    And then compile the two into the correct format for DPO. For the correct format, please refer to this.

  18. h

    limo-new-alpaca-format

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language & AGI Lab, limo-new-alpaca-format [Dataset]. https://huggingface.co/datasets/LangAGI-Lab/limo-new-alpaca-format
    Explore at:
    Dataset authored and provided by
    Language & AGI Lab
    Description

    LangAGI-Lab/limo-new-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    retail-alpaca-format

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    S Aditya, retail-alpaca-format [Dataset]. https://huggingface.co/datasets/aditya3w3733/retail-alpaca-format
    Explore at:
    Authors
    S Aditya
    Description

    aditya3w3733/retail-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    qwen-7b-instruct-8k-rft-alpaca-format

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language & AGI Lab, qwen-7b-instruct-8k-rft-alpaca-format [Dataset]. https://huggingface.co/datasets/LangAGI-Lab/qwen-7b-instruct-8k-rft-alpaca-format
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Language & AGI Lab
    Description

    LangAGI-Lab/qwen-7b-instruct-8k-rft-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Tatsu Lab (2023). alpaca [Dataset]. https://huggingface.co/datasets/tatsu-lab/alpaca

alpaca

Alpaca

tatsu-lab/alpaca

Explore at:
62 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2023
Dataset authored and provided by
Tatsu Lab
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for Alpaca

  Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:

The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.

Search
Clear search
Close search
Google apps
Main menu