8 datasets found
  1. h

    alpaca

    • huggingface.co
    • opendatalab.com
    Updated Mar 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tatsu Lab (2023). alpaca [Dataset]. https://huggingface.co/datasets/tatsu-lab/alpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2023
    Dataset authored and provided by
    Tatsu Lab
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for Alpaca

      Dataset Summary
    

    Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:

    The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.

  2. h

    AlpacaDataCleaned

    • huggingface.co
    Updated Apr 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessandro Lannocca (2023). AlpacaDataCleaned [Dataset]. https://huggingface.co/datasets/alexl83/AlpacaDataCleaned
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2023
    Authors
    Alessandro Lannocca
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for Alpaca-Cleaned

    Repository: https://github.com/gururise/AlpacaDataCleaned

      Dataset Description
    

    This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:

    Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer.

    "instruction":"Summarize… See the full description on the dataset page: https://huggingface.co/datasets/alexl83/AlpacaDataCleaned.

  3. h

    alpaca-cleaned-52k-th

    • huggingface.co
    Updated May 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thaweewat (2023). alpaca-cleaned-52k-th [Dataset]. https://huggingface.co/datasets/Thaweewat/alpaca-cleaned-52k-th
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 12, 2023
    Authors
    Thaweewat
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Summary

    This is a Thai 🇹🇭-instructed dataset translated from cleaned version of the original Alpaca Dataset released by Stanford using Google Cloud Translation, contain 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The following issues have been identified in the original release and fixed in this… See the full description on the dataset page: https://huggingface.co/datasets/Thaweewat/alpaca-cleaned-52k-th.

  4. h

    alpaca-data-pt-br

    • huggingface.co
    Updated Apr 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    alpaca-data-pt-br [Dataset]. https://huggingface.co/datasets/dominguesm/alpaca-data-pt-br
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 3, 2023
    Authors
    Maicon Domingues
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    NOTE: This is a machine translated version of the yahma/alpaca-cleaned dataset.

      Dataset Card for Alpaca-Cleaned
    

    Repository: https://github.com/gururise/AlpacaDataCleaned

      Dataset Description
    

    This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:

    Hallucinations: Many instructions in the original dataset had instructions referencing data on the… See the full description on the dataset page: https://huggingface.co/datasets/dominguesm/alpaca-data-pt-br.

  5. h

    alpaca_data_galician

    • huggingface.co
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    alpaca_data_galician [Dataset]. https://huggingface.co/datasets/irlab-udc/alpaca_data_galician
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2023
    Dataset authored and provided by
    Information Retrieval Lab @ University of A Coruña
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Galician version of alpaca_data.json

    This is a Galician-translated with Python package googletranslatepy version of the Stanford alpaca_data.json dataset. Our working notes are available here.

      Dataset Structure
    

    The dataset contains 52K instruction-following elements in a JSON file with a list of dictionaries. Each dictionary contains the following fields:

    instruction: str, describes the task the model should perform. Each of the 52K instructions is unique. input:… See the full description on the dataset page: https://huggingface.co/datasets/irlab-udc/alpaca_data_galician.

  6. h

    mk-alpaca-cleaned

    • huggingface.co
    Updated Nov 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Milan Velinovski (2024). mk-alpaca-cleaned [Dataset]. https://huggingface.co/datasets/milanvelinovski/mk-alpaca-cleaned
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 24, 2024
    Authors
    Milan Velinovski
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Citation Information

    @misc{alpaca, author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, title = {Stanford Alpaca: An Instruction-following LLaMA model}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}}, }```

  7. h

    alpaca-tat

    • huggingface.co
    Updated Mar 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    alpaca-tat [Dataset]. https://huggingface.co/datasets/yasalma/alpaca-tat
    Explore at:
    Dataset updated
    Mar 26, 2025
    Dataset authored and provided by
    Yasalma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TatAlpaca

    Dataset of Gemini-generated instructions in Tatar language.

    Code: tatlm/self_instruct Code is based on Stanford Alpaca and self-instruct. 166,257 examples

    Prompt template: {{num_tasks}} җыелмасының составы тел моделен өйрәнү өчен төрле:

    1. Биремнәрне максималь рәвештә типлары, соралган гамәлләре, формулировкалары, керү мөмкинлекләре буенча бер-берсенә охшамаган итеп эшлә.
    2. Биремнәр рәсемнәр, видео, аудио белән эшли белмәгән һәм тышкы дөньяга керү мөмкинлеге булмаган… See the full description on the dataset page: https://huggingface.co/datasets/yasalma/alpaca-tat.
  8. h

    alpaca-odia

    • huggingface.co
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suman Kumar Bhadra (2025). alpaca-odia [Dataset]. https://huggingface.co/datasets/sumankumarbhadra/alpaca-odia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 27, 2025
    Authors
    Suman Kumar Bhadra
    Description

    Alpaca-Odia Dataset

    This dataset contains 52,001 instruction-response pairs translated from the original Stanford Alpaca dataset into the Odia language using IndicTrans2. You can load the dataset as follows: from datasets import load_dataset

    Load the dataset

    dataset = load_dataset("sumankumarbhadra/alpaca-odia")

      Translation Details
    

    Translation Model: IndicTrans2 (ai4bharat/indictrans2-indic-en-1B) Source Language Code: eng_Latn Target Language Code: ory_Orya… See the full description on the dataset page: https://huggingface.co/datasets/sumankumarbhadra/alpaca-odia.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Tatsu Lab (2023). alpaca [Dataset]. https://huggingface.co/datasets/tatsu-lab/alpaca

alpaca

Alpaca

tatsu-lab/alpaca

Explore at:
52 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2023
Dataset authored and provided by
Tatsu Lab
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for Alpaca

  Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:

The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.

Search
Clear search
Close search
Google apps
Main menu