16 datasets found
  1. CodeAlpaca_20K

    • huggingface.co
    • opendatalab.com
    Updated Mar 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeAlpaca_20K [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    This dataset splits the original CodeAlpaca dataset into train and test splits.

  2. code-alpaca-20k

    • huggingface.co
    Updated Mar 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    code-alpaca-20k [Dataset]. https://huggingface.co/datasets/flwrlabs/code-alpaca-20k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 31, 2023
    Dataset provided by
    Flower Labs GmbH
    Authors
    Flower Labs
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

    This dataset originates from the Code Alpaca repository. The CodeAlpaca 20K dataset is specifically used for training code generation models.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    Each sample is comprised of three columns: instruction, input and output.

    Language(s): English License: Apache-2.0 License

      Dataset Sources
    

    The code from the original repository was adopted to post it here.

    Repository:… See the full description on the dataset page: https://huggingface.co/datasets/flwrlabs/code-alpaca-20k.

  3. h

    CodeAlpaca-AddLanguage

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luong Nguyen Dinh, CodeAlpaca-AddLanguage [Dataset]. https://huggingface.co/datasets/dinhlnd1610/CodeAlpaca-AddLanguage
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Luong Nguyen Dinh
    Description

    dinhlnd1610/CodeAlpaca-AddLanguage dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    test

    • huggingface.co
    Updated Dec 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ding (2024). test [Dataset]. https://huggingface.co/datasets/Ding0702/test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2024
    Authors
    Ding
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ding0702/test dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    evol-codealpaca-pairwise-sharegpt

    • huggingface.co
    Updated Jan 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleksey Korshuk (2024). evol-codealpaca-pairwise-sharegpt [Dataset]. https://huggingface.co/datasets/AlekseyKorshuk/evol-codealpaca-pairwise-sharegpt
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2024
    Authors
    Aleksey Korshuk
    Description

    AlekseyKorshuk/evol-codealpaca-pairwise-sharegpt dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. s

    Alpaca Pullover Import Data & Buyers List in USA

    • seair.co.in
    Updated Mar 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim (2024). Alpaca Pullover Import Data & Buyers List in USA [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Mar 14, 2024
    Dataset provided by
    Seair Info Solutions PVT LTD
    Authors
    Seair Exim
    Area covered
    United States
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  7. code_evaluation_prompts

    • huggingface.co
    Updated May 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). code_evaluation_prompts [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/code_evaluation_prompts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 3, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    Description

    Dataset Card for H4 Code Evaluation Prompts

    These are a filtered set of prompts for evaluating code instruction models. It will contain a variety of languages and task types. Currently, we used ChatGPT (GPT-3.5-tubro) to generate these, so we encourage using them only for qualatative evaluation and not to train your models. The generation of this data is similar to something like CodeAlpaca, which you can download here, but we intend to make these tasks botha) more challenging… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/code_evaluation_prompts.

  8. h

    code-alpaca-eval-debug

    • huggingface.co
    Updated Mar 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleksey Korshuk (2024). code-alpaca-eval-debug [Dataset]. https://huggingface.co/datasets/AlekseyKorshuk/code-alpaca-eval-debug
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 27, 2024
    Authors
    Aleksey Korshuk
    Description

    AlekseyKorshuk/code-alpaca-eval-debug dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    Evol-Instruct-Code-80k-v1

    • huggingface.co
    • opendatalab.com
    Updated Jul 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Roshdieh (2023). Evol-Instruct-Code-80k-v1 [Dataset]. https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2023
    Authors
    Nick Roshdieh
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Open Source Implementation of Evol-Instruct-Code as described in the WizardCoder Paper. Code for the intruction generation can be found on Github as Evol-Teacher.

  10. h

    python-code-instructions-18k-alpaca-standardized

    • huggingface.co
    Updated Sep 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HydraLM (2023). python-code-instructions-18k-alpaca-standardized [Dataset]. https://huggingface.co/datasets/HydraLM/python-code-instructions-18k-alpaca-standardized
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2023
    Dataset authored and provided by
    HydraLM
    Description

    Dataset Card for "python-code-instructions-18k-alpaca-standardized"

    More Information needed

  11. h

    moose-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naman Bansal, moose-dataset [Dataset]. https://huggingface.co/datasets/namanbnsl/moose-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Naman Bansal
    Description

    Moose dataset 🫎

    This is a combination of 7 datasets namely:

    Alpaca - Instruction following CodeAlpaca - Programming Dolly - Instruction following Tigerbot GSM - Math Tiger StackExchange - Chat Glaive Code - Porgramming/Computer Questions MetaMath QA - Math

    Note: No changes were made to the content in the above datasets. The only changes made were the column names in the above datasets. Input columns were added for some datasets.

      Uses 🪴
    

    This dataset was made to… See the full description on the dataset page: https://huggingface.co/datasets/namanbnsl/moose-dataset.

  12. h

    Labyrinth

    • huggingface.co
    Updated Nov 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavan Narasimha Karthik (2023). Labyrinth [Dataset]. https://huggingface.co/datasets/pnkvalavala/Labyrinth
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 7, 2023
    Authors
    Pavan Narasimha Karthik
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Labyrinth Dataset

    Labyrinth is a code dataset that combines three existing datasets without modifying the data itself but adapting the structure/format to streamline fine-tuning for Zephyr on code.

      Dataset Sources
    

    Labyrinth is composed of code examples and instructions from the following three datasets:

    CodeAlpaca by Sahil Chaudhary. Codegen-instruct by Teknium. llama-2-instruct-121k-code by Davut Emre TASAR.

  13. h

    Data from: DS2

    • huggingface.co
    Updated Aug 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DS2 [Dataset]. https://huggingface.co/datasets/timkoehne/DS2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 18, 2024
    Authors
    Tim Köhne
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset was created as part of my bachelor's thesis, where I fine-tuned the llama3.1:8B language model for generating ABAP code using Unsloth 4-Bit QLoRA. The data is based on 1000 random samples of CodeAlpaca translated to ABAP using llama3.1:8B. I don't recommend you use this dataset, it resulted in a pretty bad model.

  14. h

    alpaca

    • huggingface.co
    • opendatalab.com
    Updated Mar 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tatsu Lab (2023). alpaca [Dataset]. https://huggingface.co/datasets/tatsu-lab/alpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2023
    Dataset authored and provided by
    Tatsu Lab
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for Alpaca

      Dataset Summary
    

    Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:

    The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.

  15. h

    alpaca-gpt4

    • huggingface.co
    • opendatalab.com
    Updated Apr 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    alpaca-gpt4 [Dataset]. https://huggingface.co/datasets/vicgalle/alpaca-gpt4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 14, 2023
    Authors
    Victor Gallego
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for "alpaca-gpt4"

    This dataset contains English Instruction-Following generated by GPT-4 using Alpaca prompts for fine-tuning LLMs. The dataset was originaly shared in this repository: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM. This is just a wraper for compatibility with huggingface's datasets library.

      Dataset structure
    

    It contains 52K instruction-following data generated by GPT-4 using the same prompts as in Alpaca. The dataset has… See the full description on the dataset page: https://huggingface.co/datasets/vicgalle/alpaca-gpt4.

  16. h

    tulu-2-unfiltered

    • huggingface.co
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamish Ivison (2025). tulu-2-unfiltered [Dataset]. https://huggingface.co/datasets/hamishivi/tulu-2-unfiltered
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2025
    Authors
    Hamish Ivison
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Tulu 2 Unfiltered

    This is an 'unfiltered' version of the Tulu v2 SFT mixture, created by collating the original Tulu 2 sources and avoiding downsampling.

      Details
    

    The dataset consists of a mix of :

    FLAN (Apache 2.0, we only sample 961,322 samples along with 398,439 CoT samples from the full set for this data pool) Open Assistant 1 (Apache 2.0) ShareGPT (Apache 2.0 listed, no official repo found) GPT4-Alpaca (CC By NC 4.0) Code-Alpaca (CC By NC 4.0) LIMA (CC BY-NC-SA)… See the full description on the dataset page: https://huggingface.co/datasets/hamishivi/tulu-2-unfiltered.

  17. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
CodeAlpaca_20K [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K
Organization logo

CodeAlpaca_20K

HuggingFaceH4/CodeAlpaca_20K

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License

https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

Description

This dataset splits the original CodeAlpaca dataset into train and test splits.

Search
Clear search
Close search
Google apps
Main menu