46 datasets found
  1. h

    CodeAlpaca-20k

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sahil Chaudhary, CodeAlpaca-20k [Dataset]. https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Sahil Chaudhary
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    sahil2801/CodeAlpaca-20k dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. CodeAlpaca_20K

    • huggingface.co
    • opendatalab.com
    Updated Mar 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). CodeAlpaca_20K [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    This dataset splits the original CodeAlpaca dataset into train and test splits.

  3. h

    evol-codealpaca-v1

    • huggingface.co
    • kaggle.com
    Updated Sep 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    theblackcat102 (2023). evol-codealpaca-v1 [Dataset]. https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2023
    Authors
    theblackcat102
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Evolved codealpaca

    Updates:

    2023/08/26 - Filtered results now only contain pure english instruction and removed any mentioned of trained by OAI response

    Median sequence length : 471 We employed a methodology similar to that of WizardCoder, with the exception that ours is open-source. We used the gpt-4-0314 and gpt-4-0613 models to augment and answer each response, with the bulk of generation handled by gpt-4-0314. The aim of this dataset is twofold: firstly, to facilitate the… See the full description on the dataset page: https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1.

  4. code-alpaca-20k

    • huggingface.co
    Updated Mar 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Flower Labs (2023). code-alpaca-20k [Dataset]. https://huggingface.co/datasets/flwrlabs/code-alpaca-20k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 31, 2023
    Dataset provided by
    Flower Labs GmbH
    Authors
    Flower Labs
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for CodeAlpaca 20K

    This dataset originates from the Code Alpaca repository. The CodeAlpaca 20K dataset is specifically used for training code generation models.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    Each sample is comprised of three columns: instruction, input and output.

    Language(s): English License: Apache-2.0 License

      Dataset Sources
    

    The code from the original repository was adopted to post it here.

    Repository:… See the full description on the dataset page: https://huggingface.co/datasets/flwrlabs/code-alpaca-20k.

  5. h

    CodeAlpaca

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ansh Gupta, CodeAlpaca [Dataset]. https://huggingface.co/datasets/thisisanshgupta/CodeAlpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Ansh Gupta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    thisisanshgupta/CodeAlpaca dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    CodeAlpaca-20k-CodePlusExplanation

    • huggingface.co
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gedik (2025). CodeAlpaca-20k-CodePlusExplanation [Dataset]. https://huggingface.co/datasets/ByGedik/CodeAlpaca-20k-CodePlusExplanation
    Explore at:
    Dataset updated
    May 28, 2025
    Authors
    Gedik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code Alpaca 20K – Code + Explanation

    🧠 A dataset designed to enhance large language models (LLMs) with code generation and instructional explanation capabilities.This version is an extension of the original sahil2801/CodeAlpaca-20k, with AI-generated explanations added to the output section using the Gemini API.

      📘 Overview
    

    This dataset enhances the original CodeAlpaca-20k examples by adding natural language explanations to code outputs. The goal is not just to… See the full description on the dataset page: https://huggingface.co/datasets/ByGedik/CodeAlpaca-20k-CodePlusExplanation.

  7. h

    evol-codealpaca-v1-dpo

    • huggingface.co
    Updated Jun 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleksey Korshuk (2024). evol-codealpaca-v1-dpo [Dataset]. https://huggingface.co/datasets/AlekseyKorshuk/evol-codealpaca-v1-dpo
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 11, 2024
    Authors
    Aleksey Korshuk
    Description

    AlekseyKorshuk/evol-codealpaca-v1-dpo dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    codealpaca-filtered

    • huggingface.co
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    José Antonio Hernández López (2023). codealpaca-filtered [Dataset]. https://huggingface.co/datasets/antolin/codealpaca-filtered
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 20, 2023
    Authors
    José Antonio Hernández López
    Description

    Dataset Card for "codealpaca-filtered"

    More Information needed

  9. Data from: Incentivizing Inclusive Data Contributions in Personalized...

    • figshare.com
    • springernature.figshare.com
    zip
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enpei Zhang; Jingyi Chai; Rui Ye; Siheng Chen; Yanfeng Wang (2025). Incentivizing Inclusive Data Contributions in Personalized Federated Learning [Dataset]. http://doi.org/10.6084/m9.figshare.29669246.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 29, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Enpei Zhang; Jingyi Chai; Rui Ye; Siheng Chen; Yanfeng Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The preprocessed datasets used in our experiments are provided in the data folder. For the image or character classification task, we use 5 classical datasets: Cifar-10, Fashion-MNIST, PACS, FEMNIST, and Shakespeare. We consider the mixed-finance and code+finance scenarios for instruction-tuning tasks, involving 3 financial datasets (TFNS, FIQA, NWGI) and a code dataset (CodeAlpaca). CIFAR-10 and Fashion-MNIST are widely used benchmarks in literature for image classification tasks containing 10 categories. PACS has four domains (photo, art painting, cartoon, and sketch) and contains seven categories. FEMNIST for image classification and Shakespeare for the next character prediction are from the naturally heterogeneous synthetic dataset Leaf. Three finance datasets include: FiQA comprised of 17k sentences sourced from microblog headlines and financial news, The Twitter Financial News Sentiment (TFNS) with 11,932 annotated documents of finance-related tweets, and the News With GPT Instruction (NWGI)featuring labels generated by ChatGPT. The code dataset CodeAlpaca contains 20K instruction-following data. Note that all raw data resources can be found in the "Data availability" section in our paper.

  10. h

    CodeAlpaca-20K-Python

    • huggingface.co
    Updated Apr 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    graycat (2024). CodeAlpaca-20K-Python [Dataset]. https://huggingface.co/datasets/graycatHCO3/CodeAlpaca-20K-Python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 26, 2024
    Authors
    graycat
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    graycatHCO3/CodeAlpaca-20K-Python dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    fleece2instructions-codealpaca

    • huggingface.co
    Updated May 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Szemraj (2023). fleece2instructions-codealpaca [Dataset]. https://huggingface.co/datasets/pszemraj/fleece2instructions-codealpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 14, 2023
    Authors
    Peter Szemraj
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    codealpaca for text2text generation

    This dataset was downloaded from the sahil280114/codealpaca github repo and parsed into text2text format for "generating" instructions. It was downloaded under the wonderful Creative Commons Attribution-NonCommercial 4.0 International Public License (see snapshots of the repo and data license), so that license applies to this dataset. Note that the inputs and instruction columns in the original dataset have been aggregated together for text2text… See the full description on the dataset page: https://huggingface.co/datasets/pszemraj/fleece2instructions-codealpaca.

  12. h

    codealpaca-stanford

    • huggingface.co
    Updated Mar 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sharper Mind AI (2024). codealpaca-stanford [Dataset]. https://huggingface.co/datasets/shapermindai/codealpaca-stanford
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 27, 2024
    Authors
    Sharper Mind AI
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    shapermindai/codealpaca-stanford dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. 53M Token Instruction, Code & QA Dataset

    • kaggle.com
    zip
    Updated Jul 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GODELEV (2025). 53M Token Instruction, Code & QA Dataset [Dataset]. https://www.kaggle.com/datasets/godelev/53m-token-instruction-code-and-qa-dataset/versions/1
    Explore at:
    zip(133709957 bytes)Available download formats
    Dataset updated
    Jul 16, 2025
    Authors
    GODELEV
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Brahma Water is a compact, high-quality pretraining dataset containing over 53 million tokens across 365k+ examples. Designed for small to mid-scale language models (10–100M parameters), it balances instruction-tuned tasks, logic and math reasoning, multilingual samples, dialogue, and code.

    Key Features: - 📘 158K+ instruction samples from Alpaca, Dolly, CodeAlpaca, etc.

    • 🧠 Logic & math reasoning tasks (GSM8k, COSMOS QA, SciQ, OpenbookQA)

    • 💬 Conversational dialogue from open-source datasets

    • 💻 Code examples in Python from MBPP, CodeSearchNet

    • 🌍 Multilingual data (Hindi, Indian languages, XNLI)

    It’s ideal for: - Training efficient LLMs from scratch - Instruction-tuning compact models - Proving new architectures (e.g., symbolic, non-transformer)

  14. h

    CodeAlpaca-20k-no-input

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hakob Petrosyan, CodeAlpaca-20k-no-input [Dataset]. https://huggingface.co/datasets/jacpetro/CodeAlpaca-20k-no-input
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Hakob Petrosyan
    Description

    jacpetro/CodeAlpaca-20k-no-input dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    CodeAlpaca-20k_standardized

    • huggingface.co
    Updated Mar 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HydraLM (2023). CodeAlpaca-20k_standardized [Dataset]. https://huggingface.co/datasets/HydraLM/CodeAlpaca-20k_standardized
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 31, 2023
    Dataset authored and provided by
    HydraLM
    Description

    Dataset Card for "CodeAlpaca-20k_standardized"

    More Information needed

  16. OpenHermes

    • kaggle.com
    • huggingface.co
    zip
    Updated Dec 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Volodymyr Pivoshenko 🇺🇦 (2023). OpenHermes [Dataset]. https://www.kaggle.com/datasets/volodymyrpivoshenko/openhermes
    Explore at:
    zip(207638532 bytes)Available download formats
    Dataset updated
    Dec 17, 2023
    Authors
    Volodymyr Pivoshenko 🇺🇦
    Description

    OpenHermes was trained on 242,000 entries of primarily GPT-4 generated data, from open datasets across the AI landscape, including: - GPTeacher - General Instruct, Roleplay v1, Roleplay v2, and Code Instruct Datasets, by Teknium - WizardLM (v1, evol_instruct 70k), by WizardLM Team/nlpxucan - Airoboros GPT-4 (v1.0), by JonDurbin - Camel-AI's domain expert datasets, by the Camel-AI Team - CodeAlpaca, by Sahil2801 - GPT4-LLM and Unnatural Instructions, by Microsoft

    Filtering included the removal of OpenAI refusals, disclaimers, and "As an AI" type examples and more

    The base dataset mix is identical to the original Nous-Hermes', minus the Nous-Instruct and PDACTL datasets which were private datasets.

    References 1. https://huggingface.co/datasets/teknium/openhermes

  17. h

    evol-codealpaca-decontaminated

    • huggingface.co
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alpay Ariyak (2023). evol-codealpaca-decontaminated [Dataset]. https://huggingface.co/datasets/alpayariyak/evol-codealpaca-decontaminated
    Explore at:
    Dataset updated
    Nov 23, 2023
    Authors
    Alpay Ariyak
    Description

    Dataset Card for "evol-codealpaca-decontaminated"

    More Information needed

  18. h

    CodeAlpaca-1k-revised

    • huggingface.co
    Updated Nov 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prateek Gupta (2024). CodeAlpaca-1k-revised [Dataset]. https://huggingface.co/datasets/Prateek-Gupta123/CodeAlpaca-1k-revised
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 18, 2024
    Authors
    Prateek Gupta
    Description

    Prateek-Gupta123/CodeAlpaca-1k-revised dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    CodeAlpaca-lf-processed

    • huggingface.co
    Updated Jun 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Junxia Cui (2025). CodeAlpaca-lf-processed [Dataset]. https://huggingface.co/datasets/autoprogrammer/CodeAlpaca-lf-processed
    Explore at:
    Dataset updated
    Jun 12, 2025
    Authors
    Junxia Cui
    Description

    autoprogrammer/CodeAlpaca-lf-processed dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    CodeAlpaca-20k-finetuning-format

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohan Awhad, CodeAlpaca-20k-finetuning-format [Dataset]. https://huggingface.co/datasets/rohanawhad/CodeAlpaca-20k-finetuning-format
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Rohan Awhad
    Description

    rohanawhad/CodeAlpaca-20k-finetuning-format dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sahil Chaudhary, CodeAlpaca-20k [Dataset]. https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k

CodeAlpaca-20k

CodeAlpaca 20K

sahil2801/CodeAlpaca-20k

Explore at:
52 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Sahil Chaudhary
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

sahil2801/CodeAlpaca-20k dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu