46 datasets found

h
CodeAlpaca-20k
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahil Chaudhary, CodeAlpaca-20k [Dataset]. https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Sahil Chaudhary
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
sahil2801/CodeAlpaca-20k dataset hosted on Hugging Face and contributed by the HF Datasets community
CodeAlpaca_20K
huggingface.co
opendatalab.com
Updated Mar 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). CodeAlpaca_20K [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
This dataset splits the original CodeAlpaca dataset into train and test splits.
h
evol-codealpaca-v1
huggingface.co
kaggle.com
Updated Sep 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
theblackcat102 (2023). evol-codealpaca-v1 [Dataset]. https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2023
Authors
theblackcat102
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Evolved codealpaca

Updates:

2023/08/26 - Filtered results now only contain pure english instruction and removed any mentioned of trained by OAI response

Median sequence length : 471 We employed a methodology similar to that of WizardCoder, with the exception that ours is open-source. We used the gpt-4-0314 and gpt-4-0613 models to augment and answer each response, with the bulk of generation handled by gpt-4-0314. The aim of this dataset is twofold: firstly, to facilitate the… See the full description on the dataset page: https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1.
code-alpaca-20k
huggingface.co
Updated Mar 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Flower Labs (2023). code-alpaca-20k [Dataset]. https://huggingface.co/datasets/flwrlabs/code-alpaca-20k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 31, 2023
Dataset provided by
Flower Labs GmbH
Authors
Flower Labs
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for CodeAlpaca 20K

This dataset originates from the Code Alpaca repository. The CodeAlpaca 20K dataset is specifically used for training code generation models.

Dataset Details Dataset Description

Each sample is comprised of three columns: instruction, input and output.

Language(s): English License: Apache-2.0 License

Dataset Sources

The code from the original repository was adopted to post it here.

Repository:… See the full description on the dataset page: https://huggingface.co/datasets/flwrlabs/code-alpaca-20k.
h
CodeAlpaca
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ansh Gupta, CodeAlpaca [Dataset]. https://huggingface.co/datasets/thisisanshgupta/CodeAlpaca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Ansh Gupta
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
thisisanshgupta/CodeAlpaca dataset hosted on Hugging Face and contributed by the HF Datasets community
h
CodeAlpaca-20k-CodePlusExplanation
huggingface.co
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gedik (2025). CodeAlpaca-20k-CodePlusExplanation [Dataset]. https://huggingface.co/datasets/ByGedik/CodeAlpaca-20k-CodePlusExplanation
Explore at:
Dataset updated
May 28, 2025
Authors
Gedik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Code Alpaca 20K – Code + Explanation

🧠 A dataset designed to enhance large language models (LLMs) with code generation and instructional explanation capabilities.This version is an extension of the original sahil2801/CodeAlpaca-20k, with AI-generated explanations added to the output section using the Gemini API.

📘 Overview

This dataset enhances the original CodeAlpaca-20k examples by adding natural language explanations to code outputs. The goal is not just to… See the full description on the dataset page: https://huggingface.co/datasets/ByGedik/CodeAlpaca-20k-CodePlusExplanation.
h
evol-codealpaca-v1-dpo
huggingface.co
Updated Jun 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleksey Korshuk (2024). evol-codealpaca-v1-dpo [Dataset]. https://huggingface.co/datasets/AlekseyKorshuk/evol-codealpaca-v1-dpo
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 11, 2024
Authors
Aleksey Korshuk
Description
AlekseyKorshuk/evol-codealpaca-v1-dpo dataset hosted on Hugging Face and contributed by the HF Datasets community
h
codealpaca-filtered
huggingface.co
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
José Antonio Hernández López (2023). codealpaca-filtered [Dataset]. https://huggingface.co/datasets/antolin/codealpaca-filtered
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 20, 2023
Authors
José Antonio Hernández López
Description
Dataset Card for "codealpaca-filtered"

More Information needed
Data from: Incentivizing Inclusive Data Contributions in Personalized...
figshare.com
springernature.figshare.com
zip
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enpei Zhang; Jingyi Chai; Rui Ye; Siheng Chen; Yanfeng Wang (2025). Incentivizing Inclusive Data Contributions in Personalized Federated Learning [Dataset]. http://doi.org/10.6084/m9.figshare.29669246.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29669246.v1
Dataset updated
Jul 29, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Enpei Zhang; Jingyi Chai; Rui Ye; Siheng Chen; Yanfeng Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The preprocessed datasets used in our experiments are provided in the data folder. For the image or character classification task, we use 5 classical datasets: Cifar-10, Fashion-MNIST, PACS, FEMNIST, and Shakespeare. We consider the mixed-finance and code+finance scenarios for instruction-tuning tasks, involving 3 financial datasets (TFNS, FIQA, NWGI) and a code dataset (CodeAlpaca). CIFAR-10 and Fashion-MNIST are widely used benchmarks in literature for image classification tasks containing 10 categories. PACS has four domains (photo, art painting, cartoon, and sketch) and contains seven categories. FEMNIST for image classification and Shakespeare for the next character prediction are from the naturally heterogeneous synthetic dataset Leaf. Three finance datasets include: FiQA comprised of 17k sentences sourced from microblog headlines and financial news, The Twitter Financial News Sentiment (TFNS) with 11,932 annotated documents of finance-related tweets, and the News With GPT Instruction (NWGI)featuring labels generated by ChatGPT. The code dataset CodeAlpaca contains 20K instruction-following data. Note that all raw data resources can be found in the "Data availability" section in our paper.
h
CodeAlpaca-20K-Python
huggingface.co
Updated Apr 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
graycat (2024). CodeAlpaca-20K-Python [Dataset]. https://huggingface.co/datasets/graycatHCO3/CodeAlpaca-20K-Python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 26, 2024
Authors
graycat
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
graycatHCO3/CodeAlpaca-20K-Python dataset hosted on Hugging Face and contributed by the HF Datasets community
h
fleece2instructions-codealpaca
huggingface.co
Updated May 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Szemraj (2023). fleece2instructions-codealpaca [Dataset]. https://huggingface.co/datasets/pszemraj/fleece2instructions-codealpaca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 14, 2023
Authors
Peter Szemraj
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
codealpaca for text2text generation

This dataset was downloaded from the sahil280114/codealpaca github repo and parsed into text2text format for "generating" instructions. It was downloaded under the wonderful Creative Commons Attribution-NonCommercial 4.0 International Public License (see snapshots of the repo and data license), so that license applies to this dataset. Note that the inputs and instruction columns in the original dataset have been aggregated together for text2text… See the full description on the dataset page: https://huggingface.co/datasets/pszemraj/fleece2instructions-codealpaca.
h
codealpaca-stanford
huggingface.co
Updated Mar 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sharper Mind AI (2024). codealpaca-stanford [Dataset]. https://huggingface.co/datasets/shapermindai/codealpaca-stanford
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 27, 2024
Authors
Sharper Mind AI
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
shapermindai/codealpaca-stanford dataset hosted on Hugging Face and contributed by the HF Datasets community
53M Token Instruction, Code & QA Dataset
kaggle.com
zip
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GODELEV (2025). 53M Token Instruction, Code & QA Dataset [Dataset]. https://www.kaggle.com/datasets/godelev/53m-token-instruction-code-and-qa-dataset/versions/1
Explore at:
zip(133709957 bytes)Available download formats
Dataset updated
Jul 16, 2025
Authors
GODELEV
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Brahma Water is a compact, high-quality pretraining dataset containing over 53 million tokens across 365k+ examples. Designed for small to mid-scale language models (10–100M parameters), it balances instruction-tuned tasks, logic and math reasoning, multilingual samples, dialogue, and code.

Key Features: - 📘 158K+ instruction samples from Alpaca, Dolly, CodeAlpaca, etc.

🧠 Logic & math reasoning tasks (GSM8k, COSMOS QA, SciQ, OpenbookQA)

💬 Conversational dialogue from open-source datasets

💻 Code examples in Python from MBPP, CodeSearchNet

🌍 Multilingual data (Hindi, Indian languages, XNLI)

It’s ideal for: - Training efficient LLMs from scratch - Instruction-tuning compact models - Proving new architectures (e.g., symbolic, non-transformer)
h
CodeAlpaca-20k-no-input
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hakob Petrosyan, CodeAlpaca-20k-no-input [Dataset]. https://huggingface.co/datasets/jacpetro/CodeAlpaca-20k-no-input
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Hakob Petrosyan
Description
jacpetro/CodeAlpaca-20k-no-input dataset hosted on Hugging Face and contributed by the HF Datasets community
h
CodeAlpaca-20k_standardized
huggingface.co
Updated Mar 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HydraLM (2023). CodeAlpaca-20k_standardized [Dataset]. https://huggingface.co/datasets/HydraLM/CodeAlpaca-20k_standardized
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 31, 2023
Dataset authored and provided by
HydraLM
Description
Dataset Card for "CodeAlpaca-20k_standardized"

More Information needed
OpenHermes
kaggle.com
huggingface.co
zip
Updated Dec 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Volodymyr Pivoshenko 🇺🇦 (2023). OpenHermes [Dataset]. https://www.kaggle.com/datasets/volodymyrpivoshenko/openhermes
Explore at:
zip(207638532 bytes)Available download formats
Dataset updated
Dec 17, 2023
Authors
Volodymyr Pivoshenko 🇺🇦
Description
OpenHermes was trained on 242,000 entries of primarily GPT-4 generated data, from open datasets across the AI landscape, including: - GPTeacher - General Instruct, Roleplay v1, Roleplay v2, and Code Instruct Datasets, by Teknium - WizardLM (v1, evol_instruct 70k), by WizardLM Team/nlpxucan - Airoboros GPT-4 (v1.0), by JonDurbin - Camel-AI's domain expert datasets, by the Camel-AI Team - CodeAlpaca, by Sahil2801 - GPT4-LLM and Unnatural Instructions, by Microsoft

Filtering included the removal of OpenAI refusals, disclaimers, and "As an AI" type examples and more

The base dataset mix is identical to the original Nous-Hermes', minus the Nous-Instruct and PDACTL datasets which were private datasets.

References 1. https://huggingface.co/datasets/teknium/openhermes
h
evol-codealpaca-decontaminated
huggingface.co
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alpay Ariyak (2023). evol-codealpaca-decontaminated [Dataset]. https://huggingface.co/datasets/alpayariyak/evol-codealpaca-decontaminated
Explore at:
Dataset updated
Nov 23, 2023
Authors
Alpay Ariyak
Description
Dataset Card for "evol-codealpaca-decontaminated"

More Information needed
h
CodeAlpaca-1k-revised
huggingface.co
Updated Nov 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prateek Gupta (2024). CodeAlpaca-1k-revised [Dataset]. https://huggingface.co/datasets/Prateek-Gupta123/CodeAlpaca-1k-revised
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 18, 2024
Authors
Prateek Gupta
Description
Prateek-Gupta123/CodeAlpaca-1k-revised dataset hosted on Hugging Face and contributed by the HF Datasets community
h
CodeAlpaca-lf-processed
huggingface.co
Updated Jun 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Junxia Cui (2025). CodeAlpaca-lf-processed [Dataset]. https://huggingface.co/datasets/autoprogrammer/CodeAlpaca-lf-processed
Explore at:
Dataset updated
Jun 12, 2025
Authors
Junxia Cui
Description
autoprogrammer/CodeAlpaca-lf-processed dataset hosted on Hugging Face and contributed by the HF Datasets community
h
CodeAlpaca-20k-finetuning-format
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohan Awhad, CodeAlpaca-20k-finetuning-format [Dataset]. https://huggingface.co/datasets/rohanawhad/CodeAlpaca-20k-finetuning-format
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Rohan Awhad
Description
rohanawhad/CodeAlpaca-20k-finetuning-format dataset hosted on Hugging Face and contributed by the HF Datasets community