Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
sahil2801/CodeAlpaca-20k dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
This dataset splits the original CodeAlpaca dataset into train and test splits.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Evolved codealpaca
Updates:
2023/08/26 - Filtered results now only contain pure english instruction and removed any mentioned of trained by OAI response
Median sequence length : 471 We employed a methodology similar to that of WizardCoder, with the exception that ours is open-source. We used the gpt-4-0314 and gpt-4-0613 models to augment and answer each response, with the bulk of generation handled by gpt-4-0314. The aim of this dataset is twofold: firstly, to facilitate the… See the full description on the dataset page: https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for CodeAlpaca 20K
This dataset originates from the Code Alpaca repository. The CodeAlpaca 20K dataset is specifically used for training code generation models.
Dataset Details
Dataset Description
Each sample is comprised of three columns: instruction, input and output.
Language(s): English License: Apache-2.0 License
Dataset Sources
The code from the original repository was adopted to post it here.
Repository:… See the full description on the dataset page: https://huggingface.co/datasets/flwrlabs/code-alpaca-20k.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
thisisanshgupta/CodeAlpaca dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Code Alpaca 20K – Code + Explanation
🧠 A dataset designed to enhance large language models (LLMs) with code generation and instructional explanation capabilities.This version is an extension of the original sahil2801/CodeAlpaca-20k, with AI-generated explanations added to the output section using the Gemini API.
📘 Overview
This dataset enhances the original CodeAlpaca-20k examples by adding natural language explanations to code outputs. The goal is not just to… See the full description on the dataset page: https://huggingface.co/datasets/ByGedik/CodeAlpaca-20k-CodePlusExplanation.
Facebook
TwitterAlekseyKorshuk/evol-codealpaca-v1-dpo dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset Card for "codealpaca-filtered"
More Information needed
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The preprocessed datasets used in our experiments are provided in the data folder. For the image or character classification task, we use 5 classical datasets: Cifar-10, Fashion-MNIST, PACS, FEMNIST, and Shakespeare. We consider the mixed-finance and code+finance scenarios for instruction-tuning tasks, involving 3 financial datasets (TFNS, FIQA, NWGI) and a code dataset (CodeAlpaca). CIFAR-10 and Fashion-MNIST are widely used benchmarks in literature for image classification tasks containing 10 categories. PACS has four domains (photo, art painting, cartoon, and sketch) and contains seven categories. FEMNIST for image classification and Shakespeare for the next character prediction are from the naturally heterogeneous synthetic dataset Leaf. Three finance datasets include: FiQA comprised of 17k sentences sourced from microblog headlines and financial news, The Twitter Financial News Sentiment (TFNS) with 11,932 annotated documents of finance-related tweets, and the News With GPT Instruction (NWGI)featuring labels generated by ChatGPT. The code dataset CodeAlpaca contains 20K instruction-following data. Note that all raw data resources can be found in the "Data availability" section in our paper.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
graycatHCO3/CodeAlpaca-20K-Python dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
codealpaca for text2text generation
This dataset was downloaded from the sahil280114/codealpaca github repo and parsed into text2text format for "generating" instructions. It was downloaded under the wonderful Creative Commons Attribution-NonCommercial 4.0 International Public License (see snapshots of the repo and data license), so that license applies to this dataset. Note that the inputs and instruction columns in the original dataset have been aggregated together for text2text… See the full description on the dataset page: https://huggingface.co/datasets/pszemraj/fleece2instructions-codealpaca.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
shapermindai/codealpaca-stanford dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Brahma Water is a compact, high-quality pretraining dataset containing over 53 million tokens across 365k+ examples. Designed for small to mid-scale language models (10–100M parameters), it balances instruction-tuned tasks, logic and math reasoning, multilingual samples, dialogue, and code.
Key Features: - 📘 158K+ instruction samples from Alpaca, Dolly, CodeAlpaca, etc.
🧠 Logic & math reasoning tasks (GSM8k, COSMOS QA, SciQ, OpenbookQA)
💬 Conversational dialogue from open-source datasets
💻 Code examples in Python from MBPP, CodeSearchNet
🌍 Multilingual data (Hindi, Indian languages, XNLI)
It’s ideal for: - Training efficient LLMs from scratch - Instruction-tuning compact models - Proving new architectures (e.g., symbolic, non-transformer)
Facebook
Twitterjacpetro/CodeAlpaca-20k-no-input dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset Card for "CodeAlpaca-20k_standardized"
More Information needed
Facebook
TwitterOpenHermes was trained on 242,000 entries of primarily GPT-4 generated data, from open datasets across the AI landscape, including: - GPTeacher - General Instruct, Roleplay v1, Roleplay v2, and Code Instruct Datasets, by Teknium - WizardLM (v1, evol_instruct 70k), by WizardLM Team/nlpxucan - Airoboros GPT-4 (v1.0), by JonDurbin - Camel-AI's domain expert datasets, by the Camel-AI Team - CodeAlpaca, by Sahil2801 - GPT4-LLM and Unnatural Instructions, by Microsoft
Filtering included the removal of OpenAI refusals, disclaimers, and "As an AI" type examples and more
The base dataset mix is identical to the original Nous-Hermes', minus the Nous-Instruct and PDACTL datasets which were private datasets.
References 1. https://huggingface.co/datasets/teknium/openhermes
Facebook
TwitterDataset Card for "evol-codealpaca-decontaminated"
More Information needed
Facebook
TwitterPrateek-Gupta123/CodeAlpaca-1k-revised dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterautoprogrammer/CodeAlpaca-lf-processed dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterrohanawhad/CodeAlpaca-20k-finetuning-format dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
sahil2801/CodeAlpaca-20k dataset hosted on Hugging Face and contributed by the HF Datasets community