https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
This dataset splits the original CodeAlpaca dataset into train and test splits.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Dataset Name
This dataset originates from the Code Alpaca repository. The CodeAlpaca 20K dataset is specifically used for training code generation models.
Dataset Details
Dataset Description
Each sample is comprised of three columns: instruction, input and output.
Language(s): English License: Apache-2.0 License
Dataset Sources
The code from the original repository was adopted to post it here.
Repository:… See the full description on the dataset page: https://huggingface.co/datasets/flwrlabs/code-alpaca-20k.
dinhlnd1610/CodeAlpaca-AddLanguage dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ding0702/test dataset hosted on Hugging Face and contributed by the HF Datasets community
AlekseyKorshuk/evol-codealpaca-pairwise-sharegpt dataset hosted on Hugging Face and contributed by the HF Datasets community
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Dataset Card for H4 Code Evaluation Prompts
These are a filtered set of prompts for evaluating code instruction models. It will contain a variety of languages and task types. Currently, we used ChatGPT (GPT-3.5-tubro) to generate these, so we encourage using them only for qualatative evaluation and not to train your models. The generation of this data is similar to something like CodeAlpaca, which you can download here, but we intend to make these tasks botha) more challenging… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/code_evaluation_prompts.
AlekseyKorshuk/code-alpaca-eval-debug dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Open Source Implementation of Evol-Instruct-Code as described in the WizardCoder Paper. Code for the intruction generation can be found on Github as Evol-Teacher.
Dataset Card for "python-code-instructions-18k-alpaca-standardized"
More Information needed
Moose dataset 🫎
This is a combination of 7 datasets namely:
Alpaca - Instruction following CodeAlpaca - Programming Dolly - Instruction following Tigerbot GSM - Math Tiger StackExchange - Chat Glaive Code - Porgramming/Computer Questions MetaMath QA - Math
Note: No changes were made to the content in the above datasets. The only changes made were the column names in the above datasets. Input columns were added for some datasets.
Uses 🪴
This dataset was made to… See the full description on the dataset page: https://huggingface.co/datasets/namanbnsl/moose-dataset.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Labyrinth Dataset
Labyrinth is a code dataset that combines three existing datasets without modifying the data itself but adapting the structure/format to streamline fine-tuning for Zephyr on code.
Dataset Sources
Labyrinth is composed of code examples and instructions from the following three datasets:
CodeAlpaca by Sahil Chaudhary. Codegen-instruct by Teknium. llama-2-instruct-121k-code by Davut Emre TASAR.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created as part of my bachelor's thesis, where I fine-tuned the llama3.1:8B language model for generating ABAP code using Unsloth 4-Bit QLoRA. The data is based on 1000 random samples of CodeAlpaca translated to ABAP using llama3.1:8B. I don't recommend you use this dataset, it resulted in a pretty bad model.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for Alpaca
Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:
The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for "alpaca-gpt4"
This dataset contains English Instruction-Following generated by GPT-4 using Alpaca prompts for fine-tuning LLMs. The dataset was originaly shared in this repository: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM. This is just a wraper for compatibility with huggingface's datasets library.
Dataset structure
It contains 52K instruction-following data generated by GPT-4 using the same prompts as in Alpaca. The dataset has… See the full description on the dataset page: https://huggingface.co/datasets/vicgalle/alpaca-gpt4.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Tulu 2 Unfiltered
This is an 'unfiltered' version of the Tulu v2 SFT mixture, created by collating the original Tulu 2 sources and avoiding downsampling.
Details
The dataset consists of a mix of :
FLAN (Apache 2.0, we only sample 961,322 samples along with 398,439 CoT samples from the full set for this data pool) Open Assistant 1 (Apache 2.0) ShareGPT (Apache 2.0 listed, no official repo found) GPT4-Alpaca (CC By NC 4.0) Code-Alpaca (CC By NC 4.0) LIMA (CC BY-NC-SA)… See the full description on the dataset page: https://huggingface.co/datasets/hamishivi/tulu-2-unfiltered.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
This dataset splits the original CodeAlpaca dataset into train and test splits.