Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Open Source Implementation of Evol-Instruct-Code as described in the WizardCoder Paper. Code for the intruction generation can be found on Github as Evol-Teacher.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A decontaminated version of evol-codealpaca-v1. Decontamination is done in the same way as StarCoder (bigcode decontamination process).
Facebook
TwitterEvol-Instruct-Python-26k
Filtered version of the nickrosh/Evol-Instruct-Code-80k-v1 dataset that only keeps Python code (26,588 samples). You can find a smaller version of it here mlabonne/Evol-Instruct-Python-1k. Here is the distribution of the number of tokens in each row (instruction + output) using Llama's tokenizer:
Facebook
TwitterThe dataset is used in the research related to MultilingualSIFT.
Facebook
TwitterEvol-Instruct-Code-80k is a dataset for evaluating the performance of code generation models.
Facebook
TwitterThe dataset used in the paper for in-context learning task
Facebook
TwitterThe dataset is created by (1) translating English questions of Evol-instruct-70k into Chinese and (2) requesting GPT4 to generate Chinese responses. For more details, please refer to:
Repository: https://github.com/FreedomIntelligence/AceGPT https://github.com/FreedomIntelligence/LLMZoo
Paper: AceGPT, Localizing Large Language Models in Arabic Phoenix: Democratizing ChatGPT across Languages
BibTeX entry and citation info
@article{huang2023acegpt, title={AceGPT, Localizing… See the full description on the dataset page: https://huggingface.co/datasets/FreedomIntelligence/Evol-Instruct-Chinese-GPT4.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by CoverLover
Released under MIT
Facebook
Twitterartificial-citizen/Evol-Instruct-Code dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAlignment-Lab-AI/clusteredleaves-evol-instruct dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Wizard-LM-Chinese is a data set that translates instructions on MSRA's Wizard-LM data set and then calls GPT to obtain the answer. Wizard-LM contains many instructions that are more difficult than Alpaca. In Chinese translation problems, a small amount of instruction injection may cause translation failure. Chinese answers are obtained by asking questions based on Chinese questions.
Facebook
Twitterplaguss/distilabel-sample-evol-instruct dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterWizardcoder: Empowering code large language models with evol-instruct
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains 80,000 unique pairs of instructions and outputs to be used for Machine Learning and AI research. Instructions such as 'run', 'walk', 'jump', and 'dance' have outputs that represent the results of executing each instruction. It provides a groundbreaking collection of knowledge that can be leveraged in ways such as training AI agents, building intelligent natural language applications, exploring autonomous navigation possibilities, developing dialogues between bots and humans, replicating robotic tasks and research into sophisticated AI models able to understand instructions in various domains like engineering, medicine, finance or law. This dataset has the potential to revolutionize how we approach Artificial Intelligence by pushing boundaries when it comes to data-driven machine learning strategies. With its powerful combination of detailed information from multiple angles – language comprehension from verbal commands alongside increased contextual understanding – we can pave the way for more comprehensive applications of AI technology with exponentially enhanced accuracy when compared to existing methods
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains 80,000 pairs of instructions and outputs for Machine Learning and AI research. This data can be used to teach a variety of AI agents, as well as for tasks like autonomous navigation, dialogue, language modelling, natural language processing (NLP), robotics applications and more. The following guide outlines the steps you'll need to take in order to get the most out of this incredible resource.
- Download the dataset from Kaggle - Once downloaded you'll have access to two files:
instruction.csv&output.csv.- Examine the data - Take some time familiarizing yourself with the dataset- The columns will contain instructions/verbs such as 'run', walk', 'jump' etc., along with accompanying output results that have been generated from executing those instructions.
- Transform the data - Utilize feature engineering techniques appropriate for your project/proposed application in order to transform or extract relevant features from this dataset that can be utilized downstream by either supervised algorithms such as neural networks or unsupervised methods such as clustering algorithms.
4 Train & Test models – Develop predictive models using either supervised or unsupervised techniques according; adjust hyperparameters until desired results are obtained; split into a training set (80%) and validation set (20%) first before running on full dataset so that model performance can be properly assessed against validation/test datasets; additional notes here about repeatability vs randomization etc… 5 Deploy Models – Deploy model onto real world scenarios/environments where appropriate .e.. an autonomous car relying on natural language inputs when driving through town; a domestic robot understanding sentences given by its user etc…
- Training virtual assistants with specific domain knowledge (e.g. medical, finance, etc).
- Develop autonomous navigation systems that respond to verbal instructions given by a user in natural language format.
- Creating dialogue agents that can answer questions based on a pre-defined set of rules pertaining to the instructions given by the user
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Evolved codealpaca
Updates:
2023/08/26 - Filtered results now only contain pure english instruction and removed any mentioned of trained by OAI response
Median sequence length : 471 We employed a methodology similar to that of WizardCoder, with the exception that ours is open-source. We used the gpt-4-0314 and gpt-4-0613 models to augment and answer each response, with the bulk of generation handled by gpt-4-0314. The aim of this dataset is twofold: firstly, to facilitate the… See the full description on the dataset page: https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1.
Facebook
TwitterLocutusque/Collective-Evol-Instruct-v0.1 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterzichao22/evol-instruct dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitteralvarobartt/evol-instruct dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterbanksy235/Codefuse-Evol-Instruct-Clean dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data from a study investigating the influence of Nature of Science Instruction on evolution acceptance.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Open Source Implementation of Evol-Instruct-Code as described in the WizardCoder Paper. Code for the intruction generation can be found on Github as Evol-Teacher.