Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is timdettmers/openassistant-guanaco converted to what I believe
to be the Llama 2 prompt format (based on this Reddit post).
It is otherwise unchanged.
The format is like this:
[INST] <
Standardized format from: https://huggingface.co/datasets/timdettmers/openassistant-guanaco?row=0 This dataset is a subset of the Open Assistant dataset, which you can find here: https://huggingface.co/datasets/OpenAssistant/oasst1/tree/main This subset of the data only contains the highest-rated paths in the conversation tree, with a total of 9,846 samples. This dataset was used to train Guanaco with QLoRA. For further information, please see the original dataset. License: Apache 2.0
Guanaco: Lazy Llama 2 Formatting
This is the excellent timdettmers/openassistant-guanaco dataset, processed to match Llama 2's prompt format as described in this article. Useful if you don't want to reformat it by yourself (e.g., using a script). It was designed for this article about fine-tuning a Llama 2 model in a Google Colab.
Dataset Card for "guanaco-ai-filtered"
This dataset is a subset of TimDettmers/openassistant-guanaco useful for training generalist English-language chatbots. It has been filtered to a) remove conversations in languages other than English using a fasttext classifier, and b) remove conversations where Open Assistant is mentioned, as people training their own chatbots likely do not want their chatbot to think it is named OpenAssistant.
Guanaco-1k: Lazy Llama 2 Formatting
This is a subset (1000 samples) of the excellent timdettmers/openassistant-guanaco dataset, processed to match Llama 2's prompt format as described in this article. It was created using the following colab notebook. Useful if you don't want to reformat it by yourself (e.g., using a script). It was designed for this article about fine-tuning a Llama 2 (chat) model in a Google Colab.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Summary
🏡🏡🏡🏡Fine-turn Dataset:中文数据集🏡🏡🏡🏡 😀😀😀😀😀😀😀😀 这个数据集是timdettmers/openassistant-guanaco的中文版本,是直接翻译过来,没有经过人为检查语法。 对timdettmers/openassistant-guanaco的描述,请看他的dataset card。 License: Apache 2.0 😀😀😀😀😀😀😀😀 This data set is the Chinese version of timdettmers/openassistant-guanaco, which is directly translated without human-checked grammar. For a description of timdettmers/openassistant-guanaco, see its dataset card. License: Apache 2.0
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "guanaco-spanish-dataset"
CLEANING AND CURATION OF THE DATASET HAS BEEN PERFORMED. NOW IT IS FULLY IN SPANISH (Date:12/01/2024) This dataset is a subset of original timdettmers/openassistant-guanaco,which is also a subset o/f the Open Assistant dataset .You can find here: https://huggingface.co/datasets/OpenAssistant/oasst1/tree/main/ This subset of the data only contains the highest-rated paths in the conversation tree, with a total of 2,369 samples, translated… See the full description on the dataset page: https://huggingface.co/datasets/hlhdatscience/guanaco-spanish-dataset.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
OpenAssistant Conversations Dataset (OASST1)
Dataset Summary
In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort… See the full description on the dataset page: https://huggingface.co/datasets/OpenAssistant/oasst1.
This is a derived collection of 3000 samples from the recognized timdettmers/openassistant-guanaco dataset, tailored to align with the prompt structure required by Llama 2.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card
This is an 2000 examples extract of https://huggingface.co/datasets/timdettmers/openassistant-guanaco
Open Assistant Guanaco traduzido para o Português
Prompts originalmente em português não foram processados Traduzido com a API do Chat GPT 3.5-turbo Prompts de baixa qualidade identificados foram removidos do dataset
Sobre o Guanaco Original
O Guanaco (https://huggingface.co/datasets/timdettmers/openassistant-guanaco) é um conjunto de dados que faz parte do conjunto de dados Open Assistant, que pode ser encontrado aqui:… See the full description on the dataset page: https://huggingface.co/datasets/ocordeiro/guanaco-pt.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
reference from: huggingface: timdettmers/openassistant-guanaco https://huggingface.co/datasets/timdettmers/openassistant-guanaco
This is the excellent timdettmers/openassistant-guanaco dataset, processed to match Llama 2's prompt format as described in this article. Useful if you don't want to reformat it by yourself (e.g., using a script). It was designed for this article about fine-tuning a Llama 2 model in a Google Colab.
This is a subset (1000 samples) of timdettmers/openassistant-guanaco dataset, processed to match Mistral-7B-instruct-v0.2's prompt format as described in this article. It was created using the colab notebook. Inspired by Maxime Labonne's llm-course repo.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "Arabic_guanaco_oasst1"
This dataset is the openassistant-guanaco dataset a subset of the Open Assistant dataset translated to Arabic. You can find the original dataset here: https://huggingface.co/datasets/timdettmers/openassistant-guanaco Or the main dataset here: https://huggingface.co/datasets/OpenAssistant/oasst1/tree/main This subset of the data only contains the highest-rated paths in the conversation tree, with a total of 9,846 samples. For further… See the full description on the dataset page: https://huggingface.co/datasets/alielfilali01/Arabic_guanaco_oasst1.
我是使用notebook练习进行dataset的转换。 如何训练llama-2,是不需要数据转换,直接使用timdettmers/openassistant-guanaco 就可以。 如果是llama-2-chat版本,需要做数据格式转换。转换过程,参考 notebook。 Llama2-chat-dataset 转换
iruca-1k: Lazy Llama 2 Formatting
This is a subset (1000 samples) of the excellent timdettmers/openassistant-guanaco dataset, processed to match Llama 2's prompt format as described in this article. It was created using the following colab notebook. Useful if you don't want to reformat it by yourself (e.g., using a script). It was designed for this article about fine-tuning a Llama 2 (chat) model in a Google Colab.
Format from xlsx file to CSV
pip install openpyxl pandas… See the full description on the dataset page: https://huggingface.co/datasets/xinqiyang/iruca_llama2_japanese_demo.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is timdettmers/openassistant-guanaco converted to what I believe
to be the Llama 2 prompt format (based on this Reddit post).
It is otherwise unchanged.
The format is like this:
[INST] <