theblackcat102/llava-instruct-mix reformated for VSFT with TRL's SFT Trainer. See https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
euclaise/reddit-instruct-curated in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")
dataset = load_dataset("euclaise/reddit-instruct-curated", split="train")
def format(columns): post_title = columns["post_title"].strip() post_text = columns["post_text"].strip() comment_text =… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-reddit-instruct-curated.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
VMware/open-instruct in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")
dataset = load_dataset("VMware/open-instruct", split="train")
def format(columns): messages = [ { "role": "user", "content": columns["instruction"].strip(), }, {… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-open-instruct.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
starfishmedical/webGPT_x_dolly in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer import random
tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")
dataset = load_dataset("starfishmedical/webGPT_x_dolly", split="train")
def format(columns): instruction = columns["instruction"].strip() input = columns["input"].strip() assistant_message =… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-webGPT_x_dolly.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
OpenAssistant TOP-1 English Conversations
This is a twice filtered dataset from oasst2, which is a set of conversation trees collected by the OpenAssistant project. It was first filtered for the top ranked branches in each conversation tree, to form blancsw/oasst2_top1_chat_format It was then filtered down to English-only, and to a single 'messages' data column. This allows the dataset to directly be input to the HuggingFace SFTTrainer (provided your tokenizer has a chat template)… See the full description on the dataset page: https://huggingface.co/datasets/Trelis/oasst2_top1_chat_format_en.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
sablo/oasst2_curated in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")
dataset = load_dataset("sablo/oasst2_curated", split="train")
def format(columns): return { "text": tokenizer.apply_chat_template(columns["messages"], tokenize=False) }… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-oasst2_curated.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Open-Orca/OpenOrca in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Felladrin/Minueza-32M-Base")
dataset = load_dataset("Open-Orca/OpenOrca", split="train")
def format(columns): messages = []
system_prompt = columns["system_prompt"].strip()
if system_prompt:
messages.append({
"role": "system"… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-OpenOrca.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
theblackcat102/llava-instruct-mix reformated for VSFT with TRL's SFT Trainer. See https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py.