7 datasets found

llava-instruct-mix-vsft
huggingface.co
Updated Apr 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2024). llava-instruct-mix-vsft [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/llava-instruct-mix-vsft
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 11, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
Description
theblackcat102/llava-instruct-mix reformated for VSFT with TRL's SFT Trainer. See https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py.
h
ChatML-reddit-instruct-curated
huggingface.co
Updated Nov 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Nogueira (2023). ChatML-reddit-instruct-curated [Dataset]. https://huggingface.co/datasets/Felladrin/ChatML-reddit-instruct-curated
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 5, 2023
Authors
Victor Nogueira
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
euclaise/reddit-instruct-curated in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")

dataset = load_dataset("euclaise/reddit-instruct-curated", split="train")

def format(columns): post_title = columns["post_title"].strip() post_text = columns["post_text"].strip() comment_text =… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-reddit-instruct-curated.
h
ChatML-open-instruct
huggingface.co
Updated Feb 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Nogueira (2024). ChatML-open-instruct [Dataset]. https://huggingface.co/datasets/Felladrin/ChatML-open-instruct
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 22, 2024
Authors
Victor Nogueira
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
VMware/open-instruct in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")

dataset = load_dataset("VMware/open-instruct", split="train")

def format(columns): messages = [ { "role": "user", "content": columns["instruction"].strip(), }, {… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-open-instruct.
h
ChatML-webGPT_x_dolly
huggingface.co
Updated Feb 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Nogueira (2024). ChatML-webGPT_x_dolly [Dataset]. https://huggingface.co/datasets/Felladrin/ChatML-webGPT_x_dolly
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 18, 2024
Authors
Victor Nogueira
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
starfishmedical/webGPT_x_dolly in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer import random

tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")

dataset = load_dataset("starfishmedical/webGPT_x_dolly", split="train")

def format(columns): instruction = columns["instruction"].strip() input = columns["input"].strip() assistant_message =… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-webGPT_x_dolly.
h
oasst2_top1_chat_format_en
huggingface.co
Updated Apr 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trelis (2024). oasst2_top1_chat_format_en [Dataset]. https://huggingface.co/datasets/Trelis/oasst2_top1_chat_format_en
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 19, 2024
Dataset authored and provided by
Trelis
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
OpenAssistant TOP-1 English Conversations

This is a twice filtered dataset from oasst2, which is a set of conversation trees collected by the OpenAssistant project. It was first filtered for the top ranked branches in each conversation tree, to form blancsw/oasst2_top1_chat_format It was then filtered down to English-only, and to a single 'messages' data column. This allows the dataset to directly be input to the HuggingFace SFTTrainer (provided your tokenizer has a chat template)… See the full description on the dataset page: https://huggingface.co/datasets/Trelis/oasst2_top1_chat_format_en.
h
ChatML-oasst2_curated
huggingface.co
Updated Feb 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Nogueira (2024). ChatML-oasst2_curated [Dataset]. https://huggingface.co/datasets/Felladrin/ChatML-oasst2_curated
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 21, 2024
Authors
Victor Nogueira
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
sablo/oasst2_curated in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")

dataset = load_dataset("sablo/oasst2_curated", split="train")

def format(columns): return { "text": tokenizer.apply_chat_template(columns["messages"], tokenize=False) }… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-oasst2_curated.
h
ChatML-OpenOrca
huggingface.co
Updated Mar 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Nogueira (2024). ChatML-OpenOrca [Dataset]. https://huggingface.co/datasets/Felladrin/ChatML-OpenOrca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 8, 2024
Authors
Victor Nogueira
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Open-Orca/OpenOrca in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Felladrin/Minueza-32M-Base")

dataset = load_dataset("Open-Orca/OpenOrca", split="train")

def format(columns): messages = []

system_prompt = columns["system_prompt"].strip() if system_prompt: messages.append({ "role": "system"… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-OpenOrca.
Not seeing a result you expected?
Learn how you can add new datasets to our index.