7 datasets found
  1. llava-instruct-mix-vsft

    • huggingface.co
    Updated Apr 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2024). llava-instruct-mix-vsft [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/llava-instruct-mix-vsft
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 11, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    Description

    theblackcat102/llava-instruct-mix reformated for VSFT with TRL's SFT Trainer. See https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py.

  2. h

    ChatML-reddit-instruct-curated

    • huggingface.co
    Updated Nov 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Nogueira (2023). ChatML-reddit-instruct-curated [Dataset]. https://huggingface.co/datasets/Felladrin/ChatML-reddit-instruct-curated
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 5, 2023
    Authors
    Victor Nogueira
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    euclaise/reddit-instruct-curated in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")

    dataset = load_dataset("euclaise/reddit-instruct-curated", split="train")

    def format(columns): post_title = columns["post_title"].strip() post_text = columns["post_text"].strip() comment_text =… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-reddit-instruct-curated.

  3. h

    ChatML-open-instruct

    • huggingface.co
    Updated Feb 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Nogueira (2024). ChatML-open-instruct [Dataset]. https://huggingface.co/datasets/Felladrin/ChatML-open-instruct
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 22, 2024
    Authors
    Victor Nogueira
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    VMware/open-instruct in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")

    dataset = load_dataset("VMware/open-instruct", split="train")

    def format(columns): messages = [ { "role": "user", "content": columns["instruction"].strip(), }, {… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-open-instruct.

  4. h

    ChatML-webGPT_x_dolly

    • huggingface.co
    Updated Feb 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Nogueira (2024). ChatML-webGPT_x_dolly [Dataset]. https://huggingface.co/datasets/Felladrin/ChatML-webGPT_x_dolly
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 18, 2024
    Authors
    Victor Nogueira
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    starfishmedical/webGPT_x_dolly in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer import random

    tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")

    dataset = load_dataset("starfishmedical/webGPT_x_dolly", split="train")

    def format(columns): instruction = columns["instruction"].strip() input = columns["input"].strip() assistant_message =… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-webGPT_x_dolly.

  5. h

    oasst2_top1_chat_format_en

    • huggingface.co
    Updated Apr 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trelis (2024). oasst2_top1_chat_format_en [Dataset]. https://huggingface.co/datasets/Trelis/oasst2_top1_chat_format_en
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 19, 2024
    Dataset authored and provided by
    Trelis
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    OpenAssistant TOP-1 English Conversations

    This is a twice filtered dataset from oasst2, which is a set of conversation trees collected by the OpenAssistant project. It was first filtered for the top ranked branches in each conversation tree, to form blancsw/oasst2_top1_chat_format It was then filtered down to English-only, and to a single 'messages' data column. This allows the dataset to directly be input to the HuggingFace SFTTrainer (provided your tokenizer has a chat template)… See the full description on the dataset page: https://huggingface.co/datasets/Trelis/oasst2_top1_chat_format_en.

  6. h

    ChatML-oasst2_curated

    • huggingface.co
    Updated Feb 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Nogueira (2024). ChatML-oasst2_curated [Dataset]. https://huggingface.co/datasets/Felladrin/ChatML-oasst2_curated
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 21, 2024
    Authors
    Victor Nogueira
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    sablo/oasst2_curated in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")

    dataset = load_dataset("sablo/oasst2_curated", split="train")

    def format(columns): return { "text": tokenizer.apply_chat_template(columns["messages"], tokenize=False) }… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-oasst2_curated.

  7. h

    ChatML-OpenOrca

    • huggingface.co
    Updated Mar 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Nogueira (2024). ChatML-OpenOrca [Dataset]. https://huggingface.co/datasets/Felladrin/ChatML-OpenOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 8, 2024
    Authors
    Victor Nogueira
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Open-Orca/OpenOrca in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained("Felladrin/Minueza-32M-Base")

    dataset = load_dataset("Open-Orca/OpenOrca", split="train")

    def format(columns): messages = []

    system_prompt = columns["system_prompt"].strip()
    
    if system_prompt:
      messages.append({
        "role": "system"… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-OpenOrca.
    
  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hugging Face H4 (2024). llava-instruct-mix-vsft [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/llava-instruct-mix-vsft
Organization logo

llava-instruct-mix-vsft

HuggingFaceH4/llava-instruct-mix-vsft

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 11, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
Description

theblackcat102/llava-instruct-mix reformated for VSFT with TRL's SFT Trainer. See https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py.

Search
Clear search
Close search
Google apps
Main menu