33 datasets found
  1. open-instruct

    • huggingface.co
    Updated Feb 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VMware AI Labs (2023). open-instruct [Dataset]. https://huggingface.co/datasets/VMware/open-instruct
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 6, 2023
    Dataset provided by
    VMwarehttp://www.vmware.com/
    Authors
    VMware AI Labs
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Dataset Card for "open-instruct"

    This dataset is a combination of:

    Filtered subset of OpenAssistant/oasst1 train split of Mosaic-dolly-hhrlhf (consists of Databrick's dolly-15k dataset and a filtered subset of Anthropic's HH-RLHF). Filtered subset of conceptofmind/cot_submix_original

      Dataset
    

    The dataset consists of 6 columns:

    instruction: The natural language instruction without any prompt templates (we extracted them out of the alpaca-format inโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/VMware/open-instruct.

  2. h

    Open-Instruct-v1

    • huggingface.co
    Updated Mar 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee Jackson (2024). Open-Instruct-v1 [Dataset]. https://huggingface.co/datasets/NeuralNovel/Open-Instruct-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 5, 2024
    Authors
    Lee Jackson
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    NeuralNovel/Open-Instruct-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    open-instruct-sharegpt

    • huggingface.co
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alignment Lab Ai (2023). open-instruct-sharegpt [Dataset]. https://huggingface.co/datasets/AlignmentLab-AI/open-instruct-sharegpt
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 20, 2023
    Dataset authored and provided by
    Alignment Lab Ai
    Description

    AlignmentLab-AI/open-instruct-sharegpt dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    open-instruct-gpt4o_55k_rev_sit_72k

    • huggingface.co
    Updated Sep 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sgp-bench (2024). open-instruct-gpt4o_55k_rev_sit_72k [Dataset]. https://huggingface.co/datasets/sgp-bench/open-instruct-gpt4o_55k_rev_sit_72k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2024
    Dataset authored and provided by
    sgp-bench
    Description

    sgp-bench/open-instruct-gpt4o_55k_rev_sit_72k dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    four-digits-multiply-open-instruct

    • huggingface.co
    Updated Dec 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fireworks AI (2024). four-digits-multiply-open-instruct [Dataset]. https://huggingface.co/datasets/fireworks-ai/four-digits-multiply-open-instruct
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 13, 2024
    Dataset authored and provided by
    Fireworks AI
    Description

    fireworks-ai/four-digits-multiply-open-instruct dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    open-instruct

    • huggingface.co
    Updated Feb 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sgp-bench (2023). open-instruct [Dataset]. https://huggingface.co/datasets/sgp-bench/open-instruct
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2023
    Dataset authored and provided by
    sgp-bench
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    sgp-bench/open-instruct dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    open-instruct-v3

    • huggingface.co
    Updated Dec 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kimiko (2023). open-instruct-v3 [Dataset]. https://huggingface.co/datasets/Chat-Error/open-instruct-v3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 3, 2023
    Authors
    Kimiko
    Description

    Chat-Error/open-instruct-v3 dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    rejection_sampling_22591

    • huggingface.co
    Updated Sep 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Morrison (2024). rejection_sampling_22591 [Dataset]. https://huggingface.co/datasets/jacobmorrison/rejection_sampling_22591
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 15, 2024
    Authors
    Jacob Morrison
    Description

    allenai/open_instruct: Rejection Sampling Dataset

    See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail

      Configs
    

    args: {'add_timestamp': False, 'hf_entity': 'jacobmorrison', 'hf_repo_id': 'rejection_sampling_22591', 'hf_repo_id_scores': 'scores_22591', 'input_filename': '/output/shards/22591/29.jsonl', 'max_forward_batch_size': 64, 'mode': 'judgement', 'model_names_or_paths':โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/jacobmorrison/rejection_sampling_22591.

  9. h

    open-instruct-v1

    • huggingface.co
    Updated Feb 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Better Uncensored (2024). open-instruct-v1 [Dataset]. https://huggingface.co/datasets/betteruncensored/open-instruct-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2024
    Dataset authored and provided by
    Better Uncensored
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Open Instruct V1 Better Uncensored

    This is the open-instruct-v1 dataset processed with the Better Uncensored pipeline. About 2.5% of the dataset was removed, a quick review of the removed examples seems to point that is mostly false positives or answers with debatable moralizing content. No clear refusal was seen in the quick review. The original dataset may be safe for training uncensored models, but if you want to be extra sure you can use this one.

      Open Instruct V1 -โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/betteruncensored/open-instruct-v1.
    
  10. h

    open-instruct-uncensored-refusals-removed

    • huggingface.co
    Updated Jun 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vfouroo (2023). open-instruct-uncensored-refusals-removed [Dataset]. https://huggingface.co/datasets/userv4oo/open-instruct-uncensored-refusals-removed
    Explore at:
    Dataset updated
    Jun 24, 2023
    Authors
    vfouroo
    Description

    userv4oo/open-instruct-uncensored-refusals-removed dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    open-instruct-uncensored-alpaca

    • huggingface.co
    Updated Jun 27, 1996
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xzuyn (1996). open-instruct-uncensored-alpaca [Dataset]. https://huggingface.co/datasets/xzuyn/open-instruct-uncensored-alpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 27, 1996
    Authors
    xzuyn
    Description

    Original dataset page from ehartford. 810,102 entries. Sourced from open-instruct-uncensored.jsonl. Converted the jsonl to a json which can be loaded into something like LLaMa-LoRA-Tuner. I've also included smaller datasets that includes less entries depending on how much memory you have to work with. Each one is randomized before being converted, so each dataset is unique in order. Count of each Dataset: code_alpaca: 19991 unnatural_instructions: 68231 baize: 166096 self_instruct: 81512โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/xzuyn/open-instruct-uncensored-alpaca.

  12. h

    open-instruct-v1_deduped

    • huggingface.co
    Updated Apr 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isotonic (2023). open-instruct-v1_deduped [Dataset]. https://huggingface.co/datasets/Isotonic/open-instruct-v1_deduped
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 25, 2023
    Authors
    Isotonic
    Description

    Dataset Card for "open-instruct-v1_deduped"

    Deduplicated version of Isotonic/open-instruct-v1 Deduplicated with min Jaccard similarity of 0.8 Uses Stablility's System Prompt

    System: StableLM Tuned (Alpha version)

    • StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
    • StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
    • StableLM is more than just an informationโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Isotonic/open-instruct-v1_deduped.
  13. h

    ChatML-open-instruct

    • huggingface.co
    Updated Feb 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Nogueira (2024). ChatML-open-instruct [Dataset]. https://huggingface.co/datasets/Felladrin/ChatML-open-instruct
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 22, 2024
    Authors
    Victor Nogueira
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    VMware/open-instruct in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")

    dataset = load_dataset("VMware/open-instruct", split="train")

    def format(columns): messages = [ { "role": "user", "content": columns["instruction"].strip(), }, {โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-open-instruct.

  14. h

    open-instruct-uncensored-alpaca

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James, open-instruct-uncensored-alpaca [Dataset]. https://huggingface.co/datasets/jtatman/open-instruct-uncensored-alpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    James
    Description

    Dataset Card for "open-instruct-uncensored-alpaca"

    More Information needed

  15. h

    generation_1746311715

    • huggingface.co
    Updated May 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Graf (2025). generation_1746311715 [Dataset]. https://huggingface.co/datasets/VGraf/generation_1746311715
    Explore at:
    Dataset updated
    May 4, 2025
    Authors
    Victoria Graf
    Description

    allenai/open_instruct: Generation Dataset

    See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail

      Configs
    

    args: {'add_timestamp': True, 'dataset_end_idx': 3239, 'dataset_mixer_list': ['VGraf/alpacaeval_paraphrase_questions_dev', '1.0'], 'dataset_splits': ['train', 'train'], 'dataset_start_idx': 0, 'hf_entity': 'VGraf', 'hf_repo_id': 'generation', 'mode': 'generation', 'model_name_or_path': 'gpt-3.5-turbo-0125'โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/VGraf/generation_1746311715.

  16. h

    oasst2

    • huggingface.co
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PRLM (2025). oasst2 [Dataset]. https://huggingface.co/datasets/PRLM/oasst2
    Explore at:
    Dataset updated
    Jun 12, 2025
    Dataset authored and provided by
    PRLM
    Description

    from datasets import load_dataset, Dataset import re

    Script to filter and process the OpenAssistant dataset (oasst2).

    Based on the conversion script from the open-instruct repo -> https://github.com/allenai/open-instruct/blob/main/scripts/data/sft/utils.py#L1

    def should_be_filtered_by_keyword(example, verbose=False): # we filter out conversations that contain some specific strings filter_strings = [ "OpenAI", "Open AI", "ChatGPT", "Chat GPT"โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/PRLM/oasst2.

  17. h

    generation_19367

    • huggingface.co
    Updated Sep 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Morrison (2024). generation_19367 [Dataset]. https://huggingface.co/datasets/jacobmorrison/generation_19367
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 14, 2024
    Authors
    Jacob Morrison
    Description

    allenai/open_instruct: Generation Dataset

    See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail

      Configs
    

    args: {'add_timestamp': False, 'hf_entity': 'jacobmorrison', 'hf_repo_id': 'generation_19367', 'mode': 'generation', 'model_name_or_path': '/model', 'push_to_hub': True, 'revision': 'main', 'save_filename': '/output/shards/19367/14.jsonl', 'skill': 'chat'}

    dataset_args: {'dataset_end_idx': 29850โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/jacobmorrison/generation_19367.

  18. h

    scores_5816

    • huggingface.co
    Updated Sep 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Morrison (2024). scores_5816 [Dataset]. https://huggingface.co/datasets/jacobmorrison/scores_5816
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 14, 2024
    Authors
    Jacob Morrison
    Description

    allenai/open_instruct: Rejection Sampling Dataset

    See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail

      Configs
    

    args: {'add_timestamp': False, 'hf_entity': 'jacobmorrison', 'hf_repo_id': 'rejection_sampling_5816', 'hf_repo_id_scores': 'scores_5816', 'input_filename': '/output/shards/5816/4.jsonl', 'max_forward_batch_size': 64, 'mode': 'judgement', 'model_names_or_paths': ['Skywork/Skywork-Reward-Llama-3.1-8B']โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/jacobmorrison/scores_5816.

  19. h

    scores_26764

    • huggingface.co
    Updated Jul 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shengyi Costa Huang (2024). scores_26764 [Dataset]. https://huggingface.co/datasets/vwxyzjn/scores_26764
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 30, 2024
    Authors
    Shengyi Costa Huang
    Description

    allenai/open_instruct: Rejection Sampling Dataset

    See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail

      Configs
    

    args: {'add_timestamp': False, 'hf_entity': 'vwxyzjn', 'hf_repo_id': 'rejection_sampling_26764', 'hf_repo_id_scores': 'scores_26764', 'input_filename': 'output/shards/26764/3.jsonl', 'max_forward_batch_size': 64, 'mode': 'judgement', 'model_names_or_paths': ['allenai/llama-3-tulu-2-8b-uf-mean-rm']โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/vwxyzjn/scores_26764.

  20. h

    generation_6296

    • huggingface.co
    Updated Sep 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shengyi Costa Huang (2024). generation_6296 [Dataset]. https://huggingface.co/datasets/vwxyzjn/generation_6296
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 6, 2024
    Authors
    Shengyi Costa Huang
    Description

    allenai/open_instruct: Generation Dataset

    See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail

      Configs
    

    args: {'add_timestamp': False, 'hf_entity': 'vwxyzjn', 'hf_repo_id': 'generation_6296', 'mode': 'generation', 'model_name_or_path': 'allenai/open_instruct_dev', 'push_to_hub': True, 'revision': 'costa_finetune_tulu3_8b_norobot_meta-llama_Meta-Llama-3.1-8B_42_1725559869', 'save_filename':โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/vwxyzjn/generation_6296.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
VMware AI Labs (2023). open-instruct [Dataset]. https://huggingface.co/datasets/VMware/open-instruct
Organization logo

open-instruct

T

VMware/open-instruct

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 6, 2023
Dataset provided by
VMwarehttp://www.vmware.com/
Authors
VMware AI Labs
License

Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically

Description

Dataset Card for "open-instruct"

This dataset is a combination of:

Filtered subset of OpenAssistant/oasst1 train split of Mosaic-dolly-hhrlhf (consists of Databrick's dolly-15k dataset and a filtered subset of Anthropic's HH-RLHF). Filtered subset of conceptofmind/cot_submix_original

  Dataset

The dataset consists of 6 columns:

instruction: The natural language instruction without any prompt templates (we extracted them out of the alpaca-format inโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/VMware/open-instruct.

Search
Clear search
Close search
Google apps
Main menu