33 datasets found

open-instruct
huggingface.co
Updated Feb 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VMware AI Labs (2023). open-instruct [Dataset]. https://huggingface.co/datasets/VMware/open-instruct
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 6, 2023
Dataset provided by
VMwarehttp://www.vmware.com/
Authors
VMware AI Labs
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Dataset Card for "open-instruct"

This dataset is a combination of:

Filtered subset of OpenAssistant/oasst1 train split of Mosaic-dolly-hhrlhf (consists of Databrick's dolly-15k dataset and a filtered subset of Anthropic's HH-RLHF). Filtered subset of conceptofmind/cot_submix_original

Dataset

The dataset consists of 6 columns:

instruction: The natural language instruction without any prompt templates (we extracted them out of the alpaca-format in… See the full description on the dataset page: https://huggingface.co/datasets/VMware/open-instruct.
h
Open-Instruct-v1
huggingface.co
Updated Mar 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lee Jackson (2024). Open-Instruct-v1 [Dataset]. https://huggingface.co/datasets/NeuralNovel/Open-Instruct-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 5, 2024
Authors
Lee Jackson
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
NeuralNovel/Open-Instruct-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
open-instruct-sharegpt
huggingface.co
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alignment Lab Ai (2023). open-instruct-sharegpt [Dataset]. https://huggingface.co/datasets/AlignmentLab-AI/open-instruct-sharegpt
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 20, 2023
Dataset authored and provided by
Alignment Lab Ai
Description
AlignmentLab-AI/open-instruct-sharegpt dataset hosted on Hugging Face and contributed by the HF Datasets community
h
open-instruct-gpt4o_55k_rev_sit_72k
huggingface.co
Updated Sep 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sgp-bench (2024). open-instruct-gpt4o_55k_rev_sit_72k [Dataset]. https://huggingface.co/datasets/sgp-bench/open-instruct-gpt4o_55k_rev_sit_72k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2024
Dataset authored and provided by
sgp-bench
Description
sgp-bench/open-instruct-gpt4o_55k_rev_sit_72k dataset hosted on Hugging Face and contributed by the HF Datasets community
h
four-digits-multiply-open-instruct
huggingface.co
Updated Dec 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fireworks AI (2024). four-digits-multiply-open-instruct [Dataset]. https://huggingface.co/datasets/fireworks-ai/four-digits-multiply-open-instruct
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 13, 2024
Dataset authored and provided by
Fireworks AI
Description
fireworks-ai/four-digits-multiply-open-instruct dataset hosted on Hugging Face and contributed by the HF Datasets community
h
open-instruct
huggingface.co
Updated Feb 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sgp-bench (2023). open-instruct [Dataset]. https://huggingface.co/datasets/sgp-bench/open-instruct
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2023
Dataset authored and provided by
sgp-bench
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
sgp-bench/open-instruct dataset hosted on Hugging Face and contributed by the HF Datasets community
h
open-instruct-v3
huggingface.co
Updated Dec 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kimiko (2023). open-instruct-v3 [Dataset]. https://huggingface.co/datasets/Chat-Error/open-instruct-v3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 3, 2023
Authors
Kimiko
Description
Chat-Error/open-instruct-v3 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
rejection_sampling_22591
huggingface.co
Updated Sep 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Morrison (2024). rejection_sampling_22591 [Dataset]. https://huggingface.co/datasets/jacobmorrison/rejection_sampling_22591
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 15, 2024
Authors
Jacob Morrison
Description
allenai/open_instruct: Rejection Sampling Dataset

See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail

Configs

args: {'add_timestamp': False, 'hf_entity': 'jacobmorrison', 'hf_repo_id': 'rejection_sampling_22591', 'hf_repo_id_scores': 'scores_22591', 'input_filename': '/output/shards/22591/29.jsonl', 'max_forward_batch_size': 64, 'mode': 'judgement', 'model_names_or_paths':… See the full description on the dataset page: https://huggingface.co/datasets/jacobmorrison/rejection_sampling_22591.
h
open-instruct-v1
huggingface.co
Updated Feb 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Better Uncensored (2024). open-instruct-v1 [Dataset]. https://huggingface.co/datasets/betteruncensored/open-instruct-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2024
Dataset authored and provided by
Better Uncensored
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Open Instruct V1 Better Uncensored

This is the open-instruct-v1 dataset processed with the Better Uncensored pipeline. About 2.5% of the dataset was removed, a quick review of the removed examples seems to point that is mostly false positives or answers with debatable moralizing content. No clear refusal was seen in the quick review. The original dataset may be safe for training uncensored models, but if you want to be extra sure you can use this one.

Open Instruct V1 -… See the full description on the dataset page: https://huggingface.co/datasets/betteruncensored/open-instruct-v1.
h
open-instruct-uncensored-refusals-removed
huggingface.co
Updated Jun 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vfouroo (2023). open-instruct-uncensored-refusals-removed [Dataset]. https://huggingface.co/datasets/userv4oo/open-instruct-uncensored-refusals-removed
Explore at:
Dataset updated
Jun 24, 2023
Authors
vfouroo
Description
userv4oo/open-instruct-uncensored-refusals-removed dataset hosted on Hugging Face and contributed by the HF Datasets community
h
open-instruct-uncensored-alpaca
huggingface.co
Updated Jun 27, 1996
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xzuyn (1996). open-instruct-uncensored-alpaca [Dataset]. https://huggingface.co/datasets/xzuyn/open-instruct-uncensored-alpaca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 1996
Authors
xzuyn
Description
Original dataset page from ehartford. 810,102 entries. Sourced from open-instruct-uncensored.jsonl. Converted the jsonl to a json which can be loaded into something like LLaMa-LoRA-Tuner. I've also included smaller datasets that includes less entries depending on how much memory you have to work with. Each one is randomized before being converted, so each dataset is unique in order. Count of each Dataset: code_alpaca: 19991 unnatural_instructions: 68231 baize: 166096 self_instruct: 81512… See the full description on the dataset page: https://huggingface.co/datasets/xzuyn/open-instruct-uncensored-alpaca.
h
open-instruct-v1_deduped
huggingface.co
Updated Apr 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isotonic (2023). open-instruct-v1_deduped [Dataset]. https://huggingface.co/datasets/Isotonic/open-instruct-v1_deduped
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 25, 2023
Authors
Isotonic
Description
Dataset Card for "open-instruct-v1_deduped"

Deduplicated version of Isotonic/open-instruct-v1 Deduplicated with min Jaccard similarity of 0.8 Uses Stablility's System Prompt

System: StableLM Tuned (Alpha version)

StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.

StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.

StableLM is more than just an information… See the full description on the dataset page: https://huggingface.co/datasets/Isotonic/open-instruct-v1_deduped.
h
ChatML-open-instruct
huggingface.co
Updated Feb 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Nogueira (2024). ChatML-open-instruct [Dataset]. https://huggingface.co/datasets/Felladrin/ChatML-open-instruct
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 22, 2024
Authors
Victor Nogueira
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
VMware/open-instruct in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")

dataset = load_dataset("VMware/open-instruct", split="train")

def format(columns): messages = [ { "role": "user", "content": columns["instruction"].strip(), }, {… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-open-instruct.
h
open-instruct-uncensored-alpaca
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James, open-instruct-uncensored-alpaca [Dataset]. https://huggingface.co/datasets/jtatman/open-instruct-uncensored-alpaca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
James
Description
Dataset Card for "open-instruct-uncensored-alpaca"

More Information needed
h
generation_1746311715
huggingface.co
Updated May 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victoria Graf (2025). generation_1746311715 [Dataset]. https://huggingface.co/datasets/VGraf/generation_1746311715
Explore at:
Dataset updated
May 4, 2025
Authors
Victoria Graf
Description
allenai/open_instruct: Generation Dataset

See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail

Configs

args: {'add_timestamp': True, 'dataset_end_idx': 3239, 'dataset_mixer_list': ['VGraf/alpacaeval_paraphrase_questions_dev', '1.0'], 'dataset_splits': ['train', 'train'], 'dataset_start_idx': 0, 'hf_entity': 'VGraf', 'hf_repo_id': 'generation', 'mode': 'generation', 'model_name_or_path': 'gpt-3.5-turbo-0125'… See the full description on the dataset page: https://huggingface.co/datasets/VGraf/generation_1746311715.
h
oasst2
huggingface.co
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PRLM (2025). oasst2 [Dataset]. https://huggingface.co/datasets/PRLM/oasst2
Explore at:
Dataset updated
Jun 12, 2025
Dataset authored and provided by
PRLM
Description
from datasets import load_dataset, Dataset import re

Script to filter and process the OpenAssistant dataset (oasst2).

Based on the conversion script from the open-instruct repo -> https://github.com/allenai/open-instruct/blob/main/scripts/data/sft/utils.py#L1

def should_be_filtered_by_keyword(example, verbose=False): # we filter out conversations that contain some specific strings filter_strings = [ "OpenAI", "Open AI", "ChatGPT", "Chat GPT"… See the full description on the dataset page: https://huggingface.co/datasets/PRLM/oasst2.
h
generation_19367
huggingface.co
Updated Sep 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Morrison (2024). generation_19367 [Dataset]. https://huggingface.co/datasets/jacobmorrison/generation_19367
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 14, 2024
Authors
Jacob Morrison
Description
allenai/open_instruct: Generation Dataset

See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail

Configs

args: {'add_timestamp': False, 'hf_entity': 'jacobmorrison', 'hf_repo_id': 'generation_19367', 'mode': 'generation', 'model_name_or_path': '/model', 'push_to_hub': True, 'revision': 'main', 'save_filename': '/output/shards/19367/14.jsonl', 'skill': 'chat'}

dataset_args: {'dataset_end_idx': 29850… See the full description on the dataset page: https://huggingface.co/datasets/jacobmorrison/generation_19367.
h
scores_5816
huggingface.co
Updated Sep 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Morrison (2024). scores_5816 [Dataset]. https://huggingface.co/datasets/jacobmorrison/scores_5816
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 14, 2024
Authors
Jacob Morrison
Description
allenai/open_instruct: Rejection Sampling Dataset

See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail

Configs

args: {'add_timestamp': False, 'hf_entity': 'jacobmorrison', 'hf_repo_id': 'rejection_sampling_5816', 'hf_repo_id_scores': 'scores_5816', 'input_filename': '/output/shards/5816/4.jsonl', 'max_forward_batch_size': 64, 'mode': 'judgement', 'model_names_or_paths': ['Skywork/Skywork-Reward-Llama-3.1-8B']… See the full description on the dataset page: https://huggingface.co/datasets/jacobmorrison/scores_5816.
h
scores_26764
huggingface.co
Updated Jul 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shengyi Costa Huang (2024). scores_26764 [Dataset]. https://huggingface.co/datasets/vwxyzjn/scores_26764
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 30, 2024
Authors
Shengyi Costa Huang
Description
allenai/open_instruct: Rejection Sampling Dataset

See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail

Configs

args: {'add_timestamp': False, 'hf_entity': 'vwxyzjn', 'hf_repo_id': 'rejection_sampling_26764', 'hf_repo_id_scores': 'scores_26764', 'input_filename': 'output/shards/26764/3.jsonl', 'max_forward_batch_size': 64, 'mode': 'judgement', 'model_names_or_paths': ['allenai/llama-3-tulu-2-8b-uf-mean-rm']… See the full description on the dataset page: https://huggingface.co/datasets/vwxyzjn/scores_26764.
h
generation_6296
huggingface.co
Updated Sep 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shengyi Costa Huang (2024). generation_6296 [Dataset]. https://huggingface.co/datasets/vwxyzjn/generation_6296
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 6, 2024
Authors
Shengyi Costa Huang
Description
allenai/open_instruct: Generation Dataset

See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail

Configs

args: {'add_timestamp': False, 'hf_entity': 'vwxyzjn', 'hf_repo_id': 'generation_6296', 'mode': 'generation', 'model_name_or_path': 'allenai/open_instruct_dev', 'push_to_hub': True, 'revision': 'costa_finetune_tulu3_8b_norobot_meta-llama_Meta-Llama-3.1-8B_42_1725559869', 'save_filename':… See the full description on the dataset page: https://huggingface.co/datasets/vwxyzjn/generation_6296.

Facebook

Twitter

Click to copy link

Link copied

Cite

VMware AI Labs (2023). open-instruct [Dataset]. https://huggingface.co/datasets/VMware/open-instruct

open-instruct

T

VMware/open-instruct

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 6, 2023

Dataset provided by

VMwarehttp://www.vmware.com/

Authors

VMware AI Labs

License

Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically

Description

Dataset Card for "open-instruct"

This dataset is a combination of:

Filtered subset of OpenAssistant/oasst1 train split of Mosaic-dolly-hhrlhf (consists of Databrick's dolly-15k dataset and a filtered subset of Anthropic's HH-RLHF). Filtered subset of conceptofmind/cot_submix_original

  Dataset

The dataset consists of 6 columns:

instruction: The natural language instruction without any prompt templates (we extracted them out of the alpaca-format in… See the full description on the dataset page: https://huggingface.co/datasets/VMware/open-instruct.

Clear search

Close search

Google apps

Main menu

open-instruct

Open-Instruct-v1

open-instruct-sharegpt

open-instruct-gpt4o_55k_rev_sit_72k

four-digits-multiply-open-instruct

open-instruct

open-instruct-v3

rejection_sampling_22591

open-instruct-v1

open-instruct-uncensored-refusals-removed

open-instruct-uncensored-alpaca

open-instruct-v1_deduped

System: StableLM Tuned (Alpha version)

ChatML-open-instruct

open-instruct-uncensored-alpaca

generation_1746311715

oasst2

Script to filter and process the OpenAssistant dataset (oasst2).

Based on the conversion script from the open-instruct repo -> https://github.com/allenai/open-instruct/blob/main/scripts/data/sft/utils.py#L1

generation_19367

scores_5816

scores_26764

generation_6296

open-instructSee More Versions

T

VMware/open-instruct

open-instruct