Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Dataset Card for "open-instruct"
This dataset is a combination of:
Filtered subset of OpenAssistant/oasst1 train split of Mosaic-dolly-hhrlhf (consists of Databrick's dolly-15k dataset and a filtered subset of Anthropic's HH-RLHF). Filtered subset of conceptofmind/cot_submix_original
Dataset
The dataset consists of 6 columns:
instruction: The natural language instruction without any prompt templates (we extracted them out of the alpaca-format inโฆ See the full description on the dataset page: https://huggingface.co/datasets/VMware/open-instruct.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
NeuralNovel/Open-Instruct-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAlignmentLab-AI/open-instruct-sharegpt dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittersgp-bench/open-instruct-gpt4o_55k_rev_sit_72k dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterfireworks-ai/four-digits-multiply-open-instruct dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
sgp-bench/open-instruct dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterChat-Error/open-instruct-v3 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterallenai/open_instruct: Rejection Sampling Dataset
See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail
Configs
args: {'add_timestamp': False, 'hf_entity': 'jacobmorrison', 'hf_repo_id': 'rejection_sampling_22591', 'hf_repo_id_scores': 'scores_22591', 'input_filename': '/output/shards/22591/29.jsonl', 'max_forward_batch_size': 64, 'mode': 'judgement', 'model_names_or_paths':โฆ See the full description on the dataset page: https://huggingface.co/datasets/jacobmorrison/rejection_sampling_22591.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Open Instruct V1 Better Uncensored
This is the open-instruct-v1 dataset processed with the Better Uncensored pipeline. About 2.5% of the dataset was removed, a quick review of the removed examples seems to point that is mostly false positives or answers with debatable moralizing content. No clear refusal was seen in the quick review. The original dataset may be safe for training uncensored models, but if you want to be extra sure you can use this one.
Open Instruct V1 -โฆ See the full description on the dataset page: https://huggingface.co/datasets/betteruncensored/open-instruct-v1.
Facebook
Twitteruserv4oo/open-instruct-uncensored-refusals-removed dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterOriginal dataset page from ehartford. 810,102 entries. Sourced from open-instruct-uncensored.jsonl. Converted the jsonl to a json which can be loaded into something like LLaMa-LoRA-Tuner. I've also included smaller datasets that includes less entries depending on how much memory you have to work with. Each one is randomized before being converted, so each dataset is unique in order. Count of each Dataset: code_alpaca: 19991 unnatural_instructions: 68231 baize: 166096 self_instruct: 81512โฆ See the full description on the dataset page: https://huggingface.co/datasets/xzuyn/open-instruct-uncensored-alpaca.
Facebook
TwitterDataset Card for "open-instruct-v1_deduped"
Deduplicated version of Isotonic/open-instruct-v1 Deduplicated with min Jaccard similarity of 0.8 Uses Stablility's System Prompt
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
VMware/open-instruct in ChatML format, ready to use in HuggingFace TRL's SFT Trainer. Python code used for conversion: from datasets import load_dataset from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")
dataset = load_dataset("VMware/open-instruct", split="train")
def format(columns): messages = [ { "role": "user", "content": columns["instruction"].strip(), }, {โฆ See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-open-instruct.
Facebook
TwitterDataset Card for "open-instruct-uncensored-alpaca"
More Information needed
Facebook
Twitterallenai/open_instruct: Generation Dataset
See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail
Configs
args: {'add_timestamp': True, 'dataset_end_idx': 3239, 'dataset_mixer_list': ['VGraf/alpacaeval_paraphrase_questions_dev', '1.0'], 'dataset_splits': ['train', 'train'], 'dataset_start_idx': 0, 'hf_entity': 'VGraf', 'hf_repo_id': 'generation', 'mode': 'generation', 'model_name_or_path': 'gpt-3.5-turbo-0125'โฆ See the full description on the dataset page: https://huggingface.co/datasets/VGraf/generation_1746311715.
Facebook
Twitterfrom datasets import load_dataset, Dataset import re
def should_be_filtered_by_keyword(example, verbose=False): # we filter out conversations that contain some specific strings filter_strings = [ "OpenAI", "Open AI", "ChatGPT", "Chat GPT"โฆ See the full description on the dataset page: https://huggingface.co/datasets/PRLM/oasst2.
Facebook
Twitterallenai/open_instruct: Generation Dataset
See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail
Configs
args: {'add_timestamp': False, 'hf_entity': 'jacobmorrison', 'hf_repo_id': 'generation_19367', 'mode': 'generation', 'model_name_or_path': '/model', 'push_to_hub': True, 'revision': 'main', 'save_filename': '/output/shards/19367/14.jsonl', 'skill': 'chat'}
dataset_args: {'dataset_end_idx': 29850โฆ See the full description on the dataset page: https://huggingface.co/datasets/jacobmorrison/generation_19367.
Facebook
Twitterallenai/open_instruct: Rejection Sampling Dataset
See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail
Configs
args: {'add_timestamp': False, 'hf_entity': 'jacobmorrison', 'hf_repo_id': 'rejection_sampling_5816', 'hf_repo_id_scores': 'scores_5816', 'input_filename': '/output/shards/5816/4.jsonl', 'max_forward_batch_size': 64, 'mode': 'judgement', 'model_names_or_paths': ['Skywork/Skywork-Reward-Llama-3.1-8B']โฆ See the full description on the dataset page: https://huggingface.co/datasets/jacobmorrison/scores_5816.
Facebook
Twitterallenai/open_instruct: Rejection Sampling Dataset
See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail
Configs
args: {'add_timestamp': False, 'hf_entity': 'vwxyzjn', 'hf_repo_id': 'rejection_sampling_26764', 'hf_repo_id_scores': 'scores_26764', 'input_filename': 'output/shards/26764/3.jsonl', 'max_forward_batch_size': 64, 'mode': 'judgement', 'model_names_or_paths': ['allenai/llama-3-tulu-2-8b-uf-mean-rm']โฆ See the full description on the dataset page: https://huggingface.co/datasets/vwxyzjn/scores_26764.
Facebook
Twitterallenai/open_instruct: Generation Dataset
See https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md for more detail
Configs
args: {'add_timestamp': False, 'hf_entity': 'vwxyzjn', 'hf_repo_id': 'generation_6296', 'mode': 'generation', 'model_name_or_path': 'allenai/open_instruct_dev', 'push_to_hub': True, 'revision': 'costa_finetune_tulu3_8b_norobot_meta-llama_Meta-Llama-3.1-8B_42_1725559869', 'save_filename':โฆ See the full description on the dataset page: https://huggingface.co/datasets/vwxyzjn/generation_6296.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Dataset Card for "open-instruct"
This dataset is a combination of:
Filtered subset of OpenAssistant/oasst1 train split of Mosaic-dolly-hhrlhf (consists of Databrick's dolly-15k dataset and a filtered subset of Anthropic's HH-RLHF). Filtered subset of conceptofmind/cot_submix_original
Dataset
The dataset consists of 6 columns:
instruction: The natural language instruction without any prompt templates (we extracted them out of the alpaca-format inโฆ See the full description on the dataset page: https://huggingface.co/datasets/VMware/open-instruct.