MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Overview
This is a new curated subset of our OpenOrca data. This release provides an efficient means of reaching performance on-par with using larger slices of our data, while only including ~500k GPT-4 completions. The key change in this dataset is that we've done an additional pass, using GPT-4 to remove answers which appear wrong based on the human annotations from the FLAN dataset. This reduces the dataset size to only ~500k entries, allowing training to a similar quality levelโฆ See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/SlimOrca.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Overview
"SlimOrca Dedup" is a deduplicated, unfiltered subset of the SlimOrca dataset, excluding RLHF instances, resulting in 363k unique examples.
Key Features
Removal of RLHF instances. Deduplication using minhash and Jaccard similarity techniques.
Demo Models
Note: These models were trained on the full SlimOrca dataset, not the deduplicated, unfiltered version.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Copy of Open-Orca/SlimOrca-Dedup in ChatML format downsample to 100k
"SlimOrca Dedup" is a deduplicated, unfiltered subset of the SlimOrca dataset, excluding RLHF instances, resulting in 363k unique examples.
Key Features
Removal of RLHF instances. Deduplication using minhash and Jaccard similarity techniques.
Demo Models
Note: These models were trained on the full SlimOrca dataset, not the deduplicated, unfiltered version.
*โฆ See the full description on the dataset page: https://huggingface.co/datasets/philschmid/slimorca-dedup-chatml-100k.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Isotonic/SlimOrca
Dataset Summary
This dataset is a deduplicated version of Open-Orca/OpenOrca MinHash Deduplication with Jaccard Threshold = 0.80 Original dataset size: 4233923 Number of duplicate clusters: 522077 Files in duplicate cluster: 2115143 Unique files in duplicate cluster: 892638 Filtered dataset size: 3011418
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
CREDIT: https://huggingface.co/cgato there was some minor formatting errors in, corrected and pushed to the Open-Orca org* What is this dataset? Half of the Slim Orca Deduped dataset, but further cleaned by removing instances of soft prompting. I removed a ton prompt prefixes which did not add any information or were redundant. Ex. "Question:", "Q:", "Write the Answer:", "Read this:", "Instructions:" I also removed a ton of prompt suffixes which were simply there to lead the model to answer asโฆ See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/slimorca-deduped-cleaned-corrected.
ondevicellm/SlimOrca dataset hosted on Hugging Face and contributed by the HF Datasets community
orangetin/SlimOrca-Convo dataset hosted on Hugging Face and contributed by the HF Datasets community
lilac/SlimOrca
This dataset is a Lilac processed dataset. Original dataset: https://huggingface.co/datasets/Open-Orca/SlimOrca To download the dataset to a local directory: lilac download lilacai/lilac-SlimOrca
or from python with: ll.download("lilacai/lilac-SlimOrca")
Open-Orca/SlimOrca-Dedup
{ "processed": true, "4keys": true, "jsonifize": true, "uploaded": true }
LICENSE FOUND AT: https://huggingface.co/datasetsOpen-Orca/SlimOrca-Dedup Reformatting generated by AlignmentLab.AI please refer to the original authors work for attribution
line
{'conversations': [{'from': 'system', 'value': 'You are an AI assistant. You will be given a task. You must generate a detailed and long answer.'}, {'from': 'human', 'value':โฆ See the full description on the dataset page: https://huggingface.co/datasets/jsonifize/SlimOrca-Dedup-4keys.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is a modified version of the slimorca-deduped-cleaned-corrected dataset. It contains English only characters. Open Orca Slim for Pascal Developers is a subset of the original Open Orca dataset . Open Orca Slim for Pascal Developers dataset was created with: from datasets import load_dataset
def biggest_char_code(input_string): """ Returns the largest character code in a string. """ if not input_string: return None # Handle empty string case
largest_codeโฆ See the full description on the dataset page: https://huggingface.co/datasets/schuler/open-orca-slimorca-deduped-cleaned-corrected-for-pascal-txt.
flpelerin/slimorca-100k dataset hosted on Hugging Face and contributed by the HF Datasets community
xzuyn/SlimOrca-Dedup-random5k dataset hosted on Hugging Face and contributed by the HF Datasets community
flpelerin/slimorca-1k dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is a chatml formatted version of original SlimOrca-Dedup dataset with few modifications to the system prompts.
AyanAnsar/slimorca-llama2-1K dataset hosted on Hugging Face and contributed by the HF Datasets community
splm/openchat-spin-slimorca-iter2-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
flpelerin/slimorca-5k dataset hosted on Hugging Face and contributed by the HF Datasets community
flpelerin/slimorca-deduped-cleaned-corrected-text dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for "SlimOrca"
More Information needed
diwank/slimorca-autoj-corrected dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Overview
This is a new curated subset of our OpenOrca data. This release provides an efficient means of reaching performance on-par with using larger slices of our data, while only including ~500k GPT-4 completions. The key change in this dataset is that we've done an additional pass, using GPT-4 to remove answers which appear wrong based on the human annotations from the FLAN dataset. This reduces the dataset size to only ~500k entries, allowing training to a similar quality levelโฆ See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/SlimOrca.