57 datasets found

h
SlimOrca
huggingface.co
opendatalab.com
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenOrca (2023). SlimOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/SlimOrca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 11, 2023
Dataset authored and provided by
OpenOrca
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

This is a new curated subset of our OpenOrca data. This release provides an efficient means of reaching performance on-par with using larger slices of our data, while only including ~500k GPT-4 completions. The key change in this dataset is that we've done an additional pass, using GPT-4 to remove answers which appear wrong based on the human annotations from the FLAN dataset. This reduces the dataset size to only ~500k entries, allowing training to a similar quality level… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/SlimOrca.
h
SlimOrca-Dedup
huggingface.co
Updated Nov 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenOrca (2023). SlimOrca-Dedup [Dataset]. https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 1, 2023
Dataset authored and provided by
OpenOrca
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

"SlimOrca Dedup" is a deduplicated, unfiltered subset of the SlimOrca dataset, excluding RLHF instances, resulting in 363k unique examples.

Key Features

Removal of RLHF instances. Deduplication using minhash and Jaccard similarity techniques.

Demo Models Note: These models were trained on the full SlimOrca dataset, not the deduplicated, unfiltered version.

https://huggingface.co/openaccess-ai-collective/jackalope-7b *… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup.
h
slimorca-dedup-chatml-100k
huggingface.co
Updated Feb 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philipp Schmid (2024). slimorca-dedup-chatml-100k [Dataset]. https://huggingface.co/datasets/philschmid/slimorca-dedup-chatml-100k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 29, 2024
Authors
Philipp Schmid
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Copy of Open-Orca/SlimOrca-Dedup in ChatML format downsample to 100k

"SlimOrca Dedup" is a deduplicated, unfiltered subset of the SlimOrca dataset, excluding RLHF instances, resulting in 363k unique examples.

Key Features

Removal of RLHF instances. Deduplication using minhash and Jaccard similarity techniques.

Demo Models Note: These models were trained on the full SlimOrca dataset, not the deduplicated, unfiltered version.

*… See the full description on the dataset page: https://huggingface.co/datasets/philschmid/slimorca-dedup-chatml-100k.
h
SlimOrca
huggingface.co
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isotonic (2025). SlimOrca [Dataset]. https://huggingface.co/datasets/Isotonic/SlimOrca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 26, 2025
Authors
Isotonic
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Isotonic/SlimOrca

Dataset Summary

This dataset is a deduplicated version of Open-Orca/OpenOrca MinHash Deduplication with Jaccard Threshold = 0.80 Original dataset size: 4233923 Number of duplicate clusters: 522077 Files in duplicate cluster: 2115143 Unique files in duplicate cluster: 892638 Filtered dataset size: 3011418
h
slimorca-deduped-cleaned-corrected
huggingface.co
Updated Apr 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenOrca (2025). slimorca-deduped-cleaned-corrected [Dataset]. https://huggingface.co/datasets/Open-Orca/slimorca-deduped-cleaned-corrected
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 13, 2025
Dataset authored and provided by
OpenOrca
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
CREDIT: https://huggingface.co/cgato there was some minor formatting errors in, corrected and pushed to the Open-Orca org* What is this dataset? Half of the Slim Orca Deduped dataset, but further cleaned by removing instances of soft prompting. I removed a ton prompt prefixes which did not add any information or were redundant. Ex. "Question:", "Q:", "Write the Answer:", "Read this:", "Instructions:" I also removed a ton of prompt suffixes which were simply there to lead the model to answer as… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/slimorca-deduped-cleaned-corrected.
h
SlimOrca
huggingface.co
Updated Jan 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
On-Device-LLM (2024). SlimOrca [Dataset]. https://huggingface.co/datasets/ondevicellm/SlimOrca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 17, 2024
Dataset authored and provided by
On-Device-LLM
Description
ondevicellm/SlimOrca dataset hosted on Hugging Face and contributed by the HF Datasets community
h
SlimOrca-Convo
huggingface.co
Updated Apr 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
orangetin (2024). SlimOrca-Convo [Dataset]. https://huggingface.co/datasets/orangetin/SlimOrca-Convo
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 29, 2024
Authors
orangetin
Description
orangetin/SlimOrca-Convo dataset hosted on Hugging Face and contributed by the HF Datasets community
h
lilac-SlimOrca
huggingface.co
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lilac AI (2024). lilac-SlimOrca [Dataset]. https://huggingface.co/datasets/lilacai/lilac-SlimOrca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2024
Dataset authored and provided by
Lilac AI
Description
lilac/SlimOrca

This dataset is a Lilac processed dataset. Original dataset: https://huggingface.co/datasets/Open-Orca/SlimOrca To download the dataset to a local directory: lilac download lilacai/lilac-SlimOrca

or from python with: ll.download("lilacai/lilac-SlimOrca")
h
SlimOrca-Dedup-4keys
huggingface.co
Updated Jun 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jsonifize (2024). SlimOrca-Dedup-4keys [Dataset]. https://huggingface.co/datasets/jsonifize/SlimOrca-Dedup-4keys
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 14, 2024
Dataset authored and provided by
jsonifize
Description
Open-Orca/SlimOrca-Dedup

{ "processed": true, "4keys": true, "jsonifize": true, "uploaded": true }

LICENSE FOUND AT: https://huggingface.co/datasetsOpen-Orca/SlimOrca-Dedup Reformatting generated by AlignmentLab.AI please refer to the original authors work for attribution

line

{'conversations': [{'from': 'system', 'value': 'You are an AI assistant. You will be given a task. You must generate a detailed and long answer.'}, {'from': 'human', 'value':… See the full description on the dataset page: https://huggingface.co/datasets/jsonifize/SlimOrca-Dedup-4keys.
h
open-orca-slimorca-deduped-cleaned-corrected-for-pascal-txt
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Joao Paulo Schwarz Schuler, open-orca-slimorca-deduped-cleaned-corrected-for-pascal-txt [Dataset]. https://huggingface.co/datasets/schuler/open-orca-slimorca-deduped-cleaned-corrected-for-pascal-txt
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Dr. Joao Paulo Schwarz Schuler
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This is a modified version of the slimorca-deduped-cleaned-corrected dataset. It contains English only characters. Open Orca Slim for Pascal Developers is a subset of the original Open Orca dataset . Open Orca Slim for Pascal Developers dataset was created with: from datasets import load_dataset

Coded by Gemini

def biggest_char_code(input_string): """ Returns the largest character code in a string. """ if not input_string: return None # Handle empty string case

largest_code… See the full description on the dataset page: https://huggingface.co/datasets/schuler/open-orca-slimorca-deduped-cleaned-corrected-for-pascal-txt.
h
slimorca-100k
huggingface.co
Updated Jun 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Pelerin (2024). slimorca-100k [Dataset]. https://huggingface.co/datasets/flpelerin/slimorca-100k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 10, 2024
Authors
Florian Pelerin
Description
flpelerin/slimorca-100k dataset hosted on Hugging Face and contributed by the HF Datasets community
h
SlimOrca-Dedup-random5k
huggingface.co
Updated Feb 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xzuyn (2024). SlimOrca-Dedup-random5k [Dataset]. https://huggingface.co/datasets/xzuyn/SlimOrca-Dedup-random5k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2024
Authors
xzuyn
Description
xzuyn/SlimOrca-Dedup-random5k dataset hosted on Hugging Face and contributed by the HF Datasets community
h
slimorca-1k
huggingface.co
Updated Jun 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Pelerin (2024). slimorca-1k [Dataset]. https://huggingface.co/datasets/flpelerin/slimorca-1k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 10, 2024
Authors
Florian Pelerin
Description
flpelerin/slimorca-1k dataset hosted on Hugging Face and contributed by the HF Datasets community
h
slimorca-dedup-chatml
huggingface.co
Updated Feb 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Radiantloom AI (2024). slimorca-dedup-chatml [Dataset]. https://huggingface.co/datasets/Radiantloom/slimorca-dedup-chatml
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 17, 2024
Dataset authored and provided by
Radiantloom AI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is a chatml formatted version of original SlimOrca-Dedup dataset with few modifications to the system prompts.
h
slimorca-llama2-1K
huggingface.co
Updated Nov 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayan Ansar (2024). slimorca-llama2-1K [Dataset]. https://huggingface.co/datasets/AyanAnsar/slimorca-llama2-1K
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 1, 2024
Authors
Ayan Ansar
Description
AyanAnsar/slimorca-llama2-1K dataset hosted on Hugging Face and contributed by the HF Datasets community
h
openchat-spin-slimorca-iter2-dataset
huggingface.co
Updated Feb 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Self Play Language Models (2024). openchat-spin-slimorca-iter2-dataset [Dataset]. https://huggingface.co/datasets/splm/openchat-spin-slimorca-iter2-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 25, 2024
Dataset authored and provided by
Self Play Language Models
Description
splm/openchat-spin-slimorca-iter2-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
slimorca-5k
huggingface.co
Updated Jun 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Pelerin (2024). slimorca-5k [Dataset]. https://huggingface.co/datasets/flpelerin/slimorca-5k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 10, 2024
Authors
Florian Pelerin
Description
flpelerin/slimorca-5k dataset hosted on Hugging Face and contributed by the HF Datasets community
h
slimorca-deduped-cleaned-corrected-text
huggingface.co
Updated May 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Pelerin (2024). slimorca-deduped-cleaned-corrected-text [Dataset]. https://huggingface.co/datasets/flpelerin/slimorca-deduped-cleaned-corrected-text
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 25, 2024
Authors
Florian Pelerin
Description
flpelerin/slimorca-deduped-cleaned-corrected-text dataset hosted on Hugging Face and contributed by the HF Datasets community
h
SlimOrca
huggingface.co
Updated Oct 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pxy (2024). SlimOrca [Dataset]. https://huggingface.co/datasets/pxyyy/SlimOrca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 17, 2024
Authors
pxy
Description
Dataset Card for "SlimOrca"

More Information needed
h
slimorca-autoj-corrected
huggingface.co
Updated Apr 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diwank Tomer (2024). slimorca-autoj-corrected [Dataset]. https://huggingface.co/datasets/diwank/slimorca-autoj-corrected
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 13, 2024
Authors
Diwank Tomer
Description
diwank/slimorca-autoj-corrected dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

OpenOrca (2023). SlimOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/SlimOrca

SlimOrca

Open-Orca/SlimOrca

Explore at:

89 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 11, 2023

Dataset authored and provided by

OpenOrca

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Overview

This is a new curated subset of our OpenOrca data. This release provides an efficient means of reaching performance on-par with using larger slices of our data, while only including ~500k GPT-4 completions. The key change in this dataset is that we've done an additional pass, using GPT-4 to remove answers which appear wrong based on the human annotations from the FLAN dataset. This reduces the dataset size to only ~500k entries, allowing training to a similar quality level… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/SlimOrca.

Clear search

Close search

Google apps

Main menu

SlimOrca

SlimOrca-Dedup

slimorca-dedup-chatml-100k

SlimOrca

slimorca-deduped-cleaned-corrected

SlimOrca

SlimOrca-Convo

lilac-SlimOrca

SlimOrca-Dedup-4keys

open-orca-slimorca-deduped-cleaned-corrected-for-pascal-txt

Coded by Gemini

slimorca-100k

SlimOrca-Dedup-random5k

slimorca-1k

slimorca-dedup-chatml

slimorca-llama2-1K

openchat-spin-slimorca-iter2-dataset

slimorca-5k

slimorca-deduped-cleaned-corrected-text

SlimOrca

slimorca-autoj-corrected

SlimOrca

SlimOrca

Open-Orca/SlimOrca