57 datasets found
  1. h

    SlimOrca

    • huggingface.co
    • opendatalab.com
    Updated Oct 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenOrca (2023). SlimOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/SlimOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 11, 2023
    Dataset authored and provided by
    OpenOrca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This is a new curated subset of our OpenOrca data. This release provides an efficient means of reaching performance on-par with using larger slices of our data, while only including ~500k GPT-4 completions. The key change in this dataset is that we've done an additional pass, using GPT-4 to remove answers which appear wrong based on the human annotations from the FLAN dataset. This reduces the dataset size to only ~500k entries, allowing training to a similar quality levelโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/SlimOrca.

  2. h

    SlimOrca-Dedup

    • huggingface.co
    Updated Nov 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenOrca (2023). SlimOrca-Dedup [Dataset]. https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2023
    Dataset authored and provided by
    OpenOrca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    "SlimOrca Dedup" is a deduplicated, unfiltered subset of the SlimOrca dataset, excluding RLHF instances, resulting in 363k unique examples.

      Key Features
    

    Removal of RLHF instances. Deduplication using minhash and Jaccard similarity techniques.

      Demo Models
    
    
    
    
    
      Note: These models were trained on the full SlimOrca dataset, not the deduplicated, unfiltered version.
    
  3. h

    slimorca-dedup-chatml-100k

    • huggingface.co
    Updated Feb 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philipp Schmid (2024). slimorca-dedup-chatml-100k [Dataset]. https://huggingface.co/datasets/philschmid/slimorca-dedup-chatml-100k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 29, 2024
    Authors
    Philipp Schmid
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Copy of Open-Orca/SlimOrca-Dedup in ChatML format downsample to 100k

    "SlimOrca Dedup" is a deduplicated, unfiltered subset of the SlimOrca dataset, excluding RLHF instances, resulting in 363k unique examples.

      Key Features
    

    Removal of RLHF instances. Deduplication using minhash and Jaccard similarity techniques.

      Demo Models
    
    
    
    
    
      Note: These models were trained on the full SlimOrca dataset, not the deduplicated, unfiltered version.
    

    *โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/philschmid/slimorca-dedup-chatml-100k.

  4. h

    SlimOrca

    • huggingface.co
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isotonic (2025). SlimOrca [Dataset]. https://huggingface.co/datasets/Isotonic/SlimOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 26, 2025
    Authors
    Isotonic
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Isotonic/SlimOrca

      Dataset Summary
    

    This dataset is a deduplicated version of Open-Orca/OpenOrca MinHash Deduplication with Jaccard Threshold = 0.80 Original dataset size: 4233923 Number of duplicate clusters: 522077 Files in duplicate cluster: 2115143 Unique files in duplicate cluster: 892638 Filtered dataset size: 3011418

  5. h

    slimorca-deduped-cleaned-corrected

    • huggingface.co
    Updated Apr 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenOrca (2025). slimorca-deduped-cleaned-corrected [Dataset]. https://huggingface.co/datasets/Open-Orca/slimorca-deduped-cleaned-corrected
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 13, 2025
    Dataset authored and provided by
    OpenOrca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CREDIT: https://huggingface.co/cgato there was some minor formatting errors in, corrected and pushed to the Open-Orca org* What is this dataset? Half of the Slim Orca Deduped dataset, but further cleaned by removing instances of soft prompting. I removed a ton prompt prefixes which did not add any information or were redundant. Ex. "Question:", "Q:", "Write the Answer:", "Read this:", "Instructions:" I also removed a ton of prompt suffixes which were simply there to lead the model to answer asโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/slimorca-deduped-cleaned-corrected.

  6. h

    SlimOrca

    • huggingface.co
    Updated Jan 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    On-Device-LLM (2024). SlimOrca [Dataset]. https://huggingface.co/datasets/ondevicellm/SlimOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 17, 2024
    Dataset authored and provided by
    On-Device-LLM
    Description

    ondevicellm/SlimOrca dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    SlimOrca-Convo

    • huggingface.co
    Updated Apr 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    orangetin (2024). SlimOrca-Convo [Dataset]. https://huggingface.co/datasets/orangetin/SlimOrca-Convo
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 29, 2024
    Authors
    orangetin
    Description

    orangetin/SlimOrca-Convo dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    lilac-SlimOrca

    • huggingface.co
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lilac AI (2024). lilac-SlimOrca [Dataset]. https://huggingface.co/datasets/lilacai/lilac-SlimOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2024
    Dataset authored and provided by
    Lilac AI
    Description

    lilac/SlimOrca

    This dataset is a Lilac processed dataset. Original dataset: https://huggingface.co/datasets/Open-Orca/SlimOrca To download the dataset to a local directory: lilac download lilacai/lilac-SlimOrca

    or from python with: ll.download("lilacai/lilac-SlimOrca")

  9. h

    SlimOrca-Dedup-4keys

    • huggingface.co
    Updated Jun 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jsonifize (2024). SlimOrca-Dedup-4keys [Dataset]. https://huggingface.co/datasets/jsonifize/SlimOrca-Dedup-4keys
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 14, 2024
    Dataset authored and provided by
    jsonifize
    Description

    Open-Orca/SlimOrca-Dedup

    { "processed": true, "4keys": true, "jsonifize": true, "uploaded": true }

    LICENSE FOUND AT: https://huggingface.co/datasetsOpen-Orca/SlimOrca-Dedup Reformatting generated by AlignmentLab.AI please refer to the original authors work for attribution

      line
    

    {'conversations': [{'from': 'system', 'value': 'You are an AI assistant. You will be given a task. You must generate a detailed and long answer.'}, {'from': 'human', 'value':โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/jsonifize/SlimOrca-Dedup-4keys.

  10. h

    open-orca-slimorca-deduped-cleaned-corrected-for-pascal-txt

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Joao Paulo Schwarz Schuler, open-orca-slimorca-deduped-cleaned-corrected-for-pascal-txt [Dataset]. https://huggingface.co/datasets/schuler/open-orca-slimorca-deduped-cleaned-corrected-for-pascal-txt
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Dr. Joao Paulo Schwarz Schuler
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This is a modified version of the slimorca-deduped-cleaned-corrected dataset. It contains English only characters. Open Orca Slim for Pascal Developers is a subset of the original Open Orca dataset . Open Orca Slim for Pascal Developers dataset was created with: from datasets import load_dataset

    Coded by Gemini

    def biggest_char_code(input_string): """ Returns the largest character code in a string. """ if not input_string: return None # Handle empty string case

    largest_codeโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/schuler/open-orca-slimorca-deduped-cleaned-corrected-for-pascal-txt.

  11. h

    slimorca-100k

    • huggingface.co
    Updated Jun 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Pelerin (2024). slimorca-100k [Dataset]. https://huggingface.co/datasets/flpelerin/slimorca-100k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 10, 2024
    Authors
    Florian Pelerin
    Description

    flpelerin/slimorca-100k dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    SlimOrca-Dedup-random5k

    • huggingface.co
    Updated Feb 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xzuyn (2024). SlimOrca-Dedup-random5k [Dataset]. https://huggingface.co/datasets/xzuyn/SlimOrca-Dedup-random5k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2024
    Authors
    xzuyn
    Description

    xzuyn/SlimOrca-Dedup-random5k dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    slimorca-1k

    • huggingface.co
    Updated Jun 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Pelerin (2024). slimorca-1k [Dataset]. https://huggingface.co/datasets/flpelerin/slimorca-1k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 10, 2024
    Authors
    Florian Pelerin
    Description

    flpelerin/slimorca-1k dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    slimorca-dedup-chatml

    • huggingface.co
    Updated Feb 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radiantloom AI (2024). slimorca-dedup-chatml [Dataset]. https://huggingface.co/datasets/Radiantloom/slimorca-dedup-chatml
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 17, 2024
    Dataset authored and provided by
    Radiantloom AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is a chatml formatted version of original SlimOrca-Dedup dataset with few modifications to the system prompts.

  15. h

    slimorca-llama2-1K

    • huggingface.co
    Updated Nov 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayan Ansar (2024). slimorca-llama2-1K [Dataset]. https://huggingface.co/datasets/AyanAnsar/slimorca-llama2-1K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2024
    Authors
    Ayan Ansar
    Description

    AyanAnsar/slimorca-llama2-1K dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    openchat-spin-slimorca-iter2-dataset

    • huggingface.co
    Updated Feb 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Self Play Language Models (2024). openchat-spin-slimorca-iter2-dataset [Dataset]. https://huggingface.co/datasets/splm/openchat-spin-slimorca-iter2-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 25, 2024
    Dataset authored and provided by
    Self Play Language Models
    Description

    splm/openchat-spin-slimorca-iter2-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    slimorca-5k

    • huggingface.co
    Updated Jun 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Pelerin (2024). slimorca-5k [Dataset]. https://huggingface.co/datasets/flpelerin/slimorca-5k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 10, 2024
    Authors
    Florian Pelerin
    Description

    flpelerin/slimorca-5k dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    slimorca-deduped-cleaned-corrected-text

    • huggingface.co
    Updated May 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Pelerin (2024). slimorca-deduped-cleaned-corrected-text [Dataset]. https://huggingface.co/datasets/flpelerin/slimorca-deduped-cleaned-corrected-text
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 25, 2024
    Authors
    Florian Pelerin
    Description

    flpelerin/slimorca-deduped-cleaned-corrected-text dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    SlimOrca

    • huggingface.co
    Updated Oct 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pxy (2024). SlimOrca [Dataset]. https://huggingface.co/datasets/pxyyy/SlimOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 17, 2024
    Authors
    pxy
    Description

    Dataset Card for "SlimOrca"

    More Information needed

  20. h

    slimorca-autoj-corrected

    • huggingface.co
    Updated Apr 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diwank Tomer (2024). slimorca-autoj-corrected [Dataset]. https://huggingface.co/datasets/diwank/slimorca-autoj-corrected
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 13, 2024
    Authors
    Diwank Tomer
    Description

    diwank/slimorca-autoj-corrected dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
OpenOrca (2023). SlimOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/SlimOrca

SlimOrca

SlimOrca

Open-Orca/SlimOrca

Explore at:
89 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 11, 2023
Dataset authored and provided by
OpenOrca
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Overview

This is a new curated subset of our OpenOrca data. This release provides an efficient means of reaching performance on-par with using larger slices of our data, while only including ~500k GPT-4 completions. The key change in this dataset is that we've done an additional pass, using GPT-4 to remove answers which appear wrong based on the human annotations from the FLAN dataset. This reduces the dataset size to only ~500k entries, allowing training to a similar quality levelโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/SlimOrca.

Search
Clear search
Close search
Google apps
Main menu