14 datasets found
  1. h

    slimorca-dedup-chatml-100k

    • huggingface.co
    Updated Jun 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philipp Schmid (2023). slimorca-dedup-chatml-100k [Dataset]. https://huggingface.co/datasets/philschmid/slimorca-dedup-chatml-100k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 5, 2023
    Authors
    Philipp Schmid
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Copy of Open-Orca/SlimOrca-Dedup in ChatML format downsample to 100k

    "SlimOrca Dedup" is a deduplicated, unfiltered subset of the SlimOrca dataset, excluding RLHF instances, resulting in 363k unique examples.

      Key Features
    

    Removal of RLHF instances. Deduplication using minhash and Jaccard similarity techniques.

      Demo Models
    
    
    
    
    
      Note: These models were trained on the full SlimOrca dataset, not the deduplicated, unfiltered version.
    

    *… See the full description on the dataset page: https://huggingface.co/datasets/philschmid/slimorca-dedup-chatml-100k.

  2. h

    SlimOrca-Dedup

    • huggingface.co
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenOrca (2023). SlimOrca-Dedup [Dataset]. https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 5, 2023
    Dataset authored and provided by
    OpenOrca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    "SlimOrca Dedup" is a deduplicated, unfiltered subset of the SlimOrca dataset, excluding RLHF instances, resulting in 363k unique examples.

      Key Features
    

    Removal of RLHF instances. Deduplication using minhash and Jaccard similarity techniques.

      Demo Models
    
    
    
    
    
      Note: These models were trained on the full SlimOrca dataset, not the deduplicated, unfiltered version.
    
  3. h

    SlimOrca-enPurified-openai-messages

    • huggingface.co
    Updated Jan 19, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    C.T. Lahey (2026). SlimOrca-enPurified-openai-messages [Dataset]. https://huggingface.co/datasets/enPurified/SlimOrca-enPurified-openai-messages
    Explore at:
    Dataset updated
    Jan 19, 2026
    Authors
    C.T. Lahey
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for enPurified/SlimOrca-enPurified-openai-messages

    This dataset was updated on January 14th to remove even more math, code, and low quality prose from the dataset. That's why the number below shows it was trimming from 270k.

      Dataset Summary
    

    The enPurified collection is an initiative to curate high-fidelity English prose datasets for language modeling. The primary objective is to isolate high-quality natural language text by strictly excising code… See the full description on the dataset page: https://huggingface.co/datasets/enPurified/SlimOrca-enPurified-openai-messages.

  4. h

    slim-orca-ukrainian

    • huggingface.co
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Center of Innovations and Defence Technologies Development MOD of UA (2023). slim-orca-ukrainian [Dataset]. https://huggingface.co/datasets/cidtd-mod-ua/slim-orca-ukrainian
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 5, 2023
    Dataset authored and provided by
    Center of Innovations and Defence Technologies Development MOD of UA
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Slim Orca(Deduped) Translated to Ukrainian 🇺🇦

      Dataset Description
    

    A Ukrainian language dataset comprising 350,000+ records translated from the SlimOrca dataset. This dataset is suitable for various natural language processing tasks. Слава Україні!

      Disclaimer
    

    Prepare data before your usage. There are some errors in texts, so be carefull.

      How to Use
    

    This dataset can be loaded using the Hugging Face Datasets library: from datasets import load_dataset… See the full description on the dataset page: https://huggingface.co/datasets/cidtd-mod-ua/slim-orca-ukrainian.

  5. h

    TinyOrca

    • huggingface.co
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prince Canuma (2023). TinyOrca [Dataset]. https://huggingface.co/datasets/prince-canuma/TinyOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 5, 2023
    Authors
    Prince Canuma
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This is a new curated subset of the SlimOpenOrca data.

      Citation
    

    @misc{TinyOrca, title = {TinyOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces, with Verification}, author = {Prince Canuma}, year = {2024}, publisher = {HuggingFace}, url = {https://https://huggingface.co/prince-canuma/TinyOrca} }

    @misc{SlimOrca, title = {SlimOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces, with Verification}, author = {Wing Lian and Guan… See the full description on the dataset page: https://huggingface.co/datasets/prince-canuma/TinyOrca.

  6. slimorca-th-translated

    • huggingface.co
    Updated Dec 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VISAI AI (2025). slimorca-th-translated [Dataset]. https://huggingface.co/datasets/VISAI-AI/slimorca-th-translated
    Explore at:
    Dataset updated
    Dec 25, 2025
    Dataset provided by
    Visai AI Co., Ltd.
    Authors
    VISAI AI
    Description

    Slimorca TH Translated

    A subset of SlimOrca dataset translated using Qwen3-30BA3B-Instruct-2507.

      Limitation
    

    The dataset was translated naively using prompting, this SOMETIMES often caused the translated text to answer the question text instead of translating. Make sure to filter and clean the dataset accordingly.

      Translation Code
    

    import requests import os import hashlib import time from functools import partial from typing import List from concurrent.futures… See the full description on the dataset page: https://huggingface.co/datasets/VISAI-AI/slimorca-th-translated.

  7. h

    calibration-test-01

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Buzard, calibration-test-01 [Dataset]. https://huggingface.co/datasets/benbuzard/calibration-test-01
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Ben Buzard
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Generated from:

    FineWeb Github Code SlimOrca-Deduped-Cleaned-Corrected Glaive Code Assistant v3 MS Orca Math

  8. h

    SlimOrca_eu

    • huggingface.co
    Updated Jul 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orai NLP technologies (2025). SlimOrca_eu [Dataset]. https://huggingface.co/datasets/orai-nlp/SlimOrca_eu
    Explore at:
    Dataset updated
    Jul 28, 2025
    Dataset authored and provided by
    Orai NLP technologies
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    SlimOrca machine translated instruction dataset for Basque

      Dataset Creation
    
    
    
    
    
      Source Data
    

    Machine translated to Basque from the SlimOrca dataset.

      Annotations
    
    
    
    
    
      Annotation process
    

    Machine translated to Basque from the SlimOrca dataset.

      Citation [optional]
    

    If you use this dataset please cite the following reference: @misc{Llama-eus, title = {Llama-eus-8B, a foundational sub-10 billion parameter LLM for Basque}, author =… See the full description on the dataset page: https://huggingface.co/datasets/orai-nlp/SlimOrca_eu.

  9. h

    GRaPE-preview

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sweaterdog, GRaPE-preview [Dataset]. https://huggingface.co/datasets/Sweaterdog/GRaPE-preview
    Explore at:
    Authors
    Sweaterdog
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    GRaPE-Preview Dataset

    This dataset May or May not be the final one.

      General & Reasoning
    

    NousResearch/Hermes-3-Dataset Open-Orca/SlimOrca HuggingFaceH4/ultrafeedback_binarized

      Code & STEM
    

    glaiveai/glaive-code-assistant-v3 nickrosh/Evol-Instruct-Code-80k-v1 meta-math/MetaMathQA

      Agentic & Function Calling
    

    Sweaterdog/Andy-4-base gardner/glaive-function-calling-v2-sharegpt

      Uncensored SFT & DPO
    

    NobodyExistsOnTheInternet/ToxicQAtextFiltered… See the full description on the dataset page: https://huggingface.co/datasets/Sweaterdog/GRaPE-preview.

  10. h

    tinyllama-function-calling-eval

    • huggingface.co
    Updated Feb 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gardner Bickford (2024). tinyllama-function-calling-eval [Dataset]. https://huggingface.co/datasets/gardner/tinyllama-function-calling-eval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 5, 2024
    Authors
    Gardner Bickford
    Description

    Not intended for training

    This dataset is the result of an evaluation run on the model located here: gardner/TinyLlama-1.1B-SlimOrca-Function-Calling-3T

      Format
    

    In this result set, response1 is from the fine tuned model, and response2 is from the test dataset.

  11. h

    hopkok-v1

    • huggingface.co
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Olsén (2024). hopkok-v1 [Dataset]. https://huggingface.co/datasets/skvarre/hopkok-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2024
    Authors
    Tim Olsén
    Description

    Dataset Details

    Hopkok is a Swedish instruction dataset, consisting of translated examples, synthetically generated examples and Q&A examples collected from the web. The datasets have been cleaned and curated for training a Swedish large language model. This dataset will be further iterated upon, and it is by no means considered a well cleaned and curated dataset in its current state.

      Dataset Sources
    
    
    
    
    
      SlimOrca-SV-33K
    

    SlimOrca-SV-33K is a machine… See the full description on the dataset page: https://huggingface.co/datasets/skvarre/hopkok-v1.

  12. h

    hyperion-v3.0

    • huggingface.co
    Updated Mar 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Gabarain (2024). hyperion-v3.0 [Dataset]. https://huggingface.co/datasets/Locutusque/hyperion-v3.0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2024
    Authors
    Sebastian Gabarain
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Hyperion-3.0 has significantly improved performance over its predecessors. "I found that having more code datasets than general purpose datasets ironically decreases performance in both coding and general tasks." Data sources:

    OpenOrca/SlimOrca cognitivecomputations/dolphin (300k examples) microsoft/orca-math-word-problems-200k (60k examples) glaiveai/glaive-code-assistant Vezora/Tested-22k-Python-Alpaca Unnatural Instructions BI55/MedText LDJnr/Pure-Dove Various domain-specific datasets by… See the full description on the dataset page: https://huggingface.co/datasets/Locutusque/hyperion-v3.0.

  13. h

    ko-openchat-0404

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heegyu Kim, ko-openchat-0404 [Dataset]. https://huggingface.co/datasets/heegyu/ko-openchat-0404
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Heegyu Kim
    Description

    한국어 챗봇 학습을 위해, 여러 데이터를 가져와서 포멧을 통일

    heegyu/glaive-function-calling-v2-ko: 15170 items FreedomIntelligence/evol-instruct-korean: 59022 items heegyu/PKU-SafeRLHF-ko: 135213 items maywell/koVast: 684579 items MarkrAI/KoCommercial-Dataset: 175454 items HuggingFaceH4/ultrachat_200k: 207865 items Open-Orca/SlimOrca-Dedup: 363491 items glaiveai/glaive-code-assistant-v2: 215166 items

  14. h

    ko-openchat-0406

    • huggingface.co
    Updated Apr 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heegyu Kim (2024). ko-openchat-0406 [Dataset]. https://huggingface.co/datasets/heegyu/ko-openchat-0406
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2024
    Authors
    Heegyu Kim
    Description

    다음 공개된 데이터를 모두 포멧 통일 후 병합. 이후 1000개를 무작위로 추출하여 test set으로 사용

      지시문 수행(Instruction-Following), 추론(Reasoning), 일반상식(Commonsense)
    

    이 데이터들에도 수학, 코딩 데이터가 섞여있긴 합니다

    FreedomIntelligence/evol-instruct-korean heegyu/OpenOrca-gugugo-ko-len500 MarkrAI/KoCommercial-Dataset heegyu/CoT-collection-ko changpt/ko-lima-vicuna maywell/koVast dbdu/ShareGPT-74k-koHuggingFaceH4/ultrachat_200k Open-Orca/SlimOrca-Dedup

      수학, 코딩, 함수 호출 (Function Calling)
    

    heegyu/glaive-function-calling-v2-ko… See the full description on the dataset page: https://huggingface.co/datasets/heegyu/ko-openchat-0406.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Philipp Schmid (2023). slimorca-dedup-chatml-100k [Dataset]. https://huggingface.co/datasets/philschmid/slimorca-dedup-chatml-100k

slimorca-dedup-chatml-100k

SlimOrca Dedup

philschmid/slimorca-dedup-chatml-100k

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 5, 2023
Authors
Philipp Schmid
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Copy of Open-Orca/SlimOrca-Dedup in ChatML format downsample to 100k

"SlimOrca Dedup" is a deduplicated, unfiltered subset of the SlimOrca dataset, excluding RLHF instances, resulting in 363k unique examples.

  Key Features

Removal of RLHF instances. Deduplication using minhash and Jaccard similarity techniques.

  Demo Models





  Note: These models were trained on the full SlimOrca dataset, not the deduplicated, unfiltered version.

*… See the full description on the dataset page: https://huggingface.co/datasets/philschmid/slimorca-dedup-chatml-100k.

Search
Clear search
Close search
Google apps
Main menu