14 datasets found

h
slimorca-dedup-chatml-100k
huggingface.co
Updated Jun 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philipp Schmid (2023). slimorca-dedup-chatml-100k [Dataset]. https://huggingface.co/datasets/philschmid/slimorca-dedup-chatml-100k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 5, 2023
Authors
Philipp Schmid
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Copy of Open-Orca/SlimOrca-Dedup in ChatML format downsample to 100k

"SlimOrca Dedup" is a deduplicated, unfiltered subset of the SlimOrca dataset, excluding RLHF instances, resulting in 363k unique examples.

Key Features

Removal of RLHF instances. Deduplication using minhash and Jaccard similarity techniques.

Demo Models Note: These models were trained on the full SlimOrca dataset, not the deduplicated, unfiltered version.

*… See the full description on the dataset page: https://huggingface.co/datasets/philschmid/slimorca-dedup-chatml-100k.
h
SlimOrca-Dedup
huggingface.co
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenOrca (2023). SlimOrca-Dedup [Dataset]. https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 5, 2023
Dataset authored and provided by
OpenOrca
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

"SlimOrca Dedup" is a deduplicated, unfiltered subset of the SlimOrca dataset, excluding RLHF instances, resulting in 363k unique examples.

Key Features

Removal of RLHF instances. Deduplication using minhash and Jaccard similarity techniques.

Demo Models Note: These models were trained on the full SlimOrca dataset, not the deduplicated, unfiltered version.

https://huggingface.co/openaccess-ai-collective/jackalope-7b *… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup.
h
SlimOrca-enPurified-openai-messages
huggingface.co
Updated Jan 19, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
C.T. Lahey (2026). SlimOrca-enPurified-openai-messages [Dataset]. https://huggingface.co/datasets/enPurified/SlimOrca-enPurified-openai-messages
Explore at:
Dataset updated
Jan 19, 2026
Authors
C.T. Lahey
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for enPurified/SlimOrca-enPurified-openai-messages

This dataset was updated on January 14th to remove even more math, code, and low quality prose from the dataset. That's why the number below shows it was trimming from 270k.

Dataset Summary

The enPurified collection is an initiative to curate high-fidelity English prose datasets for language modeling. The primary objective is to isolate high-quality natural language text by strictly excising code… See the full description on the dataset page: https://huggingface.co/datasets/enPurified/SlimOrca-enPurified-openai-messages.
h
slim-orca-ukrainian
huggingface.co
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Center of Innovations and Defence Technologies Development MOD of UA (2023). slim-orca-ukrainian [Dataset]. https://huggingface.co/datasets/cidtd-mod-ua/slim-orca-ukrainian
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 5, 2023
Dataset authored and provided by
Center of Innovations and Defence Technologies Development MOD of UA
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Slim Orca(Deduped) Translated to Ukrainian 🇺🇦

Dataset Description

A Ukrainian language dataset comprising 350,000+ records translated from the SlimOrca dataset. This dataset is suitable for various natural language processing tasks. Слава Україні!

Disclaimer

Prepare data before your usage. There are some errors in texts, so be carefull.

How to Use

This dataset can be loaded using the Hugging Face Datasets library: from datasets import load_dataset… See the full description on the dataset page: https://huggingface.co/datasets/cidtd-mod-ua/slim-orca-ukrainian.
h
TinyOrca
huggingface.co
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prince Canuma (2023). TinyOrca [Dataset]. https://huggingface.co/datasets/prince-canuma/TinyOrca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 5, 2023
Authors
Prince Canuma
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

This is a new curated subset of the SlimOpenOrca data.

Citation

@misc{TinyOrca, title = {TinyOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces, with Verification}, author = {Prince Canuma}, year = {2024}, publisher = {HuggingFace}, url = {https://https://huggingface.co/prince-canuma/TinyOrca} }

@misc{SlimOrca, title = {SlimOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces, with Verification}, author = {Wing Lian and Guan… See the full description on the dataset page: https://huggingface.co/datasets/prince-canuma/TinyOrca.
slimorca-th-translated
huggingface.co
Updated Dec 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VISAI AI (2025). slimorca-th-translated [Dataset]. https://huggingface.co/datasets/VISAI-AI/slimorca-th-translated
Explore at:
Dataset updated
Dec 25, 2025
Dataset provided by
Visai AI Co., Ltd.
Authors
VISAI AI
Description
Slimorca TH Translated

A subset of SlimOrca dataset translated using Qwen3-30BA3B-Instruct-2507.

Limitation

The dataset was translated naively using prompting, this SOMETIMES often caused the translated text to answer the question text instead of translating. Make sure to filter and clean the dataset accordingly.

Translation Code

import requests import os import hashlib import time from functools import partial from typing import List from concurrent.futures… See the full description on the dataset page: https://huggingface.co/datasets/VISAI-AI/slimorca-th-translated.
h
calibration-test-01
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Buzard, calibration-test-01 [Dataset]. https://huggingface.co/datasets/benbuzard/calibration-test-01
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Ben Buzard
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Generated from:

FineWeb Github Code SlimOrca-Deduped-Cleaned-Corrected Glaive Code Assistant v3 MS Orca Math
h
SlimOrca_eu
huggingface.co
Updated Jul 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Orai NLP technologies (2025). SlimOrca_eu [Dataset]. https://huggingface.co/datasets/orai-nlp/SlimOrca_eu
Explore at:
Dataset updated
Jul 28, 2025
Dataset authored and provided by
Orai NLP technologies
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
SlimOrca machine translated instruction dataset for Basque

Dataset Creation Source Data

Machine translated to Basque from the SlimOrca dataset.

Annotations Annotation process

Machine translated to Basque from the SlimOrca dataset.

Citation [optional]

If you use this dataset please cite the following reference: @misc{Llama-eus, title = {Llama-eus-8B, a foundational sub-10 billion parameter LLM for Basque}, author =… See the full description on the dataset page: https://huggingface.co/datasets/orai-nlp/SlimOrca_eu.
h
GRaPE-preview
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sweaterdog, GRaPE-preview [Dataset]. https://huggingface.co/datasets/Sweaterdog/GRaPE-preview
Explore at:
Authors
Sweaterdog
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
GRaPE-Preview Dataset

This dataset May or May not be the final one.

General & Reasoning

NousResearch/Hermes-3-Dataset Open-Orca/SlimOrca HuggingFaceH4/ultrafeedback_binarized

Code & STEM

glaiveai/glaive-code-assistant-v3 nickrosh/Evol-Instruct-Code-80k-v1 meta-math/MetaMathQA

Agentic & Function Calling

Sweaterdog/Andy-4-base gardner/glaive-function-calling-v2-sharegpt

Uncensored SFT & DPO

NobodyExistsOnTheInternet/ToxicQAtextFiltered… See the full description on the dataset page: https://huggingface.co/datasets/Sweaterdog/GRaPE-preview.
h
tinyllama-function-calling-eval
huggingface.co
Updated Feb 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gardner Bickford (2024). tinyllama-function-calling-eval [Dataset]. https://huggingface.co/datasets/gardner/tinyllama-function-calling-eval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 5, 2024
Authors
Gardner Bickford
Description
Not intended for training

This dataset is the result of an evaluation run on the model located here: gardner/TinyLlama-1.1B-SlimOrca-Function-Calling-3T

Format

In this result set, response1 is from the fine tuned model, and response2 is from the test dataset.
h
hopkok-v1
huggingface.co
Updated Jun 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tim Olsén (2024). hopkok-v1 [Dataset]. https://huggingface.co/datasets/skvarre/hopkok-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2024
Authors
Tim Olsén
Description
Dataset Details

Hopkok is a Swedish instruction dataset, consisting of translated examples, synthetically generated examples and Q&A examples collected from the web. The datasets have been cleaned and curated for training a Swedish large language model. This dataset will be further iterated upon, and it is by no means considered a well cleaned and curated dataset in its current state.

Dataset Sources SlimOrca-SV-33K

SlimOrca-SV-33K is a machine… See the full description on the dataset page: https://huggingface.co/datasets/skvarre/hopkok-v1.
h
hyperion-v3.0
huggingface.co
Updated Mar 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Gabarain (2024). hyperion-v3.0 [Dataset]. https://huggingface.co/datasets/Locutusque/hyperion-v3.0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2024
Authors
Sebastian Gabarain
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Hyperion-3.0 has significantly improved performance over its predecessors. "I found that having more code datasets than general purpose datasets ironically decreases performance in both coding and general tasks." Data sources:

OpenOrca/SlimOrca cognitivecomputations/dolphin (300k examples) microsoft/orca-math-word-problems-200k (60k examples) glaiveai/glaive-code-assistant Vezora/Tested-22k-Python-Alpaca Unnatural Instructions BI55/MedText LDJnr/Pure-Dove Various domain-specific datasets by… See the full description on the dataset page: https://huggingface.co/datasets/Locutusque/hyperion-v3.0.
h
ko-openchat-0404
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heegyu Kim, ko-openchat-0404 [Dataset]. https://huggingface.co/datasets/heegyu/ko-openchat-0404
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Heegyu Kim
Description
한국어 챗봇 학습을 위해, 여러 데이터를 가져와서 포멧을 통일

heegyu/glaive-function-calling-v2-ko: 15170 items FreedomIntelligence/evol-instruct-korean: 59022 items heegyu/PKU-SafeRLHF-ko: 135213 items maywell/koVast: 684579 items MarkrAI/KoCommercial-Dataset: 175454 items HuggingFaceH4/ultrachat_200k: 207865 items Open-Orca/SlimOrca-Dedup: 363491 items glaiveai/glaive-code-assistant-v2: 215166 items
h
ko-openchat-0406
huggingface.co
Updated Apr 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heegyu Kim (2024). ko-openchat-0406 [Dataset]. https://huggingface.co/datasets/heegyu/ko-openchat-0406
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2024
Authors
Heegyu Kim
Description
다음 공개된 데이터를 모두 포멧 통일 후 병합. 이후 1000개를 무작위로 추출하여 test set으로 사용

지시문 수행(Instruction-Following), 추론(Reasoning), 일반상식(Commonsense)

이 데이터들에도 수학, 코딩 데이터가 섞여있긴 합니다

FreedomIntelligence/evol-instruct-korean heegyu/OpenOrca-gugugo-ko-len500 MarkrAI/KoCommercial-Dataset heegyu/CoT-collection-ko changpt/ko-lima-vicuna maywell/koVast dbdu/ShareGPT-74k-koHuggingFaceH4/ultrachat_200k Open-Orca/SlimOrca-Dedup

수학, 코딩, 함수 호출 (Function Calling)

heegyu/glaive-function-calling-v2-ko… See the full description on the dataset page: https://huggingface.co/datasets/heegyu/ko-openchat-0406.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Philipp Schmid (2023). slimorca-dedup-chatml-100k [Dataset]. https://huggingface.co/datasets/philschmid/slimorca-dedup-chatml-100k

slimorca-dedup-chatml-100k

SlimOrca Dedup

philschmid/slimorca-dedup-chatml-100k

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 5, 2023

Authors

Philipp Schmid

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Copy of Open-Orca/SlimOrca-Dedup in ChatML format downsample to 100k

"SlimOrca Dedup" is a deduplicated, unfiltered subset of the SlimOrca dataset, excluding RLHF instances, resulting in 363k unique examples.

  Key Features

Removal of RLHF instances. Deduplication using minhash and Jaccard similarity techniques.

  Demo Models





  Note: These models were trained on the full SlimOrca dataset, not the deduplicated, unfiltered version.

*… See the full description on the dataset page: https://huggingface.co/datasets/philschmid/slimorca-dedup-chatml-100k.

Clear search

Close search

Google apps

Main menu

slimorca-dedup-chatml-100k

SlimOrca-Dedup

SlimOrca-enPurified-openai-messages

slim-orca-ukrainian

TinyOrca

slimorca-th-translated

calibration-test-01

SlimOrca_eu

GRaPE-preview

tinyllama-function-calling-eval

hopkok-v1

hyperion-v3.0

ko-openchat-0404

ko-openchat-0406

slimorca-dedup-chatml-100kSee More Versions

SlimOrca Dedup

philschmid/slimorca-dedup-chatml-100k

slimorca-dedup-chatml-100k