100+ datasets found

h
OpenOrca
huggingface.co
opendatalab.com
Updated Jun 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenOrca (2023). OpenOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/OpenOrca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2023
Dataset authored and provided by
OpenOrca
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🐋 The OpenOrca Dataset! 🐋

We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!

Official Models Mistral-7B-OpenOrca

Our latest model, the first 7B to score better overall than all… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/OpenOrca.
h
Open-Orca-OpenOrca
huggingface.co
Updated Aug 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AGIE AI Technology (2023). Open-Orca-OpenOrca [Dataset]. https://huggingface.co/datasets/agie-ai/Open-Orca-OpenOrca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2023
Dataset authored and provided by
AGIE AI Technology
Description
Dataset Card for "Open-Orca-OpenOrca"

More Information needed
h
OpenOrca-KO
huggingface.co
opendatalab.com
Updated Oct 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KyujinHan (2023). OpenOrca-KO [Dataset]. https://huggingface.co/datasets/kyujinpy/OpenOrca-KO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 13, 2023
Authors
KyujinHan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
OpenOrca-KO

OpenOrca dataset 중 약 2만개를 sampling하여 번역한 데이터셋 데이터셋 이용하셔서 모델이나 데이터셋을 만드실 때, 간단한 출처 표기를 해주신다면 연구에 큰 도움이 됩니다😭😭

Dataset inf0

NIV // 1571개
FLAN // 9434개
T0 // 6351개
CoT // 2117개
KoCoT // 2159개

Translation

Using DeepL Pro API. Thanks.

Below is original dataset card

🐋 The OpenOrca Dataset! 🐋

We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with… See the full description on the dataset page: https://huggingface.co/datasets/kyujinpy/OpenOrca-KO.
t
OpenOrca dataset - Dataset - LDM
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). OpenOrca dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/openorca-dataset
Explore at:
Dataset updated
Dec 16, 2024
Description
The dataset used for the Vectara hallucination task, containing OpenOrca questions.
MLPerf-OpenOrca
huggingface.co
Updated Mar 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cTuning foundation (2025). MLPerf-OpenOrca [Dataset]. https://huggingface.co/datasets/ctuning/MLPerf-OpenOrca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 2, 2025
Dataset provided by
CTuning foundationhttps://ctuning.org/
Authors
cTuning foundation
Description
ctuning/MLPerf-OpenOrca dataset hosted on Hugging Face and contributed by the HF Datasets community
h
SlimOrca
huggingface.co
opendatalab.com
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenOrca (2023). SlimOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/SlimOrca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 11, 2023
Dataset authored and provided by
OpenOrca
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

This is a new curated subset of our OpenOrca data. This release provides an efficient means of reaching performance on-par with using larger slices of our data, while only including ~500k GPT-4 completions. The key change in this dataset is that we've done an additional pass, using GPT-4 to remove answers which appear wrong based on the human annotations from the FLAN dataset. This reduces the dataset size to only ~500k entries, allowing training to a similar quality level… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/SlimOrca.
h
tamil-alpaca-orca
huggingface.co
Updated Nov 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhinand Balachandran (2023). tamil-alpaca-orca [Dataset]. https://huggingface.co/datasets/abhinand/tamil-alpaca-orca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 13, 2023
Authors
Abhinand Balachandran
License
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Description
Dataset Card for "tamil-alpaca"

This repository includes a Tamil-translated versions of the Alpaca dataset and a subset of OpenOrca dataset. This dataset is part of the release of Tamil LLaMA family of models – an important step in advancing LLMs for the Tamil language. To dive deep into the development and capabilities of this model, please read the research paper and the introductory blog post (WIP) that outlines our journey and the model's potential impact. GitHub Repository:… See the full description on the dataset page: https://huggingface.co/datasets/abhinand/tamil-alpaca-orca.
h
OpenOrca-Top5percent
huggingface.co
Updated Mar 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dynopii Inc (2024). OpenOrca-Top5percent [Dataset]. https://huggingface.co/datasets/dynopii/OpenOrca-Top5percent
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 13, 2024
Dataset authored and provided by
Dynopii Inc
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🐋 The OpenOrca-Top5Percent Dataset! 🐋

We are excited to introduce the OpenOrca-Top5Percent dataset, a refined version of the original OpenOrca dataset. This dataset contains only those entries which utilize the top 5% most frequently used words in the OpenOrca dataset, aiming to focus on high-frequency vocabulary for various NLP tasks.

Dataset Summary

The OpenOrca-Top5Percent dataset is a curated subset of the augmented FLAN Collection data, focusing specifically on entries that… See the full description on the dataset page: https://huggingface.co/datasets/dynopii/OpenOrca-Top5percent.
t
Yunfan Shao, Linyang Li, Zhaoye Fei, Hang Yan, Dahua Lin, Xipeng Qiu (2024)....
service.tib.eu
Updated Dec 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Yunfan Shao, Linyang Li, Zhaoye Fei, Hang Yan, Dahua Lin, Xipeng Qiu (2024). Dataset: Open-Orca. https://doi.org/10.57702/pmheosqy [Dataset]. https://service.tib.eu/ldmservice/dataset/open-orca
Explore at:
Dataset updated
Dec 16, 2024
Description
The dataset used for training large language models, with a focus on balancing the text distribution and mitigating overfitting.
h
OpenOrca-tr
huggingface.co
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamad Alhajar (2024). OpenOrca-tr [Dataset]. https://huggingface.co/datasets/malhajar/OpenOrca-tr
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 2, 2024
Authors
Mohamad Alhajar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "OpenOrca-tr"

This Dataset is part of a series of datasets aimed at advancing Turkish LLM Developments by establishing rigid Turkish dataset collection to enhance the performance of LLM's Produced in the Turkish Language. malhajar/orca-tr is a translated version of the OpenOrca and is the first ever SFT dataset in the Turkish Language with more than 2M entries! Translated by: Mohamad Alhajar

Dataset Summary

The OpenOrca dataset is a collection of… See the full description on the dataset page: https://huggingface.co/datasets/malhajar/OpenOrca-tr.
h
openorca
huggingface.co
Updated May 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Viktor Moskvoretskii (2024). openorca [Dataset]. https://huggingface.co/datasets/VityaVitalich/openorca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 2, 2024
Authors
Viktor Moskvoretskii
Description
VityaVitalich/openorca dataset hosted on Hugging Face and contributed by the HF Datasets community
h
OpenOrca-1500
huggingface.co
Updated Mar 19, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muntasir Hossain (2014). OpenOrca-1500 [Dataset]. https://huggingface.co/datasets/MuntasirHossain/OpenOrca-1500
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 19, 2014
Authors
Muntasir Hossain
Description
This dataset contains a subsample of 1500 records of the original Open-Orca/OpenOrca dataset.
h
OpenOrca-Traditional-Chinese
huggingface.co
Updated Mar 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lee Chak Kei (2025). OpenOrca-Traditional-Chinese [Dataset]. https://huggingface.co/datasets/lchakkei/OpenOrca-Traditional-Chinese
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 28, 2025
Authors
Lee Chak Kei
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🐋 OpenOrca-Chinese 数据集！🐋

感謝 Open-Orca/OpenOrca 資料集的發布，為廣大NLP研究人員和開發者帶來了寶貴的資源！這是一個對 Open-Orca/OpenOrca 資料集中文翻譯的版本，翻譯引擎為 Google 翻譯，希望能為中文 LLM 研究做出一點點貢獻。

Dataset Summary

The OpenOrca dataset is a collection of augmented FLAN Collection data. Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions. It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing… See the full description on the dataset page: https://huggingface.co/datasets/lchakkei/OpenOrca-Traditional-Chinese.
h
FLAN
huggingface.co
Updated Aug 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenOrca (2023). FLAN [Dataset]. https://huggingface.co/datasets/Open-Orca/FLAN
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2023
Dataset authored and provided by
OpenOrca
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
🍮 The WHOLE FLAN Collection! 🍮

Overview

This repository includes the full dataset from the FLAN Collection, totalling ~300GB as parquets. Generated using the official seqio templating from the Google FLAN Collection GitHub repo. The data is subject to all the same licensing of the component datasets. To keep up with our continued work on OpenOrca and other exciting research, find our Discord here: https://AlignmentLab.ai

Motivation

This work was done as part of… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/FLAN.
h
OpenOrca-zh-20k
huggingface.co
Updated Apr 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Belandros Pan (2024). OpenOrca-zh-20k [Dataset]. https://huggingface.co/datasets/wenbopan/OpenOrca-zh-20k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 1, 2024
Authors
Belandros Pan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Datsetcard for 'OpenOrca-zh-20k'

This is the Chinese version of Open-Orca/OpenOrca from Azure99/blossom-orca-v3. Compared to Azure99/blossom-orca-v3:

This dataset extracts all Chinese blossom-orca-v3 samples (around 20K) into a separate zh split.

All samples are formatted in the ocra format with an optional system role in the first round.

Instead of using a 1:1 En-Zh ratio as in blossom-orca-v3, this dataset contains 200K GPT-4 generated English samples from OpenOrca in the en… See the full description on the dataset page: https://huggingface.co/datasets/wenbopan/OpenOrca-zh-20k.
h
OpenOrca-gugugo-ko
huggingface.co
Updated Jan 1, 2001
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Woojun Jeong (2001). OpenOrca-gugugo-ko [Dataset]. https://huggingface.co/datasets/squarelike/OpenOrca-gugugo-ko
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2001
Authors
Woojun Jeong
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
OpenOrca 한국어 번역 데이터셋

Gugugo-koen-7B-V1.1을 이용하여 OpenOrca데이터셋을 번역하고 있습니다. 번역 진행상황은 아래를 참고해 주십시오.

진행상황

GPT4 생성물 약 100만 개 중 약 64만 개 번역완료 GPT3.5 생성물 약 350만 개 중 약 159만 개 번역완료

데이터셋 사용 후 출처표기는 제작자에게 큰 힘이 됩니다.

Original dataset card: OpenOrca

🐋 The OpenOrca Dataset! 🐋

We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has… See the full description on the dataset page: https://huggingface.co/datasets/squarelike/OpenOrca-gugugo-ko.
h
OpenOrca-zh
huggingface.co
Updated Nov 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Southern university of science and technology (2023). OpenOrca-zh [Dataset]. https://huggingface.co/datasets/SUSTech/OpenOrca-zh
Explore at:
Dataset updated
Nov 1, 2023
Dataset authored and provided by
Southern university of science and technology
Description
Dataset Card for "OpenOrca-zh"

More Information needed
h
OpenOrca-50k
huggingface.co
Updated Mar 19, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kim (2014). OpenOrca-50k [Dataset]. https://huggingface.co/datasets/kimnt93/OpenOrca-50k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 19, 2014
Authors
Kim
Description
OpenOrca-50k Dataset

Description

OpenOrca-50k is a curated subset of the original Open-Orca dataset available on HuggingFace. This subset contains 50,000 random samples from the main dataset. It has been extracted to serve specific research purposes, especially for those requiring a smaller but representative portion of the original dataset. Each entry in the dataset has the following structure:

id: The unique identifier for the sample. system_prompt: System-generated… See the full description on the dataset page: https://huggingface.co/datasets/kimnt93/OpenOrca-50k.
h
lilac-OpenOrca
huggingface.co
Updated Feb 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lilac AI (2024). lilac-OpenOrca [Dataset]. https://huggingface.co/datasets/lilacai/lilac-OpenOrca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2024
Dataset authored and provided by
Lilac AI
Description
lilac/OpenOrca

This dataset is a Lilac processed dataset. Original dataset: https://huggingface.co/datasets/Open-Orca/OpenOrca To download the dataset to a local directory: lilac download lilacai/lilac-OpenOrca

or from python with: ll.download("lilacai/lilac-OpenOrca")
h
OpenOrca
huggingface.co
Updated Aug 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Narmo Sadanto (2024). OpenOrca [Dataset]. https://huggingface.co/datasets/Sadanto3933/OpenOrca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 5, 2024
Authors
Narmo Sadanto
Description
Sadanto3933/OpenOrca dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

OpenOrca (2023). OpenOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/OpenOrca

OpenOrca

Open-Orca/OpenOrca

Explore at:

406 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 29, 2023

Dataset authored and provided by

OpenOrca

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

🐋 The OpenOrca Dataset! 🐋

We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!

  Official Models






  Mistral-7B-OpenOrca

Our latest model, the first 7B to score better overall than all… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/OpenOrca.

Clear search

Close search

Google apps

Main menu

OpenOrca

Open-Orca-OpenOrca

OpenOrca-KO

OpenOrca dataset - Dataset - LDM

MLPerf-OpenOrca

SlimOrca

tamil-alpaca-orca

OpenOrca-Top5percent

Yunfan Shao, Linyang Li, Zhaoye Fei, Hang Yan, Dahua Lin, Xipeng Qiu (2024)....

OpenOrca-tr

openorca

OpenOrca-1500

OpenOrca-Traditional-Chinese

FLAN

OpenOrca-zh-20k

OpenOrca-gugugo-ko

OpenOrca-zh

OpenOrca-50k

lilac-OpenOrca

OpenOrca

OpenOrcaSee More Versions

OpenOrca

Open-Orca/OpenOrca

OpenOrca