100+ datasets found
  1. h

    OpenOrca

    • huggingface.co
    • opendatalab.com
    Updated Jun 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenOrca (2023). OpenOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/OpenOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2023
    Dataset authored and provided by
    OpenOrca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    ๐Ÿ‹ The OpenOrca Dataset! ๐Ÿ‹

    We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!

      Official Models
    
    
    
    
    
    
      Mistral-7B-OpenOrca
    

    Our latest model, the first 7B to score better overall than allโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/OpenOrca.

  2. h

    Open-Orca-OpenOrca

    • huggingface.co
    Updated Aug 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AGIE AI Technology (2023). Open-Orca-OpenOrca [Dataset]. https://huggingface.co/datasets/agie-ai/Open-Orca-OpenOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2023
    Dataset authored and provided by
    AGIE AI Technology
    Description

    Dataset Card for "Open-Orca-OpenOrca"

    More Information needed

  3. h

    OpenOrca-KO

    • huggingface.co
    • opendatalab.com
    Updated Oct 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KyujinHan (2023). OpenOrca-KO [Dataset]. https://huggingface.co/datasets/kyujinpy/OpenOrca-KO
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 13, 2023
    Authors
    KyujinHan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    OpenOrca-KO

    OpenOrca dataset ์ค‘ ์•ฝ 2๋งŒ๊ฐœ๋ฅผ samplingํ•˜์—ฌ ๋ฒˆ์—ญํ•œ ๋ฐ์ดํ„ฐ์…‹ ๋ฐ์ดํ„ฐ์…‹ ์ด์šฉํ•˜์…”์„œ ๋ชจ๋ธ์ด๋‚˜ ๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“œ์‹ค ๋•Œ, ๊ฐ„๋‹จํ•œ ์ถœ์ฒ˜ ํ‘œ๊ธฐ๋ฅผ ํ•ด์ฃผ์‹ ๋‹ค๋ฉด ์—ฐ๊ตฌ์— ํฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค๐Ÿ˜ญ๐Ÿ˜ญ

      Dataset inf0
    

    NIV // 1571๊ฐœ
    FLAN // 9434๊ฐœ
    T0 // 6351๊ฐœ
    CoT // 2117๊ฐœ
    KoCoT // 2159๊ฐœ

      Translation
    

    Using DeepL Pro API. Thanks.

    Below is original dataset card

    ๐Ÿ‹ The OpenOrca Dataset! ๐Ÿ‹

    We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, withโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/kyujinpy/OpenOrca-KO.

  4. t

    OpenOrca dataset - Dataset - LDM

    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). OpenOrca dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/openorca-dataset
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The dataset used for the Vectara hallucination task, containing OpenOrca questions.

  5. MLPerf-OpenOrca

    • huggingface.co
    Updated Mar 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cTuning foundation (2025). MLPerf-OpenOrca [Dataset]. https://huggingface.co/datasets/ctuning/MLPerf-OpenOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 2, 2025
    Dataset provided by
    CTuning foundationhttps://ctuning.org/
    Authors
    cTuning foundation
    Description

    ctuning/MLPerf-OpenOrca dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    SlimOrca

    • huggingface.co
    • opendatalab.com
    Updated Oct 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenOrca (2023). SlimOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/SlimOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 11, 2023
    Dataset authored and provided by
    OpenOrca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This is a new curated subset of our OpenOrca data. This release provides an efficient means of reaching performance on-par with using larger slices of our data, while only including ~500k GPT-4 completions. The key change in this dataset is that we've done an additional pass, using GPT-4 to remove answers which appear wrong based on the human annotations from the FLAN dataset. This reduces the dataset size to only ~500k entries, allowing training to a similar quality levelโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/SlimOrca.

  7. h

    tamil-alpaca-orca

    • huggingface.co
    Updated Nov 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhinand Balachandran (2023). tamil-alpaca-orca [Dataset]. https://huggingface.co/datasets/abhinand/tamil-alpaca-orca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 13, 2023
    Authors
    Abhinand Balachandran
    License

    https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/

    Description

    Dataset Card for "tamil-alpaca"

    This repository includes a Tamil-translated versions of the Alpaca dataset and a subset of OpenOrca dataset. This dataset is part of the release of Tamil LLaMA family of models โ€“ an important step in advancing LLMs for the Tamil language. To dive deep into the development and capabilities of this model, please read the research paper and the introductory blog post (WIP) that outlines our journey and the model's potential impact. GitHub Repository:โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/abhinand/tamil-alpaca-orca.

  8. h

    OpenOrca-Top5percent

    • huggingface.co
    Updated Mar 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dynopii Inc (2024). OpenOrca-Top5percent [Dataset]. https://huggingface.co/datasets/dynopii/OpenOrca-Top5percent
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 13, 2024
    Dataset authored and provided by
    Dynopii Inc
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    ๐Ÿ‹ The OpenOrca-Top5Percent Dataset! ๐Ÿ‹

    We are excited to introduce the OpenOrca-Top5Percent dataset, a refined version of the original OpenOrca dataset. This dataset contains only those entries which utilize the top 5% most frequently used words in the OpenOrca dataset, aiming to focus on high-frequency vocabulary for various NLP tasks.

      Dataset Summary
    

    The OpenOrca-Top5Percent dataset is a curated subset of the augmented FLAN Collection data, focusing specifically on entries thatโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/dynopii/OpenOrca-Top5percent.

  9. t

    Yunfan Shao, Linyang Li, Zhaoye Fei, Hang Yan, Dahua Lin, Xipeng Qiu (2024)....

    • service.tib.eu
    Updated Dec 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Yunfan Shao, Linyang Li, Zhaoye Fei, Hang Yan, Dahua Lin, Xipeng Qiu (2024). Dataset: Open-Orca. https://doi.org/10.57702/pmheosqy [Dataset]. https://service.tib.eu/ldmservice/dataset/open-orca
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The dataset used for training large language models, with a focus on balancing the text distribution and mitigating overfitting.

  10. h

    OpenOrca-tr

    • huggingface.co
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamad Alhajar (2024). OpenOrca-tr [Dataset]. https://huggingface.co/datasets/malhajar/OpenOrca-tr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 2, 2024
    Authors
    Mohamad Alhajar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "OpenOrca-tr"

    This Dataset is part of a series of datasets aimed at advancing Turkish LLM Developments by establishing rigid Turkish dataset collection to enhance the performance of LLM's Produced in the Turkish Language. malhajar/orca-tr is a translated version of the OpenOrca and is the first ever SFT dataset in the Turkish Language with more than 2M entries! Translated by: Mohamad Alhajar

      Dataset Summary
    

    The OpenOrca dataset is a collection ofโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/malhajar/OpenOrca-tr.

  11. h

    openorca

    • huggingface.co
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Viktor Moskvoretskii (2024). openorca [Dataset]. https://huggingface.co/datasets/VityaVitalich/openorca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 2, 2024
    Authors
    Viktor Moskvoretskii
    Description

    VityaVitalich/openorca dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    OpenOrca-1500

    • huggingface.co
    Updated Mar 19, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muntasir Hossain (2014). OpenOrca-1500 [Dataset]. https://huggingface.co/datasets/MuntasirHossain/OpenOrca-1500
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 19, 2014
    Authors
    Muntasir Hossain
    Description

    This dataset contains a subsample of 1500 records of the original Open-Orca/OpenOrca dataset.

  13. h

    OpenOrca-Traditional-Chinese

    • huggingface.co
    Updated Mar 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee Chak Kei (2025). OpenOrca-Traditional-Chinese [Dataset]. https://huggingface.co/datasets/lchakkei/OpenOrca-Traditional-Chinese
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 28, 2025
    Authors
    Lee Chak Kei
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    ๐Ÿ‹ OpenOrca-Chinese ๆ•ฐๆฎ้›†๏ผ๐Ÿ‹

    ๆ„Ÿ่ฌ Open-Orca/OpenOrca ่ณ‡ๆ–™้›†็š„็™ผๅธƒ๏ผŒ็‚บๅปฃๅคงNLP็ ”็ฉถไบบๅ“กๅ’Œ้–‹็™ผ่€…ๅธถไพ†ไบ†ๅฏถ่ฒด็š„่ณ‡ๆบ๏ผ ้€™ๆ˜ฏไธ€ๅ€‹ๅฐ Open-Orca/OpenOrca ่ณ‡ๆ–™้›†ไธญๆ–‡็ฟป่ญฏ็š„็‰ˆๆœฌ๏ผŒ็ฟป่ญฏๅผ•ๆ“Ž็‚บ Google ็ฟป่ญฏ๏ผŒๅธŒๆœ›่ƒฝ็‚บไธญๆ–‡ LLM ็ ”็ฉถๅšๅ‡บไธ€้ปž้ปž่ฒข็ปใ€‚

      Dataset Summary
    

    The OpenOrca dataset is a collection of augmented FLAN Collection data. Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions. It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoingโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/lchakkei/OpenOrca-Traditional-Chinese.

  14. h

    FLAN

    • huggingface.co
    Updated Aug 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenOrca (2023). FLAN [Dataset]. https://huggingface.co/datasets/Open-Orca/FLAN
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2023
    Dataset authored and provided by
    OpenOrca
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ๐Ÿฎ The WHOLE FLAN Collection! ๐Ÿฎ

      Overview
    

    This repository includes the full dataset from the FLAN Collection, totalling ~300GB as parquets. Generated using the official seqio templating from the Google FLAN Collection GitHub repo. The data is subject to all the same licensing of the component datasets. To keep up with our continued work on OpenOrca and other exciting research, find our Discord here: https://AlignmentLab.ai

      Motivation
    

    This work was done as part ofโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/FLAN.

  15. h

    OpenOrca-zh-20k

    • huggingface.co
    Updated Apr 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Belandros Pan (2024). OpenOrca-zh-20k [Dataset]. https://huggingface.co/datasets/wenbopan/OpenOrca-zh-20k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 1, 2024
    Authors
    Belandros Pan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Datsetcard for 'OpenOrca-zh-20k'

    This is the Chinese version of Open-Orca/OpenOrca from Azure99/blossom-orca-v3. Compared to Azure99/blossom-orca-v3:

    This dataset extracts all Chinese blossom-orca-v3 samples (around 20K) into a separate zh split.

    All samples are formatted in the ocra format with an optional system role in the first round.

    Instead of using a 1:1 En-Zh ratio as in blossom-orca-v3, this dataset contains 200K GPT-4 generated English samples from OpenOrca in the enโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/wenbopan/OpenOrca-zh-20k.

  16. h

    OpenOrca-gugugo-ko

    • huggingface.co
    Updated Jan 1, 2001
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Woojun Jeong (2001). OpenOrca-gugugo-ko [Dataset]. https://huggingface.co/datasets/squarelike/OpenOrca-gugugo-ko
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2001
    Authors
    Woojun Jeong
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    OpenOrca ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹

    Gugugo-koen-7B-V1.1์„ ์ด์šฉํ•˜์—ฌ OpenOrca๋ฐ์ดํ„ฐ์…‹์„ ๋ฒˆ์—ญํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฒˆ์—ญ ์ง„ํ–‰์ƒํ™ฉ์€ ์•„๋ž˜๋ฅผ ์ฐธ๊ณ ํ•ด ์ฃผ์‹ญ์‹œ์˜ค.

      ์ง„ํ–‰์ƒํ™ฉ
    

    GPT4 ์ƒ์„ฑ๋ฌผ ์•ฝ 100๋งŒ ๊ฐœ ์ค‘ ์•ฝ 64๋งŒ ๊ฐœ ๋ฒˆ์—ญ์™„๋ฃŒ GPT3.5 ์ƒ์„ฑ๋ฌผ ์•ฝ 350๋งŒ ๊ฐœ ์ค‘ ์•ฝ 159๋งŒ ๊ฐœ ๋ฒˆ์—ญ์™„๋ฃŒ

    ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์šฉ ํ›„ ์ถœ์ฒ˜ํ‘œ๊ธฐ๋Š” ์ œ์ž‘์ž์—๊ฒŒ ํฐ ํž˜์ด ๋ฉ๋‹ˆ๋‹ค.

      Original dataset card: OpenOrca
    

    ๐Ÿ‹ The OpenOrca Dataset! ๐Ÿ‹

    We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It hasโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/squarelike/OpenOrca-gugugo-ko.

  17. h

    OpenOrca-zh

    • huggingface.co
    Updated Nov 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Southern university of science and technology (2023). OpenOrca-zh [Dataset]. https://huggingface.co/datasets/SUSTech/OpenOrca-zh
    Explore at:
    Dataset updated
    Nov 1, 2023
    Dataset authored and provided by
    Southern university of science and technology
    Description

    Dataset Card for "OpenOrca-zh"

    More Information needed

  18. h

    OpenOrca-50k

    • huggingface.co
    Updated Mar 19, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kim (2014). OpenOrca-50k [Dataset]. https://huggingface.co/datasets/kimnt93/OpenOrca-50k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 19, 2014
    Authors
    Kim
    Description

    OpenOrca-50k Dataset

      Description
    

    OpenOrca-50k is a curated subset of the original Open-Orca dataset available on HuggingFace. This subset contains 50,000 random samples from the main dataset. It has been extracted to serve specific research purposes, especially for those requiring a smaller but representative portion of the original dataset. Each entry in the dataset has the following structure:

    id: The unique identifier for the sample. system_prompt: System-generatedโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/kimnt93/OpenOrca-50k.

  19. h

    lilac-OpenOrca

    • huggingface.co
    Updated Feb 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lilac AI (2024). lilac-OpenOrca [Dataset]. https://huggingface.co/datasets/lilacai/lilac-OpenOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2024
    Dataset authored and provided by
    Lilac AI
    Description

    lilac/OpenOrca

    This dataset is a Lilac processed dataset. Original dataset: https://huggingface.co/datasets/Open-Orca/OpenOrca To download the dataset to a local directory: lilac download lilacai/lilac-OpenOrca

    or from python with: ll.download("lilacai/lilac-OpenOrca")

  20. h

    OpenOrca

    • huggingface.co
    Updated Aug 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Narmo Sadanto (2024). OpenOrca [Dataset]. https://huggingface.co/datasets/Sadanto3933/OpenOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 5, 2024
    Authors
    Narmo Sadanto
    Description

    Sadanto3933/OpenOrca dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
OpenOrca (2023). OpenOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/OpenOrca

OpenOrca

OpenOrca

Open-Orca/OpenOrca

Explore at:
406 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2023
Dataset authored and provided by
OpenOrca
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

๐Ÿ‹ The OpenOrca Dataset! ๐Ÿ‹

We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!

  Official Models






  Mistral-7B-OpenOrca

Our latest model, the first 7B to score better overall than allโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/OpenOrca.

Search
Clear search
Close search
Google apps
Main menu