MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
๐ The OpenOrca Dataset! ๐
We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!
Official Models
Mistral-7B-OpenOrca
Our latest model, the first 7B to score better overall than allโฆ See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/OpenOrca.
Dataset Card for "Open-Orca-OpenOrca"
More Information needed
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
OpenOrca-KO
OpenOrca dataset ์ค ์ฝ 2๋ง๊ฐ๋ฅผ samplingํ์ฌ ๋ฒ์ญํ ๋ฐ์ดํฐ์ ๋ฐ์ดํฐ์ ์ด์ฉํ์ ์ ๋ชจ๋ธ์ด๋ ๋ฐ์ดํฐ์ ์ ๋ง๋์ค ๋, ๊ฐ๋จํ ์ถ์ฒ ํ๊ธฐ๋ฅผ ํด์ฃผ์ ๋ค๋ฉด ์ฐ๊ตฌ์ ํฐ ๋์์ด ๋ฉ๋๋ค๐ญ๐ญ
Dataset inf0
NIV // 1571๊ฐ
FLAN // 9434๊ฐ
T0 // 6351๊ฐ
CoT // 2117๊ฐ
KoCoT // 2159๊ฐ
Translation
Using DeepL Pro API. Thanks.
Below is original dataset card
๐ The OpenOrca Dataset! ๐
We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, withโฆ See the full description on the dataset page: https://huggingface.co/datasets/kyujinpy/OpenOrca-KO.
The dataset used for the Vectara hallucination task, containing OpenOrca questions.
ctuning/MLPerf-OpenOrca dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Overview
This is a new curated subset of our OpenOrca data. This release provides an efficient means of reaching performance on-par with using larger slices of our data, while only including ~500k GPT-4 completions. The key change in this dataset is that we've done an additional pass, using GPT-4 to remove answers which appear wrong based on the human annotations from the FLAN dataset. This reduces the dataset size to only ~500k entries, allowing training to a similar quality levelโฆ See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/SlimOrca.
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Dataset Card for "tamil-alpaca"
This repository includes a Tamil-translated versions of the Alpaca dataset and a subset of OpenOrca dataset. This dataset is part of the release of Tamil LLaMA family of models โ an important step in advancing LLMs for the Tamil language. To dive deep into the development and capabilities of this model, please read the research paper and the introductory blog post (WIP) that outlines our journey and the model's potential impact. GitHub Repository:โฆ See the full description on the dataset page: https://huggingface.co/datasets/abhinand/tamil-alpaca-orca.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
๐ The OpenOrca-Top5Percent Dataset! ๐
We are excited to introduce the OpenOrca-Top5Percent dataset, a refined version of the original OpenOrca dataset. This dataset contains only those entries which utilize the top 5% most frequently used words in the OpenOrca dataset, aiming to focus on high-frequency vocabulary for various NLP tasks.
Dataset Summary
The OpenOrca-Top5Percent dataset is a curated subset of the augmented FLAN Collection data, focusing specifically on entries thatโฆ See the full description on the dataset page: https://huggingface.co/datasets/dynopii/OpenOrca-Top5percent.
The dataset used for training large language models, with a focus on balancing the text distribution and mitigating overfitting.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "OpenOrca-tr"
This Dataset is part of a series of datasets aimed at advancing Turkish LLM Developments by establishing rigid Turkish dataset collection to enhance the performance of LLM's Produced in the Turkish Language. malhajar/orca-tr is a translated version of the OpenOrca and is the first ever SFT dataset in the Turkish Language with more than 2M entries! Translated by: Mohamad Alhajar
Dataset Summary
The OpenOrca dataset is a collection ofโฆ See the full description on the dataset page: https://huggingface.co/datasets/malhajar/OpenOrca-tr.
VityaVitalich/openorca dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset contains a subsample of 1500 records of the original Open-Orca/OpenOrca dataset.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
๐ OpenOrca-Chinese ๆฐๆฎ้๏ผ๐
ๆ่ฌ Open-Orca/OpenOrca ่ณๆ้็็ผๅธ๏ผ็บๅปฃๅคงNLP็ ็ฉถไบบๅกๅ้็ผ่ ๅธถไพไบๅฏถ่ฒด็่ณๆบ๏ผ ้ๆฏไธๅๅฐ Open-Orca/OpenOrca ่ณๆ้ไธญๆ็ฟป่ญฏ็็ๆฌ๏ผ็ฟป่ญฏๅผๆ็บ Google ็ฟป่ญฏ๏ผๅธๆ่ฝ็บไธญๆ LLM ็ ็ฉถๅๅบไธ้ป้ป่ฒข็ปใ
Dataset Summary
The OpenOrca dataset is a collection of augmented FLAN Collection data. Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions. It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoingโฆ See the full description on the dataset page: https://huggingface.co/datasets/lchakkei/OpenOrca-Traditional-Chinese.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
๐ฎ The WHOLE FLAN Collection! ๐ฎ
Overview
This repository includes the full dataset from the FLAN Collection, totalling ~300GB as parquets. Generated using the official seqio templating from the Google FLAN Collection GitHub repo. The data is subject to all the same licensing of the component datasets. To keep up with our continued work on OpenOrca and other exciting research, find our Discord here: https://AlignmentLab.ai
Motivation
This work was done as part ofโฆ See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/FLAN.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Datsetcard for 'OpenOrca-zh-20k'
This is the Chinese version of Open-Orca/OpenOrca from Azure99/blossom-orca-v3. Compared to Azure99/blossom-orca-v3:
This dataset extracts all Chinese blossom-orca-v3 samples (around 20K) into a separate zh split.
All samples are formatted in the ocra format with an optional system role in the first round.
Instead of using a 1:1 En-Zh ratio as in blossom-orca-v3, this dataset contains 200K GPT-4 generated English samples from OpenOrca in the enโฆ See the full description on the dataset page: https://huggingface.co/datasets/wenbopan/OpenOrca-zh-20k.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
OpenOrca ํ๊ตญ์ด ๋ฒ์ญ ๋ฐ์ดํฐ์
Gugugo-koen-7B-V1.1์ ์ด์ฉํ์ฌ OpenOrca๋ฐ์ดํฐ์ ์ ๋ฒ์ญํ๊ณ ์์ต๋๋ค. ๋ฒ์ญ ์งํ์ํฉ์ ์๋๋ฅผ ์ฐธ๊ณ ํด ์ฃผ์ญ์์ค.
์งํ์ํฉ
GPT4 ์์ฑ๋ฌผ ์ฝ 100๋ง ๊ฐ ์ค ์ฝ 64๋ง ๊ฐ ๋ฒ์ญ์๋ฃ GPT3.5 ์์ฑ๋ฌผ ์ฝ 350๋ง ๊ฐ ์ค ์ฝ 159๋ง ๊ฐ ๋ฒ์ญ์๋ฃ
๋ฐ์ดํฐ์ ์ฌ์ฉ ํ ์ถ์ฒํ๊ธฐ๋ ์ ์์์๊ฒ ํฐ ํ์ด ๋ฉ๋๋ค.
Original dataset card: OpenOrca
๐ The OpenOrca Dataset! ๐
We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It hasโฆ See the full description on the dataset page: https://huggingface.co/datasets/squarelike/OpenOrca-gugugo-ko.
Dataset Card for "OpenOrca-zh"
More Information needed
OpenOrca-50k Dataset
Description
OpenOrca-50k is a curated subset of the original Open-Orca dataset available on HuggingFace. This subset contains 50,000 random samples from the main dataset. It has been extracted to serve specific research purposes, especially for those requiring a smaller but representative portion of the original dataset. Each entry in the dataset has the following structure:
id: The unique identifier for the sample. system_prompt: System-generatedโฆ See the full description on the dataset page: https://huggingface.co/datasets/kimnt93/OpenOrca-50k.
lilac/OpenOrca
This dataset is a Lilac processed dataset. Original dataset: https://huggingface.co/datasets/Open-Orca/OpenOrca To download the dataset to a local directory: lilac download lilacai/lilac-OpenOrca
or from python with: ll.download("lilacai/lilac-OpenOrca")
Sadanto3933/OpenOrca dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
๐ The OpenOrca Dataset! ๐
We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!
Official Models
Mistral-7B-OpenOrca
Our latest model, the first 7B to score better overall than allโฆ See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/OpenOrca.