88 datasets found

h
OpenOrca
huggingface.co
opendatalab.com
Updated Jun 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenOrca (2023). OpenOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/OpenOrca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2023
Dataset authored and provided by
OpenOrca
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🐋 The OpenOrca Dataset! 🐋

We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!

Official Models Mistral-7B-OpenOrca

Our latest model, the first 7B to score better overall than all… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/OpenOrca.
h
SlimOrca
huggingface.co
opendatalab.com
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenOrca (2023). SlimOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/SlimOrca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 11, 2023
Dataset authored and provided by
OpenOrca
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

This is a new curated subset of our OpenOrca data. This release provides an efficient means of reaching performance on-par with using larger slices of our data, while only including ~500k GPT-4 completions. The key change in this dataset is that we've done an additional pass, using GPT-4 to remove answers which appear wrong based on the human annotations from the FLAN dataset. This reduces the dataset size to only ~500k entries, allowing training to a similar quality level… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/SlimOrca.
h
Open-Orca-OpenOrca
huggingface.co
Updated Aug 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AGIE AI Technology (2023). Open-Orca-OpenOrca [Dataset]. https://huggingface.co/datasets/agie-ai/Open-Orca-OpenOrca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2023
Dataset authored and provided by
AGIE AI Technology
Description
Dataset Card for "Open-Orca-OpenOrca"

More Information needed
Open Orca Dataset Embedding
kaggle.com
Updated Feb 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atman_Coder (2024). Open Orca Dataset Embedding [Dataset]. https://www.kaggle.com/datasets/atmancoder/open-orca-dataset-embedding/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 29, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Atman_Coder
Description
Dataset

This dataset was created by Atman_Coder

Contents
h
FLAN
huggingface.co
Updated Aug 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenOrca (2023). FLAN [Dataset]. https://huggingface.co/datasets/Open-Orca/FLAN
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2023
Dataset authored and provided by
OpenOrca
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
🍮 The WHOLE FLAN Collection! 🍮

Overview

This repository includes the full dataset from the FLAN Collection, totalling ~300GB as parquets. Generated using the official seqio templating from the Google FLAN Collection GitHub repo. The data is subject to all the same licensing of the component datasets. To keep up with our continued work on OpenOrca and other exciting research, find our Discord here: https://AlignmentLab.ai

Motivation

This work was done as part of… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/FLAN.
Orca - Dataset - NASA Open Data Portal
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Mar 28, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Orca - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/orca
Explore at:
Dataset updated
Mar 28, 2010
Dataset provided by
NASAhttp://nasa.gov/
Description
Orca is a data-driven, unsupervised anomaly detection algorithm that uses a distance-based approach. It uses a novel pruning rule that allows it to run in nearly linear time. Orca was co-developed by Stephen Bay of ISLE and Mark Schwabacher of NASA ARC. More information about Orca, including downloadable software, can be found here: http://stephenbay.net/orca/ A conference paper about Orca can be found here: https://dashlink.arc.nasa.gov/paper/mining-distance-based-outliers-in-near-linear-time/
h
slimorca-deduped-cleaned-corrected
huggingface.co
Updated Apr 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenOrca (2025). slimorca-deduped-cleaned-corrected [Dataset]. https://huggingface.co/datasets/Open-Orca/slimorca-deduped-cleaned-corrected
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 13, 2025
Dataset authored and provided by
OpenOrca
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
CREDIT: https://huggingface.co/cgato there was some minor formatting errors in, corrected and pushed to the Open-Orca org* What is this dataset? Half of the Slim Orca Deduped dataset, but further cleaned by removing instances of soft prompting. I removed a ton prompt prefixes which did not add any information or were redundant. Ex. "Question:", "Q:", "Write the Answer:", "Read this:", "Instructions:" I also removed a ton of prompt suffixes which were simply there to lead the model to answer as… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/slimorca-deduped-cleaned-corrected.
P
SurgeGlobal/Orca Dataset
paperswithcode.com
Updated Apr 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chandeepa Dissanayake; Lahiru Lowe; Sachith Gunasekara; Yasiru Ratnayake (2024). SurgeGlobal/Orca Dataset [Dataset]. https://paperswithcode.com/dataset/surgeglobal-orca
Explore at:
Dataset updated
Apr 17, 2024
Authors
Chandeepa Dissanayake; Lahiru Lowe; Sachith Gunasekara; Yasiru Ratnayake
Description
Dataset Generation

Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2 Seed Instructions: Derived from the FLAN-v2 Collection. Generation Approach: Explanation tuning with detailed responses generated from h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. Total Instructions: 5,507 explanation tuning data samples.

Dataset Sources

Repository: Bitbucket Project Paper : Pre-Print

Structure The dataset entries consist of: - Query - Response - System Message (when applicable)

Usage The Orca Dataset is intended for fine-tuning language models to not only imitate the style but also the reasoning process of LFMs, thereby improving the safety and quality of the models’ responses.

Citation If you find our work useful, please cite our paper as follows: @misc{surge2024openbezoar, title={OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data}, author={Chandeepa Dissanayake and Lahiru Lowe and Sachith Gunasekara and Yasiru Ratnayake}, year={2024}, eprint={2404.12195}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Dataset Authors Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, and Yasiru Ratnayake
h
SlimOrca-Dedup
huggingface.co
Updated Nov 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenOrca (2023). SlimOrca-Dedup [Dataset]. https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 1, 2023
Dataset authored and provided by
OpenOrca
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

"SlimOrca Dedup" is a deduplicated, unfiltered subset of the SlimOrca dataset, excluding RLHF instances, resulting in 363k unique examples.

Key Features

Removal of RLHF instances. Deduplication using minhash and Jaccard similarity techniques.

Demo Models Note: These models were trained on the full SlimOrca dataset, not the deduplicated, unfiltered version.

https://huggingface.co/openaccess-ai-collective/jackalope-7b *… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup.
P
ORCAS Dataset
paperswithcode.com
opendatalab.com
Updated Dec 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Craswell; Daniel Campos; Bhaskar Mitra; Emine Yilmaz; Bodo Billerbeck (2020). ORCAS Dataset [Dataset]. https://paperswithcode.com/dataset/orcas
Explore at:
Dataset updated
Dec 20, 2020
Authors
Nick Craswell; Daniel Campos; Bhaskar Mitra; Emine Yilmaz; Bodo Billerbeck
Description
ORCAS is a click-based dataset. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries.
t
Yunfan Shao, Linyang Li, Zhaoye Fei, Hang Yan, Dahua Lin, Xipeng Qiu (2024)....
service.tib.eu
Updated Dec 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Yunfan Shao, Linyang Li, Zhaoye Fei, Hang Yan, Dahua Lin, Xipeng Qiu (2024). Dataset: Open-Orca. https://doi.org/10.57702/pmheosqy [Dataset]. https://service.tib.eu/ldmservice/dataset/open-orca
Explore at:
Dataset updated
Dec 16, 2024
Description
The dataset used for training large language models, with a focus on balancing the text distribution and mitigating overfitting.
h
open-orca-flan
huggingface.co
Updated Oct 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Marten (2024). open-orca-flan [Dataset]. https://huggingface.co/datasets/ryanmarten/open-orca-flan
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 17, 2024
Authors
Ryan Marten
Description
ryanmarten/open-orca-flan dataset hosted on Hugging Face and contributed by the HF Datasets community
h
open-orca-cot-judged-test
huggingface.co
Updated Nov 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Marten (2024). open-orca-cot-judged-test [Dataset]. https://huggingface.co/datasets/ryanmarten/open-orca-cot-judged-test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2024
Authors
Ryan Marten
Description
ryanmarten/open-orca-cot-judged-test dataset hosted on Hugging Face and contributed by the HF Datasets community
O
ORCA pipeline
data.oregon.gov
application/rdfxml +5
Updated Jul 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oregon Housing and Community Services (2025). ORCA pipeline [Dataset]. https://data.oregon.gov/dataset/ORCA-pipeline/uphc-pyji
Explore at:
csv, json, application/rssxml, xml, application/rdfxml, tsvAvailable download formats
Dataset updated
Jul 21, 2025
Dataset authored and provided by
Oregon Housing and Community Services
Description
ORCA applications received and in-process
f
Identification Catalogue of the Killer whales that frequent inner Vestlandet...
figshare.com
pdf
Updated Mar 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eve Jourdain; Vegard Byrkjeland Aasen; Olve Vaagø Erdal; Simon Johnsen (2022). Identification Catalogue of the Killer whales that frequent inner Vestlandet [Dataset]. http://doi.org/10.6084/m9.figshare.19350476.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19350476.v1
Dataset updated
Mar 29, 2022
Dataset provided by
figshare
Authors
Eve Jourdain; Vegard Byrkjeland Aasen; Olve Vaagø Erdal; Simon Johnsen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Western Norway
Description
This ID-Catalogue (English and Norwegian versions available) compiles ID-photographs of the killer whales that were encountered regularly in the inner fjords of the Vestlandet region (Norway) between 2016 and 2021.
Integrated Building Health Management - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Integrated Building Health Management - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/integrated-building-health-management
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Abstract: Building health management is an important part in running an efficient and cost-effective building. Many problems in a building’s system can go undetected for long periods of time, leading to expensive repairs or wasted resources. This project aims to help detect and diagnose the building‘s health with data driven methods throughout the day. Orca and IMS are two state of the art algorithms that observe an array of building health sensors and provide feedback on the overall system’s health as well as localize the problem to one, or possibly two, components. With this level of feedback the hope is to quickly identify problems and provide appropriate maintenance while reducing the number of complaints and service calls. Introduction: To prepare these technologies for the new installation, the proposed methods are being tested on a current system that behaves similarly to the future green building. Building 241 was determined to best resemble the proposed building 232 and therefore was chosen for this study. Building 241 is currently outfitted with 34 sensors that monitor the heating & cooling temperatures for the air and water systems as well as other various subsystem states. The daily sensor recordings were logged and sent to the IDU group for analysis. The period of analysis was focused from July 1st through August 10th 2009. Methodology: The two algorithms used for analysis were Orca and IMS. Both methods look for anomalies using a distanced based scoring approach. Orca has the ability to use a single data set and find outliers within that data set. This tactic was applied to each day. After scoring each time sample throughout a given day the Orca score profiles were compared by computing the correlation against all other days. Days with high overall correlations were considered normal however days with lower overall correlations were more anomalous. IMS, on the other hand, needs a normal set of data to build a model, which can be applied to a set of test data to asses how anomaly the particular data set is. The typical days identified by Orca were used as the reference/training set for IMS, while all the other days were passed through IMS resulting in an anomaly score profile for each day. The mean of the IMS score profile was then calculated for each day to produce a summary IMS score. These summary scores were ranked and the top outliers were identified (see Figure 1). Once the anomalies were identified the contributing parameters were then ranked by the algorithm. Analysis: The contributing parameters identified by IMS were localized to the return air temperature duct system. -7/03/09 (Figure 2 & 3) AHU-1 Return Air Temperature (RAT) Calculated Average Return Air Temperature -7/19/09 (Figure 3 & 4) AHU-2 Return Air Temperature (RAT) Calculated Average Return Air Temperature IMS identified significantly higher temperatures compared to other days during the month of July and August. Conclusion: The proposed algorithms Orca and IMS have shown that they were able to pick up significant anomalies in the building system as well as diagnose the anomaly by identifying the sensor values that were anomalous. In the future these methods can be used on live streaming data and produce a real time anomaly score to help building maintenance with detection and diagnosis of problems.
O
ORCA Waitlist
data.oregon.gov
application/rdfxml +5
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oregon Housing and Community Services (2025). ORCA Waitlist [Dataset]. https://data.oregon.gov/dataset/ORCA-Waitlist/3xud-wvpj
Explore at:
tsv, application/rssxml, csv, application/rdfxml, json, xmlAvailable download formats
Dataset updated
Jun 24, 2025
Dataset authored and provided by
Oregon Housing and Community Services
Description
A list of projects currently on the ORCA waitlist
T
FIN Average MPG and Fuel Usage
internal.open.piercecountywa.gov
open.piercecountywa.gov
application/rdfxml +5
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Finance (2024). FIN Average MPG and Fuel Usage [Dataset]. https://internal.open.piercecountywa.gov/Finance/FIN-Average-MPG-and-Fuel-Usage/kirv-exad
Explore at:
application/rdfxml, tsv, xml, csv, json, application/rssxmlAvailable download formats
Dataset updated
Feb 28, 2024
Dataset authored and provided by
Finance
Description
This dataset includes total fuel usage data for the County fleet, including the Finance and ESD managed fleet vehicles. It also includes Ferry fuel usage and data on transit trips based on ORCA card usage.
O
ORCA waitlist
data.oregon.gov
application/rdfxml +5
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oregon Housing and Community Services (2025). ORCA waitlist [Dataset]. https://data.oregon.gov/dataset/ORCA-waitlist/nfv3-nxmu
Explore at:
csv, xml, application/rssxml, json, application/rdfxml, tsvAvailable download formats
Dataset updated
Jun 24, 2025
Dataset authored and provided by
Oregon Housing and Community Services
Description
Applications currently on a waitlist for the Oregon Housing and Community Services' Oregon Centralized Application (ORCA) process.
Northern Resident Killer Whale Group Cohesion (1980-2010)
open.canada.ca
datasets.ai
csv, pdf
Updated Jul 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fisheries and Oceans Canada (2023). Northern Resident Killer Whale Group Cohesion (1980-2010) [Dataset]. https://open.canada.ca/data/en/dataset/8c773994-1031-411b-a1ad-933928daa4ac
Explore at:
csv, pdfAvailable download formats
Dataset updated
Jul 11, 2023
Dataset provided by
Fisheries and Oceans Canadahttp://www.dfo-mpo.gc.ca/
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Time period covered
Jan 1, 1980 - Dec 31, 2010
Description
Data from: Stredulinsky et al. (2021) Family feud: permanent group splitting in a highly philopatric mammal, the killer whale (Orcinus orca). Behavioural Ecology and Sociobiology. https://link.springer.com/article/10.1007/s00265-021-02992-8. Group cohesion and demographic parameters were derived from annual censuses of Northern Resident Killer Whales (NRKW) in Pacific Canadian coastal waters, conducted by DFO's Cetacean Research Program since 1973. For animals that tend to remain with their natal group rather than individually disperse, group sizes may become too large to benefit individual fitness. In such cases, group splitting (or fission) allows philopatric animals to form more optimal group sizes without sacrificing all familiar social relationships. Although permanent group splitting is observed in many mammals, it occurs relatively infrequently. Here, we use combined generalized modeling and machine learning approaches to provide a comprehensive examination of group splitting in a population of killer whales (Orcinus orca) that occurred over three decades. Fission occurred both along and across maternal lines, where animals dispersed in parallel with their closest maternal kin. Group splitting was more common: (1) in larger natal groups, (2) when the common maternal ancestor was no longer alive, and (3) among groups with greater substructuring. The death of a matriarch did not appear to immediately trigger splitting. Our data suggest intragroup competition for food, leadership experience, and kinship are important factors that influence group splitting in this population. Our approach provides a foundation for future studies to examine the dynamics and consequences of matrilineal fission in killer whales and other taxa.

Facebook

Twitter

Click to copy link

Link copied

Cite

OpenOrca (2023). OpenOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/OpenOrca

OpenOrca

Open-Orca/OpenOrca

Explore at:

382 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 29, 2023

Dataset authored and provided by

OpenOrca

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

🐋 The OpenOrca Dataset! 🐋

We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!

  Official Models






  Mistral-7B-OpenOrca

Our latest model, the first 7B to score better overall than all… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/OpenOrca.

Clear search

Close search

Google apps

Main menu

OpenOrca

SlimOrca

Open-Orca-OpenOrca

Open Orca Dataset Embedding

Dataset

Contents

FLAN

Orca - Dataset - NASA Open Data Portal

slimorca-deduped-cleaned-corrected

SurgeGlobal/Orca Dataset

SlimOrca-Dedup

ORCAS Dataset

Yunfan Shao, Linyang Li, Zhaoye Fei, Hang Yan, Dahua Lin, Xipeng Qiu (2024)....

open-orca-flan

open-orca-cot-judged-test

ORCA pipeline

Identification Catalogue of the Killer whales that frequent inner Vestlandet...

Integrated Building Health Management - Dataset - NASA Open Data Portal

ORCA Waitlist

FIN Average MPG and Fuel Usage

ORCA waitlist

Northern Resident Killer Whale Group Cohesion (1980-2010)

OpenOrcaSee More Versions

OpenOrca

Open-Orca/OpenOrca

OpenOrca