88 datasets found
  1. h

    OpenOrca

    • huggingface.co
    • opendatalab.com
    Updated Jun 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenOrca (2023). OpenOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/OpenOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2023
    Dataset authored and provided by
    OpenOrca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🐋 The OpenOrca Dataset! 🐋

    We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!

      Official Models
    
    
    
    
    
    
      Mistral-7B-OpenOrca
    

    Our latest model, the first 7B to score better overall than all… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/OpenOrca.

  2. h

    SlimOrca

    • huggingface.co
    • opendatalab.com
    Updated Oct 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenOrca (2023). SlimOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/SlimOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 11, 2023
    Dataset authored and provided by
    OpenOrca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This is a new curated subset of our OpenOrca data. This release provides an efficient means of reaching performance on-par with using larger slices of our data, while only including ~500k GPT-4 completions. The key change in this dataset is that we've done an additional pass, using GPT-4 to remove answers which appear wrong based on the human annotations from the FLAN dataset. This reduces the dataset size to only ~500k entries, allowing training to a similar quality level… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/SlimOrca.

  3. h

    Open-Orca-OpenOrca

    • huggingface.co
    Updated Aug 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AGIE AI Technology (2023). Open-Orca-OpenOrca [Dataset]. https://huggingface.co/datasets/agie-ai/Open-Orca-OpenOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2023
    Dataset authored and provided by
    AGIE AI Technology
    Description

    Dataset Card for "Open-Orca-OpenOrca"

    More Information needed

  4. Open Orca Dataset Embedding

    • kaggle.com
    Updated Feb 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atman_Coder (2024). Open Orca Dataset Embedding [Dataset]. https://www.kaggle.com/datasets/atmancoder/open-orca-dataset-embedding/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 29, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Atman_Coder
    Description

    Dataset

    This dataset was created by Atman_Coder

    Contents

  5. h

    FLAN

    • huggingface.co
    Updated Aug 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenOrca (2023). FLAN [Dataset]. https://huggingface.co/datasets/Open-Orca/FLAN
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2023
    Dataset authored and provided by
    OpenOrca
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    🍮 The WHOLE FLAN Collection! 🍮

      Overview
    

    This repository includes the full dataset from the FLAN Collection, totalling ~300GB as parquets. Generated using the official seqio templating from the Google FLAN Collection GitHub repo. The data is subject to all the same licensing of the component datasets. To keep up with our continued work on OpenOrca and other exciting research, find our Discord here: https://AlignmentLab.ai

      Motivation
    

    This work was done as part of… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/FLAN.

  6. Orca - Dataset - NASA Open Data Portal

    • data.nasa.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    Updated Mar 28, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Orca - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/orca
    Explore at:
    Dataset updated
    Mar 28, 2010
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Orca is a data-driven, unsupervised anomaly detection algorithm that uses a distance-based approach. It uses a novel pruning rule that allows it to run in nearly linear time. Orca was co-developed by Stephen Bay of ISLE and Mark Schwabacher of NASA ARC. More information about Orca, including downloadable software, can be found here: http://stephenbay.net/orca/ A conference paper about Orca can be found here: https://dashlink.arc.nasa.gov/paper/mining-distance-based-outliers-in-near-linear-time/

  7. h

    slimorca-deduped-cleaned-corrected

    • huggingface.co
    Updated Apr 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenOrca (2025). slimorca-deduped-cleaned-corrected [Dataset]. https://huggingface.co/datasets/Open-Orca/slimorca-deduped-cleaned-corrected
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 13, 2025
    Dataset authored and provided by
    OpenOrca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CREDIT: https://huggingface.co/cgato there was some minor formatting errors in, corrected and pushed to the Open-Orca org* What is this dataset? Half of the Slim Orca Deduped dataset, but further cleaned by removing instances of soft prompting. I removed a ton prompt prefixes which did not add any information or were redundant. Ex. "Question:", "Q:", "Write the Answer:", "Read this:", "Instructions:" I also removed a ton of prompt suffixes which were simply there to lead the model to answer as… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/slimorca-deduped-cleaned-corrected.

  8. P

    SurgeGlobal/Orca Dataset

    • paperswithcode.com
    Updated Apr 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chandeepa Dissanayake; Lahiru Lowe; Sachith Gunasekara; Yasiru Ratnayake (2024). SurgeGlobal/Orca Dataset [Dataset]. https://paperswithcode.com/dataset/surgeglobal-orca
    Explore at:
    Dataset updated
    Apr 17, 2024
    Authors
    Chandeepa Dissanayake; Lahiru Lowe; Sachith Gunasekara; Yasiru Ratnayake
    Description

    Dataset Generation

    Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2 Seed Instructions: Derived from the FLAN-v2 Collection. Generation Approach: Explanation tuning with detailed responses generated from h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. Total Instructions: 5,507 explanation tuning data samples.

    Dataset Sources

    Repository: Bitbucket Project Paper : Pre-Print

    Structure The dataset entries consist of: - Query - Response - System Message (when applicable)

    Usage The Orca Dataset is intended for fine-tuning language models to not only imitate the style but also the reasoning process of LFMs, thereby improving the safety and quality of the models’ responses.

    Citation If you find our work useful, please cite our paper as follows: @misc{surge2024openbezoar, title={OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data}, author={Chandeepa Dissanayake and Lahiru Lowe and Sachith Gunasekara and Yasiru Ratnayake}, year={2024}, eprint={2404.12195}, archivePrefix={arXiv}, primaryClass={cs.CL} }

    Dataset Authors Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, and Yasiru Ratnayake

  9. h

    SlimOrca-Dedup

    • huggingface.co
    Updated Nov 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenOrca (2023). SlimOrca-Dedup [Dataset]. https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2023
    Dataset authored and provided by
    OpenOrca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    "SlimOrca Dedup" is a deduplicated, unfiltered subset of the SlimOrca dataset, excluding RLHF instances, resulting in 363k unique examples.

      Key Features
    

    Removal of RLHF instances. Deduplication using minhash and Jaccard similarity techniques.

      Demo Models
    
    
    
    
    
      Note: These models were trained on the full SlimOrca dataset, not the deduplicated, unfiltered version.
    
  10. P

    ORCAS Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Dec 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Craswell; Daniel Campos; Bhaskar Mitra; Emine Yilmaz; Bodo Billerbeck (2020). ORCAS Dataset [Dataset]. https://paperswithcode.com/dataset/orcas
    Explore at:
    Dataset updated
    Dec 20, 2020
    Authors
    Nick Craswell; Daniel Campos; Bhaskar Mitra; Emine Yilmaz; Bodo Billerbeck
    Description

    ORCAS is a click-based dataset. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries.

  11. t

    Yunfan Shao, Linyang Li, Zhaoye Fei, Hang Yan, Dahua Lin, Xipeng Qiu (2024)....

    • service.tib.eu
    Updated Dec 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Yunfan Shao, Linyang Li, Zhaoye Fei, Hang Yan, Dahua Lin, Xipeng Qiu (2024). Dataset: Open-Orca. https://doi.org/10.57702/pmheosqy [Dataset]. https://service.tib.eu/ldmservice/dataset/open-orca
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The dataset used for training large language models, with a focus on balancing the text distribution and mitigating overfitting.

  12. h

    open-orca-flan

    • huggingface.co
    Updated Oct 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Marten (2024). open-orca-flan [Dataset]. https://huggingface.co/datasets/ryanmarten/open-orca-flan
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 17, 2024
    Authors
    Ryan Marten
    Description

    ryanmarten/open-orca-flan dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    open-orca-cot-judged-test

    • huggingface.co
    Updated Nov 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Marten (2024). open-orca-cot-judged-test [Dataset]. https://huggingface.co/datasets/ryanmarten/open-orca-cot-judged-test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 30, 2024
    Authors
    Ryan Marten
    Description

    ryanmarten/open-orca-cot-judged-test dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. O

    ORCA pipeline

    • data.oregon.gov
    application/rdfxml +5
    Updated Jul 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oregon Housing and Community Services (2025). ORCA pipeline [Dataset]. https://data.oregon.gov/dataset/ORCA-pipeline/uphc-pyji
    Explore at:
    csv, json, application/rssxml, xml, application/rdfxml, tsvAvailable download formats
    Dataset updated
    Jul 21, 2025
    Dataset authored and provided by
    Oregon Housing and Community Services
    Description

    ORCA applications received and in-process

  15. f

    Identification Catalogue of the Killer whales that frequent inner Vestlandet...

    • figshare.com
    pdf
    Updated Mar 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eve Jourdain; Vegard Byrkjeland Aasen; Olve Vaagø Erdal; Simon Johnsen (2022). Identification Catalogue of the Killer whales that frequent inner Vestlandet [Dataset]. http://doi.org/10.6084/m9.figshare.19350476.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Mar 29, 2022
    Dataset provided by
    figshare
    Authors
    Eve Jourdain; Vegard Byrkjeland Aasen; Olve Vaagø Erdal; Simon Johnsen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Western Norway
    Description

    This ID-Catalogue (English and Norwegian versions available) compiles ID-photographs of the killer whales that were encountered regularly in the inner fjords of the Vestlandet region (Norway) between 2016 and 2021.

  16. Integrated Building Health Management - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Integrated Building Health Management - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/integrated-building-health-management
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Abstract: Building health management is an important part in running an efficient and cost-effective building. Many problems in a building’s system can go undetected for long periods of time, leading to expensive repairs or wasted resources. This project aims to help detect and diagnose the building‘s health with data driven methods throughout the day. Orca and IMS are two state of the art algorithms that observe an array of building health sensors and provide feedback on the overall system’s health as well as localize the problem to one, or possibly two, components. With this level of feedback the hope is to quickly identify problems and provide appropriate maintenance while reducing the number of complaints and service calls. Introduction: To prepare these technologies for the new installation, the proposed methods are being tested on a current system that behaves similarly to the future green building. Building 241 was determined to best resemble the proposed building 232 and therefore was chosen for this study. Building 241 is currently outfitted with 34 sensors that monitor the heating & cooling temperatures for the air and water systems as well as other various subsystem states. The daily sensor recordings were logged and sent to the IDU group for analysis. The period of analysis was focused from July 1st through August 10th 2009. Methodology: The two algorithms used for analysis were Orca and IMS. Both methods look for anomalies using a distanced based scoring approach. Orca has the ability to use a single data set and find outliers within that data set. This tactic was applied to each day. After scoring each time sample throughout a given day the Orca score profiles were compared by computing the correlation against all other days. Days with high overall correlations were considered normal however days with lower overall correlations were more anomalous. IMS, on the other hand, needs a normal set of data to build a model, which can be applied to a set of test data to asses how anomaly the particular data set is. The typical days identified by Orca were used as the reference/training set for IMS, while all the other days were passed through IMS resulting in an anomaly score profile for each day. The mean of the IMS score profile was then calculated for each day to produce a summary IMS score. These summary scores were ranked and the top outliers were identified (see Figure 1). Once the anomalies were identified the contributing parameters were then ranked by the algorithm. Analysis: The contributing parameters identified by IMS were localized to the return air temperature duct system. -7/03/09 (Figure 2 & 3) AHU-1 Return Air Temperature (RAT) Calculated Average Return Air Temperature -7/19/09 (Figure 3 & 4) AHU-2 Return Air Temperature (RAT) Calculated Average Return Air Temperature IMS identified significantly higher temperatures compared to other days during the month of July and August. Conclusion: The proposed algorithms Orca and IMS have shown that they were able to pick up significant anomalies in the building system as well as diagnose the anomaly by identifying the sensor values that were anomalous. In the future these methods can be used on live streaming data and produce a real time anomaly score to help building maintenance with detection and diagnosis of problems.

  17. O

    ORCA Waitlist

    • data.oregon.gov
    application/rdfxml +5
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oregon Housing and Community Services (2025). ORCA Waitlist [Dataset]. https://data.oregon.gov/dataset/ORCA-Waitlist/3xud-wvpj
    Explore at:
    tsv, application/rssxml, csv, application/rdfxml, json, xmlAvailable download formats
    Dataset updated
    Jun 24, 2025
    Dataset authored and provided by
    Oregon Housing and Community Services
    Description

    A list of projects currently on the ORCA waitlist

  18. T

    FIN Average MPG and Fuel Usage

    • internal.open.piercecountywa.gov
    • open.piercecountywa.gov
    application/rdfxml +5
    Updated Feb 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Finance (2024). FIN Average MPG and Fuel Usage [Dataset]. https://internal.open.piercecountywa.gov/Finance/FIN-Average-MPG-and-Fuel-Usage/kirv-exad
    Explore at:
    application/rdfxml, tsv, xml, csv, json, application/rssxmlAvailable download formats
    Dataset updated
    Feb 28, 2024
    Dataset authored and provided by
    Finance
    Description

    This dataset includes total fuel usage data for the County fleet, including the Finance and ESD managed fleet vehicles. It also includes Ferry fuel usage and data on transit trips based on ORCA card usage.

  19. O

    ORCA waitlist

    • data.oregon.gov
    application/rdfxml +5
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oregon Housing and Community Services (2025). ORCA waitlist [Dataset]. https://data.oregon.gov/dataset/ORCA-waitlist/nfv3-nxmu
    Explore at:
    csv, xml, application/rssxml, json, application/rdfxml, tsvAvailable download formats
    Dataset updated
    Jun 24, 2025
    Dataset authored and provided by
    Oregon Housing and Community Services
    Description

    Applications currently on a waitlist for the Oregon Housing and Community Services' Oregon Centralized Application (ORCA) process.

  20. Northern Resident Killer Whale Group Cohesion (1980-2010)

    • open.canada.ca
    • datasets.ai
    csv, pdf
    Updated Jul 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fisheries and Oceans Canada (2023). Northern Resident Killer Whale Group Cohesion (1980-2010) [Dataset]. https://open.canada.ca/data/en/dataset/8c773994-1031-411b-a1ad-933928daa4ac
    Explore at:
    csv, pdfAvailable download formats
    Dataset updated
    Jul 11, 2023
    Dataset provided by
    Fisheries and Oceans Canadahttp://www.dfo-mpo.gc.ca/
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Time period covered
    Jan 1, 1980 - Dec 31, 2010
    Description

    Data from: Stredulinsky et al. (2021) Family feud: permanent group splitting in a highly philopatric mammal, the killer whale (Orcinus orca). Behavioural Ecology and Sociobiology. https://link.springer.com/article/10.1007/s00265-021-02992-8. Group cohesion and demographic parameters were derived from annual censuses of Northern Resident Killer Whales (NRKW) in Pacific Canadian coastal waters, conducted by DFO's Cetacean Research Program since 1973. For animals that tend to remain with their natal group rather than individually disperse, group sizes may become too large to benefit individual fitness. In such cases, group splitting (or fission) allows philopatric animals to form more optimal group sizes without sacrificing all familiar social relationships. Although permanent group splitting is observed in many mammals, it occurs relatively infrequently. Here, we use combined generalized modeling and machine learning approaches to provide a comprehensive examination of group splitting in a population of killer whales (Orcinus orca) that occurred over three decades. Fission occurred both along and across maternal lines, where animals dispersed in parallel with their closest maternal kin. Group splitting was more common: (1) in larger natal groups, (2) when the common maternal ancestor was no longer alive, and (3) among groups with greater substructuring. The death of a matriarch did not appear to immediately trigger splitting. Our data suggest intragroup competition for food, leadership experience, and kinship are important factors that influence group splitting in this population. Our approach provides a foundation for future studies to examine the dynamics and consequences of matrilineal fission in killer whales and other taxa.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
OpenOrca (2023). OpenOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/OpenOrca

OpenOrca

OpenOrca

Open-Orca/OpenOrca

Explore at:
382 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2023
Dataset authored and provided by
OpenOrca
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

🐋 The OpenOrca Dataset! 🐋

We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!

  Official Models






  Mistral-7B-OpenOrca

Our latest model, the first 7B to score better overall than all… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/OpenOrca.

Search
Clear search
Close search
Google apps
Main menu