MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🐋 The OpenOrca Dataset! 🐋
We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!
Official Models
Mistral-7B-OpenOrca
Our latest model, the first 7B to score better overall than all… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/OpenOrca.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Overview
This is a new curated subset of our OpenOrca data. This release provides an efficient means of reaching performance on-par with using larger slices of our data, while only including ~500k GPT-4 completions. The key change in this dataset is that we've done an additional pass, using GPT-4 to remove answers which appear wrong based on the human annotations from the FLAN dataset. This reduces the dataset size to only ~500k entries, allowing training to a similar quality level… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/SlimOrca.
Dataset Card for "Open-Orca-OpenOrca"
More Information needed
This dataset was created by Atman_Coder
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
🍮 The WHOLE FLAN Collection! 🍮
Overview
This repository includes the full dataset from the FLAN Collection, totalling ~300GB as parquets. Generated using the official seqio templating from the Google FLAN Collection GitHub repo. The data is subject to all the same licensing of the component datasets. To keep up with our continued work on OpenOrca and other exciting research, find our Discord here: https://AlignmentLab.ai
Motivation
This work was done as part of… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/FLAN.
Orca is a data-driven, unsupervised anomaly detection algorithm that uses a distance-based approach. It uses a novel pruning rule that allows it to run in nearly linear time. Orca was co-developed by Stephen Bay of ISLE and Mark Schwabacher of NASA ARC. More information about Orca, including downloadable software, can be found here: http://stephenbay.net/orca/ A conference paper about Orca can be found here: https://dashlink.arc.nasa.gov/paper/mining-distance-based-outliers-in-near-linear-time/
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
CREDIT: https://huggingface.co/cgato there was some minor formatting errors in, corrected and pushed to the Open-Orca org* What is this dataset? Half of the Slim Orca Deduped dataset, but further cleaned by removing instances of soft prompting. I removed a ton prompt prefixes which did not add any information or were redundant. Ex. "Question:", "Q:", "Write the Answer:", "Read this:", "Instructions:" I also removed a ton of prompt suffixes which were simply there to lead the model to answer as… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/slimorca-deduped-cleaned-corrected.
Dataset Generation
Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2 Seed Instructions: Derived from the FLAN-v2 Collection. Generation Approach: Explanation tuning with detailed responses generated from h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. Total Instructions: 5,507 explanation tuning data samples.
Dataset Sources
Repository: Bitbucket Project Paper : Pre-Print
Structure The dataset entries consist of: - Query - Response - System Message (when applicable)
Usage The Orca Dataset is intended for fine-tuning language models to not only imitate the style but also the reasoning process of LFMs, thereby improving the safety and quality of the models’ responses.
Citation If you find our work useful, please cite our paper as follows: @misc{surge2024openbezoar, title={OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data}, author={Chandeepa Dissanayake and Lahiru Lowe and Sachith Gunasekara and Yasiru Ratnayake}, year={2024}, eprint={2404.12195}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Dataset Authors Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, and Yasiru Ratnayake
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Overview
"SlimOrca Dedup" is a deduplicated, unfiltered subset of the SlimOrca dataset, excluding RLHF instances, resulting in 363k unique examples.
Key Features
Removal of RLHF instances. Deduplication using minhash and Jaccard similarity techniques.
Demo Models
Note: These models were trained on the full SlimOrca dataset, not the deduplicated, unfiltered version.
ORCAS is a click-based dataset. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries.
The dataset used for training large language models, with a focus on balancing the text distribution and mitigating overfitting.
ryanmarten/open-orca-flan dataset hosted on Hugging Face and contributed by the HF Datasets community
ryanmarten/open-orca-cot-judged-test dataset hosted on Hugging Face and contributed by the HF Datasets community
ORCA applications received and in-process
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This ID-Catalogue (English and Norwegian versions available) compiles ID-photographs of the killer whales that were encountered regularly in the inner fjords of the Vestlandet region (Norway) between 2016 and 2021.
Abstract: Building health management is an important part in running an efficient and cost-effective building. Many problems in a building’s system can go undetected for long periods of time, leading to expensive repairs or wasted resources. This project aims to help detect and diagnose the building‘s health with data driven methods throughout the day. Orca and IMS are two state of the art algorithms that observe an array of building health sensors and provide feedback on the overall system’s health as well as localize the problem to one, or possibly two, components. With this level of feedback the hope is to quickly identify problems and provide appropriate maintenance while reducing the number of complaints and service calls. Introduction: To prepare these technologies for the new installation, the proposed methods are being tested on a current system that behaves similarly to the future green building. Building 241 was determined to best resemble the proposed building 232 and therefore was chosen for this study. Building 241 is currently outfitted with 34 sensors that monitor the heating & cooling temperatures for the air and water systems as well as other various subsystem states. The daily sensor recordings were logged and sent to the IDU group for analysis. The period of analysis was focused from July 1st through August 10th 2009. Methodology: The two algorithms used for analysis were Orca and IMS. Both methods look for anomalies using a distanced based scoring approach. Orca has the ability to use a single data set and find outliers within that data set. This tactic was applied to each day. After scoring each time sample throughout a given day the Orca score profiles were compared by computing the correlation against all other days. Days with high overall correlations were considered normal however days with lower overall correlations were more anomalous. IMS, on the other hand, needs a normal set of data to build a model, which can be applied to a set of test data to asses how anomaly the particular data set is. The typical days identified by Orca were used as the reference/training set for IMS, while all the other days were passed through IMS resulting in an anomaly score profile for each day. The mean of the IMS score profile was then calculated for each day to produce a summary IMS score. These summary scores were ranked and the top outliers were identified (see Figure 1). Once the anomalies were identified the contributing parameters were then ranked by the algorithm. Analysis: The contributing parameters identified by IMS were localized to the return air temperature duct system. -7/03/09 (Figure 2 & 3) AHU-1 Return Air Temperature (RAT) Calculated Average Return Air Temperature -7/19/09 (Figure 3 & 4) AHU-2 Return Air Temperature (RAT) Calculated Average Return Air Temperature IMS identified significantly higher temperatures compared to other days during the month of July and August. Conclusion: The proposed algorithms Orca and IMS have shown that they were able to pick up significant anomalies in the building system as well as diagnose the anomaly by identifying the sensor values that were anomalous. In the future these methods can be used on live streaming data and produce a real time anomaly score to help building maintenance with detection and diagnosis of problems.
A list of projects currently on the ORCA waitlist
This dataset includes total fuel usage data for the County fleet, including the Finance and ESD managed fleet vehicles. It also includes Ferry fuel usage and data on transit trips based on ORCA card usage.
Applications currently on a waitlist for the Oregon Housing and Community Services' Oregon Centralized Application (ORCA) process.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Data from: Stredulinsky et al. (2021) Family feud: permanent group splitting in a highly philopatric mammal, the killer whale (Orcinus orca). Behavioural Ecology and Sociobiology. https://link.springer.com/article/10.1007/s00265-021-02992-8. Group cohesion and demographic parameters were derived from annual censuses of Northern Resident Killer Whales (NRKW) in Pacific Canadian coastal waters, conducted by DFO's Cetacean Research Program since 1973. For animals that tend to remain with their natal group rather than individually disperse, group sizes may become too large to benefit individual fitness. In such cases, group splitting (or fission) allows philopatric animals to form more optimal group sizes without sacrificing all familiar social relationships. Although permanent group splitting is observed in many mammals, it occurs relatively infrequently. Here, we use combined generalized modeling and machine learning approaches to provide a comprehensive examination of group splitting in a population of killer whales (Orcinus orca) that occurred over three decades. Fission occurred both along and across maternal lines, where animals dispersed in parallel with their closest maternal kin. Group splitting was more common: (1) in larger natal groups, (2) when the common maternal ancestor was no longer alive, and (3) among groups with greater substructuring. The death of a matriarch did not appear to immediately trigger splitting. Our data suggest intragroup competition for food, leadership experience, and kinship are important factors that influence group splitting in this population. Our approach provides a foundation for future studies to examine the dynamics and consequences of matrilineal fission in killer whales and other taxa.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🐋 The OpenOrca Dataset! 🐋
We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!
Official Models
Mistral-7B-OpenOrca
Our latest model, the first 7B to score better overall than all… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/OpenOrca.