Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Summary
databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.
This dataset was created by hyc
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Dataset Card for "dolly-15k"
Summary
This is the dataset supplied by Databricks for training Dolly V2. This set is split 99% training / 1% validation, should you want to set aside some records for evaluation purposes.
Special thanks to ❤️ Databricks for creating and making this set available.
More Information needed
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Matt M.
Released under Apache 2.0
BSC Dolly 15k EN
Reviewed version from the Argilla Dolly v2 English version, originally created by Databricks. We provide two subsets: "annotated", where some instances were labelled with potential problems; and "filtered", which only contains the instances without the issues that we observed.
Annotation process
While analysing the Argilla Dolly v2 English version, we observed the following:
Task classification:
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MixInstruct
Introduction
This is the official realease of dataset MixInstruct for project LLM-Blender. This dataset contains 11 responses from the current popular instruction following-LLMs that includes:
Stanford Alpaca FastChat Vicuna Dolly V2 StableLM Open Assistant Koala Baize Flan-T5 ChatGLM MOSS Moasic MPT
We evaluate each response with auto metrics including BLEU, ROUGE, BERTScore, BARTScore. And provide pairwise comparison results by prompting ChatGPT for the… See the full description on the dataset page: https://huggingface.co/datasets/llm-blender/mix-instruct.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
自動生成のマルチターンデータセット
オープンなデータソースから、Calm3-22bを使ってQ&Aを自動生成したものです。
一部の計算には東京工業大学のスーパーコンピュータTSUBAME4.0を利用しました。
データソース
はじめの質問(q1)を、種々のデータソースから収集しました。その後のやりとりはすべて、Calmが生成しました。質問文については、元データのライセンスに準拠します。
oasst2-33k-ja apache 2.0
databricks-dolly-15k-ja cc-by-sa-3.0
minnade CC0
cyberagent/chatbot-arena-ja-calm2-7b-chat-experimental cc-by-4.0
Overview The LaMini Dataset is an instruction dataset generated using h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. It is designed for instruction-tuning pre-trained models to specialize them in a variety of downstream tasks.
Dataset Generation
Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. Seed Instructions: Sourced from databricks/databricks-dolly-15k dataset. Generation Approach: Example-guided and topic-guided strategies. Total Instructions: 1,504 unique instruction examples.
Dataset Sources
Repository: Bitbucket Project Paper : Pre-Print
Structure Each entry in the dataset contains: - Instruction - Response
Usage The LaMini Dataset can be used to fine-tune language models to improve their ability to follow instructions and generate relevant responses.
Access The dataset is available on HuggingFace at the following link: https://huggingface.co/datasets/SurgeGlobal/LaMini
Citation If you find our work useful, please cite our paper as follows: @misc{surge2024openbezoar, title={OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data}, author={Chandeepa Dissanayake and Lahiru Lowe and Sachith Gunasekara and Yasiru Ratnayake}, year={2024}, eprint={2404.12195}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Dataset Authors Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, and Yasiru Ratnayake
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Dataset Card for Tulu Instruction Mix
For a newer version, see Tulu V2 This version, the human data mixture, dataset consists of a mix of:
FLAN (Apache 2.0): FLAN v2 with CoT examples (most of the tasks in SuperNatural Instructions are included here) Open Assistant 1 (Apache 2.0) Dolly (CC By SA 3.0) ShareGPT (Apache 2.0 listed, no official repo found) GPT4-Alpaca (CC By NC 4.0) Code-Alpaca (CC By NC 4.0)
These are made by taking either just the training set of the subsets or the… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Summary
databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.