Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Summary
databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.
Dataset Card for Evaluation run of databricks/dolly-v2-7b
Dataset automatically created during the evaluation run of model databricks/dolly-v2-7b The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/databricks_dolly-v2-7b-details.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "databricks-dolly-15k-curated-multilingual"
A curated and multilingual version of the Databricks Dolly instructions dataset. It includes a programmatically and manually corrected version of the original en dataset. See below. STATUS: Currently, the original Dolly v2 English version has been curated combining automatic processing and collaborative human curation using Argilla (~400 records have been manually edited and fixed). The following graph shows a summary… See the full description on the dataset page: https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual.
Dataset Card for Evaluation run of databricks/dolly-v2-3b
Dataset automatically created during the evaluation run of model databricks/dolly-v2-3b The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/databricks_dolly-v2-3b-details.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Dataset Card for "dolly-15k"
Summary
This is the dataset supplied by Databricks for training Dolly V2. This set is split 99% training / 1% validation, should you want to set aside some records for evaluation purposes.
Special thanks to ❤️ Databricks for creating and making this set available.
More Information needed
BSC Dolly 15k EN
Reviewed version from the Argilla Dolly v2 English version, originally created by Databricks. We provide two subsets: "annotated", where some instances were labelled with potential problems; and "filtered", which only contains the instances without the issues that we observed.
Annotation process
While analysing the Argilla Dolly v2 English version, we observed the following:
Task classification:
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Explain tuned Dolly-V2 dataset ~15K created using approaches from Orca Research Paper. We leverage all of the 15 system instructions provided in Orca Research Paper to generate explain tuned datasets, in contrast to vanilla instruction tuning approaches used by original datasets. This helps student models like orca_mini_13b, orca_mini_7b or orca_mini_3b to learn thought process from teacher model, which is ChatGPT (gpt-3.5-turbo-0301 version). Please see how the System prompt is added before… See the full description on the dataset page: https://huggingface.co/datasets/pankajmathur/dolly-v2_orca.
https://market.us/privacy-policy/https://market.us/privacy-policy/
Hand Truck and Dolly Market size is expected to be worth around USD 2.0 Billion by 2034, from USD 1.3 Billion in 2024, at a CAGR of 4.3%.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Translated with googletrans==3.1.0a0 from original dataset *part of the data (up to 600) was lost during the translation
license: apache-2.0
Dataset Card for "final_training_set_v1"
Finetuning datasets for WangChanGLM sourced from LAION OIG chip2 and infill_dbpedia (Apache-2.0), DataBricks Dolly v2 (Apache-2.0), OpenAI TL;DR (MIT), and Hello-SimpleAI HC3 (CC-BY SA)
Dataset Card for "InstructMix"
Description: InstructMix is a versatile instruction-tuning dataset available in Alpaca format. It encompasses a variety of instruction-related tasks and sources, making it well suited for finetuning instruction following Large Language Models.
Included Datasets:
Dataset Name Size Type Details GitHub Repo
Alpaca_GPT4 52,002 examples General Instruction Generated by GPT-4 using Alpaca GitHub Repo
dolly 2.0 15,015 examples Closed… See the full description on the dataset page: https://huggingface.co/datasets/Xilabs/instructmix.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Here is a collective list of instruction dataset used for Neural Chat fine-tuning. The total number of instruction samples and tokens are about 1.5M and 5M respectively.
Type Language Dataset Number
HC3 en HC3 24K
dolly en databricks-dolly-15k 15K
alpaca-zh zh tigerbot-alpaca-zh-0.5m 500K
alpaca-en en TigerResearch/tigerbot-alpaca-en-50k50K
math en tigerbot-gsm-8k-en 8K
general en tigerbot-stackexchange-qa-en-0.5m 500K
OpenOrca en Open-Orca/OpenOrca 400K (sampled)… See the full description on the dataset page: https://huggingface.co/datasets/Intel/neural-chat-dataset-v2.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "KULLM-v2"
Dataset Summary
Korean translation of GPT4ALL, Dolly, and Vicuna data. repository: nlpai-lab/KULLM huggingface: nlpai-lab/kullm-v2
Translate dataset
Translated 'instruction', 'input', and 'output' in the dataset via the DeepL API
Lisence
Apache-2.0
from datasets import load_dataset
ds = load_dataset("nlpai-lab/kullm-v2", split="train") ds DatasetDict({ train: Dataset({ features: ['id'… See the full description on the dataset page: https://huggingface.co/datasets/csujeong/kullm-v2.1.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
自動生成のマルチターンデータセット
オープンなデータソースから、Calm3-22bを使ってQ&Aを自動生成したものです。
一部の計算には東京工業大学のスーパーコンピュータTSUBAME4.0を利用しました。
データソース
はじめの質問(q1)を、種々のデータソースから収集しました。その後のやりとりはすべて、Calmが生成しました。質問文については、元データのライセンスに準拠します。
oasst2-33k-ja apache 2.0
databricks-dolly-15k-ja cc-by-sa-3.0
minnade CC0
cyberagent/chatbot-arena-ja-calm2-7b-chat-experimental cc-by-4.0
Overview
The LaMini Dataset is an instruction dataset generated using h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. It is designed for instruction-tuning pre-trained models to specialize them in a variety of downstream tasks.
Dataset Generation
Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. Seed Instructions: Sourced from databricks/databricks-dolly-15k dataset. Generation Approach: Example-guided and topic-guided strategies. Total Instructions: 1,504 unique… See the full description on the dataset page: https://huggingface.co/datasets/SurgeGlobal/LaMini.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Dataset Card for Tulu Instruction Mix
For a newer version, see Tulu V2 This version, the human data mixture, dataset consists of a mix of:
FLAN (Apache 2.0): FLAN v2 with CoT examples (most of the tasks in SuperNatural Instructions are included here) Open Assistant 1 (Apache 2.0) Dolly (CC By SA 3.0) ShareGPT (Apache 2.0 listed, no official repo found) GPT4-Alpaca (CC By NC 4.0) Code-Alpaca (CC By NC 4.0)
These are made by taking either just the training set of the subsets or the… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture.
自動生成のマルチターンデータセット
オープンなデータソースから、MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-GGUFを使ってQ&Aを自動生成したものです。
関連コード
一部の計算には東京工業大学のスーパーコンピュータTSUBAME4.0を利用しました。
データソース
はじめの質問(q1)を、種々のデータソースから収集しました。その後のやりとりはすべて、Mixtralが生成しました。質問文については、元データのライセンスに準拠します。
oasst2-33k-ja apache 2.0
databricks-dolly-15k-ja cc-by-sa-3.0
minnade CC0
cyberagent/chatbot-arena-ja-calm2-7b-chat-experimental cc-by-4.0
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Summary
databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.