Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Summary
databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
databricks-dolly-15k
This dataset was not originally created by AI Squared. This dataset was curated and created by Databricks. The below text comes from the original release of the dataset's README file in GitHub (available at https://github.com/databrickslabs/dolly/tree/master/data):
Summary
databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in… See the full description on the dataset page: https://huggingface.co/datasets/aisquared/databricks-dolly-15k.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Korean translation of databricks-dolly-15k via the DeepL API Note: There are cases where multilingual data has been converted to monolingual data during batch translation to Korean using the API. Below is databricks-dolly-15k's README.
Summary
databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification… See the full description on the dataset page: https://huggingface.co/datasets/nlpai-lab/databricks-dolly-15k-ko.
Dataset Card for "databricks-databricks-dolly-15k"
More Information needed
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
databricks-dolly-15k-ja
This repository provides an instruction tuning dataset developed by LLM-jp, a collaborative project launched in Japan. This dataset is a Japanese translation of databricks-dolly-15k using DeepL.
Send Questions to
llm-jp(at)nii.ac.jp
Model Card Authors
The names are listed in alphabetical order. Hirokazu Kiyomaru, Hiroshi Matsuda, Jun Suzuki, Namgi Han, Saku Sugawara, Shota Sasaki, Shuhei Kurita, Taishi Nakamura, Takashi Kodama, Takumi… See the full description on the dataset page: https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja.
In this dataset, you will find a collection of records that show a category, an instruction, a context and a response to that instruction. The aim of the project is to correct the instructions, intput and responses to make sure they are of the highest quality and that they match the task category that they belong to. All three texts should be clear and include real information. In addition, the response should be as complete but concise as possible.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "databricks-dolly-15k-curated-multilingual"
A curated and multilingual version of the Databricks Dolly instructions dataset. It includes a programmatically and manually corrected version of the original en dataset. See below. STATUS: Currently, the original Dolly v2 English version has been curated combining automatic processing and collaborative human curation using Argilla (~400 records have been manually edited and fixed). The following graph shows a summary… See the full description on the dataset page: https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Conversion of databricks/databricks-dolly-15k dataset to be used in pretraining. Python code used for conversion: from datasets import load_dataset import pandas
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
def format(columns): instruction = columns["instruction"].strip() answer = columns["response"].strip() return f"{instruction}
{answer}"pandas.DataFrame({"text": [format(columns) for columns in dataset]}).to_csv("train.csv", index=False)
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
databricks/databricks-dolly-15k in ChatML format. Python code used for conversion: from datasets import load_dataset import pandas from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained( pretrained_model_name_or_path="Felladrin/Llama-160M-Chat-v1" )
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
def format(columns): instruction = columns["instruction"].strip() context = columns["context"].strip() response =… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-databricks-dolly-15k.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Post-training-Data-Flywheel/databricks-dolly-15k dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for "databricks-dolly-15k-chatml"
Dataset Summary
This dataset has been created by Re:cast AI to transform the existing dataset databricks/databricks-dolly-15k into a chatml friendly format for use in SFT tasks with pretrained models.
Dataset Structure
messages = [ { "content": "You are an expert Q&A system that is trusted around the world. You always... etc.", "role": "system" }, { "content": "(Optional) Context information is… See the full description on the dataset page: https://huggingface.co/datasets/recastai/databricks-dolly-15k-chatml.
Dataset Card for databricks-dolly-15k-curated-es
This dataset has been created with Argilla. As shown in the sections below, this dataset can be loaded into Argilla as explained in Load with Argilla, or used directly with the datasets library in Load with datasets.
Dataset Summary
This dataset contains:
A dataset configuration file conforming to the Argilla dataset format named argilla.cfg. This configuration file will be used to configure the dataset when using the… See the full description on the dataset page: https://huggingface.co/datasets/mariagrandury/databricks-dolly-15k-curated-es.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
okamototk/databricks-dolly-15k-nyan dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This is a Thai-instructed dataset translated from databricks-dolly-15k using Google Cloud Translation. databricks-dolly-15k is an open-source dataset of instruction-following records generated by thousands of Databricks employees in several behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
vaibhavad/databricks-dolly-15k dataset hosted on Hugging Face and contributed by the HF Datasets community
marcov/instruct-rl-databricks-dolly-15k dataset hosted on Hugging Face and contributed by the HF Datasets community
Summary
aaditya/databricks-dolly-15k-Hindi is an open source Hinglish-Codemix version dataset of databricks/databricks-dolly-15k. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License. Supported Tasks:
Training LLMs Synthetic Data Generation Data Augmentation
Languages: Hindi Version: 1.0 Original Dataset repo… See the full description on the dataset page: https://huggingface.co/datasets/aaditya/databricks-dolly-15k-Hinglish-Codemix.
Dataset Card for "databricks-dolly-15k-llama"
More Information needed
rislemy/databricks-dolly-15k-single-text dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
sam-liu-lmi/databricks-dolly-15k-alpaca-style dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Summary
databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.