94 datasets found

databricks-dolly-15k
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Databrickshttp://databricks.com/
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.
h
databricks-dolly-15k
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI Squared, Inc., databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/aisquared/databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
AI Squared, Inc.
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
databricks-dolly-15k

This dataset was not originally created by AI Squared. This dataset was curated and created by Databricks. The below text comes from the original release of the dataset's README file in GitHub (available at https://github.com/databrickslabs/dolly/tree/master/data):

Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in… See the full description on the dataset page: https://huggingface.co/datasets/aisquared/databricks-dolly-15k.
h
databricks-dolly-15k-ko
huggingface.co
Updated Apr 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NLP & AI - Korea University (2023). databricks-dolly-15k-ko [Dataset]. https://huggingface.co/datasets/nlpai-lab/databricks-dolly-15k-ko
Explore at:
Dataset updated
Apr 12, 2023
Dataset authored and provided by
NLP & AI - Korea University
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Korean translation of databricks-dolly-15k via the DeepL API Note: There are cases where multilingual data has been converted to monolingual data during batch translation to Korean using the API. Below is databricks-dolly-15k's README.

Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification… See the full description on the dataset page: https://huggingface.co/datasets/nlpai-lab/databricks-dolly-15k-ko.
h
databricks-dolly-15k-ja
huggingface.co
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LLM-jp (2024). databricks-dolly-15k-ja [Dataset]. https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2024
Dataset authored and provided by
LLM-jp
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
databricks-dolly-15k-ja

This repository provides an instruction tuning dataset developed by LLM-jp, a collaborative project launched in Japan. This dataset is a Japanese translation of databricks-dolly-15k using DeepL.

Send Questions to

llm-jp(at)nii.ac.jp

Model Card Authors

The names are listed in alphabetical order. Hirokazu Kiyomaru, Hiroshi Matsuda, Jun Suzuki, Namgi Han, Saku Sugawara, Shota Sasaki, Shuhei Kurita, Taishi Nakamura, Takashi Kodama, Takumi… See the full description on the dataset page: https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja.
o
Databricks Human Instruction Dataset
opendatabay.com
.undefined
Updated Jul 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Databricks Human Instruction Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/78cf60f8-b078-411f-aa41-bc5794f3121c
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 4, 2025
Dataset authored and provided by
Datasimple
Area covered
Data Science and Analytics
Description
This dataset is a collection of over 15,000 records generated by Databricks employees, specifically designed to enable large language models to exhibit the interactive qualities of conversational AI. It serves as an open-source, human-generated instruction corpus, invaluable for fine-tuning large language models. The contributors created prompt and response pairs across eight distinct instruction categories, carefully avoiding external web sources (with the exception of Wikipedia for certain subsets) and generative AI in their formulations. This dataset holds significant value for instruction fine-tuning, synthetic data generation, and data augmentation, and is openly available for any purpose, including academic and commercial applications.

Columns

instruction: Represents the prompt or question provided.

context: Serves as reference material relevant to the instruction.

response: Contains the generated response to the instruction.

category: Indicates the annotator behavioural category, derived from the InstructGPT paper.

Distribution

The dataset is provided as a CSV file, containing fields for instruction, context, response, and category. It comprises over 15,000 records, with 14,781 unique values for 'instruction' and 14,944 unique values for 'category'.

Usage

This dataset is ideal for several applications, including: * Instruction fine-tuning of large language models to enhance their interactive capabilities. * Generating synthetic data by using the human-generated prompts as few-shot examples for large open language models. * Data augmentation techniques, such as paraphrasing prompts or short responses to regularise the dataset and improve model robustness.

Coverage

The dataset has a global reach. It was listed on 11/06/2025. The data is human-generated by Databricks employees. While the language used is American English, it is noted that some annotators may not be native English speakers. The demographic profile and subject matter of the data may reflect the composition of Databricks employees. It is important to note that as Wikipedia was consulted for certain categories, the dataset may reflect biases, factual errors, or topical focuses present in Wikipedia.

License

CC-BY-SA

Who Can Use It

This dataset is intended for a wide range of users, including: * Data Scientists and Machine Learning Engineers: For fine-tuning and developing large language models. * Researchers: For studies on instruction-following, synthetic data generation, and data augmentation in natural language processing. * Developers: Building applications that require interactive or instruction-based language model capabilities. * Organisations: For commercial product development involving custom language models.

Dataset Name Suggestions

Dolly 15K Instruction Corpus

Databricks Human Instruction Data

LLM Fine-tuning Prompt Dataset

Opendatabay Dolly 15K

Interactive AI Training Data

Attribute

Original Data Source: Databricks Dolly 15K Dataset
h
databricks-dolly-15k-curated-multilingual
huggingface.co
Updated Apr 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Argilla (2023). databricks-dolly-15k-curated-multilingual [Dataset]. https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 28, 2023
Dataset authored and provided by
Argilla
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for "databricks-dolly-15k-curated-multilingual"

A curated and multilingual version of the Databricks Dolly instructions dataset. It includes a programmatically and manually corrected version of the original en dataset. See below. STATUS: Currently, the original Dolly v2 English version has been curated combining automatic processing and collaborative human curation using Argilla (~400 records have been manually edited and fixed). The following graph shows a summary… See the full description on the dataset page: https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual.
o
Dolly 15K AI Chat Data
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Dolly 15K AI Chat Data [Dataset]. https://www.opendatabay.com/data/ai-ml/a2914db9-a1d3-4d91-84c9-be253ae09386
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Telecommunications & Network Data
Description
This dataset provides over 15,000 language models and dialogues designed to power dynamic ChatGPT applications. It was created by Databricks employees, aiming to facilitate the use of large language models (LLMs) for interactive dialogue interactions. The dataset generates prompt-response pairs across eight distinct instruction categories and deliberately avoids information from external web sources, with the exception of Wikipedia for specific instruction sets. This open-source resource is ideal for exploring the boundaries of text-based conversations and uncovering new insights into natural language processing.

Columns

Instruction (Text): This field contains the text prompt intended to generate an appropriate response from a machine learning model or chatbot, utilising natural language processing techniques. It represents what one individual says in a conversation.

Context (Text): Providing additional information, the context field enhances accuracy by offering the model more detail about the ongoing conversation or request execution. Like the instruction, it captures what is said by one individual.

Response (Text): This column holds the conversational reply or what is said back by the other individual in the dialogue.

Category (Text): Each prompt-response pair is classified into one of eight distinct categories based on its content. Examples of unique category values include 'open_qa' and 'general_qa'.

Distribution

The dataset is typically provided as a data file, usually in CSV format. It contains over 15,000 language models and dialogues, with the main train.csv file consisting of this quantity of records. Each record within the dataset represents a unique prompt-response pair, or a single turn in a conversation between two individuals. The columns are all of a string data type.

Usage

This dataset is suited for a variety of applications and use cases: * Training dialogue systems by developing multiple funneling pipelines to enrich models with real-world conversations. * Creating intelligent chatbot interactions. * Generating natural language answers as part of Q&A systems. * Utilising excerpts from Wikipedia for particular subsets of instruction categories. * Leveraging the classification labels with supervised learning techniques, such as multi-class classification neural networks or logistic regression classifiers. * Developing deep learning models to detect and respond to conversational intent. * Training language models for customer service queries using natural language processing (NLP). * Creating custom dialogue agents capable of handling more intricate conversational interactions.

Coverage

The dataset has a global reach. It was listed on 17/06/2025, and its content focuses on general conversational and Q&A interactions, without specific demographic limitations.

License

CC0

Who Can Use It

This dataset is valuable for a wide range of users, including AI/ML developers, researchers, and data scientists looking to: * Build and train conversational AI models. * Develop advanced chatbot applications. * Explore new insights in natural language processing. * Create bespoke dialogue agents for various sectors, such as customer service. * Apply supervised learning to classify conversational data.

Dataset Name Suggestions

Databricks Dolly (15K) Dialogue Data

LLM Training Conversation Dataset

Dolly 15K AI Chat Data

Prompt-Response Pairs for LLMs

Attributes

Original Data Source: Databricks Dolly (15K)
h
thai_databricks_dolly
huggingface.co
Updated Jun 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SEACrowd (2024). thai_databricks_dolly [Dataset]. https://huggingface.co/datasets/SEACrowd/thai_databricks_dolly
Explore at:
Dataset updated
Jun 20, 2024
Dataset authored and provided by
SEACrowd
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
This is a Thai-instructed dataset translated from databricks-dolly-15k using Google Cloud Translation. databricks-dolly-15k is an open-source dataset of instruction-following records generated by thousands of Databricks employees in several behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
h
databricks-databricks-dolly-15k
huggingface.co
Updated Sep 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AGIE AI Technology (2024). databricks-databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/agie-ai/databricks-databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 21, 2024
Dataset authored and provided by
AGIE AI Technology
Description
Dataset Card for "databricks-databricks-dolly-15k"

More Information needed
h
databricks-dolly-15k
huggingface.co
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Post-training-Data-Flywheel (2024). databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/Post-training-Data-Flywheel/databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 27, 2024
Dataset authored and provided by
Post-training-Data-Flywheel
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Post-training-Data-Flywheel/databricks-dolly-15k dataset hosted on Hugging Face and contributed by the HF Datasets community
h
databricks-dolly-15k
huggingface.co
Updated Oct 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vaibhav Adlakha (2024). databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/vaibhavad/databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 19, 2024
Authors
Vaibhav Adlakha
Description
vaibhavad/databricks-dolly-15k dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Databricks-Dolly-8k
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vishva R, Databricks-Dolly-8k [Dataset]. https://huggingface.co/datasets/Vishva007/Databricks-Dolly-8k
Explore at:
Authors
Vishva R
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Databricks-Dolly-8k

The resulting dataset contains 8000 samples of the databricks/databricks-dolly-15k dataset.
This split of an even smaller subset is provided for very fast experimentation and evaluation of models when computational resources are highly limited or for quick prototyping.

Dataset Structure

The dataset is provided as a DatasetDict with the following splits:

train: Contains 8000 samples.

Each split contains the following features, identical to the… See the full description on the dataset page: https://huggingface.co/datasets/Vishva007/Databricks-Dolly-8k.
h
ChatML-databricks-dolly-15k
huggingface.co
Updated Feb 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Nogueira (2024). ChatML-databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/Felladrin/ChatML-databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 3, 2024
Authors
Victor Nogueira
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
databricks/databricks-dolly-15k in ChatML format. Python code used for conversion: from datasets import load_dataset import pandas from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained( pretrained_model_name_or_path="Felladrin/Llama-160M-Chat-v1" )

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

def format(columns): instruction = columns["instruction"].strip() context = columns["context"].strip() response =… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-databricks-dolly-15k.
h
databricks-dolly-1k
huggingface.co
Updated Feb 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenqi Glantz (2023). databricks-dolly-1k [Dataset]. https://huggingface.co/datasets/wenqiglantz/databricks-dolly-1k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2023
Authors
Wenqi Glantz
Description
This is a subset (1000 samples) of databricks/databricks-dolly-15k dataset, processed to match Mistral-7B-instruct-v0.2's prompt format. It was created using the colab notebook.
h
databricks-dolly-15k-chatml
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Re:cast AI, databricks-dolly-15k-chatml [Dataset]. https://huggingface.co/datasets/recastai/databricks-dolly-15k-chatml
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Re:cast AI
Description
Dataset Card for "databricks-dolly-15k-chatml"

Dataset Summary

This dataset has been created by Re:cast AI to transform the existing dataset databricks/databricks-dolly-15k into a chatml friendly format for use in SFT tasks with pretrained models.

Dataset Structure

messages = [ { "content": "You are an expert Q&A system that is trusted around the world. You always... etc.", "role": "system" }, { "content": "(Optional) Context information is… See the full description on the dataset page: https://huggingface.co/datasets/recastai/databricks-dolly-15k-chatml.
h
pretrain-databricks-dolly-15k
huggingface.co
Updated Jan 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Nogueira (2024). pretrain-databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/Felladrin/pretrain-databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 24, 2024
Authors
Victor Nogueira
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Conversion of databricks/databricks-dolly-15k dataset to be used in pretraining. Python code used for conversion: from datasets import load_dataset import pandas

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

def format(columns): instruction = columns["instruction"].strip() answer = columns["response"].strip() return f"{instruction}

{answer}"pandas.DataFrame({"text": [format(columns) for columns in dataset]}).to_csv("train.csv", index=False)
h
databricks-dolly-15k-curated-es
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
María Grandury, databricks-dolly-15k-curated-es [Dataset]. https://huggingface.co/datasets/mariagrandury/databricks-dolly-15k-curated-es
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
María Grandury
Description
Dataset Card for databricks-dolly-15k-curated-es

This dataset has been created with Argilla. As shown in the sections below, this dataset can be loaded into Argilla as explained in Load with Argilla, or used directly with the datasets library in Load with datasets.

Dataset Summary

This dataset contains:

A dataset configuration file conforming to the Argilla dataset format named argilla.cfg. This configuration file will be used to configure the dataset when using the… See the full description on the dataset page: https://huggingface.co/datasets/mariagrandury/databricks-dolly-15k-curated-es.
h
databricks-dolly-15k-ja-annotated
huggingface.co
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
System K Dev. (2025). databricks-dolly-15k-ja-annotated [Dataset]. https://huggingface.co/datasets/systemk/databricks-dolly-15k-ja-annotated
Explore at:
Dataset updated
Feb 5, 2025
Dataset authored and provided by
System K Dev.
Description
systemk/databricks-dolly-15k-ja-annotated dataset hosted on Hugging Face and contributed by the HF Datasets community
h
databricks-dolly-100
huggingface.co
Updated Oct 21, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wittawat Rakchat (2014). databricks-dolly-100 [Dataset]. https://huggingface.co/datasets/wt-golf/databricks-dolly-100
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 21, 2014
Authors
Wittawat Rakchat
Description
wt-golf/databricks-dolly-100 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
databricks-dolly-15k-single-text
huggingface.co
Updated May 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yassin Elsir (2024). databricks-dolly-15k-single-text [Dataset]. https://huggingface.co/datasets/rislemy/databricks-dolly-15k-single-text
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 6, 2024
Authors
Yassin Elsir
Description
rislemy/databricks-dolly-15k-single-text dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k

databricks-dolly-15k

databricks/databricks-dolly-15k

Explore at:

178 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset authored and provided by

Databrickshttp://databricks.com/

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.

Clear search

Close search

Google apps

Main menu

databricks-dolly-15k

databricks-dolly-15k

databricks-dolly-15k-ko

databricks-dolly-15k-ja

Databricks Human Instruction Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attribute

databricks-dolly-15k-curated-multilingual

Dolly 15K AI Chat Data

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

thai_databricks_dolly

databricks-databricks-dolly-15k

databricks-dolly-15k

databricks-dolly-15k

Databricks-Dolly-8k

ChatML-databricks-dolly-15k

databricks-dolly-1k

databricks-dolly-15k-chatml

pretrain-databricks-dolly-15k

databricks-dolly-15k-curated-es

databricks-dolly-15k-ja-annotated

databricks-dolly-100

databricks-dolly-15k-single-text

databricks-dolly-15kSee More Versions

databricks/databricks-dolly-15k

databricks-dolly-15k