17 datasets found

databricks-dolly-15k
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Databrickshttp://databricks.com/
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.
h
databricks_dolly-v2-7b-details
huggingface.co
Updated Jul 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open LLM Leaderboard (2025). databricks_dolly-v2-7b-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/databricks_dolly-v2-7b-details
Explore at:
Dataset updated
Jul 30, 2025
Dataset authored and provided by
Open LLM Leaderboard
Description
Dataset Card for Evaluation run of databricks/dolly-v2-7b

Dataset automatically created during the evaluation run of model databricks/dolly-v2-7b The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/databricks_dolly-v2-7b-details.
h
databricks-dolly-15k-curated-multilingual
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Argilla, databricks-dolly-15k-curated-multilingual [Dataset]. https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Argilla
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for "databricks-dolly-15k-curated-multilingual"

A curated and multilingual version of the Databricks Dolly instructions dataset. It includes a programmatically and manually corrected version of the original en dataset. See below. STATUS: Currently, the original Dolly v2 English version has been curated combining automatic processing and collaborative human curation using Argilla (~400 records have been manually edited and fixed). The following graph shows a summary… See the full description on the dataset page: https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual.
h
databricks_dolly-v2-3b-details
huggingface.co
Updated Jul 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open LLM Leaderboard (2025). databricks_dolly-v2-3b-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/databricks_dolly-v2-3b-details
Explore at:
Dataset updated
Jul 30, 2025
Dataset authored and provided by
Open LLM Leaderboard
Description
Dataset Card for Evaluation run of databricks/dolly-v2-3b

Dataset automatically created during the evaluation run of model databricks/dolly-v2-3b The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/databricks_dolly-v2-3b-details.
h
dolly-15k
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ritesh Khanna, dolly-15k [Dataset]. https://huggingface.co/datasets/treadon/dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Ritesh Khanna
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Dataset Card for "dolly-15k"

Summary

This is the dataset supplied by Databricks for training Dolly V2. This set is split 99% training / 1% validation, should you want to set aside some records for evaluation purposes.

Special thanks to ❤️ Databricks for creating and making this set available.

More Information needed
h
bsc-dolly-15k-en
huggingface.co
Updated Aug 31, 2000
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technologies Laboratory @ Barcelona Supercomputing Center (2000). bsc-dolly-15k-en [Dataset]. https://huggingface.co/datasets/BSC-LT/bsc-dolly-15k-en
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 31, 2000
Dataset authored and provided by
Language Technologies Laboratory @ Barcelona Supercomputing Center
Description
BSC Dolly 15k EN

Reviewed version from the Argilla Dolly v2 English version, originally created by Databricks. We provide two subsets: "annotated", where some instances were labelled with potential problems; and "filtered", which only contains the instances without the issues that we observed.

Annotation process

While analysing the Argilla Dolly v2 English version, we observed the following:

Task classification:

There are three classes with context:… See the full description on the dataset page: https://huggingface.co/datasets/BSC-LT/bsc-dolly-15k-en.
h
dolly-v2_orca
huggingface.co
Updated Jul 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pankaj Mathur (2018). dolly-v2_orca [Dataset]. https://huggingface.co/datasets/pankajmathur/dolly-v2_orca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 15, 2018
Authors
Pankaj Mathur
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Explain tuned Dolly-V2 dataset ~15K created using approaches from Orca Research Paper. We leverage all of the 15 system instructions provided in Orca Research Paper to generate explain tuned datasets, in contrast to vanilla instruction tuning approaches used by original datasets. This helps student models like orca_mini_13b, orca_mini_7b or orca_mini_3b to learn thought process from teacher model, which is ChatGPT (gpt-3.5-turbo-0301 version). Please see how the System prompt is added before… See the full description on the dataset page: https://huggingface.co/datasets/pankajmathur/dolly-v2_orca.
m
Hand Truck and Dolly Market Size, Share | CAGR of 4.3%.
market.us
csv, pdf
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market.us (2025). Hand Truck and Dolly Market Size, Share | CAGR of 4.3%. [Dataset]. https://market.us/report/hand-truck-and-dolly-market/
Explore at:
csv, pdfAvailable download formats
Dataset updated
Jun 17, 2025
Dataset provided by
Market.us
License
https://market.us/privacy-policy/https://market.us/privacy-policy/
Time period covered
2022 - 2032
Area covered
Global
Description
Hand Truck and Dolly Market size is expected to be worth around USD 2.0 Billion by 2034, from USD 1.3 Billion in 2024, at a CAGR of 4.3%.
h
databricks-dolly-15k-es
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Quispe, databricks-dolly-15k-es [Dataset]. https://huggingface.co/datasets/daqc/databricks-dolly-15k-es
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
David Quispe
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Translated with googletrans==3.1.0a0 from original dataset *part of the data (up to 600) was lost during the translation

license: apache-2.0
h
final_training_set_v1
huggingface.co
Updated May 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PyThaiNLP (2023). final_training_set_v1 [Dataset]. https://huggingface.co/datasets/pythainlp/final_training_set_v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 4, 2023
Dataset authored and provided by
PyThaiNLP
Description
Dataset Card for "final_training_set_v1"

Finetuning datasets for WangChanGLM sourced from LAION OIG chip2 and infill_dbpedia (Apache-2.0), DataBricks Dolly v2 (Apache-2.0), OpenAI TL;DR (MIT), and Hello-SimpleAI HC3 (CC-BY SA)
h
instructmix
huggingface.co
Updated Dec 15, 2000
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ξ Labs (2000). instructmix [Dataset]. https://huggingface.co/datasets/Xilabs/instructmix
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2000
Dataset authored and provided by
Ξ Labs
Description
Dataset Card for "InstructMix"

Description: InstructMix is a versatile instruction-tuning dataset available in Alpaca format. It encompasses a variety of instruction-related tasks and sources, making it well suited for finetuning instruction following Large Language Models.

Included Datasets:

Dataset Name Size Type Details GitHub Repo

Alpaca_GPT4 52,002 examples General Instruction Generated by GPT-4 using Alpaca GitHub Repo

dolly 2.0 15,015 examples Closed… See the full description on the dataset page: https://huggingface.co/datasets/Xilabs/instructmix.
neural-chat-dataset-v2
huggingface.co
Updated Aug 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Intel (2024). neural-chat-dataset-v2 [Dataset]. https://huggingface.co/datasets/Intel/neural-chat-dataset-v2
Explore at:
Dataset updated
Aug 23, 2024
Dataset authored and provided by
Intelhttp://intel.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Here is a collective list of instruction dataset used for Neural Chat fine-tuning. The total number of instruction samples and tokens are about 1.5M and 5M respectively.

Type Language Dataset Number

HC3 en HC3 24K

dolly en databricks-dolly-15k 15K

alpaca-zh zh tigerbot-alpaca-zh-0.5m 500K

alpaca-en en TigerResearch/tigerbot-alpaca-en-50k50K

math en tigerbot-gsm-8k-en 8K

general en tigerbot-stackexchange-qa-en-0.5m 500K

OpenOrca en Open-Orca/OpenOrca 400K (sampled)… See the full description on the dataset page: https://huggingface.co/datasets/Intel/neural-chat-dataset-v2.
h
kullm-v2.1
huggingface.co
Updated Mar 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeong (2024). kullm-v2.1 [Dataset]. https://huggingface.co/datasets/csujeong/kullm-v2.1
Explore at:
Dataset updated
Mar 4, 2024
Authors
Jeong
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "KULLM-v2"

Dataset Summary

Korean translation of GPT4ALL, Dolly, and Vicuna data. repository: nlpai-lab/KULLM huggingface: nlpai-lab/kullm-v2

Translate dataset

Translated 'instruction', 'input', and 'output' in the dataset via the DeepL API

Lisence

Apache-2.0

from datasets import load_dataset

ds = load_dataset("nlpai-lab/kullm-v2", split="train") ds DatasetDict({ train: Dataset({ features: ['id'… See the full description on the dataset page: https://huggingface.co/datasets/csujeong/kullm-v2.1.
h
AutoMultiTurnByCalm3-22B
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kan hatakeyama, AutoMultiTurnByCalm3-22B [Dataset]. https://huggingface.co/datasets/kanhatakeyama/AutoMultiTurnByCalm3-22B
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
kan hatakeyama
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
自動生成のマルチターンデータセット

オープンなデータソースから､Calm3-22bを使ってQ&Aを自動生成したものです｡

一部の計算には東京工業大学のスーパーコンピュータTSUBAME4.0を利用しました｡

データソースはじめの質問(q1)を､種々のデータソースから収集しました｡その後のやりとりはすべて､Calmが生成しました｡質問文については､元データのライセンスに準拠します｡

oasst2-33k-ja apache 2.0

databricks-dolly-15k-ja cc-by-sa-3.0

minnade CC0

cyberagent/chatbot-arena-ja-calm2-7b-chat-experimental cc-by-4.0
h
LaMini
huggingface.co
Updated Sep 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Surge Global (2023). LaMini [Dataset]. https://huggingface.co/datasets/SurgeGlobal/LaMini
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 27, 2023
Dataset authored and provided by
Surge Global
Description
Overview

The LaMini Dataset is an instruction dataset generated using h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. It is designed for instruction-tuning pre-trained models to specialize them in a variety of downstream tasks.

Dataset Generation

Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. Seed Instructions: Sourced from databricks/databricks-dolly-15k dataset. Generation Approach: Example-guided and topic-guided strategies. Total Instructions: 1,504 unique… See the full description on the dataset page: https://huggingface.co/datasets/SurgeGlobal/LaMini.
tulu-v1-sft-mixture
huggingface.co
Updated Nov 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2023). tulu-v1-sft-mixture [Dataset]. https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 14, 2023
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Dataset Card for Tulu Instruction Mix

For a newer version, see Tulu V2 This version, the human data mixture, dataset consists of a mix of:

FLAN (Apache 2.0): FLAN v2 with CoT examples (most of the tasks in SuperNatural Instructions are included here) Open Assistant 1 (Apache 2.0) Dolly (CC By SA 3.0) ShareGPT (Apache 2.0 listed, no official repo found) GPT4-Alpaca (CC By NC 4.0) Code-Alpaca (CC By NC 4.0)

These are made by taking either just the training set of the subsets or the… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture.
h
AutoMultiTurnByMixtral8x22b
huggingface.co
Updated May 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kan hatakeyama (2024). AutoMultiTurnByMixtral8x22b [Dataset]. https://huggingface.co/datasets/kanhatakeyama/AutoMultiTurnByMixtral8x22b
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 20, 2024
Authors
kan hatakeyama
Description
自動生成のマルチターンデータセット

オープンなデータソースから､MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-GGUFを使ってQ&Aを自動生成したものです｡

関連コード

一部の計算には東京工業大学のスーパーコンピュータTSUBAME4.0を利用しました｡

データソースはじめの質問(q1)を､種々のデータソースから収集しました｡その後のやりとりはすべて､Mixtralが生成しました｡質問文については､元データのライセンスに準拠します｡

oasst2-33k-ja apache 2.0

databricks-dolly-15k-ja cc-by-sa-3.0

minnade CC0

cyberagent/chatbot-arena-ja-calm2-7b-chat-experimental cc-by-4.0
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k

databricks-dolly-15k

databricks/databricks-dolly-15k

Explore at:

208 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset authored and provided by

Databrickshttp://databricks.com/

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.

Clear search

Close search

Google apps

Main menu

databricks-dolly-15k

databricks_dolly-v2-7b-details

databricks-dolly-15k-curated-multilingual

databricks_dolly-v2-3b-details

dolly-15k

bsc-dolly-15k-en

dolly-v2_orca

Hand Truck and Dolly Market Size, Share | CAGR of 4.3%.

databricks-dolly-15k-es

final_training_set_v1

instructmix

neural-chat-dataset-v2

kullm-v2.1

AutoMultiTurnByCalm3-22B

LaMini

tulu-v1-sft-mixture

AutoMultiTurnByMixtral8x22b

databricks-dolly-15kSee More Versions

databricks/databricks-dolly-15k

databricks-dolly-15k