94 datasets found

T
databricks_dolly
tensorflow.org
Updated Sep 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). databricks_dolly [Dataset]. https://www.tensorflow.org/datasets/catalog/databricks_dolly
Explore at:
Dataset updated
Sep 9, 2023
Description
databricks-dolly-15k is an open source dataset of instruction-following records used in training databricks/dolly-v2-12b that was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('databricks_dolly', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
h
databricks-dolly-15k-ja
huggingface.co
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LLM-jp (2024). databricks-dolly-15k-ja [Dataset]. https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2024
Dataset authored and provided by
LLM-jp
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
databricks-dolly-15k-ja

This repository provides an instruction tuning dataset developed by LLM-jp, a collaborative project launched in Japan. This dataset is a Japanese translation of databricks-dolly-15k using DeepL.

Send Questions to

llm-jp(at)nii.ac.jp

Model Card Authors

The names are listed in alphabetical order. Hirokazu Kiyomaru, Hiroshi Matsuda, Jun Suzuki, Namgi Han, Saku Sugawara, Shota Sasaki, Shuhei Kurita, Taishi Nakamura, Takashi Kodama, Takumi… See the full description on the dataset page: https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja.
h
databricks-mini
huggingface.co
Updated Feb 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrinivasan Sankar (2024). databricks-mini [Dataset]. https://huggingface.co/datasets/ai-bites/databricks-mini
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 27, 2024
Authors
Shrinivasan Sankar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This is a subset of the databricks 15k dataset databricks/databricks-dolly-15k used for finetuning Google's Gemma model google/gemma-2b. This version has only those records without context to match the dataset used in the fine-tuning Keras example from Google.
h
databricks-dolly-15k
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI Squared, Inc., databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/aisquared/databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
AI Squared, Inc.
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
databricks-dolly-15k

This dataset was not originally created by AI Squared. This dataset was curated and created by Databricks. The below text comes from the original release of the dataset's README file in GitHub (available at https://github.com/databrickslabs/dolly/tree/master/data):

Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in… See the full description on the dataset page: https://huggingface.co/datasets/aisquared/databricks-dolly-15k.
h
databricks-dolly-15k-ko
huggingface.co
Updated Apr 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NLP & AI - Korea University (2023). databricks-dolly-15k-ko [Dataset]. https://huggingface.co/datasets/nlpai-lab/databricks-dolly-15k-ko
Explore at:
Dataset updated
Apr 12, 2023
Dataset authored and provided by
NLP & AI - Korea University
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Korean translation of databricks-dolly-15k via the DeepL API Note: There are cases where multilingual data has been converted to monolingual data during batch translation to Korean using the API. Below is databricks-dolly-15k's README.

Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification… See the full description on the dataset page: https://huggingface.co/datasets/nlpai-lab/databricks-dolly-15k-ko.
h
Data from: Databricks
huggingface.co
Updated Jul 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ignatius Balayo (2024). Databricks [Dataset]. https://huggingface.co/datasets/igntrevor/Databricks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 31, 2024
Authors
Ignatius Balayo
Description
igntrevor/Databricks dataset hosted on Hugging Face and contributed by the HF Datasets community
O
databricks-dolly-15k-ja-reformat-v1
opendatalab.com
huggingface.co
zip
Updated Apr 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). databricks-dolly-15k-ja-reformat-v1 [Dataset]. https://opendatalab.com/OpenDataLab/databricks-dolly-15k-ja-reformat-v1
Explore at:
zipAvailable download formats
Dataset updated
Apr 13, 2023
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category. Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly. For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the context field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. [42]) which we recommend users remove for downstream applications.
h
databricks-dolly-15k-curated-multilingual
huggingface.co
Updated Apr 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Argilla (2023). databricks-dolly-15k-curated-multilingual [Dataset]. https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 28, 2023
Dataset authored and provided by
Argilla
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for "databricks-dolly-15k-curated-multilingual"

A curated and multilingual version of the Databricks Dolly instructions dataset. It includes a programmatically and manually corrected version of the original en dataset. See below. STATUS: Currently, the original Dolly v2 English version has been curated combining automatic processing and collaborative human curation using Argilla (~400 records have been manually edited and fixed). The following graph shows a summary… See the full description on the dataset page: https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual.
h
databricks-dolly-15k
huggingface.co
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Post-training-Data-Flywheel (2024). databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/Post-training-Data-Flywheel/databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 27, 2024
Dataset authored and provided by
Post-training-Data-Flywheel
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Post-training-Data-Flywheel/databricks-dolly-15k dataset hosted on Hugging Face and contributed by the HF Datasets community
P
SurgeGlobal/LaMini Dataset
paperswithcode.com
Updated Apr 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chandeepa Dissanayake; Lahiru Lowe; Sachith Gunasekara; Yasiru Ratnayake (2024). SurgeGlobal/LaMini Dataset [Dataset]. https://paperswithcode.com/dataset/surgeglobal-lamini
Explore at:
Dataset updated
Apr 17, 2024
Authors
Chandeepa Dissanayake; Lahiru Lowe; Sachith Gunasekara; Yasiru Ratnayake
Description
Overview The LaMini Dataset is an instruction dataset generated using h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. It is designed for instruction-tuning pre-trained models to specialize them in a variety of downstream tasks.

Dataset Generation

Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. Seed Instructions: Sourced from databricks/databricks-dolly-15k dataset. Generation Approach: Example-guided and topic-guided strategies. Total Instructions: 1,504 unique instruction examples.

Dataset Sources

Repository: Bitbucket Project Paper : Pre-Print

Structure Each entry in the dataset contains: - Instruction - Response

Usage The LaMini Dataset can be used to fine-tune language models to improve their ability to follow instructions and generate relevant responses.

Access The dataset is available on HuggingFace at the following link: https://huggingface.co/datasets/SurgeGlobal/LaMini

Citation If you find our work useful, please cite our paper as follows: @misc{surge2024openbezoar, title={OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data}, author={Chandeepa Dissanayake and Lahiru Lowe and Sachith Gunasekara and Yasiru Ratnayake}, year={2024}, eprint={2404.12195}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Dataset Authors Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, and Yasiru Ratnayake
h
databricks-dolly-100
huggingface.co
Updated Oct 21, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wittawat Rakchat (2014). databricks-dolly-100 [Dataset]. https://huggingface.co/datasets/wt-golf/databricks-dolly-100
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 21, 2014
Authors
Wittawat Rakchat
Description
wt-golf/databricks-dolly-100 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
databricks-dolly-15k_standardized
huggingface.co
Updated Aug 31, 2000
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kranthi Kiran GV (2000). databricks-dolly-15k_standardized [Dataset]. https://huggingface.co/datasets/kranthigv/databricks-dolly-15k_standardized
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 31, 2000
Authors
Kranthi Kiran GV
Description
kranthigv/databricks-dolly-15k_standardized dataset hosted on Hugging Face and contributed by the HF Datasets community
h
databricks-dolly-llama2-1k
huggingface.co
Updated Feb 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephane Remigereau (2023). databricks-dolly-llama2-1k [Dataset]. https://huggingface.co/datasets/sremigere/databricks-dolly-llama2-1k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2023
Authors
Stephane Remigereau
Description
sremigere/databricks-dolly-llama2-1k dataset hosted on Hugging Face and contributed by the HF Datasets community
h
databricks-mini
huggingface.co
Updated Mar 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ray (2025). databricks-mini [Dataset]. https://huggingface.co/datasets/ray12332/databricks-mini
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 9, 2025
Authors
Ray
Description
ray12332/databricks-mini dataset hosted on Hugging Face and contributed by the HF Datasets community
Orca DPO Dialogue Pairs
kaggle.com
opendatabay.com
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Orca DPO Dialogue Pairs [Dataset]. https://www.kaggle.com/datasets/thedevastator/intel-orca-dialogue-pairs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Intel Orca Dialogue Pairs

Orca style for preference training (Intel's DPO dataset)

By Huggingface Hub [source]

About this dataset

The Intel/Orca/DPO Dialogue Pairs dataset is a unique resource for Natural language processing (NLP) research, combining AI and human conversations collected from online sources. This dataset is invaluable for exploring how human conversations can inform the development of conversational AI models. With columns such as System and Question extracted from chat logs, this dataset can help researchers understand more about how to better connect people with technology using meaningful dialogue. Furthermore, the data also includes columns for ChatGPT and Llama2–13b-Chat, two of the most widely used conversational AI models. By leveraging this data set, researchers have an exceptional opportunity to explore conversational techniques that enable humans and machines to communicate in natural languages

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This guide will provide an overview of how to use the Intel/Orca/DPO Dialogue Pairs dataset efficiently for human-centric natural language processing research.

Step 1: Understand the dataset

The Intel/Orca/DPO Dialogue Pairs dataset is composed of two main columns: System and Question. The System column contains responses from AI systems, and the Question column contains questions asked by humans. Additionally, this dataset also contains columns for ChatGPT and Llama2–13b-Chat, two models used in developing conversational AI systems.

Step 2: Prepare your environment

Before getting started with analyzing data from this dataset, you should first prepare your environment accordingly. Make sure that any necessary libraries or services are installed on your machine before attempting to work with the data from this dataset in order to avoid potential issues or errors during usage.

##### Step 3: Access the data
In order to access and start working with the data contained in this Dataset, you can either download it directly via a Kaggle account or alternatively access it through one of its REST Endpoints if available on other services (i.e Databricks).

##### Step 4: Exploring & Analyzing the Data

##### Step 5 : Reporting Results
Lastly ,once explorations and analyses have been completed its highly important that results are reported accurately especially when dealing with ethical datasets such as dialogue pairs since consequences could be dire if misinformation is disseminated .Reporting results should usually involve standard relevant indicators being declared while taking care conducting appropriate statistical tests ruling out incorrect anomalous outcomes

Research Ideas

Developing and improving natural language processing algorithms for AI-human conversation.

Building user-friendly chatbots that are better at recognizing and understanding human intent by training the model using this dataset.

Designing recommendation systems to predict user questions and generate more accurate responses based on previous conversations in the dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------------|:-----------------------------------------------------------------------------| | system | Contains the AI system's response to the user's question. (Text) | | chatgpt | Contains the ChatGPT model's response to the user's question. (Text) | | llama2-13b-chat | Contains the Llama2-13b-Chat model's response to the user's question. (Text) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
h
databricks-mini
huggingface.co
Updated Oct 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johnathon (2024). databricks-mini [Dataset]. https://huggingface.co/datasets/applecrumble123/databricks-mini
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 20, 2024
Authors
Johnathon
Description
applecrumble123/databricks-mini dataset hosted on Hugging Face and contributed by the HF Datasets community
h
databricks-dolly-15k-subset-general_qa
huggingface.co
Updated Oct 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bernardo Garcia (2024). databricks-dolly-15k-subset-general_qa [Dataset]. https://huggingface.co/datasets/bergr7f/databricks-dolly-15k-subset-general_qa
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 9, 2024
Authors
Bernardo Garcia
Description
bergr7f/databricks-dolly-15k-subset-general_qa dataset hosted on Hugging Face and contributed by the HF Datasets community
h
databricks-micro
huggingface.co
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soumyadipta Maiti (2024). databricks-micro [Dataset]. https://huggingface.co/datasets/SoumyaM/databricks-micro
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Authors
Soumyadipta Maiti
Description
SoumyaM/databricks-micro dataset hosted on Hugging Face and contributed by the HF Datasets community
h
databricks-dolly-15k
huggingface.co
Updated Oct 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vaibhav Adlakha (2024). databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/vaibhavad/databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 19, 2024
Authors
Vaibhav Adlakha
Description
vaibhavad/databricks-dolly-15k dataset hosted on Hugging Face and contributed by the HF Datasets community
h
databricks-mini-pratap
huggingface.co
Updated Mar 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Surwase (2024). databricks-mini-pratap [Dataset]. https://huggingface.co/datasets/pratapswati/databricks-mini-pratap
Explore at:
Dataset updated
Mar 25, 2024
Authors
Surwase
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
pratapswati/databricks-mini-pratap dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

(2023). databricks_dolly [Dataset]. https://www.tensorflow.org/datasets/catalog/databricks_dolly

databricks_dolly

Explore at:

Dataset updated

Sep 9, 2023

Description

databricks-dolly-15k is an open source dataset of instruction-following records used in training databricks/dolly-v2-12b that was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License.

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('databricks_dolly', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

Clear search

Close search

Google apps

Main menu

databricks_dolly

databricks-dolly-15k-ja

databricks-mini

databricks-dolly-15k

databricks-dolly-15k-ko

Data from: Databricks

databricks-dolly-15k-ja-reformat-v1

databricks-dolly-15k-curated-multilingual

databricks-dolly-15k

SurgeGlobal/LaMini Dataset

databricks-dolly-100

databricks-dolly-15k_standardized

databricks-dolly-llama2-1k

databricks-mini

Orca DPO Dialogue Pairs

Intel Orca Dialogue Pairs

Orca style for preference training (Intel's DPO dataset)

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Step 1: Understand the dataset

Step 2: Prepare your environment

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

databricks-mini

databricks-dolly-15k-subset-general_qa

databricks-micro

databricks-dolly-15k

databricks-mini-pratap

databricks_dollySee More Versions

databricks_dolly