Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Summary
databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.
andrewbai/databricks-dolly-15k dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset was using "kunishou/databricks-dolly-15k-ja" This dataset is licensed under CC BY SA 3.0 Last Update : 2023-05-28 databricks-dolly-15k-ja-gozaru kunishou/databricks-dolly-15k-ja https://huggingface.co/datasets/kunishou/databricks-dolly-15k-ja
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Beyza Coban
Released under Apache 2.0
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Databricks-dolly-15k is a corpus of over 15,000 records generated by thousands of Databricks employees, enabling large language models to demonstrate the amazing interactivity of ChatGPT. Databricks employees were invited to create prompt/response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended, free-form category. Instruct contributors to refrain from using information from any source on the web, except Wikipedia (for a specific subset of command categories), and explicitly instruct contributors to avoid using generative AI in formulating commands or responses. Examples of each behavior are provided to motivate the question types and instructions appropriate to each category.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset contains 14,934 instructions, contexts and responses, in several natural language categories such as classification, closed QA, generation, etc. The English original dataset was created by @databricks, who crowd-sourced the data creation via its employees. The current dataset is a translation of that dataset through ChatGPT (gpt-3.5-turbo
).
Data Instances
{
"id": 14963,
"instruction": "Wat zijn de duurste steden ter wereld?",
"context": "",
"response": "Dit is een uitgebreide lijst van de duurste steden: Singapore, Tel Aviv, New York, Hong Kong, Los Angeles, Zurich, Genève, San Francisco, Parijs en Sydney.",
"category": "brainstorming"
}
Data Fields
[1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966]
Dataset Creation
Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo
. max_tokens=1024, temperature=0
as parameters.
The prompt template to translate the input is (where src_lang
was English and tgt_lang
Dutch):
CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a task's instruction, optional context to the task, and the response to the task, from {src_lang} to {tgt_lang}.
Here are the requirements that you should adhere to:
1. maintain the format: the task consists of a task instruction (marked `instruction: `), optional context to the task (marked `context: `) and response for the task marked with `response: `;
2. do not translate the identifiers `instruction: `, `context: `, and `response: ` but instead copy them to your output;
3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
4. translate the instruction and context text using informal, but standard, language;
5. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the context in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang};
7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the context, nor the translation in the response (just copy them as-is);
8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English.
Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.
"""
The system message was:
You are a helpful assistant that translates English to Dutch according to the requirements that are given to you.
Note that 77 items (0.5%) were not successfully translated. This can either mean that the prompt was too long for the given limit (max_tokens=1024
) or that the generated translation could not be parsed into instruction
, context
and response
fields. The missing IDs are [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966]
.
Initial Data Collection and Normalization
Initial data collection by databricks. See their repository for more information about this dataset.
Considerations for Using the Data
Note that the translations in this new dataset have not been verified by humans! Use at your own risk, both in terms of quality and biases.
Discussion of Biases
As with any machine-generated texts, users should be aware of potential biases that are included in this dataset. Although the prompt specifically includes make sure to avoid biases (such as gender bias, grammatical bias, social bias)
, of course the impact of such command is not known. It is likely that biases remain in the dataset so use with caution.
Other Known Limitations
The translation quality has not been verified. Use at your own risk!
Licensing Information
This repository follows the original databricks license, which is CC BY-SA 3.0 but see below for a specific restriction.
This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo
), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.
If you use this dataset, you must also follow the Sharing and Usage policies.
As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.
This dataset is also available on the Hugging Face hub, its canonical repository.
In this dataset, you will find a collection of records that show a category, an instruction, a context and a response to that instruction. The aim of the project is to correct the instructions, intput and responses to make sure they are of the highest quality and that they match the task category that they belong to. All three texts should be clear and include real information. In addition, the response should be as complete but concise as possible.
rislemy/databricks-dolly-15k-single-text dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for "databricks-dolly-15k-chatml"
Dataset Summary
This dataset has been created by Re:cast AI to transform the existing dataset databricks/databricks-dolly-15k into a chatml friendly format for use in SFT tasks with pretrained models.
Dataset Structure
messages = [ { "content": "You are an expert Q&A system that is trusted around the world. You always... etc.", "role": "system" }, { "content": "(Optional) Context information is… See the full description on the dataset page: https://huggingface.co/datasets/recastai/databricks-dolly-15k-chatml.
kamrr/databricks-dolly-15k-alpaca dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Suchinthana/databricks-dolly-15k-tamil dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "databricks-dolly-15k-sinhala"
More Information needed
Overview The LaMini Dataset is an instruction dataset generated using h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. It is designed for instruction-tuning pre-trained models to specialize them in a variety of downstream tasks.
Dataset Generation
Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. Seed Instructions: Sourced from databricks/databricks-dolly-15k dataset. Generation Approach: Example-guided and topic-guided strategies. Total Instructions: 1,504 unique instruction examples.
Dataset Sources
Repository: Bitbucket Project Paper : Pre-Print
Structure Each entry in the dataset contains: - Instruction - Response
Usage The LaMini Dataset can be used to fine-tune language models to improve their ability to follow instructions and generate relevant responses.
Access The dataset is available on HuggingFace at the following link: https://huggingface.co/datasets/SurgeGlobal/LaMini
Citation If you find our work useful, please cite our paper as follows: @misc{surge2024openbezoar, title={OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data}, author={Chandeepa Dissanayake and Lahiru Lowe and Sachith Gunasekara and Yasiru Ratnayake}, year={2024}, eprint={2404.12195}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Dataset Authors Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, and Yasiru Ratnayake
lilac/databricks-dolly-15k-curated-en
This dataset is a Lilac processed dataset. Original dataset: https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-en To download the dataset to a local directory: lilac download lilacai/lilac-databricks-dolly-15k-curated-en
or from python with: ll.download("lilacai/lilac-databricks-dolly-15k-curated-en")
rchu233/databricks-dolly-15k-modernbert-split-kmeans-dim768-20250130 dataset hosted on Hugging Face and contributed by the HF Datasets community
NamburiSrinath/databricks-dolly-15k-modernbert-train-kmeans-dim768-20250316 dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Dataset Card for "dolly-15k"
Summary
This is the dataset supplied by Databricks for training Dolly V2. This set is split 99% training / 1% validation, should you want to set aside some records for evaluation purposes.
Special thanks to ❤️ Databricks for creating and making this set available.
More Information needed
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Overview
This dataset is edited from kunishou/databricks-dolly-15k-en.It was edited so that it would be like Yuki Nagato, who appears in "The Melancholy of Haruhi Suzumiya," with an emotionless and indifferent way of speaking.In more detail, I used VS CODE etc. to replace "です、ます" and "だ、である", etc.
It's a dataset for my hobby, but feel free to use it.
Links… See the full description on the dataset page: https://huggingface.co/datasets/WarriorMama777/databricks-dolly-15k-ja_cool.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Summary databricks-dolly-15k ( https://huggingface.co/datasets/databricks/databricks-dolly-15k/ ) is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This translation into Portuguese was executed utilizing a technique from the HIPPO benchmark. By… See the full description on the dataset page: https://huggingface.co/datasets/Gustrd/dolly-15k-hippo-translated-pt-12k.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
원본 데이터셋: databricks/databricks-dolly-15k
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Summary
databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.