databricks-dolly-15k
is an open source dataset of instruction-following
records used in training
databricks/dolly-v2-12b that
was generated by thousands of Databricks employees in several of the behavioral
categories outlined in the InstructGPT
paper, including brainstorming, classification, closed QA, generation,
information extraction, open QA, and summarization.
This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('databricks_dolly', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
databricks-dolly-15k-ja
This repository provides an instruction tuning dataset developed by LLM-jp, a collaborative project launched in Japan. This dataset is a Japanese translation of databricks-dolly-15k using DeepL.
Send Questions to
llm-jp(at)nii.ac.jp
Model Card Authors
The names are listed in alphabetical order. Hirokazu Kiyomaru, Hiroshi Matsuda, Jun Suzuki, Namgi Han, Saku Sugawara, Shota Sasaki, Shuhei Kurita, Taishi Nakamura, Takashi Kodama, Takumiโฆ See the full description on the dataset page: https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is a subset of the databricks 15k dataset databricks/databricks-dolly-15k used for finetuning Google's Gemma model google/gemma-2b. This version has only those records without context to match the dataset used in the fine-tuning Keras example from Google.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
databricks-dolly-15k
This dataset was not originally created by AI Squared. This dataset was curated and created by Databricks. The below text comes from the original release of the dataset's README file in GitHub (available at https://github.com/databrickslabs/dolly/tree/master/data):
Summary
databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined inโฆ See the full description on the dataset page: https://huggingface.co/datasets/aisquared/databricks-dolly-15k.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Korean translation of databricks-dolly-15k via the DeepL API Note: There are cases where multilingual data has been converted to monolingual data during batch translation to Korean using the API. Below is databricks-dolly-15k's README.
Summary
databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classificationโฆ See the full description on the dataset page: https://huggingface.co/datasets/nlpai-lab/databricks-dolly-15k-ko.
igntrevor/Databricks dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category. Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly. For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the context field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. [42]) which we recommend users remove for downstream applications.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "databricks-dolly-15k-curated-multilingual"
A curated and multilingual version of the Databricks Dolly instructions dataset. It includes a programmatically and manually corrected version of the original en dataset. See below. STATUS: Currently, the original Dolly v2 English version has been curated combining automatic processing and collaborative human curation using Argilla (~400 records have been manually edited and fixed). The following graph shows a summaryโฆ See the full description on the dataset page: https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Post-training-Data-Flywheel/databricks-dolly-15k dataset hosted on Hugging Face and contributed by the HF Datasets community
Overview The LaMini Dataset is an instruction dataset generated using h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. It is designed for instruction-tuning pre-trained models to specialize them in a variety of downstream tasks.
Dataset Generation
Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. Seed Instructions: Sourced from databricks/databricks-dolly-15k dataset. Generation Approach: Example-guided and topic-guided strategies. Total Instructions: 1,504 unique instruction examples.
Dataset Sources
Repository: Bitbucket Project Paper : Pre-Print
Structure Each entry in the dataset contains: - Instruction - Response
Usage The LaMini Dataset can be used to fine-tune language models to improve their ability to follow instructions and generate relevant responses.
Access The dataset is available on HuggingFace at the following link: https://huggingface.co/datasets/SurgeGlobal/LaMini
Citation If you find our work useful, please cite our paper as follows: @misc{surge2024openbezoar, title={OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data}, author={Chandeepa Dissanayake and Lahiru Lowe and Sachith Gunasekara and Yasiru Ratnayake}, year={2024}, eprint={2404.12195}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Dataset Authors Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, and Yasiru Ratnayake
wt-golf/databricks-dolly-100 dataset hosted on Hugging Face and contributed by the HF Datasets community
kranthigv/databricks-dolly-15k_standardized dataset hosted on Hugging Face and contributed by the HF Datasets community
sremigere/databricks-dolly-llama2-1k dataset hosted on Hugging Face and contributed by the HF Datasets community
ray12332/databricks-mini dataset hosted on Hugging Face and contributed by the HF Datasets community
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Intel/Orca/DPO Dialogue Pairs dataset is a unique resource for Natural language processing (NLP) research, combining AI and human conversations collected from online sources. This dataset is invaluable for exploring how human conversations can inform the development of conversational AI models. With columns such as System and Question extracted from chat logs, this dataset can help researchers understand more about how to better connect people with technology using meaningful dialogue. Furthermore, the data also includes columns for ChatGPT and Llama2โ13b-Chat, two of the most widely used conversational AI models. By leveraging this data set, researchers have an exceptional opportunity to explore conversational techniques that enable humans and machines to communicate in natural languages
For more datasets, click here.
- ๐จ Your notebook can be here! ๐จ!
This guide will provide an overview of how to use the Intel/Orca/DPO Dialogue Pairs dataset efficiently for human-centric natural language processing research.
Step 1: Understand the dataset
The Intel/Orca/DPO Dialogue Pairs dataset is composed of two main columns: System and Question. The System column contains responses from AI systems, and the Question column contains questions asked by humans. Additionally, this dataset also contains columns for ChatGPT and Llama2โ13b-Chat, two models used in developing conversational AI systems.
Step 2: Prepare your environment
Before getting started with analyzing data from this dataset, you should first prepare your environment accordingly. Make sure that any necessary libraries or services are installed on your machine before attempting to work with the data from this dataset in order to avoid potential issues or errors during usage.
##### Step 3: Access the data
In order to access and start working with the data contained in this Dataset, you can either download it directly via a Kaggle account or alternatively access it through one of its REST Endpoints if available on other services (i.e Databricks).##### Step 4: Exploring & Analyzing the Data
##### Step 5 : Reporting Results
Lastly ,once explorations and analyses have been completed its highly important that results are reported accurately especially when dealing with ethical datasets such as dialogue pairs since consequences could be dire if misinformation is disseminated .Reporting results should usually involve standard relevant indicators being declared while taking care conducting appropriate statistical tests ruling out incorrect anomalous outcomes
- Developing and improving natural language processing algorithms for AI-human conversation.
- Building user-friendly chatbots that are better at recognizing and understanding human intent by training the model using this dataset.
- Designing recommendation systems to predict user questions and generate more accurate responses based on previous conversations in the dataset
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------------|:-----------------------------------------------------------------------------| | system | Contains the AI system's response to the user's question. (Text) | | chatgpt | Contains the ChatGPT model's response to the user's question. (Text) | | llama2-13b-chat | Contains the Llama2-13b-Chat model's response to the user's question. (Text) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
applecrumble123/databricks-mini dataset hosted on Hugging Face and contributed by the HF Datasets community
bergr7f/databricks-dolly-15k-subset-general_qa dataset hosted on Hugging Face and contributed by the HF Datasets community
SoumyaM/databricks-micro dataset hosted on Hugging Face and contributed by the HF Datasets community
vaibhavad/databricks-dolly-15k dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
pratapswati/databricks-mini-pratap dataset hosted on Hugging Face and contributed by the HF Datasets community
databricks-dolly-15k
is an open source dataset of instruction-following
records used in training
databricks/dolly-v2-12b that
was generated by thousands of Databricks employees in several of the behavioral
categories outlined in the InstructGPT
paper, including brainstorming, classification, closed QA, generation,
information extraction, open QA, and summarization.
This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('databricks_dolly', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.