94 datasets found
  1. T

    databricks_dolly

    • tensorflow.org
    Updated Sep 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). databricks_dolly [Dataset]. https://www.tensorflow.org/datasets/catalog/databricks_dolly
    Explore at:
    Dataset updated
    Sep 9, 2023
    Description

    databricks-dolly-15k is an open source dataset of instruction-following records used in training databricks/dolly-v2-12b that was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('databricks_dolly', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  2. h

    databricks-dolly-15k-ja

    • huggingface.co
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LLM-jp (2024). databricks-dolly-15k-ja [Dataset]. https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2024
    Dataset authored and provided by
    LLM-jp
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    databricks-dolly-15k-ja

    This repository provides an instruction tuning dataset developed by LLM-jp, a collaborative project launched in Japan. This dataset is a Japanese translation of databricks-dolly-15k using DeepL.

      Send Questions to
    

    llm-jp(at)nii.ac.jp

      Model Card Authors
    

    The names are listed in alphabetical order. Hirokazu Kiyomaru, Hiroshi Matsuda, Jun Suzuki, Namgi Han, Saku Sugawara, Shota Sasaki, Shuhei Kurita, Taishi Nakamura, Takashi Kodama, Takumiโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja.

  3. h

    databricks-mini

    • huggingface.co
    Updated Feb 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shrinivasan Sankar (2024). databricks-mini [Dataset]. https://huggingface.co/datasets/ai-bites/databricks-mini
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 27, 2024
    Authors
    Shrinivasan Sankar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This is a subset of the databricks 15k dataset databricks/databricks-dolly-15k used for finetuning Google's Gemma model google/gemma-2b. This version has only those records without context to match the dataset used in the fine-tuning Keras example from Google.

  4. h

    databricks-dolly-15k

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI Squared, Inc., databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/aisquared/databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    AI Squared, Inc.
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    databricks-dolly-15k

    This dataset was not originally created by AI Squared. This dataset was curated and created by Databricks. The below text comes from the original release of the dataset's README file in GitHub (available at https://github.com/databrickslabs/dolly/tree/master/data):

      Summary
    

    databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined inโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/aisquared/databricks-dolly-15k.

  5. h

    databricks-dolly-15k-ko

    • huggingface.co
    Updated Apr 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NLP & AI - Korea University (2023). databricks-dolly-15k-ko [Dataset]. https://huggingface.co/datasets/nlpai-lab/databricks-dolly-15k-ko
    Explore at:
    Dataset updated
    Apr 12, 2023
    Dataset authored and provided by
    NLP & AI - Korea University
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Korean translation of databricks-dolly-15k via the DeepL API Note: There are cases where multilingual data has been converted to monolingual data during batch translation to Korean using the API. Below is databricks-dolly-15k's README.

      Summary
    

    databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classificationโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/nlpai-lab/databricks-dolly-15k-ko.

  6. h

    Data from: Databricks

    • huggingface.co
    Updated Jul 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ignatius Balayo (2024). Databricks [Dataset]. https://huggingface.co/datasets/igntrevor/Databricks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 31, 2024
    Authors
    Ignatius Balayo
    Description

    igntrevor/Databricks dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. O

    databricks-dolly-15k-ja-reformat-v1

    • opendatalab.com
    • huggingface.co
    zip
    Updated Apr 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). databricks-dolly-15k-ja-reformat-v1 [Dataset]. https://opendatalab.com/OpenDataLab/databricks-dolly-15k-ja-reformat-v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 13, 2023
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category. Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly. For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the context field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. [42]) which we recommend users remove for downstream applications.

  8. h

    databricks-dolly-15k-curated-multilingual

    • huggingface.co
    Updated Apr 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Argilla (2023). databricks-dolly-15k-curated-multilingual [Dataset]. https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 28, 2023
    Dataset authored and provided by
    Argilla
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "databricks-dolly-15k-curated-multilingual"

    A curated and multilingual version of the Databricks Dolly instructions dataset. It includes a programmatically and manually corrected version of the original en dataset. See below. STATUS: Currently, the original Dolly v2 English version has been curated combining automatic processing and collaborative human curation using Argilla (~400 records have been manually edited and fixed). The following graph shows a summaryโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual.

  9. h

    databricks-dolly-15k

    • huggingface.co
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Post-training-Data-Flywheel (2024). databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/Post-training-Data-Flywheel/databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 27, 2024
    Dataset authored and provided by
    Post-training-Data-Flywheel
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Post-training-Data-Flywheel/databricks-dolly-15k dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. P

    SurgeGlobal/LaMini Dataset

    • paperswithcode.com
    Updated Apr 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chandeepa Dissanayake; Lahiru Lowe; Sachith Gunasekara; Yasiru Ratnayake (2024). SurgeGlobal/LaMini Dataset [Dataset]. https://paperswithcode.com/dataset/surgeglobal-lamini
    Explore at:
    Dataset updated
    Apr 17, 2024
    Authors
    Chandeepa Dissanayake; Lahiru Lowe; Sachith Gunasekara; Yasiru Ratnayake
    Description

    Overview The LaMini Dataset is an instruction dataset generated using h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. It is designed for instruction-tuning pre-trained models to specialize them in a variety of downstream tasks.

    Dataset Generation

    Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. Seed Instructions: Sourced from databricks/databricks-dolly-15k dataset. Generation Approach: Example-guided and topic-guided strategies. Total Instructions: 1,504 unique instruction examples.

    Dataset Sources

    Repository: Bitbucket Project Paper : Pre-Print

    Structure Each entry in the dataset contains: - Instruction - Response

    Usage The LaMini Dataset can be used to fine-tune language models to improve their ability to follow instructions and generate relevant responses.

    Access The dataset is available on HuggingFace at the following link: https://huggingface.co/datasets/SurgeGlobal/LaMini

    Citation If you find our work useful, please cite our paper as follows: @misc{surge2024openbezoar, title={OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data}, author={Chandeepa Dissanayake and Lahiru Lowe and Sachith Gunasekara and Yasiru Ratnayake}, year={2024}, eprint={2404.12195}, archivePrefix={arXiv}, primaryClass={cs.CL} }

    Dataset Authors Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, and Yasiru Ratnayake

  11. h

    databricks-dolly-100

    • huggingface.co
    Updated Oct 21, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wittawat Rakchat (2014). databricks-dolly-100 [Dataset]. https://huggingface.co/datasets/wt-golf/databricks-dolly-100
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 21, 2014
    Authors
    Wittawat Rakchat
    Description

    wt-golf/databricks-dolly-100 dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    databricks-dolly-15k_standardized

    • huggingface.co
    Updated Aug 31, 2000
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kranthi Kiran GV (2000). databricks-dolly-15k_standardized [Dataset]. https://huggingface.co/datasets/kranthigv/databricks-dolly-15k_standardized
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2000
    Authors
    Kranthi Kiran GV
    Description

    kranthigv/databricks-dolly-15k_standardized dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    databricks-dolly-llama2-1k

    • huggingface.co
    Updated Feb 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephane Remigereau (2023). databricks-dolly-llama2-1k [Dataset]. https://huggingface.co/datasets/sremigere/databricks-dolly-llama2-1k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2023
    Authors
    Stephane Remigereau
    Description

    sremigere/databricks-dolly-llama2-1k dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    databricks-mini

    • huggingface.co
    Updated Mar 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ray (2025). databricks-mini [Dataset]. https://huggingface.co/datasets/ray12332/databricks-mini
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 9, 2025
    Authors
    Ray
    Description

    ray12332/databricks-mini dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. Orca DPO Dialogue Pairs

    • kaggle.com
    • opendatabay.com
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Orca DPO Dialogue Pairs [Dataset]. https://www.kaggle.com/datasets/thedevastator/intel-orca-dialogue-pairs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Intel Orca Dialogue Pairs

    Orca style for preference training (Intel's DPO dataset)

    By Huggingface Hub [source]

    About this dataset

    The Intel/Orca/DPO Dialogue Pairs dataset is a unique resource for Natural language processing (NLP) research, combining AI and human conversations collected from online sources. This dataset is invaluable for exploring how human conversations can inform the development of conversational AI models. With columns such as System and Question extracted from chat logs, this dataset can help researchers understand more about how to better connect people with technology using meaningful dialogue. Furthermore, the data also includes columns for ChatGPT and Llama2โ€“13b-Chat, two of the most widely used conversational AI models. By leveraging this data set, researchers have an exceptional opportunity to explore conversational techniques that enable humans and machines to communicate in natural languages

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • ๐Ÿšจ Your notebook can be here! ๐Ÿšจ!

    How to use the dataset

    This guide will provide an overview of how to use the Intel/Orca/DPO Dialogue Pairs dataset efficiently for human-centric natural language processing research.

    Step 1: Understand the dataset

    The Intel/Orca/DPO Dialogue Pairs dataset is composed of two main columns: System and Question. The System column contains responses from AI systems, and the Question column contains questions asked by humans. Additionally, this dataset also contains columns for ChatGPT and Llama2โ€“13b-Chat, two models used in developing conversational AI systems.

    Step 2: Prepare your environment

    Before getting started with analyzing data from this dataset, you should first prepare your environment accordingly. Make sure that any necessary libraries or services are installed on your machine before attempting to work with the data from this dataset in order to avoid potential issues or errors during usage.

    ##### Step 3: Access the data
    In order to access and start working with the data contained in this Dataset, you can either download it directly via a Kaggle account or alternatively access it through one of its REST Endpoints if available on other services (i.e Databricks).

    ##### Step 4: Exploring & Analyzing the Data

    ##### Step 5 : Reporting Results
    Lastly ,once explorations and analyses have been completed its highly important that results are reported accurately especially when dealing with ethical datasets such as dialogue pairs since consequences could be dire if misinformation is disseminated .Reporting results should usually involve standard relevant indicators being declared while taking care conducting appropriate statistical tests ruling out incorrect anomalous outcomes

    Research Ideas

    • Developing and improving natural language processing algorithms for AI-human conversation.
    • Building user-friendly chatbots that are better at recognizing and understanding human intent by training the model using this dataset.
    • Designing recommendation systems to predict user questions and generate more accurate responses based on previous conversations in the dataset

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------------|:-----------------------------------------------------------------------------| | system | Contains the AI system's response to the user's question. (Text) | | chatgpt | Contains the ChatGPT model's response to the user's question. (Text) | | llama2-13b-chat | Contains the Llama2-13b-Chat model's response to the user's question. (Text) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  16. h

    databricks-mini

    • huggingface.co
    Updated Oct 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johnathon (2024). databricks-mini [Dataset]. https://huggingface.co/datasets/applecrumble123/databricks-mini
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 20, 2024
    Authors
    Johnathon
    Description

    applecrumble123/databricks-mini dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    databricks-dolly-15k-subset-general_qa

    • huggingface.co
    Updated Oct 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernardo Garcia (2024). databricks-dolly-15k-subset-general_qa [Dataset]. https://huggingface.co/datasets/bergr7f/databricks-dolly-15k-subset-general_qa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 9, 2024
    Authors
    Bernardo Garcia
    Description

    bergr7f/databricks-dolly-15k-subset-general_qa dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    databricks-micro

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soumyadipta Maiti (2024). databricks-micro [Dataset]. https://huggingface.co/datasets/SoumyaM/databricks-micro
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Authors
    Soumyadipta Maiti
    Description

    SoumyaM/databricks-micro dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    databricks-dolly-15k

    • huggingface.co
    Updated Oct 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vaibhav Adlakha (2024). databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/vaibhavad/databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 19, 2024
    Authors
    Vaibhav Adlakha
    Description

    vaibhavad/databricks-dolly-15k dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    databricks-mini-pratap

    • huggingface.co
    Updated Mar 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Surwase (2024). databricks-mini-pratap [Dataset]. https://huggingface.co/datasets/pratapswati/databricks-mini-pratap
    Explore at:
    Dataset updated
    Mar 25, 2024
    Authors
    Surwase
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    pratapswati/databricks-mini-pratap dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2023). databricks_dolly [Dataset]. https://www.tensorflow.org/datasets/catalog/databricks_dolly

databricks_dolly

Explore at:
Dataset updated
Sep 9, 2023
Description

databricks-dolly-15k is an open source dataset of instruction-following records used in training databricks/dolly-v2-12b that was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License.

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('databricks_dolly', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

Search
Clear search
Close search
Google apps
Main menu