100+ datasets found
  1. databricks-dolly-15k

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Databrickshttp://databricks.com/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Summary

    databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.

  2. h

    databricks-dolly-15k

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI Squared, Inc., databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/aisquared/databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    AI Squared, Inc.
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    databricks-dolly-15k

    This dataset was not originally created by AI Squared. This dataset was curated and created by Databricks. The below text comes from the original release of the dataset's README file in GitHub (available at https://github.com/databrickslabs/dolly/tree/master/data):

      Summary
    

    databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in… See the full description on the dataset page: https://huggingface.co/datasets/aisquared/databricks-dolly-15k.

  3. h

    databricks-dolly-15k-ja

    • huggingface.co
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LLM-jp (2024). databricks-dolly-15k-ja [Dataset]. https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2024
    Dataset authored and provided by
    LLM-jp
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    databricks-dolly-15k-ja

    This repository provides an instruction tuning dataset developed by LLM-jp, a collaborative project launched in Japan. This dataset is a Japanese translation of databricks-dolly-15k using DeepL.

      Send Questions to
    

    llm-jp(at)nii.ac.jp

      Model Card Authors
    

    The names are listed in alphabetical order. Hirokazu Kiyomaru, Hiroshi Matsuda, Jun Suzuki, Namgi Han, Saku Sugawara, Shota Sasaki, Shuhei Kurita, Taishi Nakamura, Takashi Kodama, Takumi… See the full description on the dataset page: https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja.

  4. o

    Databricks Human Instruction Dataset

    • opendatabay.com
    .undefined
    Updated Jul 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Databricks Human Instruction Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/78cf60f8-b078-411f-aa41-bc5794f3121c
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 4, 2025
    Dataset authored and provided by
    Datasimple
    Area covered
    Data Science and Analytics
    Description

    This dataset is a collection of over 15,000 records generated by Databricks employees, specifically designed to enable large language models to exhibit the interactive qualities of conversational AI. It serves as an open-source, human-generated instruction corpus, invaluable for fine-tuning large language models. The contributors created prompt and response pairs across eight distinct instruction categories, carefully avoiding external web sources (with the exception of Wikipedia for certain subsets) and generative AI in their formulations. This dataset holds significant value for instruction fine-tuning, synthetic data generation, and data augmentation, and is openly available for any purpose, including academic and commercial applications.

    Columns

    • instruction: Represents the prompt or question provided.
    • context: Serves as reference material relevant to the instruction.
    • response: Contains the generated response to the instruction.
    • category: Indicates the annotator behavioural category, derived from the InstructGPT paper.

    Distribution

    The dataset is provided as a CSV file, containing fields for instruction, context, response, and category. It comprises over 15,000 records, with 14,781 unique values for 'instruction' and 14,944 unique values for 'category'.

    Usage

    This dataset is ideal for several applications, including: * Instruction fine-tuning of large language models to enhance their interactive capabilities. * Generating synthetic data by using the human-generated prompts as few-shot examples for large open language models. * Data augmentation techniques, such as paraphrasing prompts or short responses to regularise the dataset and improve model robustness.

    Coverage

    The dataset has a global reach. It was listed on 11/06/2025. The data is human-generated by Databricks employees. While the language used is American English, it is noted that some annotators may not be native English speakers. The demographic profile and subject matter of the data may reflect the composition of Databricks employees. It is important to note that as Wikipedia was consulted for certain categories, the dataset may reflect biases, factual errors, or topical focuses present in Wikipedia.

    License

    CC-BY-SA

    Who Can Use It

    This dataset is intended for a wide range of users, including: * Data Scientists and Machine Learning Engineers: For fine-tuning and developing large language models. * Researchers: For studies on instruction-following, synthetic data generation, and data augmentation in natural language processing. * Developers: Building applications that require interactive or instruction-based language model capabilities. * Organisations: For commercial product development involving custom language models.

    Dataset Name Suggestions

    • Dolly 15K Instruction Corpus
    • Databricks Human Instruction Data
    • LLM Fine-tuning Prompt Dataset
    • Opendatabay Dolly 15K
    • Interactive AI Training Data

    Attribute

    Original Data Source: Databricks Dolly 15K Dataset

  5. h

    databricks-dolly-15k-ko

    • huggingface.co
    Updated Apr 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NLP & AI - Korea University (2023). databricks-dolly-15k-ko [Dataset]. https://huggingface.co/datasets/nlpai-lab/databricks-dolly-15k-ko
    Explore at:
    Dataset updated
    Apr 12, 2023
    Dataset authored and provided by
    NLP & AI - Korea University
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Korean translation of databricks-dolly-15k via the DeepL API Note: There are cases where multilingual data has been converted to monolingual data during batch translation to Korean using the API. Below is databricks-dolly-15k's README.

      Summary
    

    databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification… See the full description on the dataset page: https://huggingface.co/datasets/nlpai-lab/databricks-dolly-15k-ko.

  6. Z

    Dolly 15k Dutch

    • data.niaid.nih.gov
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vanroy, Bram (2023). Dolly 15k Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8054097
    Explore at:
    Dataset updated
    Jun 20, 2023
    Dataset authored and provided by
    Vanroy, Bram
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset contains 14,934 instructions, contexts and responses, in several natural language categories such as classification, closed QA, generation, etc. The English original dataset was created by @databricks, who crowd-sourced the data creation via its employees. The current dataset is a translation of that dataset through ChatGPT (gpt-3.5-turbo).

    Data Instances

    { "id": 14963, "instruction": "Wat zijn de duurste steden ter wereld?", "context": "", "response": "Dit is een uitgebreide lijst van de duurste steden: Singapore, Tel Aviv, New York, Hong Kong, Los Angeles, Zurich, Genève, San Francisco, Parijs en Sydney.", "category": "brainstorming" }

    Data Fields

    id: the ID of the item. The following 77 IDs are not included because they could not be translated (or were too long): [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966]

    instruction: the instruction (question)

    context: additional context that the AI can use to answer the question

    response: the AI's expected response

    category: the category of this type of question (see Dolly for more info)

    Dataset Creation

    Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

    The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

    CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a task's instruction, optional context to the task, and the response to the task, from {src_lang} to {tgt_lang}.

    Here are the requirements that you should adhere to: 1. maintain the format: the task consists of a task instruction (marked instruction:), optional context to the task (marked context:) and response for the task marked with response:; 2. do not translate the identifiers instruction:, context:, and response: but instead copy them to your output; 3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 4. translate the instruction and context text using informal, but standard, language; 5. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the context in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang}; 7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the context, nor the translation in the response (just copy them as-is); 8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English.

    Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.

    """

    The system message was:

    You are a helpful assistant that translates English to Dutch according to the requirements that are given to you.

    Note that 77 items (0.5%) were not successfully translated. This can either mean that the prompt was too long for the given limit (max_tokens=1024) or that the generated translation could not be parsed into instruction, context and response fields. The missing IDs are [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966].

    Initial Data Collection and Normalization

    Initial data collection by databricks. See their repository for more information about this dataset.

    Considerations for Using the Data

    Note that the translations in this new dataset have not been verified by humans! Use at your own risk, both in terms of quality and biases.

    Discussion of Biases

    As with any machine-generated texts, users should be aware of potential biases that are included in this dataset. Although the prompt specifically includes make sure to avoid biases (such as gender bias, grammatical bias, social bias), of course the impact of such command is not known. It is likely that biases remain in the dataset so use with caution.

    Other Known Limitations

    The translation quality has not been verified. Use at your own risk!

    Licensing Information

    This repository follows the original databricks license, which is CC BY-SA 3.0 but see below for a specific restriction.

    This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

    If you use this dataset, you must also follow the Sharing and Usage policies.

    As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

    This dataset is also available on the Hugging Face hub, its canonical repository.

  7. h

    databricks-dolly-15k

    • huggingface.co
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Post-training-Data-Flywheel (2024). databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/Post-training-Data-Flywheel/databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 27, 2024
    Dataset authored and provided by
    Post-training-Data-Flywheel
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Post-training-Data-Flywheel/databricks-dolly-15k dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. o

    Dolly 15K AI Chat Data

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Dolly 15K AI Chat Data [Dataset]. https://www.opendatabay.com/data/ai-ml/a2914db9-a1d3-4d91-84c9-be253ae09386
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Telecommunications & Network Data
    Description

    This dataset provides over 15,000 language models and dialogues designed to power dynamic ChatGPT applications. It was created by Databricks employees, aiming to facilitate the use of large language models (LLMs) for interactive dialogue interactions. The dataset generates prompt-response pairs across eight distinct instruction categories and deliberately avoids information from external web sources, with the exception of Wikipedia for specific instruction sets. This open-source resource is ideal for exploring the boundaries of text-based conversations and uncovering new insights into natural language processing.

    Columns

    • Instruction (Text): This field contains the text prompt intended to generate an appropriate response from a machine learning model or chatbot, utilising natural language processing techniques. It represents what one individual says in a conversation.
    • Context (Text): Providing additional information, the context field enhances accuracy by offering the model more detail about the ongoing conversation or request execution. Like the instruction, it captures what is said by one individual.
    • Response (Text): This column holds the conversational reply or what is said back by the other individual in the dialogue.
    • Category (Text): Each prompt-response pair is classified into one of eight distinct categories based on its content. Examples of unique category values include 'open_qa' and 'general_qa'.

    Distribution

    The dataset is typically provided as a data file, usually in CSV format. It contains over 15,000 language models and dialogues, with the main train.csv file consisting of this quantity of records. Each record within the dataset represents a unique prompt-response pair, or a single turn in a conversation between two individuals. The columns are all of a string data type.

    Usage

    This dataset is suited for a variety of applications and use cases: * Training dialogue systems by developing multiple funneling pipelines to enrich models with real-world conversations. * Creating intelligent chatbot interactions. * Generating natural language answers as part of Q&A systems. * Utilising excerpts from Wikipedia for particular subsets of instruction categories. * Leveraging the classification labels with supervised learning techniques, such as multi-class classification neural networks or logistic regression classifiers. * Developing deep learning models to detect and respond to conversational intent. * Training language models for customer service queries using natural language processing (NLP). * Creating custom dialogue agents capable of handling more intricate conversational interactions.

    Coverage

    The dataset has a global reach. It was listed on 17/06/2025, and its content focuses on general conversational and Q&A interactions, without specific demographic limitations.

    License

    CC0

    Who Can Use It

    This dataset is valuable for a wide range of users, including AI/ML developers, researchers, and data scientists looking to: * Build and train conversational AI models. * Develop advanced chatbot applications. * Explore new insights in natural language processing. * Create bespoke dialogue agents for various sectors, such as customer service. * Apply supervised learning to classify conversational data.

    Dataset Name Suggestions

    • Databricks Dolly (15K) Dialogue Data
    • LLM Training Conversation Dataset
    • Dolly 15K AI Chat Data
    • Prompt-Response Pairs for LLMs

    Attributes

    Original Data Source: Databricks Dolly (15K)

  9. h

    dolly-15k

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ritesh Khanna, dolly-15k [Dataset]. https://huggingface.co/datasets/treadon/dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Ritesh Khanna
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Dataset Card for "dolly-15k"

      Summary
    

    This is the dataset supplied by Databricks for training Dolly V2. This set is split 99% training / 1% validation, should you want to set aside some records for evaluation purposes.

      Special thanks to ❤️ Databricks for creating and making this set available.
    

    More Information needed

  10. h

    dolly-15k-oai-style

    • huggingface.co
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philipp Schmid (2023). dolly-15k-oai-style [Dataset]. https://huggingface.co/datasets/philschmid/dolly-15k-oai-style
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2023
    Authors
    Philipp Schmid
    Description

    Dataset Card for "dolly-15k-oai-style"

    More Information needed

  11. h

    thai_databricks_dolly

    • huggingface.co
    Updated Jun 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SEACrowd (2024). thai_databricks_dolly [Dataset]. https://huggingface.co/datasets/SEACrowd/thai_databricks_dolly
    Explore at:
    Dataset updated
    Jun 20, 2024
    Dataset authored and provided by
    SEACrowd
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This is a Thai-instructed dataset translated from databricks-dolly-15k using Google Cloud Translation. databricks-dolly-15k is an open-source dataset of instruction-following records generated by thousands of Databricks employees in several behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

  12. h

    pretrain-databricks-dolly-15k

    • huggingface.co
    Updated Jan 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Nogueira (2024). pretrain-databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/Felladrin/pretrain-databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 24, 2024
    Authors
    Victor Nogueira
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Conversion of databricks/databricks-dolly-15k dataset to be used in pretraining. Python code used for conversion: from datasets import load_dataset import pandas

    dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

    def format(columns): instruction = columns["instruction"].strip() answer = columns["response"].strip() return f"{instruction}

    {answer}"pandas.DataFrame({"text": [format(columns) for columns in dataset]}).to_csv("train.csv", index=False)

  13. h

    databricks-databricks-dolly-15k

    • huggingface.co
    Updated Sep 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AGIE AI Technology (2024). databricks-databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/agie-ai/databricks-databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 21, 2024
    Dataset authored and provided by
    AGIE AI Technology
    Description

    Dataset Card for "databricks-databricks-dolly-15k"

    More Information needed

  14. h

    ChatML-databricks-dolly-15k

    • huggingface.co
    Updated Feb 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Nogueira (2024). ChatML-databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/Felladrin/ChatML-databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2024
    Authors
    Victor Nogueira
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    databricks/databricks-dolly-15k in ChatML format. Python code used for conversion: from datasets import load_dataset import pandas from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained( pretrained_model_name_or_path="Felladrin/Llama-160M-Chat-v1" )

    dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

    def format(columns): instruction = columns["instruction"].strip() context = columns["context"].strip() response =… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-databricks-dolly-15k.

  15. h

    dolly-15k-it

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierpaolo Basile, dolly-15k-it [Dataset]. https://huggingface.co/datasets/basilepp19/dolly-15k-it
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Pierpaolo Basile
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    This dataset is obtained by automatically translating the dolly 15k dataset (https://huggingface.co/datasets/databricks/databricks-dolly-15k) in Italian using an open-source machine translation tool: https://pypi.org/project/argostranslate/

  16. h

    databricks-dolly-15k

    • huggingface.co
    Updated Oct 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vaibhav Adlakha (2024). databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/vaibhavad/databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 19, 2024
    Authors
    Vaibhav Adlakha
    Description

    vaibhavad/databricks-dolly-15k dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. P

    Bactrian-X Dataset

    • paperswithcode.com
    Updated Oct 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haonan Li; Fajri Koto; Minghao Wu; Alham Fikri Aji; Timothy Baldwin (2024). Bactrian-X Dataset [Dataset]. https://paperswithcode.com/dataset/bactrian-x
    Explore at:
    Dataset updated
    Oct 24, 2024
    Authors
    Haonan Li; Fajri Koto; Minghao Wu; Alham Fikri Aji; Timothy Baldwin
    Description

    Bactrian-X is a comprehensive multilingual parallel dataset of 3.4 million instruction-response pairs across 52 languages. The instructions were obtained from alpaca-52k, and dolly-15k, and tranlated into 52 languages (52 languages x 67k instances = 3.4M instances).

  18. h

    databricks-dolly-15k-curated-multilingual

    • huggingface.co
    Updated Apr 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    databricks-dolly-15k-curated-multilingual [Dataset]. https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 28, 2023
    Dataset authored and provided by
    Argilla
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "databricks-dolly-15k-curated-multilingual"

    A curated and multilingual version of the Databricks Dolly instructions dataset. It includes a programmatically and manually corrected version of the original en dataset. See below. STATUS: Currently, the original Dolly v2 English version has been curated combining automatic processing and collaborative human curation using Argilla (~400 records have been manually edited and fixed). The following graph shows a summary… See the full description on the dataset page: https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual.

  19. databricks-dolly-15k-chatml

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Re:cast AI, databricks-dolly-15k-chatml [Dataset]. https://huggingface.co/datasets/recastai/databricks-dolly-15k-chatml
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    CAST AI Group, Inc.
    Authors
    Re:cast AI
    Description

    Dataset Card for "databricks-dolly-15k-chatml"

      Dataset Summary
    

    This dataset has been created by Re:cast AI to transform the existing dataset databricks/databricks-dolly-15k into a chatml friendly format for use in SFT tasks with pretrained models.

      Dataset Structure
    

    messages = [ { "content": "You are an expert Q&A system that is trusted around the world. You always... etc.", "role": "system" }, { "content": "(Optional) Context information is… See the full description on the dataset page: https://huggingface.co/datasets/recastai/databricks-dolly-15k-chatml.

  20. O

    Bactrain-X

    • opendatalab.com
    zip
    Updated May 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Melbourne (2023). Bactrain-X [Dataset]. https://opendatalab.com/OpenDataLab/Bactrain-X
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 1, 2023
    Dataset provided by
    Monash University
    University of Melbourne
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The Bactrain-X dataset is a collection of 3.4M instruction-response pairs in 52 languages, that are obtained by translating 67K English instructions (alpaca-52k + dolly-15k) into 51 languages using Google Translate API. The translated instructions are then fed to ChatGPT () to obtain its natural responses, resulting in 3.4M instruction-response pairs in 52 languages (52 languages x 67k instances = 3.4M instances).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
Organization logo

databricks-dolly-15k

databricks/databricks-dolly-15k

Explore at:
178 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Databrickshttp://databricks.com/
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.

Search
Clear search
Close search
Google apps
Main menu