17 datasets found
  1. databricks-dolly-15k

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Databrickshttp://databricks.com/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Summary

    databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.

  2. h

    databricks_dolly-v2-7b-details

    • huggingface.co
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open LLM Leaderboard (2025). databricks_dolly-v2-7b-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/databricks_dolly-v2-7b-details
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Open LLM Leaderboard
    Description

    Dataset Card for Evaluation run of databricks/dolly-v2-7b

    Dataset automatically created during the evaluation run of model databricks/dolly-v2-7b The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/databricks_dolly-v2-7b-details.

  3. h

    databricks-dolly-15k-curated-multilingual

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Argilla, databricks-dolly-15k-curated-multilingual [Dataset]. https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Argilla
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "databricks-dolly-15k-curated-multilingual"

    A curated and multilingual version of the Databricks Dolly instructions dataset. It includes a programmatically and manually corrected version of the original en dataset. See below. STATUS: Currently, the original Dolly v2 English version has been curated combining automatic processing and collaborative human curation using Argilla (~400 records have been manually edited and fixed). The following graph shows a summary… See the full description on the dataset page: https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual.

  4. h

    databricks_dolly-v2-3b-details

    • huggingface.co
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open LLM Leaderboard (2025). databricks_dolly-v2-3b-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/databricks_dolly-v2-3b-details
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Open LLM Leaderboard
    Description

    Dataset Card for Evaluation run of databricks/dolly-v2-3b

    Dataset automatically created during the evaluation run of model databricks/dolly-v2-3b The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/databricks_dolly-v2-3b-details.

  5. h

    dolly-15k

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ritesh Khanna, dolly-15k [Dataset]. https://huggingface.co/datasets/treadon/dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Ritesh Khanna
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Dataset Card for "dolly-15k"

      Summary
    

    This is the dataset supplied by Databricks for training Dolly V2. This set is split 99% training / 1% validation, should you want to set aside some records for evaluation purposes.

      Special thanks to ❤️ Databricks for creating and making this set available.
    

    More Information needed

  6. h

    bsc-dolly-15k-en

    • huggingface.co
    Updated Aug 31, 2000
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technologies Laboratory @ Barcelona Supercomputing Center (2000). bsc-dolly-15k-en [Dataset]. https://huggingface.co/datasets/BSC-LT/bsc-dolly-15k-en
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2000
    Dataset authored and provided by
    Language Technologies Laboratory @ Barcelona Supercomputing Center
    Description

    BSC Dolly 15k EN

    Reviewed version from the Argilla Dolly v2 English version, originally created by Databricks. We provide two subsets: "annotated", where some instances were labelled with potential problems; and "filtered", which only contains the instances without the issues that we observed.

      Annotation process
    

    While analysing the Argilla Dolly v2 English version, we observed the following:

      Task classification:
    
  7. h

    dolly-v2_orca

    • huggingface.co
    Updated Jul 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pankaj Mathur (2018). dolly-v2_orca [Dataset]. https://huggingface.co/datasets/pankajmathur/dolly-v2_orca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 15, 2018
    Authors
    Pankaj Mathur
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Explain tuned Dolly-V2 dataset ~15K created using approaches from Orca Research Paper. We leverage all of the 15 system instructions provided in Orca Research Paper to generate explain tuned datasets, in contrast to vanilla instruction tuning approaches used by original datasets. This helps student models like orca_mini_13b, orca_mini_7b or orca_mini_3b to learn thought process from teacher model, which is ChatGPT (gpt-3.5-turbo-0301 version). Please see how the System prompt is added before… See the full description on the dataset page: https://huggingface.co/datasets/pankajmathur/dolly-v2_orca.

  8. m

    Hand Truck and Dolly Market Size, Share | CAGR of 4.3%.

    • market.us
    csv, pdf
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market.us (2025). Hand Truck and Dolly Market Size, Share | CAGR of 4.3%. [Dataset]. https://market.us/report/hand-truck-and-dolly-market/
    Explore at:
    csv, pdfAvailable download formats
    Dataset updated
    Jun 17, 2025
    Dataset provided by
    Market.us
    License

    https://market.us/privacy-policy/https://market.us/privacy-policy/

    Time period covered
    2022 - 2032
    Area covered
    Global
    Description

    Hand Truck and Dolly Market size is expected to be worth around USD 2.0 Billion by 2034, from USD 1.3 Billion in 2024, at a CAGR of 4.3%.

  9. h

    databricks-dolly-15k-es

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Quispe, databricks-dolly-15k-es [Dataset]. https://huggingface.co/datasets/daqc/databricks-dolly-15k-es
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    David Quispe
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Translated with googletrans==3.1.0a0 from original dataset *part of the data (up to 600) was lost during the translation

      license: apache-2.0
    
  10. h

    final_training_set_v1

    • huggingface.co
    Updated May 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PyThaiNLP (2023). final_training_set_v1 [Dataset]. https://huggingface.co/datasets/pythainlp/final_training_set_v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 4, 2023
    Dataset authored and provided by
    PyThaiNLP
    Description

    Dataset Card for "final_training_set_v1"

    Finetuning datasets for WangChanGLM sourced from LAION OIG chip2 and infill_dbpedia (Apache-2.0), DataBricks Dolly v2 (Apache-2.0), OpenAI TL;DR (MIT), and Hello-SimpleAI HC3 (CC-BY SA)

  11. h

    instructmix

    • huggingface.co
    Updated Dec 15, 2000
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ξ Labs (2000). instructmix [Dataset]. https://huggingface.co/datasets/Xilabs/instructmix
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2000
    Dataset authored and provided by
    Ξ Labs
    Description

    Dataset Card for "InstructMix"

    Description: InstructMix is a versatile instruction-tuning dataset available in Alpaca format. It encompasses a variety of instruction-related tasks and sources, making it well suited for finetuning instruction following Large Language Models.

      Included Datasets:
    

    Dataset Name Size Type Details GitHub Repo

    Alpaca_GPT4 52,002 examples General Instruction Generated by GPT-4 using Alpaca GitHub Repo

    dolly 2.0 15,015 examples Closed… See the full description on the dataset page: https://huggingface.co/datasets/Xilabs/instructmix.

  12. neural-chat-dataset-v2

    • huggingface.co
    Updated Aug 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Intel (2024). neural-chat-dataset-v2 [Dataset]. https://huggingface.co/datasets/Intel/neural-chat-dataset-v2
    Explore at:
    Dataset updated
    Aug 23, 2024
    Dataset authored and provided by
    Intelhttp://intel.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Here is a collective list of instruction dataset used for Neural Chat fine-tuning. The total number of instruction samples and tokens are about 1.5M and 5M respectively.

    Type Language Dataset Number

    HC3 en HC3 24K

    dolly en databricks-dolly-15k 15K

    alpaca-zh zh tigerbot-alpaca-zh-0.5m 500K

    alpaca-en en TigerResearch/tigerbot-alpaca-en-50k50K

    math en tigerbot-gsm-8k-en 8K

    general en tigerbot-stackexchange-qa-en-0.5m 500K

    OpenOrca en Open-Orca/OpenOrca 400K (sampled)… See the full description on the dataset page: https://huggingface.co/datasets/Intel/neural-chat-dataset-v2.

  13. h

    kullm-v2.1

    • huggingface.co
    Updated Mar 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeong (2024). kullm-v2.1 [Dataset]. https://huggingface.co/datasets/csujeong/kullm-v2.1
    Explore at:
    Dataset updated
    Mar 4, 2024
    Authors
    Jeong
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "KULLM-v2"

      Dataset Summary
    

    Korean translation of GPT4ALL, Dolly, and Vicuna data. repository: nlpai-lab/KULLM huggingface: nlpai-lab/kullm-v2

      Translate dataset
    

    Translated 'instruction', 'input', and 'output' in the dataset via the DeepL API

      Lisence
    

    Apache-2.0

    from datasets import load_dataset

    ds = load_dataset("nlpai-lab/kullm-v2", split="train") ds DatasetDict({ train: Dataset({ features: ['id'… See the full description on the dataset page: https://huggingface.co/datasets/csujeong/kullm-v2.1.

  14. h

    AutoMultiTurnByCalm3-22B

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kan hatakeyama, AutoMultiTurnByCalm3-22B [Dataset]. https://huggingface.co/datasets/kanhatakeyama/AutoMultiTurnByCalm3-22B
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    kan hatakeyama
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    自動生成のマルチターンデータセット

      オープンなデータソースから、Calm3-22bを使ってQ&Aを自動生成したものです。
    

    一部の計算には東京工業大学のスーパーコンピュータTSUBAME4.0を利用しました。

      データソース
    
    
    
    
    
      はじめの質問(q1)を、種々のデータソースから収集しました。その後のやりとりはすべて、Calmが生成しました。質問文については、元データのライセンスに準拠します。
    

    oasst2-33k-ja apache 2.0

    databricks-dolly-15k-ja cc-by-sa-3.0

    minnade CC0

    cyberagent/chatbot-arena-ja-calm2-7b-chat-experimental cc-by-4.0

  15. h

    LaMini

    • huggingface.co
    Updated Sep 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Surge Global (2023). LaMini [Dataset]. https://huggingface.co/datasets/SurgeGlobal/LaMini
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 27, 2023
    Dataset authored and provided by
    Surge Global
    Description

    Overview

    The LaMini Dataset is an instruction dataset generated using h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. It is designed for instruction-tuning pre-trained models to specialize them in a variety of downstream tasks.

      Dataset Generation
    

    Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. Seed Instructions: Sourced from databricks/databricks-dolly-15k dataset. Generation Approach: Example-guided and topic-guided strategies. Total Instructions: 1,504 unique… See the full description on the dataset page: https://huggingface.co/datasets/SurgeGlobal/LaMini.

  16. tulu-v1-sft-mixture

    • huggingface.co
    Updated Nov 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2023). tulu-v1-sft-mixture [Dataset]. https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 14, 2023
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Dataset Card for Tulu Instruction Mix

    For a newer version, see Tulu V2 This version, the human data mixture, dataset consists of a mix of:

    FLAN (Apache 2.0): FLAN v2 with CoT examples (most of the tasks in SuperNatural Instructions are included here) Open Assistant 1 (Apache 2.0) Dolly (CC By SA 3.0) ShareGPT (Apache 2.0 listed, no official repo found) GPT4-Alpaca (CC By NC 4.0) Code-Alpaca (CC By NC 4.0)

    These are made by taking either just the training set of the subsets or the… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture.

  17. h

    AutoMultiTurnByMixtral8x22b

    • huggingface.co
    Updated May 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kan hatakeyama (2024). AutoMultiTurnByMixtral8x22b [Dataset]. https://huggingface.co/datasets/kanhatakeyama/AutoMultiTurnByMixtral8x22b
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 20, 2024
    Authors
    kan hatakeyama
    Description

    自動生成のマルチターンデータセット

      オープンなデータソースから、MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-GGUFを使ってQ&Aを自動生成したものです。
    

    関連コード

    一部の計算には東京工業大学のスーパーコンピュータTSUBAME4.0を利用しました。

      データソース
    
    
    
    
    
      はじめの質問(q1)を、種々のデータソースから収集しました。その後のやりとりはすべて、Mixtralが生成しました。質問文については、元データのライセンスに準拠します。
    

    oasst2-33k-ja apache 2.0

    databricks-dolly-15k-ja cc-by-sa-3.0

    minnade CC0

    cyberagent/chatbot-arena-ja-calm2-7b-chat-experimental cc-by-4.0

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
Organization logo

databricks-dolly-15k

databricks/databricks-dolly-15k

Explore at:
208 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Databrickshttp://databricks.com/
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.

Search
Clear search
Close search
Google apps
Main menu